Each entry consists of the following:
- Term or n-gram
- Language(s) with which it's associated
- Whether it can occur as a standalone term, at the beginning of a word, the middle of a word, or the end of a word, or some combination of these, and
- An integer weighting value (per term/language pairing)
Thus, for the Chinese Simplified/Traditional and Japanese disambiguation problem, the Japanese-specific kana (listed as unigrams) were given large positive values for Japanese, and also associated with large negative scores for the Chinese languages because as in the example given before, the presence of hiragana or katakana is quite sparse in newspaper headlines, which otherwise would score mainly as Traditional Chinese. To distinguish between the two Chinese scripts, each simplified Han characters were associated with both positive scores for Simplified, and negative for Traditional.
For Western languages with overlapping stopwords, we also mined our existing lexical dictionaries for distinctive prefixes and suffixes, as well as verbal inflective endings to ensure that we identified as many uniquely occurring features of the language. When even that was insufficient, a statistical analysis of relative prevalence factored into our scoring values. If an n-gram only identifier were run against the example "Schwarzeneggernek" (I've linked here to the Google.com search results) it would come back as German. However, we know that -nek is a Hungarian inflection. By capturing this in our dictionary, we ensure that regardless of what precedes this ending, our system indicates that it is likely to be Hungarian.
Unicode ranges were also useful as we ensured that the sample texts being parsed were all normalized to UTF16LE (little endian) first. So if we ever needed to identify Klingon, the dictionary could be extended extremely easily.
LanguageWare thus includes a Language ID component, which avails of the existing functionality, such as parsing lexemes, and uses this separate dictionary (which was at last examination 1.4MB and covering around 35 languages) to "guess" the primary language of text fragments. As we tested this against article headlines and many were truncated to under 120 characters inclusive of spaces, I'm quite confident that it would hold up very well even for tweets, where even those where URLs take up space would be reasonably accurate. The URLs could, through the use of regular expressions, either be omitted from consideration or processed such that hyphens or slashes are treated like whitespaces.
For more context, try these former posts:
Language ID part 1
Language ID part 2
Language ID part 3