LanguageWare's robust, extensible Language ID (part 4)

Having introduced the "prior art" approaches of identifying textual language in part 1 and part 3 (to wit: stop word presence and n-gram detection), I can now speak to the patented idea which we implemented as part of  LanguageWare, which is a set of Java libraries that offer NLP functionality. Simply put, our solution involves a  dictionary that is highly compactible (I may ask a guest blogger from my former team to delve into this aspect), and thus made it possible to store the following types of information:

Each entry consists of the following:

  • Term or n-gram
  • Language(s) with which it's associated
  • Whether it can occur as a standalone term, at the beginning of a word, the middle of a word, or the end of a word, or some combination of these, and
  • An integer weighting value (per term/language pairing)

Thus, for the Chinese Simplified/Traditional and Japanese disambiguation problem, the Japanese-specific kana (listed as unigrams) were given large positive values for Japanese, and also associated with large negative scores for the Chinese languages because as in the example given before, the presence of hiragana or katakana is quite sparse in newspaper headlines, which otherwise would score mainly as Traditional Chinese. To distinguish between the two Chinese scripts, each simplified Han characters were associated with both positive scores for Simplified, and negative for Traditional.

For Western languages with overlapping stopwords, we also mined our existing lexical dictionaries for distinctive prefixes and suffixes, as well as verbal inflective endings to ensure that we identified as many uniquely occurring features of the language. When even that was insufficient, a statistical analysis of relative prevalence factored into our scoring values. If an n-gram only identifier were run against the example "Schwarzeneggernek" (I've linked here to the Google.com search results) it would come back as German. However, we know that -nek is a Hungarian inflection. By capturing this in our dictionary, we ensure that regardless of what precedes this ending, our system indicates that it is likely to be Hungarian.

Unicode ranges were also useful as we ensured that the sample texts being parsed were all normalized to UTF16LE (little endian) first. So if we ever needed to identify Klingon, the dictionary could be extended extremely easily.

LanguageWare thus includes a Language ID component, which avails of the existing functionality, such as parsing lexemes, and uses this separate dictionary (which was at last examination 1.4MB and covering around 35 languages) to "guess" the primary language of text fragments. As we tested this against article headlines and many were truncated to under 120 characters inclusive of spaces, I'm quite confident that it would hold up very well even for tweets, where even those where URLs take up space would be reasonably accurate. The URLs could, through the use of regular expressions, either be omitted from consideration or processed such that hyphens or slashes are treated like whitespaces.

For more context, try these former posts:
Language ID part 1
Language ID part 2
Language ID part 3

Comments

All time popular posts

Is larger (PPC) better? Size matters, but... the #G+ strategy

「ほとんど日本人と異ならないですね」

Google+ increasing its reach