Language ID Part 3 - more challenges

Stop word detection usually works...


 In my prior post about this subject, hopefully the Jabberwocky poem examples demonstrated that when certain types of words occur in text that can be identified as belonging to a language's pronoun/conjunction/adposition parts of speech, a language label can still be assigned to text. Such identifiers are, in this context, considered to be stop words.

The presence of such terms was sufficient for us to recognize the language even when nouns, verbs, adjectives and adverbs are unidentifiable (nonexistent in our vocabulary). However it's useful to note that in those examples, there were some inflections that hinted at the nonsensical words having specific qualities. Specifically in the case of spotting nouns, these were enabled when in the inflected languages, pluralization or possession were shown via -s/'s endings (English, though -s can indicate possession in German also) or combination of title case capitalization and (when plural) -en for German. In Japanese, the presumed proper nouns were transliterated into the katakana script, which is an orthographic convention for foreign terms generally (including verbs, which are then often inflected using Japanese grammar).

...except when languages overlap

However, where another challenge presents itself, is when written languages share extensive history, and thus use overlapping stop words. Norwegian Bokmål /Danish, Czech/Slovak, and Traditional Chinese/historical Japanese texts are some examples that readily come to mind. When the input text examples are short enough, even Dutch and English show enough similarities to generate errors (the words in, over and is are all both common and shared.)
Fortunately in modern Japanese, the presence of the kana scripts (hiragana and katakana) allowed us to easily distinguish between it and either of the Chinese scripts. Japan had also simplified some of the more complicated kanji, for pragmatic reasons, though it has retained more of the traditional Han characters than the mainland Chinese government. (As a sidebar, there are uniquely Japanese kanji characters as well, though relatively few in number, constructed using constituent Han components). This approach was also necessary to distinguish between languages that use other scripts, such as Arabic and Cyrillic.
For the Bokmål /Danish challenge, it was necessary to statistically analyze their respective rates of prefix and suffix prevalence. This was performed on proprietary dictionary files, which were historically used for spelling verification.

...and, obviously, when stop words are scarce.


For groups of languages that extensively share their alphabets, the presence of not only stop words but also "giveaway" unique characters grew quite scarce whenever the text samples were news headlines or HTML file <title> values. Sheer brevity of these fragments (ranging from 20-80 characters inclusive of spaces) meant that distinct characters tend to appear less frequently, and stop words are eschewed despite their importance due to the text length being constrained. Thus both the presence and absence of n-grams (in this case, bi- and tri-grams to be more specific) became crucial to pinpoint and use to differentiate amongst these languages. For instance, in German there are useful noun endings such as -keit and -ung (the latter having historical links to the English gerundive -ing suffix) which aren't found in Dutch, whereas conversely there are far more repeated double vowels (especially aa and ee) as well as distinct sequences such as the -ij suffix occur in Dutch words.

The next and final post on Language ID will cover some specific aspects of our patented solution!
Here are links to the initial Language ID post and its sequel, also.

Comments

All time popular posts

Is larger (PPC) better? Size matters, but... the #G+ strategy

「ほとんど日本人と異ならないですね」

Google+ increasing its reach