Language ID (textual) - part 1

Word cloud of one person's compilation of English stop words, courtesy of Armand Brahaj (whose site has been infected by malware). Here instead is ranks.nl's list
 
Now that half a year has lapsed since the inception of this blog, some readers may be wondering when I might share more topics that are related to the "Linguistics" part of "SEO, Linguistics, Localization".

In fact, one of the triggers of my instigating this blog arose from the issuance of two patents, which had been filed in 2005 and 2006 for which I was a co-inventor and sole inventor, respectively. Both filings concerned language identification (from textual input): the first approached the challenges of identifying a text's (primary) language, and the second was an application of the first, as combined with messaging software.

Rather than overwhelm the reader with extensive explanations, I'm going to attempt to create a series of posts that will cover everything in the way that makes the most sense to me (which is actually how I approach most things in life, not just work). I may end up consolidating or splitting future posts, so I can't yet solve for X what the "part 1 of X" is. :-)

I should also insert the disclaimer that the intended target audience is people with little to no exposure to any sort of linguistics. However given my wider audience - particularly on G+ - there's bound to be people who may take issue with my oversimplifications. To them, I apologize in advance!

First, the problem. One may ask oneself, what IS the problem, or limitations, with textual language identification solutions? Well, take news article headlines or similarly short textual fragments, for instance. Here are two examples.

"Schwarzenegger's in a 1990 hit film 'Kindergarten Cop'." (a factual statement)
"海上危機管理で協議機関設置へ協力 玄葉外相訪中" (a news headline from November 23, 2011)

Before our implementation, the two prevalent approaches to parse and identify text as belonging to certain languages were n-gram detection, and stop word identification. But in headlines, stop words tend to be omitted. Although there is no canonical set of stop words per se for any language, in the context of topic identification and indexing they tend to be function words: usually their parts of speech are conjunctions, adpositions and articles.

This leaves the detection of n-grams, which basically are common groupings of characters (usually consonant clusters, such as "sch" in German).

So, using the traditional approaches, the first example would be guessed as being in German, and the second, likely as Traditional Chinese (although the example also contains a fair proportion of Simplified Chinese Han, and just two native Japanese characters). If one were relying on this layer of processing to categorize news articles based on their title values, this is most certainly a problem.

See the sequel post concerning Language ID, here.
See here for the final post about Language ID.

Comments

All time popular posts

Is larger (PPC) better? Size matters, but... the #G+ strategy

「ほとんど日本人と異ならないですね」

Google+ increasing its reach