Follow Mayo Takeuchi on Quora

Saturday, December 31, 2011

Year-end thoughts, 2011 edition

Over the lifetime I've spent living in various Western countries, I've noticed the predilection for media and individuals alike to focus on retrospection around this time of year: that is, reminiscing about the various events and experiences that one associates with the prior year. In direct contrast to this, it's my understanding that in Japan, it is customary to have 忘年会 which, paired with the 新年会 (which occurs after the 正月三が日 - first three days in January - timeperiod), encourages the forgetting of the prior year through much carousing and imbibing.

This year was particularly unforgettable to those with ties to Japan, however, and I've seen social media statuses speaking to the importance of remembering the disasters that have befallen my cultural homeland. The fallout - both metaphorical and literal (environmental, economic, political, and emotional) - will be palpable for decades, if not centuries, regardless of any desire the world may have to forget.

Personally speaking, I tend to mix traditions at year's end (as I suspect many Nikkei do): indulging in some retrospection, but mainly renewing commitment to work on fallibilities and improve upon specific aspects of life. This time, however, I also spent a surprising amount of time thinking more profoundly about my past, via fleshing out my timeline (courtesy of Facebook's latest UI change). But, this blog isn't the right forum to share such factoids.

I hope that 2012 enables everyone to make progress on what they wish, learn and implement interesting ideas, and generally experience fulfillment, sound health and happiness.

Wednesday, December 28, 2011

LanguageWare's robust, extensible Language ID (part 4)

Having introduced the "prior art" approaches of identifying textual language in part 1 and part 3 (to wit: stop word presence and n-gram detection), I can now speak to the patented idea which we implemented as part of  LanguageWare, which is a set of Java libraries that offer NLP functionality. Simply put, our solution involves a  dictionary that is highly compactible (I may ask a guest blogger from my former team to delve into this aspect), and thus made it possible to store the following types of information:

Each entry consists of the following:

  • Term or n-gram
  • Language(s) with which it's associated
  • Whether it can occur as a standalone term, at the beginning of a word, the middle of a word, or the end of a word, or some combination of these, and
  • An integer weighting value (per term/language pairing)

Thus, for the Chinese Simplified/Traditional and Japanese disambiguation problem, the Japanese-specific kana (listed as unigrams) were given large positive values for Japanese, and also associated with large negative scores for the Chinese languages because as in the example given before, the presence of hiragana or katakana is quite sparse in newspaper headlines, which otherwise would score mainly as Traditional Chinese. To distinguish between the two Chinese scripts, each simplified Han characters were associated with both positive scores for Simplified, and negative for Traditional.

For Western languages with overlapping stopwords, we also mined our existing lexical dictionaries for distinctive prefixes and suffixes, as well as verbal inflective endings to ensure that we identified as many uniquely occurring features of the language. When even that was insufficient, a statistical analysis of relative prevalence factored into our scoring values. If an n-gram only identifier were run against the example "Schwarzeneggernek" (I've linked here to the search results) it would come back as German. However, we know that -nek is a Hungarian inflection. By capturing this in our dictionary, we ensure that regardless of what precedes this ending, our system indicates that it is likely to be Hungarian.

Unicode ranges were also useful as we ensured that the sample texts being parsed were all normalized to UTF16LE (little endian) first. So if we ever needed to identify Klingon, the dictionary could be extended extremely easily.

LanguageWare thus includes a Language ID component, which avails of the existing functionality, such as parsing lexemes, and uses this separate dictionary (which was at last examination 1.4MB and covering around 35 languages) to "guess" the primary language of text fragments. As we tested this against article headlines and many were truncated to under 120 characters inclusive of spaces, I'm quite confident that it would hold up very well even for tweets, where even those where URLs take up space would be reasonably accurate. The URLs could, through the use of regular expressions, either be omitted from consideration or processed such that hyphens or slashes are treated like whitespaces.

For more context, try these former posts:
Language ID part 1
Language ID part 2
Language ID part 3

Tuesday, December 6, 2011

Language ID Part 3 - more challenges

Stop word detection usually works...

 In my prior post about this subject, hopefully the Jabberwocky poem examples demonstrated that when certain types of words occur in text that can be identified as belonging to a language's pronoun/conjunction/adposition parts of speech, a language label can still be assigned to text. Such identifiers are, in this context, considered to be stop words.

The presence of such terms was sufficient for us to recognize the language even when nouns, verbs, adjectives and adverbs are unidentifiable (nonexistent in our vocabulary). However it's useful to note that in those examples, there were some inflections that hinted at the nonsensical words having specific qualities. Specifically in the case of spotting nouns, these were enabled when in the inflected languages, pluralization or possession were shown via -s/'s endings (English, though -s can indicate possession in German also) or combination of title case capitalization and (when plural) -en for German. In Japanese, the presumed proper nouns were transliterated into the katakana script, which is an orthographic convention for foreign terms generally (including verbs, which are then often inflected using Japanese grammar).

...except when languages overlap

However, where another challenge presents itself, is when written languages share extensive history, and thus use overlapping stop words. Norwegian Bokmål /Danish, Czech/Slovak, and Traditional Chinese/historical Japanese texts are some examples that readily come to mind. When the input text examples are short enough, even Dutch and English show enough similarities to generate errors (the words in, over and is are all both common and shared.)
Fortunately in modern Japanese, the presence of the kana scripts (hiragana and katakana) allowed us to easily distinguish between it and either of the Chinese scripts. Japan had also simplified some of the more complicated kanji, for pragmatic reasons, though it has retained more of the traditional Han characters than the mainland Chinese government. (As a sidebar, there are uniquely Japanese kanji characters as well, though relatively few in number, constructed using constituent Han components). This approach was also necessary to distinguish between languages that use other scripts, such as Arabic and Cyrillic.
For the Bokmål /Danish challenge, it was necessary to statistically analyze their respective rates of prefix and suffix prevalence. This was performed on proprietary dictionary files, which were historically used for spelling verification.

...and, obviously, when stop words are scarce.

For groups of languages that extensively share their alphabets, the presence of not only stop words but also "giveaway" unique characters grew quite scarce whenever the text samples were news headlines or HTML file <title> values. Sheer brevity of these fragments (ranging from 20-80 characters inclusive of spaces) meant that distinct characters tend to appear less frequently, and stop words are eschewed despite their importance due to the text length being constrained. Thus both the presence and absence of n-grams (in this case, bi- and tri-grams to be more specific) became crucial to pinpoint and use to differentiate amongst these languages. For instance, in German there are useful noun endings such as -keit and -ung (the latter having historical links to the English gerundive -ing suffix) which aren't found in Dutch, whereas conversely there are far more repeated double vowels (especially aa and ee) as well as distinct sequences such as the -ij suffix occur in Dutch words.

The next and final post on Language ID will cover some specific aspects of our patented solution!
Here are links to the initial Language ID post and its sequel, also.

Thursday, December 1, 2011

Every breath we take (Foursquare et al.)

Approximate location of this blogger, give or take a few hundred metres

After many months of dragging my feet, I joined Foursquare today. For those unfamiliar with it, this geo-social networking service allows a registrant user with a smartphone to download an application that makes it possible to easily "check in" to physical places. With tie-ins to Facebook and Twitter, it encourages users to publicly promote the businesses and services they prefer. This, in turn, is the incentive businesses value (endorsements) sufficiently to make offers to those who check-in to them.

Truth be told, I'm not a particularly suitable user of such services as these. First, I'd rather not have my whereabouts documented online to this level of detail, even though I don't live alone (and thus, am not quite so susceptible to being burgled).

Second, I'm an inconspicuous consumer - that is, I try to live frugally, and what I consider to be frivolous purchases mainly take place online (and quite infrequently, on the order of semi-annually).

Third, none of my friends live in locales that I could realistically visit (without taking a long haul flight). Case in point: my first four people I linked with on Foursquare live in Chiba prefecture, Tokyo, Saskatchewan, and Ontario, respectively. I'm excepting work related folk in saying that though, of whom I'm actually quite fond. It's just that my long standing policy to eschew conflating professional and personal relationships has resulted in residual levels of reluctance to socialize with them outside of regular work hours.

As counterpoints to the above three observations, though, I discovered that I did indeed have enough reason to join the service nonetheless. So long as I carry an Android phone, it seems my every move is being carefully tracked and filed away anyhow. I'm also interested in what areas where I do roam offer free (and presumably secure) WiFi. And I'd like to start documenting places and dishes that I would recommend to people who visit my city, which the "done" and "tips" features available to this service allow me to do ex post facto. I've already started to work on those, as I don't have to commit to stating when I experienced or ate anything, but just that I had.

The fact that I'm not tweeting my whereabouts (and that I intend to only sporadically check in at all; I've also been known swap my SIM card into a non-smart phone on occasion) means I can still achieve a semblance of living "under the radar". So to those readers who plan to request a link with me on there, please don't expect to infer too much more about my life, although by agreeing to open up this data to you, I would be showing that I implicitly trust you.

About Mayo

My photo

Professional: As "Senior Enterprise SEO Strategist" in IBM's Digital Marketing division, I provide consulting and training services for both internal and external clients. Formerly I was involved in Natural Language Processing, software localization, quality assurance and documentation authoring.
Personal: INTJ Nikkei Nisei ex-patriated Canadian who takes photographs and enjoys Baroque through late Classical music. The G+ page shares some of the "best of" photos.