An eclectic IT career: Mayo Takeuchi

Posts

Showing posts with the label linguistics

Thoughts on Google's Knowledge Graph

May 16, 2012

Disambiguating "Taj Mahal" - structure or music band? Courtesy of Google's own blog Otherwise known as semantic web, Google has announced its roll-out of ways to prompt the user to help disambiguate query terms ("strings", as in sequences of textual characters) to more specific concepts ("things"). Very catchy slogan. The Mashable article provides a basic overview of what this news means, and as I read this, my thoughts invariably turned to my former job in LanguageWare (which has been partially described over four non-contiguous blog posts last year, related to Language Identification ). When one is first exposed to linguistic data which has been amassed for the purpose of spell-check, it becomes quickly clear that in order to use this same word lists effect grammatical checks and even orthographical ones (e.g. whether a proper noun needs to be title-cased even when it doesn't commence a sentence), the part of speech is important. The afor...

LanguageWare's robust, extensible Language ID (part 4)

December 28, 2011

Having introduced the "prior art" approaches of identifying textual language in part 1 and part 3 (to wit: stop word presence and n-gram detection), I can now speak to the patented idea which we implemented as part of LanguageWare , which is a set of Java libraries that offer NLP functionality. Simply put, our solution involves a dictionary that is highly compactible (I may ask a guest blogger from my former team to delve into this aspect), and thus made it possible to store the following types of information: Each entry consists of the following: Term or n-gram Language(s) with which it's associated Whether it can occur as a standalone term, at the beginning of a word, the middle of a word, or the end of a word, or some combination of these, and An integer weighting value (per term/language pairing) Thus, for the Chinese Simplified/Traditional and Japanese disambiguation problem, the Japanese-specific kana (listed as unigrams) were given large positive va...