Posts

Language ID Part 3 - more challenges

Stop word detection usually works...  In my prior post about this subject, hopefully the Jabberwocky poem examples demonstrated that when certain types of words occur in text that can be identified as belonging to a language's pronoun/conjunction/ adposition parts of speech, a language label can still be assigned to text. Such identifiers are, in this context, considered to be stop words. The presence of such terms was sufficient for us to recognize the language even when nouns, verbs, adjectives and adverbs are unidentifiable (nonexistent in our vocabulary). However it's useful to note that in those examples, there were some inflections that hinted at the nonsensical words having specific qualities. Specifically in the case of spotting nouns, these were enabled when in the inflected languages, pluralization or possession were shown via -s/'s endings (English, though -s can indicate possession in German also) or combination of title case capitalization and (when plu

Every breath we take (Foursquare et al.)

Image
Approximate location of this blogger, give or take a few hundred metres After many months of dragging my feet, I joined Foursquare today. For those unfamiliar with it, this geo-social networking service allows a registrant user with a smartphone to download an application that makes it possible to easily "check in" to physical places. With tie-ins to Facebook and Twitter, it encourages users to publicly promote the businesses and services they prefer. This, in turn, is the incentive businesses value (endorsements) sufficiently to make offers to those who check-in to them. Truth be told, I'm not a particularly suitable user of such services as these. First, I'd rather not have my whereabouts documented online to this level of detail, even though I don't live alone (and thus, am not quite so susceptible to being burgled). Second, I'm an inconspicuous consumer - that is, I try to live frugally, and what I consider to be frivolous purchases mainly take

Language ID part 2: Callooh! Callay!

 As mentioned in part 1 , thinking about identifying language leads one to the fundamental question: "what defines a language?" The example that our team used was from the children's book by Lewis Carroll - a  poem that the protagonist reads, and although not comprehending it, thinks it "pretty" and that it was "clear" that "somebody killed something". Here it is, courtesy of the Wikipedia (English) article: " Jabberwocky " 'Twas brillig, and the slithy toves Did gyre and gimble in the wabe; All mimsy were the borogoves, And the mome raths outgrabe. "Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!" He took his vorpal sword in hand: Long time the manxome foe he sought-- So rested he by the Tumtum tree, And stood awhile in thought. And as in uffish thought he stood, The Jabberwock, with eyes of flame, Came whiffling

Trunk.ly acquired by Delicious

As of November 9, 2011, it was announced that the newish owners of Delicious had acquired trunk.ly (which I'd blogged about before ). However, even earlier (in September), my manager had blogged about Delicious' apparent demise , as precipitated by the takeover by AVOS . Trunk.ly has promised to remain functional until the start of next year, but I've found that attempts to use the Delicious import feature are failing (the page times out). The export for trunk.ly worked without any problems. I'm sincerely hoping that Delicious gets their act together, and that soon it'll have incorporated trunk.ly's ease of use, and restored lost tags and works glitch-free for all the pre-existing (or surviving?) users.

Language ID (textual) - part 1

Image
Word cloud of one person's compilation of English stop words, courtesy of Armand Brahaj (whose site has been infected by malware). Here instead is ranks.nl's list   Now that half a year has lapsed since the inception of this blog, some readers may be wondering when I might share more topics that are related to the "Linguistics" part of "SEO, Linguistics, Localization". In fact, one of the triggers of my instigating this blog arose from the issuance of two patents, which had been filed in 2005 and 2006 for which I was a co-inventor and sole inventor , respectively. Both filings concerned language identification (from textual input): the first approached the challenges of identifying a text's (primary) language, and the second was an application of the first, as combined with messaging software. Rather than overwhelm the reader with extensive explanations, I'm going to attempt to create a series of posts that will cover everything in the way