Posts

Showing posts with the label stop_words

Language ID Part 3 - more challenges

Stop word detection usually works...  In my prior post about this subject, hopefully the Jabberwocky poem examples demonstrated that when certain types of words occur in text that can be identified as belonging to a language's pronoun/conjunction/ adposition parts of speech, a language label can still be assigned to text. Such identifiers are, in this context, considered to be stop words. The presence of such terms was sufficient for us to recognize the language even when nouns, verbs, adjectives and adverbs are unidentifiable (nonexistent in our vocabulary). However it's useful to note that in those examples, there were some inflections that hinted at the nonsensical words having specific qualities. Specifically in the case of spotting nouns, these were enabled when in the inflected languages, pluralization or possession were shown via -s/'s endings (English, though -s can indicate possession in German also) or combination of title case capitalization and (when plu

Language ID part 2: Callooh! Callay!

 As mentioned in part 1 , thinking about identifying language leads one to the fundamental question: "what defines a language?" The example that our team used was from the children's book by Lewis Carroll - a  poem that the protagonist reads, and although not comprehending it, thinks it "pretty" and that it was "clear" that "somebody killed something". Here it is, courtesy of the Wikipedia (English) article: " Jabberwocky " 'Twas brillig, and the slithy toves Did gyre and gimble in the wabe; All mimsy were the borogoves, And the mome raths outgrabe. "Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!" He took his vorpal sword in hand: Long time the manxome foe he sought-- So rested he by the Tumtum tree, And stood awhile in thought. And as in uffish thought he stood, The Jabberwock, with eyes of flame, Came whiffling