Follow Mayo Takeuchi on Quora

Tuesday, November 29, 2011

Language ID part 2: Callooh! Callay!

 As mentioned in part 1, thinking about identifying language leads one to the fundamental question: "what defines a language?" The example that our team used was from the children's book by Lewis Carroll - a  poem that the protagonist reads, and although not comprehending it, thinks it "pretty" and that it was "clear" that "somebody killed something". Here it is, courtesy of the Wikipedia (English) article:

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.

"Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!"

He took his vorpal sword in hand:
Long time the manxome foe he sought--
So rested he by the Tumtum tree,
And stood awhile in thought.

And as in uffish thought he stood,
The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
And burbled as it came!

One, two! One, two! and through and through
The vorpal blade went snicker-snack!
He left it dead, and with its head
He went galumphing back.

"And hast thou slain the Jabberwock?
Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!"
He chortled in his joy.

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.

from Through the Looking-Glass, and What Alice Found There (1872).
Since this novel has enjoyed worldwide popularity, unsurprisingly translations exist in dozens of languages. Accordingly, translators have been thus challenged with interpreting the "nonce" words from Carroll, in ways they deemed fit. Here's a site that lists various translations (including Klingon!)
I've chosen one of three listed German ones to quote here, the version found also in Gödel, Escher, Bach.

Der Jammerwoch

Es brillig war. Die schlichte Toven
Wirrten und wimmelten in Waben;
Und aller-mümsige Burggoven
Die mohmen Räth' ausgraben.
»Bewahre doch vor Jammerwoch!
Die Zähne knirschen, Krallen kratzen!
Bewahr' vor Jubjub-Vogel, vor
Frumiösen Banderschntzchen!«
Er griff sein vorpals Schwertchen zu,
Er suchte lang das manchsan' Ding;
Dann, stehend unterm Tumtum Baum,
Er an-zu-denken-fing.
Als stand er tief in Andacht auf,
Des Jammerwochen's Augen-feuer
Durch tulgen Wald mit Wiffek kam
Ein burbelnd Ungeheuer!
Eins, Zwei! Eins, Zwei! Und durch und durch
Sein vorpals Schwert zerschnifer-schnück,
Da blieb es todt! Er, Kopf in Hand,
Geläumfig zog zurück.
»Und schlugst Du ja den Jammerwoch?
Umarme mich, mien Böhm'sches Kind!
O Freuden-Tag! O Halloo-Schlag!«
Er schortelt froh-gesinnt.
Es brillig war. Die schlichte Toven
Wirrten und wimmelten in Waben;
Und aller-mümsige Burggoven
Die mohmen Räth' ausgraben.

Original source:
Scott, Robert. "The Jabberwock Traced to Its True Source", MacMillan's Magazine, Feb 1872.

Compared with the other available German versions (by Lieselotte & Martin Remané and Christian Enzensberger, respectively, the two that are cited in Wikipedia), this seemed somehow closest in "feel" to the original (but with more consistent rhyming), but perhaps my German speaking readership may share their impressions of this, compared with the others?

Meanwhile, I was disappointed that the aforementioned translations list site (although apparently authoritative enough to be listed as a reference in Wikipedia) didn't list the Japanese version. Here I'm taking the one presented in the Wikipedia page:
ジャバウォックの詩 (literally, Jabawock's poem)





一、二! 一、二! 貫きて尚も貫く 

 おお芳晴かんばらしき日よ! 花柳かな! 華麗かな!』 

Whoever created this Japanese rendition had clearly studied Carroll's own annotations (e.g. brillig becomes literally "the hour of twilight-fire"). I'm not convinced that I agree with some of the straight phonetic transliterations, but like the more archaic inflections in use.

These examples hopefully reveal to the reader what was meant by "stop words" in the context of identifying language. Indeed, with this amount of text, the traditional approach works even though so many words are classifiable either as nonce words or portmanteaux. But as mentioned before, newspaper headlines omit these very same identifiable words. And as we'll see in the next part, some languages share so much history (yet are politically segregated) that even these stop words overlap sufficiently to cause false positives and consternation.

Wishing my readers a frabjous day!

Here's the prior post about Language ID.
Part 3 of Language ID is here, or you can skip to the final post.

About Mayo

My photo

Professional: I served as "Senior Enterprise SEO Strategist" in IBM's Digital Marketing division until early 2018, during which I provided consulting and training services for both internal and external clients. Before this I was involved in Natural Language Processing, software localization, quality assurance and documentation authoring.
Currently, I am stewarding a taxonomy and scaling the learning curve to (the IT sense of) ontologies.
Personal: INTJ Nikkei Nisei ex-patriated Canadian who takes photographs and enjoys Baroque through late Classical music. The G+ page shares some of the "best of" photos.