Follow Mayo Takeuchi on Quora

Tuesday, November 29, 2011

Language ID part 2: Callooh! Callay!

 As mentioned in part 1, thinking about identifying language leads one to the fundamental question: "what defines a language?" The example that our team used was from the children's book by Lewis Carroll - a  poem that the protagonist reads, and although not comprehending it, thinks it "pretty" and that it was "clear" that "somebody killed something". Here it is, courtesy of the Wikipedia (English) article:

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.

"Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!"

He took his vorpal sword in hand:
Long time the manxome foe he sought--
So rested he by the Tumtum tree,
And stood awhile in thought.

And as in uffish thought he stood,
The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
And burbled as it came!

One, two! One, two! and through and through
The vorpal blade went snicker-snack!
He left it dead, and with its head
He went galumphing back.

"And hast thou slain the Jabberwock?
Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!"
He chortled in his joy.

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.

from Through the Looking-Glass, and What Alice Found There (1872).
Since this novel has enjoyed worldwide popularity, unsurprisingly translations exist in dozens of languages. Accordingly, translators have been thus challenged with interpreting the "nonce" words from Carroll, in ways they deemed fit. Here's a site that lists various translations (including Klingon!)
I've chosen one of three listed German ones to quote here, the version found also in Gödel, Escher, Bach.

Der Jammerwoch

Es brillig war. Die schlichte Toven
Wirrten und wimmelten in Waben;
Und aller-mümsige Burggoven
Die mohmen Räth' ausgraben.
»Bewahre doch vor Jammerwoch!
Die Zähne knirschen, Krallen kratzen!
Bewahr' vor Jubjub-Vogel, vor
Frumiösen Banderschntzchen!«
Er griff sein vorpals Schwertchen zu,
Er suchte lang das manchsan' Ding;
Dann, stehend unterm Tumtum Baum,
Er an-zu-denken-fing.
Als stand er tief in Andacht auf,
Des Jammerwochen's Augen-feuer
Durch tulgen Wald mit Wiffek kam
Ein burbelnd Ungeheuer!
Eins, Zwei! Eins, Zwei! Und durch und durch
Sein vorpals Schwert zerschnifer-schnück,
Da blieb es todt! Er, Kopf in Hand,
Geläumfig zog zurück.
»Und schlugst Du ja den Jammerwoch?
Umarme mich, mien Böhm'sches Kind!
O Freuden-Tag! O Halloo-Schlag!«
Er schortelt froh-gesinnt.
Es brillig war. Die schlichte Toven
Wirrten und wimmelten in Waben;
Und aller-mümsige Burggoven
Die mohmen Räth' ausgraben.

Original source:
Scott, Robert. "The Jabberwock Traced to Its True Source", MacMillan's Magazine, Feb 1872.

Compared with the other available German versions (by Lieselotte & Martin Remané and Christian Enzensberger, respectively, the two that are cited in Wikipedia), this seemed somehow closest in "feel" to the original (but with more consistent rhyming), but perhaps my German speaking readership may share their impressions of this, compared with the others?

Meanwhile, I was disappointed that the aforementioned translations list site (although apparently authoritative enough to be listed as a reference in Wikipedia) didn't list the Japanese version. Here I'm taking the one presented in the Wikipedia page:
ジャバウォックの詩 (literally, Jabawock's poem)





一、二! 一、二! 貫きて尚も貫く 

 おお芳晴かんばらしき日よ! 花柳かな! 華麗かな!』 

Whoever created this Japanese rendition had clearly studied Carroll's own annotations (e.g. brillig becomes literally "the hour of twilight-fire"). I'm not convinced that I agree with some of the straight phonetic transliterations, but like the more archaic inflections in use.

These examples hopefully reveal to the reader what was meant by "stop words" in the context of identifying language. Indeed, with this amount of text, the traditional approach works even though so many words are classifiable either as nonce words or portmanteaux. But as mentioned before, newspaper headlines omit these very same identifiable words. And as we'll see in the next part, some languages share so much history (yet are politically segregated) that even these stop words overlap sufficiently to cause false positives and consternation.

Wishing my readers a frabjous day!

Here's the prior post about Language ID.
Part 3 of Language ID is here, or you can skip to the final post.

Friday, November 25, 2011 acquired by Delicious

As of November 9, 2011, it was announced that the newish owners of Delicious had acquired (which I'd blogged about before). However, even earlier (in September), my manager had blogged about Delicious' apparent demise, as precipitated by the takeover by AVOS. has promised to remain functional until the start of next year, but I've found that attempts to use the Delicious import feature are failing (the page times out). The export for worked without any problems.

I'm sincerely hoping that Delicious gets their act together, and that soon it'll have incorporated's ease of use, and restored lost tags and works glitch-free for all the pre-existing (or surviving?) users.

Wednesday, November 23, 2011

Language ID (textual) - part 1

Word cloud of one person's compilation of English stop words, courtesy of Armand Brahaj (whose site has been infected by malware). Here instead is's list
Now that half a year has lapsed since the inception of this blog, some readers may be wondering when I might share more topics that are related to the "Linguistics" part of "SEO, Linguistics, Localization".

In fact, one of the triggers of my instigating this blog arose from the issuance of two patents, which had been filed in 2005 and 2006 for which I was a co-inventor and sole inventor, respectively. Both filings concerned language identification (from textual input): the first approached the challenges of identifying a text's (primary) language, and the second was an application of the first, as combined with messaging software.

Rather than overwhelm the reader with extensive explanations, I'm going to attempt to create a series of posts that will cover everything in the way that makes the most sense to me (which is actually how I approach most things in life, not just work). I may end up consolidating or splitting future posts, so I can't yet solve for X what the "part 1 of X" is. :-)

I should also insert the disclaimer that the intended target audience is people with little to no exposure to any sort of linguistics. However given my wider audience - particularly on G+ - there's bound to be people who may take issue with my oversimplifications. To them, I apologize in advance!

First, the problem. One may ask oneself, what IS the problem, or limitations, with textual language identification solutions? Well, take news article headlines or similarly short textual fragments, for instance. Here are two examples.

"Schwarzenegger's in a 1990 hit film 'Kindergarten Cop'." (a factual statement)
"海上危機管理で協議機関設置へ協力 玄葉外相訪中" (a news headline from November 23, 2011)

Before our implementation, the two prevalent approaches to parse and identify text as belonging to certain languages were n-gram detection, and stop word identification. But in headlines, stop words tend to be omitted. Although there is no canonical set of stop words per se for any language, in the context of topic identification and indexing they tend to be function words: usually their parts of speech are conjunctions, adpositions and articles.

This leaves the detection of n-grams, which basically are common groupings of characters (usually consonant clusters, such as "sch" in German).

So, using the traditional approaches, the first example would be guessed as being in German, and the second, likely as Traditional Chinese (although the example also contains a fair proportion of Simplified Chinese Han, and just two native Japanese characters). If one were relying on this layer of processing to categorize news articles based on their title values, this is most certainly a problem.

See the sequel post concerning Language ID, here.
See here for the final post about Language ID.

Monday, November 21, 2011

Sushi Preparation compared to Search Enablement

Courtesy of Kojiro Fish Shop in Wieden, Vienna
 Being a fan of various cuisines, I count myself fortunate in having had the opportunity to grow up in Toronto (and having spent time in gastronomical meccas such as Tokyo and New York). As my parents kept my household quite Japanese, I grew up eating what most of my classmates considered to be exotic foods: umeboshi, chirashi zushi, korokke, grilled fish with daikon oroshi and such.

Thus, when I was recently asked by a virtual friend - by which I mean someone whose acquaintance I made online, and have not yet spent time with in person, as opposed to an artificial being - to review her classmate's journey of learning to make sushi, I thought I may as well take the opportunity to talk about how my views on  sushi preparation and enabling search optimization of online content actually have comparable points. Sound  strange? Do read on...

First, the sushi making (with the disclaimer that I am not a professional chef, nor would I know how to properly cut fish for nigiri or sashimi, unless they come in rectangular blocks already sans scales or bones.)

Sidebar: initially, sushi arose as a means for people living further from the oceans to preserve fish without salting or drying them out thoroughly - the vinegar-soaked rice was a means to achieve this.
  • The sushi meshi (vinegared rice, aka the contracted form which is "sumeshi") must be mastered first. For real sushi masters, this part takes literally years to accomplish. I'm not going to give a recipe here, but rather things I've noticed that English language recipes don't always mention, which would likely contribute to subpar results.
    • Choose short grain, glutinous rice. Long grain rice and rice that's been parboiled won't work as well. Rinse the rice well (the castoff water is good in dry climates to water plants with; don't waste it!) and then boil (rice cookers do the best job).
    • Prepare the vinegar solution - there is an optimal ratio of solution to rice, which most recipes cover. Once this is prepared, wipe the inside of the container where the rice will be mixed and cooled with some of this solution.
    • This should be a shallow wooden container in which to air and mix the rice. This container needs to be wooden in order to absorb and then release part of the vinegar solution into the rice.
      Use a wooden flat large spoon and a hand fan once the boiled rice is spread into the aforementioned container. Mix the remaining solution carefully into the rice while it's aired by the spoon (near horizontal cutting motions are often described) while fanning the rice. The timing of doing this is important, and ensuring that the grains of rice remain uncompromised.
    • The end result of this process should be shiny, undamaged grains of rice that are evenly flavoured by the vinegar solution.
  • The toppings (nigiri) or fillings (maki) should be of high quality, and cut as suggested by the links above. Sharp, well maintained knives are necessary to descale and debone fish. For the latter, crisp sheets of nori should be placed on the rolling mat shiny side down. Wasabi should be ideally fresh, and it exists not just for flavour, but also to help slow spoilage and prevent food poisoning.
  • Then, keeping your hands wet so they don't stick to the cooling vinegared rice, apply the aforementioned wasabi and toppings/maki fillings to taste. In either case, the pressure to form the pieces of sushi needs to be firm enough to keep it from easily breaking apart, but not so hard that the grains get deformed or compromised in structure. Nori is used in some nigiri pieces (such as the atsuyaki - sweet egg omelet - type as shown in my photo above) to doubly ensure the rice/wasabi/topping remains intact.
  • When enjoying sushi, dip each piece in a bit of shoyu (soy sauce) where some wasabi has been dissolved. 
    • Don't forget to cleanse your palate between pieces of different types by ingesting gari (sushi ginger). The quality of gari (which in restaurants should be made in situ!) is often a good benchmark for how good the sushi is, too.
Now, my views on search enablement:
  • First, the foundation of any search enablement strategy consists of keywords. They need to be self-explanatory without being too lengthy (long tail searches beyond four words is still rare). 
    • These keywords should not compete internally, and need to be applied strategically to where they best apply. 
    • As it's not an exact science, experience helps most in researching and deriving optimal keywords.
    • One would play around with sources of query data, just as one may experiment with how much water to boil the rice in and what amount of vinegar solution works for your volume of rice.
  • Regardless of whether social media is used for inbound marketing or not, link bait must exist, and furthermore be of high quality. 
    •  Much like the sushi toppings or fillings (or ingredients for any cuisine), the better its inherent quality, the better the outcome. 
    • The link bait is what draws people in, generating conversions as well as click-throughs by lending credibility and authority and the "appeal" of a business or product. 
    • And certainly, with this analogy it is possible to have a huge viral success purely based on link bait (sashimi). But the balance of flavours that sumeshi offers with the toppings, along with wasabi and nori, offers the complete package.
  • The successful execution of most campaigns precludes just launching and leaving it - the follow-through (nori, which coincidentally is a Japanese homonym for "glue") is important. 
    • Again, sashimi - just the link bait - can enjoy success to an extent, but even sashimi needs shoyu (soy sauce) and wasabi to complete the experience, and would be ingested with plain boiled rice anyway.
    • Wasabi might be analogous to diligent moderation (perhaps an expiry/renewal date value) of what content remains accessible - outdated, and thus increasingly irrelevant content should be archived and removed from view in a timely manner.
Unlike with most baking recipes, I've found that cooking instructions can be played with and one can substitute ingredients creatively to some extent - but to me, the foundation of delicious meals - and sound search enablement - come out of 1) sound foundation (sumeshi/keywords), 2) deft execution (to ensure quality ingredients/link bait are maximized for gustatory enjoyment), and 3) follow-through (including sunsetting old content/ensuring food doesn't spoil).

About Mayo

My photo

Professional: As "Senior Enterprise SEO Strategist" in IBM's Digital Marketing division, I provide consulting and training services for both internal and external clients. Formerly I was involved in Natural Language Processing, software localization, quality assurance and documentation authoring.
Personal: INTJ Nikkei Nisei ex-patriated Canadian who takes photographs and enjoys Baroque through late Classical music. The G+ page shares some of the "best of" photos.