Follow Mayo Takeuchi on Quora

Saturday, December 31, 2011

Year-end thoughts, 2011 edition

Over the lifetime I've spent living in various Western countries, I've noticed the predilection for media and individuals alike to focus on retrospection around this time of year: that is, reminiscing about the various events and experiences that one associates with the prior year. In direct contrast to this, it's my understanding that in Japan, it is customary to have 忘年会 which, paired with the 新年会 (which occurs after the 正月三が日 - first three days in January - timeperiod), encourages the forgetting of the prior year through much carousing and imbibing.

This year was particularly unforgettable to those with ties to Japan, however, and I've seen social media statuses speaking to the importance of remembering the disasters that have befallen my cultural homeland. The fallout - both metaphorical and literal (environmental, economic, political, and emotional) - will be palpable for decades, if not centuries, regardless of any desire the world may have to forget.

Personally speaking, I tend to mix traditions at year's end (as I suspect many Nikkei do): indulging in some retrospection, but mainly renewing commitment to work on fallibilities and improve upon specific aspects of life. This time, however, I also spent a surprising amount of time thinking more profoundly about my past, via fleshing out my timeline (courtesy of Facebook's latest UI change). But, this blog isn't the right forum to share such factoids.

I hope that 2012 enables everyone to make progress on what they wish, learn and implement interesting ideas, and generally experience fulfillment, sound health and happiness.

Wednesday, December 28, 2011

LanguageWare's robust, extensible Language ID (part 4)

Having introduced the "prior art" approaches of identifying textual language in part 1 and part 3 (to wit: stop word presence and n-gram detection), I can now speak to the patented idea which we implemented as part of  LanguageWare, which is a set of Java libraries that offer NLP functionality. Simply put, our solution involves a  dictionary that is highly compactible (I may ask a guest blogger from my former team to delve into this aspect), and thus made it possible to store the following types of information:

Each entry consists of the following:

  • Term or n-gram
  • Language(s) with which it's associated
  • Whether it can occur as a standalone term, at the beginning of a word, the middle of a word, or the end of a word, or some combination of these, and
  • An integer weighting value (per term/language pairing)

Thus, for the Chinese Simplified/Traditional and Japanese disambiguation problem, the Japanese-specific kana (listed as unigrams) were given large positive values for Japanese, and also associated with large negative scores for the Chinese languages because as in the example given before, the presence of hiragana or katakana is quite sparse in newspaper headlines, which otherwise would score mainly as Traditional Chinese. To distinguish between the two Chinese scripts, each simplified Han characters were associated with both positive scores for Simplified, and negative for Traditional.

For Western languages with overlapping stopwords, we also mined our existing lexical dictionaries for distinctive prefixes and suffixes, as well as verbal inflective endings to ensure that we identified as many uniquely occurring features of the language. When even that was insufficient, a statistical analysis of relative prevalence factored into our scoring values. If an n-gram only identifier were run against the example "Schwarzeneggernek" (I've linked here to the search results) it would come back as German. However, we know that -nek is a Hungarian inflection. By capturing this in our dictionary, we ensure that regardless of what precedes this ending, our system indicates that it is likely to be Hungarian.

Unicode ranges were also useful as we ensured that the sample texts being parsed were all normalized to UTF16LE (little endian) first. So if we ever needed to identify Klingon, the dictionary could be extended extremely easily.

LanguageWare thus includes a Language ID component, which avails of the existing functionality, such as parsing lexemes, and uses this separate dictionary (which was at last examination 1.4MB and covering around 35 languages) to "guess" the primary language of text fragments. As we tested this against article headlines and many were truncated to under 120 characters inclusive of spaces, I'm quite confident that it would hold up very well even for tweets, where even those where URLs take up space would be reasonably accurate. The URLs could, through the use of regular expressions, either be omitted from consideration or processed such that hyphens or slashes are treated like whitespaces.

For more context, try these former posts:
Language ID part 1
Language ID part 2
Language ID part 3

Tuesday, December 6, 2011

Language ID Part 3 - more challenges

Stop word detection usually works...

 In my prior post about this subject, hopefully the Jabberwocky poem examples demonstrated that when certain types of words occur in text that can be identified as belonging to a language's pronoun/conjunction/adposition parts of speech, a language label can still be assigned to text. Such identifiers are, in this context, considered to be stop words.

The presence of such terms was sufficient for us to recognize the language even when nouns, verbs, adjectives and adverbs are unidentifiable (nonexistent in our vocabulary). However it's useful to note that in those examples, there were some inflections that hinted at the nonsensical words having specific qualities. Specifically in the case of spotting nouns, these were enabled when in the inflected languages, pluralization or possession were shown via -s/'s endings (English, though -s can indicate possession in German also) or combination of title case capitalization and (when plural) -en for German. In Japanese, the presumed proper nouns were transliterated into the katakana script, which is an orthographic convention for foreign terms generally (including verbs, which are then often inflected using Japanese grammar).

...except when languages overlap

However, where another challenge presents itself, is when written languages share extensive history, and thus use overlapping stop words. Norwegian Bokmål /Danish, Czech/Slovak, and Traditional Chinese/historical Japanese texts are some examples that readily come to mind. When the input text examples are short enough, even Dutch and English show enough similarities to generate errors (the words in, over and is are all both common and shared.)
Fortunately in modern Japanese, the presence of the kana scripts (hiragana and katakana) allowed us to easily distinguish between it and either of the Chinese scripts. Japan had also simplified some of the more complicated kanji, for pragmatic reasons, though it has retained more of the traditional Han characters than the mainland Chinese government. (As a sidebar, there are uniquely Japanese kanji characters as well, though relatively few in number, constructed using constituent Han components). This approach was also necessary to distinguish between languages that use other scripts, such as Arabic and Cyrillic.
For the Bokmål /Danish challenge, it was necessary to statistically analyze their respective rates of prefix and suffix prevalence. This was performed on proprietary dictionary files, which were historically used for spelling verification.

...and, obviously, when stop words are scarce.

For groups of languages that extensively share their alphabets, the presence of not only stop words but also "giveaway" unique characters grew quite scarce whenever the text samples were news headlines or HTML file <title> values. Sheer brevity of these fragments (ranging from 20-80 characters inclusive of spaces) meant that distinct characters tend to appear less frequently, and stop words are eschewed despite their importance due to the text length being constrained. Thus both the presence and absence of n-grams (in this case, bi- and tri-grams to be more specific) became crucial to pinpoint and use to differentiate amongst these languages. For instance, in German there are useful noun endings such as -keit and -ung (the latter having historical links to the English gerundive -ing suffix) which aren't found in Dutch, whereas conversely there are far more repeated double vowels (especially aa and ee) as well as distinct sequences such as the -ij suffix occur in Dutch words.

The next and final post on Language ID will cover some specific aspects of our patented solution!
Here are links to the initial Language ID post and its sequel, also.

Thursday, December 1, 2011

Every breath we take (Foursquare et al.)

Approximate location of this blogger, give or take a few hundred metres

After many months of dragging my feet, I joined Foursquare today. For those unfamiliar with it, this geo-social networking service allows a registrant user with a smartphone to download an application that makes it possible to easily "check in" to physical places. With tie-ins to Facebook and Twitter, it encourages users to publicly promote the businesses and services they prefer. This, in turn, is the incentive businesses value (endorsements) sufficiently to make offers to those who check-in to them.

Truth be told, I'm not a particularly suitable user of such services as these. First, I'd rather not have my whereabouts documented online to this level of detail, even though I don't live alone (and thus, am not quite so susceptible to being burgled).

Second, I'm an inconspicuous consumer - that is, I try to live frugally, and what I consider to be frivolous purchases mainly take place online (and quite infrequently, on the order of semi-annually).

Third, none of my friends live in locales that I could realistically visit (without taking a long haul flight). Case in point: my first four people I linked with on Foursquare live in Chiba prefecture, Tokyo, Saskatchewan, and Ontario, respectively. I'm excepting work related folk in saying that though, of whom I'm actually quite fond. It's just that my long standing policy to eschew conflating professional and personal relationships has resulted in residual levels of reluctance to socialize with them outside of regular work hours.

As counterpoints to the above three observations, though, I discovered that I did indeed have enough reason to join the service nonetheless. So long as I carry an Android phone, it seems my every move is being carefully tracked and filed away anyhow. I'm also interested in what areas where I do roam offer free (and presumably secure) WiFi. And I'd like to start documenting places and dishes that I would recommend to people who visit my city, which the "done" and "tips" features available to this service allow me to do ex post facto. I've already started to work on those, as I don't have to commit to stating when I experienced or ate anything, but just that I had.

The fact that I'm not tweeting my whereabouts (and that I intend to only sporadically check in at all; I've also been known swap my SIM card into a non-smart phone on occasion) means I can still achieve a semblance of living "under the radar". So to those readers who plan to request a link with me on there, please don't expect to infer too much more about my life, although by agreeing to open up this data to you, I would be showing that I implicitly trust you.

Tuesday, November 29, 2011

Language ID part 2: Callooh! Callay!

 As mentioned in part 1, thinking about identifying language leads one to the fundamental question: "what defines a language?" The example that our team used was from the children's book by Lewis Carroll - a  poem that the protagonist reads, and although not comprehending it, thinks it "pretty" and that it was "clear" that "somebody killed something". Here it is, courtesy of the Wikipedia (English) article:

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.

"Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!"

He took his vorpal sword in hand:
Long time the manxome foe he sought--
So rested he by the Tumtum tree,
And stood awhile in thought.

And as in uffish thought he stood,
The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
And burbled as it came!

One, two! One, two! and through and through
The vorpal blade went snicker-snack!
He left it dead, and with its head
He went galumphing back.

"And hast thou slain the Jabberwock?
Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!"
He chortled in his joy.

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.

from Through the Looking-Glass, and What Alice Found There (1872).
Since this novel has enjoyed worldwide popularity, unsurprisingly translations exist in dozens of languages. Accordingly, translators have been thus challenged with interpreting the "nonce" words from Carroll, in ways they deemed fit. Here's a site that lists various translations (including Klingon!)
I've chosen one of three listed German ones to quote here, the version found also in Gödel, Escher, Bach.

Der Jammerwoch

Es brillig war. Die schlichte Toven
Wirrten und wimmelten in Waben;
Und aller-mümsige Burggoven
Die mohmen Räth' ausgraben.
»Bewahre doch vor Jammerwoch!
Die Zähne knirschen, Krallen kratzen!
Bewahr' vor Jubjub-Vogel, vor
Frumiösen Banderschntzchen!«
Er griff sein vorpals Schwertchen zu,
Er suchte lang das manchsan' Ding;
Dann, stehend unterm Tumtum Baum,
Er an-zu-denken-fing.
Als stand er tief in Andacht auf,
Des Jammerwochen's Augen-feuer
Durch tulgen Wald mit Wiffek kam
Ein burbelnd Ungeheuer!
Eins, Zwei! Eins, Zwei! Und durch und durch
Sein vorpals Schwert zerschnifer-schnück,
Da blieb es todt! Er, Kopf in Hand,
Geläumfig zog zurück.
»Und schlugst Du ja den Jammerwoch?
Umarme mich, mien Böhm'sches Kind!
O Freuden-Tag! O Halloo-Schlag!«
Er schortelt froh-gesinnt.
Es brillig war. Die schlichte Toven
Wirrten und wimmelten in Waben;
Und aller-mümsige Burggoven
Die mohmen Räth' ausgraben.

Original source:
Scott, Robert. "The Jabberwock Traced to Its True Source", MacMillan's Magazine, Feb 1872.

Compared with the other available German versions (by Lieselotte & Martin Remané and Christian Enzensberger, respectively, the two that are cited in Wikipedia), this seemed somehow closest in "feel" to the original (but with more consistent rhyming), but perhaps my German speaking readership may share their impressions of this, compared with the others?

Meanwhile, I was disappointed that the aforementioned translations list site (although apparently authoritative enough to be listed as a reference in Wikipedia) didn't list the Japanese version. Here I'm taking the one presented in the Wikipedia page:
ジャバウォックの詩 (literally, Jabawock's poem)





一、二! 一、二! 貫きて尚も貫く 

 おお芳晴かんばらしき日よ! 花柳かな! 華麗かな!』 

Whoever created this Japanese rendition had clearly studied Carroll's own annotations (e.g. brillig becomes literally "the hour of twilight-fire"). I'm not convinced that I agree with some of the straight phonetic transliterations, but like the more archaic inflections in use.

These examples hopefully reveal to the reader what was meant by "stop words" in the context of identifying language. Indeed, with this amount of text, the traditional approach works even though so many words are classifiable either as nonce words or portmanteaux. But as mentioned before, newspaper headlines omit these very same identifiable words. And as we'll see in the next part, some languages share so much history (yet are politically segregated) that even these stop words overlap sufficiently to cause false positives and consternation.

Wishing my readers a frabjous day!

Here's the prior post about Language ID.
Part 3 of Language ID is here, or you can skip to the final post.

Friday, November 25, 2011 acquired by Delicious

As of November 9, 2011, it was announced that the newish owners of Delicious had acquired (which I'd blogged about before). However, even earlier (in September), my manager had blogged about Delicious' apparent demise, as precipitated by the takeover by AVOS. has promised to remain functional until the start of next year, but I've found that attempts to use the Delicious import feature are failing (the page times out). The export for worked without any problems.

I'm sincerely hoping that Delicious gets their act together, and that soon it'll have incorporated's ease of use, and restored lost tags and works glitch-free for all the pre-existing (or surviving?) users.

Wednesday, November 23, 2011

Language ID (textual) - part 1

Word cloud of one person's compilation of English stop words, courtesy of Armand Brahaj (whose site has been infected by malware). Here instead is's list
Now that half a year has lapsed since the inception of this blog, some readers may be wondering when I might share more topics that are related to the "Linguistics" part of "SEO, Linguistics, Localization".

In fact, one of the triggers of my instigating this blog arose from the issuance of two patents, which had been filed in 2005 and 2006 for which I was a co-inventor and sole inventor, respectively. Both filings concerned language identification (from textual input): the first approached the challenges of identifying a text's (primary) language, and the second was an application of the first, as combined with messaging software.

Rather than overwhelm the reader with extensive explanations, I'm going to attempt to create a series of posts that will cover everything in the way that makes the most sense to me (which is actually how I approach most things in life, not just work). I may end up consolidating or splitting future posts, so I can't yet solve for X what the "part 1 of X" is. :-)

I should also insert the disclaimer that the intended target audience is people with little to no exposure to any sort of linguistics. However given my wider audience - particularly on G+ - there's bound to be people who may take issue with my oversimplifications. To them, I apologize in advance!

First, the problem. One may ask oneself, what IS the problem, or limitations, with textual language identification solutions? Well, take news article headlines or similarly short textual fragments, for instance. Here are two examples.

"Schwarzenegger's in a 1990 hit film 'Kindergarten Cop'." (a factual statement)
"海上危機管理で協議機関設置へ協力 玄葉外相訪中" (a news headline from November 23, 2011)

Before our implementation, the two prevalent approaches to parse and identify text as belonging to certain languages were n-gram detection, and stop word identification. But in headlines, stop words tend to be omitted. Although there is no canonical set of stop words per se for any language, in the context of topic identification and indexing they tend to be function words: usually their parts of speech are conjunctions, adpositions and articles.

This leaves the detection of n-grams, which basically are common groupings of characters (usually consonant clusters, such as "sch" in German).

So, using the traditional approaches, the first example would be guessed as being in German, and the second, likely as Traditional Chinese (although the example also contains a fair proportion of Simplified Chinese Han, and just two native Japanese characters). If one were relying on this layer of processing to categorize news articles based on their title values, this is most certainly a problem.

See the sequel post concerning Language ID, here.
See here for the final post about Language ID.

Monday, November 21, 2011

Sushi Preparation compared to Search Enablement

Courtesy of Kojiro Fish Shop in Wieden, Vienna
 Being a fan of various cuisines, I count myself fortunate in having had the opportunity to grow up in Toronto (and having spent time in gastronomical meccas such as Tokyo and New York). As my parents kept my household quite Japanese, I grew up eating what most of my classmates considered to be exotic foods: umeboshi, chirashi zushi, korokke, grilled fish with daikon oroshi and such.

Thus, when I was recently asked by a virtual friend - by which I mean someone whose acquaintance I made online, and have not yet spent time with in person, as opposed to an artificial being - to review her classmate's journey of learning to make sushi, I thought I may as well take the opportunity to talk about how my views on  sushi preparation and enabling search optimization of online content actually have comparable points. Sound  strange? Do read on...

First, the sushi making (with the disclaimer that I am not a professional chef, nor would I know how to properly cut fish for nigiri or sashimi, unless they come in rectangular blocks already sans scales or bones.)

Sidebar: initially, sushi arose as a means for people living further from the oceans to preserve fish without salting or drying them out thoroughly - the vinegar-soaked rice was a means to achieve this.
  • The sushi meshi (vinegared rice, aka the contracted form which is "sumeshi") must be mastered first. For real sushi masters, this part takes literally years to accomplish. I'm not going to give a recipe here, but rather things I've noticed that English language recipes don't always mention, which would likely contribute to subpar results.
    • Choose short grain, glutinous rice. Long grain rice and rice that's been parboiled won't work as well. Rinse the rice well (the castoff water is good in dry climates to water plants with; don't waste it!) and then boil (rice cookers do the best job).
    • Prepare the vinegar solution - there is an optimal ratio of solution to rice, which most recipes cover. Once this is prepared, wipe the inside of the container where the rice will be mixed and cooled with some of this solution.
    • This should be a shallow wooden container in which to air and mix the rice. This container needs to be wooden in order to absorb and then release part of the vinegar solution into the rice.
      Use a wooden flat large spoon and a hand fan once the boiled rice is spread into the aforementioned container. Mix the remaining solution carefully into the rice while it's aired by the spoon (near horizontal cutting motions are often described) while fanning the rice. The timing of doing this is important, and ensuring that the grains of rice remain uncompromised.
    • The end result of this process should be shiny, undamaged grains of rice that are evenly flavoured by the vinegar solution.
  • The toppings (nigiri) or fillings (maki) should be of high quality, and cut as suggested by the links above. Sharp, well maintained knives are necessary to descale and debone fish. For the latter, crisp sheets of nori should be placed on the rolling mat shiny side down. Wasabi should be ideally fresh, and it exists not just for flavour, but also to help slow spoilage and prevent food poisoning.
  • Then, keeping your hands wet so they don't stick to the cooling vinegared rice, apply the aforementioned wasabi and toppings/maki fillings to taste. In either case, the pressure to form the pieces of sushi needs to be firm enough to keep it from easily breaking apart, but not so hard that the grains get deformed or compromised in structure. Nori is used in some nigiri pieces (such as the atsuyaki - sweet egg omelet - type as shown in my photo above) to doubly ensure the rice/wasabi/topping remains intact.
  • When enjoying sushi, dip each piece in a bit of shoyu (soy sauce) where some wasabi has been dissolved. 
    • Don't forget to cleanse your palate between pieces of different types by ingesting gari (sushi ginger). The quality of gari (which in restaurants should be made in situ!) is often a good benchmark for how good the sushi is, too.
Now, my views on search enablement:
  • First, the foundation of any search enablement strategy consists of keywords. They need to be self-explanatory without being too lengthy (long tail searches beyond four words is still rare). 
    • These keywords should not compete internally, and need to be applied strategically to where they best apply. 
    • As it's not an exact science, experience helps most in researching and deriving optimal keywords.
    • One would play around with sources of query data, just as one may experiment with how much water to boil the rice in and what amount of vinegar solution works for your volume of rice.
  • Regardless of whether social media is used for inbound marketing or not, link bait must exist, and furthermore be of high quality. 
    •  Much like the sushi toppings or fillings (or ingredients for any cuisine), the better its inherent quality, the better the outcome. 
    • The link bait is what draws people in, generating conversions as well as click-throughs by lending credibility and authority and the "appeal" of a business or product. 
    • And certainly, with this analogy it is possible to have a huge viral success purely based on link bait (sashimi). But the balance of flavours that sumeshi offers with the toppings, along with wasabi and nori, offers the complete package.
  • The successful execution of most campaigns precludes just launching and leaving it - the follow-through (nori, which coincidentally is a Japanese homonym for "glue") is important. 
    • Again, sashimi - just the link bait - can enjoy success to an extent, but even sashimi needs shoyu (soy sauce) and wasabi to complete the experience, and would be ingested with plain boiled rice anyway.
    • Wasabi might be analogous to diligent moderation (perhaps an expiry/renewal date value) of what content remains accessible - outdated, and thus increasingly irrelevant content should be archived and removed from view in a timely manner.
Unlike with most baking recipes, I've found that cooking instructions can be played with and one can substitute ingredients creatively to some extent - but to me, the foundation of delicious meals - and sound search enablement - come out of 1) sound foundation (sumeshi/keywords), 2) deft execution (to ensure quality ingredients/link bait are maximized for gustatory enjoyment), and 3) follow-through (including sunsetting old content/ensuring food doesn't spoil).

Wednesday, October 19, 2011

Where is Dennis Ritchie's day?

It's now a week since the creator of the C programming language, and co-creator of the UNIX operating system, Dennis Ritchie, died after a long illness. I still have the distinct impression that Steve Jobs' charisma and Apple's links to pop culture have generated far more hype than the former's profound contributions to technology.

A few days ago I'd shared the New York Times obituary on Ritchie, which garnered comments from my loyal readers (thank you, Klaus and Mick!) Since then, I'd been looking at various media sources to see what more would be said about him. However, I see announcements instead like this (Californian governor declares October 16 Steve Jobs Day), and threads like this (Google has neither created a doodle nor provided a hyperlink to Ritchie, despite doing the latter for Jobs).

It seems there must be many more people who share my disappointment and outrage that Ritchie's passing has been eclipsed so effectively by the timing of Jobs' death: here's a blog entry from Computerworld, which makes some more interesting comparisons.

Sunday, October 16, 2011

Time management thoughts, Part 1

I'd recently admitted to some friends that, ironically (and funnily enough) the topic of time management has been on my mind. The irony being that this post comes more than halfway through October, with the greatest gap in time that had transpired since the blog was launched in May.

Here is a quote from the TV series "Bones", which has a protagonist whose behaviour I can relate to quite well. She's being interviewed by a bubbly morning chat show hostess in the following exchange:

Courtesy of IMDB:
Stacy Goodyear: I'm Stacie Goodyear and joining me on Wake Up, D.C. is Dr. Temperance Brennan. She is the author of the best-selling mystery novel "Bred in the Bone" and she's also - now tell me if I get this wrong - an anthropologist who works with the F.B.I. to solve crimes?
Dr. Temperance 'Bones' Brennan: Yes, that's correct. I use the bones of people who have been murdered, or burned, or blown up, or eaten by animals or insects, or just decomposed.
Stacy Goodyear: Well, that's exciting. Uh, Dr. Brennan, your book has sold over 300,00 copies. How do you juggle twin careers as a best-selling author and a crime-fighting scientist?
Dr. Temperance 'Bones' Brennan: Well, I do one, then the other.
Stacy Goodyear: And is the work enjoyable? I mean, the part involving rotten bodies?
Dr. Temperance 'Bones' Brennan: Enjoyable? Well, satisfying, yes. Like cracking a code. But in general, when you're looking at someone who's been brutally murdered... it's complicated.
Stacy Goodyear: 'Cause I just thought, you know, yuck!
[she laughs, but Brennan doesn't

When I saw this episode some years back, I laughed so much at her response to the juggling twin careers question that my outburst may have slightly alarmed my husband. It's exactly the sort of reply I'd have given, and I'd also learned long ago that while I'm capable of some multitasking, the efficacy thereof is highly dependent upon the nature of the tasks involved.

Over the years, the level of care I've needed to devote to scheduling my work has varied greatly. In consulting settings, at one point I'd tracked 15 minute increments of the workday; being billable necessitated this. In managing linguistic data for the NLP research team, I was able to set my own deliverable due dates, and work towards them in a fairly undisciplined fashion.

Recently, I'd returned to work on something for an external client, which meant there were very intense efforts to meet the interim and final deadlines. With the perceived level of urgency being so high, I found that I truly couldn't let my metaphorical mental backburner work on anything else: even my dreams worked through concerns about the tasks at hand (although this wasn't a new phenomenon in my life: during my piano performance studies at the Royal Conservatory of Music in Canada as a teen, I recall having nightly "walk-through" dreams in the fortnight or so leading up to both the grade 9 and 10 exams.)

Part 2 of this topic to follow...

Wednesday, September 21, 2011

Why I won't link to your blog

Today I received the above comment, unsolicited, and after about two minutes' investigation I moved it into the Spam category. Here's a numbered list explaining why:
  1. Although my name is part of the blogspot domain I use, and promote in most places, the message addresses me as "Webmaster", which is possibly today's equivalent of "to whom it may concern". Actually, I have interchangeably experimented with the vanity URL provided to me via my alma mater, such as on Technorati and
  2. The request is for cross-linking, which already devalues the proposition (as it's a "black hat" practice). If this person truly valued my blog, he would link to it without asking me to link to his.
  3. The request uses my domain, implying that it is a "keyword". I've blocked out the destination URL and the keyword he asked for (which, although partially reflecting his website address, was also far too generic to stand a chance at ranking well for it with his SEO approach).
  4. His blogroll still retains the domain, which indicates to me that he isn't serious about using this blog for a business. (As an aside, I'm slowly working on masking my domain with a vanity URL that I've had for years; I may eventually hide mine altogether). Seeing as he has a *.net domain, I have no idea why he wouldn't give his blog address as www.*withheldbusinessname*.net/blog.
  5. The name he signs off in the comment body does not match the handle he signed in as. Furthermore, clicking through to his Blogspot profile, it offers no substantial information except three blog links, two of which share the same title but are totally unrelated to the SEO topics realm. The third link is related to SEO, but isn't a link to the blog he wished to promote as per his comment.
  6. The only author mentioned and profiled on the blog does not match in name to either the signed name in the comment I received or the handle with which he signed into Blogspot.
  7. The blog itself has a glaringly obvious typographical error, "...Quality Bakc Links", as found on the most recent post heading. There is also a 5 month gap between the most recent and second most recent post, which, when considering the most authoritative corporate blogs have daily posts, is remarkably poor practice). And finally, its assertion that their services are "100% ethical" was already quite undermined by all the points I've addressed in this post.
So, to the commentor who "visited" and left the message today on behalf of this company: the above are the main reasons I will not be cross-linking with you. Thank you, however, for finding me. And let me know if you link to my blog anyway. :-)

Friday, September 16, 2011

How I syndicate web content

Like most individuals who are working on establishing an online presence, I have multiple SNSs (social networking services) on which I wish to share content. The four main services that I use currently, along with my audience demographics are as follows:
  • Twitter: mostly topics of professional interest or music, and breaking news, scientific articles and alma mater related newsbits. My twitter follower audience is still small and largely impersonal, which encourages me to be mindful that tweets may be mined publicly by anyone.
  • Google+: add to my preferred Twitter topics, photos that I've begun to upload to Picasaweb, which is primarily Vienna-related. On + my audience is academic and more professionally allied than on Facebook, with very little overlap.
  • Facebook (The link to my FB profile is not publicly available, which was my deliberate choice): most of the above, plus the occasional "true status" - things on my mind that only actual friends would find of slight interest to read. Here we find the greatest percentage of people I knew and felt favourably towards from all levels of school, elementary through university, as well as a handful of LinkedIn contacts though not all.
  • LinkedIn: my topics largely overlap with Twitter, save the music and breaking news types of content. My audience encompasses most contacts I made throughout my career, some web-based friends and some school friends.
(And as an aside, I've recently joined XING, but only have a few contacts on it thus far).

Since the mantra I follow is "write (or share) once, publish everywhere that it's relevant", here's what I've ended up doing:
  • Use the Rob McGee Google+ Bot on + to post content that I wish for + public, Twitter, Facebook and LinkedIn (via the use of the #li or #in hashtag) to see. This works best for professional and academic topics. I particularly like that the main image that one sees normally just in Facebook-native link shares also happens when the bot is used.
  • Use the + interface without the bot, for content I only wish for subsets of my + audience to see.
  • Use a private Facebook group I created for some topics, and then use lists to target my audience subsets.
  • Use LinkedIn's status feature to either post content specifically to LI, or also to Twitter.
  • Use Tweetdeck's edit retweet feature from my phone, to post both to Twitter and sometimes to LI. This I often do while winding down in the evenings.
  • Use Android-native apps on my phone (for Huffingtonpost, NPR, BBC News) to share content either to Facebook or Twitter (with the option of forwarding it on also to LI via the aforementioned hashtags).

Two more aggregators I've noted to try out are and Yoono, both from my phone and computer. I may report back on these, when time allows.

Friday, September 9, 2011

Bing's "SEO Fundamentals" are everyone's fundamentals

As a followup perhaps to the Bing/Yahoo! quality checklist, subsequently provided 18 points of what Bing expects web content publishers to implement for SEO.

Well, it seems to me that all their advice applies equally as well for those aiming to optimize their web content for any search engine. I think perhaps that there should have been a disclaimer associated with point 1, which concerned the implementation of robots.txt and XML site maps. It's still my understanding that both of these files only provide a set of suggestions for search engines, and their parameters may not necessarily be obeyed by crawlers.

Point 8, create an RSS feed, also may imply quite a few additional points, such as that new content is expected to be published with some frequency and that said feed can be easily subscribed to by those who may not know how to hack the URL (via point 11, enablement of social media).

Points 12 through 18 are the don'ts, and they also reflect the most prevalent of "black hat" practices such as link farming and buying.

In all, this list should help someone getting started with SEO efforts to perform a sanity check against their current web site.

Thursday, September 8, 2011

Trying out Technorati (claim code)

Their FAQ advises against using redirects, so this may not work - nonetheless, here it is:


I may need to re-claim with my actual domain URL.

Tuesday, September 6, 2011

Criteria for "quality" from Bing/Yahoo!'s perspectives

About a month ago, published this article on things that Bing have disclosed that they penalize web content for from a ranking perspective.

Most of the points they made concerned concision, but the final point on actively discouraging machine-translated text caught my eye. I'd posted in the past about how translation did not equate to localization, so I was rather pleased to imagine that someone was incorporating grammar and spelling checks into the ranking algorithm. However, I also have the following questions:
  • Do they verify that the language attribute found in the HTML matches the body text language that people read?
  • If the language is a distinct flavour, such as English as spoken in India or the Kansai dialect of Japan, is that taken into account during the linguistic quality assessment?
  • Do they penalize on slang, profanities or "text-speak" orthography, or will they process them accurately and take that into account in evaluating the tone of the site? The site comes to mind for this instance, where the main entries and definitions, not to mention examples, are rife with NSFW terms.
I'll follow up with a post, should I find information further to any of the above.

Friday, September 2, 2011

The benefits of

Some months ago my team lead had mentioned to me. It's a social bookmarking service that aggregates links that the user has shared out via various social media services. As I often try to share web content that I find interesting but rarely spend the time either completing an in-depth perusal of said content, I've found the cumulative archive of what I've been tweeting and publishing via Google+/Buzz and Facebook to be most useful. At the least, it spares me the effort of maintaining browser-specific bookmarks and trawling through my Facebook profile export or tweet history. LinkedIn shares are also supported, but due to the way I cross-publish, I haven't bothered to use it.

Furthermore, has a Top SEO Experts group, which I was able to join. Through it I can find not only the most up to date content that benefits me in my current role, but I can see via the number of shares, how popular or vetted the links have been.

Now, if only I had the time to read everything I wanted to. Perhaps if I didn't require sleep at all...

Tuesday, August 23, 2011

Observations about Twitter hashtags

I've been spending more time on Twitter lately, and wanted to note two things I've gleaned, rather unscientifically.

First, about tweeting topics (or trending hashtags) and culture.

In Japan, many trending topics and/or hashtags encourage sharing of personal information, and moreso of interaction between twitter users. Some examples from the last few days are: "how I came to start tweeting" and "what age would you say I am?". What seems far less prevalent thematically in Japan when compared to the other places I've been watching (French, German, Irish, American), are people (celebrities, sports figures), TV shows, and states of mind. Since I don't keep abreast of most entertainment news, and think twice before presenting too many of my rants for public consumption, I'm finding it easier to participate in incorporating the Japanese themes into my tweets, than the Western ones.

Second, a tale of two anniversaries and twitter strategies: MIT150 and Harvard375.

Alongside my employer's centennial, both my alma mater and the "little red brick schoolhouse" are celebrating milestone years. (As an aside, last year was my high school's centennial year).

MIT has its own sesquicentennial-specific domain (linked above) and 4 Twitter IDs (the primary one being @MIT150):
MIT150  MIT150
However, I couldn't see any hashtag use, which implies that their target audience would have had to know about the ID. As well, a cursory search for affiliated accounts, shows that the MIT Museum account is most popular currently with over 40K followers, whereas MIT Press takes second spot with nearly 15,600 followers. The business school (Sloan) takes third place in follower count, at 12K roughly.

On the other hand, Harvard has a surprisingly poor following for its 375th year specific ID:

Harvard University Harvard University

However: what the main @Harvard ID uses is its own #Harvard375 hashtag. Which the official @Harvard ID has been interspersing with all its other news. This reaches its nearly 63K followers.

Harvard University 

@Harvard view full profile 

Cambridge, MA
 On the other hand, the self-professed official MIT presence on Twitter has the following stats, which to me were somewhat - but not excessively - surprising:
MITnews   MITnews
From the contrast that the stats present and via my recollections of my exposure to both school cultures, I'm positing that MIT is more likely to use Twitter as a tool to convey useful information to its community, and only chooses to follow sources they know that they would retweet or benefit from factually. On the other hand, Harvard seems to take advantage of the medium (as with other social media tools) to promote itself consciously with public relations and networking in mind. These are likely stemming from fundamental philosophical differences concerning interpersonal communication. Looking at their homepages, the respective real estate allocated to social media is also quite revealing.

The lesson I'm inferring here is that although my instinct is to use Twitter in the MIT style, in order to be successful in Twitter I need to shift to the Harvard way. 

p.s. I've decided to shift to a less intensive publishing schedule for the month of September, with apologies in advance for the unpredictable (but reduced) post count. At the same time I may try to become more active on Twitter. No promises though!

Friday, August 19, 2011

iOS vs Android users - commentary on an infographic

A former manager of mine shared this infographic on LinkedIn the other day, and I wanted to share some thoughts on the findings it presents.

First, it mentions that Android users mainly fit the 18-34 age bracket, which really seems to explain many of the other traits they're more likely to have. Specifically, the survey results reveal that Android users tend to have started using the internet around or after 2000, their incomes tend to be (significantly) lower than iOS users, they're not as well-traveled, and they tend to hold fewer educational qualifications. From the combination of these I suspect that the large portion of the sampled Android users simply haven't completed their undergraduate work yet.

Next, let's look at the gender skew -more men than women typically use Android. There's a well-touted gender correlation with math and perhaps stereotypically, with meat-heavy food preferences. Although having said that, anecdotally within my local team, there are 4 iPhone users (3 of whom are male): I'm the sole Android user.

Examining at the tendency towards pessimism (the aforementioned female manager from my NLP days who shared out the link implies that she's an Android user herself though, but as a "realist") when combined with fiscally conservative behaviour, it's not surprising that the Android crowd are later adopters of smartphones - though, speaking as an erstwhile software quality assurance person, I think most people who have been testers would be later adopters.

Although I speak as an Android user, I identify with more than half of the iOS traits. I match the core demographics, life experience and phone use traits on the iOS side with only a couple of exceptions. Personality-wise, I'm mostly on the Android side, but that's the only category where I'm overtly typical of the crowd: I can relate to roughly half from each set of all the remaining categories. Not sure whether I could legitimately say that however, since I've not seen any of the movies mentioned, can't get the TV channels listed or General Gau's chicken (it's Tso's in most places, but Gau's in New England) here in restaurants, and am only starting to explore European wines (I've heard great things about Swiss wineries..)

Mick, I think the Blackberry OS profile suits you quite well. Thanks as always for your comments, by the way!

Wednesday, August 17, 2011

Personal thoughts on Twitter and follower counts

Here I'd like document various thoughts concerning my journey in Twitter, which I joined in 2009.

At first, I wasn't convinced that I would enjoy using it. Already feeling overwhelmed by the Information Age, I also noticed a lot of highly public yet personal (read: inappropriate or irrelevant for mass consumption) tweets as well as quite a lot of rude behaviour (ad hominem attacks). At the time of joining I had no Smartphone, and even now I have a severely minimalist data plan, so I don't tweet "on the go". Since I walk to work, checking the twitter stream on my commute is also fairly hazardous (although having said that, when I had a painful bus commute I relied on audio casts and preferred musical recordings stored in my iPod due to the ease with which I succumb to motion sickness.)

As of today, mostly due to the aforementioned circumstances, I still only have a handful of tweets. More depressingly, I've noticed some depletion of my followers (my record high was 164; as of today I have 157). However, I'm realizing the value of having a qualified audience, where those following are only folks who enjoy the overall mix of content that I publish.

A way to ensure that one's target audience finds one, as with regular web pages, is well researched hashtag and keyword use. More than a handful of fellow SEO enthusiasts/would-be opportunity marketers have found me on Twitter, I believe, due to the keywords I've been embedding near the start of many of my blog posts via the post titles.

When I initially started to blog in English this year, I did hand-craft my post announcement tweets. Eventually I lapsed into dependence on automated notices courtesy of networkedblogs, which is my most productive referral site. Due to its close integration (literally; they authenticate users via Facebook credentials) it cross-posts there most readily, and because most of my audience seems to know me personally, this isn't surprising either.

I've also consistently shared my tweets on Google Buzz (Google itself posts both new blog entries and notes update time(s) in Buzz), and sometimes on LinkedIn where I've relied on TypePad to list my blog entries, which it does perhaps too enthusiastically (every post used to appear twice, though I've since fixed this). "Write once, publish (presumably once) everywhere" makes sense not only logistically, but also in the consistency of the online persona one presents.

As many best practices as there are, there's also quite a few poor ones in the Twitterverse. One of the latter that I've tried to avoid personally, was gleaned first-hand via one of my roles at work as lead seeker: it's where someone repeatedly spams the same type of question, only offers marketing page links or otherwise shows no well-roundedness or individuality as a human. My micro-reviews of the occasional Arte music program and very infrequent dialogue with old school friends hopefully provide some insight into who I am, but for a time I scheduled #QotD (quotes of the day) to be pushed out around 4PM EST, to boost my tweet count.

I'll call it a (work)day now, but if you've made it this far, thank you. Feel free to follow and @ me on Twitter too. At your convenience and only if you wish to, of course.

Monday, August 15, 2011

5 Writing tips, or a response to "8 Essential Tips to better Content Writing"

Here's the original blog post upon which I'm commenting.

(A disclaimer: I've had no interactions with this author save the message I left for him on his blog. I also have no metaphorical axe to grind nor malice with which I'm replying (the apt expression in Japanese would be that I'm not "selling him a fight"). It's simply that I wish to present my critique on the actual 8 listed tips. I certainly agree with his opening paragraph.)

Now, my response proper:

I believe his 8 tips could be condensed into 5. Moreover, in my world they would be re-ordered as the following:

This merges his "valuable" and "solution" tips, and is related to "relevance" too, in terms of what the audience expects to find on the site, topic-wise.

Using vetted sources for information is an essential part of all academic writing; lend credence to one's own assertions whenever possible online, too. This also touches upon the "resourceful" tip although in his article, it seems to speak to visual aids and formatting. I recently was reminded of this valuable lesson when I failed to verify the Mashable coverage of the bogus "study"!

His tip speaks to how the content fits into the overall theme(s) of the site in question, but I would argue that in the context of blogs, topic relevance should be measured against one's actual (vs. intended) audience. As the preferences of the actual audience become apparent (as fed back via comments), a responsive author would find her or his topic choices being affected by this information - it's a two way street.

This point would, ironically, merge his "short 'n' sweet" and "to the point" tips; I (perhaps naïvely) believe one can't effectively produce short posts without their being "to the point". Why produce pointless but brief missives? Perhaps that overlaps with the microblogging realm...

By this, I'm merging both the original post's "resourceful" and "readable" tips. The text should be well written with the target audience's expectations in mind, and presented with some visually appealing items when possible, though my preference is to avoid gratuitous videos or (info)graphics, especially if an alternate mobile design is not implemented for one's readers.

Friday, August 12, 2011

Why most bloggers needn't worry about high bounce rates

I was encouraged to read a couple of posts that talked about bounce rates from a web analytics person. In them, he describes several contexts in which high bounce rates should not be construed as being a negative reflection of the quality of the site or content.

My own bounce rate is nearing 75% to date. In the web metrics world, bounce rate is defined as when "the visitor leaves a site without visiting any other pages [within the same domain] before a specified session-timeout occurs." In the aforementioned blog, the first entry talked about when the page's call to action takes the user to an external page or an advertisement link, and what is most valid for blogs, when the page arrived at is a so-called "destination page".

Since most blog designs that I've seen provide the most recent entry content for quick viewing on the root or landing page, people whose blog posts are brief enough to be displayed in their entirety, returning readers only need to read the most recent content, before moving away.

This is why I actually prefer the definition that Unica uses for bounce rate, which is where the visitor spends 10 seconds or less before moving away from a given page. This would more reliably indicate whether the user had arrived at content which they were seeking, although clearly if the landing page only exists to provide a call to action such as downloading a product from an external source, it's still plausible that event tracking is necessary to determine if a high bounce rate is problematic.

However if you, as a blogger, still wish to work on lowering your bounce rate, try following these tips: they're courtesy of a fellow not-too-concerned-about-bouncing blogger.

Wednesday, August 10, 2011

American iOS, Android and BlackBerry OS usage mapped

Mashable had an interesting report to show us from July 2011:

It occurred to me that the North-East, Mid-West and South/West looked vaguely reminiscent of American political party affiliation data by state, so I found this Wikipedia map of the gubernatorial election results data from 2010:

  Republican gains
  Republican holds
  Democratic gains
  Democratic holds
  Independent win
  not contested

Well, a slight correlation can be discerned, anyway - California may be favouring Android overall, but Mashable's article did report that there are iOS-heavy cities (to be specific, they were reported as San Francisco, San Jose, Modesto, Oxnard, Santa Barbara, Chico, Santa Cruz, San Luis Obispo and Napa).

Given my personal history in Massachusetts and the fact that most of my friends (and colleagues) are iPhone owners, one might think I'd also jumped onto the iOS bandwagon. Apple had also been the most popular smartphone manufacturer in Japan.

However, when I finally joined the smartphone using populace this past May (closely coinciding with the launch of this blog, actually), I'd picked an Android phone. Articles like this one, reporting from Bloomberg, piqued my curiousity sufficiently for me to make this choice, when combined with the cumulative gaffes surrounding Apple's iPhone (for design issues and pre-ordering of the 4th gen, to cite just two instances.)

In any case, I'm still hoping my spouse will upgrade to an iPhone 5 not too far in the future: his Nokia is on its last legs, and then I'll be able to make some personal comparisons to the usability of these OSs.

Monday, August 8, 2011

A hoax correlation study: IQ scores and browser choice (amended 8th August 2011)

One of the news aggregators that I visit is Mashable, and recently they published the results of a false correlative study of browser use and IQ score, which supposedly used data from 100,000 users (and was run by a Canadian company).  Here's the link to the Mashable article.

Supposed correlation results from the published hoax:
The fictitious study's conclusion was that "“individuals on the lower side of the IQ scale tend to resist a change/upgrade of their browsers.” 

Since I blogged about it well before the false nature of the hoax was published, I've decided to keep an amended version up (thanks to Caesar for the comment). My own anecdotal impression had been that on corporate hardware, vestiges of IE6 uses was attributable to bigger bureaucratic organizations, who actually do exhibit tendencies to resist change. Another variable that has historically influenced rates of browser use of course, is factory settings. Microsoft's IE certainly enjoyed years of being the out-of-the-box default in the PC market.

I personally divide my time these days on my company-sanctioned flavour of Firefox, Opera and Chrome. And since my post about my audience and their browser use, I've been wondering why blogger counts SimplePie as a browser (it's an RSS reader) while Google Analytics doesn't.

[reverse sort direction]
Browser contribution to total:
1. 135 39.94%

2. 121 35.80%
3. 48 14.20%
4. 17 5.03%
5. 10 2.96%
6. 5 1.48%
7. 1 0.30%
8. 1 0.30%

Friday, August 5, 2011

Learning "Englise" - a fun Friday share

I received the following album link from a friend: the photos consist of pages from a Hangul - English phrasebook.

Commented samples from the publication "Living Englise Language Everyday"

Aside from the implicit perceptions of "common" phrases that the authors seem to expect to be spoken or heard in English, the most noticeable grammatical mistakes seemed to arise from the unpredictable use of "to be" in place of "to have". This was actually something I noticed when studying French and German, such as the "j'ai froid" "I am cold" "mir ist kalt" comparisons (and it's "j'ai faim" "I'm hungry" "ich habe Hunger"/"ich bin hungrig") - in Japanese at least, the subject is so often omitted that just saying "寒い" ("[I feel] cold") and "お腹がすいた" ("[My] stomach has become empty", to attempt a literal interpretation). This would explain, if Korean has a similar construct, why such phrases are so often conjugated with the wrong verb.

The second page of the album simply consists of this:

Ah, and what gems that sequel must contain... hopefully with less emphasis on running over children and overall crime rates!

Wednesday, August 3, 2011

Thoughts on the information age: news aggregators

Several of my fondest childhood memories stem from working in libraries. There, it was often my duty to take a crisp newspaper and clamp it to a wooden holder for broadsheets. My preferred paper was The Globe and Mail from quite early on; one of the alumni from my high school is a prominent columnist there.

The advent of the internet in the early to mid-90s happened to coincide with a period that I didn't subscribe to broadsheets and lived without TV (otherwise known as my time at university). To procrastinate from studies, I often read through some of the newsgroups, and played around with a personal set of HTML pages. Interestingly I was still working in the libraries during this period, but had moved to cataloguing new arrivals of periodicals, and didn't touch newspapers except for the occasional copy of the university papers (The Tech and Tech Talk - I was saddened to learn the latter went out of print in 2009) or Bay Windows, made freely available to the community in stacks at Lobby 7.

A few years later, I began to receive my news online, but still at independent sites like those listed above. It was more recently that I began fully using RSS feeds, but by that point, the information age was instilling the overwhelming sense that no matter how much I read, I would still miss interminably vast amounts of news. It has become all the more crucial to me to find sources where their biases were knowable, and where their integrity has been beyond reproach (the NYT scandal disillusioned me greatly).

Now that individuals have the means to publish their own news compilations, I marvel at their having the time to review, vet and hand-select the content to aggregate in the first place. At this point I've accrued personal interest in the national level headlines of all the countries I've resided in as well as retained a desire to practice Japanese (via reading generally, which is easily accommodated online). This, combined with an attempt to keep abreast of my friends' shared information on various SNS, has led to an increasing portion of my free time spent sedentary and online.  As thankful as I am for the world wide web, the inactive lifestyle part is something I need to change. At the expense of missing even more news, it seems, alas.

Monday, August 1, 2011

"Old school" communication styles

I recently communicated with a newly hired colleague, who had just completed his Master's degree. After inviting me to contact him primarily by email or instant message, he remarked upon how he found teleconferences "old school". His comment gave me pause to think about my experience with globally distributed teamwork.

While in my prior role at the software research lab, our team was distributed across CDT (UTC - 5 hours) through GMT and all the way to JST (UTC + 9 hours). Email was definitely the main form of having complex discussions, and as the centrally located team, we primarily conversed via instant messaging (where accents and bad audio quality couldn't interfere with comprehension) with the Japanese, UK and Egyptian colleagues in our mornings, and the American and Canadian ones in our mid afternoons. There were regular teleconferences (from which the Japanese were mostly exempt due to them being late in their nighttime typically), but those tended to serve as team socialization aids, since in-person visits were infrequent and costly to arrange. 

When I joined my current organization, I found that teleconferences were far more frequently held, and used for information transfer as well as the aforementioned socialization, since despite the majority of participants being based in the US, they're still scattered geographically. As the majority is along the East coast and a smattering of central and west coast folks, the latter have definitely expressed some discomfort with very early morning phone calls. 

Anecdotally, I've experienced a much higher prevalence of loquacity amongst these Sales and Marketing folks, which has necessitated an adjustment on my part. The effort to do so, will apparently be a lengthy one.

About Mayo

My photo

Professional: As "Senior Enterprise SEO Strategist" in IBM's Digital Marketing division, I provide consulting and training services for both internal and external clients. Formerly I was involved in Natural Language Processing, software localization, quality assurance and documentation authoring.
Personal: INTJ Nikkei Nisei ex-patriated Canadian who takes photographs and enjoys Baroque through late Classical music. The G+ page shares some of the "best of" photos.