Posts

Showing posts from 2011

Year-end thoughts, 2011 edition

Over the lifetime I've spent living in various Western countries, I've noticed the predilection for media and individuals alike to focus on retrospection around this time of year: that is, reminiscing about the various events and experiences that one associates with the prior year. In direct contrast to this, it's my understanding that  in Japan, it is customary to have 忘年会  which, paired with  the 新年会 (which occurs after the  正月三が日 - first three days in January - timeperiod), encourages the forgetting of the prior year through much carousing and imbibing. This year was particularly unforgettable to those with ties to Japan, however, and I've seen social media statuses speaking to the importance of remembering the disasters that have befallen my cultural homeland. The fallout - both metaphorical and literal (environmental, economic, political, and emotional) - will be palpable for decades, if not centuries, regardless of any desire the world may have to forget. P

LanguageWare's robust, extensible Language ID (part 4)

Having introduced the "prior art" approaches of identifying textual language in part 1 and part 3 (to wit: stop word presence and n-gram detection), I can now speak to the patented idea which we implemented as part of  LanguageWare , which is a set of Java libraries that offer NLP functionality. Simply put, our solution involves a  dictionary that is highly compactible (I may ask a guest blogger from my former team to delve into this aspect), and thus made it possible to store the following types of information: Each entry consists of the following: Term or n-gram Language(s) with which it's associated Whether it can occur as a standalone term, at the beginning of a word, the middle of a word, or the end of a word, or some combination of these, and An integer weighting value (per term/language pairing) Thus, for the Chinese Simplified/Traditional and Japanese disambiguation problem, the Japanese-specific kana (listed as unigrams) were given large positive va

Language ID Part 3 - more challenges

Stop word detection usually works...  In my prior post about this subject, hopefully the Jabberwocky poem examples demonstrated that when certain types of words occur in text that can be identified as belonging to a language's pronoun/conjunction/ adposition parts of speech, a language label can still be assigned to text. Such identifiers are, in this context, considered to be stop words. The presence of such terms was sufficient for us to recognize the language even when nouns, verbs, adjectives and adverbs are unidentifiable (nonexistent in our vocabulary). However it's useful to note that in those examples, there were some inflections that hinted at the nonsensical words having specific qualities. Specifically in the case of spotting nouns, these were enabled when in the inflected languages, pluralization or possession were shown via -s/'s endings (English, though -s can indicate possession in German also) or combination of title case capitalization and (when plu

Every breath we take (Foursquare et al.)

Image
Approximate location of this blogger, give or take a few hundred metres After many months of dragging my feet, I joined Foursquare today. For those unfamiliar with it, this geo-social networking service allows a registrant user with a smartphone to download an application that makes it possible to easily "check in" to physical places. With tie-ins to Facebook and Twitter, it encourages users to publicly promote the businesses and services they prefer. This, in turn, is the incentive businesses value (endorsements) sufficiently to make offers to those who check-in to them. Truth be told, I'm not a particularly suitable user of such services as these. First, I'd rather not have my whereabouts documented online to this level of detail, even though I don't live alone (and thus, am not quite so susceptible to being burgled). Second, I'm an inconspicuous consumer - that is, I try to live frugally, and what I consider to be frivolous purchases mainly take

Language ID part 2: Callooh! Callay!

 As mentioned in part 1 , thinking about identifying language leads one to the fundamental question: "what defines a language?" The example that our team used was from the children's book by Lewis Carroll - a  poem that the protagonist reads, and although not comprehending it, thinks it "pretty" and that it was "clear" that "somebody killed something". Here it is, courtesy of the Wikipedia (English) article: " Jabberwocky " 'Twas brillig, and the slithy toves Did gyre and gimble in the wabe; All mimsy were the borogoves, And the mome raths outgrabe. "Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!" He took his vorpal sword in hand: Long time the manxome foe he sought-- So rested he by the Tumtum tree, And stood awhile in thought. And as in uffish thought he stood, The Jabberwock, with eyes of flame, Came whiffling

Trunk.ly acquired by Delicious

As of November 9, 2011, it was announced that the newish owners of Delicious had acquired trunk.ly (which I'd blogged about before ). However, even earlier (in September), my manager had blogged about Delicious' apparent demise , as precipitated by the takeover by AVOS . Trunk.ly has promised to remain functional until the start of next year, but I've found that attempts to use the Delicious import feature are failing (the page times out). The export for trunk.ly worked without any problems. I'm sincerely hoping that Delicious gets their act together, and that soon it'll have incorporated trunk.ly's ease of use, and restored lost tags and works glitch-free for all the pre-existing (or surviving?) users.

Language ID (textual) - part 1

Image
Word cloud of one person's compilation of English stop words, courtesy of Armand Brahaj (whose site has been infected by malware). Here instead is ranks.nl's list   Now that half a year has lapsed since the inception of this blog, some readers may be wondering when I might share more topics that are related to the "Linguistics" part of "SEO, Linguistics, Localization". In fact, one of the triggers of my instigating this blog arose from the issuance of two patents, which had been filed in 2005 and 2006 for which I was a co-inventor and sole inventor , respectively. Both filings concerned language identification (from textual input): the first approached the challenges of identifying a text's (primary) language, and the second was an application of the first, as combined with messaging software. Rather than overwhelm the reader with extensive explanations, I'm going to attempt to create a series of posts that will cover everything in the way

Sushi Preparation compared to Search Enablement

Image
Courtesy of Kojiro Fish Shop in Wieden, Vienna  Being a fan of various cuisines, I count myself fortunate in having had the opportunity to grow up in Toronto (and having spent time in gastronomical meccas such as Tokyo and New York). As my parents kept my household quite Japanese, I grew up eating what most of my classmates considered to be exotic foods: umeboshi, chirashi zushi, korokke, grilled fish with daikon oroshi and such. Thus, when I was recently asked by a virtual friend - by which I mean someone whose acquaintance I made online, and have not yet spent time with in person, as opposed to an artificial being - to review her classmate's journey of learning to make sushi , I thought I may as well take the opportunity to talk about how my views on  sushi preparation and enabling search optimization of online content actually have comparable points. Sound strange? Do read on... First, the sushi making (with the disclaimer that I am not a professional chef, nor would

Where is Dennis Ritchie's day?

It's now a week since the creator of the C programming language, and co-creator of the UNIX operating system, Dennis Ritchie , died after a long illness. I still have the distinct impression that Steve Jobs' charisma and Apple's links to pop culture have generated far more hype than the former's profound contributions to technology. A few days ago I'd shared the New York Times obituary on Ritchie, which garnered comments from my loyal readers (thank you, Klaus and Mick!) Since then, I'd been looking at various media sources to see what more would be said about him. However, I see announcements instead like this (Californian governor declares October 16 Steve Jobs Day), and threads like this (Google has neither created a doodle nor provided a hyperlink to Ritchie, despite doing the latter for Jobs). It seems there must be many more people who share my disappointment and outrage that Ritchie's passing has been eclipsed so effectively by the timing of J

Time management thoughts, Part 1

I'd recently admitted to some friends that, ironically (and funnily enough) the topic of time management has been on my mind. The irony being that this post comes more than halfway through October, with the greatest gap in time that had transpired since the blog was launched in May. Here is a quote from the TV series "Bones", which has a protagonist whose behaviour I can relate to quite well. She's being interviewed by a bubbly morning chat show hostess in the following exchange: Courtesy of IMDB : Stacy Goodyear : I'm Stacie Goodyear and joining me on Wake Up, D.C. is Dr. Temperance Brennan. She is the author of the best-selling mystery novel "Bred in the Bone" and she's also - now tell me if I get this wrong - an anthropologist who works with the F.B.I. to solve crimes? Dr. Temperance 'Bones' Brennan : Yes, that's correct. I use the bones of people who have been murdered, or burned, or blown up, or eaten by animals or insects, or

Why I won't link to your blog

Image
Today I received the above comment, unsolicited, and after about two minutes' investigation I moved it into the Spam category. Here's a numbered list explaining why: Although my name is part of the blogspot domain I use, and promote in most places, the message addresses me as "Webmaster", which is possibly today's equivalent of "to whom it may concern". Actually, I have interchangeably experimented with the vanity URL provided to me via my alma mater, such as on Technorati and STC.org. The request is for cross-linking, which already devalues the proposition (as it's a "black hat" practice). If this person truly valued my blog, he would link to it without asking me to link to his. The request uses my domain, implying that it is a "keyword". I've blocked out the destination URL and the keyword he asked for (which, although partially reflecting his website address, was also far too generic to stand a chance at ranking well

How I syndicate web content

Image
Like most individuals who are working on establishing an online presence, I have multiple SNSs (social networking services) on which I wish to share content. The four main services that I use currently, along with my audience demographics are as follows: Twitter : mostly topics of professional interest or music, and breaking news, scientific articles and alma mater related newsbits. My twitter follower audience is still small and largely impersonal, which encourages me to be mindful that tweets may be mined publicly by anyone. Google+ : add to my preferred Twitter topics, photos that I've begun to upload to Picasaweb, which is primarily Vienna-related. On + my audience is academic and more professionally allied than on Facebook, with very little overlap. Facebook (The link to my FB profile is not publicly available, which was my deliberate choice): most of the above, plus the occasional "true status" - things on my mind that only actual friends would find of slight int

Bing's "SEO Fundamentals" are everyone's fundamentals

Image
  As a followup perhaps to the Bing/Yahoo! quality checklist, Searchenginejournal.com subsequently provided 18 points of what Bing expects web content publishers to implement for SEO . Well, it seems to me that all their advice applies equally as well for those aiming to optimize their web content for any search engine. I think perhaps that there should have been a disclaimer associated with point 1, which concerned the implementation of robots.txt and XML site maps. It's still my understanding that both of these files only provide a set of suggestions for search engines, and their parameters may not necessarily be obeyed by crawlers. Point 8, create an RSS feed, also may imply quite a few additional points, such as that new content is expected to be published with some frequency and that said feed can be easily subscribed to by those who may not know how to hack the URL (via point 11, enablement of social media). In fact, segmented audience studies have shown that the pu

Trying out Technorati (claim code)

Their FAQ advises against using redirects, so this may not work - nonetheless, here it is: M467DDBXQN92 I may need to re-claim with my actual domain URL.

Criteria for "quality" from Bing/Yahoo!'s perspectives

Image
About a month ago, Searchenginejournal.com published this article on things that Bing have disclosed that they penalize web content for from a ranking perspective. Most of the points they made concerned concision, but the final point on actively discouraging machine-translated text caught my eye. I'd posted in the past about how translation did not equate to localization , so I was rather pleased to imagine that someone was incorporating grammar and spelling checks into the ranking algorithm. However, I also have the following questions: Do they verify that the language attribute found in the HTML matches the body text language that people read? If the language is a distinct flavour, such as English as spoken in India or the Kansai dialect of Japan, is that taken into account during the linguistic quality assessment? Do they penalize on slang, profanities or "text-speak" orthography, or will they process them accurately and take that into account in evaluating the

The benefits of trunk.ly

Image
Some months ago my team lead had mentioned trunk.ly to me. It's a social bookmarking service that aggregates links that the user has shared out via various social media services. As I often try to share web content that I find interesting but rarely spend the time either completing an in-depth perusal of said content, I've found the cumulative archive of what I've been tweeting and publishing via Google+/Buzz and Facebook to be most useful. At the least, it spares me the effort of maintaining browser-specific bookmarks and trawling through my Facebook profile export or tweet history. LinkedIn shares are also supported, but due to the way I cross-publish, I haven't bothered to use it. Furthermore, trunk.ly has a Top SEO Experts group , which I was able to join. Through it I can find not only the most up to date content that benefits me in my current role, but I can see via the number of shares, how popular or vetted the links have been. Now, if only I had the tim

Observations about Twitter hashtags

Image
I've been spending more time on Twitter lately, and wanted to note two things I've gleaned, rather unscientifically. First, about tweeting topics (or trending hashtags) and culture. In Japan, many trending topics and/or hashtags encourage sharing of personal information, and moreso of interaction between twitter users. Some examples from the last few days are: "how I came to start tweeting" and "what age would you say I am?". What seems far less prevalent thematically in Japan when compared to the other places I've been watching (French, German, Irish, American), are people (celebrities, sports figures), TV shows, and states of mind. Since I don't keep abreast of most entertainment news, and think twice before presenting too many of my rants for public consumption, I'm finding it easier to participate in incorporating the Japanese themes into my tweets, than the Western ones. Second, a tale of two anniversaries and twitter strategies: MIT15

iOS vs Android users - commentary on an infographic

Image
A former manager of mine shared this infographic on LinkedIn the other day, and I wanted to share some thoughts on the findings it presents. First, it mentions that Android users mainly fit the 18-34 age bracket, which really seems to explain many of the other traits they're more likely to have. Specifically, the survey results reveal that Android users tend to have started using the internet around or after 2000, their incomes tend to be (significantly) lower than iOS users, they're not as well-traveled, and they tend to hold fewer educational qualifications. From the combination of these I suspect that the large portion of the sampled Android users simply haven't completed their undergraduate work yet. Next, let's look at the gender skew -more men than women typically use Android. There's a well-touted gender correlation with math and perhaps stereotypically, with meat-heavy food preferences. Although having said that, anecdotally within my local team, there a

Personal thoughts on Twitter and follower counts

Image
... iff (if and only if) you appreciate my blog! Here I'd like document various thoughts concerning my journey in Twitter, which I joined in 2009. At first, I wasn't convinced that I would enjoy using it. Already feeling overwhelmed by the Information Age, I also noticed a lot of highly public yet personal (read: inappropriate or irrelevant for mass consumption) tweets as well as quite a lot of rude behaviour (ad hominem attacks). At the time of joining I had no Smartphone, and even now I have a severely minimalist data plan, so I don't tweet "on the go". Since I walk to work, checking the twitter stream on my commute is also fairly hazardous (although having said that, when I had a painful bus commute I relied on audio casts and preferred musical recordings stored in my iPod due to the ease with which I succumb to motion sickness.) As of today, mostly due to the aforementioned circumstances, I still only have a handful of tweets. More depressingly, I

5 Writing tips, or a response to "8 Essential Tips to better Content Writing"

Image
Here's the original blog post upon which I'm commenting. (A disclaimer: I've had no interactions with this author save the message I left for him on his blog. I also have no metaphorical axe to grind nor malice with which I'm replying (the apt expression in Japanese would be that I'm not "selling him a fight"). It's simply that I wish to present my critique on the actual 8 listed tips. I certainly agree with his opening paragraph.) Now, my response proper: I believe his 8 tips could be condensed into 5. Moreover, in my world they would be re-ordered as the following: Valuable This merges his "valuable" and "solution" tips, and is related to "relevance" too, in terms of what the audience expects to find on the site, topic-wise. Credible  Using vetted sources for information is an essential part of all academic writing; lend credence to one's own assertions whenever possible online, too. This also touches

Why most bloggers needn't worry about high bounce rates

Image
I was encouraged to read a couple of posts that talked about bounce rates from a web analytics person. In them, he describes several contexts in which high bounce rates should not be construed as being a negative reflection of the quality of the site or content. My own bounce rate is nearing 75% to date. In the web metrics world, bounce rate is defined as when " the visitor leaves a site without visiting any other pages [within the same domain]  before a specified session-timeout occurs. " In the aforementioned blog, the first entry talked about when the page's call to action takes the user to an external page or an advertisement link, and what is most valid for blogs, when the page arrived at is a so-called "destination page". Since most blog designs that I've seen provide the most recent entry content for quick viewing on the root or landing page, people whose blog posts are brief enough to be displayed in their entirety, returning readers only need

American iOS, Android and BlackBerry OS usage mapped

Image
Mashable had an interesting report to show us from July 2011: It occurred to me that the North-East, Mid-West and South/West looked vaguely reminiscent of American political party affiliation data by state, so I found this Wikipedia map of the gubernatorial election results data from 2010 : L egend:    Republican gains    Republican holds    Democratic gains    Democratic holds    Independent win    not contested Well, a slight correlation can be discerned, anyway - California may be favouring Android overall, but Mashable's article did report that there are iOS-heavy cities (to be specific, they were reported as  San Francisco, San Jose, Modesto, Oxnard, Santa Barbara, Chico, Santa Cruz, San Luis Obispo and Napa). Given my personal history in Massachusetts and the fact that most of my friends (and colleagues) are iPhone owners, one might think I'd also jumped onto the iOS bandwagon. Apple had also been the most popular smartphone manufacturer in Japan. However, wh

A hoax correlation study: IQ scores and browser choice (amended 8th August 2011)

Image
One of the news aggregators that I visit is Mashable, and recently they published the results of a false correlative study of browser use and IQ score , which supposedly used data from 100,000 users (and was run by a Canadian company).  Here's the l ink to the Mashable article . Supposed correlation results from the published hoax: The fictitious study's conclusion was that " “individuals on the lower side of the IQ scale tend to resist a change/upgrade of their browsers.”  Since I blogged about it well before the false nature of the hoax was published, I've decided to keep an amended version up (thanks to Caesar for the comment). My own anecdotal impression had been that on corporate hardware, vestiges of IE6 uses was attributable to bigger bureaucratic organizations, who actually do exhibit tendencies to resist change. Another variable that has historically influenced rates of browser use of course, is factory settings. Microsoft's IE certainly enjoyed year

Learning "Englise" - a fun Friday share

Image
I received the following album link from a friend: the photos consist of pages from a Hangul - English phrasebook. Commented samples from the publication "Living Englise Language Everyday" Aside from the implicit perceptions of "common" phrases that the authors seem to expect to be spoken or heard in English, the most noticeable grammatical mistakes seemed to arise from the unpredictable use of "to be" in place of "to have". This was actually something I noticed when studying French and German, such as the "j'ai froid" "I am cold" "mir ist kalt" comparisons (and it's "j'ai faim" "I'm hungry" "ich habe Hunger"/"ich bin hungrig") - in Japanese at least, the subject is so often omitted that just saying "寒い" ("[I feel] cold") and "お腹がすいた" ("[My] stomach has become empty", to attempt a literal interpretation). This would ex

Thoughts on the information age: news aggregators

Image
Several of my fondest childhood memories stem from working in libraries. There, it was often my duty to take a crisp newspaper and clamp it to a wooden holder for broadsheets. My preferred paper was The Globe and Mail from quite early on; one of the alumni from my high school is a prominent columnist  there. The advent of the internet in the early to mid-90s happened to coincide with a period that I didn't subscribe to broadsheets and lived without TV (otherwise known as my time at university). To procrastinate from studies, I often read through some of the newsgroups, and played around with a personal set of HTML pages. Interestingly I was still working in the libraries during this period, but had moved to cataloguing new arrivals of periodicals, and didn't touch newspapers except for the occasional copy of the university papers ( The Tech  and  Tech Talk - I was saddened to learn the latter went out of print in 2009) or  Bay Windows , made freely available to the communit