Posts

Showing posts with the label Japanese

Year-end thoughts, 2011 edition

Over the lifetime I've spent living in various Western countries, I've noticed the predilection for media and individuals alike to focus on retrospection around this time of year: that is, reminiscing about the various events and experiences that one associates with the prior year. In direct contrast to this, it's my understanding that  in Japan, it is customary to have 忘年会  which, paired with  the 新年会 (which occurs after the  正月三が日 - first three days in January - timeperiod), encourages the forgetting of the prior year through much carousing and imbibing. This year was particularly unforgettable to those with ties to Japan, however, and I've seen social media statuses speaking to the importance of remembering the disasters that have befallen my cultural homeland. The fallout - both metaphorical and literal (environmental, economic, political, and emotional) - will be palpable for decades, if not centuries, regardless of any desire the world may have to forget. P

LanguageWare's robust, extensible Language ID (part 4)

Having introduced the "prior art" approaches of identifying textual language in part 1 and part 3 (to wit: stop word presence and n-gram detection), I can now speak to the patented idea which we implemented as part of  LanguageWare , which is a set of Java libraries that offer NLP functionality. Simply put, our solution involves a  dictionary that is highly compactible (I may ask a guest blogger from my former team to delve into this aspect), and thus made it possible to store the following types of information: Each entry consists of the following: Term or n-gram Language(s) with which it's associated Whether it can occur as a standalone term, at the beginning of a word, the middle of a word, or the end of a word, or some combination of these, and An integer weighting value (per term/language pairing) Thus, for the Chinese Simplified/Traditional and Japanese disambiguation problem, the Japanese-specific kana (listed as unigrams) were given large positive va

Language ID Part 3 - more challenges

Stop word detection usually works...  In my prior post about this subject, hopefully the Jabberwocky poem examples demonstrated that when certain types of words occur in text that can be identified as belonging to a language's pronoun/conjunction/ adposition parts of speech, a language label can still be assigned to text. Such identifiers are, in this context, considered to be stop words. The presence of such terms was sufficient for us to recognize the language even when nouns, verbs, adjectives and adverbs are unidentifiable (nonexistent in our vocabulary). However it's useful to note that in those examples, there were some inflections that hinted at the nonsensical words having specific qualities. Specifically in the case of spotting nouns, these were enabled when in the inflected languages, pluralization or possession were shown via -s/'s endings (English, though -s can indicate possession in German also) or combination of title case capitalization and (when plu

Language ID part 2: Callooh! Callay!

 As mentioned in part 1 , thinking about identifying language leads one to the fundamental question: "what defines a language?" The example that our team used was from the children's book by Lewis Carroll - a  poem that the protagonist reads, and although not comprehending it, thinks it "pretty" and that it was "clear" that "somebody killed something". Here it is, courtesy of the Wikipedia (English) article: " Jabberwocky " 'Twas brillig, and the slithy toves Did gyre and gimble in the wabe; All mimsy were the borogoves, And the mome raths outgrabe. "Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!" He took his vorpal sword in hand: Long time the manxome foe he sought-- So rested he by the Tumtum tree, And stood awhile in thought. And as in uffish thought he stood, The Jabberwock, with eyes of flame, Came whiffling

Sushi Preparation compared to Search Enablement

Image
Courtesy of Kojiro Fish Shop in Wieden, Vienna  Being a fan of various cuisines, I count myself fortunate in having had the opportunity to grow up in Toronto (and having spent time in gastronomical meccas such as Tokyo and New York). As my parents kept my household quite Japanese, I grew up eating what most of my classmates considered to be exotic foods: umeboshi, chirashi zushi, korokke, grilled fish with daikon oroshi and such. Thus, when I was recently asked by a virtual friend - by which I mean someone whose acquaintance I made online, and have not yet spent time with in person, as opposed to an artificial being - to review her classmate's journey of learning to make sushi , I thought I may as well take the opportunity to talk about how my views on  sushi preparation and enabling search optimization of online content actually have comparable points. Sound strange? Do read on... First, the sushi making (with the disclaimer that I am not a professional chef, nor would

Learning "Englise" - a fun Friday share

Image
I received the following album link from a friend: the photos consist of pages from a Hangul - English phrasebook. Commented samples from the publication "Living Englise Language Everyday" Aside from the implicit perceptions of "common" phrases that the authors seem to expect to be spoken or heard in English, the most noticeable grammatical mistakes seemed to arise from the unpredictable use of "to be" in place of "to have". This was actually something I noticed when studying French and German, such as the "j'ai froid" "I am cold" "mir ist kalt" comparisons (and it's "j'ai faim" "I'm hungry" "ich habe Hunger"/"ich bin hungrig") - in Japanese at least, the subject is so often omitted that just saying "寒い" ("[I feel] cold") and "お腹がすいた" ("[My] stomach has become empty", to attempt a literal interpretation). This would ex

How to pronounce Japanese - a simplified primer on its phonemes

Image
It's actually interesting to note that for much of the 20th century, many Japanese organizations like schools and some workplaces did participate in radio calisthenics . Even my Saturday morning language schools had them at the start of each school session. In any case, a recent meeting with a UK entrepreneur reminded me of another benefit to living on the European continent: my first name is almost never mispronounced any longer. If you wish to impress your Japanese colleagues or clients, I strongly recommend that you remember the following simple rules: There are only 5 pure vowels in Japanese, which are identical to the Italian vowels: from the way I pronounce (Canadian) English, the sounds are: A as in altruism, E as in elbow, I as in index, O as in olfactory and U as in ulna. The letter y can be seen preceding three of the vowels (ya や, yu ゆ, yo よ); it's always used as the consonant. Note: having received linking permission, here is  an audio file from a friend

Having a "bad language day"

Image
Since childhood, I've found that if I devote a certain amount of concentrated effort thinking in one language, there is a transitory period where trying to speak another language is frustratingly difficult. The worst experience I had of this was a few years ago. After a few weeks of only working, reading and dreaming in English, I bumped into a Japanese faculty member at DCU. I sincerely hope she doesn't remember the incident, as it was painfully humiliating for me: practically no Japanese issued from me at the time, but stubborn pride kept me from switching to English. The fact it was a chance encounter definitely exacerbated the situation, but I was no stranger to this phenomenon. When I entered the Canadian school system, I'd had practically no prior exposure to English. This meant that for a few years initially, I'd answer questions posed to me at school in Japanese, and at home it would take about an hour before I'd revert to Japanese with my parents. Saturd

Thoughts on cross-linking, back-linking

Image
In the early days of the world wide web, most links to external sites were, in my opinion, "legitimate" rather than contrived. My first site dated back to 1994, and consisted of a landing page along with some samples of my academic writing. Back then, besides having no Wikipedia (but a plethora of Usenet newsgroups to refer to), I was able to mainly browse and select what I considered to be quality sites to which to link, and I gave no thought to soliciting inbound links from those destinations. Something that I recall about Japanese sites before the turn of the millennium, is that the cultural concept of " giri " was being commonly applied to making links mutual, and more interestingly, that authors of content gave explicit permission to have their content linked to by strangers, with the proper etiquette that when one created an external link, the owner(s) of the destination page would be notified. Now, most SEO blogs and resources speak of the painstaking ro

Link bait thoughts - Infographics

Image
I have a love-hate relationship with Infographics. For those who haven't seen many examples, they're concise sources of information presented with plenty of visual aids. Here's a source of a self-referential infographic, followed by 49 great examples. I love them, because my first language uses a logographic script, kanji , and as I grew up with manga , I'd always known that practically any subject, ranging from history to arithmetic and even abstract concepts such as those covered in philosophy, could be learned via a mix of graphics and text. As an aside, when I mention manga to non- Otaku , invariably I receive two questions: "Aren't comics for kids?" (Answer: not in Japan - there, manga exists for every age and demographic.) And, "What subjects do non-kid manga cover then?" (Answer: what do you think "novels" cover?) However, the concise presentation of information found in many cases also reminds me of "executive summarie

Balancing diction (quality) and comprehensibility (effectiveness)

Image
Something which by now may have become apparent to my colleagues and friends alike, is that in personal writing I gravitate toward long and complicated sentences. The formality of my writing has also been remarked upon by more than one friend. On the other hand, I also try to optimize diction: that is, I have an old habit of attempting to use whatever word I believe is most appropriate, regardless of how rarely one might hear it. In my Japanese language post from May, I had mentioned that I experience a constant struggle to maintain linguistic competence. In fact, it seems self evident that disuse leads to atrophy in many situations, be they physical (musculature), neural (pathways to access memories) or otherwise. With tweeting, the stringent limit on message length means I struggle with the inevitable prevalence of abbreviations and  (in that case, &) initialisms - and rarely, acronyms - far more in English than I do in Japanese. However, in the latter tongue I clearly need

「ほとんど日本人と異ならないですね」

数年前に、NLP関係の研究チームに加わって間もなく言われた言葉でした。たしかに日本語が通じるという事を履歴書に示してきたおかげで、幾度も仕事のチャンスが与えられてきた。その「ほとんど」が微妙だけど、まあたしかに完璧な語学力とはほど遠いのは痛いほどわかってもいる。 私はカナダで生まれ育ったので、日本での滞在歴はなんと数週間のみ(ちなみに二週間以上いた期間は3歳になる直前だった)。土曜の午前中に行われる日本語学校へ十年ほど通い、せいぜい中学生レベルくらいの国語を身につけたのだが、主に多量のフィクション(漫画も含む)を読み続ける事、日本にいる親戚との文通、そして父母とは日本語のみで対話していた事があったおかげで今にいたる。 とにかく、「使わなければ無くす」ため、(英語だと"use it or lose it"ですね)定期的に日本語でツィッターやブログを書いて行こうと心がけしてますが、どうぞよろしくお願いします。コメントもいつでも日本語でお気軽に書き込んでください。