|Disambiguating "Taj Mahal" - structure or music band? Courtesy of Google's own blog|
The Mashable article provides a basic overview of what this news means, and as I read this, my thoughts invariably turned to my former job in LanguageWare (which has been partially described over four non-contiguous blog posts last year, related to Language Identification).
When one is first exposed to linguistic data which has been amassed for the purpose of spell-check, it becomes quickly clear that in order to use this same word lists effect grammatical checks and even orthographical ones (e.g. whether a proper noun needs to be title-cased even when it doesn't commence a sentence), the part of speech is important.
The aforementioned Mashable article cites "kings" as an example, where the likely senses are all to do with nouns. Actually quite a few words exist that are even more difficult to process in this way, such as "bank", which are not just nouns (repository of items to do with financial, genetic, food, blood, or other such as paper, data, memory? Geographical, geological senses also exist) but can be part of noun phrases ("bank shot" in sports), or verbs (to bank something). Its plural form could refer to surnames and place names, as well as the verbal inflection.
Granted, most search engine users have been conditioned, it appears, to minimize stop words and focus on noun phrases, but as with my example, one needs to disambiguate the shorter queries (one or two word terms) more often than not, even when it's a foretold conclusion that the concept being sought after is a noun.
To get a sense of how often such disambiguation is necessary, I thought that it might be interesting to understand how many pages exist in Wikipedia for this purpose. In its English set of pages, this search yielded 35,452 hits. Given the existence of 3,951,340 total pages (for English only, as of May 16 2012), disambiguation pages constitute 0.9% of the total. The Japanese, French and German language pages are structured differently, where disambiguating entries are not identified as such in the title (e.g. the pages for Hase in German and in French).
In order for pages to be correctly indexed by any search engine as belonging to a specific topic, then, the co-occurrence of terms that semantically enforce the primary keyword becomes crucial. A genre of writing where topic determination may be more challenging for search engine indexing, is in scientific journalism (as found in non refereed publications such as newspapers and non-specialized magazines). Anecdotally, I've noticed that even when the subject matter concerns pure science, the authors may attempt to make its contents more accessible or relatable to a layperson audience. Which in turn means that there can be mentions of popular culture or seemingly less related subjects, often found prominently (early on in the article) as a means of capturing the readers' attentions and providing analogies.
I may return with further thoughts on semantic web-enabled search; certainly I look forward to experimenting with Google's US roll-out.