Monday, May 12, 2008

Playing with Powerset

If we really want to be informed by the resources available through the Internet, beyond the basic need for getting straightforward answers to straightforward questions, to what extent will we be able to benefit from a "semantic edge?" Put another way, can semantic analysis assist us in finding information resources that we might not find (or find only with considerable time and effort) by figuring out the right keywords to use in poking Google? Since this is not a straightforward question, we are unlikely to find a satisfactory answer in the near future; but, thanks to a Reuters report by Eric Auchard, we now have the opportunity to get a feel for what such a semantic edge might provide. This feel comes from the technology of Powerset, a start-up, which, according to Greg Sterling of Sterling Market Intelligence, "could become the basis of a Google-killer."

Auchard reported that Powerset has taken its first step towards challenging Google over the content of the entire World Wide Web with a semantic analysis of the contents of Wikipedia:

Powerset on Sunday unveiled tools for searching Wikipedia that use conversational phrasing instead of keywords, marking the first step of its challenge to established Web search services such as Google.

Powerset's technology breaks down the meaning of words and sentences into related concepts, freeing users from always needing to type the exact words they want to find.

The closely watched Silicon Valley start-up is offering a way of searching millions of entries in Wikipedia's online encyclopedia, helping users find detailed answers to questions rather than isolated links that require further research.

For example, a user who wants to know how many wives King Henry VIII had (six, or two, depending on your definition of marriage) can find an answer via Powerset's service at tinyurl.com/5qpcr9/.

That hyperlink is "live;" and it allows you to see for yourself the results of the Henry VIII question. Fortunately, it also allows you to try out and explore the tool with questions of your own.

Back in 2004, when Tom Malone was flogging his Future of Work book, he was arguing that Wikipedia was a significant indicator of the future of work. I had at least one colleague who basically believed that Wikipedia content was only worth consulting if Google gave it a high enough page rank, but I decided to see how much satisfaction I could get out of playing with Wikipedia. Since I was deep into reading Marcel Proust at the time, I decided to ask it about "Combray" and was pleasantly surprised to read an account of how the city of Illiers, where Proust had spent much of his youth, had changed its name to Illiers-Combray in honor of the author. This gave me the confidence to try "Balbec;" but that one came up dry. Nevertheless, I am happy to report that I would have better luck today. There is no entry for Balbec; but there are pointers to three other entries, one of which is for Cabourg, which was the model for Balbec, as Illiers was for Combray.

This time I began in a similar spirit of playfulness, which led to my trying to home in on questions that would get me "beyond keywords." Here is the series of questions I put to Powerset:

  1. Where is the Floss?
  2. Where is St. Oggs?
  3. How many operas did Cavalli write?
  4. When did Brahms write his first string quartet?
  5. Who were the Smithfield martyrs?
  6. Why is there no general algebraic solution for cubic equations?
  7. Who tried to analyze the music of Bach in terms of religious interpretation?

The first two questions were motivated by the fact that this time I am deep into George Eliot's The Mill on the Floss; and, for both of these questions, I was duly directed to the Wikipedia entry for the novel. (However, it appears that Powerset does not know about capitalization, since the highest-ranking articles for the first question had to do with dental floss!) What was not resolved, however, was whether or not the Floss and St. Oggs were as fictitious as Combray and Balbec. On the other hand I realized that Wikipedia may not have had the resources to answer that question, and I could not fault Powerset for not answering a question that Wikipedia could not resolve.

This led to the Cavalli question, since Wikipedia had served me so well when I recently wrote about him. Powerset easily homed in on the sentence from Wikipedia that provided a direct answer to my question; so, on the basis of yesterday's post (which was not served by Wikipedia while I was writing it), I formulated the Brahms question and got an equally direct answer. The Smithfield question goes back to George Eliot, who uses it for one of her best throw-away jokes. I was talking with my wife about it, and we both realized that neither of us knew when heretics had been burned in Smithfield. We had both assumed that these martyrs were Catholic victims of either Henry VIII or Oliver Cromwell; so I appreciated being informed that they were actually Protestants, executed under the reign of Queen (Bloody) Mary I and were properly known as the Marian martyrs.

At this point I realized that I had been asking questions that could have just as easily been answered by feeding judiciously formulated keywords to Google. So I tried harder to come up with a question that could not be so easily addressed. I did not have a particularly good answer to my sixth question in mind, except for the faint memory that it had something to do with Galois Theory. Powerset gave me all sorts of answers about the history of algebra and geometric solutions to cubic equations without any mention of Galois. This led me to probe Wikipedia directly (including the entry for Galois); and I discovered that the information was not in the sorts of places I would have expected it to be.

The final question brought me back to yesterday's post and my acknowledgement of Albert Schweitzer. I realized that I did not want to use his name in the question, since that would lead to searching his Wikipedia entry. Rather, I wanted a question for which his name was the answer. Furthermore, having reviewed the Wikipedia entry for Schweitzer, I knew that the answer was in there. Unfortunately, he was not in the first ten (out of 624) hits returned by Powerset. In fact, the answer was in the twentieth hit, which is pretty far down the list for a relatively straightforward answer, particularly in light of the number of irrelevant hits that pushed it to that level.

On the basis of this unsystematic experience, I would say that the current state of the art of the "semantic edge" is far from a Google-killer. While it will probably get better, there are still some questions as to how valuable it will prove to be and under what circumstances. To a great extent I would say that my experiments were not particularly realistic, which also constitutes a criticism of the interface through which those experiments were performed. As was the case with Smithfield, questions do not pop out of the blue; rather, they arise in conversations, which means that the best understanding of what it being asked almost always requires accounting for the context of the conversation. Put another way, it is probably unrealistic to try to "kill Google" through the same interface that has served Google so well.

This thinking, however, may lead Powerset in a direction it is not prepared to go. It would be a departure from answering a question in isolation to inferring a context from an ongoing discourse and providing (if not volunteering) information relevant to that context. This goes beyond semantic theory to many of the social aspects of a "theory of communicative action," such as that developed by Jürgen Habermas. Such an endeavor, unfortunately, would require long-range planning far beyond the scope of any research funding budget, let alone the timetables of venture capitalists!

No comments: