Friday, December 3, 2010

Quantitative CHUTZPAH

Patricia Cohen has a piece in today’s New York Times that definitely deserves attention;  but it is the sort of attention that one reserves for a double-edged sword.  The good news is that she gives a good account of both edges of the sword.  The bad news is that she leads with the blue-sky technocentric evangelism that I continue to hold responsible for much of the current mind rot of our national culture and its possible global implications:

Victorians were enamored of the new science of statistics, so it seems fitting that these pioneering data hounds are now the subject of an unusual experiment in statistical analysis. The titles of every British book published in English in and around the 19th century — 1,681,161, to be exact — are being electronically scoured for key words and phrases that might offer fresh insight into the minds of the Victorians.

This research, which has only recently become possible, thanks to a new generation of powerful digital tools and databases, represents one of the many ways that technology is transforming the study of literature, philosophy and other humanistic fields that haven’t necessarily embraced large-scale quantitative analysis.

Before delving into details, there is one important reflection on Cohen’s choice of a lead sentence, which can found in the “Lies, damned lies, and statistics” entry in Wikipedia.  That entry makes it clear that there is no hard evidence of who first applied that phrase but that the most plausible evidence seems to point to the Liberal Victorian politician, Sir Charles Wentworth Dilke, whose 1996 biography by Roy Harris Jenkins was entitled Dilke:  A Victorian Tragedy.  There may thus be a bit of exaggeration in the use of that adjective “enamored!”  Neither of these two names appears in Cohen’s article, although one of the cheerleaders for this brave new statistical world is Alice Jenkins, Professor of Victorian Literature and Culture at the University of Glasgow, who may or may not be related to Dilke’s biographer (just as one of the leading researchers, Dan Cohen at George Mason University may or may not be related to the Times writer).

From a technical point of view, Cohen (the historian, not the Times writer) and his colleagues see the Internet as a vast corpus of text sources that will open up a new generation of methods for lexical analysis.  This is certainly the case.  It is also the case that Peter Norvig, Director of Search at Google, has turned up some interesting results in how one may actually be able to derive results in semantic analysis on the basis of relatively brute-force results in lexical analysis.  Nevertheless, the “bridge” that Norvig has investigated can only be crossed through well-developed methods involving how one chooses the data, how one selects the statistical tools, and how one interprets the results that those tools yield.  In other words increasing the size of the data sample, even by several orders of magnitude, does not turn lexical analysis, in and of itself, into some kind of “miracle drug.”  For Cohen to claim that lexical methods now allow scholars to “finally and truly test” outstanding hypotheses is nothing more than chutzpah fueled by technocentric enthusiasm.

The Times story offers an amusing counterexample based on research by Meredith Martin, Assistant Professor of English at Princeton:

Ms. Martin at Princeton knows firsthand how electronic searches can unearth both obscure texts and dead ends. She has spent the last 10 years compiling a list of books, newspaper and journal articles about the technical aspects of poetry.

She recalled finding a sudden explosion of the words “syntax” and “prosody” in 1832, suggesting a spirited debate about poetic structure. But it turned out that Dr. Syntax and Prosody were the names of two racehorses.

The punch line to this and any number of other equally amusing anecdotes is that one needs to handle the results of context-free examination of words with the same caution one would apply to an unexploded bomb.

(There is a similar incident involving our own Department of Homeland Security scouring a massive collection of documents for names that would appear in close proximity in the text stream to that of Osama bin Laden, the assumption being that these would be the names of likely terrorists.  It turned out, however, that one of the names that appeared very frequently in this analysis was that of George W. Bush!  While this became a source for many jokes about Bush being a terrorist, the real punch line was that lexical analysis could not distinguish “talking with” from “talking about!”)

From the standpoint of chutzpah, this is a clear case of making too much from not very much.  The fairest response would probably be for the two Cohens, the scholar and the reporter, to share the Chutzpah of the Week award.  The first deserves it for over-promotion;  and second deserves it for keeping the “evidence to the contrary” in a “below the fold” position in the article, thus contributing to the inflated promotional value.

No comments: