Some years ago Stephen King announced that he would put his new book online before publication, for anyone to read freely. His publishers were spitting dollar signs and the fans delighted. In my memory he did as he said, and put the entire book on his website, but the 100,000-or-so words of the manuscript, though all there, were in alphabetical order. If you wanted to read the book from beginning to end, in grammatical sentences and plot-wise, you had to pay your money. If it was meaning you were after, it came at a price. Stephen King is far more generous than my psyche, and the second part of the story turns out to be entirely untrue.
Now Google Labs has done something similar, only on a google – if not a googol – sized (1.0 x 10100) scale. Google has made freely available a 500-billion-word searchable database from the books in English, French, Spanish, German, Chinese, Hebrew and Russian it has already scanned and digitised. (In fact, two trillion words have been scanned so far, which represents 11 per cent of the books published between 1500 and 2008, and the 500 billion words they’ve put online are just 4 per cent of the total.)
A paper, acclaimed for its wit and elegance according to the New York Times, was published in December on the Science website offering the joys of quantitative analysis to literature researchers and cultural historians everywhere. And so that everyone can play, the nice people at Google, who worked with the two main authors of the Science paper, have developed what they call an Ngram Viewer (ngrams.googlelabs.com). You put up to five words, separated by commas, into a search box, specify the dates, and up comes a multicoloured graph of the relative frequency of those words over time.
What with the usual monstrous holiday hiatus, it seems from twitterings and Facebook statuses that people everywhere (well not quite everywhere) have gone into corners with their new pads, pods, Androids and Airs, and taken a well deserved break from food and family fun. A brand new word game has evolved: what words occur above frequency w in year t? You pick w and t. What goes up and what goes down? What to make of the fact that the words today and tomorrow occurred with almost exactly the same frequency between 1700 and 1880, slinking along the bottom of the graph together between 0 and 0.001 per cent, when suddenly today makes a great leap up to the very top, while tomorrow, though becoming more frequent from 1880, never rises above 0.002 per cent. Is it, as someone has suggested, the new daily newspapers causing a pressing immediacy, or a new present-centredness as the world prepares itself for modernity? On the other hand, except in 1625 and 1675, the word you has always been more frequent than me; and, against all expectation, from the mid-20th century on, you pulls even further away from me – a bit of a disappointment for those who describe the young of the period as the Me Generation. But what was going on in the first and last quarter of the 17th century to cause those two noticeable blips of self-regard? Melancholy is virtually non-existent before 1570, but begins to rise and then falls until it drops off completely around 1625, about the time of the death of Dowland. It builds again to a great surge in 1650 (when, it says in Wikipedia, ‘the Age of Discovery ends’: reason enough), falls and then picks up, growing nicely and rising with the Romantics in 1800, and then declines gently before starting to increase again after 2000. Sting recorded a very terrible version of Dowland’s songs in 2006. Fuck is quite absent from books until about 1590 when it jolts up the chart for about eight years and then plummets, before returning in the 1630s, holding its own quite robustly until, of course, it disappears completely between 1820 and the mid to late 1950s when it surges once more (Look Back in Anger, Saturday Night and Sunday Morning, the Beat Poets) and remains ever on the up after that. Not as much as shit, however, which overtook fuck in the 1950s and has remained in the ascendant. Cunt is something of a rarity, hardly visible apart from a small hump around 1700, but then it starts to perk up and continues to rise until the latest available date. I imagine it will have made something of a spurt in 2010.
These are all very raw data. There is no way of telling how words were used, in what context or in what form (apart from a category called ‘English fiction’), or the way in which the meaning of words might have changed over time. Sensible peaks early in the 19th century, and appears lately to be on the rise again, but with a different meaning, surely. The blue line of melancholy intersects with the red line of depression in 1910, and depression (clinical or economic, who knows?) begins to climb towards the top of the graph while melancholy falls, so that their positions in 1840 and 2000 are almost exactly reversed as medical terms, economic conditions and common sensibilities alter.
Steven Pinker, who had a hand in the Science paper, is delighted about it all. ‘There is so much ignorance. We’ve had to speculate what might have happened to the language.’ The Google database is certainly a tool, and will, I imagine, become a better one. It might eventually help to answer what (happened to the language) and when (it happened), but how and why won’t be quite so amenable to statistical analysis. It’s always been the case, apparently. Put the four questions into the Ngram Viewer for the dates 1800 to 2008 and you get four differently coloured parallel lines more or less unchanging with the when at the top, and the what coming next, far above how and the very lowly why. Take the graph back to 1500, and the questions remain pretty much in the same order, but are much more intricately and closely linked, until they unravel in the mid-18th century. Louis Menand points out in the New York Times that among the 13 contributors to the Science paper there isn’t anyone from the humanities, not even a historian of the book. If you feel a little alarmed at the term ‘Culturomics’ which the paper’s authors have given to their analysis, at least we have an alternative to Sudoku, and a new use for the crucible: put in words taken from literature and watch them transmute into numbers and dates instead. Oh, you want meaning? Well, that might have to come at the price of studying, reading and speculation.