Jenny Diski

Some years ago Stephen King announced that he would put his new book online before publication, for anyone to read freely. His publishers were spitting dollar signs and the fans delighted. In my memory he did as he said, and put the entire book on his website, but the 100,000-or-so words of the manuscript, though all there, were in alphabetical order. If you wanted to read the book from beginning to end, in grammatical sentences and plot-wise, you had to pay your money. If it was meaning you were after, it came at a price. Stephen King is far more generous than my psyche, and the second part of the story turns out to be entirely untrue.

Now Google Labs has done something similar, only on a google – if not a googol – sized (1.0 x 10¹⁰⁰) scale. Google has made freely available a 500-billion-word searchable database from the books in English, French, Spanish, German, Chinese, Hebrew and Russian it has already scanned and digitised. (In fact, two trillion words have been scanned so far, which represents 11 per cent of the books published between 1500 and 2008, and the 500 billion words they’ve put online are just 4 per cent of the total.)

A paper, acclaimed for its wit and elegance according to the New York Times, was published in December on the Science website offering the joys of quantitative analysis to literature researchers and cultural historians everywhere. And so that everyone can play, the nice people at Google, who worked with the two main authors of the Science paper, have developed what they call an Ngram Viewer (ngrams.googlelabs.com). You put up to five words, separated by commas, into a search box, specify the dates, and up comes a multicoloured graph of the relative frequency of those words over time.

What with the usual monstrous holiday hiatus, it seems from twitterings and Facebook statuses that people everywhere (well not quite everywhere) have gone into corners with their new pads, pods, Androids and Airs, and taken a well deserved break from food and family fun. A brand new word game has evolved: what words occur above frequency w in year t? You pick w and t. What goes up and what goes down? What to make of the fact that the words today and tomorrow occurred with almost exactly the same frequency between 1700 and 1880, slinking along the bottom of the graph together between 0 and 0.001 per cent, when suddenly today makes a great leap up to the very top, while tomorrow, though becoming more frequent from 1880, never rises above 0.002 per cent. Is it, as someone has suggested, the new daily newspapers causing a pressing immediacy, or a new present-centredness as the world prepares itself for modernity? On the other hand, except in 1625 and 1675, the word you has always been more frequent than me; and, against all expectation, from the mid-20th century on, you pulls even further away from me – a bit of a disappointment for those who describe the young of the period as the Me Generation. But what was going on in the first and last quarter of the 17th century to cause those two noticeable blips of self-regard? Melancholy is virtually non-existent before 1570, but begins to rise and then falls until it drops off completely around 1625, about the time of the death of Dowland. It builds again to a great surge in 1650 (when, it says in Wikipedia, ‘the Age of Discovery ends’: reason enough), falls and then picks up, growing nicely and rising with the Romantics in 1800, and then declines gently before starting to increase again after 2000. Sting recorded a very terrible version of Dowland’s songs in 2006. Fuck is quite absent from books until about 1590 when it jolts up the chart for about eight years and then plummets, before returning in the 1630s, holding its own quite robustly until, of course, it disappears completely between 1820 and the mid to late 1950s when it surges once more (Look Back in Anger, Saturday Night and Sunday Morning, the Beat Poets) and remains ever on the up after that. Not as much as shit, however, which overtook fuck in the 1950s and has remained in the ascendant. Cunt is something of a rarity, hardly visible apart from a small hump around 1700, but then it starts to perk up and continues to rise until the latest available date. I imagine it will have made something of a spurt in 2010.

These are all very raw data. There is no way of telling how words were used, in what context or in what form (apart from a category called ‘English fiction’), or the way in which the meaning of words might have changed over time. Sensible peaks early in the 19th century, and appears lately to be on the rise again, but with a different meaning, surely. The blue line of melancholy intersects with the red line of depression in 1910, and depression (clinical or economic, who knows?) begins to climb towards the top of the graph while melancholy falls, so that their positions in 1840 and 2000 are almost exactly reversed as medical terms, economic conditions and common sensibilities alter.

Steven Pinker, who had a hand in the Science paper, is delighted about it all. ‘There is so much ignorance. We’ve had to speculate what might have happened to the language.’ The Google database is certainly a tool, and will, I imagine, become a better one. It might eventually help to answer what (happened to the language) and when (it happened), but how and why won’t be quite so amenable to statistical analysis. It’s always been the case, apparently. Put the four questions into the Ngram Viewer for the dates 1800 to 2008 and you get four differently coloured parallel lines more or less unchanging with the when at the top, and the what coming next, far above how and the very lowly why. Take the graph back to 1500, and the questions remain pretty much in the same order, but are much more intricately and closely linked, until they unravel in the mid-18th century. Louis Menand points out in the New York Times that among the 13 contributors to the Science paper there isn’t anyone from the humanities, not even a historian of the book. If you feel a little alarmed at the term ‘Culturomics’ which the paper’s authors have given to their analysis, at least we have an alternative to Sudoku, and a new use for the crucible: put in words taken from literature and watch them transmute into numbers and dates instead. Oh, you want meaning? Well, that might have to come at the price of studying, reading and speculation.

Letters

Vol. 33 No. 5 · 3 March 2011

Jenny Diski is mistaken in implying in her piece on Google’s Ngram Viewer that there was a golden age of swearing (LRB, 20 January). The apparent prevalence of the word fuck in the period before 1820, and its complete disappearance for more than a century thereafter, can be explained by the end of the use in printing of the ‘long s’, which modern optical character recognition sees as an ‘f’. All the apparent ‘fucking’ before then is actually just ‘sucking’. Diski is also mistaken in saying that there is no way of telling how the words were used. All the scanned, digitised books are fully searchable by date range: a single click on the ‘fuck’ search page would have taken her to several examples that would have made her realise her initial error. Needless to say, there are hours of adolescent fun to be had with this.

Henry Phillips
Richmond

Vol. 33 No. 4 · 17 February 2011

Jenny Diski uses Google’s 500-billion-word database to compare the frequency of ‘you’ and ‘me’ from the mid-20th century on, and she finds it ‘against all expectation’ that ‘you’ outnumbers ‘me’ in that period by an even larger margin than in previous centuries (LRB, 20 January). I would suggest that since ‘you’ is both nominative and accusative, the more accurate comparison is between ‘you’ and ‘me’ plus ‘I’. Come to think of it, since ‘you’ is both singular and plural, the proper comparison is between, on the one hand, ‘you’, and, on the other hand, the total of ‘I’, ‘me’, ‘we’ and ‘us’. That comparison might prove the existence of the ‘Me Generation’. Or is the whole exercise just silly?

Malcolm Mitchell
New York

Jenny Diski misremembers Stephen King publishing one of his novels online with all the words in alphabetical order. Maybe she was confusing King with Douglas Adams and Terry Jones, who published the latter’s novelisation of the computer game StarshipTitanic in this fashion in 1997. ‘Douglas, being enamoured of the internet,’ Yoz Grahame, who worked on the game, recalled, ‘wanted to put the whole text of the novel online, and was disappointed when the publishers nixed that idea. However, we still found a way to do it.’

Phil Gyford
London EC2

More search Options

Browse by Subject

Google’s Ngram Viewer

Mother’s Prettiest Thing

A Reparation of Her Choosing: Among the Sufis

Who’ll be last?

Letters

send letters to

Mother’s Prettiest Thing

Jenny Diski

A Reparation of Her Choosing: Among the Sufis

Jenny Diski

Who’ll be last?

Jenny Diski

Mother’s Prettiest Thing

Jenny Diski

A Reparation of Her Choosing: Among the Sufis

Jenny Diski

Who’ll be last?

Jenny Diski

Computers that want things

James Meek

Where the Power Is: Planet Phosphorus

James Vincent

Folding and Unfolding: Protein to Prion

Stephen Buranyi

Novel Approaches: ‘Kidnapped’ by Robert Louis Stevenson

Should the Bank of England be independent?

On Politics: Do bond markets and the Bank of England run Britain?

Download the LRB app

Sign up to our newsletter

Please enable Javascript

Google’s Ngram Viewer

Letters

send letters to

More by this contributor

Related Articles

Sign up to our newsletter

Please enable Javascript