« | Home | »

Me call Ishmael

Tags: | | |

The four most ‘informative’ words in Moby-Dick, statistically speaking, are ‘I’, ‘whale’, ‘you’ and ‘Ahab’. Marcello Montemurro and Damian Zanette worked this out by comparing the text of Moby-Dick to all the possible alternatives obtainable by shuffling Melville’s words into random sequences. These are not the four words that are used most often, or that carry the most ‘information’ in the everyday sense of the term, but the words whose positioning in the original, meaningful text differs most from the way they would be scattered in all other permutations. The ‘information’ here is of the mathematical, measurable kind: ‘most informative’ means ‘least randomly distributed’. It may seem a slightly odd way to try to quantify semantic content, as though when Melville wrote Moby-Dick, it wasn’t so much a matter of finding the right words, as of putting them down in the right order.

As it happens, the real point of the research isn’t to compare the contribution of individual words to the meaning of the whole, but to see how this measure of meaning behaves in different texts. In On the Origin of Species, for example, there are 69 instances of the word ‘instinct’, but more than 50 of them are clustered in Chapter 7 (which, by no coincidence, has the title ‘Instinct’). In any random rearrangement of the entire book, mentions of ‘instinct’ would be pretty evenly scattered. So for this book and that word, comparing the distribution across chapters captures the difference – or a difference, anyway – between authorial intention and meaninglessness.

But does the way books can be broken down into sections according to word distribution always coincide with the way they’ve been divided into chapters? Montemurro and Zanette analysed all 5258 books available in electronic form from Project Gutenberg, and found that each book has a different ‘optimum scale’. Moby-Dick, which runs to 218,284 words, seems to be divisible into discrete, coherent chunks of around 1200 words (you might be able to guess that from looking at the contents page, but Montemurro and Zanette have now proved it).

The semantic structure of On the Origin of Species – 155,800 words in length – is however revealed in sections of around 3000 words. Quotes and Images From The Novels of Georg Ebers, which Montemurro and Zanette describe as a short book of quotations with ‘no thematic unity building up along the text’, conveys its meaning in sections of 50 to 70 words. A typical book of around 100,000 words apparently has a most informative scale of between 300 and 3000 words.

Which leaves you, or at least it leaves me, wanting to know which works have the longest maximal informative span. The top three are: Decline and Fall of the Roman Empire Vol. 3 by Edward Gibbon, A History of Rome Vol. 1 by Abel Greenidge, and Civilisation of the Renaissance in Italy by Jacob Burckhardt, with spans of between 4000 and 6000 words. Not exactly Twitter.

Comments on “Me call Ishmael”

  1. Camus123 says:

    What I’d like them to tell me is whether there’s a book in which each noun appears only once. Or have I missed the point?

Comment on this post

Log in or register to post a comment.


  • Recent Posts

    RSS – posts

  • Contributors

  • Recent Comments

    • andymartinink on Reacher v. Parker: Slayground definitely next on my agenda. But to be fair to Lee Child, as per the Forbes analysis, there is clearly a massive collective reader-writer ...
    • Robert Hanks on Reacher v. Parker: And in Breakout, Parker, in prison, teams up with a black guy to escape; another white con dislikes it but accepts the necessity; Parker is absolutely...
    • Robert Hanks on Reacher v. Parker: Parker may not have the integrity and honesty of Marlowe, but I'd argue that Richard Stark writes with far more of both than Raymond Chandler does: Ch...
    • Christopher Tayler on Reacher v. Parker: Good to see someone holding up standards. The explanation is that I had thoughts - or words - left over from writing about Lee Child. (For Chandler se...
    • Geoff Roberts on Reacher v. Parker: ..."praised in the London Review of Books" Just read the article on Lee Child in a certain literary review and was surprised to find this rave notice...

    RSS – comments

  • Contact

  • Blog Archive

  • From the LRB Archive

    Chris Lehmann: The Candidates
    18 June 2015

    ‘Every one of the Republican candidates can be described as a full-blown adult failure. These are people who, in most cases, have been granted virtually every imaginable advantage on the road to success, and managed nevertheless to foul things up along the way.’

    Hugh Pennington:
    The Problem with Biodiversity
    10 May 2007

    ‘As a medical microbiologist, for example, I have spent my career fighting biodiversity: my ultimate aim has been to cause the extinction of harmful microbes, an objective shared by veterinary and plant pathologists. But despite more than a hundred years of concentrated effort, supported by solid science, smallpox has been the only success.’

    Jeremy Harding: At the Mexican Border
    20 October 2011

    ‘The battle against illegal migration is a domestic version of America’s interventions overseas, with many of the same trappings: big manpower commitments, militarisation, pursuit, detection, rendition, loss of life. The Mexican border was already the focus of attention before 9/11; it is now a fixation that shows no signs of abating.’

    James Meek: When the Floods Came
    31 July 2008

    ‘Last July, a few days after the floods arrived, with 350,000 people still cut off from the first necessity of life, Severn Trent held its annual general meeting. It announced profits of £325 million, and confirmed a dividend for shareholders of £143 million. Not long afterwards the company, with the consent of the water regulator Ofwat, announced that it wouldn’t be compensating customers: all would be charged as if they had had running water, even when they hadn’t.’

Advertisement Advertisement