« | Home | »

Me call Ishmael

Tags: | | |

The four most ‘informative’ words in Moby-Dick, statistically speaking, are ‘I’, ‘whale’, ‘you’ and ‘Ahab’. Marcello Montemurro and Damian Zanette worked this out by comparing the text of Moby-Dick to all the possible alternatives obtainable by shuffling Melville’s words into random sequences. These are not the four words that are used most often, or that carry the most ‘information’ in the everyday sense of the term, but the words whose positioning in the original, meaningful text differs most from the way they would be scattered in all other permutations. The ‘information’ here is of the mathematical, measurable kind: ‘most informative’ means ‘least randomly distributed’. It may seem a slightly odd way to try to quantify semantic content, as though when Melville wrote Moby-Dick, it wasn’t so much a matter of finding the right words, as of putting them down in the right order.

As it happens, the real point of the research isn’t to compare the contribution of individual words to the meaning of the whole, but to see how this measure of meaning behaves in different texts. In On the Origin of Species, for example, there are 69 instances of the word ‘instinct’, but more than 50 of them are clustered in Chapter 7 (which, by no coincidence, has the title ‘Instinct’). In any random rearrangement of the entire book, mentions of ‘instinct’ would be pretty evenly scattered. So for this book and that word, comparing the distribution across chapters captures the difference – or a difference, anyway – between authorial intention and meaninglessness.

But does the way books can be broken down into sections according to word distribution always coincide with the way they’ve been divided into chapters? Montemurro and Zanette analysed all 5258 books available in electronic form from Project Gutenberg, and found that each book has a different ‘optimum scale’. Moby-Dick, which runs to 218,284 words, seems to be divisible into discrete, coherent chunks of around 1200 words (you might be able to guess that from looking at the contents page, but Montemurro and Zanette have now proved it).

The semantic structure of On the Origin of Species – 155,800 words in length – is however revealed in sections of around 3000 words. Quotes and Images From The Novels of Georg Ebers, which Montemurro and Zanette describe as a short book of quotations with ‘no thematic unity building up along the text’, conveys its meaning in sections of 50 to 70 words. A typical book of around 100,000 words apparently has a most informative scale of between 300 and 3000 words.

Which leaves you, or at least it leaves me, wanting to know which works have the longest maximal informative span. The top three are: Decline and Fall of the Roman Empire Vol. 3 by Edward Gibbon, A History of Rome Vol. 1 by Abel Greenidge, and Civilisation of the Renaissance in Italy by Jacob Burckhardt, with spans of between 4000 and 6000 words. Not exactly Twitter.

Comments on “Me call Ishmael”

  1. Camus123 says:

    What I’d like them to tell me is whether there’s a book in which each noun appears only once. Or have I missed the point?

Comment on this post

Log in or register to post a comment.

  • Recent Posts

    RSS – posts

  • Contributors

  • Recent Comments

    • name on Who is the enemy?: Simply stating it is correct doesn't make it so, I just wish you would apply the same epistemic vigilance to "Muslim crimes" as you do to their Hebrew...
    • Glen Newey on Unwinnable War: The legal issue admits of far less clarity than the simple terms in which you – I imagine quite sincerely – frame them. For the benefit of readers...
    • Geoff Roberts on The New Normal: The causes go back a long way into the colonial past, but the more immediate causes stem from the activities of the US forces in the name of freedom a...
    • sol_adelman on The New Normal: There's also the fact that the French state denied the mass drownings of '61 even happened for forty-odd years. No episode in post-war W European hist...
    • funky gibbon on At Wembley: If England get France in the quarter finals of Euro 16 I expect that a good deal of the fraternity will go out the window

    RSS – comments

  • Contact

  • Blog Archive

  • From the LRB Archive

    Edward Said: The Iraq War
    17 April 2003

    ‘This is the most reckless war in modern times. It is all about imperial arrogance unschooled in worldliness, unfettered either by competence or experience, undeterred by history or human complexity, unrepentant in its violence and the cruelty of its technology.’

    David Runciman:
    The Politics of Good Intentions
    8 May 2003

    ‘One of the things that unites all critics of Blair’s war in Iraq, whether from the Left or the Right, is that they are sick of the sound of Blair trumpeting the purity of his purpose, when what matters is the consequences of his actions.’

    Simon Wren-Lewis: The Austerity Con
    19 February 2015

    ‘How did a policy that makes so little sense to economists come to be seen by so many people as inevitable?’

    Hugh Roberts: The Hijackers
    16 July 2015

    ‘American intelligence saw Islamic State coming and was not only relaxed about the prospect but, it appears, positively interested in it.’

Advertisement Advertisement