« | Home | »

Me call Ishmael

Tags: | | |

The four most ‘informative’ words in Moby-Dick, statistically speaking, are ‘I’, ‘whale’, ‘you’ and ‘Ahab’. Marcello Montemurro and Damian Zanette worked this out by comparing the text of Moby-Dick to all the possible alternatives obtainable by shuffling Melville’s words into random sequences. These are not the four words that are used most often, or that carry the most ‘information’ in the everyday sense of the term, but the words whose positioning in the original, meaningful text differs most from the way they would be scattered in all other permutations. The ‘information’ here is of the mathematical, measurable kind: ‘most informative’ means ‘least randomly distributed’. It may seem a slightly odd way to try to quantify semantic content, as though when Melville wrote Moby-Dick, it wasn’t so much a matter of finding the right words, as of putting them down in the right order.

As it happens, the real point of the research isn’t to compare the contribution of individual words to the meaning of the whole, but to see how this measure of meaning behaves in different texts. In On the Origin of Species, for example, there are 69 instances of the word ‘instinct’, but more than 50 of them are clustered in Chapter 7 (which, by no coincidence, has the title ‘Instinct’). In any random rearrangement of the entire book, mentions of ‘instinct’ would be pretty evenly scattered. So for this book and that word, comparing the distribution across chapters captures the difference – or a difference, anyway – between authorial intention and meaninglessness.

But does the way books can be broken down into sections according to word distribution always coincide with the way they’ve been divided into chapters? Montemurro and Zanette analysed all 5258 books available in electronic form from Project Gutenberg, and found that each book has a different ‘optimum scale’. Moby-Dick, which runs to 218,284 words, seems to be divisible into discrete, coherent chunks of around 1200 words (you might be able to guess that from looking at the contents page, but Montemurro and Zanette have now proved it).

The semantic structure of On the Origin of Species – 155,800 words in length – is however revealed in sections of around 3000 words. Quotes and Images From The Novels of Georg Ebers, which Montemurro and Zanette describe as a short book of quotations with ‘no thematic unity building up along the text’, conveys its meaning in sections of 50 to 70 words. A typical book of around 100,000 words apparently has a most informative scale of between 300 and 3000 words.

Which leaves you, or at least it leaves me, wanting to know which works have the longest maximal informative span. The top three are: Decline and Fall of the Roman Empire Vol. 3 by Edward Gibbon, A History of Rome Vol. 1 by Abel Greenidge, and Civilisation of the Renaissance in Italy by Jacob Burckhardt, with spans of between 4000 and 6000 words. Not exactly Twitter.

Comments on “Me call Ishmael”

  1. Camus123 says:

    What I’d like them to tell me is whether there’s a book in which each noun appears only once. Or have I missed the point?

Comment on this post

Log in or register to post a comment.


  • Recent Posts

    RSS – posts

  • Contributors

  • Recent Comments

    • suetonius on Remembering Seymour Papert: Oh my, flashback inducing. I remember being an undergraduate right when the book came out, physics student at the time. Several of my professors wer...
    • fbkun on Justice for Théo: Polls show that more than half of French police(wo)men vote for the Front National. Quelle surprise...
    • jcscott on The Deep State: How we get rid of Trump is at least as important as whether we get rid of him. The best would be a progressive landslide election in 2020 repudiating ...
    • Oliver Miles on Shambles in Court: A very difficult problem, almost insoluble. Many people are quite unaware of it and just assume that if there is an interpreter there is no problem. I...
    • michael bosley on Arguing with Strangers: Meanwhile, in the UK, cuts in sexual health services are being made by stealth, with hardly any public/political debate. As a report by the Advis...

    RSS – comments

  • Contact

  • Blog Archive

Advertisement Advertisement