The four most ‘informative’ words in Moby-Dick, statistically speaking, are ‘I’, ‘whale’, ‘you’ and ‘Ahab’. Marcello Montemurro and Damian Zanette worked this out by comparing the text of Moby-Dick to all the possible alternatives obtainable by shuffling Melville’s words into random sequences. These are not the four words that are used most often, or that carry the most ‘information’ in the everyday sense of the term, but the words whose positioning in the original, meaningful text differs most from the way they would be scattered in all other permutations. The ‘information’ here is of the mathematical, measurable kind: ‘most informative’ means ‘least randomly distributed’. It may seem a slightly odd way to try to quantify semantic content, as though when Melville wrote Moby-Dick, it wasn’t so much a matter of finding the right words, as of putting them down in the right order.
As it happens, the real point of the research isn’t to compare the contribution of individual words to the meaning of the whole, but to see how this measure of meaning behaves in different texts. In On the Origin of Species, for example, there are 69 instances of the word ‘instinct’, but more than 50 of them are clustered in Chapter 7 (which, by no coincidence, has the title ‘Instinct’). In any random rearrangement of the entire book, mentions of ‘instinct’ would be pretty evenly scattered. So for this book and that word, comparing the distribution across chapters captures the difference – or a difference, anyway – between authorial intention and meaninglessness.
But does the way books can be broken down into sections according to word distribution always coincide with the way they’ve been divided into chapters? Montemurro and Zanette analysed all 5258 books available in electronic form from Project Gutenberg, and found that each book has a different ‘optimum scale’. Moby-Dick, which runs to 218,284 words, seems to be divisible into discrete, coherent chunks of around 1200 words (you might be able to guess that from looking at the contents page, but Montemurro and Zanette have now proved it).
The semantic structure of On the Origin of Species – 155,800 words in length – is however revealed in sections of around 3000 words. Quotes and Images From The Novels of Georg Ebers, which Montemurro and Zanette describe as a short book of quotations with ‘no thematic unity building up along the text’, conveys its meaning in sections of 50 to 70 words. A typical book of around 100,000 words apparently has a most informative scale of between 300 and 3000 words.
Which leaves you, or at least it leaves me, wanting to know which works have the longest maximal informative span. The top three are: Decline and Fall of the Roman Empire Vol. 3 by Edward Gibbon, A History of Rome Vol. 1 by Abel Greenidge, and Civilisation of the Renaissance in Italy by Jacob Burckhardt, with spans of between 4000 and 6000 words. Not exactly Twitter.