It knows

Daniel Soar

  • BuyThe Googlisation of Everything (and Why We Should Worry) by Siva Vaidhyanathan
    California, 265 pp, £18.95, March 2011, ISBN 978 0 520 25882 2
  • BuyIn the Plex: How Google Thinks, Works and Shapes Our Lives by Steven Levy
    Simon and Schuster, 424 pp, £18.99, May 2011, ISBN 978 1 4165 9658 5
  • I’m Feeling Lucky: The Confessions of Google Employee Number 59 by Douglas Edwards
    Allen Lane, 416 pp, £20.00, July 2011, ISBN 978 1 84614 512 4

This spring, the billionaire Eric Schmidt announced that there were only four really significant technology companies: Apple, Amazon, Facebook and Google, the company he had until recently been running. People believed him. What distinguished his new ‘gang of four’ from the generation it had superseded – companies like Intel, Microsoft, Dell and Cisco, which mostly exist to sell gizmos and gadgets and innumerable hours of expensive support services to corporate clients – was that the newcomers sold their products and services to ordinary people. Since there are more ordinary people in the world than there are businesses, and since there’s nothing that ordinary people don’t want or need, or can’t be persuaded they want or need when it flashes up alluringly on their screens, the money to be made from them is virtually limitless. Together, Schmidt’s four companies are worth more than half a trillion dollars. The technology sector isn’t as big as, say, oil, but it’s growing, as more and more traditional industries – advertising, travel, real estate, used cars, new cars, porn, television, film, music, publishing, news – are subsumed into the digital economy. Schmidt, who as the ex-CEO of a multibillion-dollar corporation had learned to take the long view, warned that not all four of his disruptive gang could survive. So – as they all converge from their various beginnings to compete in the same area, the place usually referred to as ‘the cloud’, a place where everything that matters is online – the question is: who will be the first to blink?

If the company that falters is Google, it won’t be because it didn’t see the future coming. Of Schmidt’s four technology juggernauts, Google has always been the most ambitious, and the most committed to getting everything possible onto the internet, its mission being ‘to organise the world’s information and make it universally accessible and useful’. Its ubiquitous search box has changed the way information can be got at to such an extent that ten years after most people first learned of its existence you wouldn’t think of trying to find out anything without typing it into Google first. Searching on Google is automatic, a reflex, just part of what we do. But an insufficiently thought-about fact is that in order to organise the world’s information Google first has to get hold of the stuff. And in the long run ‘the world’s information’ means much more than anyone would ever have imagined it could. It means, of course, the totality of the information contained on the World Wide Web, or the contents of more than a trillion webpages (it was a trillion at the last count, in 2008; now, such a number would be meaningless). But that much goes without saying, since indexing and ranking webpages is where Google began when it got going as a research project at Stanford in 1996, just five years after the web itself was invented. It means – or would mean, if lawyers let Google have its way – the complete contents of every one of the more than 33 million books in the Library of Congress or, if you include slightly varying editions and pamphlets and other ephemera, the contents of the approximately 129,864,880 books published in every recorded language since printing was invented. It means every video uploaded to the public internet, a quantity – if you take the Google-owned YouTube alone – that is increasing at the rate of nearly an hour of video every second.

It means the location of businesses, religious institutions, schools, libraries, community centres and hospitals worldwide – a global Yellow Pages. It means the inventories of shops, the archives of newspapers, the minute by minute performance of the stock market. It means, or will mean, if Google keeps going, the exact look of every street corner and roadside on the planet, photographed in high resolution and kept as up to date as possible: the logic, if not yet the practice, of Google Street View, means that city streets should be under ever more regular photographic surveillance, since the fresher and more complete the imagery the more useful people will find it, and the more they will therefore use it.[1] If it doesn’t already have a piece of data, you can be sure that Google is pursuing a way of getting it, of gathering and sorting every kind of public information there is.

But all this is just the stuff that Google makes publicly searchable, or ‘universally accessible’. It’s only a small fraction of the information it actually possesses. I know that Google knows, because I’ve looked it up, that on 30 April 2011 at 4.33 p.m. I was at Willesden Junction station, travelling west. It knows where I was, as it knows where I am now, because like many millions of others I have an Android-powered smartphone with Google’s location service turned on. If you use the full range of its products, Google knows the identity of everyone you communicate with by email, instant messaging and phone, with a master list – accessible only by you, and by Google – of the people you contact most. If you use its products, Google knows the content of your emails and voicemail messages (a feature of Google Voice is that it transcribes messages and emails them to you, storing the text on Google servers indefinitely). If you find Google products compelling – and their promise of access-anywhere, conflagration and laptop-theft-proof document creation makes them quite compelling – Google knows the content of every document you write or spreadsheet you fiddle or presentation you construct. If as many Google-enabled robotic devices get installed as Google hopes, Google may soon know the contents of your fridge, your heart rate when you’re exercising, the weather outside your front door, the pattern of electricity use in your home.

Google knows or has sought to know, and may increasingly seek to know, your credit card numbers, your purchasing history, your date of birth, your medical history, your reading habits, your taste in music, your interest or otherwise (thanks to your searching habits) in the First Intifada or the career of Audrey Hepburn or flights to Mexico or interest-free loans, or whatever you idly speculate about at 3.45 on a Wednesday afternoon. Here’s something: if you have an Android phone, Google can guess your home address, since that’s where your phone tends to be at night. I don’t mean that in theory some rogue Google employee could hack into your phone to find out where you sleep; I mean that Google, as a system, explicitly deduces where you live and openly logs it as ‘home address’ in its location service, to put beside the ‘work address’ where you spend the majority of your daytime hours.

Some people find all this frightening. Since Google still makes more than 95 per cent of its money through selling advertising – that’s $30 billion a year, or about twice the annual global revenue of the entire recorded music industry – the fear is that all the information about us it has hoovered up is used to create scarily exact user profiles which it then offers to advertisers, as the most complete picture of billions of individuals it’s currently possible to build. The fear seems be based on the assumption that if Google is gathering all this information then it must be doing so in order to sell it: it is a profit-making company, after all. ‘We are not Google’s customers,’ Siva Vaidhyanathan writes in The Googlisation of Everything. ‘We are its product. We – our fancies, fetishes, predilections and preferences – are what Google sells to advertisers.’ Vaidhyanathan, who likes alliteration but isn’t so big on facts, doesn’t explain what he means by ‘sells’ (or whether ‘to sell a fancy’ could mean anything at all), but if he’s implying that Google makes the information it has about us available to advertisers then he’s wrong. It isn’t possible, using Google’s tools, to target an ad to 32-year-old single heterosexual men living in London who work at Goldman Sachs and like skiing, especially at Courchevel. You can do exactly that using Facebook, but the options Google gives advertisers are, by comparison, limited: the closest it gets is to allow them to target display ads to people who may be interested in the category of ‘skiing and snowboarding’ – and advertisers were always able to do that anyway by buying space in Ski & Snowboard magazine. The rest of the time, Google decides the placement of ads itself, using its proprietary algorithms to display them wherever it knows they will get the most clicks. The advertisers are left out of the loop.

So why doesn’t Google market its personal information, when it has so much of it? One answer might be that to do so would be ‘evil’. ‘Don’t be evil’ is Google’s geeky corporate motto – a hostage to fortune if ever there was one, though it usually seems to mean ‘don’t do anything to upset the users.’ We’d be upset – we might even choose to use a competing service – if Google released information about us that we didn’t know it had, or that we didn’t even know ourselves, such as the likelihood, revealed by our searches, that we might be suffering from a particular illness.[2] Facebook gets away with being evil – or does it? – because the personal information it makes available for targeting is information that users have voluntarily surrendered by filling in their profiles: birthday, relationship status, hometown, workplace; every time they click on a ‘Like’ button on the web they are deemed to have declared an interest that can be used for targeting. But another answer might be that the information Google has is too valuable to give away, that it has another reason for collecting every piece of data it possibly can, that the stuff it’s amassing is worth more than just money.

The reason is that Google is learning. The more data it gathers, the more it knows, the better it gets at what it does. Of course, the better it gets at what it does the more money it makes, and the more money it makes the more data it gathers and the better it gets at what it does – an example of the kind of win-win feedback loop Google specialises in – but what’s surprising is that there is no obvious end to the process. Thanks to what it has learned so far, Google is no longer the merely impressive search engine it was a decade ago. Back then, it was assumed that the key to its success in delivering its (as it once seemed) uncannily accurate results was its first and best-known invention, PageRank, the algorithm that assigns to every page on the web a value indicating how authoritative it is, based on the number and the authoritativeness of the pages linking to it. Its inventor was Larry Page (hence, cunningly, PageRank), one of Google’s founders and now once more its CEO; and his model, as Steven Levy explains in In the Plex, was the system of scholarly citation, by which journal articles and books are considered important if they are referred to by other important journal articles and books. Levy is big on origins. Not everyone will think much of the suggestion that Page and Sergey Brin, his co-founder, got where they are today because they were both ‘Montessori kids’ who were taught from an early age to believe anything was possible.[3] But he may be on to something when he says that Page’s academic family background – his father taught at Michigan State, and he hung out at Stanford as a child – meant that when he faced the problem of how to rank importance he recognised that the economy of the web was very similar to the economy of academia. Those at the bottom of the ladder (the junior academics, the lowly website owners) seek recognition from those above them (the celebrated professors, the global internet portals) and use citations in the hope that some of the gold dust will rub off on them if they get cited back. Rankings based on citations aren’t necessarily a measure of excellence – if they were, we wouldn’t hear so much about Steven Pinker – but they do reflect where humans have decided that authority lies.

PageRank, however, has always been just one of the factors determining how Google’s search results are ordered. In 2007, Google told the New York Times that it was now using more than 200 signals in its ranking algorithm, and the number must now be higher. What every one of those signals is and how they are weighted is Google’s most precious trade secret, but the most useful signal of all is the least predictable: the behaviour of the person who types their query into the search box. A click on the third result counts as a vote that it ought to come higher. A ‘long click’ – when you select one of the results and don’t come back – is a stronger vote. To test a new version of its algorithm, Google releases it to a small subset of its users and measures its effectiveness through the pattern of their clicks: more happy surfers and it’s just got cleverer. We teach it while we think it’s teaching us. Levy tells the story of a new recruit with a long managerial background who asked Google’s senior vice-president of engineering, Alan Eustace, what systems Google had in place to improve its products. ‘He expected to hear about quality assurance teams and focus groups’ – the sort of set-up he was used to. ‘Instead Eustace explained that Google’s brain was like a baby’s, an omnivorous sponge that was always getting smarter from the information it soaked up.’ Like a baby, Google uses what it hears to learn about the workings of human language. The large number of people who search for ‘pictures of dogs’ and also ‘pictures of puppies’ tells Google that ‘puppy’ and ‘dog’ mean similar things, yet it also knows that people searching for ‘hot dogs’ get cross if they’re given instructions for ‘boiling puppies’. If Google misunderstands you, and delivers the wrong results, the fact that you’ll go back and rephrase your query, explaining what you mean, will help it get it right next time. Every search for information is itself a piece of information Google can learn from.

By 2007, Google knew enough about the structure of queries to be able to release a US-only directory inquiry service called GOOG-411. You dialled 1-800-4664-411 and spoke your question to the robot operator, which parsed it and spoke you back the top eight results, while offering to connect your call. It was free, nifty and widely used, especially because – unprecedentedly for a company that had never spent much on marketing – Google chose to promote it on billboards across California and New York State. People thought it was weird that Google was paying to advertise a product it couldn’t possibly make money from, but by then Google had become known for doing weird and pleasing things. In 2004, it launched Gmail with what was for the time an insanely large quota of free storage – 1GB, five hundred times more than its competitors. But in that case it was making money from the ads that appeared alongside your emails. What was it getting with GOOG-411? It soon became clear that what it was getting were demands for pizza spoken in every accent in the continental United States, along with questions about plumbers in Detroit and countless variations on the pronunciations of ‘Schenectady’, ‘Okefenokee’ and ‘Boca Raton’. GOOG-411, a Google researcher later wrote, was a phoneme-gathering operation, a way of improving voice recognition technology through massive data collection.

Three years later, the service was dropped, but by then Google had launched its Android operating system and had released into the wild an improved search-by-voice service that didn’t require a phone call. You tapped the little microphone icon on your phone’s screen – it was later extended to Blackberries and iPhones – and your speech was transmitted via the mobile internet to Google servers, where it was interpreted using the advanced techniques the GOOG-411 exercise had enabled. The baby had learned to talk. Now that Android phones are being activated at a rate of more than half a million a day,[4] Google suddenly has a vast and growing repository of spoken words, in every language on earth, and a much more powerful learning machine. If your phone mistranscribes what you say, you correct it by typing it in, and Google’s algorithms – once again – are taught how to get better still. It’s a frustratingly faultless learning loop. It’s easy to assume that the end result of this increasing perfection will be a Google machine in the cloud that can correctly transcribe all speech in all languages from Afrikaans to Xhosa, however badly you mumble: useful when you’re driving or have your hands full. But that’s to think small.

Before Google bought YouTube in 2006 for $1.65 billion, it had a fledgling video service of its own, predictably called Google Video, that in its initial incarnation offered the – it seemed – brilliant feature of answering a typed phrase with a video clip in which those words were spoken. The promise was that, for example, you’d be able to search for the phrase ‘in my beginning is my end’ and see T.S. Eliot, on film, reciting from the Four Quartets. But no such luck. Google Video’s search worked by a kind of trickery: it used the hidden subtitles that broadcasters provide for the hard of hearing, which Google had generally paid to use, and searched against the text. The service is just one of the many experiments that Google over the years has killed, but a presumably large reason for its death was that although it appeared to work it was really very limited. Not everything is tailored for the deaf, and subtitles are often wrong. If, however, Google is able to deploy its newly capable voice recognition system to transcribe the spoken words in the two days’ worth of video uploaded to YouTube every minute, there would be an explosion in the amount of searchable material. Since there’s no reason Google can’t do it, it will.

A thought experiment: if Google launched satellites into orbit it could record all terrestrial broadcasts and transcribe those too. That may sound exorbitant, but it’s not obviously crazier than some of the ideas that Google’s founders have dreamed up and found a way of implementing: the idea of photographing all the world’s streets, of scanning all the world’s books, of building cars that drive themselves. It’s the sort of thing that crosses Google’s mind. An April Fool’s joke a few years ago advertised job opportunities at Google’s research centre on the Moon, where listening equipment would provide an ‘ear on the chatter of the universe, the vast web of electromagnetic pulses that may contain signals from intelligent life forms in other galaxies, as well as a complete record of every radio or television signal broadcast from our own planet’. Google takes its April Fool’s jokes very seriously, as the marketing man who wrote some of them, Douglas Edwards, explains in I’m Feeling Lucky: The Confessions of Google Employee Number 59: big arguments broke out when the founders felt that proposed jokes weren’t true to Google’s sense of its mission. The jokes – like the friendly logo, and the homepage doodles – are carefully designed to hint at the scale of Google’s ambition without scaring the world to death.

There seem to be no large Google initiatives – however seemingly tangential to the company’s core competency, and unhelpful to its bottom line – that don’t bring as a side benefit, or as the main benefit, an enormous amount of data to Google. They also threaten to put whole industries out of business by being free. In 2009, Google updated its Maps application for Android to include free turn-by-turn navigation: on-screen and spoken directions to whatever destination you choose. The cost to Google was negligible, and the damage to existing businesses was enormous: companies like Garmin and TomTom had been getting large margins on hundred-pound satnav hardware, and then charging for monthly subscriptions. Not any more. Naturally, those threatened don’t always give up without a fight. That a more esoteric battle has been taking place over Android was revealed earlier this year when a little company called Skyhook took Google to court for alleged unfair business practices. Skyhook makes its money by licensing location-detection technology to hardware manufacturers, and – in an impressive coup – had succeeded in persuading Motorola, among others, that its system was better than Google’s. Motorola agreed to pay to use Skyhook’s service on its Android phones in preference to Google’s built-in free one. When Google executives found out what had happened – as subpoenaed emails between them showed – they were incredulous, and alarmed:

This feels like a disaster :(

I think this is worth a postmortem and maybe a code yellow or something like that to really focus here.

What they were alarmed about was not that their system might not be the best – they didn’t quite believe that – but that if manufacturers started using a competitor’s product they would no longer be getting the data they needed to improve their own.[5] In other words, Google faced the unfamiliar problem of the negative feedback loop: the fewer people that used its product, the less information it would have and the worse the product would get. So the executives swung into action and reminded Motorola of various contractual obligations that went with the Android licence. Google got to keep its data. Coincidentally, last month, it announced its plan to buy Motorola Mobility – along with 19,000 employees, nearly doubling Google’s workforce – for $12.5 billion.

Google isn’t invincible. Eric Schmidt likes to say that its competitors are only one click away: if you don’t like Google’s search results, or its business practices, you can always use Bing. But Google is currently facing anti-trust scrutiny by Senate subcommittees, and the bigger it gets the less answerable the regulatory threat will become. Google is getting cleverer precisely because it is so big. If it’s cut down to size then what will happen to everything it knows? That’s the conundrum. It’s clearly wrong for all the information in all the world’s books to be in the sole possession of a single company. It’s clearly not ideal that only one company in the world can, with increasing accuracy, translate text between 506 different pairs of languages. On the other hand, if Google doesn’t do these things, who will?

[1] In 1999, Google’s web index – its copy of every page on the internet – was updated once every three or four months. By 2003 parts of the index were updated once a day, and by 2007 the rate was once every few minutes. By 2009 it was no longer possible to say that the web was being crawled at such and such a speed: if Google considered there was a chance a page might be updated it engineered things such that any change on that page was reflected in its index exactly as it happened. A search for ‘hudson river’ on 15 January 2009 would have showed that a plane had crash-landed on it before it was reported by CNN.

[2] This is something that Google can in theory know. Google Flu Trends uses aggregated search data for flu-like symptoms to estimate the spread of flu pandemics in various countries around the world. Google published an article in Nature explaining its methodology (‘we applied the Fisher Z-transformation to each correlation, and took the mean of the 36 Z-transformed correlations’), and demonstrated that its tool was as accurate as any existing method of estimating flu levels at any given moment and, since it doesn’t depend on health departments’ weekly reports, much faster at providing results.

[3] As further evidence of Page’s thinking big, Levy reports a conversation from 2003, when Google executives were discussing opening engineering offices overseas. ‘Schmidt asked Page how quickly he would like to grow. “How many engineers does Microsoft have?” Page asked. About 25,000, he was told. “We should have a million.”’

[4] Android’s very rapid growth can mostly be attributed to the fact that the operating system is free for manufacturers to license. Previously, a handset-maker such as Nokia either had to develop its own software or pay large sums to use software developed by another company, such as Microsoft. In fact, Android is what is known as ‘less than free’, since manufacturers get an undisclosed percentage of Google’s ad revenue from phones Android is installed on.

[5] In 2010, Google had been forced by regulators to stop its Street View cars from collecting certain location data after it was discovered that they were (Google says) accidentally also recording some of the data transmitted through WiFi networks in people’s homes. One of the emails in the Skyhook lawsuit explained that now that Google was no longer getting location data from Street View cars it relied heavily on the data from Android handsets ‘to maintain and improve’ its location service. It was a revealing indication of how ingeniously Google projects serve multiple ends.