Big Biology

Hugh Pennington

Big Science took off during the Second World War and justified itself with successful ventures such as the Manhattan Project. Physicists have operated on a grand scale ever since. Lavish public funding has enabled them to conduct enormous experiments, each taking years in the planning and requiring hundreds of scientists and machines that cost hundreds of millions. Biology is different. Its most expensive items of equipment – MRI scanners or electron microscopes or DNA sequencers – cost many orders of magnitude less and don’t need enormous engineering teams because they can be bought off the shelf with manufacturers’ guarantees, just like white goods. But the problems that biologists investigate are far, far more complicated than those that remain for physicists. Living organisms have not been rationally designed, but have evolved, and are still evolving. Variability is everywhere; enormously complex interactions between the thousands of different molecules (themselves very complex) in an organism are universal, but rules that reliably predict and explain them are still vanishingly rare. To find answers to their questions in the face of these difficulties, biologists have been forced to become Big Science practitioners as well.

The technical and administrative tour de force described in the research article ‘Draft Genome Sequence of the Sexually Transmitted Pathogen Trichomonas vaginalis’, published in Science on 12 January, is typical of Big Biology. It has 65 authors. They come from England, Scotland, Denmark, Germany, Belgium, Italy, the Czech Republic, Australia, Canada, Taiwan and the US. America provides 35 of them, 12 from the Institute for Genomic Research (TIGR).

TIGR epitomises Big Biology. It describes itself as a ‘not-for-profit centre dedicated to deciphering and analysing genomes’, and its work, it says, has ‘wide-ranging applications in medicine, agriculture, energy, the environment and biodefence’. Set up in 1992 with a campus in the Shady Grove Life Sciences Center near Washington DC, it has so far sequenced the genomes of more than three dozen disease-causing microbes, including those causing cholera, syphilis, anthrax, malaria and sleeping sickness, as well as the rice plant. But just as US atom-smashing enterprises have to compete with CERN in Geneva, TIGR is not alone. Its 17-acre campus is more than matched by Hinxton Hall, the Wellcome Trust Sanger Institute’s 55-acre site near Cambridge. It also describes itself as ‘not for profit’ and was also established in 1992. It has sequenced the genomes of more than ninety pathogens as well as those of the mouse, the zebra fish and a third of the human.

DNA sequencing machines work by copying DNA molecules to give fragments of different lengths which are then driven by an electric current through a porous medium that sieves them according to size; the machines display them in order so that the sequence can be deduced. The machines work best with DNA molecules made up of fewer than a thousand bases. Genomes generally have millions of bases and have themselves to be cut into components of manageable size before sequencing. Overlapping fragments are generated so that their relationships to one another and their order can be worked out from the overlap. TIGR and the Sanger Institute have semi-industrialised the process by setting up hundreds of DNA-sequencing machines in one place, dedicating them at any one time to sequencing only a few genomes. The productivity gains have been enormous. DNA sequencing was invented by Frederick Sanger in 1975. By 1990, 230,000 base pairs had been sequenced; the current figure for the Sanger Institute is 3,402,803,300.

Trichomonas vaginalis is a microbe too small to be seen by the naked eye. It grows best in a moist environment at human body temperature and in the absence of oxygen. It swims jerkily using hair-like flagellae and eats bacteria and the cells that line the vagina, the urethra and, maybe, the prostate. It boosts its iron supplies by consuming blood cells shed during menstruation. The majority of parasitised humans are unaware they have it; it causes irritation in a significant minority, mostly women. For most of its history it has been underestimated as a problem. It was the first microbe living in humans to be seen, described and named (by Alfred Donné in France in 1836) and the first to be photographed (by Donné and Léon Foucault, using the ‘microscope daguerréotype’ in 1845). But it languished in medical textbooks as ‘a normal inhabitant of the vagina’ for another hundred years until its unmasking as a pathogen by the successful treatment of gonorrhoea – with which it often coexisted – with antibiotics in the 1940s. It is important because it is common, with more than 170 million people affected worldwide every year, and because its presence raises the risk of HIV infection in those cases between two and fourfold. Although it can survive alive for a while on surfaces like lavatory seats, it is transmitted from human to human, its only host, by sex. It doesn’t have sex itself.

Medical reasons undoubtedly helped to secure funding for the sequencing of the Trichomonas genome from the US National Institute of Allergy and Infectious Diseases, the Burroughs Wellcome Fund and the Ellison Medical Foundation (to pay for meetings of the ‘Trichomonas community’) and the Chang Gung Memorial Hospital and Taiwan Biotech Company Limited. It is a reasonable guess, however, that the perception of Trichomonas as a particularly primitive member of the eukaryotes (animals and plants are eukaryotes, bacteria are prokaryotes) played a role as well.[*] DNA sequencing data is the only information that can be used to test this perception.

Although it is normal for scientific papers to end with words like ‘suggest’, and ‘may indicate’, and ‘support the hypothesis’ (words denoting that closure is far off), it is unusual to start a paper with ‘draft’, when it has more than a hundred pages of online supporting material, because it tempts referees and editors to return the paper with the instruction to finish the project and then resubmit. But the Trichomonas article contains an enormous amount of finished information. It is a ‘draft’ because the way the genome is organised has so far defeated the final assembly of all the chunks of information generated by the sequencers. They made and analysed 1,403,509 DNA fragments, each with an average length of 785 base pairs, but couldn’t put them in proper order because so many genes were repeated over and over again. At least 65 per cent of the genome is made up of families of identical genes present in hundreds of copies.

TIGR and the Sanger Institute spend a lot of time developing software and processes to interpret the raw data generated by their DNA sequences. But Trichomonas’s repetitive nature has, so far, proved too much for TIGR’s AutoJoiner, ContigFattening, ContigGrowing and AutoGrowing algorithms and procedures. Nevertheless, the sequencing project has gone a long way towards explaining how Trichomonas works. It shows that it metabolises amino acids for energy, has antioxidant systems to protect it from oxygen toxicity, and has more than four hundred genes in its degradome – the system it uses to digest proteins. It explains why, after almost fifty years of unregulated antibiotic use, resistance has developed very slowly: Trichomonas has seven copies of the enzyme that converts the approved drug metronidazole to its toxic form and all would have to mutate simultaneously.

The sequencers speculate that massive gene expansion, particularly of certain gene families, and the uptake of bacterial genes, were adaptations that occurred when Trichomonas vaginalis was moving from the bowels to become a urogenital parasite. A reasonable explanation; but a somewhat more certain, and more surprising, observation is that it has so many predicted genes, 59,681. The mouse has 37,854 and Homo sapiens 35,845. From the evolutionary standpoint, instead of confirming Trichomonas as a primitive ancestor, sequencing has raised another problem: why so many genes? But this is the way science moves: trying to answer one question raises an even better one.

[*] Eukaryote cells have a nucleus; prokaryote cells don’t.