Until 1995 the only completely sequenced DNA molecules were viral and organelle genomes. That year Craig Venter's group at TIGR, and their collaborators, reported complete genome sequences of two bacterial species,
Haemophilus influenzae (
49) and
Mycoplasma genitalium (
50). The
H. influenzae sequence gave the first glimpse of the complete instruction set for a living organism. The
M. genitalium sequence showed us an approximation to the minimal set of genes required for cellular life.
The methods used to obtain these sequences were as important for subsequent events as the biological insights they revealed. Sequencing of H. influenzae introduced the whole genome shotgun (WGS) method for sequencing cellular genomes. In this method, genomic DNA is fragmented randomly and cloned to produce a random library in E. coli. Clones are sequenced at random and the results are assembled to produce the complete genome sequence by a computer program that compares all of the sequence reads and aligns matching sequences. Sanger and colleagues used this general strategy to sequence the lambda phage genome (48.5 kb), published in 1982. However, no larger genome was shotgun sequenced until H. influenzae (1.83 Mb). In the interim shotgun sequencing was used extensively, but only to sequence mapped subclones of larger sequences. This was the strategy used for the 230 kb human CMV sequence, the largest sequence finished sequencing before the H. influenzae genome.
Venter and colleagues introduced critical improvements that made it feasible, for the first time, to shotgun sequence complete cellular genomes. Perhaps most important was adoption of the ‘paired ends’ strategy (
51,
52). The automated sequencing procedure used in the
H. influenzae project used melted double-stranded DNA as template, whereas the HCMV project had used single-stranded M13 vectors. With double-stranded templates it was convenient to sequence each clone from both ends. Because the randomly sheared DNA was carefully sized before cloning, the distance between the reads from the ends of each clone could be estimated. The assembly program used this information to construct ‘scaffolds’ from the blocks of completely overlapped sequence (‘contigs’). When two contigs contained sequences from opposite ends of a single clone, then the two contigs could be linked, although a ‘sequence gap’ was said to exist between them. Sequence gaps remaining at the end of the shotgun phase of sequencing could be closed by sequencing from a primer for a site internal to a clone bridging the gap. Gaps between scaffolds are ‘physical gaps’ that contain sequences, which do not occur within any of the sequenced clones. Other measures, such as PCR between the ends of scaffolds using a genomic DNA template, were used to close physical gaps.
Another critical factor in the application of shotgun sequencing to cellular genomes was the TIGR assembler. Previous assembly programs were not designed to handle thousands of sequence reads involved in even the smallest cellular genome projects. However, the TIGR assembler that had been designed to assemble vast amounts of EST data was adequate for the job.
Once these initial sequences were reported the floodgates were open and a steady stream of completed genome sequences has been appearing ever since. It is only possible here to touch on a few of the most significant. Because of the large communities of scientists actively engaged in studies that would benefit from the availability of a genome sequence I have chosen to mention the bacteria E. coli and Bacillus subtilis, the yeast Saccharomyces cerevisiae, the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster and humans.
Because of its position as the pre-eminent model organism of molecular biology, sequencing of the genome of
E. coli (4.6 Mb) was proposed by Blattner as early as 1983 (
53). Sequencing proceeded as sequencing technology improved, starting with manual methods and finishing in 1997 with automated sequencers (
54). Early sequences covering ~1.9 Mb, were deposited starting in 1992, and were obtained from an overlapping set of cosmid clones. The final ~2.5 Mb was obtained by shotgun sequencing of ~250 Kb I-Sce I fragments. This
E. coli genome sequence, along with several other strains sequenced subsequently has yielded a wealth of information about bacterial evolution and pathogenicity (
55,
56).
Meanwhile, another model for large-scale genome sequencing projects had emerged; the international consortium. The first genome sequence to be completed by this approach was the yeast
S. cerevisiae (12.0 Mb) (
57), in late 1996. This was the also the first eukaryotic organism to be sequenced. The project involved about 600 scientists in Europe, North America and Japan. The participants included both academic laboratories and large sequencing centers.
The next success of the consortium approach was the genome of the bacterium
B. subtilis (4.2 Mb) (
58), in 1997. The project began in 1990 with the participation of five European laboratories. The project finally became a consortium of 25 laboratories in six European countries coordinated at the Institut Pasteur by Frank Kunst (coordinator) and Antoine Danchin. A consortium of seven Japanese laboratories, coordinated by Naotake Ogasawara and Hiroshi Yoshikawa at the Nara Institute of Science and Technology, Japan, also participated, as well as one Korean and two US laboratories.
The first animal genome sequenced was that of ‘the worm’
C. elegans (97 Mb) (
59), in 1998. The authorship of this work was simply ‘The
C. elegans Sequencing Consortium’, which was a collaboration between the Washington University Genome Sequencing Center in the United States and the Sanger Centre in UK.
In 1996, ABI introduced the first commercial DNA sequencer that used capillary electrophoresis rather than a slab gel (the ABI Prism 310), and in 1998 the ABI Prism 3700 with 96 capillaries was announced. For the first time DNA sequencing was truly automated. The considerable labor of pouring slab gels was replaced with automated reloading of the capillaries with polymer matrix. Samples for electrophoresis were automatically loaded from 96-well plates rather than manually loaded as the previous generation of sequencers had been. Celera Genomics was found by Applera Corporation (the parent company of ABI) and Craig Venter in May 1998 to exploit these new machines by applying Venter's methods for WGS sequencing to the human genome, in direct competition with the publicly funded Human Genome Project. Celera acquired 300 of the machines, each capable of producing 1.6 × 105 bases of sequence data per day, for a total theoretical capacity of ~5 × 107 bases of raw sequence data per day.
Celera chose the
D. melanogaster genome to test the applicability of the WGS approach to a complex eukaryotic genome (
60). This involved a scientific collaboration between the scientists at Celera and those of the Berkeley and European
Drosophila Genome Projects. These projects finished 29 Mb of the 120 Mb of euchromatic portion of the genome. (About one-third of the 180 Mb
Drosophila genome is centromeric heterochromatin.) Using the WGS approach, data was collected over a 4-month period that provided more than 12× coverage of the euchromatic portion of the genome. The results validated the data produced by the ABI 3700s, the applicability of the WGS approach to eukaryotic genomes, and the assembly methods developed at Celera (
61). This was a nearly ideal test case because the WGS data could be analyzed separately and then portions of it could be compared with finished sequence already produced by the
Drosophila Genome Projects. At the same time the sequence information provided a valuable resource for
Drosophila genetics. More than 40 scientists at an ‘Annotation Jamboree’ did initial annotation of the sequence. These scientists, mainly drawn from the
Drosophila research community, met at Celera for a 2-week period to identify genes, predict functions, and begin a global synthesis of the genome sequence information.