An interesting, perhaps provocative question is whether a sufficient number of genomes have already been sequenced. Most biologists subscribe to “the more the merrier“ view [4
], but others have argued that microbial genomics has already reached the stage of diminishing returns, such that each new genome yields information of progressively decreasing utility [5
]. There seems to be some substance to this claim; for example, it is unlikely that we ever see a single bacterial chromosome that is much longer than 13,033,779 nucleotides (as in the myxobacterium Sorangium cellulosum
). On the other end of the spectrum, intracellular cycada symbiont Candidatus
Hodgkinia cicadicola, with its 143,795-bp genome, could be considered a cellular organelle rather than an independent organism or, at best, a bacterium far on its way to become an organelle [7
], so there is hardly any room for further genome reduction of cellular life forms. With respect to other common parameters, such as G+C content, the number of encoded proteins, and metabolic and signaling complexity [8
], the extremes might already have been reached, or will be in the near future. Perhaps more importantly, the set of highly conserved genes (that is, those represented in the majority of genomes) is clearly approaching saturation [9
]. Similarly, in structural genomics projects, the chances of discovering a new protein fold or even a new superfamily are progressively dropping.
Nevertheless, genome sequencing is here to stay, and there are several compelling reasons for that. First of all, the value of the sequence information is in the eye of the beholder. Many biologists still passionately argue for sequencing their own favorite organism, strain or isolate, no matter how many close relatives already have been sequenced. Indeed, not having a genome sequence for an experimental model is increasingly - and for good reasons - perceived as being stuck in the "dark ages". The availability of the genome sequence allows researchers to easily clone and express any gene, create microarrays to analyze gene expression, and reconstruct the metabolic and signaling networks. Having genomic sequences from closely related organisms opens the door to the quantitative study of mutational patterns, selective regimes, adaptations to ecological factors and, in the case of microbial pathogens, virulence determinants. Potentially even more important is the possibility to identify genes and traits that are not present in the given genome - a task that clearly requires a complete genome sequence.
Secondly, the available genome collection, despite its rapid expansion, still barely scratches the surface of the real biological diversity. The availability of genomic data already led to a revolution in systematics, especially with regard to bacteria and archaea, having put this field on a solid evolutionary footing and giving rise to the new discipline of phylogenomics [10
]. Still, judging from the metagenomic data, as many as 90% of the microbial species on Earth remain uncultivated [12
], which complicates reconstruction of the global carbon and nitrogen cycles. Genome analysis has already led to several important advances in these areas. Thus, the genome of the marine α-proteobacterium SAR11 (now renamed Candidatus Pelagibacter ubique
), apparently the most abundant organism on this planet, opened our eyes to a peculiar role of bacteriorhodopsin-mediated photosynthesis as an auxiliary energy source in the extremely streamlined metabolism of this bacterium [14
]. The genome sequence of the deep-sea proteobacterium Idiomarina loihiensis
revealed mostly proteolytic, in contrast to the expected saccharolytic, metabolism [15
], indicating that the marine habitat of this bacterium contains enough dissolved protein to support a peptide-based diet. The genomes of recently discovered anammox bacteria have yielded valuable insights into the evolution of the global nitrogen cycle and the biochemical reactions that convert nitrate and nitrite into nitrogen gas [16
]. This list of unexpected discoveries with biogeochemical implications could be easily extended.
Thirdly, hidden sampling biases in genome sequencing are becoming apparent. For example, starting with Mycoplasma genitalium
in 1995, more than 20 mollicute genomes have been sequenced, none of which encoded a single environmental sensor [17
]. However, the perception that mollicutes have no signal transduction systems was shattered upon the completion of the (slightly larger) genome of the soil mollicute Acholeplasma laidlawii
, which encodes two sensory histidine kinases, three response regulators, an adenylate cyclase, and at least 15 proteins involved in c-di-GMP-mediated regulation (http://www.ncbi.nlm.nih.gov/Complete_Genomes/SignalCensus.html
Fourthly, although obtaining complete genome sequences from every major lineage [4
] would certainly be a dramatic step forward, a single representative genome is by no means sufficient to assess the true biological diversity of a taxon. As a case in point, the sequencing of several genomes from the cyanobacterium Prochlorococcus marinus
- a widespread inhabitant of ocean surface waters - was originally aimed at establishing the principal differences between “high-light” and “low-light” ecotypes [18
]. However, different strains of P. marinus
proved to have vastly different gene repertoires, indicative of high rates of gene acquisition and loss by these organisms. These findings have shown that: (i) the core set of genes shared by all P. marinus
isolates is very limited – and shrinking; and (ii) the P. marinus
pan-genome, that is the sum total of genes represented in at least one P. marinus
strain, is extremely large – and expanding [19
]. This crucial yet unexpected development puts into question the very rationale for assigning organisms with dramatically different genome contents – but (nearly) identical 16S rRNA sequences – to the same “species” (such as P. marinus
or Escherichia coli
) and puts the study of pan-genomes to the forefront of genomic research.
Finally, there remains the crucial issue of using genome sequencing to improve human health. For obvious reasons, the first sequenced genomes were mostly those of common bacterial pathogens. Then the human genome and representative genomes from popular model organisms emerged. As sequencing costs continue to decrease, the use of genomic data for fighting disease becomes more and more attractive. For many bacterial pathogens, multiple strains have been sequenced, often providing clues to the virulence factors, host specificity and drug resistance. Some biologists advocate developing a system of constant genome-based monitoring of various points on the globe, hoping to catch new emerging pathogens before they cause a new epidemic. Such an effort is already well underway for influenza viruses [20
]. The human cancer genome projects aims at sequencing thousands of tissue samples from various tumors, in hopes of delineating the whole spectrum of mutations that could contribute to cancer [22
]. Although this approach has been criticized [6
], the perspective of obtaining the full list of potentially oncogenic mutations – thereby achieving a “complete understanding” of the causes of cancer – is certainly too attractive to pass.