Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Trends Biotechnol. Author manuscript; available in PMC 2011 March 29.
Published in final edited form as:
PMCID: PMC3065831

From complete genome sequence to “complete“ understanding?


The rapidly accumulating genome sequence data allow researchers to address fundamental biological questions that were not even asked just a few years ago. A major problem in genomics is the widening gap between the rapid progress in genome sequencing and the comparatively slow progress in the functional characterization of sequenced genomes. Here we discuss two key questions of genome biology: whether we need more genomes, and how deep is our understanding of biology based on genomic analysis. We argue that overly specific annotations of gene functions are often less useful than the more generic, but also more robust, functional assignments based on protein family classification. We also discuss problems in understanding the functions of the remaining “conserved hypothetical” genes.


The year 2010 marks the 15th anniversary of the publication of the 1,830,138-base genome of the bacterium Haemophilus influenzae Rd Kw20 - the first cellular life form to have its entire genome sequenced [1]. Aided by the tremendous progress in sequencing technology, genome sequencing is advancing at an ever-increasing pace. By the end of 2009, 1052 genomes representing 720 individual species (636 bacteria, 61 archaea, and 23 eukaryotes) were completely sequenced, deposited in the public nucleotide sequence databases (GenBank\EMBL\DDBJ) and made freely available over the internet. Many more genomes were at various stages of sequencing and assembly, including almost 100 eukaryotic genomes whose preliminary descriptions have been published [2]. Thanks to the advent of the new generation of sequencing technologies, the costs of genome sequencing have dropped so much that the projects to sequence the entire human microbiome (, [3]) and to generate ~5,000 reference genomes for every major prokaryotic lineage (the Genomic Encyclopedia of Bacteria and Archaea:, [4]) have become realistic. Given these remarkable advances, it seems timely to address two lingering questions: ‘How many more genomes do we need?’ and ‘How deep is our understanding of biology derived from genome analysis?’

Don’t we already have enough genomes?

An interesting, perhaps provocative question is whether a sufficient number of genomes have already been sequenced. Most biologists subscribe to “the more the merrier“ view [4], but others have argued that microbial genomics has already reached the stage of diminishing returns, such that each new genome yields information of progressively decreasing utility [5, 6]. There seems to be some substance to this claim; for example, it is unlikely that we ever see a single bacterial chromosome that is much longer than 13,033,779 nucleotides (as in the myxobacterium Sorangium cellulosum). On the other end of the spectrum, intracellular cycada symbiont Candidatus Hodgkinia cicadicola, with its 143,795-bp genome, could be considered a cellular organelle rather than an independent organism or, at best, a bacterium far on its way to become an organelle [7], so there is hardly any room for further genome reduction of cellular life forms. With respect to other common parameters, such as G+C content, the number of encoded proteins, and metabolic and signaling complexity [8], the extremes might already have been reached, or will be in the near future. Perhaps more importantly, the set of highly conserved genes (that is, those represented in the majority of genomes) is clearly approaching saturation [9]. Similarly, in structural genomics projects, the chances of discovering a new protein fold or even a new superfamily are progressively dropping.

Nevertheless, genome sequencing is here to stay, and there are several compelling reasons for that. First of all, the value of the sequence information is in the eye of the beholder. Many biologists still passionately argue for sequencing their own favorite organism, strain or isolate, no matter how many close relatives already have been sequenced. Indeed, not having a genome sequence for an experimental model is increasingly - and for good reasons - perceived as being stuck in the "dark ages". The availability of the genome sequence allows researchers to easily clone and express any gene, create microarrays to analyze gene expression, and reconstruct the metabolic and signaling networks. Having genomic sequences from closely related organisms opens the door to the quantitative study of mutational patterns, selective regimes, adaptations to ecological factors and, in the case of microbial pathogens, virulence determinants. Potentially even more important is the possibility to identify genes and traits that are not present in the given genome - a task that clearly requires a complete genome sequence.

Secondly, the available genome collection, despite its rapid expansion, still barely scratches the surface of the real biological diversity. The availability of genomic data already led to a revolution in systematics, especially with regard to bacteria and archaea, having put this field on a solid evolutionary footing and giving rise to the new discipline of phylogenomics [10, 11]. Still, judging from the metagenomic data, as many as 90% of the microbial species on Earth remain uncultivated [12, 13], which complicates reconstruction of the global carbon and nitrogen cycles. Genome analysis has already led to several important advances in these areas. Thus, the genome of the marine α-proteobacterium SAR11 (now renamed Candidatus Pelagibacter ubique), apparently the most abundant organism on this planet, opened our eyes to a peculiar role of bacteriorhodopsin-mediated photosynthesis as an auxiliary energy source in the extremely streamlined metabolism of this bacterium [14]. The genome sequence of the deep-sea proteobacterium Idiomarina loihiensis revealed mostly proteolytic, in contrast to the expected saccharolytic, metabolism [15], indicating that the marine habitat of this bacterium contains enough dissolved protein to support a peptide-based diet. The genomes of recently discovered anammox bacteria have yielded valuable insights into the evolution of the global nitrogen cycle and the biochemical reactions that convert nitrate and nitrite into nitrogen gas [16]. This list of unexpected discoveries with biogeochemical implications could be easily extended.

Thirdly, hidden sampling biases in genome sequencing are becoming apparent. For example, starting with Mycoplasma genitalium in 1995, more than 20 mollicute genomes have been sequenced, none of which encoded a single environmental sensor [17]. However, the perception that mollicutes have no signal transduction systems was shattered upon the completion of the (slightly larger) genome of the soil mollicute Acholeplasma laidlawii, which encodes two sensory histidine kinases, three response regulators, an adenylate cyclase, and at least 15 proteins involved in c-di-GMP-mediated regulation (

Fourthly, although obtaining complete genome sequences from every major lineage [4] would certainly be a dramatic step forward, a single representative genome is by no means sufficient to assess the true biological diversity of a taxon. As a case in point, the sequencing of several genomes from the cyanobacterium Prochlorococcus marinus - a widespread inhabitant of ocean surface waters - was originally aimed at establishing the principal differences between “high-light” and “low-light” ecotypes [18]. However, different strains of P. marinus proved to have vastly different gene repertoires, indicative of high rates of gene acquisition and loss by these organisms. These findings have shown that: (i) the core set of genes shared by all P. marinus isolates is very limited – and shrinking; and (ii) the P. marinus pan-genome, that is the sum total of genes represented in at least one P. marinus strain, is extremely large – and expanding [19]. This crucial yet unexpected development puts into question the very rationale for assigning organisms with dramatically different genome contents – but (nearly) identical 16S rRNA sequences – to the same “species” (such as P. marinus or Escherichia coli) and puts the study of pan-genomes to the forefront of genomic research.

Finally, there remains the crucial issue of using genome sequencing to improve human health. For obvious reasons, the first sequenced genomes were mostly those of common bacterial pathogens. Then the human genome and representative genomes from popular model organisms emerged. As sequencing costs continue to decrease, the use of genomic data for fighting disease becomes more and more attractive. For many bacterial pathogens, multiple strains have been sequenced, often providing clues to the virulence factors, host specificity and drug resistance. Some biologists advocate developing a system of constant genome-based monitoring of various points on the globe, hoping to catch new emerging pathogens before they cause a new epidemic. Such an effort is already well underway for influenza viruses [20, 21]. The human cancer genome projects aims at sequencing thousands of tissue samples from various tumors, in hopes of delineating the whole spectrum of mutations that could contribute to cancer [22]. Although this approach has been criticized [6], the perspective of obtaining the full list of potentially oncogenic mutations – thereby achieving a “complete understanding” of the causes of cancer – is certainly too attractive to pass.

What part of the genome do you not understand?

With sequenced genomes being released almost every day, how well do we understand the functions of the genes in each new one? The answer to this question will depend on the exact meaning of the word “understanding” (as well as “function”). Modern dictionaries associate “understanding” with such terms as “appreciation”, “comprehension”, “explanation”, “insight”, “interpretation”, “knowledge”, and “mastery”. Accordingly, understanding a genome starts from the “knowledge” of the nucleotide sequence and the sequences of encoded proteins and RNAs, and includes “interpretation” of their functions, “insight” into their complex interactions, and “explanation” of the evolutionary history that shaped each particular genome. This leads to the “comprehension” of the potential activity of each component of the cell, which must be tempered by the “appreciation” that proteins often have additional (e.g. moonlighting [23, 24]) function. Finally, this understanding can be extended into “mastery” – the ability to modify the genome for certain (e.g. biotechnological) applications. Therefore, the problem of understanding the genome can be rephrased as follows: how good is the “parts list” that is compiled for each genome in the form of functional annotation of the predicted protein-coding and RNA-coding genes?

Obviously, this list is never complete. Almost 10 years ago, Peer Bork described the “70% hurdle”: on average, for approximately one-third of the genes in any given genome, the functions could not be predicted through traditional methods of genome analysis; perhaps even worse, the accuracy of functional prediction was only ~70% for the remaining genes [25]. Bork warned that hopes to cross this 70% barrier and achieve a better understanding of the functional content of genomes with the help of high-throughput analytical methods would be tempered by the fact that these methods themselves have high error rates and are most effective when used in concert [25]. Looking back, Bork’s sobering prediction was right on target. High-throughput analyses of gene and protein expression, protein-protein interactions, and ligand binding led to a dramatic increase in the amount of data pertaining to any given gene in model genomes [26]. However, as illustrated in Box 1, accumulation of such data does not necessarily translate into clarity regarding gene function, at least not immediately, and not without much work.

Box 1. The long and circuitous path from experiment to “understanding”

Many “conserved hypothetical” genes remain without an assigned function simply because they have never been studied experimentally. In other cases, experimental studies brought contradictory results that could not be easily reconciled. To illustrate the difficulty of, in the words of Sydney Brenner, "converting data into knowledge and knowledge into understanding", let us consider the history of functional characterization of three widespread genes.

YgjD/Gcp/Kae1/QRI1/OSGEPL family

The E. coli ygjD (gcp) gene has orthologs in almost every bacterial, archaeal and eukaryotic genome. In many eukaryotes it is found in two paralogous copies, such as QRI7 and Kae1 in yeast, At4g22720 and At2g45270 in Arabidopsis thaliana, and OSGEPL and OSGEPL1 in human. In addition, there is a family of more distant bacterial paralogs, represented by E. coli YeaZ and B. subtilis YdiC. We have previously discussed the potential functions of this protein family (which contains an actin/HSP70 superfamily ATPase domain), and expressed doubts about its annotation as "O-sialoglycoprotease", which was based on a single experimental observation, and further suggested an association of this protein with translation (e.g. co-translational degradation of misfolded proteins) [39]. In the past several years, proteins of this family have been studied in several model organisms, and the crystal structures of several family members have been solved [46, 59]. An archaeal YgjD family member showed no protease activity, but has been reported to bind DNA and possess an apurinic endonuclease activity [62]. In yeast, Kae1 is a subunit of the KEOPS complex which regulates transcription, telomere uncapping and telomere length, and is required for cell growth; this protein is targeted to mitochondria and appears to be essential for genome maintenance [46]. Despite all these observations, the actual function of the YgjD family proteins remains enigmatic [46, 60]. A recent study suggested their involvement in biosynthesis of threonylcarbamoyladenosine (t6A), a universal tRNA base modification occurring at position 37 in a subset of tRNAs decoding the ANN codons [63]. If so, translational defects resulting from impaired t6A biosynthesis could explain at least some properties of the ygjD mutants.

YebC/YrbC/DUF28/UPF0082 family

The E. coli yebC is another widespread gene with orthologs in most bacterial and eukaryotic genomes. In some organisms, there are two paralogs, such as yebC and yeeN in E. coli, or yeeI and yrbC in B. subtilis. Products of these genes are listed as “domain of unknown function” (DUF28, PF01709) in the Pfam database [29] and as uncharacterized protein family UPF0082 in UniProt [40]. Crystal structures of three members of this family have been solved [64], but those structures have provided no clear indication of protein function. Analysis of the genome neighborhoods of the yebC-like genes has revealed potential association with the Holliday junction resolvase RuvABC and suggested that YebC might be involved in DNA recombination and repair in bacteria and mitochondria [39]. Indeed, a recent study demonstrated DNA binding by a Pseudomonas aeruginosa YebC family protein, and suggested a role in transcription regulation [65]. However, another study [48] mapped the cytochrome c oxidase deficiency in humans (late-onset Leigh syndrome) to a mutation in the human YebC ortholog CCDC44 (renamed TACO1) and concluded that this protein was required for the proper translation of the mitochondrial COX1 protein. Three possible mechanisms of YebC action to ensure translation of the full-length COX1 polypeptide were considered: (i) securing an accurate start of translation, (ii) stabilizing the elongating polypeptide, and (iii) interacting with the peptide release factor [48]. While involvement in translation appears very likely for such a widespread protein family, its apparent capacity to bind DNA remains to be confirmed and/or explained.

YjgF/YabJ/YER057c/UK114 family

The E. coli yjgF gene has highly conserved homologs in bacteria, archaea and eukaryotes, often with multiple paralogs in the same genome. Representatives of the YjgF protein family are known as "purine regulatory protein YabJ" in B. subtilis and as "tumour-associated antigen UK114" in human and other mammals. Members of this family have been reported to possess ribonuclease activity, to function as a molecular chaperone, calpain activator, transcriptional regulator, and translational inhibitor, and also to affect photosynthesis, isoleucine biosynthesis and mitochondrial genome maintenance (reviewed in [66, 67]). Crystal structures of bacterial, archaeal and eukaryotic members of this family have been solved, revealing an inter-subunit cleft that is capable of binding a variety of small molecule ligands [58]. Despite all these efforts, the cellular functions of the members of the YjgF family remain unclear.

Another important issue here is the definition of “function”, particularly as it applies to (semi)-automated genome annotations. For a limited set of essential genes, the notion of function seems quite straightforward: the function is what the gene product needs to do to allow cell growth. Operationally, if a gene is knocked-out, the cell dies, and the cause of death can be assumed to be the function of the gene in question. For non-essential genes the picture is more complicated as the functions of many, if not most, proteins are inherently multifaceted and complicated. For example, a single oxidoreductase would use a range of substrates and a variety of electron acceptors, making a precise functional assignment difficult, if not impossible. Should the function be assigned on the basis of the substrate with the lowest KM or the highest Vmax, or the one that is most likely to be physiologically relevant? This problem becomes particularly severe for high-throughput enzyme assays, which helped assign general biochemical functions to products of previously uncharacterized genes, but were often unable to pinpoint the natural substrates for the respective enzymes [27, 28]. Furthermore, many proteins, particularly in eukaryotes, lack any (known) enzymatic activity and appear to function exclusively in protein-protein interactions. It could be argued that the “understanding” of a protein function should, at the very least, include knowledge of (i) biochemical activity, if any (i.e. the nature of the catalyzed reaction and the range of utilized substrates and products); (ii) the biological process (pathway, stress response, cell cycle) for which this activity is (most) important; and (iii) the evolutionary aspects, such as phyletic distribution, level of sequence conservation, and frequency of mutation, gene loss and/or non-orthologous gene displacement.

Owing to the paucity of experimental data, this information is rarely available in its entirety, and functional assignments for the majority of the genes are based solely on the sequence similarity of their products to experimentally characterized proteins in a handful of model organisms such as E. coli, Bacillus subtilis, yeast, Dictyostelium, Drosophila, Caenorhabditis elegans, zebrafish, or mouse. Automatic transfer of functional annotation often leads to confusion when, for example, the product of a widespread prokaryotic gene (ytaB in B. subtilis) is often annotated as “mitochondrial benzodiazepine receptor” (it is hard to imagine why B. subtilis and hundreds of other bacteria and archaea would need a receptor for Valium, even apart from the fact that they have no mitochondria). Alternative annotations for the products of this gene family in various organisms include “tryptophan-rich sensory protein TspO”, “carotenoid biosynthetic protein CrtK” and “18 kD translocator protein”. However, despite these discordant annotations, there is little doubt that all members of the TspO/MBR protein family (Pfam family PF03073 [29]) are very similar and perform closely related – and important – functions, which remain to be uncovered [30]. In our opinion, a more productive route towards sensible functional annotation is to replace annotation of individual proteins (particularly, those from poorly studied organisms) with annotation of protein families. In essence, this approach substitutes protein classification (something that we generally know how to do) for specific protein annotation, which except possibly for a handful of obvious cases, will remain questionable until each protein is experimentally characterized, even when predictions appear entirely plausible and supported by high similarity to experimentally characterized homologs and/or operon structure. As experimental assays increasingly lag behind the avalanche of genomic data, such experimental validation of predicted protein function becomes progressively less likely. By contrast, protein family classification is becoming increasingly robust. As an example, recognition of a Fis-type or a winged helix-turn-helix domain allows the recognition of a protein as a DNA-binding transcriptional regulator although the regulated genes (operons) may be difficult to predict [31]. Likewise, numerous membrane proteins are reliably recognized as transporters, whereas the range of their substrates often remains uncertain. The notable success of protein family databases, such as Pfam, InterPro, COGs and CDD [29, 3234], is probably due not only to the fact that these were – and still are – comprehensive collections with many useful features. It could be argued that another key to their success lies in the abandonment of the elusive goal of annotating every single sequence and instead concentrating on the common traits of protein families. In doing so, these databases provided a reasonably robust common framework for annotating the entire protein sets encoded in newly sequenced genomes. Thus, annotation pipelines used at most genome sequencing projects now include comparison against at least one of the available protein family databases.

It is important to note that family assignment is only the first step towards understanding, which, as discussed above, requires knowledge of both the biochemical activity of the protein and the cellular process in which the protein is involved. As an example, the sequence-based prediction that the conserved bacterial protein Era is a GTPase was a good first step in its characterization, and recognition of its involvement in translation was another step forward. However, “true understanding” of the role of this GTPase in the translation process – and its proper functional annotation – came only after an experimental study that revealed the participation of Era in processing and maturation of 16S rRNA [35].

“Conserved hypothetical” and “putative uncharacterized” genes: when and how will their functions become known?

Even in the relatively well-studied model organisms, the great majority of genes have never been experimentally characterized. E. coli K-12 and yeast Saccharomyces cerevisiae appear to be the only organisms for which at least 50% of the genes have been studied experimentally [3638]. Despite the best efforts of experimental and computational biologists, a substantial – and constantly growing, given the acceleration of genome sequencing – number of deduced proteins have no known function (Figure 1). This is hardly surprising in case of lineage-specific genes that are found, for example, only in Vibrio or Burkholderia - bacterial lineages that are extensively sampled by genome sequencing, but do not include well-characterized model organisms. However, some genes that are widespread among bacteria, archaea and/or eukaryotes still remain without functional annotation [39]. The protein products of these genes have been variously referred to as “hypothetical”, “conserved hypothetical”, “uncharacterized” or even “putative uncharacterized” (as of May 1, 2010, 3,118,564 proteins in UniProt were annotated this way [40]). Several lists of “conserved hypothetical” proteins have been compiled, including Domains of Unknown Function (DUFs) in Pfam, R- and S-COGs in the COG database, and Uncharacterized Protein Families (UPFs) in UniProtKB\Swiss-Prot [29, 33, 40]. These lists have been extensively used to guide structural genomics efforts, which resulted in structural (albeit usually not functional) characterization of many such proteins [41, 42].

Figure 1
Accumulation of protein sequences of unknown function in the genome databases. Open symbols indicate the total number of protein sequences encoded in prokaryotic (blue) and eukaryotic (red) genomes; filled symbols indicate the number of “hypothetical” ...

To highlight the distinction between the “hypothetical” genes whose functions remained completely unknown and those that could be assigned a general biochemical function (e.g. a methyltransferase, an oxidoreductase, a transcriptional regulator or a membrane transporter), we denoted the former category of genes “unknown unknown” and the latter category “known unknown” [39]. The “known unknown” category includes also genes of unknown biochemical function that have (partially) known cellular function, such as a “cell division protein” or a “stress response protein”. In purely operational terms, there are more or less clear ways of establishing function for “known unknown” genes, but not for “unknown unknowns”.

Six years ago we analyzed widely conserved “hypothetical” genes and compiled the “top 10” lists of “known unknown” and “unknown unknown” genes [39]. A re-examination of these lists shows that, despite mounting observations, nearly half of those genes still remain without an assigned function (Tables 1 and and2).2). Some of the genes in the two lists have been experimentally characterized, and in a few cases the function has been established [43]. In eukaryotes, products of some, albeit not all, of these widely conserved genes appear to be targeted to mitochondria [4448]. In two instances, mutations in these genes were linked to mitochondrial diseases, such as hereditary paraganglioma [44] and the late-onset Leigh syndrome [48]. In other cases, however, experimental results were contradictory (Box 1). Apparently, the problem was not in the lack of effort to characterize these genes, but in the pleiotropic phenotypes of their mutations, which made it difficult to pinpoint the primary function.

Table 1
Updated “top 10” list of widespread “known unknown” genes
Table 2
Updated “top 10” list of widespread “unknown unknown” genes

What could be the functions of all those “hypothetical” genes?

Given that universally conserved genes are typically involved in translation, transcription or ribosome biogenesis [9], widespread genes are likely to function in these or related processes as well. Indeed, several recently characterized “conserved hypothetical” genes are involved in post-transcriptional modification of tRNA [43, 49]. Besides, a significant number of “orphan” enzyme activities have not been associated with any protein sequence [50], suggesting that some “hypothetical” genes might have well-known functions. Characterization of the “missing” genes in various metabolic pathways allowed assigning functions to a number of formerly “conserved hypothetical” genes [51, 52]}.

Less common “hypothetical” genes are far more abundant in the genomes of free-living organisms than in the relatively streamlined genomes of parasites, symbionts and saprophytes [53]. Based on the observation that the fraction of metabolic and particularly regulatory genes increases with the genome size [17, 54, 55], sophisticated regulation of gene expression and complex (secondary) metabolism, including various post-transcriptional and post-translational modifications, appear to be plausible roles for a fair number of the remaining uncharacterized genes.

Recent studies have highlighted an additional class of functions that might account for the abundance of uncharacterized genes in free-living organisms, namely, detoxification (usually hydrolysis) of potentially hazardous side-products of various metabolic reactions [56]. These activities, commonly referred to as “house-cleaning”, are particularly important for aerobic organisms that have to cope with spontaneous oxidation of nucleotides, amino acids, lipids, and other cellular components. For example, the recently characterized “conserved hypothetical” gene yebR (renamed msrC) has been shown to encode an enzyme that hydrolyzes methionine-(R)-sulfoxide, a product of methionine oxidation [57]. Other cellular reactions that might require house-cleaning include methylation, acetylation and adenylation, among potentially many others. It is probably no coincidence that many poorly characterized proteins appear to function as hydrolases [27, 28].

Finally, it has to be kept in mind that a considerable fraction of genes in many genomes might not have definable cellular functions, but rather originate from viruses and mobile elements and only transiently pass through microbial genomes. Genomes are highly dynamic entities, and each sequence is a temporal snapshot that is likely to include many short-lived elements that are not maintained by selection. The very notion of annotation for such “selfish” genes is different from that applied to “regular” genes with distinct cellular functions [9].

Concluding remarks

In conclusion, it might be worthwhile to make several basic generalizations regarding genomes and the understanding of gene functions:

  • Functions of many widespread genes are known; all universal genes are involved in translation [9]
  • Widespread genes with unknown functions remain uncharacterized for a reason: they often affect multiple processes and their mutations typically are pleiotropic (Box 1)
  • The functions of a substantial fraction of genes in each sequenced genome remain unknown
  • Not every experiment on an unknown gene yields useful clues regarding function.
  • Structural characterization of a protein rarely gives direct clues to its function [41, 42, 58]).
  • Analysis of gene expression rarely gives direct clues to gene functions
  • Delineation of a protein interaction network involving the gene of interest rarely gives direct clues to its function [26, 59, 60]
  • Functional assignments for previously uncharacterized, widely conserved genes are just like any biological discoveries: they require a lot of hard work and a bit of luck

So far there is no single high-throughput approach that would finally reveal the functions of all “hypothetical” genes encoded in the sequenced genomes. This goal may be reachable only through sustained efforts of numerous experimental, computational and structural biologists [61]. At the end of 2009, NIH awarded a grant to the COMputational BRidge to EXperiments (COMBREX,, formerly SciBay) consortium project that aims to coordinate collaborative efforts of various research groups towards computational identification of the most interesting families of “conserved hypothetical” proteins and their experimental characterization ( This project seems to have considerable potential to accelerate functional characterization of the remaining “hypothetical” genes, thereby bringing us closer to the “complete understanding” of genomes – and the organisms themselves.


This study was supported by the Intramural Research Program of the National Library of Medicine at the U.S. National Institutes of Health.


1. Fleischmann RD, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995;269:496–512. [PubMed]
2. Liolios K, et al. The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2010;38:D346–D354. [PMC free article] [PubMed]
3. Ley RE, et al. Worlds within worlds: evolution of the vertebrate gut microbiota. Nat. Rev. Microbiol. 2008;6:776–788. [PMC free article] [PubMed]
4. Wu D, et al. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature. 2009;462:1056–1060. [PMC free article] [PubMed]
5. Whitworth DE. Genomes and knowledge - a questionable relationship? Trends Microbiol. 2008;16:512–519. [PubMed]
6. Kaiser J. A skeptic questions cancer genome projects. ScienceInsider, 23 April 2010. 2010 (
7. McCutcheon JP, et al. Origin of an alternative genetic code in the extremely small and GC-rich genome of a bacterial symbiont. PLoS Genet. 2009;5 e1000565. [PMC free article] [PubMed]
8. Galperin MY, Kolker E. New metrics for comparative genomics. Curr. Opin. Biotechnol. 2006;17:440–447. [PMC free article] [PubMed]
9. Koonin EV, Wolf YI. Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res. 2008;36:6688–6719. [PMC free article] [PubMed]
10. Eisen JA, Fraser CM. Phylogenomics: intersection of evolution and genomics. Science. 2003;300:1706–1707. [PubMed]
11. Koonin EV. The origin and early evolution of eukaryotes in the light of phylogenomics. Genome Biol. 2010;11:209. [PMC free article] [PubMed]
12. Hugenholtz P. Exploring prokaryotic diversity in the genomic era. Genome Biol. 2002;3 REVIEWS0003. [PMC free article] [PubMed]
13. DeLong EF. The microbial ocean from genomes to biomes. Nature. 2009;459:200–206. [PubMed]
14. Giovannoni SJ, et al. Genome streamlining in a cosmopolitan oceanic bacterium. Science. 2005;309:1242–1245. [PubMed]
15. Hou S, et al. Genome sequence of the deep-sea gamma-proteobacterium Idiomarina loihiensis reveals amino acid fermentation as a source of carbon and energy. Proc. Natl. Acad. Sci. USA. 2004;101:18036–18041. [PubMed]
16. Klotz MG, Stein LY. Nitrifier genomics and evolution of the nitrogen cycle. FEMS Microbiol. Lett. 2008;278:146–156. [PubMed]
17. Galperin MY. A census of membrane-bound and intracellular signal transduction proteins in bacteria: bacterial IQ, extroverts and introverts. BMC Microbiol. 2005;5:35. [PMC free article] [PubMed]
18. Rocap G, et al. Genome divergence in two Prochlorococcus ecotypes reflects oceanic niche differentiation. Nature. 2003;424:1042–1047. [PubMed]
19. Scanlan DJ, et al. Ecological genomics of marine picocyanobacteria. Microbiol. Mol. Biol. Rev. 2009;73:249–299. [PMC free article] [PubMed]
20. McHardy AC, Adams B. The role of genomics in tracking the evolution of influenza A virus. PLoS Pathog. 2009;5 e1000566. [PMC free article] [PubMed]
21. Lee CW, et al. Large-scale evolutionary surveillance of the 2009 H1N1 influenza A virus using resequencing arrays. Nucleic Acids Res. 2010;38:e111. [PMC free article] [PubMed]
22. Pleasance ED, et al. A comprehensive catalogue of somatic mutations from a human cancer genome. Nature. 2010;463:191–196. [PMC free article] [PubMed]
23. Jeffery CJ. Moonlighting proteins. Trends Biochem. Sci. 1999;24:8–11. [PubMed]
24. Sriram G, et al. Single-gene disorders: what role could moonlighting enzymes play? Am. J. Hum. Genet. 2005;76:911–924. [PubMed]
25. Bork P. Powers and pitfalls in sequence analysis: the 70% hurdle. Genome Res. 2000;10:398–400. [PubMed]
26. Jensen LJ, et al. STRING 8--a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res. 2009;37:D412–D416. [PMC free article] [PubMed]
27. Kuznetsova E, et al. Enzyme genomics: Application of general enzymatic screens to discover new enzymes. FEMS Microbiol. Rev. 2005;29:263–279. [PubMed]
28. Kuznetsova E, et al. Genome-wide analysis of substrate specificities of the Escherichia coli haloacid dehalogenase-like phosphatase family. J. Biol. Chem. 2006;281:36149–36161. [PubMed]
29. Finn RD, et al. The Pfam protein families database. Nucleic Acids Res. 2010;38:D211–D222. [PMC free article] [PubMed]
30. Chapalain A, et al. Bacterial ortholog of mammalian translocator protein (TSPO) with virulence regulating activity. PLoS One. 2009;4:e6096. [PMC free article] [PubMed]
31. Galperin MY. Diversity of structure and function of response regulator output domains. Curr. Opin. Microbiol. 2010;13:150–159. [PMC free article] [PubMed]
32. Hunter S, et al. InterPro: the integrative protein signature database. Nucleic Acids Res. 2009;37:D211–D215. [PMC free article] [PubMed]
33. Tatusov RL, et al. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000;28:33–36. [PMC free article] [PubMed]
34. Marchler-Bauer A, et al. CDD: specific functional annotation with the Conserved Domain Database. Nucleic Acids Res. 2009;37:D205–D210. [PMC free article] [PubMed]
35. Tu C, et al. Structure of ERA in complex with the 3' end of 16S rRNA: implications for ribosome biogenesis. Proc. Natl. Acad. Sci. USA. 2009;106:14843–14848. [PubMed]
36. Riley M, et al. Escherichia coli K-12: a cooperatively developed annotation snapshot--2005. Nucleic Acids Res. 2006;34:1–9. [PMC free article] [PubMed]
37. Keseler IM, et al. EcoCyc: a comprehensive view of Escherichia coli biology. Nucleic Acids Res. 2009;37:D464–D470. [PMC free article] [PubMed]
38. Christie KR, et al. Functional annotations for the Saccharomyces cerevisiae genome: the knowns and the known unknowns. Trends Microbiol. 2009;17:286–294. [PMC free article] [PubMed]
39. Galperin MY, Koonin EV. 'Conserved hypothetical' proteins: prioritization of targets for experimental study. Nucleic Acids Res. 2004;32:5452–5463. [PMC free article] [PubMed]
40. The UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res. 2009;37:D169–D174. [PMC free article] [PubMed]
41. Rigden DJ. Understanding the cell in terms of structure and function: Insights from structural genomics. Curr. Opin. Biotech. 2006;17:457–464. [PubMed]
42. Lee D, et al. Predicting protein function from sequence and structure. Nat. Rev. Mol. Cell Biol. 2007;8:995–1005. [PubMed]
43. El Yacoubi B, et al. The universal YrdC/Sua5 family is required for the formation of threonylcarbamoyladenosine in tRNA. Nucleic Acids Res. 2009;37:2894–2909. [PMC free article] [PubMed]
44. Hao HX, et al. SDH5, a gene required for flavination of succinate dehydrogenase, is mutated in paraganglioma. Science. 2009;325:1139–1142. [PMC free article] [PubMed]
45. Khalimonchuk O, et al. Evidence for a pro-oxidant intermediate in the assembly of cytochrome oxidase. J. Biol. Chem. 2007;282:17442–17449. [PubMed]
46. Oberto J, et al. Qri7/OSGEPL, the mitochondrial version of the universal Kae1/YgjD protein, is essential for mitochondrial genome maintenance. Nucleic Acids Res. 2009;37:5343–5352. [PMC free article] [PubMed]
47. Rudolph C, et al. ApoA-I-binding protein (AI-BP) and its homologues hYjeF_N2 and hYjeF_N3 comprise the YjeF_N domain protein family in humans with a role in spermiogenesis and oogenesis. Horm. Metab. Res. 2007;39:322–335. [PubMed]
48. Weraarpachai W, et al. Mutation in TACO1, encoding a translational activator of COX I, results in cytochrome c oxidase deficiency and late-onset Leigh syndrome. Nat. Genet. 2009;41:833–837. [PubMed]
49. Phillips G, et al. Discovery and characterization of an amidotransferase involved in the modification of archaeal tRNA. J. Biol. Chem. 2010;285:12706–12713. [PMC free article] [PubMed]
50. Pouliot Y, Karp PD. A survey of orphan enzyme activities. BMC Bioinformatics. 2007;8:244. [PMC free article] [PubMed]
51. Osterman A, Overbeek R. Missing genes in metabolic pathways: a comparative genomics approach. Curr. Opin. Chem. Biol. 2003;7:238–251. [PubMed]
52. Hanson AD, et al. 'Unknown' proteins and 'orphan' enzymes: the missing half of the engineering parts list--and how to find it. Biochem. J. 2010;425:1–11. [PMC free article] [PubMed]
53. Kolker E, et al. Global profiling of Shewanella oneidensis MR-1: Expression of hypothetical genes and improved functional annotations. Proc. Natl. Acad. Sci. USA. 2005;102:2099–2104. [PubMed]
54. van Nimwegen E. Scaling laws in the functional content of genomes. Trends Genet. 2003;19:479–484. [PubMed]
55. Konstantinidis KT, Tiedje JM. Trends between gene content and genome size in prokaryotic species with larger genomes. Proc. Natl. Acad. Sci. USA. 2004;101:3160–3165. [PubMed]
56. Galperin MY, et al. House cleaning, a part of good housekeeping. Mol. Microbiol. 2006;59:5–19. [PubMed]
57. Lin Z, et al. Free methionine-(R)-sulfoxide reductase from Escherichia coli reveals a new GAF domain function. Proc. Natl. Acad. Sci. USA. 2007;104:9597–9602. [PubMed]
58. Burman JD, et al. The crystal structure of Escherichia coli TdcF, a member of the highly conserved YjgF/YER057c/UK114 family. BMC Struct. Biol. 2007;7:30. [PMC free article] [PubMed]
59. Handford JI, et al. Conserved network of proteins essential for bacterial viability. J. Bacteriol. 2009;191:4732–4749. [PMC free article] [PubMed]
60. Msadek T. Grasping at shadows: revealing the elusive nature of essential genes. J. Bacteriol. 2009;191:4701–4704. [PMC free article] [PubMed]
61. Roberts RJ, et al. An experimental approach to genome annotation. The American Academy of Microbiology colloquium report. 2004 American Society for Microbiology
62. Hecker A, et al. An archaeal orthologue of the universal protein Kae1 is an iron metalloprotein which exhibits atypical DNA-binding properties and apurinic-endonuclease activity in vitro. Nucleic Acids Res. 2007;35:6042–6051. [PMC free article] [PubMed]
63. El Yacoubi B, et al. Function of the YrdC/YgjD conserved protein network: the t6A lead. In: Weil T, Santos M, editors. 23rd tRNA workshop: From the origin of life to biomedicine. 2010. p. 7. (
64. Shin DH, et al. Crystal structure of conserved hypothetical protein Aq1575 from Aquifex aeolicus. Proc. Natl. Acad. Sci. USA. 2002;99:7980–7985. [PubMed]
65. Liang H, et al. The YebC family protein PA0964 negatively regulates the Pseudomonas aeruginosa quinolone signal system and pyocyanin production. J. Bacteriol. 2008;190:6217–6227. [PMC free article] [PubMed]
66. Christopherson MR, et al. YjgF is required for isoleucine biosynthesis when Salmonella enterica is grown on pyruvate medium. J. Bacteriol. 2008;190:3057–3062. [PMC free article] [PubMed]
67. Thakur KG, et al. Mycobacterium tuberculosis Rv2704 is a member of the YjgF/YER057c/UK114 family. Proteins. 2010;78:773–778. [PMC free article] [PubMed]
68. Sayers EW, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2009;37:D5–D15. [PMC free article] [PubMed]
69. Koller-Eichhorn R, et al. Human OLA1 defines an ATPase subfamily in the Obg family of GTP-binding proteins. J. Biol. Chem. 2007;282:19928–19937. [PubMed]
70. Kaczanowska M, Ryden-Aulin M. The YrdC protein--a putative ribosome maturation factor. Biochim. Biophys. Acta. 2005;1727:87–96. [PubMed]
71. Krasnikov BF, et al. Identification of the putative tumor suppressor Nit2 as omega-amidase, an enzyme metabolically linked to glutamine and asparagine transamination. Biochimie. 2009;91:1072–1080. [PMC free article] [PubMed]
72. Cooper EL, et al. YsxC, an essential protein in Staphylococcus aureus crucial for ribosome assembly/stability. BMC Microbiol. 2009;9:266. [PMC free article] [PubMed]
73. Mercker M, et al. The BEM46-like protein appears to be essential for hyphal development upon ascospore germination in Neurospora crassa and is targeted to the endoplasmic reticulum. Curr. Genet. 2009;55:151–161. [PubMed]
74. Miller DJ, et al. Structural and biochemical characterization of a novel Mn2+-dependent phosphodiesterase encoded by the yfcE gene. Protein Sci. 2007;16:1338–1348. [PubMed]
75. Keppetipola N, Shuman S. A phosphate-binding histidine of binuclear metallophosphodiesterase enzymes is a determinant of 2',3'-cyclic nucleotide phosphodiesterase activity. J. Biol. Chem. 2008;283:30942–30949. [PMC free article] [PubMed]
76. Rosby R, et al. Knockdown of the Drosophila GTPase nucleostemin 1 impairs large ribosomal subunit biogenesis, cell growth, and midgut precursor cell maintenance. Mol. Biol. Cell. 2009;20:4424–4434. [PMC free article] [PubMed]
77. Jiang M, et al. The Escherichia coli GTPase CgtAE is involved in late steps of large ribosome assembly. J. Bacteriol. 2006;188:6757–6770. [PMC free article] [PubMed]
78. Pereira CM, et al. IMPACT, a protein preferentially expressed in the mouse brain, binds GCN1 and inhibits GCN2 activation. J. Biol. Chem. 2005;280:28316–28323. [PubMed]
79. de Hoog CL, et al. RNA and RNA binding proteins participate in early stages of cell spreading through spreading initiation centers. Cell. 2004;117:649–662. [PubMed]
80. Balaji S, Aravind L. The RAGNYA fold: a novel fold with multiple topological variants found in functionally diverse nucleic acid, nucleotide and peptide-binding proteins. Nucleic Acids Res. 2007;35:5658–5671. [PMC free article] [PubMed]