|Home | About | Journals | Submit | Contact Us | Français|
The rapidly accumulating genome sequence data allow researchers to address fundamental biological questions that were not even asked just a few years ago. A major problem in genomics is the widening gap between the rapid progress in genome sequencing and the comparatively slow progress in the functional characterization of sequenced genomes. Here we discuss two key questions of genome biology: whether we need more genomes, and how deep is our understanding of biology based on genomic analysis. We argue that overly specific annotations of gene functions are often less useful than the more generic, but also more robust, functional assignments based on protein family classification. We also discuss problems in understanding the functions of the remaining “conserved hypothetical” genes.
The year 2010 marks the 15th anniversary of the publication of the 1,830,138-base genome of the bacterium Haemophilus influenzae Rd Kw20 - the first cellular life form to have its entire genome sequenced . Aided by the tremendous progress in sequencing technology, genome sequencing is advancing at an ever-increasing pace. By the end of 2009, 1052 genomes representing 720 individual species (636 bacteria, 61 archaea, and 23 eukaryotes) were completely sequenced, deposited in the public nucleotide sequence databases (GenBank\EMBL\DDBJ) and made freely available over the internet. Many more genomes were at various stages of sequencing and assembly, including almost 100 eukaryotic genomes whose preliminary descriptions have been published . Thanks to the advent of the new generation of sequencing technologies, the costs of genome sequencing have dropped so much that the projects to sequence the entire human microbiome (http://nihroadmap.nih.gov/hmp/, ) and to generate ~5,000 reference genomes for every major prokaryotic lineage (the Genomic Encyclopedia of Bacteria and Archaea: http://www.jgi.doe.gov/programs/GEBA/, ) have become realistic. Given these remarkable advances, it seems timely to address two lingering questions: ‘How many more genomes do we need?’ and ‘How deep is our understanding of biology derived from genome analysis?’
An interesting, perhaps provocative question is whether a sufficient number of genomes have already been sequenced. Most biologists subscribe to “the more the merrier“ view , but others have argued that microbial genomics has already reached the stage of diminishing returns, such that each new genome yields information of progressively decreasing utility [5, 6]. There seems to be some substance to this claim; for example, it is unlikely that we ever see a single bacterial chromosome that is much longer than 13,033,779 nucleotides (as in the myxobacterium Sorangium cellulosum). On the other end of the spectrum, intracellular cycada symbiont Candidatus Hodgkinia cicadicola, with its 143,795-bp genome, could be considered a cellular organelle rather than an independent organism or, at best, a bacterium far on its way to become an organelle , so there is hardly any room for further genome reduction of cellular life forms. With respect to other common parameters, such as G+C content, the number of encoded proteins, and metabolic and signaling complexity , the extremes might already have been reached, or will be in the near future. Perhaps more importantly, the set of highly conserved genes (that is, those represented in the majority of genomes) is clearly approaching saturation . Similarly, in structural genomics projects, the chances of discovering a new protein fold or even a new superfamily are progressively dropping.
Nevertheless, genome sequencing is here to stay, and there are several compelling reasons for that. First of all, the value of the sequence information is in the eye of the beholder. Many biologists still passionately argue for sequencing their own favorite organism, strain or isolate, no matter how many close relatives already have been sequenced. Indeed, not having a genome sequence for an experimental model is increasingly - and for good reasons - perceived as being stuck in the "dark ages". The availability of the genome sequence allows researchers to easily clone and express any gene, create microarrays to analyze gene expression, and reconstruct the metabolic and signaling networks. Having genomic sequences from closely related organisms opens the door to the quantitative study of mutational patterns, selective regimes, adaptations to ecological factors and, in the case of microbial pathogens, virulence determinants. Potentially even more important is the possibility to identify genes and traits that are not present in the given genome - a task that clearly requires a complete genome sequence.
Secondly, the available genome collection, despite its rapid expansion, still barely scratches the surface of the real biological diversity. The availability of genomic data already led to a revolution in systematics, especially with regard to bacteria and archaea, having put this field on a solid evolutionary footing and giving rise to the new discipline of phylogenomics [10, 11]. Still, judging from the metagenomic data, as many as 90% of the microbial species on Earth remain uncultivated [12, 13], which complicates reconstruction of the global carbon and nitrogen cycles. Genome analysis has already led to several important advances in these areas. Thus, the genome of the marine α-proteobacterium SAR11 (now renamed Candidatus Pelagibacter ubique), apparently the most abundant organism on this planet, opened our eyes to a peculiar role of bacteriorhodopsin-mediated photosynthesis as an auxiliary energy source in the extremely streamlined metabolism of this bacterium . The genome sequence of the deep-sea proteobacterium Idiomarina loihiensis revealed mostly proteolytic, in contrast to the expected saccharolytic, metabolism , indicating that the marine habitat of this bacterium contains enough dissolved protein to support a peptide-based diet. The genomes of recently discovered anammox bacteria have yielded valuable insights into the evolution of the global nitrogen cycle and the biochemical reactions that convert nitrate and nitrite into nitrogen gas . This list of unexpected discoveries with biogeochemical implications could be easily extended.
Thirdly, hidden sampling biases in genome sequencing are becoming apparent. For example, starting with Mycoplasma genitalium in 1995, more than 20 mollicute genomes have been sequenced, none of which encoded a single environmental sensor . However, the perception that mollicutes have no signal transduction systems was shattered upon the completion of the (slightly larger) genome of the soil mollicute Acholeplasma laidlawii, which encodes two sensory histidine kinases, three response regulators, an adenylate cyclase, and at least 15 proteins involved in c-di-GMP-mediated regulation (http://www.ncbi.nlm.nih.gov/Complete_Genomes/SignalCensus.html).
Fourthly, although obtaining complete genome sequences from every major lineage  would certainly be a dramatic step forward, a single representative genome is by no means sufficient to assess the true biological diversity of a taxon. As a case in point, the sequencing of several genomes from the cyanobacterium Prochlorococcus marinus - a widespread inhabitant of ocean surface waters - was originally aimed at establishing the principal differences between “high-light” and “low-light” ecotypes . However, different strains of P. marinus proved to have vastly different gene repertoires, indicative of high rates of gene acquisition and loss by these organisms. These findings have shown that: (i) the core set of genes shared by all P. marinus isolates is very limited – and shrinking; and (ii) the P. marinus pan-genome, that is the sum total of genes represented in at least one P. marinus strain, is extremely large – and expanding . This crucial yet unexpected development puts into question the very rationale for assigning organisms with dramatically different genome contents – but (nearly) identical 16S rRNA sequences – to the same “species” (such as P. marinus or Escherichia coli) and puts the study of pan-genomes to the forefront of genomic research.
Finally, there remains the crucial issue of using genome sequencing to improve human health. For obvious reasons, the first sequenced genomes were mostly those of common bacterial pathogens. Then the human genome and representative genomes from popular model organisms emerged. As sequencing costs continue to decrease, the use of genomic data for fighting disease becomes more and more attractive. For many bacterial pathogens, multiple strains have been sequenced, often providing clues to the virulence factors, host specificity and drug resistance. Some biologists advocate developing a system of constant genome-based monitoring of various points on the globe, hoping to catch new emerging pathogens before they cause a new epidemic. Such an effort is already well underway for influenza viruses [20, 21]. The human cancer genome projects aims at sequencing thousands of tissue samples from various tumors, in hopes of delineating the whole spectrum of mutations that could contribute to cancer . Although this approach has been criticized , the perspective of obtaining the full list of potentially oncogenic mutations – thereby achieving a “complete understanding” of the causes of cancer – is certainly too attractive to pass.
With sequenced genomes being released almost every day, how well do we understand the functions of the genes in each new one? The answer to this question will depend on the exact meaning of the word “understanding” (as well as “function”). Modern dictionaries associate “understanding” with such terms as “appreciation”, “comprehension”, “explanation”, “insight”, “interpretation”, “knowledge”, and “mastery”. Accordingly, understanding a genome starts from the “knowledge” of the nucleotide sequence and the sequences of encoded proteins and RNAs, and includes “interpretation” of their functions, “insight” into their complex interactions, and “explanation” of the evolutionary history that shaped each particular genome. This leads to the “comprehension” of the potential activity of each component of the cell, which must be tempered by the “appreciation” that proteins often have additional (e.g. moonlighting [23, 24]) function. Finally, this understanding can be extended into “mastery” – the ability to modify the genome for certain (e.g. biotechnological) applications. Therefore, the problem of understanding the genome can be rephrased as follows: how good is the “parts list” that is compiled for each genome in the form of functional annotation of the predicted protein-coding and RNA-coding genes?
Obviously, this list is never complete. Almost 10 years ago, Peer Bork described the “70% hurdle”: on average, for approximately one-third of the genes in any given genome, the functions could not be predicted through traditional methods of genome analysis; perhaps even worse, the accuracy of functional prediction was only ~70% for the remaining genes . Bork warned that hopes to cross this 70% barrier and achieve a better understanding of the functional content of genomes with the help of high-throughput analytical methods would be tempered by the fact that these methods themselves have high error rates and are most effective when used in concert . Looking back, Bork’s sobering prediction was right on target. High-throughput analyses of gene and protein expression, protein-protein interactions, and ligand binding led to a dramatic increase in the amount of data pertaining to any given gene in model genomes . However, as illustrated in Box 1, accumulation of such data does not necessarily translate into clarity regarding gene function, at least not immediately, and not without much work.
Many “conserved hypothetical” genes remain without an assigned function simply because they have never been studied experimentally. In other cases, experimental studies brought contradictory results that could not be easily reconciled. To illustrate the difficulty of, in the words of Sydney Brenner, "converting data into knowledge and knowledge into understanding", let us consider the history of functional characterization of three widespread genes.
The E. coli ygjD (gcp) gene has orthologs in almost every bacterial, archaeal and eukaryotic genome. In many eukaryotes it is found in two paralogous copies, such as QRI7 and Kae1 in yeast, At4g22720 and At2g45270 in Arabidopsis thaliana, and OSGEPL and OSGEPL1 in human. In addition, there is a family of more distant bacterial paralogs, represented by E. coli YeaZ and B. subtilis YdiC. We have previously discussed the potential functions of this protein family (which contains an actin/HSP70 superfamily ATPase domain), and expressed doubts about its annotation as "O-sialoglycoprotease", which was based on a single experimental observation, and further suggested an association of this protein with translation (e.g. co-translational degradation of misfolded proteins) . In the past several years, proteins of this family have been studied in several model organisms, and the crystal structures of several family members have been solved [46, 59]. An archaeal YgjD family member showed no protease activity, but has been reported to bind DNA and possess an apurinic endonuclease activity . In yeast, Kae1 is a subunit of the KEOPS complex which regulates transcription, telomere uncapping and telomere length, and is required for cell growth; this protein is targeted to mitochondria and appears to be essential for genome maintenance . Despite all these observations, the actual function of the YgjD family proteins remains enigmatic [46, 60]. A recent study suggested their involvement in biosynthesis of threonylcarbamoyladenosine (t6A), a universal tRNA base modification occurring at position 37 in a subset of tRNAs decoding the ANN codons . If so, translational defects resulting from impaired t6A biosynthesis could explain at least some properties of the ygjD mutants.
The E. coli yebC is another widespread gene with orthologs in most bacterial and eukaryotic genomes. In some organisms, there are two paralogs, such as yebC and yeeN in E. coli, or yeeI and yrbC in B. subtilis. Products of these genes are listed as “domain of unknown function” (DUF28, PF01709) in the Pfam database  and as uncharacterized protein family UPF0082 in UniProt . Crystal structures of three members of this family have been solved , but those structures have provided no clear indication of protein function. Analysis of the genome neighborhoods of the yebC-like genes has revealed potential association with the Holliday junction resolvase RuvABC and suggested that YebC might be involved in DNA recombination and repair in bacteria and mitochondria . Indeed, a recent study demonstrated DNA binding by a Pseudomonas aeruginosa YebC family protein, and suggested a role in transcription regulation . However, another study  mapped the cytochrome c oxidase deficiency in humans (late-onset Leigh syndrome) to a mutation in the human YebC ortholog CCDC44 (renamed TACO1) and concluded that this protein was required for the proper translation of the mitochondrial COX1 protein. Three possible mechanisms of YebC action to ensure translation of the full-length COX1 polypeptide were considered: (i) securing an accurate start of translation, (ii) stabilizing the elongating polypeptide, and (iii) interacting with the peptide release factor . While involvement in translation appears very likely for such a widespread protein family, its apparent capacity to bind DNA remains to be confirmed and/or explained.
The E. coli yjgF gene has highly conserved homologs in bacteria, archaea and eukaryotes, often with multiple paralogs in the same genome. Representatives of the YjgF protein family are known as "purine regulatory protein YabJ" in B. subtilis and as "tumour-associated antigen UK114" in human and other mammals. Members of this family have been reported to possess ribonuclease activity, to function as a molecular chaperone, calpain activator, transcriptional regulator, and translational inhibitor, and also to affect photosynthesis, isoleucine biosynthesis and mitochondrial genome maintenance (reviewed in [66, 67]). Crystal structures of bacterial, archaeal and eukaryotic members of this family have been solved, revealing an inter-subunit cleft that is capable of binding a variety of small molecule ligands . Despite all these efforts, the cellular functions of the members of the YjgF family remain unclear.
Another important issue here is the definition of “function”, particularly as it applies to (semi)-automated genome annotations. For a limited set of essential genes, the notion of function seems quite straightforward: the function is what the gene product needs to do to allow cell growth. Operationally, if a gene is knocked-out, the cell dies, and the cause of death can be assumed to be the function of the gene in question. For non-essential genes the picture is more complicated as the functions of many, if not most, proteins are inherently multifaceted and complicated. For example, a single oxidoreductase would use a range of substrates and a variety of electron acceptors, making a precise functional assignment difficult, if not impossible. Should the function be assigned on the basis of the substrate with the lowest KM or the highest Vmax, or the one that is most likely to be physiologically relevant? This problem becomes particularly severe for high-throughput enzyme assays, which helped assign general biochemical functions to products of previously uncharacterized genes, but were often unable to pinpoint the natural substrates for the respective enzymes [27, 28]. Furthermore, many proteins, particularly in eukaryotes, lack any (known) enzymatic activity and appear to function exclusively in protein-protein interactions. It could be argued that the “understanding” of a protein function should, at the very least, include knowledge of (i) biochemical activity, if any (i.e. the nature of the catalyzed reaction and the range of utilized substrates and products); (ii) the biological process (pathway, stress response, cell cycle) for which this activity is (most) important; and (iii) the evolutionary aspects, such as phyletic distribution, level of sequence conservation, and frequency of mutation, gene loss and/or non-orthologous gene displacement.
Owing to the paucity of experimental data, this information is rarely available in its entirety, and functional assignments for the majority of the genes are based solely on the sequence similarity of their products to experimentally characterized proteins in a handful of model organisms such as E. coli, Bacillus subtilis, yeast, Dictyostelium, Drosophila, Caenorhabditis elegans, zebrafish, or mouse. Automatic transfer of functional annotation often leads to confusion when, for example, the product of a widespread prokaryotic gene (ytaB in B. subtilis) is often annotated as “mitochondrial benzodiazepine receptor” (it is hard to imagine why B. subtilis and hundreds of other bacteria and archaea would need a receptor for Valium, even apart from the fact that they have no mitochondria). Alternative annotations for the products of this gene family in various organisms include “tryptophan-rich sensory protein TspO”, “carotenoid biosynthetic protein CrtK” and “18 kD translocator protein”. However, despite these discordant annotations, there is little doubt that all members of the TspO/MBR protein family (Pfam family PF03073 ) are very similar and perform closely related – and important – functions, which remain to be uncovered . In our opinion, a more productive route towards sensible functional annotation is to replace annotation of individual proteins (particularly, those from poorly studied organisms) with annotation of protein families. In essence, this approach substitutes protein classification (something that we generally know how to do) for specific protein annotation, which except possibly for a handful of obvious cases, will remain questionable until each protein is experimentally characterized, even when predictions appear entirely plausible and supported by high similarity to experimentally characterized homologs and/or operon structure. As experimental assays increasingly lag behind the avalanche of genomic data, such experimental validation of predicted protein function becomes progressively less likely. By contrast, protein family classification is becoming increasingly robust. As an example, recognition of a Fis-type or a winged helix-turn-helix domain allows the recognition of a protein as a DNA-binding transcriptional regulator although the regulated genes (operons) may be difficult to predict . Likewise, numerous membrane proteins are reliably recognized as transporters, whereas the range of their substrates often remains uncertain. The notable success of protein family databases, such as Pfam, InterPro, COGs and CDD [29, 32–34], is probably due not only to the fact that these were – and still are – comprehensive collections with many useful features. It could be argued that another key to their success lies in the abandonment of the elusive goal of annotating every single sequence and instead concentrating on the common traits of protein families. In doing so, these databases provided a reasonably robust common framework for annotating the entire protein sets encoded in newly sequenced genomes. Thus, annotation pipelines used at most genome sequencing projects now include comparison against at least one of the available protein family databases.
It is important to note that family assignment is only the first step towards understanding, which, as discussed above, requires knowledge of both the biochemical activity of the protein and the cellular process in which the protein is involved. As an example, the sequence-based prediction that the conserved bacterial protein Era is a GTPase was a good first step in its characterization, and recognition of its involvement in translation was another step forward. However, “true understanding” of the role of this GTPase in the translation process – and its proper functional annotation – came only after an experimental study that revealed the participation of Era in processing and maturation of 16S rRNA .
Even in the relatively well-studied model organisms, the great majority of genes have never been experimentally characterized. E. coli K-12 and yeast Saccharomyces cerevisiae appear to be the only organisms for which at least 50% of the genes have been studied experimentally [36–38]. Despite the best efforts of experimental and computational biologists, a substantial – and constantly growing, given the acceleration of genome sequencing – number of deduced proteins have no known function (Figure 1). This is hardly surprising in case of lineage-specific genes that are found, for example, only in Vibrio or Burkholderia - bacterial lineages that are extensively sampled by genome sequencing, but do not include well-characterized model organisms. However, some genes that are widespread among bacteria, archaea and/or eukaryotes still remain without functional annotation . The protein products of these genes have been variously referred to as “hypothetical”, “conserved hypothetical”, “uncharacterized” or even “putative uncharacterized” (as of May 1, 2010, 3,118,564 proteins in UniProt were annotated this way ). Several lists of “conserved hypothetical” proteins have been compiled, including Domains of Unknown Function (DUFs) in Pfam, R- and S-COGs in the COG database, and Uncharacterized Protein Families (UPFs) in UniProtKB\Swiss-Prot [29, 33, 40]. These lists have been extensively used to guide structural genomics efforts, which resulted in structural (albeit usually not functional) characterization of many such proteins [41, 42].
To highlight the distinction between the “hypothetical” genes whose functions remained completely unknown and those that could be assigned a general biochemical function (e.g. a methyltransferase, an oxidoreductase, a transcriptional regulator or a membrane transporter), we denoted the former category of genes “unknown unknown” and the latter category “known unknown” . The “known unknown” category includes also genes of unknown biochemical function that have (partially) known cellular function, such as a “cell division protein” or a “stress response protein”. In purely operational terms, there are more or less clear ways of establishing function for “known unknown” genes, but not for “unknown unknowns”.
Six years ago we analyzed widely conserved “hypothetical” genes and compiled the “top 10” lists of “known unknown” and “unknown unknown” genes . A re-examination of these lists shows that, despite mounting observations, nearly half of those genes still remain without an assigned function (Tables 1 and and2).2). Some of the genes in the two lists have been experimentally characterized, and in a few cases the function has been established . In eukaryotes, products of some, albeit not all, of these widely conserved genes appear to be targeted to mitochondria [44–48]. In two instances, mutations in these genes were linked to mitochondrial diseases, such as hereditary paraganglioma  and the late-onset Leigh syndrome . In other cases, however, experimental results were contradictory (Box 1). Apparently, the problem was not in the lack of effort to characterize these genes, but in the pleiotropic phenotypes of their mutations, which made it difficult to pinpoint the primary function.
Given that universally conserved genes are typically involved in translation, transcription or ribosome biogenesis , widespread genes are likely to function in these or related processes as well. Indeed, several recently characterized “conserved hypothetical” genes are involved in post-transcriptional modification of tRNA [43, 49]. Besides, a significant number of “orphan” enzyme activities have not been associated with any protein sequence , suggesting that some “hypothetical” genes might have well-known functions. Characterization of the “missing” genes in various metabolic pathways allowed assigning functions to a number of formerly “conserved hypothetical” genes [51, 52]}.
Less common “hypothetical” genes are far more abundant in the genomes of free-living organisms than in the relatively streamlined genomes of parasites, symbionts and saprophytes . Based on the observation that the fraction of metabolic and particularly regulatory genes increases with the genome size [17, 54, 55], sophisticated regulation of gene expression and complex (secondary) metabolism, including various post-transcriptional and post-translational modifications, appear to be plausible roles for a fair number of the remaining uncharacterized genes.
Recent studies have highlighted an additional class of functions that might account for the abundance of uncharacterized genes in free-living organisms, namely, detoxification (usually hydrolysis) of potentially hazardous side-products of various metabolic reactions . These activities, commonly referred to as “house-cleaning”, are particularly important for aerobic organisms that have to cope with spontaneous oxidation of nucleotides, amino acids, lipids, and other cellular components. For example, the recently characterized “conserved hypothetical” gene yebR (renamed msrC) has been shown to encode an enzyme that hydrolyzes methionine-(R)-sulfoxide, a product of methionine oxidation . Other cellular reactions that might require house-cleaning include methylation, acetylation and adenylation, among potentially many others. It is probably no coincidence that many poorly characterized proteins appear to function as hydrolases [27, 28].
Finally, it has to be kept in mind that a considerable fraction of genes in many genomes might not have definable cellular functions, but rather originate from viruses and mobile elements and only transiently pass through microbial genomes. Genomes are highly dynamic entities, and each sequence is a temporal snapshot that is likely to include many short-lived elements that are not maintained by selection. The very notion of annotation for such “selfish” genes is different from that applied to “regular” genes with distinct cellular functions .
In conclusion, it might be worthwhile to make several basic generalizations regarding genomes and the understanding of gene functions:
So far there is no single high-throughput approach that would finally reveal the functions of all “hypothetical” genes encoded in the sequenced genomes. This goal may be reachable only through sustained efforts of numerous experimental, computational and structural biologists . At the end of 2009, NIH awarded a grant to the COMputational BRidge to EXperiments (COMBREX, http://www.combrex.org/, formerly SciBay) consortium project that aims to coordinate collaborative efforts of various research groups towards computational identification of the most interesting families of “conserved hypothetical” proteins and their experimental characterization (http://www.nigms.nih.gov/News/Results/gogrant_112309.htm). This project seems to have considerable potential to accelerate functional characterization of the remaining “hypothetical” genes, thereby bringing us closer to the “complete understanding” of genomes – and the organisms themselves.
This study was supported by the Intramural Research Program of the National Library of Medicine at the U.S. National Institutes of Health.