|Home | About | Journals | Submit | Contact Us | Français|
Three integrated genomic context methods were used to annotate uncharacterized proteins in 102 bacterial genomes. Of 7853 orthologous groups with unknown function containing 45110 proteins, 1738 groups could be linked to functionally associated partners. In many cases, those partners are uncharacterized themselves (hinting at newly identified modules) or have been described in general terms only. However, we were able to assign pathways, cellular processes or physical complexes for 273 groups (encompassing 3624 previously functionally uncharacterized proteins).
During the last few years, the speed and cost-effectiveness of genome sequencing has increased enormously, and with >100 bacterial genomes now available, the quality and comprehensiveness of their annotation becomes a demanding problem. Even in the most recent annotations, a large fraction of open reading frames are still labelled as ‘conserved hypothetical proteins’, sometimes representing more than the half of the potential protein-coding regions of a genome (1).
Many of these ‘hypothetical proteins’ occur in fact in more than one bacterial species, and can thus be combined into orthologous groups; this subset of proteins contains the majority of biologically relevant sequences (less likely to be artefactual), and it is amenable to analysis by comparative genomics techniques.
For assigning function to novel proteins, homology-based gene annotation has been the standard during the last decades. However, novel methods have been developed recently, which can complement the classical homology search; these methods are designed to detect presumed functional constraints on genome evolution, and are called ‘genomic context’ approaches.
They predict functional associations between protein-coding genes by analyzing gene fusion events, the conservation of gene neighbourhood, or the significant co-occurrence of genes across different species (2–7). Unlike homology-based annotation, which infers molecular features by information transfer from experimentally characterized proteins, genomic context methods predict functional associations between proteins, such as physical interactions, or co-membership in pathways, regulons or other cellular processes (8). Characterizing protein function in this manner (i.e. by predicting associated partners) is intuitive and generally applicable, but it should be noted that it does not provide information about the exact biochemical or enzymatic function of a protein. Genomic context methods have been successfully used to study protein associations, either individually (2–6) or in combination with other methods or data sets (7,9,10). Recently, a combination of genomic context methods has been applied to infer functional associations between proteins in archaea (11), or to identify functional modules in Escherichia coli (12). Furthermore, these methods have been used to identify genes that correlate with the hyperthermophilic phenotype (13) or to predict target processes for transcription regulators (14). Yet, despite these efforts >35% of genes in prokaryotes are still annotated as ‘function unknown’ (15). The urgent need to functionally characterize these proteins and to bridge this gap in our knowledge, is highlighted by a recent call for community action (1).
Here, we aim at reducing this fraction considerably, by a systematic study of uncharacterized proteins in prokaryotes. We exploit the genomic context of this set to predict their functional role, by determining the biological/cellular process in which the proteins participate.
From the manually curated clusters of orthologous groups (COG) database (16) (http://www.ncbi.nlm.nih.gov/COG/), we extracted clusters of orthologous genes, which were annotated as ‘hypothetical’ or ‘uncharacterized’. In the original procedure to create COGs (16), orthologues are identified using an all-against-all sequence comparison of the proteins encoded in completely sequenced genomes. In considering a protein from a given genome, this comparison reveals the one protein from each of the other genomes to which it is most similar. Each of these proteins is in turn considered. If a reciprocal best-hit relationship between these proteins (or a subset) is revealed, then those that are reciprocal best-hits will form a COG. An additional constraint in this procedure is that a COG must be comprised of one protein from at least three phylogenetically distant genomes.
We have recently (7) extended the COG database by considering an additional set of 37 newly sequenced complete genomes. This resulted in both the extension of the original COGs with new proteins, but also in the creation of entirely novel orthologous groups, which we termed ‘NOGs’ (non-supervised orthologous groups). Similar to the original procedure, assignments for NOGs were made based on triangles of reciprocal best matches between species in all-against-all Smith–Waterman searches, allowing for recent duplications within the genome, and including a clean-up step to join any remaining genes by simple bidirectional hits. NOGs are fully automatically generated, and they do not have any manually curated functional annotation (for a more detailed description see Supplementary Material, p. 12).
Functionally unannotated COGs and NOGs were analysed using the tool STRING (Search Tool for the Retrieval of Interacting Genes/Proteins, http://string.embl.de/) (7), applying a conservative score threshold of 0.4 [for a benchmark see (7)]. STRING calculates this ‘confidence score’ on the basis of three genomic context methods: conserved gene neighbourhood, gene fusion events and significant co-occurrence of the genes across a specific subset of species. The prediction accuracy of functional links is often higher than the confidence score indicates [e.g. when tested against E.coli small molecule metabolism (12)]. Each COG and NOG was queried manually, and the STRING output was inspected—asking, whether it allowed the assignment of a cellular role to the uncharacterized proteins clustered in this group, based on their predicted functional partners.
In a global search for conserved uncharacterized proteins, we retrieved a total of 7853 orthologous groups annotated to be of unknown function. All groups contain orthologous proteins derived from several genomes.
In the set of 102 prokaryotic genomes considered here, 82% of all the genes are included in orthologous groups. Most of the remaining proteins might be without orthology to any other genome and can thus not be included in a comparative analysis; others might even be gene prediction artefacts.
The list of functionally uncharacterized group contains 1162 groups originally assembled by Tatusov et al. (16) (COGs), and 6691 groups derived from more recently completed genome sequences, by a similar procedure (7) (NOGs). The entire list was submitted to the STRING server for genomic context analysis, using a score threshold of 0.4. We found that 5628 of the 7853 COGs/NOGs did not reveal any genomic context association. The remaining 2225 groups, however (703 COGs and 1522 NOGs, see Figure Figure1),1), were found to be functionally related to other orthologous groups based on gene neighbourhood, gene fusion events or phylogenetic co-occurrence.
We manually checked the list to remove COGs/NOGs, for which some molecular function was indeed already known [e.g. in the SWISSPROT database (17)]. This reduced the list to 1737 groups (containing 6367 proteins).
For 399 COGs and 1065 NOGs of these groups (84.1% of 1740), we identified only uncharacterized interaction partners or only proteins with a very unspecific function (see Supplementary Material). We observed that these groups form a larger association network of uncharacterized proteins; we split this network by a standard clustering procedure [for methods see (12)], in order to identify sets of hypothetical proteins that potentially co-operate in separatable functional units. We observe that 975 groups of orthologues (66.6%) are clustered in 384 functional modules (see Figure Figure1),1), whereas 492 have only single connections. The modules of unknown function identified here may represent novel pathways, complexes or operons, and should be given high priority for experimental analysis, as they are likely to perform complex functions requiring several gene products, and are important enough for the cell to be selected for as a unit.
For the remaining 127 COGs and 146 NOGs, we were indeed able to predict a cellular role, a connected pathway or a complex (for selected representatives and Supplementary Material for the whole list see Table Table1).1). A novel function was predicted for 24% of COGs, but only for 12% of NOGs, showing a correlation with orthologous group size (average size: ~17 proteins/COG and ~4 protein/NOG) and the likelihood of belonging to a conserved genomic context.
We found functional associations for uncharacterized COGs/NOGs to a broad variety of cellular activities. These include chromatin-associated processes such as DNA-repair, transcription and translation, metabolic or signalling pathways and membrane-associated transport and secretion processes.
The specificity and annotation depth of predicted partners for COGs/NOGs of unknown function varies. Associated partners range from well-defined and experimentally characterized COGs, when, for instance, the hypothetical COG3310 phylogenetically co-occurs significantly in several proteobacteria with an experimentally characterized operon of pilus assembly proteins (18), to barely specified pathways such as COG4029, COG4048, COG4050, COG4051, COG4052 and COG4069, which appear only in methanogenic archaea in a conserved operon with genes coding for phosphatases (COG4087) (19,20) (Table (Table11).
Newly annotated NOGs are often related to very specific cellular processes, present in very few species known so far [e.g. NOG14679 related to Photosytem II (21)].
Recent experimental data confirm and complete our genomic context approach. An example is NOG06495 (see Table Table1)1) consisting of hypothetical proteins from three different species, which our procedure predicts to be associated to proteins of the protein-secretion pathway type III. Recently, experimental evidence was published (22), which is consistent with our genomic context prediction.
In several cases, a number of uncharacterized COGs were found clustering together, but were also linked to at least one better characterized group. We found, for instance, a cluster of 15 groups (including COG3456, see Table Table11 and Supplementary Material), revealing a putative novel functional pathway or complex; genes of these orthologous groups are closely associated (scores range >0.7) by conserved gene neighbourhood (data not shown) and phylogenetic co-occurrence (see Figure S1a in Supplementary Material) in several proteobacterial species. However, the proteins of this novel pathway are also connected to homologues of a flagellar motor protein (COG1360) (23) by conserved operon architecture, a fusion event with COG3455 (hypothetical) and phylogenetic co-occurrence (see Figure S1b in Supplementary Material). Proteins of COG1360 contain an OmpA/MotB domain and are thought to function as porin-like integral membrane proteins (24) or lipid-anchored proteins (25); and COG3523 contains an ImcF domain, which has been proposed to be involved in Vibrio cholerae cell surface reorganization (26). Furthermore COG3515 contains the ‘ImpA-related N-terminal domain’ of inner membrane proteins; this domain has been found in extracellular proteins and is associated with colony variations in Actinobacillus actinomycetemcomitans (27). These findings support the assumption that the novel pathway/protein complex plays a role in a membrane-associated transport process.
An example for a more specific prediction, via association to a well-characterized pathway, is COG0217, which consists of hypothetical proteins present in a variety of eubacterial species. Crystal structure analyses of a representative of this group (Aq1575 from Aquifex aeolicus) give no hint to any putative function (28). As shown in Figure Figure2a,2a, genes coding for these proteins occur consistently (in 24 species) in a putative operon together with Holliday junction resolvasome genes. The resolvasome in E.coli is known to play an important role in the late stages of homologous genetic recombination and in the recombinational repair of damaged DNA (29). In the majority of species, the functional resolvasome depends on three different subunits (DNA-binding subunit RUVA, DNA helicase RUVB and DNA endonuclease RUVC).
Structural studies indicate that two RuvA tetramers sandwich the formation of heteroduplex DNA and hexameric rings of RuvB face the junction. Thus, RuvB promote dual helicase action that ‘pumps’ DNA through the RuvAB complex by ATP hydrolysis.
The third protein, RuvC endonuclease, resolves the Holliday junction by introducing nicks into two DNA strands (29).
RUVC is absent in most of low GC Gram-positives (30) and in Borrelia; further studies in Mycoplasma pneumoniae suggest that the Holliday branch migration and resolution is different from E.coli and a novel resolvase is being searched for (31).
Proteins of COG0217 are frequently found in conserved gene neighbourhood with the classical ‘three-protein-resolvasome’, but also in the equivalent operon of the atypical resolvasome of some low GC Gram-positives, missing the endonuclease (Figure (Figure2a).2a). The network view (Figure (Figure2b)2b) illustrates the functional association of COG0217 based on its conserved neighbourhood with well-annotated genes, and the functional associations between the other genes of this operon. Together with other, below-threshold associations (see Supplementary Material, Info-Box 1), an accessory function in DNA repair is predicted, possibly in response to oxidative damage and in conjunction with the resolvasome (32).
It should be noted, that while the predicted association with the resolvasome is quite strong, it does not provide a precise molecular function for the members of COG0217. However, as for many other such cases, the prediction significantly narrows down the putative function of this group. The prediction can serve as a guide for future experimental inquiry, aiming to identify the role of these hypothetical proteins in the Holliday junction resolvasome, for example, by selected mutational analysis in different species.
Taken together, our large-scale analysis of bacterial proteins of unknown function predicts functional associations for 1740 out of 7853 orthologous groups (22.2%), whereby 1466 of these, assign links purely among uncharacterized proteins, thus hinting at potentially novel functional modules. We can assign cellular processes, complexes or operons for 273 so far uncharacterized orthologous groups (i.e. for 3624 proteins). In contrast, homology searches reveal no functional information for these proteins, at best pointing to very poorly characterized domains [e.g. DUF-domains in the PFAM database (33)]. Thus, the results show the significance of context-based methods in function prediction and emphasize the complementarity to homology-based methods. Our study should be a first step in following the call for community action (1), and may help to pave the way for comprehensive protein function annotation.
Supplementary Material is available at NAR Online.