In a global search for conserved uncharacterized proteins, we retrieved a total of 7853 orthologous groups annotated to be of unknown function. All groups contain orthologous proteins derived from several genomes.
In the set of 102 prokaryotic genomes considered here, 82% of all the genes are included in orthologous groups. Most of the remaining proteins might be without orthology to any other genome and can thus not be included in a comparative analysis; others might even be gene prediction artefacts.
The list of functionally uncharacterized group contains 1162 groups originally assembled by Tatusov et al
) (COGs), and 6691 groups derived from more recently completed genome sequences, by a similar procedure (7
) (NOGs). The entire list was submitted to the STRING server for genomic context analysis, using a score threshold of 0.4. We found that 5628 of the 7853 COGs/NOGs did not reveal any genomic context association. The remaining 2225 groups, however (703 COGs and 1522 NOGs, see Figure ), were found to be functionally related to other orthologous groups based on gene neighbourhood, gene fusion events or phylogenetic co-occurrence.
Figure 1 Uncharacterized prokaryotic proteins: prevalence and flow chart of the annotation strategy. (1) Extraction of COGs, annotated as ‘hypothetical’ or ‘uncharacterized’, and (1') NOGs without any functional annotation from (more ...)
We manually checked the list to remove COGs/NOGs, for which some molecular function was indeed already known [e.g. in the SWISSPROT database (17
)]. This reduced the list to 1737 groups (containing 6367 proteins).
For 399 COGs and 1065 NOGs of these groups (84.1% of 1740), we identified only uncharacterized interaction partners or only proteins with a very unspecific function (see Supplementary Material). We observed that these groups form a larger association network of uncharacterized proteins; we split this network by a standard clustering procedure [for methods see (12
)], in order to identify sets of hypothetical proteins that potentially co-operate in separatable functional units. We observe that 975 groups of orthologues (66.6%) are clustered in 384 functional modules (see Figure ), whereas 492 have only single connections. The modules of unknown function identified here may represent novel pathways, complexes or operons, and should be given high priority for experimental analysis, as they are likely to perform complex functions requiring several gene products, and are important enough for the cell to be selected for as a unit.
For the remaining 127 COGs and 146 NOGs, we were indeed able to predict a cellular role, a connected pathway or a complex (for selected representatives and Supplementary Material for the whole list see Table ). A novel function was predicted for 24% of COGs, but only for 12% of NOGs, showing a correlation with orthologous group size (average size: ~17 proteins/COG and ~4 protein/NOG) and the likelihood of belonging to a conserved genomic context.
Selected uncharacterized orthologous groups, and their putative functions or pathways predicted here
We found functional associations for uncharacterized COGs/NOGs to a broad variety of cellular activities. These include chromatin-associated processes such as DNA-repair, transcription and translation, metabolic or signalling pathways and membrane-associated transport and secretion processes.
The specificity and annotation depth of predicted partners for COGs/NOGs of unknown function varies. Associated partners range from well-defined and experimentally characterized COGs, when, for instance, the hypothetical COG3310 phylogenetically co-occurs significantly in several proteobacteria with an experimentally characterized operon of pilus assembly proteins (18
), to barely specified pathways such as COG4029, COG4048, COG4050, COG4051, COG4052 and COG4069, which appear only in methanogenic archaea in a conserved operon with genes coding for phosphatases (COG4087) (19
) (Table ).
Newly annotated NOGs are often related to very specific cellular processes, present in very few species known so far [e.g. NOG14679 related to Photosytem II (21
Recent experimental data confirm and complete our genomic context approach. An example is NOG06495 (see Table ) consisting of hypothetical proteins from three different species, which our procedure predicts to be associated to proteins of the protein-secretion pathway type III. Recently, experimental evidence was published (22
), which is consistent with our genomic context prediction.
In several cases, a number of uncharacterized COGs were found clustering together, but were also linked to at least one better characterized group. We found, for instance, a cluster of 15 groups (including COG3456, see Table and Supplementary Material), revealing a putative novel functional pathway or complex; genes of these orthologous groups are closely associated (scores range >0.7) by conserved gene neighbourhood (data not shown) and phylogenetic co-occurrence (see Figure S1a in Supplementary Material) in several proteobacterial species. However, the proteins of this novel pathway are also connected to homologues of a flagellar motor protein (COG1360) (23
) by conserved operon architecture, a fusion event with COG3455 (hypothetical) and phylogenetic co-occurrence (see Figure S1b in Supplementary Material). Proteins of COG1360 contain an OmpA/MotB domain and are thought to function as porin-like integral membrane proteins (24
) or lipid-anchored proteins (25
); and COG3523 contains an ImcF domain, which has been proposed to be involved in Vibrio cholerae
cell surface reorganization (26
). Furthermore COG3515 contains the ‘ImpA-related N-terminal domain’ of inner membrane proteins; this domain has been found in extracellular proteins and is associated with colony variations in Actinobacillus actinomycetemcomitans
). These findings support the assumption that the novel pathway/protein complex plays a role in a membrane-associated transport process.
An example for a more specific prediction, via association to a well-characterized pathway, is COG0217, which consists of hypothetical proteins present in a variety of eubacterial species. Crystal structure analyses of a representative of this group (Aq1575 from Aquifex aeolicus
) give no hint to any putative function (28
). As shown in Figure a, genes coding for these proteins occur consistently (in 24 species) in a putative operon together with Holliday junction resolvasome genes. The resolvasome in E.coli
is known to play an important role in the late stages of homologous genetic recombination and in the recombinational repair of damaged DNA (29
). In the majority of species, the functional resolvasome depends on three different subunits (DNA-binding subunit RUVA, DNA helicase RUVB and DNA endonuclease RUVC).
Figure 2 (a) Case study: evidence linking COG0217 to well-annotated proteins. Species tree showing conserved operon architecture and co-occurrence of genes coding for subunits of Holliday junction resolvasome (COG0217: uncharacterized conserved protein (red), (more ...)
Structural studies indicate that two RuvA tetramers sandwich the formation of heteroduplex DNA and hexameric rings of RuvB face the junction. Thus, RuvB promote dual helicase action that ‘pumps’ DNA through the RuvAB complex by ATP hydrolysis.
The third protein, RuvC endonuclease, resolves the Holliday junction by introducing nicks into two DNA strands (29
RUVC is absent in most of low GC Gram-positives (30
) and in Borrelia
; further studies in Mycoplasma pneumoniae
suggest that the Holliday branch migration and resolution is different from E.coli
and a novel resolvase is being searched for (31
Proteins of COG0217 are frequently found in conserved gene neighbourhood with the classical ‘three-protein-resolvasome’, but also in the equivalent operon of the atypical resolvasome of some low GC Gram-positives, missing the endonuclease (Figure a). The network view (Figure b) illustrates the functional association of COG0217 based on its conserved neighbourhood with well-annotated genes, and the functional associations between the other genes of this operon. Together with other, below-threshold associations (see Supplementary Material, Info-Box 1), an accessory function in DNA repair is predicted, possibly in response to oxidative damage and in conjunction with the resolvasome (32
It should be noted, that while the predicted association with the resolvasome is quite strong, it does not provide a precise molecular function for the members of COG0217. However, as for many other such cases, the prediction significantly narrows down the putative function of this group. The prediction can serve as a guide for future experimental inquiry, aiming to identify the role of these hypothetical proteins in the Holliday junction resolvasome, for example, by selected mutational analysis in different species.
Taken together, our large-scale analysis of bacterial proteins of unknown function predicts functional associations for 1740 out of 7853 orthologous groups (22.2%), whereby 1466 of these, assign links purely among uncharacterized proteins, thus hinting at potentially novel functional modules. We can assign cellular processes, complexes or operons for 273 so far uncharacterized orthologous groups (i.e. for 3624 proteins). In contrast, homology searches reveal no functional information for these proteins, at best pointing to very poorly characterized domains [e.g. DUF-domains in the PFAM database (33
)]. Thus, the results show the significance of context-based methods in function prediction and emphasize the complementarity to homology-based methods. Our study should be a first step in following the call for community action (1
), and may help to pave the way for comprehensive protein function annotation.