|Home | About | Journals | Submit | Contact Us | Français|
Genomics centers discover increasingly many protein sequences and structures, but not necessarily their full biological functions. Thus, currently, fewer than one percent of proteins have experimentally verified biochemical activities. To fill this gap, function prediction algorithms apply metrics of similarity between proteins on the premise that those sufficiently alike in sequence, or structure, will perform identical functions. Although high sensitivity is elusive, network analyses that integrate these metrics together hold the promise of rapid gains in function prediction specificity.
There is a large gap between the number of known proteins and those that are characterized functionally. Out of a few thousands ongoing high-throughput genome projects, the nine hundred or so that are complete  collectively yield over 13 million protein sequences. A sliver of these, or 1%, has experimental annotations . Most others carry inferred annotations (64%), and fully a third remain cryptic, being labeled “putative”, “uncharacterized”, “hypothetical” or “unknown function” (35%) in the UniProt database . The same is true for protein structures solved by Structural Genomics (SG), a world-wide effort that aims to inform function through structural knowledge. In keeping with a selection bias against homologs of known structures, 40 % of the nearly 10,000 SG structures solved thus far have unknown function in the Protein Data Bank, and even after putative automated annotations nearly 3000 structures remain listed as unannotated in the Structural Genomics Knowledgebase .
These numbers likely underestimate the magnitude of the problem since existing annotations are not necessarily accurate. Most rely on homology, assuming that evolutionarily proximity implies shared function. But even with sequence identity of 70%, or greater, careful studies showed that 10% of any pair of enzymes had different substrates; and differences in the actual enzymatic reactions are not uncommon near 50% sequence identity [5,6]. Thus databases may carry misannotations that could then propagate, and be amplified, via otherwise accurate annotation methods . Indeed, an analysis of 37 well-characterized enzyme families suggests that electronically curated databases carry misannotations whereas, reassuringly, the manually curated SwissProt database is nearly free of them and is thus closer to a gold standard .
The reason for this discrepancy between human computer-generated functional knowledge is that many aspects of protein evolution naturally confound both the sensitivity and specificity of automated efforts to pinpoint function. First, individual proteins are multifunctional. This is clear when a protein carries multiple binding or catalytic sites, or promiscuous ones (meaning they are non-specific). But folding, cellular targeting, post-translational modifications, allosteric regulation and degradation are functions in their own right; and their interplay with context is seen in metalloproteins that bind distinct metal ions depending on cellular location . Second, evolutionary relatedness, or the lack of it, can be deceiving. After gene duplication, paralogs may develop entirely unrelated functions, such as eye lens crystallins that originate from enzymes . Conversely, there are over a hundred examples of enzymatic convergence in which unrelated proteins converged to perform similar reactions . Functional convergence is difficult to discern even at the molecular level: a study of nine types of ligand (AMP, ATP, FAD, FMN, glucose, heme, NAD, phosphate and steroid molecules) illustrates that each one can binds into a variety of binding pockets with a wide range of electrostatic or hydrophobic properties . And third, the functional response to even single residue perturbations may range from dramatic fold changes , to switches in functional specificity  or catalytic function , and on to no changes in function despite variations in the side chain character or positions of catalytic residues .
This complexity suggests that protein should be viewed as evolving in a functional landscape with a non-trivial topology. Specifically, the relationship between changes in proteins and changes in their functions has many forms (Figure 1): it can be smooth and predictable (Figure 1, red line), but it can also be abrupt (green line) or absent (blue line). Thus in response to changes in context or in sequence the function can sometimes jump to distant parts of the functional landscape rather than stay close by. In that light, the problem of function annotation is twofold: to describe the functional landscape that is available to proteins, and to correctly determine which parts of this landscape a protein occupies given the pitfalls illustrated in Figure 1. We briefly address the first point next, and then focus the balance of the review on the second point.
Nomenclatures that tally, classify and compare individual protein functions have begun to describe part of this functional landscape. The Enzyme Commission (EC) functional classification is a hierarchy of four numbers that describe catalytic reactions in successively finer detail. Enzymes that have more EC numbers in common, starting from level 1, that describes broad enzymatic classes, to level 4, that describes specific substrates, should ideally be increasingly related mechanistically. But detailed comparisons show enzymes with identical first-three digit EC numbers may have significant differences in catalytic process . EC numbers must therefore be interpreted with care.
The Gene Ontology is a more general alternative (GO) . It has distinct terms for Molecular Function, such as growth factor receptor binding; Biological Process, such as cell proliferation, and Cellular Component, such as nuclear membrane. Moreover, eighteen different evidence codes specify the basis for each annotation, and hence their reliability. For example, EXP, IEP, ISS, IC, IEA mean, respectively, inferred from experiment, expression pattern, sequence or structural similarity, or by the curator, or via electronic annotation. This GO framework creates child-parent hierarchical relationships described by directed acyclic graph.
Many other classification schemes exist. For example, the Transporter Classification describes transport proteins , and others apply to cellular pathways and processes such as KEGG  and EcoCyc . The latter classifies Escherichia coli genes based on their association with metabolic pathways, while MetaCyc is its most generalized version for bacteria .
Given such classifications to codify the functional landscape, annotation methods then rely on a correlation between functional and structural similarity metrics of the type shown in Figure 1, red line. Many choices of protein similarity metrics are possible, however, to assess likeness.
The simplest protein similarity metrics exploit homology of whole sequences. BLAST/PSI-BLAST  are routinely used, and the top hit with a known function provides the annotation. A better strategy is to gather GO terms among all hits, and transfer those that recur with statistically significant frequency. Using a “Function Association Matrix” to apply this strategy, PFP reached ~100% coverage and 60% accuracy in a benchmark set of non-redundant 2000 sequences , with some improvements with an iterative use of PSI-BLAST . Specificity can be raised by distinguishing between orthologs and paralogs. A recent comparative phylogenetic analysis of yeast Saccharomyces cerevisiae genes showed significant differences in functional inheritance between them . SIFTER, exploits these differences to transfer GO terms based on bayesian statistics of duplication and speciation events . This phylogenomic approach is slower, but more accurate. Yet, as already mentioned, small sequence changes can profoundly impact function: melamine deaminase and atrizine chlorohydrolase share 98% sequence identity but differ in function .
Therefore a second type of similarity metric focuses on local sequence motifs rather than on whole sequence comparisons. These motifs consist of residues that directly mediate function, and which therefore should be the most specific for annotation. As a basis for these searches, InterPro  assembles functional signatures of proteins gleaned from eleven databases. As motifs become smaller, the chance of random similarities and false positives rises. To recover specificity, EficaZ identifies functionally discriminating residues. These are derived from Hidden Markov Models of alignments of enzymes, PROSITE patterns and family specific sequence identity threshold . Fourth-level EC annotations reach 92% specificity and 82% sensitivity in non-trivial controls with mutual sequence identity that is below 40%. Similarly, ConFunc has 24% greater prediction specificity than BLAST, also in sequences with low sequence identity . It uses position specific scoring matrices derived from discriminating residue profiles in GO term-specific sub-alignments of PSI-BLAST hits.
A third type of protein similarity metric exploits three-dimensional (3D) protein structures. One may either directly align structures to each other , or more generally find how they fit into broad classifications of structures, such as SCOP and CATH [33,34], or specialized ones, such as The Structure Function Linkage Database . The fact that 70% of CATH fold types are associated with just one function , or that GO terms are more identical in the SCOP superfamilies than different superfamilies with the same fold  justify these methods. Structural alignments are now faster with little accuracy loss , and those methods that do not explicitly construct the alignments can be especially fast [39,40].
A further similarity metric focuses on local structural features. The local geometry of clefts and pockets, or their surface electrostatics, inform on comparisons of active sites and ligand-binding sites at the molecular level. For example, pevoSOAR  annotates enzymes based on matching cavities and pockets with known functional sites collected in the CASTp database . SURFNET , and Consurf-HSSP database  also focus on cleft comparison, while EF-site  and MultiBind/MAPPIS check electrostatic properties and physico-chemical properties of binding sites respectively . Model structures can be used as well: FINDSITE threads query sequences to find their putative binding sites and to suggest potential ligands . Its accuracy was 67% in controls with less than 35% sequence similarity to any target protein.
A closely related fifth type of similarity metric is based on 3D templates, which narrow local structural searches even further. These templates are composed of a few residues that are directly associated with function and positioned with respect to each other in a defined spatial geometry. The Ser-His-Asp catalytic triad of serine proteases is a case in point . Its residues are not necessarily sequential and may therefore be very difficult to detect from sequence analysis. Yet their 3D-templates could be geometrically matched to other protein structures so as to identify other proteases better than sequence homology methods could . The Catalytic Site Atlas is a resource that provides 3D templates for over 53,000 protein chains, each one based on experimentally verified small functional motifs . However, these sites often have three residues or fewer, and hence do not include surrounding residues that may also modulate catalysis. Moreover, many proteins are not enzymes.
To follow this strategy, it is therefore important to also derive the templates themselves. The Reverse Templates (RT) method  breaks down a query protein structure into the tri-peptide segments and searches them against the non-redundant protein structures. GASPS generates templates based on their ability to distinguish related structures from others . A recent state-of-the-art template-based method, FLORA constructs templates from the residues specific to functional sub-groups in the functionally diverse CATH superfamilies and it outperforms other similar methods in three-digit EC annotation in an unbiased set of control enzymes .
In a complementary approach, 3D templates may also be defined and then compared objectively by relying on evolution. This requires no prior assumptions on functional mechanisms and amino acids. Rather, in a series of steps, key functional residues are extracted from phylogenomic comparisons of aligned sequences, and they are mapped onto the protein structure. Next, templates are then picked from the functional residues that cluster at the surface. Their geometric matches to other structures then define template “hits”. Finally, various computational filters select among those hits the ones that are least likely to arise by chance. In practice the Evolutionary Trace Annotation (ETA) server , depicted in Figure 2A, uses the ranked lists of evolutionarily important residues produced by Evolutionary Trace (ET) [55,56]. Top-ranked ET residues are good candidates for 3D templates because they are known to generally overlap functional sites and identify their determinants , such that their targeted mutations efficiently engineer proteins with selective separation of function or rewired functional specificity . Evolution is also central to each of the three specificity filters. The first one is a Support Vector Machine trained to reject template hits that do not fall on residues that are themselves ranked as being evolutionarily important by ET . The second imposes plurality, so that a function is passed to a protein only if that function recurs more often than any other in all of its hits . And the third filter requires hit reciprocity, so that if the template of protein A has a hit on protein B, the reverse is also true: the template of protein B will hit protein A . With all of these filters applied together, the positive predictive value up to the third digit of EC numbers rose to 92% in a large-scale control over more than 1200 SG proteins. Sensitivity, on the order of 40%, can be raised to 53% by using more accurate ET-based templates . Similar results are obtained for GO annotations: 53% sensitivity and 94% PPV at the third GO depth over 2300 protein among, while 76% of the predictions were still correct at the deepest available GO level .
Since these different metrics focus on different protein features, an expectation is that they would yield better predictions when combined. For example, ProFunc is a meta-server that combines fourteen different types computational annotations (5 sequence-based, 5 structure-based, 4 template-based), and which reaches 60% coverage with 70% accuracy, in a controls over 92 protein structures of known function [49,64]. ProKnow  is another meta-server that is knowledge-based and which combines similarity metrics from fold and sequence comparisons, from motifs and from interaction relationships among proteins. In 1500 distinctly folded protein controls, its coverage and accuracy were 93% and 89%, respectively, at the first level of the GO classification, decreasing to 44% accuracy at the ninth, deepest available level.
An alternate to meta-servers is to pool annotations into network structures. Genes or gene products define the nodes of such networks, and the associations between them that suggest functional similarities are indicated by edges. A key advantage is that any number of similarity metrics can be represented at once simply by adding new edges between the protein nodes, or strengthening existing edges, regardless of whether they arise from sequence, structure, or evolutionary data over the whole or part of the protein. Moreover, these edges can also describe functional associations from yeast-two-hybrid; co-expression; conserved genomic neighborhood; phylogenetic co-occurrence and literature co-occurrence; for example, the STRING database  now covers nearly 30% of all protein sequences in UniProt with such data. To benchmark prediction quality or to make novel predictions on protein function, biological process or gene phenotype, one can then apply the concepts of connectivity, centrality, modularity, clustering or graph cuts and maximum flows on graphs . Network methods can be broadly ordered into local and global approaches depending on whether their calculated predictions require some or all nodes and edges in the graph, respectively.
Local network methods consider nearest neighbors and the functions of a node are predicted from its annotated direct neighbors. This heuristic approach remains the standard to measure prediction accuracy and coverage since it predictive power is not easily surpassed and it scales at most linearly with the total number of nodes in the network . For example, given reliable underlying network information, local methods have been shown to predict a spectrum of effects ranging from gene essentiality to tissue-specific loss-of-function phenotypes in the nematode Caenorhabditis elegans . However, local network generally require additional considerations to yield statistical confidence values , and non-local alternatives are more accurate.
Some non-local methods can gather information from larger neighborhoods. They apply the concept of network modules, or motifs, which are groups of genes or proteins with the same molecular function or taking part in the same biological process. The detection of modules involves clustering and statistical testing of significance against random networks . In yeast, where detailed and reliable genome-wide interaction data are available, module detection identified both novel molecular complexes and specific biological roles , such as highly significant gene promoter motifs that regulate transcription . However, not all functionally coherent groups of proteins can be represented through modules. For example, transmembrane receptors bind to many extra- and intra-cellular molecular partners but they much less frequently form complexes with other membrane proteins . Hence, it is unlikely that protein interaction networks can be completely decomposed into functional modules.
Fully global methods seek to optimize annotations by finding the minimum of a quadratic polynomial, H, over all nodes and edges. Here, H is a positive cost function the minimum of which reflects the topology of the graph and yields a distribution of numerical labels (discrete or continuous, positive and negative) indicating functional memberships. In the input, only the nodes with known functions carry labels. In the output, after optimization, most nodes carry some labels including those initially unknown. Minimization of H is an optimization problem equivalent to maximum a posteriori (MAP) estimates in Bayesian networks , to stationary states in Markovian random fields , or to minimum cost solutions in graph-based semi-supervised learning . This last method, also referred to as network diffusion, is notable for its improved accuracy and coverage over local methods . Also, when the network edges are positive, it produces a solution that grows linearly with network size , enabling global analysis of very large networks—potentially millions of protein nodes. Finally, it allows the integration of heterogeneous data by optimizing the relative weights of individual networks; for example, those built from local evolutionary, global geometrical, topological and sequence relationships lead, after weighted integration, to an increase in sensitivity of 17% over the best single network .
Most recently, in the context of Structural Genomics, this machine learning technique improved the specificity and coverage of function annotations. A network of protein structures was generated from reciprocal 3D template hits derived from the ETA method  (Fig. 2A). At the start, labels indicated the enzymatic activity of known proteins in the Protein Data Bank. Graph-based semi-supervised learning was then applied to transfer known functional labels of enzymatic activity to proteins whose function was unknown and to assign a statistical confidence score to all predictions (Fig. 2B). By comparison to the ETA method , this global analysis raised accuracy by 6% at 65% coverage (from 90% to 96% accuracy) at the substrate-specific fourth EC level. It also increased accuracy and coverage over standard BLAST annotation by 10% (from 85% to 95% accuracy also at 65% coverage, see Fig. 2C). In other controls, it improved over other structure-based methods, such as FLORA, reducing false positives to raise accuracy rose from 60% to 90% (measured at 97% sensitivity). Finally, as a direct additional control, a new annotation of a carboxylesterase (EC 22.214.171.124), in a vancomycin resistant strain of Staphylococcus aureus, was tested experimentally and confirmed (Fig. 2D).
The pace of discovery of protein sequences and structures is accelerating, and with it the need to interpret their biological meaning efficiently. While diverse experimental techniques inform on biological processes  and phenotype , direct and high-throughput experimental screen to simultaneously measure a wide array of different biochemical activities remain unusual . Any assay is best when tuned to specific substrate and reaction, optimal conditions will be different for different proteins, and protein promiscuity and functional multiplicity can lead to false positives and false negatives. For these reasons, continued progress in automated annotation is imperative. This means increasing the specificity and sensitivity of function predictions.
Network-based inferences of function are likely to be well-suited for both tasks. Specificity should rise because any type of functionally relevant associations between proteins can be integrated together in a unified computational framework . Global network analyses also efficiently apply all of the network's information to each node , and statistical significance can point to the most reliable annotations [67,77,80]. Indeed, specificity does rises as a result of these integrative  and global features of networks .
One key hurdle for further increases in specificity are the errors that may be contained in the primary data used to encode the networks. Both individual link and the functional labels they propagate may be inaccurate. It is therefore critical to systematically control reference gold standards , and to objectively and systematically control predictions through systematic experiments . A second hurdle is computational. The sheer of number of intrinsic relationships between gene and protein sequences poses a computational barrier even to today's most scalable network analysis methods. For example, there are more than a hundred billion orthology relationships between protein sequences in the current STRING database, which covers around 2.5 million proteins across 630 organisms. Global network optimization on such a scale remains a steep challenge. Finally, more involved description of protein dissimilarities or functional anti-correlations should eventually be taken into account. This, however, leads to incompatibilities between functional labels, network frustration, and multiple minima for which efficient optimizations also remain a challenge.
Ultimately, the problem of raising sensitivity may prove harder. There are clearly diverse evolutionary and molecular solutions to carrying out a given functions [11,12]. Whenever we come across such a new solution, it is unlikely that existing metric of similarity will discern the conserved features that mediate the common function. One approach to increase sensitivity, for example in the context of 3D templates, is to reduce the number of residues in the templates and so increase their number of hits. Another approach is to enlarge the repertoire of functional markers, for example by generating multiple 3D templates for each protein, which also leads to more hits. As these and other more sensitive strategies are integrated into one network, the hope would be that they complement each other sufficiently that the network recovers specificity and still preserves the gains in sensitivity.
We gratefully acknowledge grant support from the National Institute of Health, NIH GM079656 and GM066099, and from the National Science Foundation, NSF, CCF 0905536.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.