This work shows that the diffusion of protein functions over a network of local structural and evolutionary similarities yields accurate functional predictions. The key distinguishing features of the diffusion process are (1) that it is guided by functionally-relevant links. These links are defined by reciprocal ETA matches, which establish that two proteins share some key functional determinants that are in identical structural configurations. Importantly, they can be generated from ET analysis without any prior knowledge of a protein's likely function or mechanism. However, these links do reflect both evolutionary and structural information about the most functionally relevant parts of a protein. (2) Network diffusion also puts every link and every known prior annotation on par across the entire network, so that all annotations compete without bias, and the best one at each node is objectively assessed with a statistical z-score.
Compared to other state-of-the-art approaches, confidence values proved better at sorting unreliable predictions, and in turn this improved annotation accuracy at the third and at the fourth, and most specific, EC levels. These results are general since they apply across all types of enzymes, and they are accurate since false positive rates decrease substantially—between 2- to 5-fold. The many predictions on unannotated proteins demonstrate the benefits of the repeated use of evolutionary information and its integration with structural information over the structural proteome.
In order to identify the various sources of information that improve annotation we compared the impact that negative information has on diffusion on identical PDB 90 networks and Structural Genomics testsets. Accuracy-coverage curves with negative labels (
Figure S4A, red) have a substantial accuracy advantage over the same curves without negative labels (purple). Without the negative labels, accuracy falls: at 50% coverage it drops by 16% and 10.7% for 3 and 4 EC digit predictions, respectively. Thus, the -1 entries in the
y vector, which indicate the knowledge that a protein does not perform a specific enzymatic function, contribute significantly to accuracy. By contrast, many annotation methods, for instance the nearest-neighbor and BLAST approaches we benchmark against, do not make use of this information. Hence, knowledge of which proteins in the network lack a particular function is critical for function prediction with diffusion, and may contribute to accuracy advantages over other methods.
Additionally, in order to assess the contribution of distant positive labels to annotation, we examined the shortest path lengths between proteins with the same and differing functions.
Figure S4C shows a stacked histogram comparing the lengths of shortest paths in the network between nodes with a correct prediction and nodes in the dataset, segmented by the confidence z-score. Proteins with the same function (blue) tend to have shorter distances between them than proteins with different function (orange), indicating that functions generally cluster in the network. However the distributions have long tails, especially for predictions with a z score less than 3, so that in a number of instances proteins with the same function can be quite distant. Based on our accuracies, the diffusion process is presumably able to connect these distant proteins. Therefore, both negative labels and positive labels distant by 10 or more nodes are additional information sources that contribute to more accurate predictions in ETA Network Diffusion.
Strikingly, these results rely on the large-scale comparison of just six evolutionarily important template residues, chosen protein by protein. The accuracy of the network shows that these residues effectively capture the determinants of protein function. This in essence, validates on a large scale the notion that ET analysis identifies key functional residues—consistent with the conclusions of many experimental case studies
[36]. Notably, as this study draws from many previous ideas, such as 3D templates
[57],
[76] evolutionary importance
[35], functional site analysis
[88], molecular determinants of function
[23],
[20] and network analysis
[64],
[69],
[89],
[90] it combines them uniquely by repeatedly relying on evolution at each step of the annotation process.
First, the 3D template residues are selected for their evolutionary importance measured by ET, and for their structural clustering. This local structural motif defined by evolution obviates the need for any prior knowledge, or assumptions, about the nature and determinants of function. This is an advantage since compared to the size of the proteome, there are relatively few proteins with reliable data on the molecular mechanisms underlying their function and specificity, as may be available from the catalytic site atlas
[34]. Likewise, these 3D templates also replace searches for structural features, such as clefts, cavities or depressions, which are suggestive but rarely sufficient
[88].
Second, the selection of which 3D template matches are functionally relevant also relies on evolution. Out of the profusion of purely geometric matches between a template and protein structures, only those that involve residues with evolutionary importance similar to the template residues are retained. Every accepted match, and therefore every edge in the network, indicates reciprocal similarities of evolutionary constraint and structural context, which raises the likelihood of a true functional similarity.
It follows, third, that diffusion over a network defined by these evolutionary template matches disseminates evolution-guided inferences over the structural proteome. The correlation between the confidence z-score associated with every diffused function and the reliability of annotations confirm that these three hierarchical types of evolutionary inferences—meaning the 3D template, the match, and the diffusion—are all well founded: evolutionary analysis thus dramatically narrows the search for the essential determinants of a protein's function and for their comparison.
The global network approach also has many intrinsic advantages. It removes the heuristic aspect of the ETA voting approach,
[57] it enables global and formal integration of information over the entire structural proteome, and, as a future direction, it prepares the integration of ETA information with many other types of functionally relevant protein similarity, since the latter usually come in the form of pairwise relationships are therefore well suited for network representation
[64],
[69]. Specifically, the diffusion process is non-local and draws information from all of the functional labels in the network, not just those from direct matches. As a result, it extends prediction coverage compared to strictly local techniques. This is illustrated, for example, by the gene PHO147 in Pyrococcus horikoshii (PDB 2dz9 chain A), as shown in . This protein matches solely unnannotated proteins in a well-connected cluster, so both ETA and nearest neighbors can make no prediction. Network diffusion, however, enables more distant annotations to inform the annotation of this node, leading to a correct fourth EC digit prediction.
The computation of a confidence z-score that correlates with prediction accuracy is another contribution of this work. Together with the global diffusion process, it enables unbiased consideration of all possible functions, establishes an objective criterion for selecting the best candidate, and attaches a confidence value to it. As a result, predictions can be stratified by the z-score, yielding the accuracy versus coverage receiver-operator curves, shown in , that remain close to 100% accuracy over a large coverage. Predictions with a z-score above 2 are over 99% accurate and over a z-score of 1 they are 98% accurate. Conversely, on the FLORA set, the vast majority of false positives also had the lowest confidences (z<−0.05) (). The z-score is therefore an adequate marker of confidence with which to recognize unreliable predictions that otherwise would become false positives. Overall, we see from 2 to 5 fold reductions in false positives when compared to ETA, FLORA, nearest neighbors or BLAST.
Collectively these improvements yield confident predictions at the 4th EC level, which identifies precise substrates in many cases. For example, the predicted EC annotation for gene PHO147 in Pyrococcus horikoshii (PDB structure 2dz9A) is biotin—acetyl-CoA-carboxylase ligase (EC 6.3.4.15). This function indicates the substrates ATP, biotin and acetyl-CoA-carboxylase, which would not be obtainable from a 3 digit EC annotation (EC 6.3.4, Carbon—Nitrogen Ligases), which usually describes the chemical reaction.
In the future, a number of network diffusion limitations remain to be addressed. Here only enzymatic functions were considered, although ETA itself makes both enzymatic and non-enzymatic predictions using Gene Ontology (GO) terms
[57]. The reason was that the network diffusion of labels taken from a GO directed acyclic graph (DAG) is more complex than from the simple EC hierarchy. Another concern is to further extend the coverage of yet unannotated proteins. As seen in , ETA network diffusion performs better than a BLAST search when there are fewer homologs at high confidence z-scores. However, many non-homologous proteins share molecular function as a result of convergent evolution,
[91] and variations can produce enzymes with similar function but differing sequence motifs
[92]. Moreover, enzymatic function can be flexible and depend on context and expression level
[7] such that enzymes are promiscuous and may perform several functions
[93]. Presumably, to achieve even greater coverage, these problems will need to be addressed by raising the function detection sensitivity of the network. Further improvements in template construction
[39],
[94] or data integration
[69] are possible directions towards these goals.
In practice, the competitive diffusion of Evolutionary Trace Annotations via a global network of local evolutionary and structural similarities provides a highly specific and reliable method to predict the function of novel protein structures. With the goal of minimizing false positives, we showed that the confidence z-score can reliably select correct annotations and identify those that are likely to be false. The improvement over sequence comparison and nearest neighbor methods is most striking for 4 EC level predictions. This leads to 257 high-confidence functional predictions of Structural Genomics proteins (
Table S1). For one of these, the prediction of carboxylesterase activity in
Staphylococcus aureus protein SAV0321 (PDB ID 3h04), we have demonstrated the accuracy of our method through an
in vitro assay.