An alternate to meta-servers is to pool annotations into network structures. Genes or gene products define the nodes of such networks, and the associations between them that suggest functional similarities are indicated by edges. A key advantage is that any number of similarity metrics can be represented at once simply by adding new edges between the protein nodes, or strengthening existing edges, regardless of whether they arise from sequence, structure, or evolutionary data over the whole or part of the protein. Moreover, these edges can also describe functional associations from yeast-two-hybrid; co-expression; conserved genomic neighborhood; phylogenetic co-occurrence and literature co-occurrence; for example, the STRING database [66
] now covers nearly 30% of all protein sequences in UniProt with such data. To benchmark prediction quality or to make novel predictions on protein function, biological process or gene phenotype, one can then apply the concepts of connectivity, centrality, modularity, clustering or graph cuts and maximum flows on graphs [67
]. Network methods can be broadly ordered into local and global approaches depending on whether their calculated predictions require some or all nodes and edges in the graph, respectively.
Local network methods consider nearest neighbors and the functions of a node are predicted from its annotated direct neighbors. This heuristic approach remains the standard to measure prediction accuracy and coverage since it predictive power is not easily surpassed and it scales at most linearly with the total number of nodes in the network [68
]. For example, given reliable underlying network information, local methods have been shown to predict a spectrum of effects ranging from gene essentiality to tissue-specific loss-of-function phenotypes in the nematode Caenorhabditis elegans
]. However, local network generally require additional considerations to yield statistical confidence values [70
], and non-local alternatives are more accurate.
Some non-local methods can gather information from larger neighborhoods. They apply the concept of network modules, or motifs, which are groups of genes or proteins with the same molecular function or taking part in the same biological process. The detection of modules involves clustering and statistical testing of significance against random networks [67
]. In yeast, where detailed and reliable genome-wide interaction data are available, module detection identified both novel molecular complexes and specific biological roles [71
], such as highly significant gene promoter motifs that regulate transcription [72
]. However, not all functionally coherent groups of proteins can be represented through modules. For example, transmembrane receptors bind to many extra- and intra-cellular molecular partners but they much less frequently form complexes with other membrane proteins [73
]. Hence, it is unlikely that protein interaction networks can be completely decomposed into functional modules.
Fully global methods seek to optimize annotations by finding the minimum of a quadratic polynomial, H
, over all nodes and edges. Here, H
is a positive cost function the minimum of which reflects the topology of the graph and yields a distribution of numerical labels (discrete or continuous, positive and negative) indicating functional memberships. In the input, only the nodes with known functions carry labels. In the output, after optimization, most nodes carry some labels including those initially unknown. Minimization of H
is an optimization problem equivalent to maximum a posteriori (MAP) estimates in Bayesian networks [74
], to stationary states in Markovian random fields [75
], or to minimum cost solutions in graph-based semi-supervised learning [76
]. This last method, also referred to as network diffusion, is notable for its improved accuracy and coverage over local methods [77
]. Also, when the network edges are positive, it produces a solution that grows linearly with network size [78
], enabling global analysis of very large networks—potentially millions of protein nodes. Finally, it allows the integration of heterogeneous data by optimizing the relative weights of individual networks; for example, those built from local evolutionary, global geometrical, topological and sequence relationships lead, after weighted integration, to an increase in sensitivity of 17% over the best single network [79
Most recently, in the context of Structural Genomics, this machine learning technique improved the specificity and coverage of function annotations. A network of protein structures was generated from reciprocal 3D template hits derived from the ETA method [80
] (). At the start, labels indicated the enzymatic activity of known proteins in the Protein Data Bank. Graph-based semi-supervised learning was then applied to transfer known functional labels of enzymatic activity to proteins whose function was unknown and to assign a statistical confidence score to all predictions (). By comparison to the ETA method [54
], this global analysis raised accuracy by 6% at 65% coverage (from 90% to 96% accuracy) at the substrate-specific fourth EC level. It also increased accuracy and coverage over standard BLAST annotation by 10% (from 85% to 95% accuracy also at 65% coverage, see ). In other controls, it improved over other structure-based methods, such as FLORA, reducing false positives to raise accuracy rose from 60% to 90% (measured at 97% sensitivity). Finally, as a direct additional control, a new annotation of a carboxylesterase (EC 18.104.22.168), in a vancomycin resistant strain of Staphylococcus aureus
, was tested experimentally and confirmed ().