While there often is a direct relationship between sequence similarity and conservation of protein structure, the same is not true for protein function: transfer of function based solely on the similarity between two sequences can be highly unreliable. Common evolutionary origin does not guarantee functional conservation of paralogs and the more distant the evolutionary relationship, the less reliable the transfer. Indeed, large-scale studies have shown that the transfer of functional annotation is only accurate for highly similar pairs of proteins [10
]. However, even when two protein sequences do not appear to have overall sequence similarity, their alignment can contain short conserved sequence motifs, and these patterns of residues can be characteristic of a particular function. More powerful methods such as PSI-BLAST [12
] or hidden Markov models can also be used to improve recognition performance. Methods such as ConFunc [13
] and PFP [14
] use clustering methods to refine and improve such homology-based predictions.
Domain databases such as Pfam [15
], which recognizes the "accumulated sequence conservation of a long sequence segment" are also very useful tools for predicting function. Many Pfam functional domains and alignments are manually constructed by experts and are often among the best sources of functional information.
In many cases the most interesting functional information, such as catalytic and ligand-binding residues, is to be found at the residue level. One example of residue-level transfer can be found in the Catalytic Site Atlas [16
]. Here catalytic residues extracted from the literature are supplemented by catalytic residues annotated from PSI-BLAST searches. One recent development has been Firestar [17
], which is a server that integrates a database of experimentally validated functional residues with a sequence alignment analysis tool that evaluates the reliability of functional transfer. Firestar highlights potential functionally important residues such as ligand-binding residues and catalytic residues and allows users to assess whether the functionally important residues can be transferred.
Protein phosphorylation has a crucial role in almost all cellular signaling processes and is the most widespread post-translational modification in eukaryotes [18
]. The first machine-learning-based method for prediction of phosphorylation sites, NetPhos, was published a decade ago; it uses ensembles of neural networks to distinguish between phosphorylated and non-phosphorylated residues [19
However, mammals have more than 500 protein kinases with very different sequence specificities. Newer methods have thus instead focused on deriving separate sequence motifs for individual kinases or families of closely related kinases. The Scansite method relies on position-specific scoring matrices that are determined from data obtained in in vitro
binding assays using degenerate peptide libraries [20
]. Alternatively, machine-learning algorithms can be used to derive a sequence motif for each kinase (or kinase family) based on its known in vivo
substrates. The first such method, NetPhosK, consisted of neural networks for only six kinase families [21
], which later was extended to 17 families. Many other kinase-specifc methods have been developed using a variety of different machine-learning algorithms (see [22
] and references therein for an overview).
As experimental phospho-proteomics approaches continue to produce vast numbers of phosphorylation sites, a key problem is to match these sites to the kinases that phosphorylate them. NetPhorest is a new atlas of consensus sequence motifs with a nonredundant collection of 125 sequence-based classifiers for linear motifs in phosphorylation-dependent signaling [23
]. It covers more than 180 kinases and 100 phosphorylation-dependent binding domains (such as Src homology 2 (SH2), phosphotyrosine binding (PTB), BRCA1 C-terminal (BRCT), WW and 14-3-3). The resource is maintained by an automated pipeline, which uses phylogenetic trees to structure the available in vivo
and in vitro
data to derive probabilistic sequence models of linear motifs. This type of approach is therefore automatically maintained as new data become available and represents an entirely new angle on the sustainability of tools for protein function annotation.
The cellular substrate specificities of kinases are heavily influenced by contextual factors such as co-activators, protein scaffolds and expression [18
]. The systems-biology-oriented method NetworKIN takes the context into account by augmenting the sequence motifs with a network context for the kinases and phosphoproteins [24
]. The network is constructed on the basis of known and predicted functional associations from the STRING database, which integrates evidence from curated pathway databases, automatic literature mining, high-throughput experiments and genomic context [25
]. For further details on prediction of biological networks see [26
] and references therein.
Many proteins are glycoproteins and the most important types of glycosylations are N-linked, O-linked GalNAc (mucin-type), and O-β-linked GlcNAc (intracellular/nuclear) [21
]. Glycosylation prediction is not a trivial task because of the lack of a clear consensus recognition sequence; however, it has been possible to develop useful models for prediction of O-GalNAc-glycosylation (NetOGlyc) using a neural network based approach that combines a range of features derived from sequence [27
]. A recent advance in the glycosylation field has been the development of a new method - NetCGlyc - for predicting the unusual modification C-mannosylation [28