|Home | About | Journals | Submit | Contact Us | Français|
A recent trend in computational methods for annotation of protein function is that many prediction tools are combined in complex workflows and pipelines to facilitate the analysis of feature combinations, for example, the entire repertoire of kinase-binding motifs in the human proteome.
As more sequenced genomes become available, computational methods for predicting protein function from sequence data continue to be of high importance. In fact, such methods represent the only viable strategy for keeping up with the growth of genomic information. In the current era of pan- and metagenomics it is obvious that computational annotation is essential for turning sequence data into functional knowledge that can be used to understand biological mechanisms and their evolutionary trends.
The computational annotation of structural and functional properties of proteins from their amino acid sequences is often possible, because similar functional or structural elements can be identified via similar sequence patterns. However, it is important to realize that there are two reasons for these similarities: some are due to homology (common ancestry), whereas others are due to convergent evolution (common selective pressure). This has consequences for the methods used to infer the annotations: while similarities due to common ancestry can often be identified by alignment techniques - either pairwise or profile-based - similarities produced by common selective pressures are often of a more subtle nature and are best identified using machine-learning techniques such as artificial neural networks, support vector machines (SVMs) or hidden Markov models adapted to the topology and sequential structure of the functional patterns in a given protein.
Functional patterns can be local, taking the shape of linear motifs or regions, or they can be reflected by more global features such as amino acid composition or pair frequencies, or by combinations of local and global features. Annotation based on homology has, in a broad sense, been used for as long as amino acid sequences have been compared. However, annotation of non-homologous patterns is also a very old discipline within bioinformatics. One of the very first published prediction methods in this context was a reduced-alphabet weight matrix calculating a score for signal peptide cleavage sites position by position .
No matter which type of functional feature a method attempts to identify, a crucial aspect of its usefulness is the predictive performance and, in particular, its ability to generalize to novel, unannotated data . The selection of dissimilar datasets for training, testing and validation is therefore critical to the practical usefulness of a given method. Overfitting to existing data has been and still is a common problem. When test and validation data are too similar to the training data, the predictive performance can be grossly overestimated or completely absent.
Interestingly, several of the breakthroughs in predicting functional features and structure have been linked to improvements in dataset preparation rather than to the invention of new algorithms as such [3-6]. Prediction of protein secondary structure represents one example [3,4], and of signal peptides another . This also holds true for the new class of advanced workflow-oriented prediction schemes where hundreds of prediction tools are integrated . The structuring of the experimental data and their conversion into datasets relevant for machine learning represents the most significant part of the inventive step, rather than the sophistication of the individual prediction tools .
In this review, we will provide an overview of how these different approaches can be used to annotate a number of functional features. We have chosen to focus on the structure-independent aspect of annotation - in other words, which features can be predicted without knowing or explicitly predicting the three-dimensional structure of the protein under consideration. Table Table11 contains a list of websites with extensive references to such protein-annotation tools. We will begin by considering the identification of functionally important residues - that is, those involved in catalysis or binding. The prediction of post-translational modifications will be described - exemplified by phosphorylation, glycosylation and lipid attachment. Then we will discuss how to predict which part of the cell a protein is destined for, on the basis of either the actual sorting signals or differences in global properties of proteins from different compartments. A related question is whether the protein is embedded in a membrane, and if so, which parts traverse the membrane and which parts are exposed to the two compartments separated by the membrane. Finally, we will discuss how these single-feature predictions can be integrated with each other and with overall homology-based detection schemes to assign a functional class to the entire protein.
An important current problem is to predict features that can be successfully used in comparative analysis of rather similar protein sequences, such as those derived from the same transcript by alternative splicing, from genome variation data (single-nucleotide polymorphisms, SNPs), variants arising by somatic mutation, or protein families from one or more species. Here the aim often is not to identify all functional features per se, but rather to single out differential functional features that may explain disease phenotypes or biochemical differences between organisms. The solution, as illustrated in Additional data file 1, is to structure and combine a large set of tools that can then be used to screen differential properties of datasets from large cohorts; this solution is now in development by the Epipe Consortium .
When many features are considered simultaneously, an effective way of structuring feature annotation is to develop an ontology of protein feature types. An ontology provides a structured and precisely defined common controlled vocabulary in a dynamic environment so that changes can occur as different uses are invented and new terms added. Recently, a new Protein Feature Ontology has been jointly developed by the BioSapiens, UniProt and Gene Ontology (GO) consortia , as an addition to the existing GO evidence ontology. This development is also very important for the future evolution of function-prediction tools.
While there often is a direct relationship between sequence similarity and conservation of protein structure, the same is not true for protein function: transfer of function based solely on the similarity between two sequences can be highly unreliable. Common evolutionary origin does not guarantee functional conservation of paralogs and the more distant the evolutionary relationship, the less reliable the transfer. Indeed, large-scale studies have shown that the transfer of functional annotation is only accurate for highly similar pairs of proteins [10,11]. However, even when two protein sequences do not appear to have overall sequence similarity, their alignment can contain short conserved sequence motifs, and these patterns of residues can be characteristic of a particular function. More powerful methods such as PSI-BLAST  or hidden Markov models can also be used to improve recognition performance. Methods such as ConFunc  and PFP  use clustering methods to refine and improve such homology-based predictions.
Domain databases such as Pfam , which recognizes the "accumulated sequence conservation of a long sequence segment" are also very useful tools for predicting function. Many Pfam functional domains and alignments are manually constructed by experts and are often among the best sources of functional information.
In many cases the most interesting functional information, such as catalytic and ligand-binding residues, is to be found at the residue level. One example of residue-level transfer can be found in the Catalytic Site Atlas . Here catalytic residues extracted from the literature are supplemented by catalytic residues annotated from PSI-BLAST searches. One recent development has been Firestar , which is a server that integrates a database of experimentally validated functional residues with a sequence alignment analysis tool that evaluates the reliability of functional transfer. Firestar highlights potential functionally important residues such as ligand-binding residues and catalytic residues and allows users to assess whether the functionally important residues can be transferred.
Protein phosphorylation has a crucial role in almost all cellular signaling processes and is the most widespread post-translational modification in eukaryotes . The first machine-learning-based method for prediction of phosphorylation sites, NetPhos, was published a decade ago; it uses ensembles of neural networks to distinguish between phosphorylated and non-phosphorylated residues .
However, mammals have more than 500 protein kinases with very different sequence specificities. Newer methods have thus instead focused on deriving separate sequence motifs for individual kinases or families of closely related kinases. The Scansite method relies on position-specific scoring matrices that are determined from data obtained in in vitro binding assays using degenerate peptide libraries . Alternatively, machine-learning algorithms can be used to derive a sequence motif for each kinase (or kinase family) based on its known in vivo substrates. The first such method, NetPhosK, consisted of neural networks for only six kinase families , which later was extended to 17 families. Many other kinase-specifc methods have been developed using a variety of different machine-learning algorithms (see  and references therein for an overview).
As experimental phospho-proteomics approaches continue to produce vast numbers of phosphorylation sites, a key problem is to match these sites to the kinases that phosphorylate them. NetPhorest is a new atlas of consensus sequence motifs with a nonredundant collection of 125 sequence-based classifiers for linear motifs in phosphorylation-dependent signaling . It covers more than 180 kinases and 100 phosphorylation-dependent binding domains (such as Src homology 2 (SH2), phosphotyrosine binding (PTB), BRCA1 C-terminal (BRCT), WW and 14-3-3). The resource is maintained by an automated pipeline, which uses phylogenetic trees to structure the available in vivo and in vitro data to derive probabilistic sequence models of linear motifs. This type of approach is therefore automatically maintained as new data become available and represents an entirely new angle on the sustainability of tools for protein function annotation.
The cellular substrate specificities of kinases are heavily influenced by contextual factors such as co-activators, protein scaffolds and expression . The systems-biology-oriented method NetworKIN takes the context into account by augmenting the sequence motifs with a network context for the kinases and phosphoproteins . The network is constructed on the basis of known and predicted functional associations from the STRING database, which integrates evidence from curated pathway databases, automatic literature mining, high-throughput experiments and genomic context . For further details on prediction of biological networks see  and references therein.
Many proteins are glycoproteins and the most important types of glycosylations are N-linked, O-linked GalNAc (mucin-type), and O-β-linked GlcNAc (intracellular/nuclear) . Glycosylation prediction is not a trivial task because of the lack of a clear consensus recognition sequence; however, it has been possible to develop useful models for prediction of O-GalNAc-glycosylation (NetOGlyc) using a neural network based approach that combines a range of features derived from sequence . A recent advance in the glycosylation field has been the development of a new method - NetCGlyc - for predicting the unusual modification C-mannosylation .
Automated sequence annotation of subcellular localization is a major step in protein functional annotation. This is particularly important in eukaryotic cells, which contain several subcellular compartments. Signal peptide prediction has a quite long history that will not be reviewed here. That area indeed represents one of the big successes in the entire field of predictive bioinformatics: algorithms are approaching a performance level comparable to the quality of the underlying experimental data, perhaps in some cases even better [6,29].
The SignalP scheme [30,31] was the first neural-network-based approach predicting both the presence of the secretory signal peptide and its cleavage site. It gave an order of magnitude improvement in performance. As mentioned above, this improvement was also based on new dataset preparation principles inspired by developments in protein structure prediction . Other published machine-learning-based methods that perform well in this area include LOCTree , based on several binary SVMs, arranged in three different decision trees and specific for plants, non-plants and prokaryotes; BaCelLo [29,33], which is based on a decision tree of binary SVMs, and is specific for animals, fungi and plants; TargetP , based on neural networks and specific for non-plants, plants and prokaryotes; WoLF PSORT , a classifier that computes a large number of sequence features and is specific for animals, fungi and plants. A general trend in the benchmarking of these algorithms is perhaps that the performance of multi-compartment predictors tends to be overestimated.
One subcellular location for which a wide range of sequence-based prediction methods has been developed is insertion into membranes. Structurally, integral membrane proteins come in two basic shapes, either tightly packed bundles of α-helices or β-barrels that often form permeable pores across the membrane. For various reasons, most computational work on membrane proteins has focused on the former. Generally speaking, topology predictors usually look for three important sequence characteristics of transmembrane alpha-helices: first, hydrophobic stretches of approximately 20 amino acids spanning the core of the lipid bilayer; second, a flanking 'aromatic belt' of tryptophan and tyrosine residues situated in the lipid-water interface; and third, an over-representation of the positively charged amino acids lysine and arginine in short cytoplasmic loops, known as the positive-inside rule .
Early attempts at predicting transmembrane topology from sequence were based on identifying peaks in hydrophobicity plots, using the positive-inside rule for uncertain cases and to predict the overall orientation of the protein . More recent approaches use machine-learning algorithms to extract statistical sequence preferences from membrane proteins with known structures [36-40]. Including evolutionary information by basing the prediction on sequence profiles has been shown to increase performance levels by around 5-10% [37,39,41]. Current predictors attain around 80% accuracy on known membrane protein structures, although their performance might be overestimated when applied to whole-genome data .
In recent years, elucidation of the complexity of some membrane protein structures has led to the development of methods that predict not only transmembrane helices, but other structural features as well, such as re-entrant loops and interfacial helices [43,44]. Other methods, such as Phobius, combine the prediction of transmembrane helices with the simultaneous prediction of signal peptides, leading to improved performance levels for proteins that contain both .
A wide variety of proteins has been shown to contain covalently bound lipid groups . Lipid anchor attachment is also a common way to link soluble proteins to membranes in eukaryotes. This modification directs the anchored protein to its very specific cellular location with an important impact on the final function. Predictors are presently available for modifications such as myristoylation, palmitoylation and prenylation [46,47]. The most common and best-studied lipid anchor modification is the glycosylphosphatidylinositol (GPI) linkage to the carboxy-terminal sequence portion that targets the protein toward the extracellular leaflet of the plasma membrane. In recent years, advances have also been made in predicting GPI-anchored proteins [48,49].
Ultimately, the integration of various functional signals, ranging from key residues to signals for subcellular localization and post-translational modifications, can be extrapolated to global functional roles. These roles are typically expressed in general classification schemes, which aim at the complete description of known cellular functions of proteins . Inspired by well-established catalogues, such as the Enzyme Committee (EC) nomenclature system for enzymes , these schemes comprise functional classes used in the characterization of genomes . Similarly, generalized non-hierarchical structures, such as GO, express complex relationships between classes and subclasses . One of the major challenges in function prediction is thus to capture the salient features of protein sequences and map those to existing functional classification schemes, often by combining information with other elements, for example subcellular localization or post-translational modifications.
Examples of this are represented by attempts to predict EC categories from sequence alone , the prediction of functional classes from keywords and other annotations , and finally the association of sequence with GO .
Non-homologous function prediction combining many features was first implemented in the ProtFun method for human proteins . By design, the strength of the ProtFun method lies in classification of unannotated and orphan proteins. This strategy is based on the observation that proteins with the same function tend to exhibit similar feature patterns and functional similarity, which can be deduced from biochemical and biophysical properties such as average hydrophobicity, charge and amino acid composition as well as from local features such as glycosylation, phosphorylation and other post-translational modifications.
More recent methods have adopted a ProtFun-like approach in combination with homology or structural input and have reported improved performance, particularly in prediction of the GO categories [58,59]. One desirable element of function prediction is the association of annotation assignments to a score that reflects the quality of the assignment. The methods need to cluster the functional space into consistent clusters and subsequently provide probabilistic estimates of assignment accuracy ; the recently developed method CORRIE can detect EC classes with high coverage . Newer methods presumably benefit from the increasing quality and quantity of functional protein annotation. Furthermore, the combination of non-homologous prediction methods with homologous or structural methods is likely to overcome limitations inherent in each individual method.
A major challenge for the area of sequence-based protein function prediction is multi-functionality, where proteins have different roles in different compartments, tissues and organs. The low number of genes in the human genome has in itself increased the interest in experimental detection of this type of protein, and similarly, detection of alternative splicing by exon and tiling arrays also contributes large amounts of functional evidence of pleiotropy where a single gene influences multiple phenotypic traits. This situation calls for systems-biology-oriented approaches where data from protein interaction screens, gene expression data, and many other types of data are integrated. From a prediction perspective the entire area of multi-functional proteins is interesting as it also will call for new benchmarking principles for novel algorithms. Today most of the systems biology approaches still focus on proteins belonging to one single functional category. This problem indeed represents a major future challenge.
Additional data file 1 contains a workflow combining the prediction and annotation tools of the Epipe method and an example output.
A PDF containing a workflow combining the prediction and annotation tools of the Epipe method and an example output.