The main reason why template-free prediction of protein structure often fails is the lack of correct long-range residue–residue contacts (Floudas et al.
). At present, the energy functions used by ab initio
prediction methods, together with the sampling methods, are not sophisticated enough to correctly discover these types of contacts (Zhang, 2008
). It has been shown that the ability to predict contacts with accuracy above 22% could improve ab initio
predictions of protein structures (Zhang et al.
). Thus residue–residue contact prediction is an important bioinformatics research area that could help to identify the structures that are not reachable by homology modeling.
Currently, there are three different main approaches to contact prediction. Template-based contact predictors
rely on detecting templates of known structure and then transferring contacts from the template to the target (Misura et al.
; Skolnick et al.
). The performance and reliability of these predictors depend on the quality of the templates and can typically be approximated by the sequence similarity between the target and the template (Wu and Zhang, 2008
). Although template-based methods are the most accurate contact prediction methods, they are also the most limited in that they require the existence of a template with significant sequence similarity to the target (Rost, 1999
; Tramontano, 2003
). Machine learning
approaches to contact prediction are based on training models from contact maps of known structures (Cheng and Baldi, 2007
; Hamilton et al.
; Wu and Zhang, 2008
; Vullo et al.
). Since these methods do not rely on templates, they can be applied to a larger range of targets, in particular targets with little or no detectable sequence similarity to known structures. Statistical contact prediction methods
take advantage of the fact that mutations lead to changes in contacts for conformational reasons (Kundrotas and Alexov, 2006
; Shindyalov et al.
). Contacts are detected by searching for position pairs that show a similar pattern of variation (correlated mutations) (Olmea and Valencia, 1997
). Although this group of methods so far has not been proven to perform well in contact prediction when applied alone (Halperin et al.
), correlated mutations have been combined with machine learning approaches to achieve improved results (Eyal et al.
; Olmea and Valencia, 1997
; Shackelford and Karplus, 2007
The best machine learning methods for contact prediction report accuracy in the area of 25–40% depending on the nature of the test set (Cheng and Baldi, 2007
; Vullo et al.
; Wu and Zhang, 2008
) and the number of predictions considered. SVM-SEQ is a support vector machine trained on profiles, secondary structure, solvent accessibility, contact potentials, residue types and segment window information. It achieved an average accuracy of 25.8% for short-, medium- and long-range contacts on a set of template-free modeling targets (Wu and Zhang, 2008
). SVMcon is also a support vector machine trained on similar features as SVM-SEQ. An accuracy of around 37% was reported for this tool using a test set of known fold sequences with sequence identity <25% to the training set (Cheng and Baldi, 2007
). A similar performance was reported using a bidirectional two layered neural network trained on protein sequence, predicted secondary structure and hydrophobic interaction scales (Vullo et al.
Long-range contacts are a particularly hard challenge for ab initio
tertiary structure prediction methods because these methods are typically based on assembly of single backbone fragments that do not incorporate such contacts directly. Instead, these methods rely on long-range contacts appearing as a result of minimizing the energy function using a sampling procedure (Bujnicki, 2006
). In this article, we propose a machine learning method that trains HMMs to recognize local neighborhoods of proteins that already incorporate a number of a short-, medium- and long-range residue–residue contacts. To this end, we apply the concept of local descriptors of protein structure
that are structural entities consisting of the entire neighborhood of an amino acid and thus containing several backbone fragments located in proximity to each other (; Hvidsten et al.
). The HMMs are trained by combining the sequence signal in structurally similar neighborhoods with predicted secondary structure information and homology information. We show that a library of representative local descriptors can be correctly aligned to sequences even in the absence of templates and that the quality of the resulting residue–residue contact predictions represent an improvement over previously published methods. The performance of the approach is shown to be virtually fold independent and thus promises to be of considerable help in the most difficult application area of ab initio
tertiary structure prediction.
Fig. 1. The local descriptor denoted 1gr8a_#407 (i.e. the local neighborhood around amino acid number 407 in protein domain 1gr8a_). The left figure shows the local descriptor 1gr8a_#407 (red) in the structure domain 1gr8a_, while the middle figure shows a close (more ...)