PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of bioinfoLink to Publisher's site
 
Bioinformatics. 2009 May 15; 25(10): 1264–1270.
Published online 2009 March 16. doi:  10.1093/bioinformatics/btp149
PMCID: PMC2677742

Using multi-data hidden Markov models trained on local neighborhoods of protein structure to predict residue–residue contacts

Abstract

Motivation:Correct prediction of residue–residue contacts in proteins that lack good templates with known structure would take ab initio protein structure prediction a large step forward. The lack of correct contacts, and in particular long-range contacts, is considered the main reason why these methods often fail.

Results: We propose a novel hidden Markov model (HMM)-based method for predicting residue–residue contacts from protein sequences using as training data homologous sequences, predicted secondary structure and a library of local neighborhoods (local descriptors of protein structure). The library consists of recurring structural entities incorporating short-, medium- and long-range interactions and is general enough to reassemble the cores of nearly all proteins in the PDB. The method is tested on an external test set of 606 domains with no significant sequence similarity to the training set as well as 151 domains with SCOP folds not present in the training set. Considering the top 0.2 · L predictions (L=sequence length), our HMMs obtained an accuracy of 22.8% for long-range interactions in new fold targets, and an average accuracy of 28.6% for long-, medium- and short-range contacts. This is a significant performance increase over currently available methods when comparing against results published in the literature.

Availability: http://predictioncenter.org/Services/FragHMMent/

Contact: torgeir.hvidsten/at/plantphys.umu.se

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

The main reason why template-free prediction of protein structure often fails is the lack of correct long-range residue–residue contacts (Floudas et al., 2006). At present, the energy functions used by ab initio prediction methods, together with the sampling methods, are not sophisticated enough to correctly discover these types of contacts (Zhang, 2008). It has been shown that the ability to predict contacts with accuracy above 22% could improve ab initio predictions of protein structures (Zhang et al., 2003). Thus residue–residue contact prediction is an important bioinformatics research area that could help to identify the structures that are not reachable by homology modeling.

Currently, there are three different main approaches to contact prediction. Template-based contact predictors rely on detecting templates of known structure and then transferring contacts from the template to the target (Misura et al., 2006; Skolnick et al., 2004). The performance and reliability of these predictors depend on the quality of the templates and can typically be approximated by the sequence similarity between the target and the template (Wu and Zhang, 2008). Although template-based methods are the most accurate contact prediction methods, they are also the most limited in that they require the existence of a template with significant sequence similarity to the target (Rost, 1999; Tramontano, 2003). Machine learning approaches to contact prediction are based on training models from contact maps of known structures (Cheng and Baldi, 2007; Hamilton et al., 2004; Wu and Zhang, 2008; Vullo et al., 2006). Since these methods do not rely on templates, they can be applied to a larger range of targets, in particular targets with little or no detectable sequence similarity to known structures. Statistical contact prediction methods take advantage of the fact that mutations lead to changes in contacts for conformational reasons (Kundrotas and Alexov, 2006; Shindyalov et al., 1994). Contacts are detected by searching for position pairs that show a similar pattern of variation (correlated mutations) (Olmea and Valencia, 1997). Although this group of methods so far has not been proven to perform well in contact prediction when applied alone (Halperin et al., 2006), correlated mutations have been combined with machine learning approaches to achieve improved results (Eyal et al., 2007; Olmea and Valencia, 1997; Shackelford and Karplus, 2007).

The best machine learning methods for contact prediction report accuracy in the area of 25–40% depending on the nature of the test set (Cheng and Baldi, 2007; Vullo et al., 2006; Wu and Zhang, 2008) and the number of predictions considered. SVM-SEQ is a support vector machine trained on profiles, secondary structure, solvent accessibility, contact potentials, residue types and segment window information. It achieved an average accuracy of 25.8% for short-, medium- and long-range contacts on a set of template-free modeling targets (Wu and Zhang, 2008). SVMcon is also a support vector machine trained on similar features as SVM-SEQ. An accuracy of around 37% was reported for this tool using a test set of known fold sequences with sequence identity <25% to the training set (Cheng and Baldi, 2007). A similar performance was reported using a bidirectional two layered neural network trained on protein sequence, predicted secondary structure and hydrophobic interaction scales (Vullo et al., 2006).

Long-range contacts are a particularly hard challenge for ab initio tertiary structure prediction methods because these methods are typically based on assembly of single backbone fragments that do not incorporate such contacts directly. Instead, these methods rely on long-range contacts appearing as a result of minimizing the energy function using a sampling procedure (Bujnicki, 2006). In this article, we propose a machine learning method that trains HMMs to recognize local neighborhoods of proteins that already incorporate a number of a short-, medium- and long-range residue–residue contacts. To this end, we apply the concept of local descriptors of protein structure that are structural entities consisting of the entire neighborhood of an amino acid and thus containing several backbone fragments located in proximity to each other (Fig. 1; Hvidsten et al., 2008). The HMMs are trained by combining the sequence signal in structurally similar neighborhoods with predicted secondary structure information and homology information. We show that a library of representative local descriptors can be correctly aligned to sequences even in the absence of templates and that the quality of the resulting residue–residue contact predictions represent an improvement over previously published methods. The performance of the approach is shown to be virtually fold independent and thus promises to be of considerable help in the most difficult application area of ab initio tertiary structure prediction.

Fig. 1.
The local descriptor denoted 1gr8a_#407 (i.e. the local neighborhood around amino acid number 407 in protein domain 1gr8a_). The left figure shows the local descriptor 1gr8a_#407 (red) in the structure domain 1gr8a_, while the middle figure shows a close ...

2 MATERIALS AND METHODS

2.1 Structure classification of contacts

We define two residues to be in contact if the distance between their Cβ atoms (Cα for glycine residues) is <8 Å. This is the definition used in the residue–residue contact prediction assessment in CASP (Critical Assessment of techniques for protein Structure Prediction; Izarzugaza et al., 2007). Contacts are classified into three groups, long-, medium- and short-range contacts, defined by the number of amino acids that separate the two residues in contact. Long-range contacts are separated by more than 24 amino acids, medium-range contacts by between 12 and 23 amino acids and short-range contacts by between 6 and 11 amino acids (Cheng and Baldi, 2007; Vullo et al., 2006; Wu and Zhang, 2008). Contacts separated by fewer than six amino acids are not considered since they are mostly the result of secondary structure.

2.2 Model assessment

In order to assess the performance of the proposed method we use a set of standard assessment definitions. Accuracy (Acc) is the number of correctly predicted contacts (Nc) divided by the number of predicted contacts (Np), Acc=Nc/Np. Coverage (Cov) is the number of correctly predicted contacts divided by the total number of true contacts (Nt), Cov=Nc/Nt. As is common in other studies, we control the number of contact predictions using the length of the target sequence (L). This is accomplished by only considering the Np= Pct ·L best predictions, where Pct must be set to some fixed value.

2.3 Training and test set

The local descriptors of protein structure were extracted from a set of 4013 structures in the ASTRAL release 1.63 and then grouped into a library of representative substructures (Brenner et al., 2000; Hvidsten et al., 2008). This library was used together with sequences, predicted secondary structure and detected homologous sequences to train the hidden Markov models (HMMs). Secondary structure was predicted using PSI-PRED (McGuffin et al., 2000) with the non-redundant sequence database (NR) at NCBI (ftp://ftp.ncbi.nlm.nih.gov/) (Pruitt et al., 2005). Homology information was used in terms of profiles (PSSM) obtained from running PSI-BLAST with three iterations against NR (Altschul and Koonin, 1998; Altschul et al., 1997; Pruitt et al., 2005).

The test set was identified by blasting all sequences in ASTRAL releases 1.69 against release 1.63 (the training set) and retaining all domains with a BLAST E-score >0.05. See Supplementary Material File 1. The test set was then divided into domains with SCOP folds not present in the training set and domains of known SCOP folds (Andreeva et al., 2004; Lo Conte et al., 2002).

2.4 Local descriptors of protein structure

Local descriptors of protein structure are a recent development for describing reoccurring 3D substructures in proteins. The method is a non-rigid body approach capable of, in particular, comparing protein structures by analyzing similarities between proteins on a local level. Substructures are composed of all segments of the amino acid backbone chain that are in proximity to each other in space but not necessarily along the amino acid sequence (Fig. 1; Hvidsten et al., 2008). The local descriptors that are of interest to this study are those that are composed of three or more segments and thus contain a number of contacts between different parts of the protein. All local substructures in the training set were organized into a library consisting of 7151 representative groups of structurally similar local descriptors (Hvidsten et al., 2008). See Supplementary Material File 2. To reduce the search space, groups in this study only contain local descriptors with the same segment order along the backbone. The library provides a set of building blocks for protein structures that are common to proteins independent of their global fold. The structural alignment of local descriptors in a group is mirrored by a sequence alignment between the corresponding backbone fragments. These sequence alignments constitute the training examples used in this study (Fig. 1).

2.5 Hidden Markov models

In this article, profile HMMs are used to recognize and align local descriptors in target sequences [for an instruction to profile HMMs, see Eddy (1998)]. Each HMM model represents one structural neighborhood (descriptor group). The idea is to model the residues in segments of the descriptor as match states (M) and the rest of the sequence as insert states (I). Thus, each column in the alignment of a descriptor group corresponds to a specific match state (Fig. 1). Some groups may contain descriptor fragments of varying length because only parts of the fragments structurally match the group according to the defined similarity threshold. This is handled by using delete states that are tied to specific match states. In order to ensure that whole fragments are not deleted there are two different types of delete states that are disconnected; delete states that are located in the beginning of fragments (DB) and delete states that are located at the end of the fragments (DE). We do not expect a significant sequence signal from these structurally unmatched positions, and thus the delete states are associated with the same emission probabilities as the insert states. The HMMs are built using the architecture shown in Figure 2. This is an architecture that utilizes both secondary structure- and sequence data simultaneously. The models consist of one transition matrix T and two emission matrices Ea and Es that correspond to emission probabilities estimated according to the sequence alignments resulting from similar structural neighborhoods and predicted secondary structure, respectively. The emission matrices contain the emission probabilities en,m for each observable amino acid or secondary structure m in every state n. The emission probabilities in Ea were estimated using BLOSUM62, while Es were estimated using a pseudo count (Henikoff and Henikoff, 1992, 1996). In order to use the Viterbi algorithm with the models, the observed sequence O (O = {Oa, Os}) is defined as consisting of two parallel associated series of observations of similar length (Oa, Os). Since profiles are used as search elements rather than the sequence of the target, Oa is a linear sequence of arrays, Oa = {Oγ,1, Oγ,2, , Oγ,J}. Each Oγ is an array representing a specific position in the target, where each value γj,m is the observed fraction of amino acid m in position j in the target. Thus Oγ,j = (γj1, γj2,…, γj20). The secondary structure Oss is predicted from the profile and is thus a single-ordered linear sequence of secondary structure elements (i.e. helix, strand or coil) Os = {Os,1, Os,2, …, Os,J}. The emission probability E(Oj) can then be defined as E(Oj) = Ea,n(Oγ,j) Es,n(Os,j) at position j in state n. As Oγ,j is an array of amino acid probabilities, the emission function Ea,n(Oγ,j) is defined as Ea,n(Oγ,j)=en,1t1)+en,2t2)+···+en,20t20). The optimal path through the HMM is then found using the Viterbi algorithm (Rabiner, 1989; Viterbi, 1967).

Fig. 2.
The topology of the HMMs used to align structural neighborhoods to a target sequence. The underlying red structure represents states emitting secondary structure (labeled ‘ss’) and the blue overlaying structure represents states emitting ...

2.6 HMM thresholds

Given a target sequence and an HMM trained to recognize a specific local substructure, the Viterbi algorithm gives both the most probable position of the backbone fragments of that local substructure on the sequence (i.e. the alignment) and the corresponding probability of the alignment. The resulting alignment can be wrong for two reasons: either some of the fragments are assigned to wrong positions along the sequence, or the local substructure associated with the HMM does not exist in the target structure at all. To evaluate the performance of the HMMs with respect to the former, we performed a leave-one-domain-out cross-validation for each group using the associated HMM (Cawley and Talbot, 2003) (leave-one-domain-out procedure ensures that all local descriptors from the same domain are left out of the training set). To evaluate the performance with respect to the latter, we matched the HMM to 1000 sequences that did not contain the local substructure (negative examples), and selected a threshold that discriminated scores from these negative examples from cross-validation scores obtained from sequences known to contain the local descriptor. This was done by choosing the threshold that maximized sensitivity plus specificity. We found that the best approach to discriminate positive targets from negative targets was to consider the sum of the log values from the match and delete state emissions only. This eliminated the problem of accounting for different sequence lengths when comparing scores from different targets.

2.7 Predicting contacts

Contact predictions were obtained by matching all HMMs to the target, and accepting the ones with a score higher than the associated threshold. Each assigned HMM then received a score equal to the modified Viterbi score (i.e. only considering match and delete states, see Section 2.5) divided by the HMM threshold. Given the Viterbi alignment, contacts were then transferred to the target from the corresponding local substructure recognized by the HMM. Obviously, the contact map for each local descriptor in a group can differ slightly given the discrete definition of a contact (8 Å). In this study, contacts from the central local descriptor of each group were transferred, and only contacts between residues located in different backbone fragments were considered.

Each predicted contact was given a score equal to the sum of the scores from all HMMs predicting that contact. Thus, a contact predicted by many different local descriptor groups were given a higher score than contacts predicted by fewer models. The number of predicted contacts was chosen according to the length of the protein and the chosen Pct-value. The contacts with the highest score within each contact range were chosen.

3 RESULTS AND DISCUSSION

In this article, we have developed a multi-data HMMs to recognize local substructures in sequence and to predict contacts. This was done by (i) constructing a library of 7057 representative and recurring local neighborhoods (local descriptor groups), (ii) training an HMM to recognize each such local neighborhood in a sequence domain using sequence and secondary structure data, (iii) evaluating the HMMs ability to (a) correctly align the local substructure to a target sequence given that the neighborhood in fact is present in the target's structure (by cross-validation) and (b) correctly discriminate targets that contain the local substructure from those that do not, and, finally, (iv) use the HMMs that performed satisfactorily to predict contacts in a number of previously unseen targets (i.e. test set) with little or no sequence similarity to the training set. Figure 3 gives an overview of the method.

Fig. 3.
Method overview.

3.1 Evaluation using the training set

In our method, the ability to properly align local descriptor to a target sequence is a prerequisite for correct contact prediction. This ability was evaluated in two steps depending on whether we know if the local substructure is present in the target structure or not.

We measured the ability to correctly align each local descriptor to its corresponding domain using an HMM trained on the remaining local descriptors in the group (i.e. leave-one-out cross-validation). The maximum sequence identity between two domains in the training set was 40%. Since each local descriptor consists of at least three separate backbone fragments, each of them needs to be correctly aligned to obtain a perfect alignment. Figure 4 shows that a little over 50% of the segments in a local descriptor group are correctly aligned, and that this number increases to almost 60% when allowing an alignment error of at most four positions.

Fig. 4.
The average percentage of descriptor segments in a group that has been aligned within a certain residue distance from the true positions.

In order to successfully use local descriptors to predict (long-, medium- and short-range) contacts, we need to ensure that the models are able to discriminate targets that lack the corresponding substructure. Obviously, given a new target sequence we do not know which local substructures it contains. Thus, we associate a threshold value with each HMM. Only target sequences with a Viterbi score above this threshold are assumed to contain the corresponding local substructure. We require that these thresholds should discriminate at least 95% of sequences not containing the local shape (i.e. specificity > 0.95). Of the 7057 descriptors in our library, 6776 passed this criterion while still being able to recognize at least one target that contained the local shape (i.e. sensitivity > 0). The descriptor similarity group analyzed in Figure 1 obtained a perfect sensitivity score and a specificity score of 0.954. It is interesting that this local shape exhibits such a clear sequence signal despite the fact that it contains only five members belonging to four different SCOP folds. The group library as a whole obtained an average sensitivity of 74.7% and a specificity of 96.0%.

3.2 Pct-values

Previously published contact prediction methods often consider the number of top ranked predictions corresponding to a Pct of 0.5 for each type of contact (i.e. long-, medium- and short-range contacts), thus reporting contact predictions corresponding to a total Pct of 1.5. The motivation is that the number of contacts in a protein is linear to the protein length. However, we argue that packing a long protein sequence will naturally require a higher fraction of long-range interactions than packing a short protein sequence. Therefore, we look at the actual distributions of long-, medium- and short-range interactions in proteins with known structure. As can be seen from Figure 5, assuming a Pct of 0.5 for each type of contact is a rather inaccurate approximation. As expected, the length of the target sequence has a great impact on the distribution of the different types of contacts. We used this result to create a cubic spline curve for each distribution and then used these curves to compute the Pct score for each type of contact given the length of the target. The distribution itself is of interest as it to some degree explains why it is so difficult to predict the structure of medium and long proteins with ab initio methods. Not only does the number of contacts increase in a longer amino acid sequence, but a big fraction of these contacts are long-range interactions. Thus, using fixed Pct values makes prediction accuracies for contact predictors look better than they actually are.

Fig. 5.
The three curves show the distribution of short-, medium- and long-range contacts as a function of sequence length. The blue fields represent the density of Pct for the specific amino acid sequence length (x-axis). The red line is a spline curve extracted ...

3.3 Contact predictions

To evaluate our approach for contact prediction, we used an external test set containing 755 domains with BLAST E-score higher than 0.05 to the closest neighbor in the training set. Of these targets, 149 belong to SCOP folds not present in the training set. Contacts in both test sets were predicted using the best assignments corresponding to a Pct of 0.2, 0.5 and using the spline interpolated Pct-values for each contact range.

Table 1 (known folds) and Table 2 (new folds) show some interesting trends. Obviously, a Pct of 0.2 gives higher accuracy and lower coverage than a Pct of 0.5. Spline interpolated Pct-values, which are designed to mimic the true number of contacts, allow in general more predictions than the commonly used values of 0.2 and 0.5, and thus result in higher coverage and lower accuracy than these values. Longer distances between contacts generally decrease the quality of predictions. Known folds in general can be more accurately predicted than new folds, although the difference is rather small.

Table 1.
Contact prediction results for the test set of 606 domains with known folds
Table 2.
Contact prediction results on test set of 151 domains with new folds not present in the training set

Since the spline interpolation forces the method to look for more long-range contacts that are inherently difficult to predict, it is hard to compare these results to those obtained with fixed Pct-values. However, considering that previously published results suggest that a contact prediction accuracy of >22% should improve ab inito structure prediction methods, a Pct-value of 0.2 seems a reasonable choice. This results in accuracy >22% for all ranges both for known and new folds.

Structurally matching all local descriptor groups in our library to the new fold targets indicated that on average 73% of the amino acids in a target were covered by at least one group. We observed little or no correlation between the descriptor coverage and the contact prediction accuracy (Pct=0.2, see Fig. 6), which shows that the method is robust both in describing and predicting contacts even for new folds. The lack of correlation is most likely due to high descriptor coverage for virtually all new fold targets and the fact that we only consider a small fraction of the very best contact predictions. Moreover, since we reward predictions made by several HMMs, contact predictions should typically be located in the most highly covered regions of proteins. A similar lack of correlation was observed in applying local descriptors to fold recognition (Hvidsten et al., 2008). Interestingly, the five new fold targets that did not match any of the descriptor groups in our library still obtained an average accuracy of 24% (all ranges). These were all small domains with an average length of only 50 amino acids. Such domains are typically not well covered by local descriptors which require at least three different parts of the backbone to be close in space. Correct contact predictions in these domains must either be due to the fact that HMMs correctly assign only some, but not all, of the fragments in a local descriptor or that incorrect assignments still contribute correct contact predictions. The former is quite often observed in the training set (Fig. 4). The latter should occur from time to time given that we employ a discrete definition of local descriptor similarity (Hvidsten et al., 2008) and that we look at new fold proteins with no sequence similarity to the training set. Thus, local descriptors falling just outside our similarity threshold should have an almost equal chance of being recognized by the ‘correct’ HMM as a local descriptors falling just inside the threshold.

Fig. 6.
Contact prediction accuracy (Pct = 0.2) for new fold targets plotted against the fraction of the targets structurally matched by at least one local descriptor group in the library (i.e. descriptor coverage). Correlation coefficients between prediction ...

Figure 7A shows a match of group 1gr8a_#407 (Fig. 1) to the recombinational repair protein RecR. There are no examples of this fold in the training set. The group predicts 16 long-range, 11 medium-range and no short-range contacts. Of these, five and four predictions are correct (Fig. 7B), resulting in an accuracy of 31 and 36%, respectively.

Fig. 7.
(A) The assignment of group 1gr8a_#407 (Fig. 1) to the recombinational repair protein RecR (PDB code 1vdd, chain A). Positions 1–54 in the structure are not shown. (B) Contacts correctly predicted by the group. Medium-range contacts are in blue ...

4 CONCLUSION

We have developed a fold-independent method that predicts contacts based on sequence information. The HMMs are trained using sequence alignments from structurally similar local descriptors of protein structure, homologous sequences and predicted secondary structure. Testing our method on targets with unknown folds, we find that our method yields better results than state-of-the-art sequence-based contact prediction methods (Cheng and Baldi, 2007; Vullo et al., 2006; Wu and Zhang, 2008). In particular, SVM-SEQ (Wu and Zhang, 2008) obtained accuracy of 20.2% for long-range contacts, 23.3% for medium-range contacts and 34.0% for short-range contacts when using a Pct-value of 0.2 on new fold targets. However, when comparing these numbers with our new fold results in Table 2, one should keep in mind that the comparison is made on test sets of similar rather than exactly the same difficulty (i.e. targets with folds not present in the training set). In general, improving sequence-based residue–residue contact prediction could help advance prediction of tertiary structure, especially in the particularly difficult new fold category [for a comparative assessment, see for example CASP 7 results (Jauch et al., 2007)]. Indeed, providing sufficiently reliable geometrical constraints would be useful in all cases where good templates are unavailable. Therefore developing new sequence-based methods for predicting residue–residue contacts may also be critical to building a folding algorithm that correctly predicts protein structure from sequence.

Supplementary Material

[Supplementary Data]
[Supplementary Data]

ACKNOWLEDGEMENTS

We would like to thank Daniel Larson for making the illustration showing the topology of the HMMs. Computations were in part carried out in CoE BioExploratorium at University of Warsaw.

Funding: The Knut and Alice Wallenberg Foundation; the Swedish Foundation for Strategic Research; the Swedish Research Council; the Swedish Governmental Agency for Innovation Systems (VINNOVA); the Polish Ministry of Science and Higher Education (grant PBZ-MIN-014/P05/2004); the National Institutes of Health (LM007085).

Conflict of Interest: none declared.

REFERENCES

  • Altschul SF, Koonin EV. Iterated profile searches with PSI-BLAST–a tool for discovery in protein databases. Trends Biochem. Sci. 1998;23:444–447. [PubMed]
  • Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
  • Andreeva A, et al. SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res. 2004;32:D226–D229. [PMC free article] [PubMed]
  • Brenner SE, et al. The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res. 2000;28:254–256. [PMC free article] [PubMed]
  • Bujnicki JM. Protein-structure prediction by recombination of fragments. Chembiochem. 2006;7:19–27. [PubMed]
  • Cawley GC, Talbot NLC. Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers. Pattern Recognit. Soc. 2003;36:2585–2592.
  • Cheng J, Baldi P. Improved residue contact prediction using support vector machines and a large feature set. BMC Bioinformatics. 2007;8:113. [PMC free article] [PubMed]
  • Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14:755–763. [PubMed]
  • Eyal E, et al. A pair-to-pair amino acids substitution matrix and its applications for protein structure prediction. Proteins. 2007;67:142–153. [PubMed]
  • Floudas CA, et al. Advances in protein structure prediction and de novo protein design: a review. Chem. Eng. Sci. 2006;61:966–988.
  • Halperin I, et al. Correlated mutations: advances and limitations. A study on fusion proteins and on the Cohesin-Dockerin families. Proteins. 2006;63:832–845. [PubMed]
  • Hamilton N, et al. Protein contact prediction using patterns of correlation. Proteins. 2004;56:679–684. [PubMed]
  • Henikoff JG, Henikoff S. Using substitution probabilities to improve position-specific scoring matrices. Comput. Appl. Biosci. 1996;12:135–143. [PubMed]
  • Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA. 1992;89:10915–10919. [PubMed]
  • Hvidsten TR, et al. Local descriptors of protein structure: a systematical analysis of the sequence-structure relationship in proteins using short- and long-range interactions. Proteins Struct. Funct. Bioinform. 2008 in press. [PubMed]
  • Izarzugaza JM, et al. Assessment of intramolecular contact predictions for CASP7. Proteins. 2007;69(Suppl. 8):152–158. [PubMed]
  • Jauch R, et al. Assessment of casp7 structure predictions for template free targets. Proteins Struct. Funct. Bioinform. 2007;69(Suppl. 8):57–67. [PubMed]
  • Kundrotas PJ, Alexov EG. Predicting residue contacts using pragmatic correlated mutations method: reducing the false positives. BMC Bioinformatics. 2006;7:503. [PMC free article] [PubMed]
  • Lo Conte L, et al. SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res. 2002;30:264–267. [PMC free article] [PubMed]
  • McGuffin LJ, et al. The PSIPRED protein structure prediction server. Bioinformatics. 2000;16:404–405. [PubMed]
  • Misura KM, et al. Physically realistic homology models built with ROSETTA can be more accurate than their templates. Proc. Natl Acad. Sci. USA. 2006;103:5361–5366. [PubMed]
  • Olmea O, Valencia A. Improving contact predictions by the combination of correlated mutations and other sources of sequence information. Fold. Des. 1997;2:S25–S32. [PubMed]
  • Pruitt KD, et al. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005;33:D501–D504. [PMC free article] [PubMed]
  • Rabiner D. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE. 1989;77:257–286.
  • Rost B. Twilight zone of protein sequence alignments. Protein Eng. 1999;12:85–94. [PubMed]
  • Shackelford G, Karplus K. Contact prediction using mutual information and neural nets. Proteins. 2007;69(Suppl. 8):159–164. [PubMed]
  • Shindyalov IN, et al. Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? Protein Eng. 1994;7:349–358. [PubMed]
  • Skolnick J, et al. Development and large scale benchmark testing of the PROSPECTOR_3 threading algorithm. Proteins. 2004;56:502–518. [PubMed]
  • Tramontano A. Of men and machines. Nat. Struct. Biol. 2003;10:87–90. [PubMed]
  • Viterbi AJ. Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Trans. Inf. Theory IT. 1967;13:10.
  • Wu S, Zhang Y. A comprehensive assessment of sequence-based and template-based methods for protein contact prediction. Bioinformatics. 2008;24:924–931. [PMC free article] [PubMed]
  • Vullo A, et al. A two-stage approach for improved prediction of residue contact maps. BMC Bioinformatics. 2006;7:180. [PMC free article] [PubMed]
  • Zhang Y. Progress and challenges in protein structure prediction. Curr. Opin. Struct. Biol. 2008;18:342–348. [PMC free article] [PubMed]
  • Zhang Y, et al. TOUCHSTONE II: a new approach to ab initio protein structure prediction. Biophys. J. 2003;85:1145–1164. [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press