One of the ultimate goals of computational biology is reliable prediction of the three-dimensional structure of proteins from sequences because of significant difficulties in obtaining high resolution crystallographic data. An intermediate but useful step is to predict the secondary structure, which might simplify the complicated 3D structure prediction problem. This reduces the difficult 3D structure determination to a simpler one-dimensional problem. This reduction is possible since proteins can form local conformational patterns like α-helices and β-sheets. Many have shown that predicting secondary structure can be a first step toward predicting 3D structure [1
Sometimes, secondary structure prediction can even be useful when a protein native structure is already known, for example to predict amyloid fibrils: misfolded β-sheet-rich structures, responsible for Alzheimer’s and Parkinson’s diseases [5
Secondary structure prediction can be used to predict some aspects of protein functions; in genome analysis, it is used to help annotate sequences, classify proteins, identify domains, and recognize functional motifs [6
First attempts at secondary structure prediction were based mainly on single amino acid propensities. In order to improve prediction accuracies, some researchers began including evolutionary information in the prediction procedure [7
]. These attempts consist of folding evolutionary propensities into the prediction by introducing a measure of sequence variability.
Enlarged databases, new searching models and algorithms make it feasible to extend family divergence, even when the structures of some of the protein family members are unknown (e.g. large-scale searches with PSI-BLAST [9
] or Hidden Markov Models [10
]). By using divergent profiles, the prediction accuracy reached 0.76 in Q3
(the fraction of correctly predicted residues in one of the three states: helix, strand, and other, see definition in methods). It is believed that enlarged databases and extended searching techniques contributed significantly to the increase in the accuracy of the secondary structure prediction.
There exist several popular and historically important prediction methods, including empirical statistical methods [12
], information theory [13
], nearest neighbor [17
], hidden Markov models [18
], and artificial intelligence (AI) approaches [19
As one of the AI approaches, neural networks simulate the operations of synaptic connections among neurons. As the neurons are layered to process different levels of signals, the neural networks accept the original input (which might be a sequence segment, having been modified by a weighting factor), combine more information, integrate all to an output, send the output to the next level of processor, and so on. During this process, the network adjusts the weights and biases continuously, according to the known information, i.e. the network is trained on a set of homologous sequences of known structures for the secondary structure prediction problem. The more layers it has, the more information it distinguishes. The PHD [19
] is one representative method, with the most recent versions reporting accuracy of prediction up to 0.77.
In the field of ab initio protein structure prediction, the Rosetta algorithm, a computational technique developed by Baker and his colleagues at the University of Washington, was quite successful in predicting the three-dimensional structure of a folded protein from its linear sequence of amino acids during the fourth (2000) and the fifth (2002) Critical Assessment of Techniques for Protein Structure Prediction (CASP4, CASP5), and most recently during CASP6 (2004).
Based upon the assumption that the distribution of structures sampled by an isolated chain segment can be approximated by the distribution of conformations of that sequence segment in known structures, Rosetta predicts local structures by searching all possible conformations. The popular view is that folding to the native structure occurs when conformations and relative orientations of the segments allow burial of the hydrophobic residues and pairing of β-strands without steric clashes [20
It is commonly believed that similarities between the sequences of two proteins infer similarities between their structures, especially when the sequence identity is greater than about 25%. Most protein sequence folds into a unique three-dimensional structure, yet sometimes highly dissimilar sequences can assume similar structures. From an evolutionary point of view, the structure can be more conserved than sequence.
The sequence similarity of very distant homologues is difficult to identify. But the conservation of some special motifs can possibly be captured using special methods. Those special motifs are more important than general matches found in alignments of close homologues due to their uniqueness. The problem of such detection is usually confounded by the use of structure-independent sequence substitution matrices.
Since a similar sequence implies a similar structure and conserved local motifs may assume a similar shape, we attempt to find a local alignment method to obtain structure information to predict the secondary structure of query sequences. For this study, we assembled a set of segments obtained through local sequence alignment and use this information for predictions. We applied BLAST on query sequences against a structure database in which all the information of proteins is available (PDB or DSSP), and then use these results to introduce evolutionary information into predictions. We applied various weighting procedures to the matches based on the sequence similarity scores. Finally, we calculated the normalized scores for each position and for each secondary structure at that position. Note that since the scores of being E, H or C are sometimes close or even equal, so that choosing the highest score is either not possible or not fully justifiable. Furthermore, the predicted secondary structure element is not physically meaningful in all cases, such as helices having only two residues, or mixtures of strand (E) and helix (H) residues. Therefore, instead of predicting the secondary structure with the highest score, we incorporated artificial intelligence (AI) approaches for the final secondary structure assignment.