3.1. Structure Prediction of Partial-Length Sequences
The accuracy of I-TASSER and ProtinfoCM/HH relies upon their ability to select templates or template fragments that are homologous to the input sequences. I-TASSER begins the structure prediction process by performing threading to identify template PDB structures that are homologous to the input sequence. Structure fragments excised from the resulting templates are then used to build models. ProtinfoCM/HH is a simpler template-based method that builds a model based on a single template found to be most homologous to the input sequence. Overall, Rosetta performed considerably worse than the other two methods, which was expected given that it is an ab initio method. Although Rosetta generates models using template fragments found to be homologous to the input sequence, the fragments it uses to generate models are considerably smaller than the fragments I-TASSER uses to build its models. Unlike the other two methods, on average, Rosetta performed better on subsequences representing more than 50% of the source native sequence than on native sequences. This is likely due to the nature of ab initio modeling, which, unlike template-based methods, involves building thousands of models and, then, clustering them and selecting an output model from one of the largest clusters. In ab initio modeling, the longer the input sequence, the more models must be generated to produce sizable clusters that contain accurate models, which may explain why predictions made from subsequences were more accurate than those made from the longer native sequences. In our experiment, we generated 10,000 models for each input sequence, a typical number used for structure prediction. If more models are generated, the clustering is more likely to yield a cluster containing an accurate prediction. Therefore, if many more models were generated for each input sequence, the prediction quality of subsequences and native sequences would likely converge.
There were seven control sequences modeled by I-TASSER and three modeled by ProtinfoCM/HH that produced TM-scores greater than 0.4. In all ten cases, the shuffled sequences originated from 2Q2F, a linear alpha helix membrane protein. To investigate the phenomenon, we reshuffled the 2Q2F sequence and found that output structures often showed high TM-scores when aligned to 2Q2F. We hypothesize this is an artifact of TM-score in that it is not unlikely for such a simple linear structure to produce a relatively high TM-score when compared to an arbitrary structure derived from known proteins.
The fact, that all the methods evaluated involve selecting templates that are homologous to the input sequence may explain why the subsequences representing the smallest fractions of the native sequences tended to be the lowest-quality predictions. The smaller the fraction of the native sequence an input subsequence represents, the more difficult it is for the sequence similarity searches used by the structure prediction methods to identify sequences that are homologous to the native sequence. This hypothesis is further supported by the result that in most cases, if the quality of the prediction made from the source native sequence was poor, the subsequence predictions were also poor. This suggests that the reason structures predicted from subsequences are accurate is because threading and sequence similarity searches used in the structure prediction process are able to identify templates homologous to the native sequence, despite having only a partial-length input sequence. Another way to interpret the apparent decline in prediction accuracy as the subsequences get shorter is that a subsequence of a protein in vitro or in its biological context may fold differently than the native sequence. The apparent decline in performance given shorter subsequences ( and ) may be the result of accurate predictions of the folded subsequences.
3.2. Application of Protein Structure Prediction to EST Data
We have shown that I-TASSER, Rosetta and ProtinfoCM/HHpred can predict the structures of subsequences representing 50% or more of a native protein sequence with accuracy similar to that of structures predicted from native protein sequences. Given that EST sequencing techniques utilize nebulization to randomly fractionate the cDNA before sequencing [36
], our benchmarking set of subsequences simulates translated EST sequences. Therefore, if a method, such as ESTScan, is used to predict the protein coding region from high quality EST sequences, and the resulting coding region contains 50% or more of the corresponding native protein sequence, these structure prediction methods can reliably predict the partial protein structure. Additionally, C-scores output by I-TASSER could help to identify ESTs that do not represent real protein sequences. A low C-score for a model produced from a translated EST suggests it would be unlikely to fold into a stable structure. Following structure prediction, an EST structure could be used as a source of information for annotating the EST. The partial structures could be input into automated structure-based function prediction methods, or the structure could be visually inspected by a structure biologist. Functional information derived from accurate EST structures could supplement or validate EST annotations made using existing techniques that rely on sequence information alone. We examined applying structure-based functional analysis tools to the models generated from subsequences, but we were only able to find functional annotations for three out of the ten benchmarking PDB structures. We determined that this sample size would be insufficient to rigorously investigate if subsequence models could be useful for automated functional annotation, but future work should address this question.
3.3. Measuring the Foldability of Arbitrary Polypeptide Sequences
The only existing method we are aware of that assigns a single score representing the foldability of an arbitrary amino acid sequence is FoldIndex. FoldIndex was evaluated on a dataset, where the sequences labeled as foldable were full-length native protein sequences and the non-foldable sequences were proteins known to be intrinsically disordered [9
]. While these proteins may be a reasonable choice as a set of non-foldable sequences, evaluation of foldability prediction on this dataset alone is not enough to conclude if a method will be reliable for use on arbitrary protein sequences. Intrinsically disordered proteins are relatively rare and poorly-defined; therefore, considering these sequences as a gold standard for arbitrary non-foldable sequences is problematic. We suggest that our dataset of native sequences and subsequences and non-foldable sequences generated by shuffling the residues of the foldable sequences is more reliable and comprehensive for evaluating foldability prediction of arbitrary polypeptide sequences. Experimental evidence has shown that shuffled sequences are very unlikely to produce stable tertiary structures in vitro
]. Shuffling, rather than randomly generating sequences, yields sequences with amino acid compositions that are identical to those of the native sequences. This ensures that any predictive ability observed is not the result of detecting deviations from the amino acid compositions of naturally-occurring proteins.
3.4. Foldability Prediction Using C-score
We investigated using C-score as a means of predicting foldability, which we define as the likelihood that a given sequence of amino acids represents a portion of a stable protein structure. Our results demonstrate that C-score can effectively distinguish shuffled and unshuffled sequences from our dataset, which suggests it would be useful for predicting the foldability of arbitrary polypeptide sequences. For comparison, we also evaluated E-value output by a standard BLAST search in the same way C-score was evaluated. E-value represents a naive approach for predicting foldability by making the reasonable assumption that non-foldable sequences would be unlikely to show sequence similarity to known proteins. Although we show that E-value has some predictive power, C-score is much more effective for classifying sequences labeled as foldable or non-foldable. C-score is an estimation of the quality of a structure prediction output by I-TASSER. The score is calculated by measuring the quality alignments produced by matching the query sequence to regions of template structures from the PDB and the degree of structural convergence of assembly refinement simulations. Refinement simulations assemble structure conformations in parallel by combining template fragments and ab initio
modeling of regions of the query sequence that did not align well to structure fragments from the PDB [4
]. Therefore, C-score may be particularly useful for distinguishing foldable and non-foldable sequences, because it is based on both the likelihood that the query sequence is homologous to known structures at the subsequence level, rather than a globally, as performed by BLAST. Furthermore, the score is also derived from structural convergence resulting from ab initio
modeling, meaning it exploits information from what are essentially constrained simulations of protein folding.
E-value and C-score, however, were not effective for distinguishing shuffled and unshuffled subsequences that represented less than 50% of their source native sequence. Despite that in our experiment, we considered all subsequences derived from native sequences to be foldable, it may be the case that many subsequences representing relatively small portions of native proteins would not in fact fold into stable conformations. Therefore, C-score may have classified the shorter subsequences in our dataset correctly as non-foldable sequences. To better assess the accuracy of using C-score to predict foldability would require evaluation on a dataset where sequences labeled as non-foldable are verified experimentally.