|Home | About | Journals | Submit | Contact Us | Français|
One approach to predict a protein fold from a sequence (a target) is based on structures of related proteins that are used as templates. We present an algorithm that examines a set of candidates for templates, builds from each of the templates an atomically detailed model, and ranks the models. The algorithm performs a hierarchical selection of the best model using a diverse set of signals. After a quick and suboptimal screening of template candidates from the protein data bank, the current method fine-tunes the selection to a few models. More detailed signals test the compatibility of the sequence and the proposed structures, and are merged to give a global fitness measure using linear programming. This algorithm is a component of the prediction server LOOPP (http://www.loopp.org). Large scale training and tests sets were designed and are presented. Recent results of the LOOPP server in CASP8 are discussed.
Homology modeling of protein structures is usually divided into three steps. In the first step structural templates are identified from a set of experimentally determined protein structures (the protein data bank PDB ). In the second step, an alignment between the sequence of the target with each of the templates is obtained. Based on the alignment, atomically detailed structures are constructed from the templates. The atomically detailed models are finally assessed and ranked. The process of template selection can be difficult. Therefore we divide the template selection process into two sequential steps: (i) template enrichment (Phase 1) and (ii) template focusing and model building (Phase 2). Empirically, we found that to begin with about 1.7% of the target-template pairs are true hits while after Phase 1 enrichment the percentage of true hits increases to 18%.
The following definitions are used throughout the manuscript: a “hit” is a prediction by the algorithm that the proposed match of a template and a target is likely to be successful. A true hit means that the prediction of the algorithm is correct. It is frequently referred to as a T pair or a T match. A false hit is an incorrect template-target match of the algorithm. It is also called a D pair (D for decoy).
Earlier, we presented a paper on a mathematical programming based method for enrichment of suitable templates for target proteins . The present work follows the previous paper and constitutes the next step of the LOOPP server (http://www.loopp.org) for protein structure prediction. Tentative hits identified in first step (Phase 1) are forwarded to Phase 2 where atomically detailed models are built with the program Modeller  based on the templates determined during Phase 1 and the alignments of SSALN . The models are assessed using a new learning and scoring algorithm described and discussed in the present manuscript, which constitutes the Phase 2 of the LOOPP server. Phase 2 typically provides a final list of five to twenty top structural candidates to the sequence of the target.
From the perspective of finding the best model the division into Phases is not optimal. The enrichment step may miss some true hits (structural templates that provide good atomic models to the template) and not include them in the subset forwarded to Phase 2. These misses, even if detectable by the filters of Phase 2, obviously remain undetected. Therefore the current LOOPP procedure is less sensitive than an alternative implementation that examines all the PDB structures with the best measures we have at hand. We discuss below the reasons that led to the present computational model of LOOPP.
At the core of the algorithms for Phase 1 and Phase 2 one finds similarity measures that we use to test the fitness of the sequence of the target to the sequence and structure of a template. As discussed in reference  the different similarity measures are learned with mathematical programming and are made into scores that rank the pairs of target and templates. The algorithm of Phase 1 uses only a fraction of the similarity measures that are available to us. Not using all of the measures results in a suboptimal performance and less accurate ranking of some of the pairs compared to the ranking of Phase 2. The reason of not using Phase 2 to begin with is computational cost. Some of the similarity measures that are used effectively in Phase 2 are expensive to compute. Phase 1 examines a representative set of the whole Protein Data Bank (PDB)  and the large number of comparisons makes it necessary to avoid some of the expensive similarity measures used in Phase 2.
For example, consider the comparison of two sequences. Let the raw optimal score between target i and template j be Tij. The Z score (Zij) is defined as . The brackets … denote an average over optimal alignments of the template sequence j and randomly shuffled sequences with the same amino acid composition as the target i. To obtain meaningful averages hundreds to thousands alignments are required, making the Z score calculation more expensive than a single alignment by two to three orders of magnitude.
Therefore, despite the observation that the Z score is significantly more sensitive and specific we did not use it in Phase 1. Phase 1 ranking is based on raw scores only (and BLAST statistical evaluation when possible) that are evaluated for 13,875 proteins in the database. Of course a score of sequence alignment is not the only similarity measure that we use to select the candidates of Phase 1. For example, threading and alignment against secondary structure were used as well. (Check the appendix of reference  for a complete list of similarity measures that we used in Phase 1). We then make the assumption that Phase 1  is sufficiently accurate to capture the true hits in the top 200 (from a total of 13,875 candidates). If Phase 1 ranking was perfect (in the sense that the template providing the best structural model is ranked number 1), then only one template is required for further model building. However, it is not. Another complication is that Phase 1 depends on the quality of future steps, such as the alignment (which we perform with SSALN ) and the construction of an atomically detailed model (which we do with Modeller ). It is likely that modeling of the structures with other programs (e.g., different alignment algorithms or different built up of loops and side chains) will impact the learning.
We strictly differentiate between learning and testing of the prediction model (see section II). We call the set of proteins we learn Learning Set (LS) and use it to optimize the parameters and the functional form of the computational model. The set TS1 includes the proteins of our most comprehensive test case that is built completely independently from the LS. To account for some of the inaccuracies of Phase 1 ranking the number of structures that we forward to Phase 2 is 200. In our learning and test cases of Phase 1 we miss 157992 of the 418037 (LS) and 35759 of the 91449 true hits (TS1). This number may seem highly significant, however, by the end of the day we wish to obtain one good model per protein. We care less about having 100 good models for a particular target sequence. The number of proteins that lost all of their templates is small and stands on 811 out of 12689 proteins (LS) and 198 out of 3802 proteins (TS1). These are fractions of 0.064 (LS) and 0.052 (TS1) from the total number of proteins we have considered. The comparable loss for the learning and test cases is reassuring from the perspective of over-learning and we expect it to be similar for future predictions.
Phase 2, which is discussed in the present manuscript, deals with a much smaller number of candidates for true hits. The limited number of candidates allows for full construction of atomically detailed models for each candidate and the use of comprehensive measures of model accuracy in a calculation feasible on a typical cluster. On a cluster of 20 CPUs a structure prediction of a protein of length 200 amino acids requires 3–5 hours. The time is significantly longer for longer proteins (about 11–15 hours for 500 amino acids), however even this calculation is accessible with moderate computational resources.
The rest of the text is divided between a detailed description of the data sets that were used for training and testing, description of the learned model, and detailed analysis of the performance of the model on various tests. We finally discuss the performance of LOOPP during CASP8.
The learning set (LS) follows from the one used in Phase 1 . It constitutes an extensive dataset of proteins selected from the available folds of the Protein Data Bank (PDB)  and of the domains of SCOP  as of 6/28/2005. In the initial selection, proteins from the PDB with less than 70 percent sequence identity to other members of the set were kept, providing a total of 9,513 single peptide chains. This is supplemented with representative folds from the complete SCOP hierarchy  giving a total of 13,825 targets and templates in the learning set, which we call LS. From a total number of 71,824,926 pairs, about 1,187,173 pairs were short listed as probable true templates from the Phase 1 prediction model and forwarded to Phase 2. A complete list of the selected structures is available in http://www.loopp.org/ls2pairs.txt
In Phase 2, the all-atom models are generated for the targets based on all the templates identified in Phase 1. This is done using Modeller , which generates an atomically detailed structure of the target based on the fold of the template and an alignment of the target with the template. The alignments were generated by the algorithm SSALN  implemented in LOOPP. Modeller is a widely used resource in the field, and hence is a component in our structure prediction system LOOPP http://www.loopp.org. The learning of scores for optimal alignment in LOOPP is documented elsewhere [4,6] and is beyond the scope of this paper. Here we focus on the identification of the best templates and models for a given target.
The true templates (T) and the decoys (D) from these pairs are identified based on the similarity of the model (built from the template) to the native. Since the accuracy of a predicted model is based on how close it is to the target structure, we define an acceptable model as one within 6.5 Å all-atom RMS distance from the native structure. For each target, we identify the true models using this criterion and define the remaining ones as decoys (D). For targets which do not have any true model using this criterion, we relax the RMS distance cutoff to 7.0 Å to identify the true templates and decoys. We thus obtain 209,090 true hits (T) and 978,083 decoys (D) for a total 12,527 proteins in the LS. Some of these proteins have both T and D representatives, whereas some others have either T or D. 11,694 proteins have one or more T pairs and 11,093 proteins have one or more D pairs as classified above.
Figure 1 shows the histogram of the probability of observing TM score as a function of the TM score for the T pairs. TM align is a structural alignment algorithm that was developed in Skolnick’s Laboratory and is used extensively in our studies . Along with the alignment, it also provides a numeric score between 0 and 1, which indicates the degree of structural similarity between the two structures. The higher the score, the closer the structures and a TM score of 0.5 and above is considered significant . We see from Figure 1 that most of the T pairs have TM score of 0.5 or more. Of a total of 209,090 T pairs, about 11,500 pairs have TM scores less than 0.5. Although these are classified as T based on our RMSD criterion, their TM scores indicate that these pairs might not have significant structural similarity.
Similar to the LS, the test sets used here also follow from Phase 1 and are designed to be independent of LS. We have generated two test cases: TS1 and TS2. In the construction of TS1 we examine all new PDB entries from dates 6/28/2005 to 6/13/2006. Hence, the structures collected did not overlap with the training set. Using the same screening procedure as in LS, we get 4,183 proteins in TS1. Of the 22,096,370 pairs, 310,031 pairs were short listed as probably true hits in Phase 1 and forwarded to Phase 2. These consist of 3,779 individual proteins. The Phase 2 classification procedure yields 39,364 Ts and 270,667 Ds.
The second test set (TS2) is based on CASP7 targets (http://predictioncenter.org/casp7/). CASP is a community wide experiment for critical assessment of methods for protein structure prediction [8,9], where protein sequences with pre-determined but undisclosed structures are provided as targets for prediction and models predicted by various groups are assessed based on their similarity with the native structure. So, our second test set is from the previous experiment of CASP, CASP 7. The CASP7 experiment was conducted from May to July 2006 and therefore our learning set did not overlap with structures from CASP7. We had 82 proteins with 702,828 pairs to begin with in Phase 1. The Phase 1 prediction model forwarded 3451 hits consisting of 55 individual proteins to Phase 2. Using the same criterion as in LS and TS1, we obtain 577 T and 2874 D pairs in the CASP7 based TS2.
In addition to TS1 and TS2 we report preliminary results of LOOPP server on CASP8.
The similarity measures used here are similar to those used in Phase 1 , except that we allow the use of more expensive measures that add significantly to the sensitivity of the algorithm at sizable computational cost. Assessing a model in Phase 2 typically requires 51 seconds compared to 0.4 seconds for a pair assessment in Phase 1 (the estimate was for a protein of 189 amino acids). When we compare a target to a template, we have the detailed three-dimensional description of the template, whereas we only have sequence-related information for the target. We built a set of characteristics (called C) of each protein to probe the similarity between the target and the template. The characteristic C is a string of vectors attached to amino acid sites. The site vectors may include the identity of the amino acid at the site, secondary structure, substitution probability in that site derived from multiple sequence alignment (profile), etc. The target characteristics include:
The three-dimensional co-ordinates of the template allow us to generate more characteristics:
There are many ways of combining and comparing these characteristics between the target and the template to obtain matching scores that we call features (F) of pairs. For example, we may match a sequence with a sequence, secondary structure with a profile, test sequence fitness to contact maps (threading ), match predicted secondary structure of the target with actual secondary structure of the template, etc. The features are generally denoted by a four letter code such as SEQG (global sequence alignment), SEQL (local sequence alignment), TRDG (global threading), TRDL (local threading), TBLS (PSI-BLAST), TSCG (global secondary structure alignment), PSMG (sequence to profile matching - global), SRFL (local exposed surface area alignment) etc. Refer to the appendix of reference 2 for more details. To avoid repetition the features are not discussed in details in the present manuscript. We used 20 features in Phase 1, but have dropped 2 of these in Phase 2 (KMER and Contact Factor) because they have been insignificant, thus leaving 18 features.
Given two characteristics C1 and C2, there is a need for an alignment between the target and the template to generate a score, a scalar feature F12. In LOOPP, we consider approximate alignments (BLAST), and exact (local or global) alignments determined by dynamic programming. We also use constant or structurally dependent gap penalties . We use four different types of scores when we compare the characteristics of the target and template to obtain a scalar feature value:
where the average S(1, P2) P is over the optimal scores of the permutations of the characteristic C2 at the amino acid sites. A single permutation is denoted by PC2. The computational cost of the Z score was discussed in the introduction and it is between 100–1000 more expensive than the calculation of the raw score.
Z scores have been obtained for 12 of the 18 features in Phase 2. The raw energies, reverse energies and Z scores similarity measures contribute a total of 48 features. In addition, we also use two new secondary structure based features SSPOS and SSCOMP. SSPOS compares the position of the secondary structure elements between the template (actual) and target (predicted) whereas SSCOMP compares the amino acid composition of the actual and predicted secondary structures of the template and the target respectively. The prediction of the secondary structure of the target sites was made with the program SABLE [10,11]. Further, in Phase 2, we also generate atomic models for the chosen pairs and use the following potentials to derive the corresponding energies for these models. These energies are then used along with the features described earlier, in our training.
Modeller  generates the atomic models based on templates chosen from phase 1 and alignments provided by the SSALN algorithm . To assess the quality of Modeller output we compare three structures: 1) The native structure of the target sequence (which is known in the training), 2) The structure of the template (the structure of the homologous protein on which the modeling is based), and 3) The model of the target that Modeller built based on the template. Ideally the similarity (based on the TM score ) of structure 3 and structure 1 should be higher than the similarity of structure 2 and 1 since a refinement was performed. This is however not always the case. Sometimes structure 2 is closer to 1 than structure 3 to 1. This is especially unfortunate when the template and target structures are very close to begin with and Modeller produces structures that are farther from the target than the template.
The quality of the model is expected to be a monotonically decreasing function of the TM score between the template and the final structure. We therefore examine the TM alignment score between the template and the model. By examining the drift between the template and the structure generated by Modeller we have another independent assessment of the quality of the results. Thus, we have a total of 55 features that we use in our prediction model.
Given a set of features that measures the similarity between pairs of proteins (we have 55 features, some of them strongly correlated), we seek a computational model that uses these features to classify the pairs according to T (true matches) and D (decoy (false) matches). We have already seen in Phase 1  that the tools of Mathematical Programming (MP)  are effective in this regard and hence we use MP in Phase 2 as well. The construction of the mathematical programming model is explained in detail in Phase 1  and since we follow a similar approach, we avoid the repetition of the details and just provide the important points. We seek a score, Q(i, j), to rank the matching quality of a pair of a target i and a template j, which is a linear combination of the features.
The γα are the unknown coefficients to be determined from the learning set with MP. The score function depends parametrically on the coefficients. In Phase 1  we used scores with moderately more complex dependence on the features. For example a score was defined as a quadratic expansion of the features (e.g. the similarity score of the pair of proteins i, j (template and target) was ). In the present study we found that the more complex expansion was not enhancing the recognition capacity of the algorithms. The number of true hits was not increasing. We therefore remain with the simple form of equation (1)).
The training process requires sets of target and template pairs that are pre-classified as D (target-decoy) and T (target-true hit) pairs. It determines the linear coefficients γ subject to maximal margin and feasibility conditions [23,24]. We require that a match of a protein i with any T template (say j) will have a better score than a match with a D template (say k). The requirement is written as an inequality QT (i, j)−QD(i, k) > 0 for a single comparison of the score QD(i, k) of a decoy (D) with the score QT(i, j) of a true match (T) to a target protein i. This inequality is linear and the set of all inequalities that we consider can be written compactly as
where the indices ij and ik denote a true and a decoy pair respectively. The sum is over the elements of the scalar product, i.e. of the linear coefficients γ (to be determined) and the difference of the similarity measures of the two pairs. Numerically it is difficult to differentiate between a solution, which is slightly larger than zero and a solution with γ = 0 which is exactly zero. To set a scale for the values of the parameters, and to avoid a trivial solution, it is convenient to write
Equation (3) allows learning from negative (D) examples in addition to positive (T) examples. It therefore suggests a richer and more complete description of the data compared to classification algorithms that learn from positive examples only.
After significant trials of different functional forms of the similarity measures we were not able to find a single Q function that makes the problem feasible for all pairs of T and D and generalizes well to the test cases. It means that the set of features we have at present is insufficient to generate such a desired Q score. We believe that such a single function exists since a free energy surface that selects native folds was illustrated for many proteins. It is just that we do not know the proper functional form. Nevertheless, it is still possible to find a simple similarity measure that minimizes an error function and generalizes well to the test cases. The solution, however, does not solve all inequalities of Eq. (3). An alternative formulation is therefore required which is a slight adjustment of Eq. (3).
where the ηij, ik are (positive) slack variables that make the solution of the new set of inequalities feasible.
Apart from the inequalities generated as QTij−QDik > 0, as mentioned above, we also generate inequalities of the type QTi1−QTij > 0, where the pair i1 is the best true hit and the pairs ij are the remaining true hits (ranked 2 and below) for a given protein i. Eq. (4) then becomes:
These inequalities are added to the inequalities obtained from QTij−QDik pairs during learning. This is done to enforce learning in such a way that the best T is ranked first along with learning to rank the Ts above the Ds. In Phase 2, we are looking to identify the best hit in the top 1 or top 5 and we expect the QTi1−QTij > 0 inequalities to help in this regard.
We use the Mathematical Programming (MP) solver, PF3 , which is tuned specifically to solve problems like Eqs. (4) and (5) frequently encountered in the field of bioinformatics and is based on the interior point algorithm. The actual numbers of pair comparisons (the number of inequalities) that we attempt to satisfy in Phase 2 is much less than Phase 1 since we have lesser number of T and D pairs. Hence, here we attempt to solve all the 10,922,967 (total from QTij−QDik and QTi1−QTij pairs) of inequalities at one go. PF3 provides a solution to this problem within minutes using 20 computer nodes. The solution is then checked on all the inequalities of the set. The results, as stated above, are not exact. Not all inequalities are satisfied with the computed vector of coefficients γ. Another way to quantify our prediction system is by statistical significance of the predictions. Such a measure can be obtained from statistical theory (like in BLAST ) which we do not have for our multiple signals, or from numerical studies of a large number of examples. We can estimate the frequencies of T and D values quite accurately since the number of empirical observation is large. A score is considered significant if the probability that it is a D (Pr(D)) is less than or equal to 0.001. Since our learning set has been significantly cleaned up in Phase 1, we have a limited set in phase 2. Hence, using such a probability measure identifies approximately one D per protein in the dataset in Phase 2. We define the significance score as SC 1−Pr(D) and this score is used in discriminating the Ts from the Ds in each step of our prediction model. In Phase 1 we used a tree of multiple linear scoring branches. We use a similar approach in learning Phase 2 as described below.
We have observed that a single Q score is insufficient in discriminating the Ts and the Ds. This is because some features like sequence similarity measures carry very strong signals that mask the signal from other features. We believe that if we remove sequences detected with one Q score, it is possible to learn another score that focuses on weaker (but nevertheless significant) signals. The idea is to use multiple scores where different Q are learned in sequence on a shrinking template database.
We have learnt that PSI-BLAST is the most dominant signal and contributes significantly to the first branch. It is therefore convenient to use PSI-BLAST as a single feature in the first branch and identify all the strong PSI-BLAST pairs. This will help to filter out the strong PSI-BLAST pairs and thereby aid in recognizing the signal from the other features in the following branches. We had a similar zeroth branch in Phase 1 as well .
We generate a PSI-BLAST profile for the target sequence using the NR database http://helixweb.nih.gov/hilexdb.php (three cycles with an E-value threshold of 0.01). This profile is compared to the sequences of the templates to yield the desired score (The log of the E-value, called TBLS_e in the feature table of the Appendix). Using the significance measure discussed earlier, all pairs with significance score, SC, larger than 0.999 were declared hits. These are kept for the final analysis in which the selected pairs from all the branches are sorted and ranked to identify the best models. All pairs that are not hits are forwarded to the next branch.
An assessment of an optimal single score can be made using PSI-BLAST, which is widely employed in template detection. In Phase 2 we are using PSI-BLAST with reasonably high confidence level (E-value smaller or equal 0.001) and with that accuracy it detects less than half of the true hits we pick with the tree. Reducing the confidence level to E-value of 0.1 we find just in branch 2 a larger number of true hits (72,625 instead of 22,433). Unfortunately, the number of false hits increases significantly from 9,817 to 207,827 making the selection of top templates very difficult.
Another feature of significance is the final score obtained in the previous LOOPP version that participated in CASP7 (LP7). This score is a single linear combination of a subset of features discussed earlier and in the appendix of ref. . We find that this score has an important signal and can be used to find many of the strong hits beyond PSI-BLAST. We apply the same significance measure on all pairs forwarded from the PSI-BLAST branch and identify those with SC larger than 0.999 as hits and are chosen for final ranking. The rest, as usual, are forwarded to the next branch.
With the pairs recognizable by straightforward PSI-BLAST and LP7 scores removed in the zero and first branches respectively, we seek in the second branch a prediction model linear with the features Fα described in the Appendix of reference . The coefficients γα(2) (the superscript (2) stands for branch 2) are determined from equation (5). We load all the 10,922,967 inequalities to solve for γα(2). We then evaluate the scores for all the pairs and identify those with SC greater than 0.999. These are retained for final ranking and the rest are forwarded to the next branch. This procedure is repeated through branches 2 to 6 with a diminishing number of T and D pairs to learn from in each branch. We select predicted T pairs from each of these branches following the same SC procedure. All hits with scores above the threshold are forwarded for final evaluation. The pairs below threshold were forwarded to the next branch. As before, the threshold was set at a score value for which the probability of being a D was equal or smaller to 0.001. In Figure 2, we show the probabilities densities of T and D for the second branch. While there is a significant overlap for the two, there remains a significant tail of T pairs that can be picked with sufficient confidence. As we move from one branch to the next, the recognition capacity diminishes except in branch 6, where there is an increase in recognition capacity compared to the previous branches. This is due to the introduction of new features (FREADY, SIFT) in this branch. We stop our learning at the 6th branch because, beyond this, the recognition capacity becomes insignificant i.e., the number of T pairs recognized with SC greater than 0.999 is negligible.
Like in Phase 1, we then tried using similarity measures derived from the features that are transformed into a uniform distribution which can be thought of as an alternative kernel  and also the quadratic expansion of the uniformly distributed variables. Although these were very useful in Phase 1, they were ineffective in Phase 2.
At this stage we have a collection of predicted T pairs from the zero to six branches. Typical number of T predictions per protein varies from 1 to 200. Now, we need to rank the predicted T pairs that are pooled together from the different branches. We use another linear score for the final sorting of the matches and we learn this the same way as we did in the last five branches – using the features mentioned earlier and solving for γα(2) from the inequalities of the type given in Eq. 5, using PF3. The score thus obtained is a linear combination of the same features but the learning is done from the T and D pairs chosen from zero to six branches. Hence, the learning set here is minimal and is enriched with more Ts and less Ds. This also yields lesser number of inequalities to learn from. The linear score thus learnt is used for the final sorting of the pairs chosen from the zero to six branches of the tree. By far the largest coefficient of the linear combination of scores of branch 6 is that of LP7.
It is possible that re-learning the final ranking score using the pairs filtered from the zero to six branches and the same features lead to over-learning. Over-learning can lead to a prediction model that performs well on the training set but does not perform well on other independent test cases. We look into the performance of the prediction model on the test sets to provide insights into this potential problem.
The computed coefficients for all five branches of the tree plus the final ranking are summarized in Table 1a. For the purpose of comparison it is convenient to normalize the vector of coefficients such that for all branches. We notice that some of the features are not used in all the branches (FREADY, LP7 score etc). These are mainly features that were derived much later and were not available during the initial stages of learning. In Table 1b we provide the product of the coefficient times the variance of the feature under consideration. This measure is another indicator of the potential contribution of a particular feature to the total score of a branch. If the variance of a feature is high, then it can (potentially) make a significant contribution even if the coefficient is small. Table 1b suggests the dominant contributions of only a few features (All-atom energy, SEQG_e, TRDG_e, TRSG_e, PSMG_e, PSML_e). However, this picture is somewhat misleading. It is possible to have a large variance and still only low recognition capacity. We know that the use of the other features (such as the Z-scores) is necessary in order to obtain a good recognition.
There are two factors that make the branches different from each other. First, the prediction spaces of the branches are different due to sequential elimination of T pairs. Second, the coefficient vectors are not the same. These differences are further enhanced in the final ranking branch since it uses the pairs selected from the 0–6 branches pooled together for learning. It is of interest to examine how similar the coefficients of the vectors are in the different branches and so we evaluate the scalar product of the normalized vectors between the different branches. This is presented in Table 2. We find that the coefficients of branches 2–5 are similar since their scalar products are in the range of 0.87–0.99 whereas those of branches 6 and final ranking are different from the rest (given by scalar products in the range 0.03–0.4). However, the scalar product of the coefficient vectors between branch 6 and final ranking is 0.8838 showing that they are very similar. Both scores are dominated by LP7.
Looking at the features providing the dominant signals in each branch, we find that in branches 2–5, the secondary structure based features, SSPOS and SSCOMP are very dominant. Apart from these, OPTM_e, SEQG_z, TRDG_z, TRSG_z, and TSSL_z provide signals in branches 2–5 whereas SEQL_z is strong in branches 2, 4 and 5. OPTM_e is a mixture of threading, secondary structure, and sequence alignment substitution tables; SEQG_z and SEQL_z are global and local sequence alignment Z scores respectively; TRDG_z and TRSG_z are threading-based global Z scores; TSSL_z is a combination of sequence, secondary structure and threading signals. In addition, profile-sequence matching score, PSML_z is dominant in branches 2 and 3 whereas PSMG_z is dominant in branches 4 and 5. Further, threading scores TRDL_z and TSSG_z are dominant in branch 3 and branches 4 & 5 respectively.
Branch 6 is different because new features are introduced which provide dominant signals. LP7 score, FREADY  and SIFT [20,21], all are new features that are dominant in this branch. In the branch of final ranking we find that the weight of the LP7 score is very high thus skewing the other weights. FREADY is a coarse grained energy computed from the final atomically detailed model. SIFT is a model assessment score that combines sequence dependent secondary structure and expose surface area prediction with mean radial distribution function-based assessment of packing, which is independent of the sequence of the template. LP7 is a linear combination of the following features: Protein length, SEQL_e, TRDL_r, TRSG_e, TRSG_r, PSML_r, SRFG_e, OPTM_e, TSSG_r, TRSL_e and TSCL_e.
Although the dominant LP7 score does not contain any Z scores, note the predominance of Z scores rather than raw or reverse scores in the other features that have Z scores evaluated. This shows the significance of Z scores and validates the computational time spent in evaluating these expensive features. Also of significance is the absence of PSI-BLAST related features (TBLS, TBSS and SBLS) that were the most dominant signals in the branches of Phase 1 tree. Although we use PSI-BLAST (TBLS_e) in branch zero and we recognize maximum number of T pairs in this branch, we find that in the rest of the branches and in the final ranking, PSI-BLAST plays a minimal role. Profile matching scores PSMG_z and PSML_z along with simple global and local alignment scores SEQG_z and SEQL_z are the only sequence-based features making any kind of contributions to these branches. In some manner, this proves that using such a tree-method eliminates the dominant PSI-BLAST signals in branch 0 and helps in picking up the signals from the other features in the later branches. This also validates our tree-based algorithm and thus enables us to identify the hits that cannot be recognized with PSI-BLAST alone. However, the drawback is that valuable PSI-BLAST hits can be lost in the final ranking due to other features being dominant there. We discuss this in detail in a later section.
The developed prediction tree is applied as follows. The target sequence is first examined with branch 0. PSI-BLAST scores are computed for any target-template pairs where the templates are taken from the complete Protein Data Bank (PDB). LOOPP has a standard database that is used by all other branches, however PSI-BLAST is so efficient to compute that we probe the complete PDB with it. If the probability of observing a false hit is smaller than 10−3 accept the match as a hit and store the hit in the list of candidates. If the structure is found in the standard LOOPP database, we remove it from the set of the structures that we need to examine in the next branch. In the next branch we seek hits of the target with templates that were not detected before. Hence the data set we examine is a subset of the total. The next branches do not consider the hits of previous branches. At the end of the process (when the last branch delivers its hits to the pool of candidates) all the hits are ranked against each other. The final ranking is done with another Mathematical Programming score.
Figure 3 shows the comparative statistics of the number of hits identified per protein in phase 2 with respect to phase 1. The plot shows a histogram of the number of proteins as a function of number of hits forwarded from Phase 1 and the corresponding number of hits detected in Phase 2. We see that most of the proteins with 0 and 10 hits that were identified in Phase 1 are also identified in Phase 2. There are 899 proteins where no hits are detected in phase 2 irrespective of the number of hits forwarded from phase 1. Of these, there are a handful of cases (77), where proteins with more than 10 hits forwarded from phase 1 have zero hits detected in phase 2. The remaining proteins have at least one or more hits identified in phase 2 and a few have as high as hundred hits identified in both phase 1 and phase 2.
As in Phase 1, we evaluate the number of T pairs identified and the number of proteins with at least one hit and (or) T hit identified in Phase 2. These results are tabulated in Table 3. A few points to note:
Therefore, the overall performance of the tree is similar in LS and TS1 although, it is slightly diminished in TS2. Further, the tree is able to identify T pairs in more than 90% of the proteins in both LS and TS1 thus enhancing the prediction capacity of LOOPP. The performance of the final ranking on the hits detected by the tree is elucidated as follows.
In Phase 1, we used the number of T pairs identified and proteins with at least one T pair detected to evaluate our performance of target recognition. These were useful measures in Phase 1 because we forwarded top 200 hits from Phase 1 to Phase 2, where we carried out more accurate and expensive calculations. In Phase 2 we are looking to identify the best models and hence we need to evaluate ranking. Ultimately, we need to identify the best model or the top 5 best models (as in CASP). Since we know the native structures in this case, we already know the best model based on the RMSD of the model generated by LOOPP with respect to the native structure. We then rank the models identified by Phase 2 tree based on the linear score from final ranking. Then we check if our identification/ranking scheme identifies the actual rmsd-wise best model in the top 1 or top 5 positions. Additionally, we also check whether the pairs identified in top 1 are T or D (at least one in top 5 is a T) as classified by our initial scheme. These results are tabulated in Table 4 for both LS and TS. For comparison, we provide the same results for LP7 as well.
Table 4 summarizes the important results of this paper discussed below:
Figure 4 shows a histogram of the number of proteins versus the TM score of the top model and the best model in the top 5 as identified by Phase 2. There are more proteins where the top models have tm scores ≥ 0.65. However, there are 1454 proteins where the top models have tm scores < 0.65 and 1303 proteins where the best model in the top 5 have tm scores < 0.65. Table 5 shows the T and D classification of these models. There are more Ts than Ds in both tm ≥ 0.65 and tm < 0.65 categories. However, the Ts are way higher for tm ≥ 0.65 than in tm < 0.65 showing the overall enrichment of the true hits in Phase 2. The Ts in tm < 0.65 category and the Ds in tm ≥ 0.65 category are borderline cases, where there are discrepancies between Modeller and TMalign.
LOOPP server participated in CASP 8 structure prediction experiment during the summer of 2008 (http://predictioncenter.gc.ucdavis.edu/casp8/index.cgi). The LOOPP server accepts sequence electronically, identifies the templates, generates atomically detailed models, scores them and e-mails back the results and top scoring models. It does not use results from any other servers while meta-servers were included in the ranking of CASP8 and in the discussions below. We have not done as well as we hoped and we are learning the results at present. Unfortunately, during CASP8 the LOOPP server was not stable. We updated its databases well into the exercise and found during the exercise several bugs. To ensure that the results reported are meaningful and reproducible we report them twice. One set for the actual LOOPP server and a second set for the stable version of LOOPP that was achieved towards the end of the competition. We analyze in more details the models of the stable version only.
Among the groups that submit at least 100 targets of the total of 115 targets (there were 66 such groups) LOOPP is ranked in the lower third. The average TM score over all targets of the model ranked first in LOOPP CASP submission was 0.612 compared to the best group by our assessment, the Zhang server, with TM score of 0.702. The rank of the first model is 51 compared to other servers. If the stable LOOPP is considered, the average TM score of the first model of LOOPP climbs to 0.647 and the rank to 45.
LOOPP is doing better when ranking the best model out of the five submissions to CASP8. The average TM score of actual LOOPP submission was 0.671 (rank 44, the best average was again the Zhang server with 0.719). When the stable version of LOOPP was considered the score was 0.691 and the rank was 26. While LOOPP requires considerable improvement perhaps the most striking observation for us is the high density of groups in the neighborhood of 0.6–0.7 TM scores suggesting that the differences between groups are not as large as one may suspect. For comparison the TM score of the group ranked 58 was 0.604.
To gain further insight to the performance of the algorithm we analyze in more details the first 63 targets. There are 13362 models generated for the 63 targets. Of these, LOOPP Phase 2 identifies 1560 as hits for 61 targets (2 proteins have no hits identified). A TM score between the native and the model, which is better or equal to 0.5, is considered a true hit. The models include 1951 true hits of which LOOPP identify 1393. When the TM scores of the template to the native structures are examined we get 2169 true hits.
Further, of the 1560 hits identified, the alignment and the model building (MB) makes 672 models worse than the template, of which 69 are bad templates to begin with. So, model building from template “spoils” 603 good templates out of the whole 1560 hits. Of the 603, 39 are unacceptable (less than 0.5 TM), whereas 564 are still acceptable (greater than 0.5 TM) although the template was better than the model. Similarly, MB makes 885 models better than the template, of which, 58 models are still bad in spite of improvement compared to the template. 800 models were already good templates to begin with. There are 27 cases, where MB takes a bad template and makes a model out of it. There are 3 cases where MB does not affect the template/model at all. Also, there are 124 hits, where MB makes the model worse by more than 0.05 tm score when compared to the template, of which 116 are from good templates.
We find the best model in top 5 hits 77% percent of the time, the best model in top 1: 43%, best template in top 5: 80%, and best template in top 1: 38% of the time. The overall performance is lower compared to what we observed for the training, test and CASP7 sets. It is possible that the additions to the Protein Data Bank that happened after 6/28/2005 are sufficiently different that re-training of the model is required (for CASP8 we updated the databases but not the prediction model).
A few concrete observations are discussed below
To further elucidate some successes and failures we present the structural alignments of the best predicted LOOPP models from the top 5, with the native structures of four targets – T0428, T0415, T0411 and T0472 (figure 5). These four targets have been chosen to bring forward our hits and misses. T0428 is one of our best predictions, where we identify the best template available and produce a very good model with a TM score of 0.95 with the native. T0472 is one of the worst misses, since there is a very good PSI-BLAST hit for this target, which LOOPP fails to identify and hence the model is also very poor (0.27 tm score with the native). T0415 and T0411 lie in between these two extreme examples. In T0415, although we identify the best template, our model is not the best (0.73 tm score with the native) and in T0411, although we do not identify the best template, our model is reasonably good (0.71 tm score with the native). There are other cases, like T0419 and T0478, where the target itself is a tough one with no reasonable template in the database and hence the models are also poor.
We present a computational model for selecting templates from the protein data bank and building atomically detailed structures for target sequences. It is illustrated that both in target selection and in model building there is a significant “bleeding” and loss of good templates. The losses are due to mis-classification of good templates and to suboptimal refinement. Nevertheless, because of the growth of database sizes and coverage we have in many cases multiple hits for a single target. Even if a few good templates are missed, in many cases there are other good templates to fill in. If we measure the success of the algorithm by its ability to find a sound template, then the algorithm is quite successful as is evident in table 4. It therefore seems that future research should focus on the step of refinement.
This research was supported by NIH grant GM067823 to Ron Elber. Brinda Kizhakke Vallat was supported by a fellowship Human Frontier Science Program Long Term Fellowship: LT00469/2007-L.