Homology modeling of protein structures is usually divided into three steps. In the first step structural templates are identified from a set of experimentally determined protein structures (the protein data bank PDB [1
]). In the second step, an alignment between the sequence of the target with each of the templates is obtained. Based on the alignment, atomically detailed structures are constructed from the templates. The atomically detailed models are finally assessed and ranked. The process of template selection can be difficult. Therefore we divide the template selection process into two sequential steps: (i) template enrichment (Phase 1) and (ii) template focusing and model building (Phase 2). Empirically, we found that to begin with about 1.7% of the target-template pairs are true hits while after Phase 1 enrichment the percentage of true hits increases to 18%.
The following definitions are used throughout the manuscript: a “hit” is a prediction by the algorithm that the proposed match of a template and a target is likely to be successful. A true hit means that the prediction of the algorithm is correct. It is frequently referred to as a T pair or a T match. A false hit is an incorrect template-target match of the algorithm. It is also called a D pair (D for decoy).
Earlier, we presented a paper on a mathematical programming based method for enrichment of suitable templates for target proteins [2
]. The present work follows the previous paper and constitutes the next step of the LOOPP server (http://www.loopp.org
) for protein structure prediction. Tentative hits identified in first step (Phase 1) are forwarded to Phase 2 where atomically detailed models are built with the program Modeller [3
] based on the templates determined during Phase 1 and the alignments of SSALN [4
]. The models are assessed using a new learning and scoring algorithm described and discussed in the present manuscript, which constitutes the Phase 2 of the LOOPP server. Phase 2 typically provides a final list of five to twenty top structural candidates to the sequence of the target.
From the perspective of finding the best model the division into Phases is not optimal. The enrichment step may miss some true hits (structural templates that provide good atomic models to the template) and not include them in the subset forwarded to Phase 2. These misses, even if detectable by the filters of Phase 2, obviously remain undetected. Therefore the current LOOPP procedure is less sensitive than an alternative implementation that examines all the PDB structures with the best measures we have at hand. We discuss below the reasons that led to the present computational model of LOOPP.
At the core of the algorithms for Phase 1 and Phase 2 one finds similarity measures that we use to test the fitness of the sequence of the target to the sequence and structure of a template. As discussed in reference [2
] the different similarity measures are learned with mathematical programming and are made into scores that rank the pairs of target and templates. The algorithm of Phase 1 uses only a fraction of the similarity measures that are available to us. Not using all of the measures results in a suboptimal performance and less accurate ranking of some of the pairs compared to the ranking of Phase 2. The reason of not using Phase 2 to begin with is computational cost. Some of the similarity measures that are used effectively in Phase 2 are expensive to compute. Phase 1 examines a representative set of the whole Protein Data Bank (PDB) [1
] and the large number of comparisons makes it necessary to avoid some of the expensive similarity measures used in Phase 2.
For example, consider the comparison of two sequences. Let the raw optimal score between target i
and template j
. The Z score (Zij
) is defined as
. The brackets
denote an average over optimal alignments of the template sequence j
and randomly shuffled sequences with the same amino acid composition as the target i
. To obtain meaningful averages hundreds to thousands alignments are required, making the Z score calculation more expensive than a single alignment by two to three orders of magnitude.
Therefore, despite the observation that the Z score is significantly more sensitive and specific we did not use it in Phase 1. Phase 1 ranking is based on raw scores only (and BLAST statistical evaluation when possible) that are evaluated for 13,875 proteins in the database. Of course a score of sequence alignment is not the only similarity measure that we use to select the candidates of Phase 1. For example, threading and alignment against secondary structure were used as well. (Check the appendix of reference [2
] for a complete list of similarity measures that we used in Phase 1). We then make the assumption that Phase 1 [2
] is sufficiently accurate to capture the true hits in the top 200 (from a total of 13,875 candidates). If Phase 1 ranking was perfect (in the sense that the template providing the best structural model is ranked number 1), then only one template is required for further model building. However, it is not. Another complication is that Phase 1 depends on the quality of future steps, such as the alignment (which we perform with SSALN [4
]) and the construction of an atomically detailed model (which we do with Modeller [3
]). It is likely that modeling of the structures with other programs (e.g., different alignment algorithms or different built up of loops and side chains) will impact the learning.
We strictly differentiate between learning and testing of the prediction model (see section II). We call the set of proteins we learn Learning Set (LS) and use it to optimize the parameters and the functional form of the computational model. The set TS1 includes the proteins of our most comprehensive test case that is built completely independently from the LS. To account for some of the inaccuracies of Phase 1 ranking the number of structures that we forward to Phase 2 is 200. In our learning and test cases of Phase 1 we miss 157992 of the 418037 (LS) and 35759 of the 91449 true hits (TS1). This number may seem highly significant, however, by the end of the day we wish to obtain one good model per protein. We care less about having 100 good models for a particular target sequence. The number of proteins that lost all of their templates is small and stands on 811 out of 12689 proteins (LS) and 198 out of 3802 proteins (TS1). These are fractions of 0.064 (LS) and 0.052 (TS1) from the total number of proteins we have considered. The comparable loss for the learning and test cases is reassuring from the perspective of over-learning and we expect it to be similar for future predictions.
Phase 2, which is discussed in the present manuscript, deals with a much smaller number of candidates for true hits. The limited number of candidates allows for full construction of atomically detailed models for each candidate and the use of comprehensive measures of model accuracy in a calculation feasible on a typical cluster. On a cluster of 20 CPUs a structure prediction of a protein of length 200 amino acids requires 3–5 hours. The time is significantly longer for longer proteins (about 11–15 hours for 500 amino acids), however even this calculation is accessible with moderate computational resources.
The rest of the text is divided between a detailed description of the data sets that were used for training and testing, description of the learned model, and detailed analysis of the performance of the model on various tests. We finally discuss the performance of LOOPP during CASP8.