Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2719020

Formats

Article sections

- Abstract
- I. Introduction
- II. Training, testing, and data sets
- III. A system for identification of best models: Characteristics, Features, Similarity Measures, and Scores
- IV. Results and Discussion
- V. Concluding remarks
- References

Authors

Related links

Proteins. Author manuscript; available in PMC 2010 September 1.

Published in final edited form as:

PMCID: PMC2719020

NIHMSID: NIHMS95514

Brinda Kizhakke Vallat,^{1} Jaroslaw Pillardy,^{2} Peter Májek,^{3} Jaroslaw Meller,^{4,}^{5} Thomas Blom,^{1} BaoQiang Cao,^{1} and Ron Elber^{1}

Full contact info: Ron Elber, Institute of Computational Engineering and Sciences, University of Texas at Austin, 1 University Station, ICES C0200, Austin TX 78712, Email: ude.saxetu.seci@nor, phone: 512-232-5415, fax: 512-471-8694

One approach to predict a protein fold from a sequence (a target) is based on structures of related proteins that are used as templates. We present an algorithm that examines a set of candidates for templates, builds from each of the templates an atomically detailed model, and ranks the models. The algorithm performs a hierarchical selection of the best model using a diverse set of signals. After a quick and suboptimal screening of template candidates from the protein data bank, the current method fine-tunes the selection to a few models. More detailed signals test the compatibility of the sequence and the proposed structures, and are merged to give a global fitness measure using linear programming. This algorithm is a component of the prediction server LOOPP (http://www.loopp.org). Large scale training and tests sets were designed and are presented. Recent results of the LOOPP server in CASP8 are discussed.

Homology modeling of protein structures is usually divided into three steps. In the first step structural templates are identified from a set of experimentally determined protein structures (the protein data bank PDB [1]). In the second step, an alignment between the sequence of the target with each of the templates is obtained. Based on the alignment, atomically detailed structures are constructed from the templates. The atomically detailed models are finally assessed and ranked. The process of template selection can be difficult. Therefore we divide the template selection process into two sequential steps: (i) template enrichment (Phase 1) and (ii) template focusing and model building (Phase 2). Empirically, we found that to begin with about 1.7% of the target-template pairs are true hits while after Phase 1 enrichment the percentage of true hits increases to 18%.

The following definitions are used throughout the manuscript: a “hit” is a prediction by the algorithm that the proposed match of a template and a target is likely to be successful. A true hit means that the prediction of the algorithm is correct. It is frequently referred to as a T pair or a T match. A false hit is an incorrect template-target match of the algorithm. It is also called a D pair (D for decoy).

Earlier, we presented a paper on a mathematical programming based method for enrichment of suitable templates for target proteins [2]. The present work follows the previous paper and constitutes the next step of the LOOPP server (http://www.loopp.org) for protein structure prediction. Tentative hits identified in first step (Phase 1) are forwarded to Phase 2 where atomically detailed models are built with the program Modeller [3] based on the templates determined during Phase 1 and the alignments of SSALN [4]. The models are assessed using a new learning and scoring algorithm described and discussed in the present manuscript, which constitutes the Phase 2 of the LOOPP server. Phase 2 typically provides a final list of five to twenty top structural candidates to the sequence of the target.

From the perspective of finding the best model the division into Phases is not optimal. The enrichment step may miss some true hits (structural templates that provide good atomic models to the template) and not include them in the subset forwarded to Phase 2. These misses, even if detectable by the filters of Phase 2, obviously remain undetected. Therefore the current LOOPP procedure is less sensitive than an alternative implementation that examines all the PDB structures with the best measures we have at hand. We discuss below the reasons that led to the present computational model of LOOPP.

At the core of the algorithms for Phase 1 and Phase 2 one finds similarity measures that we use to test the fitness of the sequence of the target to the sequence and structure of a template. As discussed in reference [2] the different similarity measures are learned with mathematical programming and are made into scores that rank the pairs of target and templates. The algorithm of Phase 1 uses only a fraction of the similarity measures that are available to us. Not using all of the measures results in a suboptimal performance and less accurate ranking of some of the pairs compared to the ranking of Phase 2. The reason of not using Phase 2 to begin with is computational cost. Some of the similarity measures that are used effectively in Phase 2 are expensive to compute. Phase 1 examines a representative set of the whole Protein Data Bank (PDB) [1] and the large number of comparisons makes it necessary to avoid some of the expensive similarity measures used in Phase 2.

For example, consider the comparison of two sequences. Let the raw optimal score between target *i* and template *j* be *T _{ij}*. The Z score (

Therefore, despite the observation that the Z score is significantly more sensitive and specific we did not use it in Phase 1. Phase 1 ranking is based on raw scores only (and BLAST statistical evaluation when possible) that are evaluated for 13,875 proteins in the database. Of course a score of sequence alignment is not the only similarity measure that we use to select the candidates of Phase 1. For example, threading and alignment against secondary structure were used as well. (Check the appendix of reference [2] for a complete list of similarity measures that we used in Phase 1). We then make the assumption that Phase 1 [2] is sufficiently accurate to capture the true hits in the top 200 (from a total of 13,875 candidates). If Phase 1 ranking was perfect (in the sense that the template providing the best structural model is ranked number 1), then only one template is required for further model building. However, it is not. Another complication is that Phase 1 depends on the quality of future steps, such as the alignment (which we perform with SSALN [4]) and the construction of an atomically detailed model (which we do with Modeller [3]). It is likely that modeling of the structures with other programs (e.g., different alignment algorithms or different built up of loops and side chains) will impact the learning.

We strictly differentiate between learning and testing of the prediction model (see section II). We call the set of proteins we learn Learning Set (LS) and use it to optimize the parameters and the functional form of the computational model. The set TS1 includes the proteins of our most comprehensive test case that is built completely independently from the LS. To account for some of the inaccuracies of Phase 1 ranking the number of structures that we forward to Phase 2 is 200. In our learning and test cases of Phase 1 we miss 157992 of the 418037 (LS) and 35759 of the 91449 true hits (TS1). This number may seem highly significant, however, by the end of the day we wish to obtain one good model per protein. We care less about having 100 good models for a particular target sequence. The number of proteins that lost all of their templates is small and stands on 811 out of 12689 proteins (LS) and 198 out of 3802 proteins (TS1). These are fractions of 0.064 (LS) and 0.052 (TS1) from the total number of proteins we have considered. The comparable loss for the learning and test cases is reassuring from the perspective of over-learning and we expect it to be similar for future predictions.

Phase 2, which is discussed in the present manuscript, deals with a much smaller number of candidates for true hits. The limited number of candidates allows for full construction of atomically detailed models for each candidate and the use of comprehensive measures of model accuracy in a calculation feasible on a typical cluster. On a cluster of 20 CPUs a structure prediction of a protein of length 200 amino acids requires 3–5 hours. The time is significantly longer for longer proteins (about 11–15 hours for 500 amino acids), however even this calculation is accessible with moderate computational resources.

The rest of the text is divided between a detailed description of the data sets that were used for training and testing, description of the learned model, and detailed analysis of the performance of the model on various tests. We finally discuss the performance of LOOPP during CASP8.

The learning set (LS) follows from the one used in Phase 1 [2]. It constitutes an extensive dataset of proteins selected from the available folds of the Protein Data Bank (PDB) [1] and of the domains of SCOP [5] as of 6/28/2005. In the initial selection, proteins from the PDB with less than 70 percent sequence identity to other members of the set were kept, providing a total of 9,513 single peptide chains. This is supplemented with representative folds from the complete SCOP hierarchy [5] giving a total of 13,825 targets and templates in the learning set, which we call LS. From a total number of 71,824,926 pairs, about 1,187,173 pairs were short listed as probable true templates from the Phase 1 prediction model and forwarded to Phase 2. A complete list of the selected structures is available in http://www.loopp.org/ls2pairs.txt

In Phase 2, the all-atom models are generated for the targets based on all the templates identified in Phase 1. This is done using Modeller [3], which generates an atomically detailed structure of the target based on the fold of the template and an alignment of the target with the template. The alignments were generated by the algorithm SSALN [4] implemented in LOOPP. Modeller is a widely used resource in the field, and hence is a component in our structure prediction system LOOPP http://www.loopp.org. The learning of scores for optimal alignment in LOOPP is documented elsewhere [4,6] and is beyond the scope of this paper. Here we focus on the identification of the best templates and models for a given target.

The true templates (T) and the decoys (D) from these pairs are identified based on the similarity of the model (built from the template) to the native. Since the accuracy of a predicted model is based on how close it is to the target structure, we define an acceptable model as one within 6.5 Å all-atom RMS distance from the native structure. For each target, we identify the true models using this criterion and define the remaining ones as decoys (D). For targets which do not have any true model using this criterion, we relax the RMS distance cutoff to 7.0 Å to identify the true templates and decoys. We thus obtain 209,090 true hits (T) and 978,083 decoys (D) for a total 12,527 proteins in the LS. Some of these proteins have both T and D representatives, whereas some others have either T or D. 11,694 proteins have one or more T pairs and 11,093 proteins have one or more D pairs as classified above.

Figure 1 shows the histogram of the probability of observing TM score as a function of the TM score for the T pairs. TM align is a structural alignment algorithm that was developed in Skolnick’s Laboratory and is used extensively in our studies [7]. Along with the alignment, it also provides a numeric score between 0 and 1, which indicates the degree of structural similarity between the two structures. The higher the score, the closer the structures and a TM score of 0.5 and above is considered significant [7]. We see from Figure 1 that most of the T pairs have TM score of 0.5 or more. Of a total of 209,090 T pairs, about 11,500 pairs have TM scores less than 0.5. Although these are classified as T based on our RMSD criterion, their TM scores indicate that these pairs might not have significant structural similarity.

Similar to the LS, the test sets used here also follow from Phase 1 and are designed to be independent of LS. We have generated two test cases: TS1 and TS2. In the construction of TS1 we examine all new PDB entries from dates 6/28/2005 to 6/13/2006. Hence, the structures collected did not overlap with the training set. Using the same screening procedure as in LS, we get 4,183 proteins in TS1. Of the 22,096,370 pairs, 310,031 pairs were short listed as probably true hits in Phase 1 and forwarded to Phase 2. These consist of 3,779 individual proteins. The Phase 2 classification procedure yields 39,364 Ts and 270,667 Ds.

The second test set (TS2) is based on CASP7 targets (http://predictioncenter.org/casp7/). CASP is a community wide experiment for critical assessment of methods for protein structure prediction [8,9], where protein sequences with pre-determined but undisclosed structures are provided as targets for prediction and models predicted by various groups are assessed based on their similarity with the native structure. So, our second test set is from the previous experiment of CASP, CASP 7. The CASP7 experiment was conducted from May to July 2006 and therefore our learning set did not overlap with structures from CASP7. We had 82 proteins with 702,828 pairs to begin with in Phase 1. The Phase 1 prediction model forwarded 3451 hits consisting of 55 individual proteins to Phase 2. Using the same criterion as in LS and TS1, we obtain 577 T and 2874 D pairs in the CASP7 based TS2.

In addition to TS1 and TS2 we report preliminary results of LOOPP server on CASP8.

The similarity measures used here are similar to those used in Phase 1 [2], except that we allow the use of more expensive measures that add significantly to the sensitivity of the algorithm at sizable computational cost. Assessing a model in Phase 2 typically requires 51 seconds compared to 0.4 seconds for a pair assessment in Phase 1 (the estimate was for a protein of 189 amino acids). When we compare a target to a template, we have the detailed three-dimensional description of the template, whereas we only have sequence-related information for the target. We built a set of characteristics (called *C*) of each protein to probe the similarity between the target and the template. The characteristic *C* is a string of vectors attached to amino acid sites. The site vectors may include the identity of the amino acid at the site, secondary structure, substitution probability in that site derived from multiple sequence alignment (profile), etc. The target characteristics include:

- A site-specific amino acid frequency (profile) generated from multiple sequence alignment of the target with homologous sequences from a standard sequence databases (NR -- http://helixweb.nih.gov/helixdb.php). The profile was created by PSI-BLAST with a single iteration and E value of 0.001.

The three-dimensional co-ordinates of the template allow us to generate more characteristics:

- The raw sequence and profile of the template generated from the same database we used for the profile of the target.
- The actual secondary structure of the template (extracted with the DSSP program [12]).
- The actual exposed surface area of the template protein (extracted with the DSSP program [12]).
- THOM2 contacts between structural sites (used in threading calculations) [13].

There are many ways of combining and comparing these characteristics between the target and the template to obtain matching scores that we call features (*F*) of pairs. For example, we may match a sequence with a sequence, secondary structure with a profile, test sequence fitness to contact maps (threading [14]), match predicted secondary structure of the target with actual secondary structure of the template, etc. The features are generally denoted by a four letter code such as SEQG (global sequence alignment), SEQL (local sequence alignment), TRDG (global threading), TRDL (local threading), TBLS (PSI-BLAST), TSCG (global secondary structure alignment), PSMG (sequence to profile matching - global), SRFL (local exposed surface area alignment) etc. Refer to the appendix of reference 2 for more details. To avoid repetition the features are not discussed in details in the present manuscript. We used 20 features in Phase 1, but have dropped 2 of these in Phase 2 (KMER and Contact Factor) because they have been insignificant, thus leaving 18 features.

Given two characteristics *C*_{1} and *C*_{2}, there is a need for an alignment between the target and the template to generate a score, a scalar feature *F*_{12}. In LOOPP, we consider approximate alignments (BLAST), and exact (local or global) alignments determined by dynamic programming. We also use constant or structurally dependent gap penalties [4]. We use four different types of scores when we compare the characteristics of the target and template to obtain a scalar feature value:

- Raw scores. For input characteristics
*C*_{1}and*C*_{2}the raw score is*S*(*C̄*_{1},*C̄*_{2}) where*C̄*denotes an extended character vector with the addition of deletions and insertions as required for an alignment. By convention the lower the energy (raw score) the better is the match. Raw scores are denoted by a _e extension to the four letter feature code, like TRDG_e, or SEQL_e. - “Reverse score” [15] is computed as
*S*(*C̄*_{1},*C̄*_{2}) −_{r}*S*(*C̄*_{1},*C̄*_{2}) where*C*_{2}is the reverse characteristic input of the second protein. For example, if_{r}*C*_{2}is the protein amino acid sequence*C*_{2}*a*_{1}*a*_{2}…*a*then_{n}*C*_{2}_{r}*a*_{n}a_{n}_{−1}…*a*_{1}. The reverse sequence provides an inexpensive measure of the deviation of the raw score from a match by chance. Higher values of the reverse score suggest a better match. Reverse scores are denoted by a _r extension such as TRDG_r, SEQL_r. - The E-values from variants of BLAST and PSI-BLAST [16].
- The Z score was introduced already in the introduction and a more complete description is given below. These scores are denoted by a _z extension such as TRDG_z, SEQL_z.

$$Z({\overline{C}}_{1},{\overline{C}}_{2})=\frac{S({\overline{C}}_{1},{\overline{C}}_{2})-{\langle S({\overline{C}}_{1},P{\overline{C}}_{2})\rangle}_{P}}{\sqrt{{\langle {S}^{2}({\overline{C}}_{1},P{\overline{C}}_{2})\rangle}_{P}-{{\langle S({\overline{C}}_{1},P{\overline{C}}_{2})\rangle}^{2}}_{P}}}$$

where the average *S*(*C̄*_{1}, *PC̄*_{2}) * _{P}* is over the optimal scores of the permutations of the characteristic

*Z* scores have been obtained for 12 of the 18 features in Phase 2. The raw energies, reverse energies and *Z* scores similarity measures contribute a total of 48 features. In addition, we also use two new secondary structure based features SSPOS and SSCOMP. SSPOS compares the position of the secondary structure elements between the template (actual) and target (predicted) whereas SSCOMP compares the amino acid composition of the actual and predicted secondary structures of the template and the target respectively. The prediction of the secondary structure of the target sites was made with the program SABLE [10,11]. Further, in Phase 2, we also generate atomic models for the chosen pairs and use the following potentials to derive the corresponding energies for these models. These energies are then used along with the features described earlier, in our training.

- ENEALL, an all-atom potential [17]. This energy is a simple all-atom energy function (the distance dependence of contact is capture by a few steps). It was designed by Mathematical Programming optimization of a set of natives, approximate structures, and decoys.
- TE13, [18]. This is one of the first energy functions computed with Mathematical Programming. It is a residue based contact potential with 13 steps to describe the distance dependence of the interaction of a pair of amino acids.
- FREADY [19]. A new coarse-grained potential for proteins that includes two point masses per amino acid. The pair interactions have a complex distance and angular dependence. It was derived by fitting distribution functions generated by Molecular Dynamics with the FREADY potential against distributions of the same variables extracted from the Protein Data Bank.

Modeller [3] generates the atomic models based on templates chosen from phase 1 and alignments provided by the SSALN algorithm [4]. To assess the quality of Modeller output we compare three structures: 1) The native structure of the target sequence (which is known in the training), 2) The structure of the template (the structure of the homologous protein on which the modeling is based), and 3) The model of the target that Modeller built based on the template. Ideally the similarity (based on the TM score [7]) of structure 3 and structure 1 should be higher than the similarity of structure 2 and 1 since a refinement was performed. This is however not always the case. Sometimes structure 2 is closer to 1 than structure 3 to 1. This is especially unfortunate when the template and target structures are very close to begin with and Modeller produces structures that are farther from the target than the template.

The quality of the model is expected to be a monotonically decreasing function of the TM score between the template and the final structure. We therefore examine the TM alignment score between the template and the model. By examining the drift between the template and the structure generated by Modeller we have another independent assessment of the quality of the results. Thus, we have a total of 55 features that we use in our prediction model.

Given a set of features that measures the similarity between pairs of proteins (we have 55 features, some of them strongly correlated), we seek a computational model that uses these features to classify the pairs according to T (true matches) and D (decoy (false) matches). We have already seen in Phase 1 [2] that the tools of Mathematical Programming (MP) [22] are effective in this regard and hence we use MP in Phase 2 as well. The construction of the mathematical programming model is explained in detail in Phase 1 [2] and since we follow a similar approach, we avoid the repetition of the details and just provide the important points. We seek a score, *Q*(*i, j*), to rank the matching quality of a pair of a target *i* and a template *j*, which is a linear combination of the features.

$$Q(i,j)=\sum _{\alpha =1,\dots ,K}{\gamma}_{\alpha}{F}_{\alpha}^{ij}$$

(1)

The *γ _{α}* are the unknown coefficients to be determined from the learning set with MP. The score function depends parametrically on the coefficients. In Phase 1 [2] we used scores with moderately more complex dependence on the features. For example a score was defined as a quadratic expansion of the features (e.g. the similarity score of the pair of proteins

The training process requires sets of target and template pairs that are pre-classified as D (target-decoy) and T (target-true hit) pairs. It determines the linear coefficients *γ* subject to maximal margin and feasibility conditions [23,24]. We require that a match of a protein *i* with any T template (say *j*) will have a better score than a match with a D template (say *k*). The requirement is written as an inequality *Q _{T}* (

$$\begin{array}{l}{Q}_{T}(i,j)-{Q}_{D}(i,k)>0\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\forall ij,ik\\ \sum _{\alpha =1}^{K}{\gamma}_{\alpha}{F}_{\alpha}^{ij,T}-\sum _{\alpha =1}^{K}{\gamma}_{\alpha}{F}_{\alpha}^{ik,D}=\sum _{\alpha =1}^{K}{\gamma}_{\alpha}({F}_{\alpha}^{ij,T}-{F}_{\alpha}^{ik,D})>0\end{array}$$

(2)

where the indices *ij* and *ik* denote a true and a decoy pair respectively. The sum is over the elements of the scalar product, i.e. of the linear coefficients *γ* (to be determined) and the difference of the similarity measures of the two pairs. Numerically it is difficult to differentiate between a solution, which is slightly larger than zero and a solution with *γ* = 0 which is exactly zero. To set a scale for the values of the parameters, and to avoid a trivial solution, it is convenient to write

$$\sum _{\alpha =1}^{K}{\gamma}_{\alpha}({F}_{\alpha}^{ij,T}-{F}_{\alpha}^{ik,D})>1\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\forall ij,ik$$

(3)

Equation (3) allows learning from negative (D) examples in addition to positive (T) examples. It therefore suggests a richer and more complete description of the data compared to classification algorithms that learn from positive examples only.

After significant trials of different functional forms of the similarity measures we were not able to find a single *Q* function that makes the problem feasible for all pairs of T and D and generalizes well to the test cases. It means that the set of features we have at present is insufficient to generate such a desired *Q* score. We believe that such a single function exists since a free energy surface that selects native folds was illustrated for many proteins. It is just that we do not know the proper functional form. Nevertheless, it is still possible to find a simple similarity measure that minimizes an error function and generalizes well to the test cases. The solution, however, does not solve all inequalities of Eq. (3). An alternative formulation is therefore required which is a slight adjustment of Eq. (3).

$$\begin{array}{l}\sum _{\alpha =1}^{K}{\gamma}_{\alpha}({P}_{\alpha}^{ij,T}-{P}_{\alpha}^{ik,D})>1-{\eta}_{ij,ik}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\forall ij,ik\\ {\eta}_{ij,ik}\ge 0\end{array}$$

(4)

$$\text{Subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.16667em}{0ex}}min\left(\sum _{ij,ik}{\eta}_{ij,ik}\right)$$

where the *η _{ij, ik}* are (positive) slack variables that make the solution of the new set of inequalities feasible.

Apart from the inequalities generated as *Q _{T}^{ij}*−

$$\begin{array}{l}\sum _{\alpha =1}^{K}{\gamma}_{\alpha}({F}_{\alpha}^{ij,T}-{F}_{\alpha}^{ik,D})>1-{\eta}_{ij,ik}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\forall ij,ik\\ \text{and}\\ \sum _{\alpha =1}^{K}{\gamma}_{\alpha}({F}_{\alpha}^{i1,T}-{F}_{\alpha}^{ij,D})>1-{\eta}_{i1,ij}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\forall i1,ij\\ {\eta}_{ij,ik}\ge 0\end{array}$$

(5)

These inequalities are added to the inequalities obtained from *Q _{T}^{ij}*−

We use the Mathematical Programming (MP) solver, PF3 [22], which is tuned specifically to solve problems like Eqs. (4) and (5) frequently encountered in the field of bioinformatics and is based on the interior point algorithm. The actual numbers of pair comparisons (the number of inequalities) that we attempt to satisfy in Phase 2 is much less than Phase 1 since we have lesser number of T and D pairs. Hence, here we attempt to solve all the 10,922,967 (total from *Q _{T}^{ij}*−

We have observed that a single *Q* score is insufficient in discriminating the Ts and the Ds. This is because some features like sequence similarity measures carry very strong signals that mask the signal from other features. We believe that if we remove sequences detected with one *Q* score, it is possible to learn another score that focuses on weaker (but nevertheless significant) signals. The idea is to use multiple scores where different *Q* are learned in sequence on a shrinking template database.

We have learnt that PSI-BLAST is the most dominant signal and contributes significantly to the first branch. It is therefore convenient to use PSI-BLAST as a single feature in the first branch and identify all the strong PSI-BLAST pairs. This will help to filter out the strong PSI-BLAST pairs and thereby aid in recognizing the signal from the other features in the following branches. We had a similar zeroth branch in Phase 1 as well [2].

We generate a PSI-BLAST profile for the target sequence using the NR database http://helixweb.nih.gov/hilexdb.php (three cycles with an E-value threshold of 0.01). This profile is compared to the sequences of the templates to yield the desired score (The log of the E-value, called TBLS_e in the feature table of the Appendix). Using the significance measure discussed earlier, all pairs with significance score, *SC*, larger than 0.999 were declared hits. These are kept for the final analysis in which the selected pairs from all the branches are sorted and ranked to identify the best models. All pairs that are not hits are forwarded to the next branch.

An assessment of an optimal single score can be made using PSI-BLAST, which is widely employed in template detection. In Phase 2 we are using PSI-BLAST with reasonably high confidence level (E-value smaller or equal 0.001) and with that accuracy it detects less than half of the true hits we pick with the tree. Reducing the confidence level to E-value of 0.1 we find just in branch 2 a larger number of true hits (72,625 instead of 22,433). Unfortunately, the number of false hits increases significantly from 9,817 to 207,827 making the selection of top templates very difficult.

Another feature of significance is the final score obtained in the previous LOOPP version that participated in CASP7 (LP7). This score is a single linear combination of a subset of features discussed earlier and in the appendix of ref. [2]. We find that this score has an important signal and can be used to find many of the strong hits beyond PSI-BLAST. We apply the same significance measure on all pairs forwarded from the PSI-BLAST branch and identify those with *SC* larger than 0.999 as hits and are chosen for final ranking. The rest, as usual, are forwarded to the next branch.

With the pairs recognizable by straightforward PSI-BLAST and LP7 scores removed in the zero and first branches respectively, we seek in the second branch a prediction model linear with the features *F _{α}* described in the Appendix of reference [2]. The coefficients γ

Overlap of the probability density of the scores for true (T) and false/decoy (D) pairs of the second branch of the LS (first branch using mathematical programming).

Like in Phase 1, we then tried using similarity measures derived from the features that are transformed into a uniform distribution which can be thought of as an alternative kernel [24] and also the quadratic expansion of the uniformly distributed variables. Although these were very useful in Phase 1, they were ineffective in Phase 2.

At this stage we have a collection of predicted T pairs from the zero to six branches. Typical number of T predictions per protein varies from 1 to 200. Now, we need to rank the predicted T pairs that are pooled together from the different branches. We use another linear score for the final sorting of the matches and we learn this the same way as we did in the last five branches – using the features mentioned earlier and solving for γ_{α}^{(2)} from the inequalities of the type given in Eq. 5, using PF3. The score thus obtained is a linear combination of the same features but the learning is done from the T and D pairs chosen from zero to six branches. Hence, the learning set here is minimal and is enriched with more Ts and less Ds. This also yields lesser number of inequalities to learn from. The linear score thus learnt is used for the final sorting of the pairs chosen from the zero to six branches of the tree. By far the largest coefficient of the linear combination of scores of branch 6 is that of LP7.

It is possible that re-learning the final ranking score using the pairs filtered from the zero to six branches and the same features lead to over-learning. Over-learning can lead to a prediction model that performs well on the training set but does not perform well on other independent test cases. We look into the performance of the prediction model on the test sets to provide insights into this potential problem.

The computed coefficients for all five branches of the tree plus the final ranking are summarized in Table 1a. For the purpose of comparison it is convenient to normalize the vector of coefficients such that $\sum _{\alpha}}{\overline{\gamma}}_{\alpha}^{2}=1$ for all branches. We notice that some of the features are not used in all the branches (FREADY, LP7 score etc). These are mainly features that were derived much later and were not available during the initial stages of learning. In Table 1b we provide the product of the coefficient times the variance of the feature under consideration. This measure is another indicator of the potential contribution of a particular feature to the total score of a branch. If the variance of a feature is high, then it can (potentially) make a significant contribution even if the coefficient is small. Table 1b suggests the dominant contributions of only a few features (All-atom energy, SEQG_e, TRDG_e, TRSG_e, PSMG_e, PSML_e). However, this picture is somewhat misleading. It is possible to have a large variance and still only low recognition capacity. We know that the use of the other features (such as the Z-scores) is necessary in order to obtain a good recognition.

Numerical weight coefficients for the different features in the different branches. Table 1a the coefficient that are used in the scoring. Table 1b the multiplication of the coefficients by the variance of values that a particular feature takes. The entries **...**

There are two factors that make the branches different from each other. First, the prediction spaces of the branches are different due to sequential elimination of T pairs. Second, the coefficient vectors are not the same. These differences are further enhanced in the final ranking branch since it uses the pairs selected from the 0–6 branches pooled together for learning. It is of interest to examine how similar the coefficients of the vectors are in the different branches and so we evaluate the scalar product of the normalized vectors between the different branches. This is presented in Table 2. We find that the coefficients of branches 2–5 are similar since their scalar products are in the range of 0.87–0.99 whereas those of branches 6 and final ranking are different from the rest (given by scalar products in the range 0.03–0.4). However, the scalar product of the coefficient vectors between branch 6 and final ranking is 0.8838 showing that they are very similar. Both scores are dominated by LP7.

Looking at the features providing the dominant signals in each branch, we find that in branches 2–5, the secondary structure based features, SSPOS and SSCOMP are very dominant. Apart from these, OPTM_e, SEQG_z, TRDG_z, TRSG_z, and TSSL_z provide signals in branches 2–5 whereas SEQL_z is strong in branches 2, 4 and 5. OPTM_e is a mixture of threading, secondary structure, and sequence alignment substitution tables; SEQG_z and SEQL_z are global and local sequence alignment Z scores respectively; TRDG_z and TRSG_z are threading-based global Z scores; TSSL_z is a combination of sequence, secondary structure and threading signals. In addition, profile-sequence matching score, PSML_z is dominant in branches 2 and 3 whereas PSMG_z is dominant in branches 4 and 5. Further, threading scores TRDL_z and TSSG_z are dominant in branch 3 and branches 4 & 5 respectively.

Branch 6 is different because new features are introduced which provide dominant signals. LP7 score, FREADY [19] and SIFT [20,21], all are new features that are dominant in this branch. In the branch of final ranking we find that the weight of the LP7 score is very high thus skewing the other weights. FREADY is a coarse grained energy computed from the final atomically detailed model. SIFT is a model assessment score that combines sequence dependent secondary structure and expose surface area prediction with mean radial distribution function-based assessment of packing, which is independent of the sequence of the template. LP7 is a linear combination of the following features: Protein length, SEQL_e, TRDL_r, TRSG_e, TRSG_r, PSML_r, SRFG_e, OPTM_e, TSSG_r, TRSL_e and TSCL_e.

Although the dominant LP7 score does not contain any *Z* scores, note the predominance of *Z* scores rather than raw or reverse scores in the other features that have *Z* scores evaluated. This shows the significance of *Z* scores and validates the computational time spent in evaluating these expensive features. Also of significance is the absence of PSI-BLAST related features (TBLS, TBSS and SBLS) that were the most dominant signals in the branches of Phase 1 tree. Although we use PSI-BLAST (TBLS_e) in branch zero and we recognize maximum number of T pairs in this branch, we find that in the rest of the branches and in the final ranking, PSI-BLAST plays a minimal role. Profile matching scores PSMG_z and PSML_z along with simple global and local alignment scores SEQG_z and SEQL_z are the only sequence-based features making any kind of contributions to these branches. In some manner, this proves that using such a tree-method eliminates the dominant PSI-BLAST signals in branch 0 and helps in picking up the signals from the other features in the later branches. This also validates our tree-based algorithm and thus enables us to identify the hits that cannot be recognized with PSI-BLAST alone. However, the drawback is that valuable PSI-BLAST hits can be lost in the final ranking due to other features being dominant there. We discuss this in detail in a later section.

The developed prediction tree is applied as follows. The target sequence is first examined with branch 0. PSI-BLAST scores are computed for any target-template pairs where the templates are taken from the complete Protein Data Bank (PDB). LOOPP has a standard database that is used by all other branches, however PSI-BLAST is so efficient to compute that we probe the complete PDB with it. If the probability of observing a false hit is smaller than 10^{−3} accept the match as a hit and store the hit in the list of candidates. If the structure is found in the standard LOOPP database, we remove it from the set of the structures that we need to examine in the next branch. In the next branch we seek hits of the target with templates that were not detected before. Hence the data set we examine is a subset of the total. The next branches do not consider the hits of previous branches. At the end of the process (when the last branch delivers its hits to the pool of candidates) all the hits are ranked against each other. The final ranking is done with another Mathematical Programming score.

Figure 3 shows the comparative statistics of the number of hits identified per protein in phase 2 with respect to phase 1. The plot shows a histogram of the number of proteins as a function of number of hits forwarded from Phase 1 and the corresponding number of hits detected in Phase 2. We see that most of the proteins with 0 and 10 hits that were identified in Phase 1 are also identified in Phase 2. There are 899 proteins where no hits are detected in phase 2 irrespective of the number of hits forwarded from phase 1. Of these, there are a handful of cases (77), where proteins with more than 10 hits forwarded from phase 1 have zero hits detected in phase 2. The remaining proteins have at least one or more hits identified in phase 2 and a few have as high as hundred hits identified in both phase 1 and phase 2.

Histogram of the number of proteins as a function of the number of pairs forwarded from Phase 1 and the number of pairs identified by Phase 2 tree.

As in Phase 1, we evaluate the number of T pairs identified and the number of proteins with at least one hit and (or) T hit identified in Phase 2. These results are tabulated in Table 3. A few points to note:

- The total number of T pairs solved by the tree in the LS is 162,762 out of 209,090 forwarded from Phase 1, which is 78%. Since 91% of the proteins have at least one T hit (see below) this performance is actually better than it looks from first sight.
- PSI-BLAST provides the most dominant signal, detecting 62,419 (30%) of T pairs, followed by the LP7 score with 52,727 (25%) T pairs. The remaining branches put together identify 54,373 (26%) percent of T pairs.
- The tree model that we generated on the training set generalizes quite well on the test sets with comparable performance on TS1, with 78.6% T pairs identified over all (30,943 detected out of 39,364 forwarded from Phase 1). The performance of LP7 score is significantly weaker in TS1 compared to LS. This poorer performance of the LP7 branch in TS1 is compensated by slight increases of the performances of other branches.
- The performance in TS2 is comparatively lower than LS and TS1 with 768 T pairs detected out of 1125 forwarded from Phase 1 (68.3%).
- Table 3 also provides the number of proteins with at least one hit identified (may be T or D) and the number of proteins with at least one T hit identified. The tree model identifies at least one hit in 91% of proteins (11,420 out of 12,527 forwarded from Phase 1) and detects at least one T hit in 92% of proteins (10,795 out of 11,694 forwarded from Phase 1) in the LS. TS1 performs comparably with 94% of proteins with at least one hit and 95% of proteins with at least one T hit identified. However, TS2 performs a bit lower with 85% and 81% respectively. Since TS2 is the smallest set, it is possible that statistical fluctuations cause the difference.

Therefore, the overall performance of the tree is similar in LS and TS1 although, it is slightly diminished in TS2. Further, the tree is able to identify T pairs in more than 90% of the proteins in both LS and TS1 thus enhancing the prediction capacity of LOOPP. The performance of the final ranking on the hits detected by the tree is elucidated as follows.

In Phase 1, we used the number of T pairs identified and proteins with at least one T pair detected to evaluate our performance of target recognition. These were useful measures in Phase 1 because we forwarded top 200 hits from Phase 1 to Phase 2, where we carried out more accurate and expensive calculations. In Phase 2 we are looking to identify the best models and hence we need to evaluate ranking. Ultimately, we need to identify the best model or the top 5 best models (as in CASP). Since we know the native structures in this case, we already know the best model based on the RMSD of the model generated by LOOPP with respect to the native structure. We then rank the models identified by Phase 2 tree based on the linear score from final ranking. Then we check if our identification/ranking scheme identifies the actual rmsd-wise best model in the top 1 or top 5 positions. Additionally, we also check whether the pairs identified in top 1 are T or D (at least one in top 5 is a T) as classified by our initial scheme. These results are tabulated in Table 4 for both LS and TS. For comparison, we provide the same results for LP7 as well.

Table 4 summarizes the important results of this paper discussed below:

- In the LS, the tree identifies the best hit in the top 5 in 90% of the proteins and the best hit in the top 1 in 59% of proteins. Further, 94% of proteins in the LS have a true hit in the top 5 and 90% have a true hit in the top 1.
- Compared to LP7, the performance of the tree is remarkably better since LP7 identifies the best hit in the top 5 in 70% of the proteins and in top 1 in 43 % of proteins of the LS. Similarly, LP7 identifies a true hit in top 5 in 89% of proteins and top 1 in 84% of proteins. Hence, the tree has significantly improved template recognition compared to LP7.
- These numbers for TS1 are similar to that of the LS, thus eliminating the doubts about over-learning. However, in TS2, which is the CASP7 dataset, we see that the performance of the tree is lower compared to LS and TS1, consistent with the other results as well.

Figure 4 shows a histogram of the number of proteins versus the TM score of the top model and the best model in the top 5 as identified by Phase 2. There are more proteins where the top models have tm scores ≥ 0.65. However, there are 1454 proteins where the top models have tm scores < 0.65 and 1303 proteins where the best model in the top 5 have tm scores < 0.65. Table 5 shows the T and D classification of these models. There are more Ts than Ds in both tm ≥ 0.65 and tm < 0.65 categories. However, the Ts are way higher for tm ≥ 0.65 than in tm < 0.65 showing the overall enrichment of the true hits in Phase 2. The Ts in tm < 0.65 category and the Ds in tm ≥ 0.65 category are borderline cases, where there are discrepancies between Modeller and TMalign.

Plot of the number of proteins as a function of TM scores of the top 1 hit and the best hit in the top 5 hits identified by LOOPP Phase 2 tree + final ranking scheme.

LOOPP server participated in CASP 8 structure prediction experiment during the summer of 2008 (http://predictioncenter.gc.ucdavis.edu/casp8/index.cgi). The LOOPP server accepts sequence electronically, identifies the templates, generates atomically detailed models, scores them and e-mails back the results and top scoring models. It does not use results from any other servers while meta-servers were included in the ranking of CASP8 and in the discussions below. We have not done as well as we hoped and we are learning the results at present. Unfortunately, during CASP8 the LOOPP server was not stable. We updated its databases well into the exercise and found during the exercise several bugs. To ensure that the results reported are meaningful and reproducible we report them twice. One set for the actual LOOPP server and a second set for the stable version of LOOPP that was achieved towards the end of the competition. We analyze in more details the models of the stable version only.

Among the groups that submit at least 100 targets of the total of 115 targets (there were 66 such groups) LOOPP is ranked in the lower third. The average TM score over all targets of the model ranked first in LOOPP CASP submission was 0.612 compared to the best group by our assessment, the Zhang server, with TM score of 0.702. The rank of the first model is 51 compared to other servers. If the stable LOOPP is considered, the average TM score of the first model of LOOPP climbs to 0.647 and the rank to 45.

LOOPP is doing better when ranking the best model out of the five submissions to CASP8. The average TM score of actual LOOPP submission was 0.671 (rank 44, the best average was again the Zhang server with 0.719). When the stable version of LOOPP was considered the score was 0.691 and the rank was 26. While LOOPP requires considerable improvement perhaps the most striking observation for us is the high density of groups in the neighborhood of 0.6–0.7 TM scores suggesting that the differences between groups are not as large as one may suspect. For comparison the TM score of the group ranked 58 was 0.604.

To gain further insight to the performance of the algorithm we analyze in more details the first 63 targets. There are 13362 models generated for the 63 targets. Of these, LOOPP Phase 2 identifies 1560 as hits for 61 targets (2 proteins have no hits identified). A TM score between the native and the model, which is better or equal to 0.5, is considered a true hit. The models include 1951 true hits of which LOOPP identify 1393. When the TM scores of the template to the native structures are examined we get 2169 true hits.

Further, of the 1560 hits identified, the alignment and the model building (MB) makes 672 models worse than the template, of which 69 are bad templates to begin with. So, model building from template “spoils” 603 good templates out of the whole 1560 hits. Of the 603, 39 are unacceptable (less than 0.5 TM), whereas 564 are still acceptable (greater than 0.5 TM) although the template was better than the model. Similarly, MB makes 885 models better than the template, of which, 58 models are still bad in spite of improvement compared to the template. 800 models were already good templates to begin with. There are 27 cases, where MB takes a bad template and makes a model out of it. There are 3 cases where MB does not affect the template/model at all. Also, there are 124 hits, where MB makes the model worse by more than 0.05 tm score when compared to the template, of which 116 are from good templates.

We find the best model in top 5 hits 77% percent of the time, the best model in top 1: 43%, best template in top 5: 80%, and best template in top 1: 38% of the time. The overall performance is lower compared to what we observed for the training, test and CASP7 sets. It is possible that the additions to the Protein Data Bank that happened after 6/28/2005 are sufficiently different that re-training of the model is required (for CASP8 we updated the databases but not the prediction model).

A few concrete observations are discussed below

- Targets 397 and 465 have no hit identified. 397, because there seems to be no good template (in Phase 2 input). 465 has a single reasonable template (TM approx. 0.5), which the Phase 2 tree fails to identify.
- In thirteen cases (out of 61), we miss the best hit. 6 of them because there is no good template (in Phase 2 input), 2 of them because of unsuccessful model building and 5 because Phase 2 tree fails to identify the reasonable templates.
- Our learning de-emphasizes target and template matches with high sequence similarity. As a result the model that we finally developed misses several trivial PSI-BLAST hits during CASP8. The PSI-BLAST hits were detected at the appropriate zero branch. However, when all the hits were collected, the hits from other branches mask the PSI-BLAST hits. We will need to rectify the model and probably assign a special protocol for sequence with high PSI-BLAST score.
- T0413 is a difficult target, where we perform well with our model ranked as the second best server model by CASP evaluators. The target is a poly(3-hydroxybutyrate) depolymerase with an α/β hydrolase fold (PDB ID: 3D0K). The CASP evaluators have listed 20 close templates in the PDB for this target, most of them within 2.75 Å C-alpha RMSD from the target native. Most of these are esterases with α/β hydrolase fold, 1JJID, a carboxylesterase, being the top one with 2.4 Å CA RMSD from the target native. Although, we do not identify any of the top templates as identified by the CASP evaluators, we do identify a putative esterase with an α/β hydrolase fold as the top template (1PV1A), which has a 2.6 Å CA RMSD from 3D0K. Further, our top template, 1PV1A, has a better TM score with the target native (0.65) as compared to the top template provided by the CASP evaluators, 1JJID (0.58). Since our learning is based on TM scores rather than RMSDs, our server follows well our training and picked 1PV1A ahead of 1JJID. We also find that the other top servers for this target (Zhang server, Robetta) do not pick any of the top templates listed by the CASP evaluators as well as 1PV1A, identified by LOOPP. The FEIG server, which performed comparably to LOOPP does not provide the template information for comparison. Further, the alignment of 1PV1A to the target has 222 out of 304 (73%) residues aligned within ± 4 residues from that of the TM alignment of 1PV1A with 3D0K. Further, the model to native TM score is 0.65, similar to the template to native TM score. Therefore, the success of the prediction for this target is mainly due to the success in template identification along with a reasonably good alignment.

To further elucidate some successes and failures we present the structural alignments of the best predicted LOOPP models from the top 5, with the native structures of four targets – T0428, T0415, T0411 and T0472 (figure 5). These four targets have been chosen to bring forward our hits and misses. T0428 is one of our best predictions, where we identify the best template available and produce a very good model with a TM score of 0.95 with the native. T0472 is one of the worst misses, since there is a very good PSI-BLAST hit for this target, which LOOPP fails to identify and hence the model is also very poor (0.27 tm score with the native). T0415 and T0411 lie in between these two extreme examples. In T0415, although we identify the best template, our model is not the best (0.73 tm score with the native) and in T0411, although we do not identify the best template, our model is reasonably good (0.71 tm score with the native). There are other cases, like T0419 and T0478, where the target itself is a tough one with no reasonable template in the database and hence the models are also poor.

We present a computational model for selecting templates from the protein data bank and building atomically detailed structures for target sequences. It is illustrated that both in target selection and in model building there is a significant “bleeding” and loss of good templates. The losses are due to mis-classification of good templates and to suboptimal refinement. Nevertheless, because of the growth of database sizes and coverage we have in many cases multiple hits for a single target. Even if a few good templates are missed, in many cases there are other good templates to fill in. If we measure the success of the algorithm by its ability to find a sound template, then the algorithm is quite successful as is evident in table 4. It therefore seems that future research should focus on the step of refinement.

This research was supported by NIH grant GM067823 to Ron Elber. Brinda Kizhakke Vallat was supported by a fellowship Human Frontier Science Program Long Term Fellowship: LT00469/2007-L.

1. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Research. 2000;28(1):235–242. [PMC free article] [PubMed]

2. Vallat BK, Pillardy J, Elber R. A template-finding algorithm and a comprehensive benchmark for homology modeling of proteins. Proteins. 2008 Aug 15;72(3):910–928. [PMC free article] [PubMed]

3. Eswar N, Mari-Renom M, Webb B, Madhusudhan MS, Eramian D, Shen M, Pieper U, Sali A. Comparative Protein Structure Modeling with Modeller. Current Protocols in Bioinformatics. 2006;5.6:5.6.1–5.6.30. [PubMed]

4. Qiu J, Elber R. SSALN: An alignment algorithm using structure-dependent substitution matrices and gap penalties learned from structurally aligned protein pairs. Proteins-Structure Function and Bioinformatics. 2006;62(4):881–891. [PubMed]

5. Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP - a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology. 1995;247(4):536–540. [PubMed]

6. Chun-Nam JY, Thorsten J, Elber R. Support Vector Training of Protein Alignment Models. Lecture notes in bioinformatics: RECOMB 2007. 2007

7. Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Research. 2005;33(7):2302–2309. [PMC free article] [PubMed]

8. Moult J. A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Current Opinion in Structural Biology. 2005;15(3):285–289. [PubMed]

9. Kryshtafovych A, Venclovas C, Fidelis K, Moult J. Progress over the first decade of CASP experiments. Proteins-Structure Function and Bioinformatics. 2005;61:225–236. [PubMed]

10. Adamczak R, Porollo A, Meller J. Accurate prediction of solvent accessibility using neural networks-based regression. Proteins-Structure Function and Bioinformatics. 2004;56(4):753–767. [PubMed]

11. Adamczak R, Porollo A, Meller J. Combining prediction of secondary structure and solvent accessibility in proteins. Proteins-Structure Function and Bioinformatics. 2005;59(3):467–475. [PubMed]

12. Kabsch W, Sander C. Dictionary of protein secondary structure – Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22(12):2577–2637. [PubMed]

13. Meller J, Elber R. Linear programming optimization and a double statistical filter for protein threading protocols. Proteins-Structure Function and Genetics. 2001;45(3):241–261. [PubMed]

14. Meller J, Elber R. Protein recognition by sequence-to-structure fitness: Bridging efficiency and capacity of threading models. Computational Methods for Protein Folding. 2002;120:77–130. Advances in Chemical Physics.

15. Karplus K, Karchin R, Shackelford G, Hughey R. Calibrating E-values for hidden Markov models using reverse-sequence null models. Bioinformatics. 2005;21(22):4107–4115. [PubMed]

16. Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997;25(17):3389–3402. [PMC free article] [PubMed]

17. Jian Qiu, Ron Elber. Atomically detailed potentials to recognize native and approximate protein structures. Proteins, Structure, Function, and Bioinformatics. 2005;61:44–55. [PubMed]

18. Dror Tobi, Ron Elber. Distance dependent, pair potential for protein folding: Results from linear optimization. Proteins, Structure Function and Genetics. 2000;41:40–16. [PubMed]

19. Peter Májek, Ron Elber. A coarse grained potential for fold recognition and molecular dynamics simulations of proteins. to be submitted. [PMC free article] [PubMed]

20. Adamczak R, Meller J. On the Transferability of Folding and Threading Potentials and Sequence-Independent Filters for Protein Folding Simulations. Molecular Physics. 2004;102(11–12):1291–1305.

21. Adamczak R, Meller J. Efficient and Accurate Protein Model Quality Assessment with Structural Profiles. to be published.

22. Wagner M, Meller J, Elber R. Large-scale linear programming techniques for the design of protein folding potentials. Mathematical Programming. 2004;101(2):301–318.

23. Meller J, Wagner M, Elber R. Maximum feasibility guideline in the design and analysis of protein folding potentials. Journal of Computational Chemistry. 2002;23(1):111–118. [PubMed]

24. Cristianini N, Shawe-Taylor J. An introduction to support Vector Machines and other kernel based learning methods. Cambridge: Cambridge University Press; 2000.

25. Karlin S, Altschul SF. Applications and statistics for multiple high-scoring segments in molecular sequences. Proceedings of the National Academy of Sciences of the United States of America. 1993;90(12):5873–5877. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |