Search tips
Search criteria 


Logo of bioinfoLink to Publisher's site
Bioinformatics. 2008 July 1; 24(13): i182–i189.
PMCID: PMC2718643

Designing succinct structural alphabets


Motivation: The 3D structure of a protein sequence can be assembled from the substructures corresponding to small segments of this sequence. For each small sequence segment, there are only a few more likely substructures. We call them the ‘structural alphabet’ for this segment. Classical approaches such as ROSETTA used sequence profile and secondary structure information, to predict structural fragments. In contrast, we utilize more structural information, such as solvent accessibility and contact capacity, for finding structural fragments.

Results: Integer linear programming technique is applied to derive the best combination of these sequence and structural information items. This approach generates significantly more accurate and succinct structural alphabets with more than 50% improvement over the previous accuracies. With these novel structural alphabets, we are able to construct more accurate protein structures than the state-of-art ab initio protein structure prediction programs such as ROSETTA. We are also able to reduce the Kolodny's library size by a factor of 8, at the same accuracy.

Availability: The online FRazor server is under construction

Contact:scli/at/,mli/at/, j3xu/at/


A small amount of structural fragments can model protein structures accurately (Kolodny et al., 2002). Thus building such structural fragment libraries has attracted intensive research. The library size and accuracy are dominating factors for modelling and predicting the protein structures accurately. Compact independent libraries for protein structures have been built, and it is difficult to reduce the size of such libraries further. However, a sequence segment does not adopt all the structural fragments in a library with equal probabilities. Therefore, it is more reasonable to build a customized structural candidate list for each sequence segment. This way, the size of structural candidate list can be much more succinct, and the protein structure can be modelled more accurately.

1.1 Fragment libraries

Structural fragment libraries are also referred to as ‘structural alphabet’ in literature. The size of a fragment library may vary from dozens to hundreds. The fragments may have fixed or variable lengths. Typically, fragments in these libraries have lengths not more than nine since the structure database may not contain representative resemblances for longer fragments (Fidelis et al., 1994).

Kolodny et al. (2002) studied fragment library with k-means clustering methods and showed that it is unnecessary to have a large fragment library to accurately model protein structures and construct near native structures. In that paper, fragments with lengths 4–7 were built with library sizes varying from 4 to 250. Criteria of evaluating fragment libraries for building protein structures were studied in Holmesand and Tsai (2004). Besides extracting structural fragments from known proteins, research has also been conducted on constructing structural fragments with ab initio methods, and such methods produced longer fragments (Lovell et al., 2003).

For the protein structure prediction purpose, it is more desirable to have a position-specific structural fragment list for every sequence segment of the target. Only a limited number of structural fragments in the fragment libraries can be adopted as candidate structural fragment for a sequence segment. ROSETTA implemented this idea based on only two type of information: sequence profile and secondary structure (Rohl et al., 2004; Simons et al., 1997). Specifically, sequence profiles for the query sequence and each sequence in the structure database are generated by PSI-BLAST (Altschul et al., 1997). A profile–profile similarity score between a query sequence segment and a structural fragment is calculated using a distance function called City Block Metric (CBM):

equation image

where [ell] is the fragment length, S(aa,i) and X(aa,i) are the frequencies of amino acid aa at position i in the sequence segment and in the structural fragment, respectively. In addition, for a query sequence segment, its predicted secondary structure is compared with the known secondary structure of each structural fragment.

Unlike ROSETTA with fixed fragment lengths, TASSER (Zhang, 2007), extracts the structural fragments from structural models generated by threading programs, with variable fragment length.

1.2 Fragment-based protein structure prediction

The structure predictions based on fragment assembly have shown promising results. For example, two top automatic methods in CASP7 (Moult et al., 2005), ROSETTA and TASSER, both use fragment assembly strategy. Fragment-based protein structure prediction is done in two steps: (1) identify the building blocks, which are fragments of known structures; (2) construct the protein structure with those building blocks using some search or simulation algorithms.

Fragment-based protein structure prediction method can be traced back to Pauling and Corey (1951), in which a protein fold is simplified into smaller parts by using the regular secondary structure element prediction. Research intensified after the work of Jones and Thirup (1986), which uses known structures to refine structures. Various fragment based structure prediction methods were developed in Jones and Thirup (1986), Claessens et al. (1989), Unger et al. (1989), Simon et al. (1991), Levitt (1992), Sippl (1993), Wendoloski and Salemme (1992), Bowie and Eisenberg (1994), with some success. Later, ROSETTA was developed, (Bradley et al., 2003; Chivian et al., 2005; Simons et al., 1997). ROSETTA significantly improved the protein structure prediction based on fragments. Some recent fragment assembly algorithms, including Haspel et al. (2003), Inbar et al. (2003) and Lee et al. (2005), use longer fragments and/or different simulation algorithms. In another related work, De Brevern et al. (2004) proposed a 16-letter alphabet to predict local structures of a protein.

1.3 Our contributions

A major bottleneck for the fragment-based protein structure prediction methods is designing succinct and highly accurate structural alphabet. A constant factor reduction on the library size will result in an exponential reduction of the search space. We have introduced new ideas and have demonstrated that

  • introducing structural information items, such as secondary structure, solvent accessibility and contact capacity, can improve the prediction of structural fragments;
  • by using integer linear programming, we can derive the best combination of both sequence and structural information items, and it is possible to significantly reduce the structural alphabet (or library) size, at the same level of accuracy;
  • such reduction will indeed significantly improve the protein structure prediction, with all other conditions unchanged.

The Occam's Razor principle tells us that smaller the alphabet is, the more likely it produces the right structure. A software package, FRazor, ‘F’ for fragment, based on integer linear programming is developed. By comparing our FRazor to the ROSETTA's fragment selection method, with the threshold as 1 Å, and fragment length 9, the position coverage is improved from 56.4% to 79.1% for β-sheets, and from 55.5% to 67.9% for loops while reducing the candidate list size from 25 to 10 simultaneously. With the candidate list size remaining at 25, our method can improve the position coverage of β-sheets from 56.4% to 89.6% and the position coverage of loops from 55.5% to 78.1%. The improvement of our method is over 50% for β-sheets and loops on average over the previous accuracies. Even for α-helices, where the improvement space is very little, FRazor still improves about 20% of the remaining open gap. Applying our method to Kolodny's library, our method reduces its size by a factor of 8 while achieving the same accuracy. Applying our fragment selection method to ROSETTA, with exactly the same settings for the rest of the system, we improve the prediction accuracy from 2% to 26% for 5 out of 6 standard benchmark proteins used in Simons et al. (1997), Kolodny et al. (2002) and Hamelryck et al. (2006).

In our work Li et al. (2007), we utilize structural fragment libraries in a new method for protein structure prediction. Experimental results show that structural fragments serve as a solid foundation for local structural bias prediction. In this article, we focus on generating high-quality structural fragment libraries.


2.1 Problem statement

Given a protein target sequence t of length n, we parse t into a collection of sequence segments. We use a sliding window of a fixed length [ell] and step size 1 to parse t. Let these segments be qe1,qe2,…,qep, p=n[ell]+1 (qe stands for query sequence element), and denote the native structural fragments of these segments as ns1,ns2,…,nsp.

We need to have a collection of structural fragments from which we can select the structural candidates for sequence segments. Denote this collection of structural fragments as (se in below stands for structural element): S={se1,se2,…,seq}.

We refer to this collection of structural fragments as structural space. In this article, S is obtained by parsing the 40 protein structures listed in Table 1A.

Table 1.
Proteins for the Structure Space and Training Set

We wish to select some structural fragments for each sequence segment, such that our selections contain adequate structural fragments close to the native structure of the sequence segment. Intuitively, the more the native-like structural fragments, the better decoys can be constructed.

Stated formally, given qej, 1≤jp, integer k and k′, k′≤k and a distance threshold θ, we wish to select a set of structural fragments, denoted as Sj[subset or is implied by]S, such that:

  • |Sj|=k;
  • [exists]Fj[subset or is implied by]Sj with |Fj|≥k′ and
  • s[set membership]Fj, dist(s,nsj)≤θ. dist is a given distance function.

Fj is referred to as a subset of near native structures for qej. If Fj is non-empty, we say that qej is covered by Sj. Using the 40 proteins in Table 1A to generate S, we have verified that over 99% sequence positions (over all protein sequences) are covered by S. Sj is referred to as the structural candidate list, or simply the candidate list, of qej. Sj is the ‘alphabet’ for qej.

2.1.1 Structural distance criteria

To compute the distance between structural fragments, we use the standard measure, which is the backbone Cα-carbon root-mean-squared deviation (or CαRMSD). CαRMSD, or simply RMSD, satisfies the triangle inequality for fragments of equal length. Our methods are also applicable to other distance measures.

2.1.2 Structural space

We generated the structural space from the 40 proteins in Table 1A, selected from the PDB. We observed that over 99% of the sequence/structural segments from the CASP7 proteins are covered by the structural fragments from these proteins.

ROSETTA and TASSER use similar approaches, selecting the fragments directly from the PDB (Berman et al., 2000). It is also possible to use the existing independent libraries, such as Kolodny's fragment library (Kolodny et al., 2002). It is not important which method to use, so long as we ensure that any structural fragment can find at least one resemblance in our structural space within some distance threshold.

2.2 Generalized linear model

It is reasonable to assume that introducing more structural information will help structural fragment prediction. However, how to combine these information items remains a difficulty. In this article, we design an integer linear programming model to integrate both sequence and structural information items optimally.

Between each structural fragment sei in the structural space and each sequence segment qej, a feature vector, An external file that holds a picture, illustration, etc.
Object name is btn165i1.jpg, of length d=4×9, will be computed to measure how well sei and qej match, for 1≤iq and 1≤jp. Each entry in Vi,j may be a linear or non-linear scoring function. We label Vi,j with +1 if dist(sei,nsj)≤θ, and −1 otherwise.

Immediately, we may employ traditional machine learning approaches, such as support vector machines (SVMs), to classify Vi,j into two classes: class +1 contains the feature vectors labelled with +1 and class −1 contain the feature vectors labelled with −1. Then the set of sei's whose Vi,j is labelled +1 can be treated as the candidate list for qej. However, such an approach is too simplistic for our case. We do not really need to classify all the structural elements correctly. As a matter of fact, we do not care if most of the structural elements are classified wrongly. We only need at least one (or a few) of the structural fragments for qej classified correctly. Often there are close-to-native structural fragments for a particular sequence segment, we are just interested in selecting a candidate list of size say k=25, and one of them is native like. Furthermore k is a constant much smaller than the total number of native-like structural fragments, hence we do not need to classify most of the structural elements correctly. It is still possible to design an SVM to classify subsets of k-elements with at least one element correct. However, this significantly increases learning dimension hence requiring more data. Furthermore, since the features are for individual elements, this approach is less natural than our direct customized approach.

On the other hand, our classification task can be easily modelled by a linear model, since a candidate list is separable by a hyperplane with high probability. To see this, let us treat the feature vectors as high-dimensional points. Suppose we have a set of random points, each point is labelled with +1 or −1, we want to find a subset of the point set, such that the subset contains at least one point with label +1. We first form the smallest convex hull containing all points. For each vertex of the convex hull, if it is labelled with +1, then we use a hyperplane to separate it from the rest. We are done. The probability that no vertex on the hull is labelled with +1 can be estimated as 1−(1−P)|H|, where P is the probability that a point is in class +1 and |H| is the expected number of points on the convex hull. According to Sims and Kim (2006), P≈(1/1.6)9=0.015 for fragments of length 9. According to Efron (1965), we may assume |H| is sufficiently large to make 1−(1−P)|H| approach to 1. Thus, with high probability, a linear separator can be trivially obtained.

These two observations have inspired us to design a system of linear models to solve our problem.

2.2.1 A generalized linear model formulation.

A general linear model has the form:

equation image

where w=(w0,…,wM)T, x is the input data, and (ϕ1,…,ϕM)T are the basis functions. Here, w is the weight vector or the parameters we want to train, and w0 is called a bias parameter and used for any fixed offset in the data. The basis functions, the ϕk's, are generally non-linear and are applied to the original data variables. In a linear model, y(x,w) is a non-linear function of the input variables due to the non-linearity of the basis functions. We refer readers to Bishop (2006) for comprehensive treatment of the linear models.

We now generalize the spirit of linear model, Equation (2), to design an integer linear program (ILP) to model our problem. As mentioned above, we use a feature vector An external file that holds a picture, illustration, etc.
Object name is btn165i2.jpg to measure the similarity between a structural fragment sei and a sequence segment qej. Without loss of generality, we assume An external file that holds a picture, illustration, etc.
Object name is btn165i3.jpg. Each structural fragment sei is associated with a weight vector An external file that holds a picture, illustration, etc.
Object name is btn165i4.jpg. The distance between a structural fragment sei and a sequence segment qej is computed by the dot product between Wi and Vi,j:

equation image

Thus, An external file that holds a picture, illustration, etc.
Object name is btn165i5.jpg may be regarded as a set of basis functions, the ϕk's, and we wish to adjust the parameters An external file that holds a picture, illustration, etc.
Object name is btn165i6.jpg so that for some nj, where senj is a ‘native−like’ structure for qej, Dnj,j is ranked top k among other Di,j's for 1≤iq. Thus we actually have a system of pq linear models, for 1≤iq and 1≤jp. Also note that we only train one set of An external file that holds a picture, illustration, etc.
Object name is btn165i7.jpg for each structural element sei, and eventually we require only one of the Di,j's, 1≤iq, which is a native-like structure for qej, to be ranked well.

This model is generic. Although we assume a linear combination of the features, we do not assume any linearity about An external file that holds a picture, illustration, etc.
Object name is btn165i8.jpg's, and they can contain quadratic terms and so on. For example, in Equation (1), a feature vector with length 180 is used. Each structure or sequence segment of length 9 is represented by 9×20 frequency distribution matrices. The feature vector has a size 180, and each entry is the absolute value of the difference between the corresponding entries. Here the Vi,j values are pre-calculated.

Now, our task is to train the Wi's. We used the 30 protein sequences, whose structures are known, in Table 1B as the Training Set. We parsed these proteins, length 9, Step 1, to obtain the set of qej's. The Structure Space sei's are obtained from the 40 proteins, also with known structures, in Table 1A.

For each sequence segment qej, we pre-compute Qj, the set of structural fragments which have a distance to qej's native structure less than the distance threshold θ. Our objective here is to optimize the weights, the Wi's, of the distance function such that we have a distance function that ranks at least k′ element in Qj well. In the following formulation, we use k′=1 for the simplicity of presentation. We can easily extend the above model to the case such that in each candidate lists, at least k′, 1≤k′≤k, native-like structures are included.

For 1≤iq, indexing the structural space and 1≤jp, indexing the sequence segments, the ILP is as follows:

equation image

equation image

equation image

equation image

equation image

equation image

We now interpret the above ILP formulation.

  • Variable gj=1 indicates no element in Qj is included as one of the k elements for Sj. The objective of the ILP, Equation (3), is therefore to minimize An external file that holds a picture, illustration, etc.
Object name is btn165i9.jpg.
  • The constant ε in Equation (4) is created as a gap to separate the native and non-native-like structural fragments. Ideally we want to have the parameter settings such that:
    equation image
    where senj is a native-like structure for qej, and sei is a non-native-like structure for qej. Equation (4) is used to achieve this goal. It is clear that −2≤Dnj,jDi,j≤2. If dnj,i,j=0, we rank the near native-like structure senj better than the non-native-like structure sei.
  • Equation (5) intends to check if a native-like structure senj is in the candidate list of size k. If fnj,j=0, then the number of non-native-like structural fragments for qej that scores higher than senj[set membership]Qj is less than k. When this fails, fnj,j=1.
  • If gj=0, Equation (6) ensures that at least one near native structure in Qj for sequence segment qej is in its candidate list. The objective function Equation (3) is used to minimize the number of sequence segments whose candidate list of size k does not contain a near native structure. Equation (7) is just to normalize the parameter distributions.

2.3 Basis functions: the Vi,j's

The basis functions we use in this article are the entries extracted from the sequence and structural information. Specifically, 4×[ell] entries, the An external file that holds a picture, illustration, etc.
Object name is btn165i10.jpg's, are created for each structural fragment and sequence segment pair. That is, for each sei and qej pair, the Vi,j feature vector has d=36 entries (i.e. the ϕk's), with four entries, corresponding to the four types of scores below, for each position.

For simplicity, in this section, we denote a structural fragment as se, and denote a sequence segment as qe, without superscripts. They both have length [ell]=9. The i-th positions of qe and se are denoted as qe[i] and se[i], respectively. For each position i, the following four types of score entries are created.

2.3.1 Mutation scores

Mutation score is similar to that of ROSETTA, as shown in Equation (1), which is to compute the similarity score between profiles. The profiles for both the template and the sequence are obtained from 5-rounds of PSI-BLAST with a cutoff of 9×10−4. Mutation score between se and qe consists of [ell] entries. One entry is calculated for each corresponding pair of positions. The value at position i, 1≤i[ell] is defined to be:

equation image

where we recall from Equation (1) that S(aa,i) and X(aa,i) are the frequencies of amino acid aa at position i for sequence segment and structural fragment, respectively. We have tested other possibilities, such as the City Block Metric, dot product and the one from Kim et al. (2003). We found that Equation (??) is slightly more stable.

2.3.2 Secondary structure score

This term is to measure the similarity between the secondary structure of sej and that of qej. The secondary structure for a structural element sej is computed by DSSP, (Kabsch and Sander, 1983). The secondary structure of a sequence is predicted by PSIPRED (Jones, 1999), then it is parsed into qej's. The program predicts the confidences αi, βi and li for position i to be α-helix, β-sheet and loop, respectively.

The secondary structure score computation at position i is from Xu (2005):

  • If the secondary structure type of se[i] is α-helix, then we use αili
  • If the secondary structure type of se[i] is β-sheet, then we use βili
  • If it is loop, we just use 0.

2.3.3 Contact capacity score

For each structural position se[i], a contact number ni is calculated. There is a contact between two residues if the distance between their Cβ atoms is within a given cutoff 7 Å. Contact capacity is to measure the capacity that a residue has c contacts with any other residues in a protein.

Given a protein structure, let N(aa,c) be the number of residues with type aa and c contacts, N(c) be the total number of residues having c contacts, N(aa) be the number of residues with type aa and N be the total number of residues. Then for an amino acid type aa, the capacity to have c contacts is defined to be:

equation image

The contact capacity score for position i is computed as: An external file that holds a picture, illustration, etc.
Object name is btn165i11.jpgS(aa,i) × CC(niaa).

2.3.4 Environmental fitness score

The environment for each structural position is defined by the combination of secondary structure type and solvent accessibility. Three secondary structure types are used: α-helix, β-strand or loop; and three accessibility levels are defined: buried, intermediate and accessible. So in total there are nine states of the structural environment and each structural position has one of the nine environmental states. Let F(Ei,aa) be the fitness score for an amino acid aa in environment state Ei. The fitness score between se[i] and qe[i] is calculated as: An external file that holds a picture, illustration, etc.
Object name is btn165i12.jpgS(aa, i) × F(Ei,aa). For more details, we refer the readers to Kim et al. (2003).


We have implemented FRazor with C++, on Linux. The ILP is implemented with the package CPLEX. Additionally, we built some heuristics into the program in the case that ILP cannot find an optimal solution within a reasonable amount of time.

3.1 Evaluation criteria

We use fragment coverage, local fit approximation and position coverage as three evaluation criteria.

One way to evaluate the significance of selected structural fragments for each target is to simply count the percentage of sequence segments covered by the structural candidate lists for a given structure distance threshold. This percentage is referred to as fragment coverage.

Local fit approximation is a criterion developed in Kolodny et al. (2002) to evaluate the quality of a fragment library. For each sequence segment, the most similar structure in terms of RMSD from the structural candidate list is calculated. Then we take the average of the RMSD values over the entire sequence segment as the local fit score.

However, a better approach for the protein prediction purpose is to count the number of positions ‘correctly predicted’ in a target t. By ‘correctly predicting a position’ we mean that at least one sequence segment containing the position is covered. The percentage of the positions which are correctly predicted is referred to as position coverage in this work. This criterion is also used by Simons et al. (1997). The positions are divided into three cases α-helix, β-sheet, and Loop. We evaluate the coverage for each type of positions.

First, we compare FRazor's score function with ROSETTA's CBM. Then we show that our program does a much better job in selecting structural candidates from a fragment library. Finally, we show that the decoys assembled from the fragments generated by our method have better quality than those from ROSETTA's fragments.

3.2 Dataset

Our dataset consists of three parts: (1) Structure Space; (2) Training Set and (3) Testing Set. The Structure Space is the collection of structural fragments from which we can select the candidate structural fragments for a sequence segment. Training set consists of the fragments used to compute our parameters. Testing set contains proteins for evaluating our method.

The proteins for Structure Space and Training Set are both from a non-homologous (<30% homology) list with resolution <2 Å, dated on March 26, 2006. The list of these proteins was created by the program PISCES (Wang and Dunbrack, 2003), and totally there are 3177 chains. We used the first 70 chains. The Structure Space is made from 40 protein chains as shown in Table 1, Panel A. We parse these proteins with a sliding window of size [ell]=9 and step size 1. Totally there are 9658 residues. The resulting structural space consists of 9338 length-9 structural fragments. The training data consist of the other 30 chains, which are also shown in Table 1, Panel B. We also parse them into length-[ell]=9 segments with sliding window of step size 1. Totally there are 6584 residues.

For the Testing Set, we use proteins from CASP7 which were created after April, 2006; there are in total 94 proteins. Also the Testing Set are parsed into segments of length [ell]=9. The CASP7 test proteins do not share high sequence identity with proteins in PDB released before March 26, 2006, which contain proteins in our Structure Space and Training Set. We also used six test proteins that were used in previous studies in Simons et al. (1997), Kolodny et al. (2002) and Hamelryck et al. (2006) to compare the quality of their decoys assembled from FRazor's fragments with that of ROSETTA's fragments. These six test proteins are: Protein A (PDB code 1FC2), Homeodomain (1ENH), Protein G (2GB1), Cro repressor (2CRO), Protein L7/L12 (1CTF) and Calbindin (4ICB).

3.3 FRazor versus CBM

It is an interesting question whether structural information, such as secondary structure, solvent accessibility and contact capacity, can help the prediction of structural fragments. In this experiment, we explore this question by comparing FRazor against the CBM model (Simons et al., 1997), where only sequence profile is used. The experimental results are listed in Table 2, where the fragment candidate list size is set to be 25, the number of templates used is 40, i.e. the 40 proteins in Table 1A, and the fragment length is 9.

Table 2.
Position coverage for CBM versus FRazor (FR)'s score function

Observe Table 2. With the threshold value as 0.5 Å, the position coverage increases from 10.0% to 37.6%, and from 26.6% to 38.7% for β-sheets and loops, respectively. With the threshold value as 1 Å, the position coverage increases from 56.4% to 89.6%, and 55.5% to 78.1% for β-sheets and loops, respectively. For threshold 1.5 Å, significant improvement is observed for β-sheets and loops as well. Overall, we can have a position coverage 88.2% and 96.7% for threshold value 1 Å and 1.5 Å , respectively, and the two values for CBM are 72.2% and 89.9%. While the improvement for α-helix looks small, because there is nothing much left to improve upon, FRazor still made 20% improvement over the remaining gap, for 0.5 Å and 1 Å.

In Table 3, we fix the threshold value as 1 Å and we compare the results by varying the candidate list size. The position coverage is displayed. The improvement for β-sheets is more than 30% on average with the same candidate list size. The improvement for loops is more than 20% on average for all the cases. From the table we can see that, the position coverage is increased from 56.4% to 79.1%, and from 55.5% to 67.9% for β-sheets and loops, respectively, while reducing the fragment candidate size from 25 to 10 simultaneously. By using 5 as the candidate list size, FRazor's performance is better than that of CBM with 40 as fragment candidate list size for β-sheets and loops. Also with using 15 as the candidate list size, FRazor's performance is better than CBM with 40 as the candidate list size in all the cases.

Table 3.
Position coverage percentage (%) for CBM versus FRazor (FR) at threshold value 1Å

Table 4 shows the results of fragment coverage and local fit criteria. In Table 4, we fix the threshold value as 1 Å and we compare the results by varying the candidate list size. This table demonstrates that FRazor with candidate list size 10 has higher fragment coverage than the fragment coverage of CBM with candidate list size 40, with scores 43.3% and 40.8%, respectively.

Table 4.
Fragment coverage and local fit score for threshold value as 1Å

For all these evaluation criteria, we can safely draw a conclusion that FRazor is able to identify compact candidate lists for sequence segments. Besides the results reported, we conducted experiments on varying the fragment length and candidate list size. These experimental results suggest that FRazor is stable and robust, and consistent improvement is observed.

3.4 Selecting fragments from a library

Sequence-specific fragment candidate lists are able to model a protein more accurately than an independent fragment library. In this section, we show that FRazor can produce a more accurate fragment candidate list than an independent library by comparing to the fragment libraries from Kolodny et al. (2002). From another aspect that each structural fragment can be mapped to an entry in a fragment library, FRazor is able to select a subset of fragments from a library for a sequence segment. The libraries from Kolodny et al. (2002) with fragment length 7 are used, and the library sizes are 50, 100, 150, 200 and 250. In order to have a fair comparison, we re-evaluated the performances of these libraries on our test data. Denote the library size as L.

Table 5 shows the results of Kolodny library, and FRazor's customized lists. By using candidate list size 25, the fragment coverage score is better than the library with 200 fragments. The local fit score by using 100 fragments is comparable with a fragment library size 250.

Table 5.
Customized fragment lists versus independent fragment libraries

3.5 Application to protein structure prediction

We also compared FRazor with ROSETTA's fragment generation module. This ultimate test is to examine the quality of the decoys folded from the fragments generated by FRazor. We replaced ROSETTA's fragment generation method by FRazor to test its accuracy. To fairly evaluate FRazor, we used ROSETTA's energy function and its default setting. The ROSETTA's fragment generation code is obtained from the ROSETTA package (version 2.0.1). For both FRazor and ROSETTA's fragment generation module, their structural fragments are selected from the same set of 40 proteins, which is included in ROSETTA's fragment generation module. Note that these are different 40 proteins from Table 1A. We used the same 30 proteins in Table 1B to train. Using FRazor instead of ROSETTA's fragment generation module, with everything else remain unchanged, we demonstrate that FRazor improves structure prediction accuracy significantly.

We used the six proteins that were used in previous studies (Hamelryck et al., 2006; Kolodny et al., 2002; Simons et al., 1997) to evaluate FRazor. We assembled 1000 decoys for each protein using structural fragments generated from both FRazor and ROSETTA, respectively, and then compared FRazor and ROSETTA in terms of the percentage of good decoys1 and the average RMSD of all the 1000 decoys. As shown in Table 6, FRazor can generate 1.8–26% more good decoys than ROSETTA's fragment generation method. The average RMSD of the decoys generated by FRazor is also much smaller for four of the six test proteins. For the other two test proteins, both FRazor and ROSETTA have similar average RMSD.

Table 6.
Decoy quality comparison between ROSETTA and FRazor

FRazor also generated the best decoys with better RMSDs. For example, the best decoy generated by FRazor for the Cro repressor protein (PDB code 2CRO) has a much lower RMSD to its native structure than that generated by ROSETTA. As shown in Figure 1, the first is the best decoy for the Cro repressor protein generated by ROSETTA with RMSD 3.02 Å, the second is the best decoy generated by FRazor with RMSD 2.57 Å, and the third is the native structure. In addition to the Cro repressor protein, the best decoys for both Homeodomain (PDB code 1ENH) and Protein L7/L12 generated (PDB code 1CTF) by FRazor also have much lower RMSDs than the best generated by ROSETTA. For the other three proteins 1FC2, 2GB1 and 4ICB, their best decoys generated by ROSETTA are slightly better than those by FRazor.

Fig. 1.
Best decoys generated by ROSETTA and FRazor for the Cro repressor protein 2CRO.


Our results illustrate that structural information can help the accurate prediction of structural fragments for a sequence segment. Our experimental results demonstrate that our method generates structural fragments of significantly better quality, compared to ROSETTA's fragment generator and Kolodny's library. This is further justified by the end results of decoys generation.

We have generated our fragment library based on 40 proteins. Currently we are in the process of extending this to a much larger set of representative proteins. The scoring function we used to map a sequence segment to a structural fragment consists of mutation score, secondary structure score, contact capacity score, and environment fitness score. To improve the performance, a natural idea is to use more scoring terms. A promising way is to combine the scores from threading results, like the case in Zhang et al. (2005), Zhang (2007). Currently all the scoring terms depend on only a single position, which implies that residues in a protein sequence are assumed to be independent. However, some residues are obviously correlated, and it may obtain better performance if we encode the correlation information into our scoring function. The challenge to do so is to deal with the sparsity in training data since there will be many more parameters to be trained. We may use some regularization technique to resolve this issue.

Our work significantly improves the accuracy of β-sheet and loop positions, and this gives us the possibility of predicting loop regions more accurately. This work is underway. The program can also assign weights to the positions of a structure automatically. This might be useful for identifying structure motifs. A position with a small weight may imply this position is unstable.


We thank David Baker and his ROSETTA group, for developing and allowing us using the ROSETTA program, and ISMB'08 referees for helpful comments.

Funding: This work is supported by NSERC OGP0046506, the Canada Research Chair program and MITACS, and was made possible by the facilities of the SHARCNET ( D.B. was also partially supported by a Chinese Government Scholarship Program and an NSFC grant 60496320.

Conflict of Interest: none declared.


1A decoy is good if its RMSD to the native structure is less than 6 Å.


  • Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
  • Berman HM, et al. The protein data bank. Nucleic Acids Res. 2000;28:235–242. [PMC free article] [PubMed]
  • Bishop CM. Pattern Recognition and Machine Learning (Information Science and Statistics) Springer; 2006.
  • Bowie JU, Eisenberg D. An evolutionary approach to folding small α-helical proteins that uses sequence information and an empirical guiding fitness function. Proc. Natl Acad. Sci. 1994;91:4436–4440. [PubMed]
  • Bradley P, et al. Rosetta predictions in CASP5: successes, failures, and prospects for complete automation. Proteins Struct. Funct. Genet. 2003;53(Suppl. 6):457–468. [PubMed]
  • De Brevern A, et al. Local backbone structure prediction of proteins. In Silico Biol. 2004;4:381–386. [PMC free article] [PubMed]
  • Chivian D, et al. Rosetta predictions in CASP5: successes, failures, and prospects for complete automation. Proteins Struct. Funct. Genet. 2005;61(Suppl. 7):157–166. [PubMed]
  • Claessens M, et al. Modelling the polypeptide backbone with ‘spare parts’ from known protein structures. Protein Eng. 1989;2:335–345. [PubMed]
  • Efron B. The convex hull of a random set of points. Biometrika. 1965;52:331–343.
  • Fidelis K, et al. Comparison of systematic search and database methods for constructing segments of protein structure. Protein Eng. 1994;7:953–960. [PubMed]
  • Hamelryck T, et al. Sampling realistic protein conformations using local structural bias. PLoS Computat. Biol. 2006;2:e131. [PMC free article] [PubMed]
  • Haspel N, et al. Reducing the computational complexity of protein folding via fragment folding and assembly. Protein Sci. 2003;12:1177–1187. [PubMed]
  • Holmesand JB, Tsai J. Some fundamental aspects of building protein structures from fragment libraries. Protein Sci. 2004;13:1636–1650. [PubMed]
  • Inbar Y, et al. Protein structure prediction via combinatorial assembly of sub-structural units. Bioinformatics. 2003;19(Suppl. 1):158–168. [PubMed]
  • Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 1999;292:195–202. [PubMed]
  • Jones TA, Thirup S. Using known substructures in protein model building and crystallography. EMBO J. 1986;5:819–823. [PubMed]
  • Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22:2577–2637. [PubMed]
  • Kim D, et al. PROSPECT II: protein structure prediction program for genome-scale applications. Protein Eng. 2003;16:641–650. [PubMed]
  • Kolodny R, et al. Small libraries of protein fragments model native protein structures accurately. J. Mol. Biol. 2002;323:297–307. [PubMed]
  • Lee J, et al. Protein structure prediction based on fragment assembly and parameter optimization. Biophys. Chem. 2005;115:209–214. [PubMed]
  • Levitt M. Accurate modeling of protein conformation by automatic segment matching. J. Mol. Biol. 1992;226:507–533. [PubMed]
  • Li SC, et al. Technical Report. University of Waterloo; 2007. FALCON: zero in on the native protein structure.
  • Lovell SC, et al. Ab initio construction of polypeptide fragments: efficient generation of accurate, representative ensembles. Proteins. 2003;51:41–55. [PubMed]
  • Moult J, et al. Critical assessment of methods of protein structure prediction (casp):round 6. Proteins Struct. Funct. Genet. 2005;61:3–7. [PubMed]
  • Pauling L, Corey RB. The pleated sheet, a new layer configuration of polypeptide chains. Proc. Natl Acad. Sci. 1951;37:251–256. [PubMed]
  • Rohl CA, et al. Protein structure prediction using Rosetta. Methods Enzymol. 2004;383:66–93. [PubMed]
  • Simon I, et al. Calculation of protein conformation as an assembly of stable overlapping segments: application to Bovine pancreatic trypsin inhibitor. Proc. Natl Acad. Sci. 1991;88:3661–3665. [PubMed]
  • Simons KT, et al. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 1997;268 [PubMed]
  • Sims GE, Kim SH. A method for evaluating the structural quality of protein models by using higher-order varphi-psi pairs scoring. Proc. Natl Acad. Sci. 2006;103:4428–4432. [PubMed]
  • Sippl M. Recognition of errors in three-dimensional structures of proteins. Proteins. 1993;17:355–362. [PubMed]
  • Unger R, et al. A 3D building blocks approach to analyzing and predicting structure of proteins. Proteins Struct. Funct. Genet. 1989;5:355–373. [PubMed]
  • Wendoloski JJ, Salemme FR. Probit: a statistical approach to modeling proteins from partial coordinate data using substructure libraries. J. Mol. Graph. 1992;10:124–126. [PubMed]
  • Wang G, Dunbrack RL., Jr PISCES: a protein sequence culling server. Bioinformatics. 2003;19:1589–1591. [PubMed]
  • Xu J. Fold recognition by predicted alignment accuracy. IEEE/ACM Trans. Comput. Biol. Bioinform. 2005;2:157–165. [PubMed]
  • Zhang Y. Proteins. Suppl. 8 2007. Template-based modeling and free modeling by I-TASSER in CASP7. [PubMed]
  • Zhang Y, et al. TASSER: an automated method for the prediction of protein tertiary structures in CASP6. Proteins. 2005;61(Suppl. 7):91–98. [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press