We have implemented FRazor with C++, on Linux. The ILP is implemented with the package CPLEX. Additionally, we built some heuristics into the program in the case that ILP cannot find an optimal solution within a reasonable amount of time.
3.1 Evaluation criteria
We use fragment coverage, local fit approximation and position coverage as three evaluation criteria.
One way to evaluate the significance of selected structural fragments for each target is to simply count the percentage of sequence segments covered by the structural candidate lists for a given structure distance threshold. This percentage is referred to as fragment coverage.
Local fit approximation is a criterion developed in Kolodny et al.
) to evaluate the quality of a fragment library. For each sequence segment, the most similar structure in terms of RMSD from the structural candidate list is calculated. Then we take the average of the RMSD values over the entire sequence segment as the local fit score
However, a better approach for the protein prediction purpose is to count the number of positions ‘correctly predicted’ in a target t
. By ‘correctly predicting a position’ we mean that at least one sequence segment containing the position is covered. The percentage of the positions which are correctly predicted is referred to as position coverage
in this work. This criterion is also used by Simons et al.
). The positions are divided into three cases α-helix
, and Loop
. We evaluate the coverage for each type of positions.
First, we compare FRazor's score function with ROSETTA's CBM. Then we show that our program does a much better job in selecting structural candidates from a fragment library. Finally, we show that the decoys assembled from the fragments generated by our method have better quality than those from ROSETTA's fragments.
Our dataset consists of three parts: (1) Structure Space; (2) Training Set and (3) Testing Set. The Structure Space is the collection of structural fragments from which we can select the candidate structural fragments for a sequence segment. Training set consists of the fragments used to compute our parameters. Testing set contains proteins for evaluating our method.
The proteins for Structure Space and Training Set are both from a non-homologous (<30% homology) list with resolution <2 Å, dated on March 26, 2006. The list of these proteins was created by the program PISCES (Wang and Dunbrack, 2003
), and totally there are 3177 chains. We used the first 70 chains. The Structure Space is made from 40 protein chains as shown in , Panel A. We parse these proteins with a sliding window of size
=9 and step size 1. Totally there are 9658 residues. The resulting structural space consists of 9338 length-9 structural fragments. The training data consist of the other 30 chains, which are also shown in , Panel B. We also parse them into length-
=9 segments with sliding window of step size 1. Totally there are 6584 residues.
For the Testing Set, we use proteins from CASP7 which were created after April, 2006; there are in total 94 proteins. Also the Testing Set are parsed into segments of length
=9. The CASP7 test proteins do not share high sequence identity with proteins in PDB released before March 26, 2006, which contain proteins in our Structure Space and Training Set. We also used six test proteins that were used in previous studies in Simons et al.
), Kolodny et al.
) and Hamelryck et al.
) to compare the quality of their decoys assembled from FRazor's fragments with that of ROSETTA's fragments. These six test proteins are: Protein A (PDB code 1FC2), Homeodomain (1ENH), Protein G (2GB1), Cro repressor (2CRO), Protein L7/L12 (1CTF) and Calbindin (4ICB).
3.3 FRazor versus CBM
It is an interesting question whether structural information, such as secondary structure, solvent accessibility and contact capacity, can help the prediction of structural fragments. In this experiment, we explore this question by comparing FRazor against the CBM model (Simons et al.
), where only sequence profile is used. The experimental results are listed in , where the fragment candidate list size is set to be 25, the number of templates used is 40, i.e. the 40 proteins in A, and the fragment length is 9.
Position coverage for CBM versus FRazor (FR)'s score function
Observe . With the threshold value as 0.5 Å, the position coverage increases from 10.0% to 37.6%, and from 26.6% to 38.7% for β-sheets and loops, respectively. With the threshold value as 1 Å, the position coverage increases from 56.4% to 89.6%, and 55.5% to 78.1% for β-sheets and loops, respectively. For threshold 1.5 Å, significant improvement is observed for β-sheets and loops as well. Overall, we can have a position coverage 88.2% and 96.7% for threshold value 1 Å and 1.5 Å , respectively, and the two values for CBM are 72.2% and 89.9%. While the improvement for α-helix looks small, because there is nothing much left to improve upon, FRazor still made 20% improvement over the remaining gap, for 0.5 Å and 1 Å.
In , we fix the threshold value as 1 Å and we compare the results by varying the candidate list size. The position coverage is displayed. The improvement for β-sheets is more than 30% on average with the same candidate list size. The improvement for loops is more than 20% on average for all the cases. From the table we can see that, the position coverage is increased from 56.4% to 79.1%, and from 55.5% to 67.9% for β-sheets and loops, respectively, while reducing the fragment candidate size from 25 to 10 simultaneously. By using 5 as the candidate list size, FRazor's performance is better than that of CBM with 40 as fragment candidate list size for β-sheets and loops. Also with using 15 as the candidate list size, FRazor's performance is better than CBM with 40 as the candidate list size in all the cases.
Position coverage percentage (%) for CBM versus FRazor (FR) at threshold value 1Å
shows the results of fragment coverage and local fit criteria. In , we fix the threshold value as 1 Å and we compare the results by varying the candidate list size. This table demonstrates that FRazor with candidate list size 10 has higher fragment coverage than the fragment coverage of CBM with candidate list size 40, with scores 43.3% and 40.8%, respectively.
Fragment coverage and local fit score for threshold value as 1Å
For all these evaluation criteria, we can safely draw a conclusion that FRazor is able to identify compact candidate lists for sequence segments. Besides the results reported, we conducted experiments on varying the fragment length and candidate list size. These experimental results suggest that FRazor is stable and robust, and consistent improvement is observed.
3.4 Selecting fragments from a library
Sequence-specific fragment candidate lists are able to model a protein more accurately than an independent fragment library. In this section, we show that FRazor can produce a more accurate fragment candidate list than an independent library by comparing to the fragment libraries from Kolodny et al.
). From another aspect that each structural fragment can be mapped to an entry in a fragment library, FRazor is able to select a subset of fragments from a library for a sequence segment. The libraries from Kolodny et al.
) with fragment length 7 are used, and the library sizes are 50, 100, 150, 200 and 250. In order to have a fair comparison, we re-evaluated the performances of these libraries on our test data. Denote the library size as L
shows the results of Kolodny library, and FRazor's customized lists. By using candidate list size 25, the fragment coverage score is better than the library with 200 fragments. The local fit score by using 100 fragments is comparable with a fragment library size 250.
Customized fragment lists versus independent fragment libraries
3.5 Application to protein structure prediction
We also compared FRazor with ROSETTA's fragment generation module. This ultimate test is to examine the quality of the decoys folded from the fragments generated by FRazor. We replaced ROSETTA's fragment generation method by FRazor to test its accuracy. To fairly evaluate FRazor, we used ROSETTA's energy function and its default setting. The ROSETTA's fragment generation code is obtained from the ROSETTA package (version 2.0.1). For both FRazor and ROSETTA's fragment generation module, their structural fragments are selected from the same set of 40 proteins, which is included in ROSETTA's fragment generation module. Note that these are different 40 proteins from A. We used the same 30 proteins in B to train. Using FRazor instead of ROSETTA's fragment generation module, with everything else remain unchanged, we demonstrate that FRazor improves structure prediction accuracy significantly.
We used the six proteins that were used in previous studies (Hamelryck et al.
; Kolodny et al.
; Simons et al.
) to evaluate FRazor. We assembled 1000 decoys for each protein using structural fragments generated from both FRazor and ROSETTA, respectively, and then compared FRazor and ROSETTA in terms of the percentage of good decoys1
and the average RMSD of all the 1000 decoys. As shown in , FRazor can generate 1.8–26% more good decoys than ROSETTA's fragment generation method. The average RMSD of the decoys generated by FRazor is also much smaller for four of the six test proteins. For the other two test proteins, both FRazor and ROSETTA have similar average RMSD.
Decoy quality comparison between ROSETTA and FRazor
FRazor also generated the best decoys with better RMSDs. For example, the best decoy generated by FRazor for the Cro repressor protein (PDB code 2CRO) has a much lower RMSD to its native structure than that generated by ROSETTA. As shown in , the first is the best decoy for the Cro repressor protein generated by ROSETTA with RMSD 3.02 Å, the second is the best decoy generated by FRazor with RMSD 2.57 Å, and the third is the native structure. In addition to the Cro repressor protein, the best decoys for both Homeodomain (PDB code 1ENH) and Protein L7/L12 generated (PDB code 1CTF) by FRazor also have much lower RMSDs than the best generated by ROSETTA. For the other three proteins 1FC2, 2GB1 and 4ICB, their best decoys generated by ROSETTA are slightly better than those by FRazor.
Best decoys generated by ROSETTA and FRazor for the Cro repressor protein 2CRO.