|Home | About | Journals | Submit | Contact Us | Français|
CPHmodels-3.0 is a web server predicting protein 3D structure by use of single template homology modeling. The server employs a hybrid of the scoring functions of CPHmodels-2.0 and a novel remote homology-modeling algorithm. A query sequence is first attempted modeled using the fast CPHmodels-2.0 profile–profile scoring function suitable for close homology modeling. The new computational costly remote homology-modeling algorithm is only engaged provided that no suitable PDB template is identified in the initial search. CPHmodels-3.0 was benchmarked in the CASP8 competition and produced models for 94% of the targets (117 out of 128), 74% were predicted as high reliability models (87 out of 117). These achieved an average RMSD of 4.6Å when superimposed to the 3D structure. The remaining 26% low reliably models (30 out of 117) could superimpose to the true 3D structure with an average RMSD of 9.3Å. These performance values place the CPHmodels-3.0 method in the group of high performing 3D prediction tools. Beside its accuracy, one of the important features of the method is its speed. For most queries, the response time of the server is <20min. The web server is available at http://www.cbs.dtu.dk/services/CPHmodels/.
Sequence profiles have a broad application in the field of bioinformatics prediction algorithms dating back to the pioneering work by Rost and Sander (1). The field of protein structure prediction has largely benefited from this work, and most high-performing algorithms for protein homology modeling use sequence profiles as their main vehicle (2–4). Prediction of local protein structure features can also improve when sequence profiles are used to represent the protein sequences (5–7). Here, we use a scheme for close and remote homology modeling building on these findings. Two protein sequences are aligned using local sequence alignment with a scoring matrix constructed by combining sequence profiles, and local protein structural features such as: secondary structure and relative surface accessibility.
The use of such local protein structural features improves the alignment accuracy. The fold recognition ability is further improved by the use of a double-sided Z-score and a baseline correction for sequence length and amino acid composition.
The method has been implemented as a web server with a simple user interface. Here, we describe the server and evaluate its performance on 117 target sequences that were modeled during the CASP8 competition.
The combinatorial extension program CE (9) was used to construct two benchmark data sets. Pairs of PDB structures were chosen that could be superimposed with a CE Z-score >3.8 and with a mutual sequence identity less than 40%. A Hobohm 1 algorithm (10) was used to identify clusters of structural similar proteins, and a maximum of 10 structures per cluster were included. This procedure leaves us with a training and test set of 1377 and 690 protein pairs, respectively.
A position-specific scoring matrix (PSSM) is generated for a query sequence by searching for up to five iterations with default settings, against a local version of the Uniprot database using PsiBlast (8). After each iteration, the PSSM generated by Blast is saved and used to search for a template in PDB. Provided that a template is found with a Blast e-value <10−5, a PSSM is also generated for the template using the same number of Blast iteration as for the query. Next, the query is aligned to the template using a scoring matrix that at each position is calculated as the average the score of the template sequence in the query PSSM and the query sequence in the template PSSM. This query–template alignment is accepted as a reliable model provided a Blast e-value <10−5 and sequence identity >30%.
In situations where the query sequence is a difficult target and no suitable template or alignment was found using the setup described for CPHmodels-2.0, it is necessary to search for a template using a refined algorithm that is computationally more costly. This includes a PsiBlast search against a reduced non-redundant protein sequence database (nr), profile-profile alignment including predicted local structure information obtained from NetSurfP (7), and a double-sided Z-score evaluation. The predicted local structural features include secondary structure and relative surface accessibility. We describe the different steps involved in this remote-homology modeling procedure in the Supplementary Material.
Once the best template has been found, Cα-atom coordinates are extracted according to the sequence alignment and used as a starting point for the homology-modeling process. Missing atoms were added using the segmod (11) program and the structure was refined using the encad program (12), both from the GeneMine package (www.bioinformatics.ucla.edu/genemine/).
Optimal alignment parameters were estimated on the benchmark training data set to maximize the fraction on correctly aligned residues within 4Å to the position in the crystal structure. This measure is commonly known as the f4 measure. The result of this benchmark calculation is shown in Figure 1. For the CPHmodels-3.0 method, we find that an average of 47% and 42% of the residues are correctly aligned for the training and test data sets, respectively. These numbers are significantly higher than what is obtained using any of the other three methods included in the benchmark (P<0.005, in all cases, binomial test).
The method was next benchmarked to validate the ability to identify the correct fold. The test set is composed of 690 query–target pairs and some sequences can appear more than once as either a query or a target sequence. In total, the test set is formed by a unique set of 1216 PDB chains. Each query sequence in the test set was aligned against the same pool of 1216 representative template structures. Next, the performance of the prediction methods was evaluated in terms of the rank of the target in the sorted list of template structures. Many templates other than the specific target structure could potentially share structural similarity to the query, and these templates could show up as ‘false’ false positives in the rank analysis even though actually being perfect hits. To exclude these ‘false’ false positives from the rank analysis, all template hits with an alignment score greater than the target in question and a CE structural alignment Z-score to the query structure>3.8 were removed from the list. In this way only ‘true’ false positive template hits are included when calculating the rank of the target. The result of the benchmark calculation is shown in Figure 2. For the CPHmodels-3.0 method with double-sided Z-score, we find that 74% of the queries in the test data set identifies the correct template within top 10 of the template pool. This performance is significantly higher than what is obtained for the three other methods in the benchmark (P<0.01 comparing to CPHmodels-3.0, e.g. Z-score, P<0.001 comparing to both CPHmodels-2.0, and Blosum. P-values are calculated using binomial test).
In the CASP8 competition, the CPHmodels-3.0 server submitted models for 117 targets out of 128. For 38 targets with a significant Blast hit, the CPHmodels-2.0 profile log-odds method was used. For the remaining 79 targets, the CPHmodels-3.0 method was used. The performance of the server is summarized in Figure 3. A large fraction of the models (85%) were structurally superimposable [CE structural alignment Z-score above 3.8 (8)] to their target. A Z-score threshold of 10 separates the ‘good’ models with an f4≥0.6 from the ‘bad’ models with f4<0.6. The difference in f4 between the models with a Z-score above and the models with a Z-score<10 is highly statistically significant (P<0.001, t-test). The average RMSD for models with a Z-score>10 is 4.6Å, and the average RMSD for the models with a Z-score<10 is 9.3Å. This difference is highly statistically significant (P<0.001, t-test). A total of 95% (51/54) of the models with a Z-score>10 shared structural similarity to their target.
We have evaluated the performance of the CPHmodels-3.0 server using the data from the official CASP8 result page. Here, 72 of the 174 registered methods competed in the class for automatic servers, and 66 of these made predictions for >80% of the targets. Among these, CPHmodels-3.0 achieved an average rank of 24 on the 164 TBM & FM domains from all targets when sorting on the different quality measures (Table 1).
CPHmodels-3.0 was thus well in the top half of the servers which made predictions for most of the domains. It must be noted that CPHmodels-3.0 is a single template server and ranking especially in the cumulative scores may have been better if the more than one domain had been modeled for some of the targets (excluding the cumulative scores performance measures improves the rank to 17). Multi-domain modeling is something the user can do manually by resubmitting un-modeled parts of the sequence to the server. One other important aspect of the server is its speed. For most queries in the CASP competition, the response time of the server was <20min.
One of the aims when implementing the CPHmodels-3.0 was to make a front-end that was easy to understand for users without any prior knowledge of homology modeling, and at the same time provide a result that is as accurate as possible. A detailed description of the server including a flowchart is given in Supplementary Material.
The input to the web server is a raw text file (i.e. not MS Word™ or other formatted format) containing a single sequence in FASTA format. Optionally the sequence can be pasted into a text field. After submitting a job, the website will update until the result appear, but a web link is also provided for the user to bookmark or the result link can be mailed when the job has finished.
Example of the output is shown in Figure 4. The output is divided into the following sections:
The output from the remote homology template search is described below (Figure 4C).
Next, is the final formatted alignment of the query sequence and the ‘datomseq’ from above.
The CPHmodels-3.0 is an easy to use web server for comparative protein homology modeling. It has in benchmark calculations including the CASP8 competition been shown to have a performance comparable to majority of high-performing 3D prediction tools. The server response time is for most targets very short (<20min). The method uses an optimized alignment scoring function that beyond secondary structure includes predicted relative surface accessibility, which to our knowledge has not previously been used in publicly available protein homology modeling severs. Also, the method employs a double-sided Z-score to rank individual template hits. This Z-score ranking attempts to reduce the biased imposed by the composition and length of the query and template database sequences on the alignment score, and was shown to significantly improve the overall prediction accuracy.
The current method is single-template based and only makes use of the top one template structure. It is therefore possible to improve the overall performance once a strategy has been implemented to utilize information from multiple templates, as previously demonstrated (13–16). Results from the CASP8 competition has shown that the overall performance of the method (as measured by for instance the cumulative GDT_TS score) could be improved. The server only builds one continuous protein chain model, meaning that for multi-domain proteins, the method might fail to build a model for a second smaller domain. This can be manually overcome by resubmitting the protein sequence once more to the server and obtain a model for the remaining part too and thus increase the coverage of the query sequence (and hence the overall GDT_TS score). However, this does not overcome the problem of structurally relating such models of multiple domains to each other, which is still an unsolved problem by any modeling server.
Supplementary Data are available at NAR Online.
Funding for open access charge: XXX.
Conflict of interest statement. None declared.
We would like to thank Garry Gippert for his input and discussions during the early stages of this work.