Template-based modeling (i.e. homology modeling and protein threading) is becoming more powerful and important for structure prediction along with the PDB growth and the improvement of prediction protocols. Current PDB may contain all templates for single-domain proteins according to the seminal studies in Zhang and Skolnick (
2005a). This implies that the structures of many new proteins can be predicted using template-based methods.
The error of a template-based model comes from template selection and sequence-template alignment, in addition to the structure difference between the sequence and template. At higher sequence identity (>50%), template-based models can be accurate enough to be useful in virtual ligand screening (Bjelic and Aqvist,
2004; Caffrey
et al.,
2005), designing site-directed mutagenesis experiments (Skowronek
et al.,
2006; Wells
et al.,
2006), small ligand docking prediction, and function prediction (Baker and Sali,
2001; Skolnick
et al.,
2000). When sequence identity is below 30%, it is difficult to recognize the best template and generate accurate sequence-template alignments, so the resultant models have a wide range of accuracies (Chakravarty
et al.,
2008; Sanchez
et al.,
2000). Pieper
et al. have shown that 76% of all the models in MODBASE are from alignments in which the sequence and template share <30% sequence identity (Pieper
et al.,
2006). Therefore, to greatly enlarge the pool of useful models, it is essential to improve fold recognition and alignment method for the sequence and template with <30% sequence identity. Considering that currently there are millions of proteins without experimental structures, even a slight improvement in prediction accuracy can have a significant impact on large-scale structure prediction and its applications. As reported in Melo and Sali (
2007), even 1% improvement in the accuracy of fold assessment for the ~4.2 million models in MODBASE can correctly identify ~42 000 more models.
The alignment accuracy is determined by a scoring function used to drive sequence-template alignment. When the sequence and template are not close homologs, their alignment can be significantly improved by incorporating homologous information (i.e. sequence profile) into the scoring function. HHpred (Soding,
2005), possibly the best profile-based method, is such a representative. HHpred uses only sequence profile and predicted secondary structure for remote homolog detection. It works very well when proteins under consideration have a large amount of homologous information in the public sequence databases, but not as well when proteins under consideration are low-homology. A protein is low-homology if there is no sufficient homologous information available for it in the sequence databases (see
Section 2 for quantitative definition). Many threading methods, such as MUSTER (Wu and Zhang,
2008), Phyre2 (Kelley and Sternberg,
2009) and SPARKS/SP3/SP5 (Zhang
et al.,
2004,
2008; Zhou and Zhou
2004,
2005), aim at going beyond profile-based methods by combining homologous information with a variety of structure information. However, recent CASP evaluations (Moult
et al.,
2005,
2007) demonstrate that HHpred actually is as good as if not better than these threading methods. Clearly, it is very challenging to outperform HHpred by simply adding structure information into template-based methods. In fact, Ginalski
et al. (
2005) claimed that ‘presently, the advantage of including the structural information in the fitness function cannot be clearly proven in benchmarks’.
This article describes a new scoring function for protein threading. In this function, the relative importance of structure information is determined according to the amount of homologous information available. When proteins under consideration are low-homology, our method will rely more on structure information; otherwise, homologous information. This method enables us to significantly advance template-based modeling over profile-based methods such as HHpred, especially for low-homology proteins.
The capability of predicting low-homology proteins without close homologs in the PDB is particularly important because (i) a large portion of proteins in the PDB, which will be used as templates, belong to this class; and (ii) a majority number of the Pfam (Finn
et al.,
2008; Sammut
et al.,
2008) families without solved structures are low-homology (see
Section 2 for exact numbers). Therefore, to predict structure for proteins in Pfam using templates, it is essential to have a method that can work well on low-homology proteins. In addition, the class of low-homology proteins may represent a substantial portion of metagenomics sequences of microbes (e.g.
Staphylococcus aureus) generated from numerous metagenomic projects. It is very challenging to predict structure of a low-homology protein because (i) its sequence profile is not diverse enough to link it to remote homologs in the PDB; and (ii) its predicted secondary structure usually has low accuracy as secondary structure is usually predicted from homologous information.
Experimental results indicate that our method greatly outperforms the best profile-based method HHpred and the top CASP8 servers on low-homology proteins. Tested on the CASP8 hard targets, our method also outperforms nine of the top 10 CASP8 servers and is very close to the best Zhang-Server (Zhang,
2009). This is significant considering that the top CASP8 servers use a combination of multiple structure prediction techniques including consensus method, multiple-template modeling, template-free modeling and model refinement while our method is a classical single-template-based threading method without any post-threading refinement.