|Home | About | Journals | Submit | Contact Us | Français|
Depending on whether similar structures are found in the PDB library, the protein structure prediction can be categorized into template-based modeling and free modeling. Although threading is an efficient tool to detect the structural analogs, the advancements in methodology development have come to a steady state. Encouraging progress is observed in structure refinement which aims at drawing template structures closer to the native; this has been mainly driven by the use of multiple structure templates and the development of hybrid knowledge-based and physics-based force fields. For free modeling, exciting examples have been witnessed in folding small proteins to atomic resolutions. However, predicting structures for proteins larger than 150 residues still remains a challenge, with bottlenecks from both force field and conformational search.
In recent years, despite many debates, structure genomics is probably one of the most noteworthy efforts in protein structure determination, which aims to obtain 3D models of all proteins by an optimized combination of experimental structure solution and computer-based structure prediction [1,2•]. Two factors will dictate the success of the structure genomics: experimental structure determination of optimally selected proteins and efficient computer modeling algorithms. Based on about 40 000 structures in the PDB library (many are redundant) , 4 million models/fold-assignments can be obtained by a simple combination of the PSI-BLAST search and the comparative modeling technique [4•]. Development of more sophisticated and automated computer modeling approaches will dramatically enlarge the scope of modelable proteins in the structure genomics project.
The crucial problems/efforts in the field of protein structure prediction include: first, for the sequences of similar structures in PDB (especially those of weakly/distant homologous relation to the target), how to identify the correct templates and how to refine the template structure closer to the native; second, for the sequences without appropriate templates, how to build models of correct topology from scratch. The progress made along these directions was assessed in the recent CASP7 experiment  under the categories of template-based modeling (TBM) and free modeling (FM). Here, I will review the new progress and challenges in these directions.
The canonical procedure of the TBM consists of four steps: first, finding known structures (templates) related to the sequence to be modeled (target); second, aligning the target sequence to the template structure; third, building structural frameworks by copying the aligned regions or by satisfying the spatial restraints from templates; fourth, constructing the unaligned loop regions and adding side-chain atoms. The first two steps are actually done in a single procedure called threading (or fold recognition) [6,7] because the correct selection of templates relies on the accurate alignment. Similarly, the last two steps are performed simultaneously since the atoms of the core and loop regions are in close interaction.
The existence of similar structures in the PDB is a necessary precondition for the successful TBM. An important question is how complete the current PDB structure library is. Figure 1 shows a distribution of the best templates found by the structural alignment  for 1413 representative single-domain proteins between 80 and 200 residues. Remarkably, even excluding the homologous templates of sequence identity >20%, all the target proteins have at least one structural analog in the PDB with a Cα root-mean-squared deviation (rmsd) to the target <6 Å covering >70% regions. The average rmsd and coverage are 2.96 Å and 86%, respectively. Zhang and Skolnick [9••] recently showed that high-quality full-length models could be built for all the protein targets with an average rmsd 2.25 Å when using the best templates in the PDB. These data demonstrate that the structural universe of the current PDB library is complete essentially for solving the protein structure problem for at least the single-domain proteins. However, most of the target–template pairs at this level of sequence identity (~15%) are difficult to identify by threading. In fact, after excluding the templates of sequence identity >30%, only two-third of the proteins could be assigned by the current threading techniques to the templates of a correct topology with some alignment errors (average rmsd ~ 4 Å) . Thus, the role of the structure genomics initiative is to bridge the target–template gap for the remaining one-third proteins, as well as, to improve the alignment accuracy of the two-third proteins by providing evolutionarily closer template proteins.
Since its invention in the early 1990s [6,7], threading has become one of the most active areas in proteins structure prediction. Numerous algorithms have been developed during the past 15 years for the purpose of identifying structure templates from the PDB, which use techniques including sequence profile–profile alignments (PPAs) [10–13], structural profile alignments , hidden Markov models (HMMs) [15,16••], machine learning [17,18], and others.
The sequence PPA is probably the most often-used and robust threading approach. Instead of matching the single sequences of target and template, PPA aligns a target multiple sequence alignment (MSA) with a template MSA. The alignment score in the PPA is usually calculated as a product of the amino-acid frequency at each position of the target MSA and the log-odds of the amino acid in the template MSA, the profile . There are alternatives in calculating the PPA scores . The profile-alignment-based methods demonstrated advantages in several recent blind tests [21,22,23•]. In Live-Bench-8 , for example, all top four servers (BASD/MASP/MBAS, SFST/STMP, FFAS03, and ORF2/ORFS) were based on the sequence PPA. In CAFASP  and the recent CASP Server Section [23•], several sequence-profile-based methods were ranked at the top of single-threading servers. Wu and Zhang  recently showed that the accuracy of the sequence PPAs can be further improved by about 5–6% by incorporating a variety of additional structural information.
In CASP7, HHsearch [16••], a HMM–HMM alignment method, stands out to be the best single-threading server. The principle of the HMM–HMM alignments and the PPAs is similar in that both try to perform a pair-wise alignment of the target MSA with the template MSA. Instead of representing the MSAs by sequence profiles, HHsearch uses profile HMMs that can generate the sequences with certain probabilities, given by the product of amino-acid emission and insertion/deletion probabilities. HHsearch aligns the target and template HMMs by maximizing the probability that two models coemit the same amino-acid sequence. In this way, amino-acid frequencies and insertions and deletions of both HMMs are matched up together in an optimum way [16••].
Although the average performance differs among different algorithms, there is not a single-threading program that can outperform other methods for every target. This naturally leads to the prevalence of the so-called meta-server [25,26•,27], which collects and combines results from a set of different threading programs. There are two ways to generate predictions in meta-servers. One is to build a hybrid model by cut-and-paste of the selected structural fragments from multiple templates . The combined model has on average larger coverage and better topology than the best single template. One draw-back is that often the hybrid models have nonphysical local clashes between atoms. The second way is to select the best model based on a variety of scoring functions or machine-learning techniques, which emerges as a new research topic called Model Quality Assessment Programs (MQAPs) . Despite considerable efforts in developing various MQAP scores, the most robust score turns out to be the one based on the structure consensus [29•], that is, the best models are those simultaneously hit by various threading algorithms. The idea behind the consensus approach is simple because there are more ways for a threading program to select a wrong template than a right one. Therefore, the chances for multiple threading programs to make a common but wrong selection are much lower than the chances to make a common and correct selection.
The meta-server predictors have dominated the server predictions in previous experiments (e.g. CAFASP4 , LiveBench-8 , and CASP6 ). In the recent CASP7 experiment [23•], however, Zhang-Server (an automated server based on profile–profile threading and I-TASSER structure refinement [31••]) clearly outperforms others (including the meta-servers which include it as an input [29•]). A list of the top 10 automated servers in the CASP7 experiment is shown in Table 1. This data on the one hand highlight the challenge to the MQAP methods in correctly ranking and selecting the best models; on the other hand, the success of the composite threading plus refinement servers (as Zhang-Server, ROBETTA, and MetaTasser) demonstrates the advantage of structure refinement in the TBM prediction.
The goal of the protein structure refinement is to draw the templates closer to the native, which has proven to be an extremely nontrivial problem. Until only a few years ago, most of the TBM procedures either keep the templates unchanged or drive the templates away from the native structures [32,33].
Early efforts on template structure refinement have been focused on the molecular dynamics (MD)-based atomic simulations, which attempt to refine low-resolution models by running the classic software such as AMBER and CHARMM. Except for some isolated instances, however, no systematic improvement was achieved . The failure of the MD-based structure refinements seems contrary to the reported successes of the MD potentials in discriminating the native from structural decoys. Wroblewska and Skolnick [35••] recently showed that the AMBER plus GB potential could only discriminate the native from roughly minimized TASSER structure decoys . After a 2-ns MD simulation, none of the native structures have the lowest energy among decoys and the energy–rmsd correlation vanishes. A noteworthy observation was recently made by Summa and Levitt [37••] who exploited different molecular mechanics (MM) potentials (AMBER99, OPLS-AA, GROMOS96, and ENCAD) on the refinement of 75 proteins by in vacuo energy minimization. The authors found that a knowledge-based atomic contact potential based on the PDB statistics outperforms all the traditional MM potentials by moving almost all the test proteins closer to the native state, while the MM potentials, except for AMBER99, essentially drive the decoys away from the native. The vacuum simulation without solvation may be a part of the reason for the failure of the MM potentials. But this observation demonstrates the potential of the hybrid knowledge-based and physics-based potentials in the protein structure refinement.
Encouraging template refinements have been recently achieved by combining the hybrid potentials with spatial restraints from threading templates [9••,38••,39•]. Misura et al. [38••] first built low-resolution models by ROSETTA  using a fragment library enriched by the query-template alignment; the Cβ-contact restraints were used to guide the assembly procedure. The low-resolution models were then refined by a physics-based atomic potential. As a result, in 22 of 39 test cases, at least 1 of the 10 lowest energy models was found closer to the native than the template.
A more comprehensive test of the template refinement procedure based on TASSER simulations, combined with consensus spatial restraints from multiple templates, was reported by Zhang and Skolnick [9••,36]. For 1489 test cases, TASSER reduces the rmsd of the templates in the majority of cases with an average rmsd reduction from 6.7 to 4.4 Å over the threading aligned regions. Even starting from the best templates as identified by the structural alignment, TASSER refines the models from 2.5 to 1.88 Å in the aligned regions. Here, TASSER has built the structures based on a reduced model (specified by Cα and side-chain center of mass) with a purely knowledge-based force field. One of the major contributions to the refinements is the use of multiple threading templates where the consensus spatial restraint is more accurate than that from the individual template. Second, the composite knowledge-based energy terms have been extensively optimized using large-scale structure decoys  which help coordinate the complicated correlations between different interaction terms.
The progress of threading template refinements has been assessed in the recent CASP7 experiment, where the assessors compared the predicted models with the best structural template (or ‘virtual predictor group’) and commented that ‘The best group in this respect (24, Zhang) managed to achieve a higher GDT-TS score than the virtual group in more than half the assessment units and a higher GDT-HA score in approximately one-third of cases’ [42•]. This comparison may not entirely reflect the template refinement ability of the algorithms because the predictors actually start from threading templates rather than the best structural alignments and the latter requests the information of the native, which was not available when the predictions were made. On the contrary, a global GDT score comparison may favor the full-length models because the template alignment has a shorter length than the models. In a direct comparison of the rmsd over the same aligned regions, we find that the first I-TASSER model is closer to the native than the best initial template in 86 of 105 TBM cases while the other 13 (6) cases are worse than (equal to) the template. The average rmsd is 4.9 and 3.8 Å for the templates and models, respectively, over the same aligned regions [31••].
When structural analogs do not exist in the PDB library or could not be successfully identified by threading (which is more often the case as shown by Figure 1), the structure prediction has to be generated from scratch. This type of predictions has been termed as ‘ab initio’ or ‘de novo’ modeling, a term that may be easily understood as a modeling ‘from first principle’. In CASP7, it is named as ‘free modeling’ which I think reflects more appropriately the status of the field, since the most efficient methods in this category still consider hybrid approaches including both knowledge-based and physics-based potentials. Evolutionary information is often used in generating sparse spatial restraints or identifying local structural building blocks.
The best-known idea for free modeling is probably the one pioneered by Bowie and Eisenberg who assembled new tertiary structures using small fragments (mainly 9-mer) cut from other PDB proteins . On the basis of similar idea, Baker and coworkers developed ROSETTA , which has worked extremely well for free modeling in the CASP experiments and made the fragment assembly approach popular in the field. In the new developments of ROSETTA [44••,45•], the authors first assemble structures in a reduced knowledge-based model with conformations specified by the heavy backbone atoms and Cβs. In the second stage, Monte Carlo simulations with an all-atom physics-based potential are performed to refine the details of the low-resolution models. An exciting achievement was demonstrated in CASP6 by generating a model for T0281 (70 residues) of 1.6 Å away from the crystal structure. In CASP7, ROSETTA built a model for T0283 (112 residues) with rmsd = 1.8 Å over 92 residues (Figure 2, left panel). Despite significant success, the computer cost of the procedure (~150 CPU days for a small protein <100 residues) is still too expensive for the routine use.
Another successful free modeling approach, called TASSER  by Zhang and Skolnick, constructs 3D models based on a purely knowledge-based approach. Continuous fragments of various sizes are excised from threading alignments and used to reassemble protein structures in an on-and-off lattice system. A newer version of I-TASSER was recently developed by Wu et al. [46••], which refines the TASSER cluster centroids by iterative Monte Carlo simulations. Although the procedure uses structural fragments and spatial restraints from threading templates, it often constructs models of correct topology even when the topologies of individual templates are incorrect. In CASP7, among 19 FM and FM/TBM targets, I-TASSER builds correct topology (~3–5 Å) for 7 cases with sequences up to 155 residues long. Figure 2 (right panel) shows one example of T0382 (123 residues) where all initial templates have a wrong topology (>9 Å) but the final model is 3.6 Å away from the X-ray structure.
Significant efforts have been made on the purely physics-based protein folding and structure prediction. The very first milestone of successful ab initio protein folding is probably the 1997 work of Duan and Kollman, who folded the villin headpiece (a 36-mer) by MD simulations in explicit solvent for two months on parallel supercomputers with models up to 4.5 Å . With the help of the worldwide-distributed computers, this small protein was recently folded by Pande and coworkers  to 1.7 Å with a total simulation time of 300 µs or approximately 1000 CPU years. To reduce the computing cost, Scheraga and coworkers [49•] developed a reduced physics-based model, called UNRES, which represents protein conformations by Cα, side-chain center, and a virtual peptide group. The low-energy UNRES models are then converted to all-atom representations based on ECEPP/3. In CASP6, a structure genomic target of TM0487 (T0230, 102 residues) was folded to a structure within 7.3 Å by the approach. Using ASTRO-FOLD on the ECEPP/3 optimization, Floudas and coworkers  recently constructed a model of 5.2 Å for a four-helical bundle protein of 102 residues in a double-blind prediction.
Since a detailed physicochemical description of protein folding principles does not yet exist, the protein structure prediction problem is largely defined by the evolutionary or structural distance between the target and the solved proteins in the PDB library. For the proteins with close templates, full-length models can be constructed by copying the template framework. Recent studies show that if using the best possible template structures in PDB, the state-of-the-art modeling algorithms could build high-quality full-length models for almost all single-domain proteins with an average rmsd ~2.3 Å; this suggests that the current PDB structure universe may be approaching complete for solving the protein structure prediction problem [9••]. However, most of the target–template pairs are evolutionarily too distant to be detected with the current threading approaches.
The development of efficient threading algorithms to detect weakly/distant homologous templates has been a central theme in the field and may persist as a principal direction, as the gap between threading and the best structural alignment is obvious and tempting. However, progress in reducing this gap is slow or incremental since the invention of the PPA techniques. There is no single-threading method that outperforms all others on every target; this results in the prevalence of the meta-servers and MQAP which generate predictions by collecting and selecting models from a set of other threading programs. On the contrary, the template structure refinement has enjoyed promising progress. In the recent CASP7 experiment [23•], automated threading plus structure refinement servers outperforms by a margin the threading-only and the MQAP-based meta-servers. Nevertheless, the template refinement mainly occurs at the topology level. The demand for atomic-level structural refinements, which can generate models of use in drug screening and biochemical function inference, is keener than ever, especially when more and more template structures become available through the structure genomics and traditional structural biology.
Free modeling is certainly the ‘Holy Grail’ of the protein structure prediction because its success would mark the eventual solution to the problem. Although a purely physics-based ab initio simulation has the advantage in revealing the pathway of protein folding, the best current free-modeling results come from those which combine both knowledge-based and physics-based approaches. Although there are consistent successes in building correct topology (3–6 Å) for small proteins, the more exciting high-resolution free modeling (<2 Å) is rarer and computationally expensive. There is evidence that the current atomic potentials have the lowest energy near the native state and the bottleneck of high-resolution folding seems to be the insufficient conformational sampling [44••]. However, a golf-hole-like energy landscape without middle-range funnel should not be the one taken in nature, which can be a deeper reason for the failure of conformational search. Thus, the bottleneck for free modeling comes from the lack of both funnel-like force fields and efficient space searching, especially for proteins of larger sizes.
The project is supported in part by KU Start-up Fund 06194, the Alfred P. Sloan Foundation, and Grant Number R01GM083107 of the National Institute of General Medical Sciences.
Papers of particular interest, published within the annual period of review, have been highlighted as:
•of special interest
••of outstanding interest