|Home | About | Journals | Submit | Contact Us | Français|
We examine the ability of current state-of-the-art methods in protein structure prediction to discriminate topologically distant folds encoded by highly similar (>90% sequence identity) designed proteins in blind protein structure prediction experiments. We detail the corresponding prognosis for the protein fold recognition field and highlight the features of the methodologies that successfully deciphered this folding riddle.
Natural proteins with over 35% sequence similarity tend to fold into similar conformations , yet several evolutionarily related natural protein pairs with up to 40% similarity have been observed to produce substantially different topologies [2,3]. Two sequences bearing the same length and only three nonidentical residues were posted as sequential targets in the recent 8th Community-Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP8). Targets T0498 and T0499 therefore posed a riddle for the international protein fold prediction community to determine whether the conformations of these 95% identical sequences maintain the same topological folds or adopt different ones.
The proteins were artificially produced in the group of John Orban and Philip Bryan  as a study of the tolerance of sequence identity to maintain the 3-α and α/β folds of streptococcal protein G domain A (GA) and domain B (GB), respectively. The two 16% identity domains of protein G were brought together in sequence space first by adding terminal tails to GA to make it equal in length to GB and then by progressively mutating sites of nonidentity. The key in this approach was linking each fold to its natural function: human serum albumin binding for the 3-α GA fold and IgG binding for the α/β GB fold. This linkage of fold to function allowed the application of powerful biologic selection methods to determine clusters of sites in each protein, which could be substituted with the corresponding amino acid in the other protein. Iteratively combining mutations identified by the selection methods resulted in two 88% identical proteins . More recently, two 95% identical sequences possessing the same fold, GA95 and GB95, were designed and provided as the two CASP8 targets discussed here . The designed protein pairs maintain the fold and specific binding function of the proteins from which they were derived, with immeasurable structural or functional character of the domain represented in the alternate protein .
A prerequisite of recognizing a fold is prior observation of the fold. Structural genomics consortia contribute thousands of new protein structures each year, yet previously unobserved folds are seldom found . This pattern seems to indicate that the majority of folds that can be detected by current laboratory techniques have already been observed. The completeness of the structural fold space has been addressed using a subset of 1,489 proteins covering the protein data bank  at the level of 35% sequence identity; all but two folds can be resolved using templates found within the same set . Thus, template-based modeling appears to be feasible given the best template(s) within the set. The search for the best template for a given query protein is known as ‘fold recognition’.
The best performing freely available fold recognition web server methods are maintained by Yang Zhang  within the local meta-threading server (LOMETS) fold recognition pipeline of I-TASSER (iterative threading assembly refinement algorithm), the best performing protein structure prediction server in the past two CASP experiments. As an isolated meta-threading server, LOMETS uses local implementation to avoid the destructive aspects of internet dynamic regulation corrupting so many meta-servers . The nine methods of LOMETS are representative of the fold recognition field (normally targeted toward naturally occurring proteins) and can be summarized as various combinations of the following: comparing target to known structure sequence profiles, secondary structure preferences, environmental fitness, pairwise contact probabilities, structure profiles, simulated mutations, single-body or residue-specific knowledge-based potentials, and profile hidden Markov models (HMMs) .
Most web server groups predicted both T0498 and T0499 to adopt the α/β fold of protein GB (Figure 1a). For example, our own predictions for T0498 did not significantly resemble the target structure [37.2 global distance test total score (GDT-TS); Figure 1b, left], yet all five of our predictions for T0499 were within the top 10 total predictions (88.4 GDT-TS; Figure 1b, right). The models for T0499 exemplify progress in another major challenge in protein structure prediction: refinement of model quality from the best template .
The side chain interactions visible in the experimental structures of GA95 and GB95, as well as the simulated mutant models depicted in Figure 2, demonstrate interactions within a relatively stable, folded state, which are not necessarily illustrative of those interactions occurring during the folding process. Even when the structures are known, it is difficult to ascertain exactly what makes the two proteins follow different fold trajectories. Yet fold recognition methods do not simulate folding. Rather, they rely on calculated interactions within simulated mutants of these folded structures to test the accuracy of fit for a possible template; thus, even with a perfect energy function, mistakes in fold recognition could occur.
In this case, a multitude of experimentally derived structures for GA and GB and detectable sequence similarity within this group reasonably limit the fold search to these topologies. Crossing fold assignments for GA95 and GB95 enables interrogation of side chain packing for the three nonidentical residues (Figure 2). The clash occurring between F30 and A20 when the nonidentities from T0499 are applied to the structure of GA95 (Figure 2, right) implicates an incorrect fold to predictors. Conversely, minimal steric clashes emerge when the T0498 sequence is applied to the structure of GB95 (Figure 2, left). This absence of incriminating evidence for the T0498 GB95 sequence fold pair could mislead predictors to select this fold topology.
Out of over 150 contributing teams, four groups recognized the difference in fold caused by three nonidentical residues in the 56 amino acid proteins: HHpred, FOLDpro, Feig, and Coma. The accurate predictions of these groups demonstrate sensitivity to subtle changes affecting folding not previously demonstrated in a bona fide blind prediction scenario.
The Söding group uses HMM emission sequences to evaluate target template matches. The emission sequence of HMMs includes position-specific insertion and deletion probabilities along with the sequence distributions found in multiple sequence alignment profiles. HHsearch specifically includes secondary structures via a substitution matrix derived from comparing measurements on the template to target predictions and to the confidence thereof. To interrogate alignments, the HHsearch method maximizes the coemission log-odds probability for the pair of HMMs derived for a given protein pair. HHsearch directs the structural similarity search hierarchically by searching databases of alignments organized by fold family rather than lists of disconnected sequences . The CSI-BLAST (context-specific iterative basic local alignment search tool) sequence similarity search method recently published by the group was likely used to build the profile input to the HMMs for each alignment .
The Cheng group uses a supervised classification approach previously used for fold classification, invoking support vector machines to combine global profile-profile alignment, secondary structure, solvent accessibility, contact map, and strand hydrogen bond pairing .
The Feig group used a very typical set of methods, including fold recognition functions overlapping those in LOMETS (including HHsearch/HHpred), standard model construction, and a modified cluster calculation using a standard discriminatory potential function . Other promising work by this group in the refinement category includes the use of an implicit continuum dielectric solvent based on generalized Born theory to drive lattice-based course grain searches, Monte Carlo molecular dynamics, and restrained normal mode sampling .
The Venclovas group invokes a profile comparison method for detection of distant evolutionary relationships across profile databases, adding a modified two-level SEG (segment sequences by local complexity) algorithm to filter noninformative profile regions, variable gap penalties, and adaptive parameterization. The underlying sequence similarity search is driven by their PSI-BLAST-ISS (position-specific iterative BLAST intermediate sequence search), which evaluates and refines output profile alignments . The manual submissions by this group displayed the overall best performance in CASP8.
A handful of the automated algorithms were able to recognize the fold switch caused by the three nonidentical residues of GA95 and GB95 (Figure 2). However, the experimentally unobserved 60% of naturally occurring proteins  and the prospect of designing new folds heralded by Top7  demand more methods sensitive enough to detect subtle triggers in fold switching and predict previously unobserved topologies.
Developments in the protein fold prediction field can often be limited to incremental engineering optimizations. In this fold recognition problem, the proper application of support vector machines and HMM methods enabled success for two groups. Also, two groups created their own improvements on PSI-BLAST : CSI-BLAST  and PSI-BLAST-ISS , which both enhance quality and relevance of a search by interrogating low-quality regions in the alignment by context and together comprise the first significant improvements on the enormously popular algorithm in a decade. The novel algorithmic adjustments in fold recognition used in CASP8 demonstrate significant progress amounting to new tools for the field.
Future developments are anticipated to include the steady stream of mathematical enhancements observed since the inception of the protein structure prediction field but also include new conceptual paradigms such as functional signatures  and the use of template-free modeling  to drive the difficult fold recognition problems.
The authors thank the Samudrala CompBio group for thoughtful comments and advice. RS was supported in this work by a National Science Foundation (NSF) CAREER award and GEMSEC, an NSF-MRSEC at the University of Washington (DMR 0520567). JH was supported by the National Institute of Dental and Craniofacial Research (NIDCR) Ruth L Kirschstein National Research Service Award (NRSA): NIH F30DE017522.
The electronic version of this article is the complete one and can be found at: http://www.F1000.com/Reports/Biology/content/1/69
The authors declare that they have no competing interests.