Motivation: Template-based modeling, including homology modeling and protein threading, is the most reliable method for protein 3D structure prediction. However, alignment errors and template selection are still the main bottleneck for current template-base modeling methods, especially when proteins under consideration are distantly related.
Results: We present a novel context-specific alignment potential for protein threading, including alignment and template selection. Our alignment potential measures the log-odds ratio of one alignment being generated from two related proteins to being generated from two unrelated proteins, by integrating both local and global context-specific information. The local alignment potential quantifies how well one sequence residue can be aligned to one template residue based on context-specific information of the residues. The global alignment potential quantifies how well two sequence residues can be placed into two template positions at a given distance, again based on context-specific information. By accounting for correlation among a variety of protein features and making use of context-specific information, our alignment potential is much more sensitive than the widely used context-independent or profile-based scoring function. Experimental results confirm that our method generates significantly better alignments and threading results than the best profile-based methods on several large benchmarks. Our method works particularly well for distantly related proteins or proteins with sparse sequence profiles because of the effective integration of context-specific, structure and global information.
Motivation: Protein contact map describes the pairwise spatial and functional relationship of residues in a protein and contains key information for protein 3D structure prediction. Although studied extensively, it remains challenging to predict contact map using only sequence information. Most existing methods predict the contact map matrix element-by-element, ignoring correlation among contacts and physical feasibility of the whole-contact map. A couple of recent methods predict contact map by using mutual information, taking into consideration contact correlation and enforcing a sparsity restraint, but these methods demand for a very large number of sequence homologs for the protein under consideration and the resultant contact map may be still physically infeasible.
Results: This article presents a novel method PhyCMAP for contact map prediction, integrating both evolutionary and physical restraints by machine learning and integer linear programming. The evolutionary restraints are much more informative than mutual information, and the physical restraints specify more concrete relationship among contacts than the sparsity restraint. As such, our method greatly reduces the solution space of the contact map matrix and, thus, significantly improves prediction accuracy. Experimental results confirm that PhyCMAP outperforms currently popular methods no matter how many sequence homologs are available for the protein under consideration.
Although studied extensively, designing highly accurate protein energy potential is still challenging. A lot of knowledge-based statistical potentials are derived from the inverse of the Boltzmann law and consist of two major components: observed atomic interacting probability and reference state. These potentials mainly distinguish themselves in the reference state and use a similar simple counting method to estimate the observed probability, which is usually assumed to correlate with only atom types. This paper takes a rather different view on the observed probability and parameterizes it by the protein sequence profile context of the atoms and the radius of the gyration, in addition to atom types. Experiments confirm that our position-specific statistical potential outperforms currently the popular ones in several decoy discrimination tests. Our results imply that in addition to reference state, the observed probability also makes energy potentials different and evolutionary information greatly boost performance of energy potentials.
Protein structure alignment is a fundamental problem in computational structure biology. Many programs have been developed for automatic protein structure alignment, but most of them align two protein structures purely based upon geometric similarity without considering evolutionary and functional relationship. As such, these programs may generate structure alignments which are not very biologically meaningful from the evolutionary perspective. This paper presents a novel method DeepAlign for automatic pairwise protein structure alignment. DeepAlign aligns two protein structures using not only spatial proximity of equivalent residues (after rigid-body superposition), but also evolutionary relationship and hydrogen-bonding similarity. Experimental results show that DeepAlign can generate structure alignments much more consistent with manually-curated alignments than other automatic tools especially when proteins under consideration are remote homologs. These results imply that in addition to geometric similarity, evolutionary information and hydrogen-bonding similarity are essential to aligning two protein structures.
Despite significant progress in recent years, ab initio folding is still one of the most challenging problems in structural biology. This paper presents a probabilistic graphical model for ab initio folding, which employs Conditional Random Fields (CRFs) and directional statistics to model the relationship between the primary sequence of a protein and its three-dimensional structure. Different from the widely-used fragment assembly method and the lattice model for protein folding, our graphical model can explore protein conformations in a continuous space according to their probability. The probability of a protein conformation reflects its stability and is estimated from PSI-BLAST sequence profile and predicted secondary structure. Experimental results indicate that this new method compares favorably with the fragment assembly method and the lattice model.
protein structure prediction; ab initio folding; conditional random fields (CRFs); directional statistics; fragment assembly; lattice model
This paper presents RaptorX, a statistical method for template-based protein modeling that improves alignment accuracy by exploiting structural information in a single or multiple templates. RaptorX consists of three major components: single-template threading, alignment quality prediction and multiple-template threading. This paper summarizes the methods employed by RaptorX and presents its CASP9 result analysis, aiming to identify major bottlenecks with RaptorX and template-based modeling and hopefully directions for further study. Our results show that template structural information helps a lot with both single-template and multiple-template protein threading especially when closely-related templates are unavailable and there is still large room for improvement in both alignment and template selection. The RaptorX web server is available at http://raptorx.uchicago.edu.
single-template threading; multiple-template threading; alignment quality prediction; probabilistic alignment; multiple protein alignment; CASP
Motivation: Building an accurate alignment of a large set of distantly related protein structures is still very challenging.
Results: This article presents a novel method 3DCOMB that can generate a multiple structure alignment (MSA) with not only as many conserved cores as possible, but also high-quality pairwise alignments. 3DCOMB is unique in that it makes use of both local and global structure environments, combined by a statistical learning method, to accurately identify highly similar fragment blocks (HSFBs) among all proteins to be aligned. By extending the alignments of these HSFBs, 3DCOMB can quickly generate an accurate MSA without using progressive alignment. 3DCOMB significantly excels others in aligning distantly related proteins. 3DCOMB can also generate correct alignments for functionally similar regions among proteins of very different structures while many other MSA tools fail. 3DCOMB is useful for many real-world applications. In particular, it enables us to find out that there is still large improvement room for multiple template homology modeling while several other MSA tools fail to do so.
Availability: 3DCOMB is available at http://ttic.uchicago.edu/~jinbo/software.htm.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Motivation: Alignment errors are still the main bottleneck for current template-based protein modeling (TM) methods, including protein threading and homology modeling, especially when the sequence identity between two proteins under consideration is low (<30%).
Results: We present a novel protein threading method, CNFpred, which achieves much more accurate sequence–template alignment by employing a probabilistic graphical model called a Conditional Neural Field (CNF), which aligns one protein sequence to its remote template using a non-linear scoring function. This scoring function accounts for correlation among a variety of protein sequence and structure features, makes use of information in the neighborhood of two residues to be aligned, and is thus much more sensitive than the widely used linear or profile-based scoring function. To train this CNF threading model, we employ a novel quality-sensitive method, instead of the standard maximum-likelihood method, to maximize directly the expected quality of the training set. Experimental results show that CNFpred generates significantly better alignments than the best profile-based and threading methods on several public (but small) benchmarks as well as our own large dataset. CNFpred outperforms others regardless of the lengths or classes of proteins, and works particularly well for proteins with sparse sequence profiles due to the effective utilization of structure information. Our methodology can also be adapted to protein sequence alignment.
Supplementary data are available at Bioinformatics online.
Most threading methods predict the structure of a protein using only a single template. Due to the increasing number of solved structures, a protein without solved structure is very likely to have more than one similar template structures. Therefore, a natural question to ask is if we can improve modeling accuracy using multiple templates. This paper describes a new multiple-template threading method to answer this question. At the heart of this multiple-template threading method is a novel probabilistic-consistency algorithm that can accurately align a single protein sequence simultaneously to multiple templates. Experimental results indicate that our multiple-template method can improve pairwise sequence-template alignment accuracy and generate models with better quality than single-template models even if they are built from the best single templates (P-value<10-6) while many popular multiple sequence/structure alignment tools fail to do so. The underlying reason is that our probabilistic-consistency algorithm can generate accurate multiple sequence/template alignments. In another word, without an accurate multiple sequence/template alignment the modeling accuracy cannot be improved by simply using multiple templates to increase alignment coverage. Blindly tested on the CASP9 targets with more than one good template structures, our method outperforms all other CASP9 servers except two (Zhang-Server and QUARK of the same group). Our probabilistic-consistency algorithm can possibly be extended to align multiple protein/RNA sequences and structures.
protein modeling; multiple-template threading; probabilistic alignment matrix; probabilistic-consistency algorithm; multiple sequence/template alignment
Compared with the protein 3-class secondary structure (SS) prediction, the 8-class prediction gains less attention and is also much more challenging, especially for proteins with few sequence homologs. This paper presents a new probabilistic method for 8-class SS prediction using conditional neural fields (CNFs), a recently invented probabilistic graphical model. This CNF method not only models the complex relationship between sequence features and SS, but also exploits the interdependency among SS types of adjacent residues. In addition to sequence profiles, our method also makes use of non-evolutionary information for SS prediction. Tested on the CB513 and RS126 data sets, our method achieves Q8 accuracy of 64.9 and 64.7%, respectively, which are much better than the SSpro8 web server (51.0 and 48.0%, respectively). Our method can also be used to predict other structure properties (e.g. solvent accessibility) of a protein or the SS of RNA.
Bioinformatics; Conditional neural fields; Eight class; Protein; Secondary structure prediction
Global Distance Test (GDT) is one of the commonly accepted measures to assess the quality of predicted protein structures. Given a set of distance thresholds, GDT maximizes the percentage of superimposed (or matched) residue pairs under each threshold, and reports the average of these percentages as the final score. The computation of GDT score was conjectured to be NP-hard. All available methods are heuristic and do not guarantee the optimality of scores. These heuristic strategies usually result in underestimated GDT scores. Contrary to the conjecture, the problem can be solved exactly in polynomial time, albeit the method would be too slow for practical usage. In this paper we propose an efficient tool called OptGDT to obtain GDT scores with theoretically guaranteed accuracies. Denote ℓ as the number of matched residue pairs found by OptGDT for a given threshold d. Let ℓ′ be the optimal number of matched residues pairs for threshold d/(1 + ε), where ε is a parameter in our computation. OptGDT guarantees that ℓ ≥ ℓ′. We applied our tool to CASP8 (The eighth Critical Assessment of Structure Prediction Techniques) data. For 87.3% of the predicted models, better GDT scores are obtained when OptGDT is used. In some cases, the number of matched residue pairs were improved by at least 10%. The tool runs in time O(n3 log n/ε5) for a given threshold d and parameter ε. In the case of globular proteins, the tool can be improved to a randomized algorithm of O(n log2 n) runtime with probability at least 1 − O(1/n). Released under the GPL license and downloadable from http://bioinformatics.uwaterloo.ca/∼scli/OptGDT/.
algorithms; alignment; computational molecular biology; linear programming; protein folding
Protein threading is one of the most successful protein structure prediction methods. Most protein threading methods use a scoring function linearly combining sequence and structure features to measure the quality of a sequence-template alignment so that a dynamic programming algorithm can be used to optimize the scoring function. However, a linear scoring function cannot fully exploit interdependency among features and thus, limits alignment accuracy.
This paper presents a nonlinear scoring function for protein threading, which not only can model interactions among different protein features, but also can be efficiently optimized using a dynamic programming algorithm. We achieve this by modeling the threading problem using a probabilistic graphical model Conditional Random Fields (CRF) and training the model using the gradient tree boosting algorithm. The resultant model is a nonlinear scoring function consisting of a collection of regression trees. Each regression tree models a type of nonlinear relationship among sequence and structure features. Experimental results indicate that this new threading model can effectively leverage weak biological signals and improve both alignment accuracy and fold recognition rate greatly.
protein threading; conditional random fields; gradient tree boosting; regression tree; nonlinear scoring function
Accurate tertiary structures are very important for the functional study of non-coding RNA molecules. However, predicting RNA tertiary structures is extremely challenging, because of a large conformation space to be explored and lack of an accurate scoring function differentiating the native structure from decoys. The fragment-based conformation sampling method (e.g. FARNA) bears shortcomings that the limited size of a fragment library makes it infeasible to represent all possible conformations well. A recent dynamic Bayesian network method, BARNACLE, overcomes the issue of fragment assembly. In addition, neither of these methods makes use of sequence information in sampling conformations. Here, we present a new probabilistic graphical model, conditional random fields (CRFs), to model RNA sequence–structure relationship, which enables us to accurately estimate the probability of an RNA conformation from sequence. Coupled with a novel tree-guided sampling scheme, our CRF model is then applied to RNA conformation sampling. Experimental results show that our CRF method can model RNA sequence–structure relationship well and sequence information is important for conformation sampling. Our method, named as TreeFolder, generates a much higher percentage of native-like decoys than FARNA and BARNACLE, although we use the same simple energy function as BARNACLE.
Contact: email@example.com; firstname.lastname@example.org
Supplementary Information: Supplementary data are available at Bioinformatics online.
One of the major challenges with protein template-free modeling is an efficient sampling algorithm that can explore a huge conformation space quickly. The popular fragment assembly method constructs a conformation by stringing together short fragments extracted from the Protein Data Base (PDB). The discrete nature of this method may limit generated conformations to a subspace in which the native fold does not belong. Another worry is that a protein with really new fold may contain some fragments not in the PDB. This article presents a probabilistic model of protein conformational space to overcome the above two limitations. This probabilistic model employs directional statistics to model the distribution of backbone angles and 2nd-order Conditional Random Fields (CRFs) to describe sequence-angle relationship. Using this probabilistic model, we can sample protein conformations in a continuous space, as opposed to the widely used fragment assembly and lattice model methods that work in a discrete space. We show that when coupled with a simple energy function, this probabilistic method compares favorably with the fragment assembly method in the blind CASP8 evaluation, especially on alpha or small beta proteins. To our knowledge, this is the first probabilistic method that can search conformations in a continuous space and achieves favorable performance. Our method also generated three-dimensional (3D) models better than template-based methods for a couple of CASP8 hard targets. The method described in this article can also be applied to protein loop modeling, model refinement, and even RNA tertiary structure prediction.
conditional random fields (CRFs); directional statistics; fragment assembly; lattice model; protein structure prediction; template-free modeling
Motivation: The challenge of template-based modeling lies in the recognition of correct templates and generation of accurate sequence-template alignments. Homologous information has proved to be very powerful in detecting remote homologs, as demonstrated by the state-of-the-art profile-based method HHpred. However, HHpred does not fare well when proteins under consideration are low-homology. A protein is low-homology if we cannot obtain sufficient amount of homologous information for it from existing protein sequence databases.
Results: We present a profile-entropy dependent scoring function for low-homology protein threading. This method will model correlation among various protein features and determine their relative importance according to the amount of homologous information available. When proteins under consideration are low-homology, our method will rely more on structure information; otherwise, homologous information. Experimental results indicate that our threading method greatly outperforms the best profile-based method HHpred and all the top CASP8 servers on low-homology proteins. Tested on the CASP8 hard targets, our threading method is also better than all the top CASP8 servers but slightly worse than Zhang-Server. This is significant considering that Zhang-Server and other top CASP8 servers use a combination of multiple structure-prediction techniques including consensus method, multiple-template modeling, template-free modeling and model refinement while our method is a classical single-template-based threading method without any post-threading refinement.
Motivation: One of the major bottlenecks with ab initio protein folding is an effective conformation sampling algorithm that can generate native-like conformations quickly. The popular fragment assembly method generates conformations by restricting the local conformations of a protein to short structural fragments in the PDB. This method may limit conformations to a subspace to which the native fold does not belong because (i) a protein with really new fold may contain some structural fragments not in the PDB and (ii) the discrete nature of fragments may prevent them from building a native-like fold. Previously we have developed a conditional random fields (CRF) method for fragment-free protein folding that can sample conformations in a continuous space and demonstrated that this CRF method compares favorably to the popular fragment assembly method. However, the CRF method is still limited by its capability of generating conformations compatible with a sequence.
Results: We present a new fragment-free approach to protein folding using a recently invented probabilistic graphical model conditional neural fields (CNF). This new CNF method is much more powerful than CRF in modeling the sophisticated protein sequence-structure relationship and thus, enables us to generate native-like conformations more easily. We show that when coupled with a simple energy function and replica exchange Monte Carlo simulation, our CNF method can generate decoys much better than CRF on a variety of test proteins including the CASP8 free-modeling targets. In particular, our CNF method can predict a correct fold for T0496_D1, one of the two CASP8 targets with truly new fold. Our predicted model for T0496 is significantly better than all the CASP8 models.
Struct2Net is a web server for predicting interactions between arbitrary protein pairs using a structure-based approach. Prediction of protein–protein interactions (PPIs) is a central area of interest and successful prediction would provide leads for experiments and drug design; however, the experimental coverage of the PPI interactome remains inadequate. We believe that Struct2Net is the first community-wide resource to provide structure-based PPI predictions that go beyond homology modeling. Also, most web-resources for predicting PPIs currently rely on functional genomic data (e.g. GO annotation, gene expression, cellular localization, etc.). Our structure-based approach is independent of such methods and only requires the sequence information of the proteins being queried. The web service allows multiple querying options, aimed at maximizing flexibility. For the most commonly studied organisms (fly, human and yeast), predictions have been pre-computed and can be retrieved almost instantaneously. For proteins from other species, users have the option of getting a quick-but-approximate result (using orthology over pre-computed results) or having a full-blown computation performed. The web service is freely available at http://struct2net.csail.mit.edu.
Protein structure prediction without using templates (i.e., ab initio folding) is one of the most challenging problems in structural biology. In particular, conformation sampling poses as a major bottleneck of ab initio folding. This article presents CRFSampler, an extensible protein conformation sampler, built on a probabilistic graphical model Conditional Random Fields (CRFs). Using a discriminative learning method, CRFSampler can automatically learn more than ten thousand parameters quantifying the relationship among primary sequence, secondary structure, and (pseudo) backbone angles. Using only compactness and self-avoiding constraints, CRFSampler can efficiently generate protein-like conformations from primary sequence and predicted secondary structure. CRFSampler is also very flexible in that a variety of model topologies and feature sets can be defined to model the sequence-structure relationship without worrying about parameter estimation. Our experimental results demonstrate that using a simple set of features, CRFSampler can generate decoys with much higher quality than the most recent HMM model.
protein conformation sampling; conditional random fields (CRFs); discriminative learning
We developed and tested RAPTOR++ in CASP8 for protein structure prediction. RAPTOR++ contains four modules: threading, model quality assessment, multiple protein alignment and template-free modeling. RAPTOR++ first threads a target protein to all the templates using three methods and then predicts the quality of the 3D model implied by each alignment using a model quality assessment method. Based upon the predicted quality, RAPTOR++ employs different strategies as follows. If multiple alignments have good quality, RAPTOR++ builds a multiple protein alignment between the target and top templates and then generates a 3D model using MODELLER. If all the alignments have very low quality, RAPTOR++ uses template-free modeling. Otherwise, RAPTOR++ submits a threading-generated 3D model with the best quality. RAPTOR++ was not ready for the first 1/3 targets and was under development during the whole CASP8 season. The template-based and template-free modeling modules in RAPTOR++ are not closely integrated. We are using our template-free modeling technique to refine template-based models.
template-based modeling; template-free modeling; protein threading; model quality assessment
Protein inter-residue contacts play a crucial role in the determination and prediction of protein structures. Previous studies on contact prediction indicate that although template-based consensus methods outperform sequence-based methods on targets with typical templates, such consensus methods perform poorly on new fold targets. However, we find out that even for new fold targets, the models generated by threading programs can contain many true contacts. The challenge is how to identify them.
In this paper, we develop an integer linear programming model for consensus contact prediction. In contrast to the simple majority voting method assuming that all the individual servers are equally important and independent, the newly developed method evaluates their correlation by using maximum likelihood estimation and extracts independent latent servers from them by using principal component analysis. An integer linear programming method is then applied to assign a weight to each latent server to maximize the difference between true contacts and false ones. The proposed method is tested on the CASP7 data set. If the top L/5 predicted contacts are evaluated where L is the protein size, the average accuracy is 73%, which is much higher than that of any previously reported study. Moreover, if only the 15 new fold CASP7 targets are considered, our method achieves an average accuracy of 37%, which is much better than that of the majority voting method, SVM-LOMETS, SVM-SEQ, and SAM-T06. These methods demonstrate an average accuracy of 13.0%, 10.8%, 25.8% and 21.2%, respectively.
Reducing server correlation and optimally combining independent latent servers show a significant improvement over the traditional consensus methods. This approach can hopefully provide a powerful tool for protein structure refinement and prediction use.
Motivation: The 3D structure of a protein sequence can be assembled from the substructures corresponding to small segments of this sequence. For each small sequence segment, there are only a few more likely substructures. We call them the ‘structural alphabet’ for this segment. Classical approaches such as ROSETTA used sequence profile and secondary structure information, to predict structural fragments. In contrast, we utilize more structural information, such as solvent accessibility and contact capacity, for finding structural fragments.
Results: Integer linear programming technique is applied to derive the best combination of these sequence and structural information items. This approach generates significantly more accurate and succinct structural alphabets with more than 50% improvement over the previous accuracies. With these novel structural alphabets, we are able to construct more accurate protein structures than the state-of-art ab initio protein structure prediction programs such as ROSETTA. We are also able to reduce the Kolodny's library size by a factor of 8, at the same accuracy.
Availability: The online FRazor server is under construction