RNA is a large group of functionally important biomacromolecules. In striking analogy to proteins, the function of RNA depends on its structure and dynamics, which in turn is encoded in the linear sequence. However, while there are numerous methods for computational prediction of protein three-dimensional (3D) structure from sequence, with comparative modeling being the most reliable approach, there are very few such methods for RNA. Here, we present ModeRNA, a software tool for comparative modeling of RNA 3D structures. As an input, ModeRNA requires a 3D structure of a template RNA molecule, and a sequence alignment between the target to be modeled and the template. It must be emphasized that a good alignment is required for successful modeling, and for large and complex RNA molecules the development of a good alignment usually requires manual adjustments of the input data based on previous expertise of the respective RNA family. ModeRNA can model post-transcriptional modifications, a functionally important feature analogous to post-translational modifications in proteins. ModeRNA can also model DNA structures or use them as templates. It is equipped with many functions for merging fragments of different nucleic acid structures into a single model and analyzing their geometry. Windows and UNIX implementations of ModeRNA with comprehensive documentation and a tutorial are freely available.
After decades of research, protein structure prediction remains a very challenging problem. In order to address the different levels of complexity of structural modeling, two types of modeling techniques — template-based modeling and template-free modeling — have been developed. Template-based modeling can often generate a moderate- to high-resolution model when a similar, homologous template structure is found for a query protein but fails if no template or only incorrect templates are found. Template-free modeling, such as fragment-based assembly, may generate models of moderate resolution for small proteins of low topological complexity. Seldom have the two techniques been integrated together to improve protein modeling. Here we develop a recursive protein modeling approach to selectively and collaboratively apply template-based and template-free modeling methods to model template-covered (i.e. certain) and template-free (i.e. uncertain) regions of a protein. A preliminary implementation of the approach was tested on a number of hard modeling cases during the 9th Critical Assessment of Techniques for Protein Structure Prediction (CASP9) and successfully improved the quality of modeling in most of these cases. Recursive modeling can signicantly reduce the complexity of protein structure modeling and integrate template-based and template-free modeling to improve the quality and efficiency of protein structure prediction.
Protein structure prediction; recursive protein modeling; template-free modeling; template-based modeling; CASP
Methods for computationally predicting deleterious mutations have recently been investigated for proteins, mainly by probabilistic estimations in the context of genomic research for identifying single nucleotide polymorphisms that can potentially affect protein function. It has been demonstrated that in cases where a few homologs are available, ab initio predicted structures modeled by the Rosetta method can become useful for including structural information to improve the deleterious mutation prediction methods for proteins. In the field of RNAs where very few homologs are available at present, this analogy can serve as a precursor to investigate a deleterious mutation prediction approach that is based on RNA secondary structure. When attempting to develop models for the prediction of deleterious mutations in RNAs, useful structural information is available from folding algorithms that predict the secondary structure of RNAs, based on energy minimization. Detecting mutations with desired structural effects among all possible point mutations may then be valuable for the prediction of deleterious mutations that can be tested experimentally. Here, a method is introduced for the prediction of deleterious mutations in the secondary structure of RNAs. The mutation prediction method, based on subdivision of the initial structure into smaller substructures and construction of eigenvalue tables, is independent of the folding algorithms but relies on their success to predict the folding of small RNA structures. Application of this method to predict mutations that may cause structural rearrangements, thereby disrupting stable motifs, is given for prokaryotic transcription termination in the thiamin pyrophosphate and S-adenosyl-methionine induced riboswitches. Ribo switches are mRNA structures that have recently been found to regulate transcription termination or translation initiation in bacteria by conformation rearrangement in response to direct metabolite binding. Predicting deleterious mutations on riboswitches may succeed to systematically intervene in bacterial genetic control.
Most threading methods predict the structure of a protein using only a single template. Due to the increasing number of solved structures, a protein without solved structure is very likely to have more than one similar template structures. Therefore, a natural question to ask is if we can improve modeling accuracy using multiple templates. This paper describes a new multiple-template threading method to answer this question. At the heart of this multiple-template threading method is a novel probabilistic-consistency algorithm that can accurately align a single protein sequence simultaneously to multiple templates. Experimental results indicate that our multiple-template method can improve pairwise sequence-template alignment accuracy and generate models with better quality than single-template models even if they are built from the best single templates (P-value<10-6) while many popular multiple sequence/structure alignment tools fail to do so. The underlying reason is that our probabilistic-consistency algorithm can generate accurate multiple sequence/template alignments. In another word, without an accurate multiple sequence/template alignment the modeling accuracy cannot be improved by simply using multiple templates to increase alignment coverage. Blindly tested on the CASP9 targets with more than one good template structures, our method outperforms all other CASP9 servers except two (Zhang-Server and QUARK of the same group). Our probabilistic-consistency algorithm can possibly be extended to align multiple protein/RNA sequences and structures.
protein modeling; multiple-template threading; probabilistic alignment matrix; probabilistic-consistency algorithm; multiple sequence/template alignment
Comparative modelling is utilized to predict the 3-dimensional conformation of a given protein (target) based on its sequence
alignment to experimentally determined protein structure (template). The use of such technique is already rewarding and
increasingly widespread in biological research and drug development. The accuracy of the predictions as commonly accepted
depends on the score of sequence identity of the target protein to the template. To assess the relationship between sequence
identity and model quality, we carried out an analysis of a set of 4753 sequence and structure alignments. Throughout this
research, the model accuracy was measured by root mean square deviations of Cα atoms of the target-template structures.
Surprisingly, the results show that sequence identity of the target protein to the template is not a good descriptor to predict
the accuracy of the 3-D structure model. However, in a large number of cases, comparative modelling with lower sequence identity
of target to template proteins led to more accurate 3-D structure model. As a consequence of this study, we suggest new tips for
improving the quality of omparative models, particularly for models whose target-template sequence identity is below 50%.
comparative modelling; homology modelling; model refinement
We developed and tested RAPTOR++ in CASP8 for protein structure prediction. RAPTOR++ contains four modules: threading, model quality assessment, multiple protein alignment and template-free modeling. RAPTOR++ first threads a target protein to all the templates using three methods and then predicts the quality of the 3D model implied by each alignment using a model quality assessment method. Based upon the predicted quality, RAPTOR++ employs different strategies as follows. If multiple alignments have good quality, RAPTOR++ builds a multiple protein alignment between the target and top templates and then generates a 3D model using MODELLER. If all the alignments have very low quality, RAPTOR++ uses template-free modeling. Otherwise, RAPTOR++ submits a threading-generated 3D model with the best quality. RAPTOR++ was not ready for the first 1/3 targets and was under development during the whole CASP8 season. The template-based and template-free modeling modules in RAPTOR++ are not closely integrated. We are using our template-free modeling technique to refine template-based models.
template-based modeling; template-free modeling; protein threading; model quality assessment
The prediction of intramolecular contacts has a useful application in predicting the three-dimensional structures of proteins. The accuracy of the template-based contact prediction methods depends on the quality of the template structures. To reduce the false positive predictions associated with using the entire set of template-derived contacts, we develop selection filters that use sequence conservation information to predict subsets of contacts more likely to be structurally conserved between the template and the target. The method is developed specifically for protein families with few available templates such as the G protein-coupled receptor (GPCR) family. It is validated on a test set of 342 template-target pairs from three protein families, and applied to one template-target pair from the GPCR family. We find that the filter selection method increases the accuracy of contact prediction with sufficient coverage for structure prediction.
structural homology; sequence conservation; contact prediction; intramolecular contacts; template-based structure prediction; G protein-coupled receptors
Pair-wise residue-residue contacts in proteins can be predicted from both threading templates and sequence-based machine learning. However, most structure modeling approaches only use the template-based contact predictions in guiding the simulations; this is partly because the sequence-based contact predictions are usually considered to be less accurate than that by threading. With the rapid progress in sequence databases and machine-learning techniques, it is necessary to have a detailed and comprehensive assessment of the contact-prediction methods in different template conditions.
We develop two methods for protein-contact predictions: SVM-SEQ is a sequence-based machine learning approach which trains a variety of sequence-derived features on contact maps; SVM-LOMETS collects consensus contact predictions from multiple threading templates. We test both methods on the same set of 554 proteins which are categorized into ‘Easy’, ‘Medium’, ‘Hard’ and ‘Very Hard’ targets based on the evolutionary and structural distance between templates and targets. For the Easy and Medium targets, SVM-LOMETS obviously outperforms SVM-SEQ; but for the Hard and Very Hard targets, the accuracy of the SVM-SEQ predictions is higher than that of SVM-LOMETS by 12–25%. If we combine the SVM-SEQ and SVM-LOMETS predictions together, the total number of correctly predicted contacts in the Hard proteins will increase by more than 60% (or 70% for the long-range contact with a sequence separation ≥24), compared with SVM-LOMETS alone. The advantage of SVM-SEQ is also shown in the CASP7 free modeling targets where the SVM-SEQ is around four times more accurate than SVM-LOMETS in the long-range contact prediction. These data demonstrate that the state-of-the-art sequence-based contact prediction has reached a level which may be helpful in assisting tertiary structure modeling for the targets which do not have close structure templates. The maximum yield should be obtained by the combination of both sequence- and template-based predictions.
In the area of protein structure prediction, recently a lot of effort has gone into the development of Model Quality Assessment Programs (MQAPs). MQAPs distinguish high quality protein structure models from inferior models. Here, we propose a new method to use an MQAP to improve the quality of models. With a given target sequence and template structure, we construct a number of different alignments and corresponding models for the sequence. The quality of these models is scored with an MQAP and used to choose the most promising model. An SVM-based selection scheme is suggested for combining MQAP partial potentials, in order to optimize for improved model selection.
The approach has been tested on a representative set of proteins. The ability of the method to improve models was validated by comparing the MQAP-selected structures to the native structures with the model quality evaluation program TM-score. Using the SVM-based model selection, a significant increase in model quality is obtained (as shown with a Wilcoxon signed rank test yielding p-values below 10-15). The average increase in TMscore is 0.016, the maximum observed increase in TM-score is 0.29.
In template-based protein structure prediction alignment is known to be a bottleneck limiting the overall model quality. Here we show that a combination of systematic alignment variation and modern model scoring functions can significantly improve the quality of alignment-based models.
Comparative modeling is a technique to predict the three dimensional structure of a given protein sequence based primarily on its alignment to one or more proteins with experimentally determined structures. A major bottleneck of current comparative modeling methods is the lack of methods to accurately refine a starting initial model so that it approaches the resolution of the corresponding experimental structure. We investigate the effectiveness of a graph-theoretic clique finding approach to solve this problem.
Our method takes into account the information presented in multiple templates/alignments at the three-dimensional level by mixing and matching regions between different initial comparative models. This method enables us to obtain an optimized conformation ensemble representing the best combination of secondary structures, resulting in the refined models of higher quality. In addition, the process of mixing and matching accumulates near-native conformations, resulting in discriminating the native-like conformation in a more effective manner. In the seventh Critical Assessment of Structure Prediction (CASP7) experiment, the refined models produced are more accurate than the starting initial models.
This novel approach can be applied without any manual intervention to improve the quality of comparative predictions where multiple template/alignment combinations are available for modeling, producing conformational models of higher quality than the starting initial predictions.
Protein structure is more conserved than sequence in nature. In this direction we developed a novel methodology that significantly improves conventional homology modelling when sequence identity is low, by taking into consideration 3D structural features of the template, such as size and shape. Herein, our new homology modelling approach was applied to the homology modelling of the RNA-dependent RNA polymerase (RdRp) of dengue (type II) virus. The RdRp of dengue was chosen due to the low sequence similarity shared between the dengue virus polymerase and the available templates, while purposely avoiding to use the actual X-ray structure that is available for the dengue RdRp. The novel approach takes advantage of 3D space corresponding to protein shape and size by creating a 3D scaffold of the template structure. The dengue polymerase model built by the novel approach exhibited all features of RNA-dependent RNA polymerases and was almost identical to the X-ray structure of the dengue RdRp, as opposed to the model built by conventional homology modelling. Therefore, we propose that the space-aided homology modelling approach can be of a more general use to homology modelling of enzymes sharing low sequence similarity with the template structures.
Understanding the numerous functions that RNAs play in living cells depends critically on knowledge of their three-dimensional structure. Due to the difficulties in experimentally assessing structures of large RNAs, there is currently great demand for new high-resolution structure prediction methods. We present the novel method for the fully automated prediction of RNA 3D structures from a user-defined secondary structure. The concept is founded on the machine translation system. The translation engine operates on the RNA FRABASE database tailored to the dictionary relating the RNA secondary structure and tertiary structure elements. The translation algorithm is very fast. Initial 3D structure is composed in a range of seconds on a single processor. The method assures the prediction of large RNA 3D structures of high quality. Our approach needs neither structural templates nor RNA sequence alignment, required for comparative methods. This enables the building of unresolved yet native and artificial RNA structures. The method is implemented in a publicly available, user-friendly server RNAComposer. It works in an interactive mode and a batch mode. The batch mode is designed for large-scale modelling and accepts atomic distance restraints. Presently, the server is set to build RNA structures of up to 500 residues.
The crystallographic phase problem is the primary bottleneck encountered when attempting to solve macromolecular structures for which no close crystallographic structural homologues are known. Typically, isomorphous “heavy-atom” replacement and/or anomalous dispersion methods must be used in such cases to obtain experimentally-determined phases. Even three-dimensional NMR structures of the same macromolecule are often not sufficient to solve the crystallographic phase problem. RNA crystal structures present additional challenges due to greater difficulty in obtaining suitable heavy-atom derivatives. We present a unique approach to solving the phase problem for novel RNA crystal structures that has enjoyed a reasonable degree of success. This approach involves modeling only those portions of the RNA sequence whose structure can be predicted readily, i.e., the individual A-form helical regions and well-known stem-loop sub-structures. We have found that no prior knowledge of how the helices and other structural elements are arranged with respect to one another in three-dimensional space, or in some cases, even the sequence, is required to obtain a useable solution to the phase problem, using simultaneous molecular replacement of a set of generic helical RNA fragments.
Ribozyme; Crystallographic Phase Problem; Molecular Replacement; RNA Crystallography; RNA Structure Solution
RNA secondary structure prediction methods based on probabilistic modeling can be developed using stochastic context-free grammars (SCFGs). Such methods can readily combine different sources of information that can be expressed probabilistically, such as an evolutionary model of comparative RNA sequence analysis and a biophysical model of structure plausibility. However, the number of free parameters in an integrated model for consensus RNA structure prediction can become untenable if the underlying SCFG design is too complex. Thus a key question is, what small, simple SCFG designs perform best for RNA secondary structure prediction?
Nine different small SCFGs were implemented to explore the tradeoffs between model complexity and prediction accuracy. Each model was tested for single sequence structure prediction accuracy on a benchmark set of RNA secondary structures.
Four SCFG designs had prediction accuracies near the performance of current energy minimization programs. One of these designs, introduced by Knudsen and Hein in their PFOLD algorithm, has only 21 free parameters and is significantly simpler than the others.
Protein threading is widely used in the prediction of protein structure and the subsequent functional annotation. Most threading approaches employ similar criteria for the template identification for use in both protein structure and function modeling. Using structure similarity alone might result in a high false positive rate in protein function inference, which suggests that selecting functional templates should be subject to a different set of constraints. In this study, we extend the functionality of eThread, a recently developed approach to meta-threading, focusing on the optimal selection of functional templates. We optimized the selection of template proteins to cover a broad spectrum of protein molecular function: ligand, metal, inorganic cluster, protein, and nucleic acid binding. In large-scale benchmarks, we demonstrate that the recognition rates in identifying templates that bind molecular partners in similar locations are very high, typically 70–80%, at the expense of a relatively low false positive rate. eThread also provides useful insights into the chemical properties of binding molecules and the structural features of binding. For instance, the sensitivity in recognizing similar protein-binding interfaces is 58% at only 18% false positive rate. Furthermore, in comparative analysis, we demonstrate that meta-threading supported by machine learning outperforms single-threading approaches in functional template selection. We show that meta-threading effectively detects many facets of protein molecular function, even in a low-sequence identity regime. The enhanced version of eThread is freely available as a webserver and stand-alone software at http://www.brylinski.org/ethread.
protein function inference; template-based modeling; protein meta-threading; ligand-binding; metal-binding; iron/sulfur-binding; protein-protein interactions; protein-DNA interactions
For successful protein structure prediction by comparative modeling, in addition to identifying a good template protein with known structure, obtaining an accurate sequence alignment between a query protein and a template protein is critical. It has been known that the alignment accuracy can vary significantly depending on our choice of various alignment parameters such as gap opening penalty and gap extension penalty. Because the accuracy of sequence alignment is typically measured by comparing it with its corresponding structure alignment, there is no good way of evaluating alignment accuracy without knowing the structure of a query protein, which is obviously not available at the time of structure prediction. Moreover, there is no universal alignment parameter option that would always yield the optimal alignment.
In this work, we develop a method to predict the quality of the alignment between a query and a template. We train the support vector regression (SVR) models to predict the MaxSub scores as a measure of alignment quality. The alignment between a query protein and a template of length n is transformed into a (n + 1)-dimensional feature vector, then it is used as an input to predict the alignment quality by the trained SVR model. Performance of our work is evaluated by various measures including Pearson correlation coefficient between the observed and predicted MaxSub scores. Result shows high correlation coefficient of 0.945. For a pair of query and template, 48 alignments are generated by changing alignment options. Trained SVR models are then applied to predict the MaxSub scores of those and to select the best alignment option which is chosen specifically to the query-template pair. This adaptive selection procedure results in 7.4% improvement of MaxSub scores, compared to those when the single best parameter option is used for all query-template pairs.
The present work demonstrates that the alignment quality can be predicted with reasonable accuracy. Our method is useful not only for selecting the optimal alignment parameters for a chosen template based on predicted alignment quality, but also for filtering out problematic templates that are not suitable for structure prediction due to poor alignment accuracy. This is implemented as a part in FORECAST, the server for fold-recognition and is freely available on the web at
Motivation: Protein structure prediction is one of the most important problems in structural bioinformatics. Here we describe MULTICOM, a multi-level combination approach to improve the various steps in protein structure prediction. In contrast to those methods which look for the best templates, alignments and models, our approach tries to combine complementary and alternative templates, alignments and models to achieve on average better accuracy.
Results: The multi-level combination approach was implemented via five automated protein structure prediction servers and one human predictor which participated in the eighth Critical Assessment of Techniques for Protein Structure Prediction (CASP8), 2008. The MULTICOM servers and human predictor were consistently ranked among the top predictors on the CASP8 benchmark. The methods can predict moderate- to high-resolution models for most template-based targets and low-resolution models for some template-free targets. The results show that the multi-level combination of complementary templates, alternative alignments and similar models aided by model quality assessment can systematically improve both template-based and template-free protein modeling.
Availability: The MULTICOM server is freely available at http://casp.rnet.missouri.edu/multicom_3d.html
Pseudoknotted structures play important structural and functional roles in RNA cellular functions at the level of transcription, splicing and translation. However, the problem of computational prediction for large pseudoknotted folds remains. Here we develop a domain-based method for predicting complex and large pseudoknotted structures from RNA sequences. The model is based on the observation that large RNAs can be separated into different structural domains. The basic idea is to first identify the domains and then predict the structures for each domain. Assembly of the domain structures gives the full structure. The use of the domain-based approach leads to a reduction of computational time by a factor of about ~N2 for an N-nt sequence. As applications of the model, we predict structures for a variety of RNA systems, such as regions in human telomerase RNA (hTR), internal ribosome entry site (IRES) and HIV genome. The lengths of these sequences range from 200-nt to 400-nt. The results show good agreements with the experiments.
hepatitis delta virus (HDV); human immunodeficiency virus (HIV); human telomerase RNA (hTR); internal ribosome entry site (IRES); large RNAs; Pseudoknots; structural predictions
The continuously increasing amount of RNA sequence and experimentally determined 3D structure data drives the development of computational methods supporting exploration of these data. Contemporary functional analysis of RNA molecules, such as ribozymes or riboswitches, covers various issues, among which tertiary structure modeling becomes more and more important. A growing number of tools to model and predict RNA structure calls for an evaluation of these tools and the quality of outcomes their produce. Thus, the development of reliable methods designed to meet this need is relevant in the context of RNA tertiary structure analysis and can highly influence the quality and usefulness of RNA tertiary structure prediction in the nearest future. Here, we present RNAlyzer—a computational method for comparison of RNA 3D models with the reference structure and for discrimination between the correct and incorrect models. Our approach is based on the idea of local neighborhood, defined as a set of atoms included in the sphere centered around a user-defined atom. A unique feature of the RNAlyzer is the simultaneous visualization of the model-reference structure distance at different levels of detail, from the individual residues to the entire molecules.
Predicting protein structure from sequence is one of the most significant and challenging problems in bioinformatics. Numerous bioinformatics techniques and tools have been developed to tackle almost every aspect of protein structure prediction ranging from structural feature prediction, template identification and query-template alignment to structure sampling, model quality assessment, and model refinement. How to synergistically select, integrate and improve the strengths of the complementary techniques at each prediction stage and build a high-performance system is becoming a critical issue for constructing a successful, competitive protein structure predictor.
Over the past several years, we have constructed a standalone protein structure prediction system MULTICOM that combines multiple sources of information and complementary methods at all five stages of the protein structure prediction process including template identification, template combination, model generation, model assessment, and model refinement. The system was blindly tested during the ninth Critical Assessment of Techniques for Protein Structure Prediction (CASP9) in 2010 and yielded very good performance. In addition to studying the overall performance on the CASP9 benchmark, we thoroughly investigated the performance and contributions of each component at each stage of prediction.
Our comprehensive and comparative study not only provides useful and practical insights about how to select, improve, and integrate complementary methods to build a cutting-edge protein structure prediction system but also identifies a few new sources of information that may help improve the design of a protein structure prediction system. Several components used in the MULTICOM system are available at: http://sysbio.rnet.missouri.edu/multicom_toolbox/.
Protein structure prediction; Template identification; Template combination; Model generation; Model assessment; Model combination; Model refinement
Building reliable structural models of G protein-coupled receptors (GPCRs) is a difficult task due to the paucity of suitable templates, low sequence identity, and the wide variety of ligand specificities within the superfamily. Template-based modeling is known to be the most successful method for protein structure prediction. However, refinement of homology models within 1–3 Å Cα RMSD of the native structure remains a major challenge. Here we address this problem by developing a novel protocol (foldGPCR) for modeling the transmembrane (TM) region of GPCRs in complex with a ligand, aimed to accurately model the structural divergence between the template and target in the TM helices. The protocol is based on predicted conserved inter-residue contacts between the template and target, and exploits an all-atom implicit membrane force field. The placement of the ligand in the binding pocket is guided by biochemical data. The foldGPCR protocol is implemented by a stepwise hierarchical approach, in which the TM helical bundle and the ligand are assembled by simulated annealing trials in the first step, and the receptor-ligand complex is refined with replica exchange sampling in the second step. The protocol is applied to model the human β2-adrenergic receptor (β2AR) bound to carazolol, using contacts derived from the template structure of bovine rhodopsin. Comparison to the X-ray crystal structure of the β2AR shows that our protocol is particularly successful in accurately capturing helix backbone irregularities and helix-helix packing interactions that distinguish rhodopsin from β2AR.
class A GPCR; structure prediction; simulated annealing; ligand binding; implicit solvent; membrane protein
Consensus is a server developed to produce high-quality alignments for comparative modeling, and to identify the alignment regions reliable for copying from a given template. This is accomplished even when target–template sequence identity is as low as 5%. Combining the output from five different alignment methods, the server produces a consensus alignment, with a reliability measure indicated for each position and a prediction of the regions suitable for modeling. Models built using the server predictions are typically within 3 Å rms deviations from the crystal structure. Users can upload a target protein sequence and specify a template (PDB code); if no template is given, the server will search for one. The method has been validated on a large set of homologous protein structure pairs. The Consensus server should prove useful for modelers for whom the structural reliability of the model is critical in their applications. It is currently available at http://structure.bu.edu/cgi-bin/consensus/consensus.cgi.
Regulatory antisense RNAs are a class of ncRNAs that regulate gene expression by prohibiting the translation of an mRNA by establishing stable interactions with a target sequence. There is great demand for efficient computational methods to predict the specific interaction between an ncRNA and its target mRNA(s). There are a number of algorithms in the literature which can predict a variety of such interactions - unfortunately at a very high computational cost. Although some existing target prediction approaches are much faster, they are specialized for interactions with a single binding site.
In this paper we present a novel algorithm to accurately predict the minimum free energy structure of RNA-RNA interaction under the most general type of interactions studied in the literature. Moreover, we introduce a fast heuristic method to predict the specific (multiple) binding sites of two interacting RNAs.
We verify the performance of our algorithms for joint structure and binding site prediction on a set of known interacting RNA pairs. Experimental results show our algorithms are highly accurate and outperform all competitive approaches.
Protein synthesis of an RNA template can initiate by two different known mechanisms: cap-dependent translation initiation and cap-independent translation initiation. The latter is driven by RNA sequences called internal ribosome entry sites (IRESs) that are found in both viral RNAs and cellular mRNAs. The diverse mechanisms used by IRESs are reflected in their structural diversity, and this structural diversity challenges us to develop a cohesive model linking IRES function to structure. With more direct structural information available for the viral IRESs, data suggest an inverse correlation between the degree to which an IRES RNA can form a stable structure on its own, and the number of factors that it requires to function. Lessons learned from the viral IRESs may help understand the cellular IRESs, although more structural data is needed before any strong links can be made.
Protein tertiary structure prediction is a fundamental problem in computational biology and identifying the most native-like model from a set of predicted models is a key sub-problem. Consensus methods work well when the redundant models in the set are the most native-like, but fail when the most native-like model is unique. In contrast, structure-based methods score models independently and can be applied to model sets of any size and redundancy level. Additionally, structure-based methods have a variety of important applications including analogous fold recognition, refinement of sequence-structure alignments, and de novo prediction. The purpose of this work was to develop a structure-based model selection method based on predicted structural features that could be applied successfully to any set of models.
Here we introduce SELECTpro, a novel structure-based model selection method derived from an energy function comprising physical, statistical, and predicted structural terms. Novel and unique energy terms include predicted secondary structure, predicted solvent accessibility, predicted contact map, β-strand pairing, and side-chain hydrogen bonding.
SELECTpro participated in the new model quality assessment (QA) category in CASP7, submitting predictions for all 95 targets and achieved top results. The average difference in GDT-TS between models ranked first by SELECTpro and the most native-like model was 5.07. This GDT-TS difference was less than 1% of the GDT-TS of the most native-like model for 18 targets, and less than 10% for 66 targets. SELECTpro also ranked the single most native-like first for 15 targets, in the top five for 39 targets, and in the top ten for 53 targets, more often than any other method. Because the ranking metric is skewed by model redundancy and ignores poor models with a better ranking than the most native-like model, the BLUNDER metric is introduced to overcome these limitations. SELECTpro is also evaluated on a recent benchmark set of 16 small proteins with large decoy sets of 12500 to 20000 models for each protein, where it outperforms the benchmarked method (I-TASSER).
SELECTpro is an effective model selection method that scores models independently and is appropriate for use on any model set. SELECTpro is available for download as a stand alone application at: . SELECTpro is also available as a public server at the same site.