We investigate automated and generic alphabet reduction techniques for protein structure prediction datasets. Reducing alphabet cardinality without losing key biochemical information opens the door to potentially faster machine learning, data mining and optimization applications in structural bioinformatics. Furthermore, reduced but informative alphabets often result in, e.g., more compact and human-friendly classification/clustering rules. In this paper we propose a robust and sophisticated alphabet reduction protocol based on mutual information and state-of-the-art optimization techniques.
We applied this protocol to the prediction of two protein structural features: contact number and relative solvent accessibility. For both features we generated alphabets of two, three, four and five letters. The five-letter alphabets gave prediction accuracies statistically similar to that obtained using the full amino acid alphabet. Moreover, the automatically designed alphabets were compared against other reduced alphabets taken from the literature or human-designed, outperforming them. The differences between our alphabets and the alphabets taken from the literature were quantitatively analyzed. All the above process had been performed using a primary sequence representation of proteins. As a final experiment, we extrapolated the obtained five-letter alphabet to reduce a, much richer, protein representation based on evolutionary information for the prediction of the same two features. Again, the performance gap between the full representation and the reduced representation was small, showing that the results of our automated alphabet reduction protocol, even if they were obtained using a simple representation, are also able to capture the crucial information needed for state-of-the-art protein representations.
Our automated alphabet reduction protocol generates competent reduced alphabets tailored specifically for a variety of protein datasets. This process is done without any domain knowledge, using information theory metrics instead. The reduced alphabets contain some unexpected (but sound) groups of amino acids, thus suggesting new ways of interpreting the data.
Motivation: Picking peaks from experimental NMR spectra is a key unsolved problem for automated NMR protein structure determination. Such a process is a prerequisite for resonance assignment, nuclear overhauser enhancement (NOE) distance restraint assignment, and structure calculation tasks. Manual or semi-automatic peak picking, which is currently the prominent way used in NMR labs, is tedious, time consuming and costly.
Results: We introduce new ideas, including noise-level estimation, component forming and sub-division, singular value decomposition (SVD)-based peak picking and peak pruning and refinement. PICKY is developed as an automated peak picking method. Different from the previous research on peak picking, we provide a systematic study of the proposed method. PICKY is tested on 32 real 2D and 3D spectra of eight target proteins, and achieves an average of 88% recall and 74% precision. PICKY is efficient. It takes PICKY on average 15.7 s to process an NMR spectrum. More important than these numbers, PICKY actually works in practice. We feed peak lists generated by PICKY to IPASS for resonance assignment, feed IPASS assignment to SPARTA for fragments generation, and feed SPARTA fragments to FALCON for structure calculation. This results in high-resolution structures of several proteins, for example, TM1112, at 1.25 Å.
Availability: PICKY is available upon request. The peak lists of PICKY can be easily loaded by SPARKY to enable a better interactive strategy for rapid peak picking.
Finding structural similarities between proteins often helps revealing shared functionality which otherwise might not be detected by native sequence information alone. Such similarity is usually detected and quantified by protein structure alignment. Determining the optimal alignment between two protein structures remains however a hard problem. An alternative approach is to approximate each protein 3D structure using a sequence of motifs derived from a structural alphabet. Using this approach, structure comparison is performed by comparing the corresponding motif sequences, or structural sequences. In this paper, we measure the performance of such alphabets in the context of the protein structure classification problem. We consider both local and global structural sequences. Each letter of a local structural sequence corresponds to the best matching fragment to the corresponding local segment of the protein structure. The global structural sequence is designed to generate the best possible complete chain that matches the full protein structure. We use an alphabet of 20 letters, corresponding to a library of 20 motifs or protein fragments of size 4 residues. We show that the global structural sequences approximate well the native structures of proteins, with an average cRMS of 0.69 Å over 2225 test proteins. The approximation is best for all α-proteins, while relatively poorer for all β-proteins. We then test the performance of four different sequence representations of proteins (their native sequence, the sequence of their secondary structure elements, and the local and global structural sequences based on our fragment library) with different classifiers in their ability to classify proteins that belong to five distinct folds of CATH. Without surprise, the primary sequence alone performs poorly as a structure classifier. We show that addition of either secondary structure information or local information from the structural sequence considerably improves the classification accuracy. The two fragment-based sequences perform better than the secondary structure sequence, but not well enough at this stage to be a viable alternative to more computationally intensive methods based on protein structure alignment.
Protein structure; Structural alphabet; Structure classification; Protein sequence comparison; Sequence feature space
As more protein structures become available and structural genomics efforts provide structural models in a genome-wide strategy, there is a growing need for fast and accurate methods for discovering homologous proteins and evolutionary classifications of newly determined structures. We have developed 3D-BLAST, in part, to address these issues. 3D-BLAST is as fast as BLAST and calculates the statistical significance (E-value) of an alignment to indicate the reliability of the prediction. Using this method, we first identified 23 states of the structural alphabet that represent pattern profiles of the backbone fragments and then used them to represent protein structure databases as structural alphabet sequence databases (SADB). Our method enhanced BLAST as a search method, using a new structural alphabet substitution matrix (SASM) to find the longest common substructures with high-scoring structured segment pairs from an SADB database. Using personal computers with Intel Pentium4 (2.8 GHz) processors, our method searched more than 10 000 protein structures in 1.3 s and achieved a good agreement with search results from detailed structure alignment methods. [3D-BLAST is available at ]
Motivation: The next generation sequencing technologies are generating billions of short reads daily. Resequencing and personalized medicine need much faster software to map these deep sequencing reads to a reference genome, to identify SNPs or rare transcripts.
Results: We present a framework for how full sensitivity mapping can be done in the most efficient way, via spaced seeds. Using the framework, we have developed software called ZOOM, which is able to map the Illumina/Solexa reads of 15× coverage of a human genome to the reference human genome in one CPU-day, allowing two mismatches, at full sensitivity.
Availability: ZOOM is freely available to non-commercial users at http://www.bioinfor.com/zoom
Contact: firstname.lastname@example.org, email@example.com
Motivation: Many proteins with vastly dissimilar sequences are found to share a common fold, as evidenced in the wealth of structures now available in the Protein Data Bank. One idea that has found success in various applications is the concept of a reduced amino acid alphabet, wherein similar amino acids are clustered together. Given the structural similarity exhibited by many apparently dissimilar sequences, we undertook this study looking for improvements in fold recognition by comparing protein sequences written in a reduced alphabet.
Results: We tested over 150 of the amino acid clustering schemes proposed in the literature with all-versus-all pairwise sequence alignments of sequences in the Distance mAtrix aLIgnment database. We combined several metrics from information retrieval popular in the literature: mean precision, area under the Receiver Operating Characteristic curve and recall at a fixed error rate and found that, in contrast to previous work, reduced alphabets in many cases outperform full alphabets. We find that reduced alphabets can perform at a level comparable to full alphabets in correct pairwise alignment of sequences and can show increased sensitivity to pairs of sequences with structural similarity but low-sequence identity. Based on these results, we hypothesize that reduced alphabets may also show performance gains with more sophisticated methods such as profile and pattern searches.
Availability: A table of results as well as the substitution matrices and residue groupings from this study can be downloaded from http://www.rpgroup.caltech.edu/publications/supplements/alphabets.
Supplementary information: Supplementary data are available at Bioinformatics online.
Recent work has shown that the accuracy of ab initio structure prediction can be significantly improved by integrating evolutionary information in form of intra-protein residue-residue contacts. Following this seminal result, much effort is put into the improvement of contact predictions. However, there is also a substantial need to develop structure prediction protocols tailored to the type of restraints gained by contact predictions. Here, we present a structure prediction protocol that combines evolutionary information with the resolution-adapted structural recombination approach of Rosetta, called RASREC. Compared to the classic Rosetta ab initio protocol, RASREC achieves improved sampling, better convergence and higher robustness against incorrect distance restraints, making it the ideal sampling strategy for the stated problem. To demonstrate the accuracy of our protocol, we tested the approach on a diverse set of 28 globular proteins. Our method is able to converge for 26 out of the 28 targets and improves the average TM-score of the entire benchmark set from 0.55 to 0.72 when compared to the top ranked models obtained by the EVFold web server using identical contact predictions. Using a smaller benchmark, we furthermore show that the prediction accuracy of our method is only slightly reduced when the contact prediction accuracy is comparatively low. This observation is of special interest for protein sequences that only have a limited number of homologs.
Recently, a breakthrough has been achieved in modeling the atomic 3D structures of proteins from their sequence alone without requiring any experimental work on the protein itself. To achieve this goal, a database of evolutionary related sequences is analyzed to find co-evolving residues, giving insight into which residues are in close proximity to each other. These residue-residue contacts can help to drive a computer simulation with an atomic-scale physical model of the protein structure from a random starting conformation to a native-like 3D conformation. Although much effort is being put into the improvement of residue-residue contact predictions, their accuracy will always be limited. Therefore, structure prediction protocols with a high tolerance against incorrect distance restraints are needed. Here, we present a structure prediction protocol that combines evolutionary information with the iterative sampling approach of the molecular modeling suite Rosetta, called RASREC. RASREC has been shown to converge faster to near-native models and to be more robust against incorrect distance restraints than standard prediction protocols. It is therefore perfectly suited for restraints obtained from predicted residue-residue contacts with limited accuracy. We show that our protocol outperforms other currently published structure prediction methods and is able to achieve accurate structures, even if the accuracy of predicted contacts is low.
Predicting accurate fragments from sequence has recently become a critical step for protein structure modeling, as protein fragment assembly techniques are presently among the most efficient approaches for de novo prediction. A key step in these approaches is, given the sequence of a protein to model, the identification of relevant fragments - candidate fragments - from a collection of the available 3D structures. These fragments can then be assembled to produce a model of the complete structure of the protein of interest. The search for candidate fragments is classically achieved by considering local sequence similarity using profile comparison, or threading approaches. In the present study, we introduce a new profile comparison approach that, instead of using amino acid profiles, is based on the use of predicted structural alphabet profiles, where structural alphabet profiles contain information related to the 3D local shapes associated with the sequences. We show that structural alphabet profile-profile comparison can be used efficiently to retrieve accurate structural fragments, and we introduce a fully new protocol for the detection of candidate fragments. It identifies fragments specific of each position of the sequence and of size varying between 6 and 27 amino-acids. We find it outperforms present state of the art approaches in terms (i) of the accuracy of the fragments identified, (ii) the rate of true positives identified, while having a high coverage score. We illustrate the relevance of the approach on complete target sets of the two previous Critical Assessment of Techniques for Protein Structure Prediction (CASP) rounds 9 and 10. A web server for the approach is freely available at http://bioserv.rpbs.univ-paris-diderot.fr/SAFrag.
The hierarchical and partially redundant nature of protein structures justifies the definition of frequently occurring conformations of short fragments as 'states'. Collections of selected representatives for these states define Structural Alphabets, describing the most typical local conformations within protein structures. These alphabets form a bridge between the string-oriented methods of sequence analysis and the coordinate-oriented methods of protein structure analysis.
A Structural Alphabet has been derived by clustering all four-residue fragments of a high-resolution subset of the protein data bank and extracting the high-density states as representative conformational states. Each fragment is uniquely defined by a set of three independent angles corresponding to its degrees of freedom, capturing in simple and intuitive terms the properties of the conformational space. The fragments of the Structural Alphabet are equivalent to the conformational attractors and therefore yield a most informative encoding of proteins. Proteins can be reconstructed within the experimental uncertainty in structure determination and ensembles of structures can be encoded with accuracy and robustness.
The density-based Structural Alphabet provides a novel tool to describe local conformations and it is specifically suitable for application in studies of protein dynamics.
Because loops connect regular secondary structures, analysis of the former depends directly on the definition of the latter. The numerous assignment methods, however, can offer different definitions. In a previous study, we defined a structural alphabet composed of 16 average protein fragments, which we called Protein Blocks (PBs). They allow an accurate description of every region of 3D protein backbones and have been used in local structure prediction. In the present study, we use this structural alphabet to analyze and predict the loops connecting two repetitive structures.
We first analyzed the secondary structure assignments. Use of five different assignment methods (DSSP, DEFINE, PCURVE, STRIDE and PSEA) showed the absence of consensus: 20% of the residues were assigned to different states. The discrepancies were particularly important at the extremities of the repetitive structures. We used PBs to describe and predict the short loops because they can help analyze and in part explain these discrepancies. An analysis of the PB distribution in these regions showed some specificities in the sequence-structure relationship. Of the amino acid over- or under-representations observed in the short loop databank, 20% did not appear in the entire databank. Finally, predicting 3D structure in terms of PBs with a Bayesian approach yielded an accuracy rate of 36.0% for all loops and 41.2% for the short loops. Specific learning in the short loops increased the latter by 1%.
This work highlights the difficulties of assigning repetitive structures and the advantages of using more precise descriptions, that is, PBs. We observed some new amino acid distributions in the short loops and used this information to enhance local prediction. Instead of describing entire loops, our approach predicts each position in the loops locally. It can thus be used to propose many different structures for the loops and to probe and sample their flexibility. It can be a useful tool in ab initio loop prediction.
Motivation: Inferring structural dependencies among a protein’s side chains helps us understand their coupled motions. It is known that coupled fluctuations can reveal pathways of communication used for information propagation in a molecule. Side-chain conformations are commonly represented by multivariate angular variables, but existing partial correlation methods that can be applied to this inference task are not capable of handling multivariate angular data. We propose a novel method to infer direct couplings from this type of data, and show that this method is useful for identifying functional regions and their interactions in allosteric proteins.
Results: We developed a novel extension of canonical correlation analysis (CCA), which we call ‘kernelized partial CCA’ (or simply KPCCA), and used it to infer direct couplings between side chains, while disentangling these couplings from indirect ones. Using the conformational information and fluctuations of the inactive structure alone for allosteric proteins in the Ras and other Ras-like families, our method identified allosterically important residues not only as strongly coupled ones but also in densely connected regions of the interaction graph formed by the inferred couplings. Our results were in good agreement with other empirical findings. By studying distinct members of the Ras, Rho and Rab sub-families, we show further that KPCCA was capable of inferring common allosteric characteristics in the small G protein super-family.
Availability and implementation:
As modeling of changes in backbone conformation still lacks a computationally efficient solution, we developed a discretisation of the conformational states accessible to the protein backbone similar to the successful rotamer approach in side chains. The BriX fragment database, consisting of fragments from 4 to 14 residues long, was realized through identification of recurrent backbone fragments from a non-redundant set of high-resolution protein structures. BriX contains an alphabet of more than 1,000 frequently observed conformations per peptide length for 6 different variation levels. Analysis of the performance of BriX revealed an average structural coverage of protein structures of more than 99% within a root mean square distance (RMSD) of 1 Angstrom. Globally, we are able to reconstruct protein structures with an average accuracy of 0.48 Angstrom RMSD. As expected, regular structures are well covered, but, interestingly, many loop regions that appear irregular at first glance are also found to form a recurrent structural motif, albeit with lower frequency of occurrence than regular secondary structures. Larger loop regions could be completely reconstructed from smaller recurrent elements, between 4 and 8 residues long. Finally, we observed that a significant amount of short sequences tend to display strong structural ambiguity between alpha helix and extended conformations. When the sequence length increases, this so-called sequence plasticity is no longer observed, illustrating the context dependency of polypeptide structures.
Large-scale DNA sequencing efforts produce large amounts of protein sequence data. However, in order to understand the function of a protein, its tertiary three-dimensional structure is required. Despite worldwide efforts in structural biology, experimental protein structures are determined at a significantly slower pace. As a result, computational methods for protein structure prediction receive significant attention. A large part of the structure prediction problem lies in the enormous size of the problem: proteins seem to occur in an infinite variety of shapes. Here, we propose that this huge complexity may be overcome by identifying recurrent protein fragments, which are frequently reused as building blocks to construct proteins that were hitherto thought to be unrelated. The BriX database is the outcome of identifying about 2,000 canonical shapes among 1,261 protein structures. We show any given protein can be reconstructed from this library of building blocks at a very high resolution, suggesting that the modelling of protein backbones may be greatly aided by our database.
Structural similarities among proteins can provide valuable insight into their functional mechanisms and relationships. As the number of available three-dimensional (3D) protein structures increases, a greater variety of studies can be conducted with increasing efficiency, among which is the design of protein structural alphabets. Structural alphabets allow us to characterize local structures of proteins and describe the global folding structure of a protein using a one-dimensional (1D) sequence. Thus, 1D sequences can be used to identify structural similarities among proteins using standard sequence alignment tools such as BLAST or FASTA.
We used self-organizing maps in combination with a minimum spanning tree algorithm to determine the optimum size of a structural alphabet and applied the k-means algorithm to group protein fragnts into clusters. The centroids of these clusters defined the structural alphabet. We also developed a flexible matrix training system to build a substitution matrix (TRISUM-169) for our alphabet. Based on FASTA and using TRISUM-169 as the substitution matrix, we developed the SA-FAST alignment tool. We compared the performance of SA-FAST with that of various search tools in database-scale search tasks and found that SA-FAST was highly competitive in all tests conducted. Further, we evaluated the performance of our structural alphabet in recognizing specific structural domains of EGF and EGF-like proteins. Our method successfully recovered more EGF sub-domains using our structural alphabet than when using other structural alphabets. SA-FAST can be found at .
The goal of this project was two-fold. First, we wanted to introduce a modular design pipeline to those who have been working with structural alphabets. Secondly, we wanted to open the door to researchers who have done substantial work in biological sequences but have yet to enter the field of protein structure research. Our experiments showed that by transforming the structural representations from 3D to 1D, several 1D-based tools can be applied to structural analysis, including similarity searches and structural motif finding.
The effectiveness of sequence alignment in detecting structural homology among protein sequences decreases markedly when pairwise sequence identity is low (the so-called “twilight zone” problem of sequence alignment). Alternative sequence comparison strategies able to detect structural kinship among highly divergent sequences are necessary to address this need. Among them are alignment-free methods, which use global sequence properties (such as amino acid composition) to identify structural homology in a rapid and straightforward way. We explore the viability of using tetramer sequence fragment composition profiles in finding structural relationships that lie undetected by traditional alignment. We establish a strategy to recast any given protein sequence into a tetramer sequence fragment composition profile, using a series of amino acid clustering steps that have been optimized for mutual information. Our method has the effect of compressing the set of 160,000 unique tetramers (if using the 20-letter amino acid alphabet) into a more tractable number of reduced tetramers (around 15 to 30), so that a meaningful tetramer composition profile can be constructed. We test remote homology detection at the topology and fold superfamily levels using a comprehensive set of fold homologs, culled from the CATH database, that share low pairwise sequence similarity. Using the receiver operating characteristic (ROC) measure, we demonstrate potentially significant improvement in using information-optimized reduced tetramer composition, over methods relying only on the raw amino acid composition or on traditional sequence alignment, in homology detection at or below the “twilight zone”.
Protein loops encompass 50% of protein residues in available three-dimensional structures. These regions are often involved in protein functions, e.g. binding site, catalytic pocket... However, the description of protein loops with conventional tools is an uneasy task. Regular secondary structures, helices and strands, have been widely studied whereas loops, because they are highly variable in terms of sequence and structure, are difficult to analyze. Due to data sparsity, long loops have rarely been systematically studied.
We developed a simple and accurate method that allows the description and analysis of the structures of short and long loops using structural motifs without restriction on loop length. This method is based on the structural alphabet HMM-SA. HMM-SA allows the simplification of a three-dimensional protein structure into a one-dimensional string of states, where each state is a four-residue prototype fragment, called structural letter. The difficult task of the structural grouping of huge data sets is thus easily accomplished by handling structural letter strings as in conventional protein sequence analysis. We systematically extracted all seven-residue fragments in a bank of 93000 protein loops and grouped them according to the structural-letter sequence, named structural word. This approach permits a systematic analysis of loops of all sizes since we consider the structural motifs of seven residues rather than complete loops. We focused the analysis on highly recurrent words of loops (observed more than 30 times). Our study reveals that 73% of loop-lengths are covered by only 3310 highly recurrent structural words out of 28274 observed words). These structural words have low structural variability (mean RMSd of 0.85 Å). As expected, half of these motifs display a flanking-region preference but interestingly, two thirds are shared by short (less than 12 residues) and long loops. Moreover, half of recurrent motifs exhibit a significant level of amino-acid conservation with at least four significant positions and 87% of long loops contain at least one such word. We complement our analysis with the detection of statistically over-represented patterns of structural letters as in conventional DNA sequence analysis. About 30% (930) of structural words are over-represented, and cover about 40% of loop lengths. Interestingly, these words exhibit lower structural variability and higher sequential specificity, suggesting structural or functional constraints.
We developed a method to systematically decompose and study protein loops using recurrent structural motifs. This method is based on the structural alphabet HMM-SA and not on structural alignment and geometrical parameters. We extracted meaningful structural motifs that are found in both short and long loops. To our knowledge, it is the first time that pattern mining helps to increase the signal-to-noise ratio in protein loops. This finding helps to better describe protein loops and might permit to decrease the complexity of long-loop analysis. Detailed results are available at http://www.mti.univ-paris-diderot.fr/publication/supplementary/2009/ACCLoop/.
Knowledge of all residue-residue contacts within a protein allows determination of the protein fold. Accurate prediction of even a subset of long-range contacts (contacts between amino acids far apart in sequence) can be instrumental for determining tertiary structure. Here we present BCL::Contact, a novel contact prediction method that utilizes artificial neural networks (ANNs) and specializes in the prediction of medium to long-range contacts. BCL::Contact comes in two modes: sequence-based and structure-based. The sequence-based mode uses only sequence information and has individual ANNs specialized for helix-helix, helix-strand, strand-helix, strand-strand, and sheet-sheet contacts. The structure-based mode combines results from 32-fold recognition methods with sequence information to a consensus prediction. The two methods were presented in the 6th and 7th Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiments. The present work focuses on elucidating the impact of fold recognition results onto contact prediction via a direct comparison of both methods on a joined benchmark set of proteins. The sequence-based mode predicted contacts with 42% accuracy (7% false positive rate), while the structure-based mode achieved 45% accuracy (2% false positive rate). Predictions by both modes of BCL::Contact were supplied as input to the protein tertiary structure prediction program Rosetta for a benchmark of 17 proteins with no close sequence homologs in the protein data bank (PDB). Rosetta created higher accuracy models, signified by an improvement of 1.3 Å on average root mean square deviation (RMSD), when driven by the predicted contacts. Further, filtering Rosetta models by agreement with the predicted contacts enriches for native-like fold topologies.
CASP; computational structural biology; contact prediction; structure prediction
iPARTS is an improved web server for aligning two RNA 3D structures based on a structural alphabet (SA)-based approach. In particular, we first derive a Ramachandran-like diagram of RNAs by plotting nucleotides on a 2D axis using their two pseudo-torsion angles η and θ. Next, we apply the affinity propagation clustering algorithm to this η-θ plot to obtain an SA of 23-nt conformations. We finally use this SA to transform RNA 3D structures into 1D sequences of SA letters and continue to utilize classical sequence alignment methods to compare these 1D SA-encoded sequences and determine their structural similarities. iPARTS takes as input two RNA 3D structures in the PDB format and outputs their global alignment (for determining overall structural similarity), semiglobal alignments (for detecting structural motifs or substructures), local alignments (for finding locally similar substructures) and normalized local structural alignments (for identifying more similar local substructures without non-similar internal fragments), with graphical display that allows the user to visually view, rotate and enlarge the superposition of aligned RNA 3D structures. iPARTS is now available online at http://bioalgorithm.life.nctu.edu.tw/iPARTS/.
The contemporary proteinogenic repertoire contains 20 amino acids with diverse functional groups and side chain geometries. Primordial proteins, in contrast, were presumably constructed from a subset of these building blocks. Subsequent expansion of the proteinogenic alphabet would have enhanced their capabilities, fostering the metabolic prowess and organismal fitness of early living systems. While the addition of amino acids bearing innovative functional groups directly enhances the chemical repertoire of proteomes, the inclusion of chemically redundant monomers is difficult to rationalize. Here, we studied how a simplified chorismate mutase evolves upon expanding its amino acid alphabet from nine to potentially 20 letters. Continuous evolution provided an enhanced enzyme variant that has only two point mutations, both of which extend the alphabet and jointly improve protein stability by >4 kcal/mol and catalytic activity tenfold. The same, seemingly innocuous substitutions (Ile→Thr, Leu→Val) occurred in several independent evolutionary trajectories. The increase in fitness they confer indicates that building blocks with very similar side chain structures are highly beneficial for fine-tuning protein structure and function.
Proteins are linear polymers of a set of typically 20 different amino acid building blocks. The amino acid sequence—encoded by a genetic template—directs the folding of newly synthesized proteins into compact 3D structures and dictates the function of the protein product. Monomers containing distinct physico-chemical properties and geometries allow the formation of highly sophisticated architectures, and diverse functional groups enable enzymes to catalyze a plethora of chemical transformations. Nevertheless, the biochemical rationale for the exact composition (and particularly the redundancy) of the proteinogenic amino acid alphabet, which contains multiple building blocks that are chemically similar, remains enigmatic. By subjecting a simplified enzyme—constructed from only nine different amino acids—to directed evolution, we were able to investigate the impact of amino acid diversity on protein function. The most prolific variant selected in the course of the experiments expanded its amino acid alphabet, albeit through two surprisingly subtle mutations (isoleucine to threonine and leucine to valine). The mutations improve both stability and catalytic activity of the enzyme, thereby demonstrating that the presence of structurally similar amino acids specified by the genetic code is highly beneficial for protein fitness.
Conventional NMR structure determination requires nearly complete assignment of the cross peaks of a refined NOESY peak list. Depending on the size of the protein and quality of the spectral data, this can be a time-consuming manual process requiring several rounds of peak list refinement and structure determination. Programs such as Aria, CYANA, and AutoStructure can generate models using unassigned NOESY data but are very sensitive to the quality of the input peak lists and can converge to inaccurate structures if the signal-to-noise of the peak lists is low. Here, we show that models with high accuracy and reliability can be produced by combining the strengths of the high-resolution structure prediction program Rosetta with global measures of the agreement between structure models and experimental data. A first round of models generated using CS-Rosetta (Rosetta supplemented with backbone chemical shift information) are filtered on the basis of their goodness-of-fit with unassigned NOESY peak lists using the DP-score, and the best fitting models are subjected to high resolution refinement with the Rosetta rebuild-and-refine protocol. This hybrid approach uses both local backbone chemical shift and the unassigned NOESY data to direct Rosetta trajectories toward the native structure and produces more accurate models than AutoStructure/CYANA or CS-Rosetta alone, particularly when using raw unedited NOESY peak lists. We also show that when accurate manually refined NOESY peak lists are available, Rosetta refinement can consistently increase the accuracy of models generated using CYANA and AutoStructure.
Motivation. To assist efforts in determining and exploring the functional properties of proteins, it is desirable to characterize and predict protein flexibilities. Results. In this study, the conformational entropy is used as an indicator of the protein flexibility. We first explore whether the conformational change can capture the protein flexibility. The well-defined decoy structures are converted into one-dimensional series of letters from a structural alphabet. Four different structure alphabets, including the secondary structure in 3-class and 8-class, the PB structure alphabet (16-letter), and the DW structure alphabet (28-letter), are investigated. The conformational entropy is then calculated from the structure alphabet letters. Some of the proteins show high correlation between the conformation entropy and the protein flexibility. We then predict the protein flexibility from basic amino acid sequence. The local structures are predicted by the dual-layer model and the conformational entropy of the predicted class distribution is then calculated. The results show that the conformational entropy is a good indicator of the protein flexibility, but false positives remain a problem. The DW structure alphabet performs the best, which means that more subtle local structures can be captured by large number of structure alphabet letters. Overall this study provides a simple and efficient method for the characterization and prediction of the protein flexibility.
gene identification in genomic DNA sequences by computational methods has become an important task in bioinformatics and computational gene prediction tools are now essential components of every genome sequencing project. Prediction of splice sites is a key step of all gene structural prediction algorithms.
we sought the role of mRNA secondary structures and their information contents for five vertebrate and plant splice site datasets. We selected 900-nucleotide sequences centered at each (real or decoy) donor and acceptor sites, and predicted their corresponding RNA structures by Vienna software. Then, based on whether the nucleotide is in a stem or not, the conventional four-letter nucleotide alphabet was translated into an eight-letter alphabet. Zero-, first- and second-order Markov models were selected as the signal detection methods. It is shown that applying the eight-letter alphabet compared to the four-letter alphabet considerably increases the accuracy of both donor and acceptor site predictions in case of higher order Markov models.
Our results imply that RNA structure contains important data and future gene prediction programs can take advantage of such information.
Motivation: Predictions of protein local structure, derived from sequence alignment information alone, provide visualization tools for biologists to evaluate the importance of amino acid residue positions of interest in the absence of X-ray crystal/NMR structures or homology models. They are also useful as inputs to sequence analysis and modeling tools, such as hidden Markov models (HMMs), which can be used to search for homology in databases of known protein structure. In addition, local structure predictions can be used as a component of cost functions in genetic algorithms that predict protein tertiary structure. We have developed a program (predict-2nd) that trains multilayer neural networks and have applied it to numerous local structure alphabets, tuning network parameters such as the number of layers, the number of units in each layer and the window sizes of each layer. We have had the most success with four-layer networks, with gradually increasing window sizes at each layer.
Results: Because the four-layer neural nets occasionally get trapped in poor local optima, our training protocol now uses many different random starts, with short training runs, followed by more training on the best performing networks from the short runs. One recent addition to the program is the option to add a guide sequence to the profile inputs, increasing the number of inputs per position by 20. We find that use of a guide sequence provides a small but consistent improvement in the predictions for several different local-structure alphabets.
Availability: Local structure prediction with the methods described here is available for use online at http://www.soe.ucsc.edu/compbio/SAM_T08/T08-query.html. The source code and example networks for PREDICT-2ND are available at http://www.soe.ucsc.edu/~karplus/predict-2nd/ A required C++ library is available at http://www.soe.ucsc.edu/~karplus/ultimate/
Supplementary information: Supplementary data are available at Bioinformatics online.
Designing a protein sequence that will fold into a predefined structure is of both practical and fundamental interest. Many successful, computational designs in the last decade resulted from improved understanding of hydrophobic and polar interactions between side chains of amino acid residues in stabilizing protein tertiary structures. However, the coupling between main-chain backbone structure and local sequence has yet to be fully addressed. Here, we attempt to account for such coupling by using a sequence profile derived from the sequences of five residue fragments in a fragment library that are structurally matched to the five-residue segments contained in a target structure. We further introduced a term to reduce low complexity regions of designed sequences. These two terms together with optimized reference states for amino-acid residues were implemented in the RosettaDesign program. The new method, called RosettaDesign-SR, makes a 12% increase (from 34 to 46%) in fraction of proteins whose designed sequences are more than 35% identical to wild-type sequences. Meanwhile, it reduces 8% (from 22% to 14%) to the number of designed sequences that are not homologous to any known protein sequences according to psi-blast. More importantly, the sequences designed by RosettaDesign-SR have 2–3% more polar residues at the surface and core regions of proteins and these surface and core polar residues have about 4% higher sequence identity to wild-type sequences than by RosettaDesign. Thus, the proteins designed by RosettaDesign-SR should be less likely to aggregate and more likely to have unique structures due to more specific polar interactions.
protein design; structure-derived sequence profile; RosettaDesign; local sequence-structure coupling
The prediction of the structures of proteins without detectable sequence similarity to any protein of known structure remains an outstanding scientific challenge. Here we report significant progress in this area. We first describe de novo blind structure predictions of unprecendented accuracy we made for two proteins in large families in the recent CASP11 blind test of protein structure prediction methods by incorporating residue–residue co-evolution information in the Rosetta structure prediction program. We then describe the use of this method to generate structure models for 58 of the 121 large protein families in prokaryotes for which three-dimensional structures are not available. These models, which are posted online for public access, provide structural information for the over 400,000 proteins belonging to the 58 families and suggest hypotheses about mechanism for the subset for which the function is known, and hypotheses about function for the remainder.
Proteins are long chains made up of small building blocks called amino acids. These chains fold up in various ways to form the three-dimensional structures that proteins need to be able work properly. Therefore, to understand how a protein works it is important to determine its structure, but this is very challenging. It is possible to predict the structure of a protein with high accuracy if previous experiments have revealed the structure of a similar protein. However, for almost half of all known families of proteins, there are currently no members whose structures have been solved.
The three-dimensional shape of a protein is determined by interactions between various amino acids. During evolution, the structure and activity of proteins often remain the same across species, even if the amino acid sequences have changed. This is because pairs of amino acids that interact with each other tend to ‘co-evolve’; that is, if one amino acid changes, then the second amino acid also changes in order to accommodate it. By identifying these pairs of co-evolving amino acids, it is possible to work out which amino acids are close to each other in the three-dimensional structure of the protein. This information can be used to generate a structural model of a protein using computational methods.
Now, Ovchinnikov et al. developed a new method to predict the structures of proteins that combines data on the co-evolution of amino acids, as identified by GREMLIN with the structural prediction software called Rosetta. A community-wide experiment called CASP—which tests different methods of protein prediction—showed that, in two cases, this new method works much better than anything previously used to predict the structures of proteins. Ovchinnikov et al. then used this method to make models for proteins belonging to 58 different protein families with currently unknown structures.
These predictions were found to be highly accurate and the protein families each have thousands of members, so Ovchinnikov et al.'s findings are expected to be useful to researchers in a wide variety of research areas. A future challenge is to extend these methods to the many protein families that have hundreds rather than thousands of members.
co-evolution; structure prediction; protein fold; protein family; B. subtilis; E. coli; H. salinarum; S. solfataricus
Many antibody crystal structures have been solved. Structural modeling programs have been developed that utilize this information to predict 3-D structures of an antibody based upon its sequence. Because of the problem of self-reference, the accuracy and utility of these predictions can only be tested when a new structure has not yet been deposited in the Protein Data Bank.
We have solved the crystal structure of the Fab fragment of RAC18, a protective anti-ricin mAb, to 1.9 Å resolution. We have also modeled the Fv structure of RAC18 using publicly available Ab modeling tools Prediction of Immunoglobulin Structures (PIGS), RosettaAntibody, and Web Antibody Modeling (WAM). The model structures underwent energy minimization. We compared results to the crystal structure on the basis of root-mean-square deviation (RMSD), template modeling score (TM-score), Z-score, and MolProbity analysis.
The crystal structure showed a pocket formed mainly by AA residues in each of the heavy chain complementarity determining regions (CDRs). There were differences between the crystal structure and structures predicted by the modeling tools, particularly in the CDRs. There were also differences among the predicted models, although the differences were small and within experimental error. No one modeling program was clearly superior to the others. In some cases, choosing structures based only on sequence homology to the crystallized Ab yielded RMSDs comparable to the models.
Molecular modeling programs accurately predict the structure of most regions of antibody variable domains of RAC18. The hypervariable CDRs proved most difficult to model, particularly H chain CDR3. Because CDR3 is most often involved in contact with antigen, this defect must be considered when using models to identify potential contacts between antibody and antigen. Because this study represents only a single case, the results cannot be generalized. Rather they highlight the utility and limitations of modeling programs.