We investigate automated and generic alphabet reduction techniques for protein structure prediction datasets. Reducing alphabet cardinality without losing key biochemical information opens the door to potentially faster machine learning, data mining and optimization applications in structural bioinformatics. Furthermore, reduced but informative alphabets often result in, e.g., more compact and human-friendly classification/clustering rules. In this paper we propose a robust and sophisticated alphabet reduction protocol based on mutual information and state-of-the-art optimization techniques.
We applied this protocol to the prediction of two protein structural features: contact number and relative solvent accessibility. For both features we generated alphabets of two, three, four and five letters. The five-letter alphabets gave prediction accuracies statistically similar to that obtained using the full amino acid alphabet. Moreover, the automatically designed alphabets were compared against other reduced alphabets taken from the literature or human-designed, outperforming them. The differences between our alphabets and the alphabets taken from the literature were quantitatively analyzed. All the above process had been performed using a primary sequence representation of proteins. As a final experiment, we extrapolated the obtained five-letter alphabet to reduce a, much richer, protein representation based on evolutionary information for the prediction of the same two features. Again, the performance gap between the full representation and the reduced representation was small, showing that the results of our automated alphabet reduction protocol, even if they were obtained using a simple representation, are also able to capture the crucial information needed for state-of-the-art protein representations.
Our automated alphabet reduction protocol generates competent reduced alphabets tailored specifically for a variety of protein datasets. This process is done without any domain knowledge, using information theory metrics instead. The reduced alphabets contain some unexpected (but sound) groups of amino acids, thus suggesting new ways of interpreting the data.
As more protein structures become available and structural genomics efforts provide structural models in a genome-wide strategy, there is a growing need for fast and accurate methods for discovering homologous proteins and evolutionary classifications of newly determined structures. We have developed 3D-BLAST, in part, to address these issues. 3D-BLAST is as fast as BLAST and calculates the statistical significance (E-value) of an alignment to indicate the reliability of the prediction. Using this method, we first identified 23 states of the structural alphabet that represent pattern profiles of the backbone fragments and then used them to represent protein structure databases as structural alphabet sequence databases (SADB). Our method enhanced BLAST as a search method, using a new structural alphabet substitution matrix (SASM) to find the longest common substructures with high-scoring structured segment pairs from an SADB database. Using personal computers with Intel Pentium4 (2.8 GHz) processors, our method searched more than 10 000 protein structures in 1.3 s and achieved a good agreement with search results from detailed structure alignment methods. [3D-BLAST is available at ]
Predicting accurate fragments from sequence has recently become a critical step for protein structure modeling, as protein fragment assembly techniques are presently among the most efficient approaches for de novo prediction. A key step in these approaches is, given the sequence of a protein to model, the identification of relevant fragments - candidate fragments - from a collection of the available 3D structures. These fragments can then be assembled to produce a model of the complete structure of the protein of interest. The search for candidate fragments is classically achieved by considering local sequence similarity using profile comparison, or threading approaches. In the present study, we introduce a new profile comparison approach that, instead of using amino acid profiles, is based on the use of predicted structural alphabet profiles, where structural alphabet profiles contain information related to the 3D local shapes associated with the sequences. We show that structural alphabet profile-profile comparison can be used efficiently to retrieve accurate structural fragments, and we introduce a fully new protocol for the detection of candidate fragments. It identifies fragments specific of each position of the sequence and of size varying between 6 and 27 amino-acids. We find it outperforms present state of the art approaches in terms (i) of the accuracy of the fragments identified, (ii) the rate of true positives identified, while having a high coverage score. We illustrate the relevance of the approach on complete target sets of the two previous Critical Assessment of Techniques for Protein Structure Prediction (CASP) rounds 9 and 10. A web server for the approach is freely available at http://bioserv.rpbs.univ-paris-diderot.fr/SAFrag.
Finding structural similarities between proteins often helps revealing shared functionality which otherwise might not be detected by native sequence information alone. Such similarity is usually detected and quantified by protein structure alignment. Determining the optimal alignment between two protein structures remains however a hard problem. An alternative approach is to approximate each protein 3D structure using a sequence of motifs derived from a structural alphabet. Using this approach, structure comparison is performed by comparing the corresponding motif sequences, or structural sequences. In this paper, we measure the performance of such alphabets in the context of the protein structure classification problem. We consider both local and global structural sequences. Each letter of a local structural sequence corresponds to the best matching fragment to the corresponding local segment of the protein structure. The global structural sequence is designed to generate the best possible complete chain that matches the full protein structure. We use an alphabet of 20 letters, corresponding to a library of 20 motifs or protein fragments of size 4 residues. We show that the global structural sequences approximate well the native structures of proteins, with an average cRMS of 0.69 Å over 2225 test proteins. The approximation is best for all α-proteins, while relatively poorer for all β-proteins. We then test the performance of four different sequence representations of proteins (their native sequence, the sequence of their secondary structure elements, and the local and global structural sequences based on our fragment library) with different classifiers in their ability to classify proteins that belong to five distinct folds of CATH. Without surprise, the primary sequence alone performs poorly as a structure classifier. We show that addition of either secondary structure information or local information from the structural sequence considerably improves the classification accuracy. The two fragment-based sequences perform better than the secondary structure sequence, but not well enough at this stage to be a viable alternative to more computationally intensive methods based on protein structure alignment.
Protein structure; Structural alphabet; Structure classification; Protein sequence comparison; Sequence feature space
The hierarchical and partially redundant nature of protein structures justifies the definition of frequently occurring conformations of short fragments as 'states'. Collections of selected representatives for these states define Structural Alphabets, describing the most typical local conformations within protein structures. These alphabets form a bridge between the string-oriented methods of sequence analysis and the coordinate-oriented methods of protein structure analysis.
A Structural Alphabet has been derived by clustering all four-residue fragments of a high-resolution subset of the protein data bank and extracting the high-density states as representative conformational states. Each fragment is uniquely defined by a set of three independent angles corresponding to its degrees of freedom, capturing in simple and intuitive terms the properties of the conformational space. The fragments of the Structural Alphabet are equivalent to the conformational attractors and therefore yield a most informative encoding of proteins. Proteins can be reconstructed within the experimental uncertainty in structure determination and ensembles of structures can be encoded with accuracy and robustness.
The density-based Structural Alphabet provides a novel tool to describe local conformations and it is specifically suitable for application in studies of protein dynamics.
With the immense growth in the number of available protein structures, fast and accurate structure comparison has been essential. We propose an efficient method for structure comparison, based on a structural alphabet. Protein Blocks (PBs) is a widely used structural alphabet with 16 pentapeptide conformations that can fairly approximate a complete protein chain. Thus a 3D structure can be translated into a 1D sequence of PBs. With a simple Needleman–Wunsch approach and a raw PB substitution matrix, PB-based structural alignments were better than many popular methods. iPBA web server presents an improved alignment approach using (i) specialized PB Substitution Matrices (SM) and (ii) anchor-based alignment methodology. With these developments, the quality of ∼88% of alignments was improved. iPBA alignments were also better than DALI, MUSTANG and GANGSTA+ in >80% of the cases. The webserver is designed to for both pairwise comparisons and database searches. Outputs are given as sequence alignment and superposed 3D structures displayed using PyMol and Jmol. A local alignment option for detecting subs-structural similarity is also embedded. As a fast and efficient ‘sequence-based’ structure comparison tool, we believe that it will be quite useful to the scientific community. iPBA can be accessed at http://www.dsimb.inserm.fr/dsimb_tools/ipba/.
Motivation: Predictions of protein local structure, derived from sequence alignment information alone, provide visualization tools for biologists to evaluate the importance of amino acid residue positions of interest in the absence of X-ray crystal/NMR structures or homology models. They are also useful as inputs to sequence analysis and modeling tools, such as hidden Markov models (HMMs), which can be used to search for homology in databases of known protein structure. In addition, local structure predictions can be used as a component of cost functions in genetic algorithms that predict protein tertiary structure. We have developed a program (predict-2nd) that trains multilayer neural networks and have applied it to numerous local structure alphabets, tuning network parameters such as the number of layers, the number of units in each layer and the window sizes of each layer. We have had the most success with four-layer networks, with gradually increasing window sizes at each layer.
Results: Because the four-layer neural nets occasionally get trapped in poor local optima, our training protocol now uses many different random starts, with short training runs, followed by more training on the best performing networks from the short runs. One recent addition to the program is the option to add a guide sequence to the profile inputs, increasing the number of inputs per position by 20. We find that use of a guide sequence provides a small but consistent improvement in the predictions for several different local-structure alphabets.
Availability: Local structure prediction with the methods described here is available for use online at http://www.soe.ucsc.edu/compbio/SAM_T08/T08-query.html. The source code and example networks for PREDICT-2ND are available at http://www.soe.ucsc.edu/~karplus/predict-2nd/ A required C++ library is available at http://www.soe.ucsc.edu/~karplus/ultimate/
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Many proteins with vastly dissimilar sequences are found to share a common fold, as evidenced in the wealth of structures now available in the Protein Data Bank. One idea that has found success in various applications is the concept of a reduced amino acid alphabet, wherein similar amino acids are clustered together. Given the structural similarity exhibited by many apparently dissimilar sequences, we undertook this study looking for improvements in fold recognition by comparing protein sequences written in a reduced alphabet.
Results: We tested over 150 of the amino acid clustering schemes proposed in the literature with all-versus-all pairwise sequence alignments of sequences in the Distance mAtrix aLIgnment database. We combined several metrics from information retrieval popular in the literature: mean precision, area under the Receiver Operating Characteristic curve and recall at a fixed error rate and found that, in contrast to previous work, reduced alphabets in many cases outperform full alphabets. We find that reduced alphabets can perform at a level comparable to full alphabets in correct pairwise alignment of sequences and can show increased sensitivity to pairs of sequences with structural similarity but low-sequence identity. Based on these results, we hypothesize that reduced alphabets may also show performance gains with more sophisticated methods such as profile and pattern searches.
Availability: A table of results as well as the substitution matrices and residue groupings from this study can be downloaded from http://www.rpgroup.caltech.edu/publications/supplements/alphabets.
Supplementary information: Supplementary data are available at Bioinformatics online.
Experiments have shown that the canonical AUCG genetic alphabet is not the only possible nucleotide alphabet. In this work we address the question 'is the canonical alphabet optimal?' We make the assumption that the genetic alphabet was determined in the RNA world. Computational tools are used to infer the RNA secondary structure (shape) from a given RNA sequence, and statistics from RNA shapes are gathered with respect to alphabet size. Then, simulations based upon the replication and selection of fixed-sized RNA populations are used to investigate the effect of alternative alphabets upon RNA's ability to step through a fitness landscape. These results show that for a low copy fidelity the canonical alphabet is fitter than two-, six- and eight-letter alphabets. In higher copy-fidelity experiments, six-letter alphabets outperform the four-letter alphabets, suggesting that the canonical alphabet is indeed a relic of the RNA world.
As modeling of changes in backbone conformation still lacks a computationally efficient solution, we developed a discretisation of the conformational states accessible to the protein backbone similar to the successful rotamer approach in side chains. The BriX fragment database, consisting of fragments from 4 to 14 residues long, was realized through identification of recurrent backbone fragments from a non-redundant set of high-resolution protein structures. BriX contains an alphabet of more than 1,000 frequently observed conformations per peptide length for 6 different variation levels. Analysis of the performance of BriX revealed an average structural coverage of protein structures of more than 99% within a root mean square distance (RMSD) of 1 Angstrom. Globally, we are able to reconstruct protein structures with an average accuracy of 0.48 Angstrom RMSD. As expected, regular structures are well covered, but, interestingly, many loop regions that appear irregular at first glance are also found to form a recurrent structural motif, albeit with lower frequency of occurrence than regular secondary structures. Larger loop regions could be completely reconstructed from smaller recurrent elements, between 4 and 8 residues long. Finally, we observed that a significant amount of short sequences tend to display strong structural ambiguity between alpha helix and extended conformations. When the sequence length increases, this so-called sequence plasticity is no longer observed, illustrating the context dependency of polypeptide structures.
Large-scale DNA sequencing efforts produce large amounts of protein sequence data. However, in order to understand the function of a protein, its tertiary three-dimensional structure is required. Despite worldwide efforts in structural biology, experimental protein structures are determined at a significantly slower pace. As a result, computational methods for protein structure prediction receive significant attention. A large part of the structure prediction problem lies in the enormous size of the problem: proteins seem to occur in an infinite variety of shapes. Here, we propose that this huge complexity may be overcome by identifying recurrent protein fragments, which are frequently reused as building blocks to construct proteins that were hitherto thought to be unrelated. The BriX database is the outcome of identifying about 2,000 canonical shapes among 1,261 protein structures. We show any given protein can be reconstructed from this library of building blocks at a very high resolution, suggesting that the modelling of protein backbones may be greatly aided by our database.
R3D-BLAST is a BLAST-like search tool that allows the user to quickly and accurately search against the PDB for RNA structures sharing similar substructures with a specified query RNA structure. The basic idea behind R3D-BLAST is that all the RNA 3D structures deposited in the PDB are first encoded as 1D structural sequences using a structural alphabet of 23 distinct nucleotide conformations, and BLAST is then applied to these 1D structural sequences to search for those RNA substructures whose 1D structural sequences are similar to that of the query RNA substructure. R3D-BLAST takes as input an RNA 3D structure in the PDB format and outputs all substructures of the hits similar to that of the query with a graphical display to show their structural superposition. In addition, each RNA substructure hit found by R3D-BLAST has an associated E-value to measure its statistical significance. R3D-BLAST is now available online at http://genome.cs.nthu.edu.tw/R3D-BLAST/ for public access.
Protein structures are classically described in terms of secondary structures. Even if the regular secondary structures have relevant physical meaning, their recognition from atomic coordinates has some important limitations such as uncertainties in the assignment of boundaries of helical and β-strand regions. Further, on an average about 50% of all residues are assigned to an irregular state, i.e., the coil. Thus different research teams have focused on abstracting conformation of protein backbone in the localized short stretches. Using different geometric measures, local stretches in protein structures are clustered in a chosen number of states. A prototype representative of the local structures in each cluster is generally defined. These libraries of local structures prototypes are named as “structural alphabets”. We have developed a structural alphabet, named Protein Blocks, not only to approximate the protein structure, but also to predict them from sequence. Since its development, we and other teams have explored numerous new research fields using this structural alphabet. We review here some of the most interesting applications.
protein structures; biochemistry; amino acids; secondary structures; propensities; structural alphabet; structure prediction; structural superimposition; mutation; binding site; Bayes theorem; Support Vector Machines.
Because loops connect regular secondary structures, analysis of the former depends directly on the definition of the latter. The numerous assignment methods, however, can offer different definitions. In a previous study, we defined a structural alphabet composed of 16 average protein fragments, which we called Protein Blocks (PBs). They allow an accurate description of every region of 3D protein backbones and have been used in local structure prediction. In the present study, we use this structural alphabet to analyze and predict the loops connecting two repetitive structures.
We first analyzed the secondary structure assignments. Use of five different assignment methods (DSSP, DEFINE, PCURVE, STRIDE and PSEA) showed the absence of consensus: 20% of the residues were assigned to different states. The discrepancies were particularly important at the extremities of the repetitive structures. We used PBs to describe and predict the short loops because they can help analyze and in part explain these discrepancies. An analysis of the PB distribution in these regions showed some specificities in the sequence-structure relationship. Of the amino acid over- or under-representations observed in the short loop databank, 20% did not appear in the entire databank. Finally, predicting 3D structure in terms of PBs with a Bayesian approach yielded an accuracy rate of 36.0% for all loops and 41.2% for the short loops. Specific learning in the short loops increased the latter by 1%.
This work highlights the difficulties of assigning repetitive structures and the advantages of using more precise descriptions, that is, PBs. We observed some new amino acid distributions in the short loops and used this information to enhance local prediction. Instead of describing entire loops, our approach predicts each position in the loops locally. It can thus be used to propose many different structures for the loops and to probe and sample their flexibility. It can be a useful tool in ab initio loop prediction.
The FALC-Loop web server provides an online interface for protein loop modeling by employing an ab initio loop modeling method called FALC (fragment assembly and analytical loop closure). The server may be used to construct loop regions in homology modeling, to refine unreliable loop regions in experimental structures or to model segments of designed sequences. The FALC method is computationally less expensive than typical ab initio methods because the conformational search space is effectively reduced by the use of fragments derived from a structure database. The analytical loop closure algorithm allows efficient search for loop conformations that fit into the protein framework starting from the fragment-assembled structures. The FALC method shows prediction accuracy comparable to other state-of-the-art loop modeling methods. Top-ranked model structures can be visualized on the web server, and an ensemble of loop structures can be downloaded for further analysis. The web server can be freely accessed at http://falc-loop.seoklab.org/.
Local structures predicted from protein sequences are employed extensively in every aspect of modeling and prediction of protein structure and function. For more than 50 years, they have been predicted at a low-resolution coarse-grained level (e.g. three-state secondary structure). Here, we combine a two-state classifier with real-value predictor to predict local structure in continuous representation by backbone torsion angles. The accuracy of the angles predicted by this approach is close to that derived from NMR chemical shifts. Their substitution for predicted secondary structure as restraints for ab initio structure prediction doubles the success rate. This result demonstrates the potential of predicted local structure for fragment-free tertiary-structure prediction. It further implies potentially significant benefits from employing predicted real-valued torsion angles as a replacement of or supplement to the secondary-structure prediction tools employed almost exclusively in many computational methods ranging from sequence alignment to function prediction.
A statistical analysis of the PDB structures has led us to define a new set of small 3D structural prototypes called Protein Blocks (PBs). This structural alphabet includes 16 PBs, each one is defined by the (φ, Ψ) dihedral angles of 5 consecutive residues. The amino acid distributions observed in sequence windows encompassing these PBs are used to predict by a Bayesian approach the local 3D structure of proteins from the sole knowledge of their sequences. LocPred is a software which allows the users to submit a protein sequence and performs a prediction in terms of PBs. The prediction results are given both textually and graphically.
Structure prediction; Confidence index; Bayesian approach; Amino Acid Sequence; Models, Molecular; Molecular Sequence Data; Protein Conformation; Proteins; chemistry
gene identification in genomic DNA sequences by computational methods has become an important task in bioinformatics and computational gene prediction tools are now essential components of every genome sequencing project. Prediction of splice sites is a key step of all gene structural prediction algorithms.
we sought the role of mRNA secondary structures and their information contents for five vertebrate and plant splice site datasets. We selected 900-nucleotide sequences centered at each (real or decoy) donor and acceptor sites, and predicted their corresponding RNA structures by Vienna software. Then, based on whether the nucleotide is in a stem or not, the conventional four-letter nucleotide alphabet was translated into an eight-letter alphabet. Zero-, first- and second-order Markov models were selected as the signal detection methods. It is shown that applying the eight-letter alphabet compared to the four-letter alphabet considerably increases the accuracy of both donor and acceptor site predictions in case of higher order Markov models.
Our results imply that RNA structure contains important data and future gene prediction programs can take advantage of such information.
We survey computational approaches that tackle membrane protein structure and function prediction. While describing the main ideas that have led to the development of the most relevant and novel methods, we also discuss pitfalls, provide practical hints and highlight the challenges that remain. The methods covered include: sequence alignment, motif search, functional residue identification, transmembrane segment and protein topology predictions, homology and ab initio modeling. Overall, predictions of functional and structural features of membrane proteins are improving, although progress is hampered by the limited amount of high-resolution experimental information available. While predictions of transmembrane segments and protein topology rank among the most accurate methods in computational biology, more attention and effort will be required in the future to ameliorate database search, homology and ab initio modeling.
membrane proteins; protein structure prediction; protein function prediction; alignments; transmembrane segment prediction; homology modeling; ab initio modeling
Prediction of protein structures from their sequences is still one of the open grand challenges of computational biology. Some approaches to protein structure prediction, especially ab initio ones, rely to some extent on the prediction of residue contact maps. Residue contact map predictions have been assessed at the CASP competition for several years now. Although it has been shown that exact contact maps generally yield correct three-dimensional structures, this is true only at a relatively low resolution (3–4 Å from the native structure). Another known weakness of contact maps is that they are generally predicted ab initio, that is not exploiting information about potential homologues of known structure.
We introduce a new class of distance restraints for protein structures: multi-class distance maps. We show that Cα trace reconstructions based on 4-class native maps are significantly better than those from residue contact maps. We then build two predictors of 4-class maps based on recursive neural networks: one ab initio, or relying on the sequence and on evolutionary information; one template-based, or in which homology information to known structures is provided as a further input. We show that virtually any level of sequence similarity to structural templates (down to less than 10%) yields more accurate 4-class maps than the ab initio predictor. We show that template-based predictions by recursive neural networks are consistently better than the best template and than a number of combinations of the best available templates. We also extract binary residue contact maps at an 8 Å threshold (as per CASP assessment) from the 4-class predictors and show that the template-based version is also more accurate than the best template and consistently better than the ab initio one, down to very low levels of sequence identity to structural templates. Furthermore, we test both ab-initio and template-based 8 Å predictions on the CASP7 targets using a pre-CASP7 PDB, and find that both predictors are state-of-the-art, with the template-based one far outperforming the best CASP7 systems if templates with sequence identity to the query of 10% or better are available. Although this is not the main focus of this paper we also report on reconstructions of Cα traces based on both ab initio and template-based 4-class map predictions, showing that the latter are generally more accurate even when homology is dubious.
Accurate predictions of multi-class maps may provide valuable constraints for improved ab initio and template-based prediction of protein structures, naturally incorporate multiple templates, and yield state-of-the-art binary maps. Predictions of protein structures and 8 Å contact maps based on the multi-class distance map predictors described in this paper are freely available to academic users at the url .
The primary obstacle to de novo protein structure prediction is conformational sampling: the native state generally has lower free energy than non-native structures but is exceedingly difficult to locate. Structure predictions with atomic level accuracy have been made for small proteins using the Rosetta structure prediction methodology, but for larger and more complex proteins, the native state is virtually never sampled and it has been unclear how much of an increase in computing power would be required to successfully predict the structures of such proteins. In this paper we develop an approach to determining how much computer power is required to accurately predict the structure of a protein, based on a reformulation of the conformational search problem as a combinatorial sampling problem in a discrete feature space. We find that conformational sampling for many proteins is limited by critical “linchpin” features, often the backbone torsion angles of individual residues, which are sampled very rarely in unbiased trajectories and when constrained dramatically increase the sampling of the native state. These critical features frequently occur in less regular and likely strained regions of proteins that contribute to protein function. In a number of proteins, the linchpin features are in regions found experimentally to form late in folding, suggesting a correspondence between folding in silico and in reality.
protein structure prediction; Rosetta; full-atom refinement; distributed computing
Protein template identification is essential to protein structure and function predictions. However, conventional whole-chain threading approaches often fail to recognize conserved substructure motifs when the target and templates do not share the same fold. We develop a new approach, SEGMER, for identifying protein substructure similarities by segmental threading. The target sequence is split into segments of 2–4 consecutive or non-consecutive secondary structural elements, which are then threaded through PDB to identify appropriate substructure motifs. SEGMER is tested on 144 non-redundant hard proteins. When combined with whole-chain threading, the TM-score of alignments and accuracy of spatial restraints of SEGMER increase by 16% and 25%, respectively, compared to that by the whole-chain threading methods only. When tested on 12 Free Modeling targets from CASP8, SEGMER increases the TM-score and contact accuracy by 28% and 48%, respectively. This significant improvement should have important impact on protein structure modeling and functional inference.
protein structure prediction; segmental threading; contact restraints
In recent years there has been a growing interest in the field of non-coding RNA. This surge is a direct consequence of the discovery of a huge number of new non-coding genes and of the finding that many of these transcripts are involved in key cellular functions. In this context, accurately detecting and comparing RNA sequences has become important. Aligning nucleotide sequences is a key requisite when searching for homologous genes. Accurate alignments reveal evolutionary relationships, conserved regions and more generally any biologically relevant pattern. Comparing RNA molecules is, however, a challenging task. The nucleotide alphabet is simpler and therefore less informative than that of amino-acids. Moreover for many non-coding RNAs, evolution is likely to be mostly constrained at the structural level and not at the sequence level. This results in very poor sequence conservation impeding comparison of these molecules. These difficulties define a context where new methods are urgently needed in order to exploit experimental results to their full potential. This review focuses on the comparative genomics of non-coding RNAs in the context of new sequencing technologies and especially dealing with two extremely important and timely research aspects: the development of new methods to align RNAs and the analysis of high-throughput data.
lncRNAs; ncRNAs; high-throughput; RNA-seq; comparative biology; sequencing
Proteins, especially larger ones, are often composed of individual evolutionary units, domains, which have their own function and structural fold. Predicting domains is an important intermediate step in protein analyses, including the prediction of protein structures.
We describe novel systems for the prediction of protein domain boundaries powered by Recursive Neural Networks. The systems rely on a combination of primary sequence and evolutionary information, predictions of structural features such as secondary structure, solvent accessibility and residue contact maps, and structural templates, both annotated for domains (from the SCOP dataset) and unannotated (from the PDB). We gauge the contribution of contact maps, and PDB and SCOP templates independently and for different ranges of template quality. We find that accurately predicted contact maps are informative for the prediction of domain boundaries, while the same is not true for contact maps predicted ab initio. We also find that gap information from PDB templates is informative, but, not surprisingly, less than SCOP annotations. We test both systems trained on templates of all qualities, and systems trained only on templates of marginal similarity to the query (less than 25% sequence identity). While the first batch of systems produces near perfect predictions in the presence of fair to good templates, the second batch outperforms or match ab initio predictors down to essentially any level of template quality.
We test all systems in 5-fold cross-validation on a large non-redundant set of multi-domain and single domain proteins. The final predictors are state-of-the-art, with a template-less prediction boundary recall of 50.8% (precision 38.7%) within ± 20 residues and a single domain recall of 80.3% (precision 78.1%). The SCOP-based predictors achieve a boundary recall of 74% (precision 77.1%) again within ± 20 residues, and classify single domain proteins as such in over 85% of cases, when we allow a mix of bad and good quality templates. If we only allow marginal templates (max 25% sequence identity to the query) the scores remain high, with boundary recall and precision of 59% and 66.3%, and 80% of all single domain proteins predicted correctly.
The systems presented here may prove useful in large-scale annotation of protein domains in proteins of unknown structure. The methods are available as public web servers at the address: and we plan on running them on a multi-genomic scale and make the results public in the near future.
Sequence homologs are an important source of information about proteins. Amino acid profiles, representing the position-specific mutation probabilities found in profiles, are a richer encoding of biological sequences than the individual sequences themselves. However, profile comparisons are an order of magnitude slower than sequence comparisons, making profiles impractical for large datasets. Also, because they are such a rich representation, profiles are difficult to visualize. To address these problems, we describe a method to map probabilistic profiles to a discrete alphabet while preserving most of the information in the profiles. We find an informationally optimal discretization using the Information Bottleneck approach (IB). We observe that an 80-character IB alphabet captures nearly 90% of the amino acid occurrence information found in profiles, compared to the consensus sequence's 78%. Distant homolog search with IB sequences is 88% as sensitive as with profiles compared to 61% with consensus sequences (AUC scores 0.73, 0.83, and 0.51, respectively), but like simple sequence comparison, is 30 times faster. Discrete IB encoding can therefore expand the range of sequence problems to which profile information can be applied to include batch queries over large databases like SwissProt, which were previously computationally infeasible.
Motivation: Recent advances in high-throughput technologies have made it possible to investigate not only individual protein interactions, but also the association of these proteins in complexes. So far the focus has been on the prediction of complexes as sets of proteins from the experimental results. The modular substructure and the physical interactions within the protein complexes have been mostly ignored.
Results: We present an approach for identifying the direct physical interactions and the subcomponent structure of protein complexes predicted from affinity purification assays. Our algorithm calculates the union of all maximum spanning trees from scoring networks for each protein complex to extract relevant interactions. In a subsequent step this network is extended to interactions which are not accounted for by alternative indirect paths. We show that the interactions identified with this approach are more accurate in predicting experimentally derived physical interactions than baseline approaches. Based on these networks, the subcomponent structure of the complexes can be resolved more satisfactorily and subcomplexes can be identified. The usefulness of our method is illustrated on the RNA polymerases for which the modular substructure can be successfully reconstructed.
Availability: A Java implementation of the prediction methods and supplementary material are available at http://www.bio.ifi.lmu.de/Complexes/Substructures/.
Supplementary information: Supplementary data are available at Bioinformatics online.