Motivation: Tandem mass spectrometry (MS/MS) is an indispensable technology for identification of proteins from complex mixtures. Proteins are digested to peptides that are then identified by their fragmentation patterns in the mass spectrometer. Thus, at its core, MS/MS protein identification relies on the relative predictability of peptide fragmentation. Unfortunately, peptide fragmentation is complex and not fully understood, and what is understood is not always exploited by peptide identification algorithms.
Results: We use a hybrid dynamic Bayesian network (DBN)/support vector machine (SVM) approach to address these two problems. We train a set of DBNs on high-confidence peptide-spectrum matches. These DBNs, known collectively as Riptide, comprise a probabilistic model of peptide fragmentation chemistry. Examination of the distributions learned by Riptide allows identification of new trends, such as prevalent a-ion fragmentation at peptide cleavage sites C-term to hydrophobic residues. In addition, Riptide can be used to produce likelihood scores that indicate whether a given peptide-spectrum match is correct. A vector of such scores is evaluated by an SVM, which produces a final score to be used in peptide identification. Using Riptide in this way yields improved discrimination when compared to other state-of-the-art MS/MS identification algorithms, increasing the number of positive identifications by as much as 12% at a 1% false discovery rate.
Availability: Python and C source code are available upon request from the authors. The curated training sets are available at http://noble.gs.washington.edu/proj/intense/. The Graphical Model Tool Kit (GMTK) is freely available at http://ssli.ee.washington.edu/bilmes/gmtk.
Protein secondary structure prediction provides insight into protein function and is a valuable preliminary step for predicting the 3D structure of a protein. Dynamic Bayesian networks (DBNs) and support vector machines (SVMs) have been shown to provide state-of-the-art performance in secondary structure prediction. As the size of the protein database grows, it becomes feasible to use a richer model in an effort to capture subtle correlations among the amino acids and the predicted labels. In this context, it is beneficial to derive sparse models that discourage over-fitting and provide biological insight.
In this paper, we first show that we are able to obtain accurate secondary structure predictions. Our per-residue accuracy on a well established and difficult benchmark (CB513) is 80.3%, which is comparable to the state-of-the-art evaluated on this dataset. We then introduce an algorithm for sparsifying the parameters of a DBN. Using this algorithm, we can automatically remove up to 70-95% of the parameters of a DBN while maintaining the same level of predictive accuracy on the SD576 set. At 90% sparsity, we are able to compute predictions three times faster than a fully dense model evaluated on the SD576 set. We also demonstrate, using simulated data, that the algorithm is able to recover true sparse structures with high accuracy, and using real data, that the sparse model identifies known correlation structure (local and non-local) related to different classes of secondary structure elements.
We present a secondary structure prediction method that employs dynamic Bayesian networks and support vector machines. We also introduce an algorithm for sparsifying the parameters of the dynamic Bayesian network. The sparsification approach yields a significant speed-up in generating predictions, and we demonstrate that the amino acid correlations identified by the algorithm correspond to several known features of protein secondary structure. Datasets and source code used in this study are available at http://noble.gs.washington.edu/proj/pssp.
Hidden Markov models (HMMs) have been successfully applied to the tasks of transmembrane protein topology prediction and signal peptide prediction. In this paper we expand upon this work by making use of the more powerful class of dynamic Bayesian networks (DBNs). Our model, Philius, is inspired by a previously published HMM, Phobius, and combines a signal peptide submodel with a transmembrane submodel. We introduce a two-stage DBN decoder that combines the power of posterior decoding with the grammar constraints of Viterbi-style decoding. Philius also provides protein type, segment, and topology confidence metrics to aid in the interpretation of the predictions. We report a relative improvement of 13% over Phobius in full-topology prediction accuracy on transmembrane proteins, and a sensitivity and specificity of 0.96 in detecting signal peptides. We also show that our confidence metrics correlate well with the observed precision. In addition, we have made predictions on all 6.3 million proteins in the Yeast Resource Center (YRC) database. This large-scale study provides an overall picture of the relative numbers of proteins that include a signal-peptide and/or one or more transmembrane segments as well as a valuable resource for the scientific community. All DBNs are implemented using the Graphical Models Toolkit. Source code for the models described here is available at http://noble.gs.washington.edu/proj/philius. A Philius Web server is available at http://www.yeastrc.org/philius, and the predictions on the YRC database are available at http://www.yeastrc.org/pdr.
Transmembrane proteins control the flow of information and substances into and out of the cell and are involved in a broad range of biological processes. Their interfacing role makes them rewarding drug targets, and it is estimated that more than 50% of recently launched drugs target membrane proteins. However, experimentally determining the three-dimensional structure of a transmembrane protein is still a difficult task, and few of the currently known tertiary structures are of transmembrane proteins despite the fact that as many as one quarter of the proteins in a given organism are transmembrane proteins. Computational methods for predicting the basic topology of a transmembrane protein are therefore of great interest, and these methods must be able to distinguish between mature, membrane-spanning proteins and proteins that, when first synthesized, contain an N-terminal membrane-spanning signal peptide. In this work, we present Philius, a new computational approach that outperforms previous methods in simultaneously detecting signal peptides and correctly predicting the topology of transmembrane proteins. Philius also supplies a set of confidence scores with each prediction. A Philius Web server is available to the public as well as precomputed predictions for over six million proteins in the Yeast Resource Center database.
Motivation: A mass spectrum produced via tandem mass spectrometry can be tentatively matched to a peptide sequence via database search. Here, we address the problem of assigning a posterior error probability (PEP) to a given peptide-spectrum match (PSM). This problem is considerably more difficult than the related problem of estimating the error rate associated with a large collection of PSMs. Existing methods for estimating PEPs rely on a parametric or semiparametric model of the underlying score distribution.
Results: We demonstrate how to apply non-parametric logistic regression to this problem. The method makes no explicit assumptions about the form of the underlying score distribution; instead, the method relies upon decoy PSMs, produced by searching the spectra against a decoy sequence database, to provide a model of the null score distribution. We show that our non-parametric logistic regression method produces accurate PEP estimates for six different commonly used PSM score functions. In particular, the estimates produced by our method are comparable in accuracy to those of PeptideProphet, which uses a parametric or semiparametric model designed specifically to work with SEQUEST. The advantage of the non-parametric approach is applicability and robustness to new score functions and new types of data.
Availability: C++ code implementing the method as well as supplementary information is available at http://noble.gs.washington.edu/proj/qvality
Motivation: A recently developed DNaseI assay has given us our first genome-wide view of chromatin structure. In addition to cataloging DNaseI hypersensitive sites, these data allows us to more completely characterize overall features of chromatin accessibility. We employed a Bayesian hierarchical change-point model (CPM), a generalization of a hidden Markov Model (HMM), to characterize tiled microarray DNaseI sensitivity data available from the ENCODE project.
Results: Our analysis shows that the accessibility of chromatin to cleavage by DNaseI is well described by a four state model of local segments with each state described by a continuous mixture of Gaussian variables. The CPM produces a better fit to the observed data than the HMM. The large posterior probability for the four-state CPM suggests that the data falls naturally into four classes of regions, which we call major and minor DNaseI hypersensitive sites (DHSs), regions of intermediate sensitivity, and insensitive regions. These classes agree well with a model of chromatin in which local disruptions (DHSs) are concentrated within larger domains of intermediate sensitivity, the accessibility islands. The CPM assigns 92% of the bases within the ENCODE regions to the insensitive regions. The 5.8% of the bases that are in regions of intermediate sensitivity are clearly enriched in functional elements, including genes and activating histone modifications, while the remaining 2.2% of the bases in hypersensitive regions are very strongly enriched in these elements.
Availability: The CPM software is available upon request from the authors.
Contact: firstname.lastname@example.org; email@example.com; Charles_Lawrence@brown.edu
Supplementary information: Supplementary data are available at Bioinformatics online. Source code is available at http://noble.gs.washington.edu/proj/segment.
The identification of peptides by tandem mass spectrometry (MS/MS) is a central method of proteomics research, but due to the complexity of MS/MS data and the large databases searched, the accuracy of peptide identification algorithms remains limited. To improve the accuracy of identification we applied a machine-learning approach using a hidden Markov model (HMM) to capture the complex and often subtle links between a peptide sequence and its MS/MS spectrum.
Our model, HMM_Score, represents ion types as HMM states and calculates the maximum joint probability for a peptide/spectrum pair using emission probabilities from three factors: the amino acids adjacent to each fragmentation site, the mass dependence of ion types and the intensity dependence of ion types. The Viterbi algorithm is used to calculate the most probable assignment between ion types in a spectrum and a peptide sequence, then a correction factor is added to account for the propensity of the model to favor longer peptides. An expectation value is calculated based on the model score to assess the significance of each peptide/spectrum match.
We trained and tested HMM_Score on three data sets generated by two different mass spectrometer types. For a reference data set recently reported in the literature and validated using seven identification algorithms, HMM_Score produced 43% more positive identification results at a 1% false positive rate than the best of two other commonly used algorithms, Mascot and X!Tandem. HMM_Score is a highly accurate platform for peptide identification that works well for a variety of mass spectrometer and biological sample types.
The program is freely available on ProteomeCommons via an OpenSource license. See http://bioinfo.unc.edu/downloads/ for the download link.
The problem of identifying the proteins in a complex mixture using tandem mass spectrometry can be framed as an inference problem on a graph that connects peptides to proteins. Several existing protein identification methods make use of statistical inference methods for graphical models, including expectation maximization, Markov chain Monte Carlo, and full marginalization coupled with approximation heuristics. We show that, for this problem, the majority of the cost of inference usually comes from a few highly connected subgraphs. Furthermore, we evaluate three different statistical inference methods using a common graphical model, and we demonstrate that junction tree inference substantially improves rates of convergence compared to existing methods. The python code used for this paper is available at http://noble.gs.washington.edu/proj/fido.
Mass spectrometry; protein identification; graphical models; Bayesian inference
Spectral counting methods provide an easy means of identifying proteins with differing abundances between complex mixtures using shotgun proteomics data. The crux spectral-counts command, implemented as part of the Crux software toolkit, implements four previously reported spectral counting methods, the spectral index (SIN), the exponentially modified protein abundance index (emPAI), the normalized spectral abundance factor (NSAF), and the distributed normalized spectral abundance factor (dNSAF).
We compared the reproducibility and the linearity relative to each protein’s abundance of the four spectral counting metrics. Our analysis suggests that NSAF yields the most reproducible counts across technical and biological replicates, and both SIN and NSAF achieve the best linearity.
With the crux spectral-counts command, Crux provides open-source modular methods to analyze mass spectrometry data for identifying and now quantifying peptides and proteins. The C++ source code, compiled binaries, spectra and sequence databases are available at
Summary: We present a large-scale implementation of the Rankprop protein homology ranking algorithm in the form of an openly accessible web server. We use the NRDB40 PSI-BLAST all-versus-all protein similarity network of 1.1 million proteins to construct the graph for the Rankprop algorithm, whereas previously, results were only reported for a database of 108 000 proteins. We also describe two algorithmic improvements to the original algorithm, including propagation from multiple homologs of the query and better normalization of ranking scores, that lead to higher accuracy and to scores with a probabilistic interpretation.
Availability: The Rankprop web server and source code are available at http://rankprop.gs.washington.edu
Contact: firstname.lastname@example.org; email@example.com
Summary: Qvality is a C++ program for estimating two types of standard statistical confidence measures: the q-value, which is an analog of the p-value that incorporates multiple testing correction, and the posterior error probability (PEP, also known as the local false discovery rate), which corresponds to the probability that a given observation is drawn from the null distribution. In computing q-values, qvality employs a standard bootstrap procedure to estimate the prior probability of a score being from the null distribution; for PEP estimation, qvality relies upon non-parametric logistic regression. Relative to other tools for estimating statistical confidence measures, qvality is unique in its ability to estimate both types of scores directly from a null distribution, without requiring the user to calculate p-values.
Availability: A web server, C++ source code and binaries are available under MIT license at http://noble.gs.washington.edu/proj/qvality
Supplementary information: Supplementary data are available at Bioinformatics online.
Accurate modeling of peptide fragmentation is necessary for the development of robust scoring functions for peptide-spectrum matches, which are the cornerstone of MS/MS-based identification algorithms. Unfortunately, peptide fragmentation is a complex process that can involve several competing chemical pathways, which makes it difficult to develop generative probabilistic models that describe it accurately. However, the vast amounts of MS/MS data being generated now make it possible to use data-driven machine learning methods to develop discriminative ranking-based models that predict the intensity ranks of a peptide's fragment ions. We use simple sequence-based features that get combined by a boosting algorithm in to models that make peak rank predictions with high accuracy. In an accompanying manuscript, we demonstrate how these prediction models are used to significantly improve the performance of peptide identification algorithms. The models can also be useful in the design of optimal MRM transitions, in cases where there is insufficient experimental data to guide the peak selection process. The prediction algorithm can also be run independently through PepNovo+, which is available for download from http://bix.ucsd.edu/Software/PepNovo.html.
MS/MS; peptide; fragmentation; prediction; machine learning; ranking; boosting; MRM
Summary: We present a format for efficient storage of multiple tracks of numeric data anchored to a genome. The format allows fast random access to hundreds of gigabytes of data, while retaining a small disk space footprint. We have also developed utilities to load data into this format. We show that retrieving data from this format is more than 2900 times faster than a naive approach using wiggle files.
Availability and Implementation: Reference implementation in Python and C components available at http://noble.gs.washington.edu/proj/genomedata/ under the GNU General Public License.
Despite a recent surge of interest in database-independent peptide identifications, accurate de novo peptide sequencing remains an elusive goal. While the recently introduced spectral network approach resulted in accurate peptide sequencing in low-complexity samples, its success depends on the chance of presence of spectra from overlapping peptides. On the other hand, while multistage mass spectrometry (collecting multiple MS 3 spectra from each MS 2 spectrum) can be applied to all spectra in a complex sample, there are currently no software tools for de novo peptide sequencing by multistage mass spectrometry. We describe a rigorous probabilistic framework for analyzing spectra of overlapping peptides and show how to apply it for multistage mass spectrometry. Our software results in both accurate de novo peptide sequencing from multistage mass spectra (despite the inferior quality of MS 3 spectra) and improved interpretation of spectral networks. We further study the problem of de novo peptide sequencing with accurate parent mass (but inaccurate fragment masses), the protocol that may soon become the dominant mode of spectral acquisition. Most existing peptide sequencing algorithms (based on the spectrum graph approach) do not track the accurate parent mass and are thus not equipped for solving this problem. We describe a de novo peptide sequencing algorithm aimed at this experimental protocol and show that it improves the sequencing accuracy on both tandem and multistage mass spectrometry.
Availability: The open-source implementation of our software is available at http://proteomics.bioprojects.org.
Supplementary information:: Supplementary data are available at Bioinformatics online.
We applied a dynamic Bayesian network method that identifies joint patterns from multiple functional genomics experiments to ChIP-seq histone modification and transcription factor data, and DNaseI-seq and FAIRE-seq open chromatin readouts from the human cell line K562. In an unsupervised fashion, we identified patterns associated with transcription start sites, gene ends, enhancers, CTCF elements, and repressed regions. Software and genome browser tracks are at http://noble.gs.washington.edu/proj/segway/.
To interpret LC-MS/MS data in proteomics, most popular protein identification algorithms primarily use predicted fragment m/z values to assign peptide sequences to fragmentation spectra. The intensity information is often undervalued, since it is not as easy to predict and incorporate into algorithms. Nevertheless, the use of intensity to assist peptide identification is an attractive prospect and can potentially improve the confidence of matches and generate more identifications. Based on our previously reported study of fragmentation intensity patterns, we developed a protein identification algorithm, SeQuence IDentfication (SQID), which makes use of the coarse intensity from a statistical analysis. The scoring scheme was validated by comparing with Sequest and X!Tandem using three datasets, and the results indicate an improvement in the number of identified peptides, including unique peptides that are not identified by Sequest or X!Tandem. The software and source code are available under the GNU GPL license at: http://quiz2.chem.arizona.edu/wysocki/bioinformatics.htm.
protein identification algorithm; intensity; tandem mass spectrometry; database search
Selective peptide transport by the transporter associated with antigen processing (TAP) represents one of the main candidate mechanisms that may regulate the presentation of antigenic peptides to HLA class I molecules. Because TAP-binding preferences may significant impact T-cell epitope selection, there is great interest in applying computational techniques to systematically discover these elements.
We describe TAP Hunter, a web-based computational system for predicting TAP-binding peptides. A novel encoding scheme, based on representations of TAP peptide fragments and composition effects, allows the identification of variable-length TAP ligands using SVM as the prediction engine. The system was rigorously trained and tested using 613 experimentally verified peptide sequences. The results showed that the system has good predictive ability with area under the receiver operating characteristics curve (AROC) ≥0.88. In addition, TAP Hunter is compared against several existing public available TAP predictors and has showed either superior or comparable performance.
TAP Hunter provides a reliable platform for predicting variable length peptides binding onto the TAP transporter. To facilitate the usage of TAP Hunter to the scientific community, a simple, flexible and user-friendly web-server is developed and freely available at http://datam.i2r.a-star.edu.sg/taphunter/.
Shotgun proteomics experiments are dependent upon database search engines to identify peptides from tandem mass spectra. Many of these algorithms score potential identifications by evaluating the number of fragment ions matched between each peptide sequence and an observed spectrum. These systems, however, generally do not distinguish between matching an intense peak and matching a minor peak. We have developed a statistical model to score peptide matches that is based upon the multivariate hypergeometric distribution. This scorer, part of the “MyriMatch” database search engine, places greater emphasis on matching intense peaks. The probability that the best match for each spectrum has occurred by random chance can be employed to separate correct matches from random ones. We evaluated this software on data sets from three different laboratories employing three different ion trap instruments. Employing a novel system for testing discrimination, we demonstrate that stratifying peaks into multiple intensity classes improves the discrimination of scoring. We compare MyriMatch results to those of Sequest and X!Tandem, revealing that it is capable of higher discrimination than either of these algorithms. When minimal peak filtering is employed, performance plummets for a scoring model that does not stratify matched peaks by intensity. On the other hand, we find that MyriMatch discrimination improves as more peaks are retained in each spectrum. MyriMatch also scales well to tandem mass spectra from high-resolution mass analyzers. These findings may indicate limitations for existing database search scorers that count matched peaks without differentiating them by intensity. This software and source code is available under Mozilla Public License at this URL: http://www.mc.vanderbilt.edu/msrc/bioinformatics/.
Proteomics; Identification; Statistical Distribution; Reversed Database; Peak Filtering
We present a peptide-spectrum alignment strategy that employs a dynamic Bayesian network (DBN) for the identification of spectra produced by tandem mass spectrometry (MS/MS). Our method is fundamentally generative in that it models peptide fragmentation in MS/MS as a physical process. The model traverses an observed MS/MS spectrum and a peptide-based theoretical spectrum to calculate the best alignment between the two spectra. Unlike all existing state-of-the-art methods for spectrum identification that we are aware of, our method can learn alignment probabilities given a dataset of high-quality peptide-spectrum pairs. The method, moreover, accounts for noise peaks and absent theoretical peaks in the observed spectrum. We demonstrate that our method outperforms, on a majority of datasets, several widely used, state-of-the-art database search tools for spectrum identification. Furthermore, the proposed approach provides an extensible framework for MS/MS analysis and provides useful information that is not produced by other methods, thanks to its generative structure.
Peptide and protein identification via tandem mass spectrometry (MS/MS) lies at the heart of proteomic characterization of biological samples. Several algorithms are able to search, score, and assign peptides to large MS/MS datasets. Most popular methods, however, underutilize the intensity information available in the tandem mass spectrum due to the complex nature of the peptide fragmentation process, thus contributing to loss of potential identifications. We present a novel probabilistic scoring algorithm called Context-Sensitive Peptide Identification (CSPI) based on highly flexible Input-Output Hidden Markov Models (IO-HMM) that capture the influence of peptide physicochemical properties on their observed MS/MS spectra. We use several local and global properties of peptides and their fragment ions from literature. Comparison with two popular algorithms, Crux (re-implementation of SEQUEST) and X!Tandem, on multiple datasets of varying complexity, shows that peptide identification scores from our models are able to achieve greater discrimination between true and false peptides, identifying up to ∼25% more peptides at a False Discovery Rate (FDR) of 1%. We evaluated two alternative normalization schemes for fragment ion-intensities, a global rank-based and a local window-based. Our results indicate the importance of appropriate normalization methods for learning superior models. Further, combining our scores with Crux using a state-of-the-art procedure, Percolator, we demonstrate the utility of using scoring features from intensity-based models, identifying ∼4-8 % additional identifications over Percolator at 1% FDR. IO-HMMs offer a scalable and flexible framework with several modeling choices to learn complex patterns embedded in MS/MS data.
Identifying peptides from the fragmentation spectra is a fundamental step in mass spectrometry (MS) data processing. The significance (discriminability) of every peak varies, providing additional information for potentially enhancing the identification sensitivity and the correct match rate. However this important information was not considered in previous algorithms. Here we presented a novel method based on Peptide Matching Discriminability (PMD), in which the PMD information of every peak reflects the discriminability of candidate peptides. In addition, we developed a novel peptide scoring algorithm Dispec based on PMD, by taking three aspects of discriminability into consideration: PMD, intensity discriminability and m/z error discriminability. Compared with Mascot and Sequest, Dispec identified remarkably more peptides from three experimental datasets with the same confidence at 1% PSM-level FDR. Dispec is also robust and versatile for various datasets obtained on different instruments. The concept of discriminability enhances the peptide identification and thus may contribute largely to the proteome studies. As an open-source program, Dispec is freely available at http://bioinformatics.jnu.edu.cn/software/dispec/.
Motivation:Granzyme B (GrB) and caspases cleave specific protein substrates to induce apoptosis in virally infected and neoplastic cells. While substrates for both types of proteases have been determined experimentally, there are many more yet to be discovered in humans and other metazoans. Here, we present a bioinformatics method based on support vector machine (SVM) learning that identifies sequence and structural features important for protease recognition of substrate peptides and then uses these features to predict novel substrates. Our approach can act as a convenient hypothesis generator, guiding future experiments by high-confidence identification of peptide-protein partners.
Results:The method is benchmarked on the known substrates of both protease types, including our literature-curated GrB substrate set (GrBah). On these benchmark sets, the method outperforms a number of other methods that consider sequence only, predicting at a 0.87 true positive rate (TPR) and a 0.13 false positive rate (FPR) for caspase substrates, and a 0.79 TPR and a 0.21 FPR for GrB substrates. The method is then applied to ∼25 000 proteins in the human proteome to generate a ranked list of predicted substrates of each protease type. Two of these predictions, AIF-1 and SMN1, were selected for further experimental analysis, and each was validated as a GrB substrate.
Availability: All predictions for both protease types are publically available at http://salilab.org/peptide. A web server is at the same site that allows a user to train new SVM models to make predictions for any protein that recognizes specific oligopeptide ligands.
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online
Tandem mass spectrometry-based shotgun proteomics has become a widespread technology for analyzing complex protein mixtures. A number of database searching algorithms have been developed to assign peptide sequences to tandem mass spectra. Assembling the peptide identifications to proteins, however, is a challenging issue because many peptides are shared among multiple proteins. IDPicker is an open-source protein assembly tool that derives a minimum protein list from peptide identifications filtered to a specified False Discovery Rate. Here, we update IDPicker to increase confident peptide identifications by combining multiple scores produced by database search tools. By segregating peptide identifications for thresholding using both the precursor charge state and the number of tryptic termini, IDPicker retrieves more peptides for protein assembly. The new version is more robust against false positive proteins, especially in searches using multispecies databases, by requiring additional novel peptides in the parsimony process. IDPicker has been designed for incorporation in many identification workflows by the addition of a graphical user interface and the ability to read identifications from the pepXML format. These advances position IDPicker for high peptide discrimination and reliable protein assembly in large-scale proteomics studies. The source code and binaries for the latest version of IDPicker are available from http://fenchurch.mc.vanderbilt.edu/.
bioinformatics; parsimony; protein assembly; protein inference; false discovery rate
Numerous organisms have evolved a wide range of toxic peptides for self-defense and predation. Their effective interstitial and macro-environmental use requires energetic and structural stability. One successful group of these peptides includes a tri-disulfide domain arrangement that offers toxicity and high stability. Sequential tri-disulfide connectivity variants create highly compact disulfide folds capable of withstanding a variety of environmental stresses. Their combination of toxicity and stability make these peptides remarkably valuable for their potential as bio-insecticides, antimicrobial peptides and peptide drug candidates. However, the wide sequence variation, sources and modalities of group members impose serious limitations on our ability to rapidly identify potential members. As a result, there is a need for automated high-throughput member classification approaches that leverage their demonstrated tertiary and functional homology.
We developed an SVM-based model to predict sequential tri-disulfide peptide (STP) toxins from peptide sequences. One optimized model, called PredSTP, predicted STPs from training set with sensitivity, specificity, precision, accuracy and a Matthews correlation coefficient of 94.86 %, 94.11 %, 84.31 %, 94.30 % and 0.86, respectively, using 200 fold cross validation. The same model outperforms existing prediction approaches in three independent out of sample testsets derived from PDB.
PredSTP can accurately identify a wide range of cystine stabilized peptide toxins directly from sequences in a species-agnostic fashion. The ability to rapidly filter sequences for potential bioactive peptides can greatly compress the time between peptide identification and testing structural and functional properties for possible antimicrobial and insecticidal candidates. A web interface is freely available to predict STP toxins from http://crick.ecs.baylor.edu/.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-015-0633-x) contains supplementary material, which is available to authorized users.
Machine learning; SVM; Tri-disulfide peptide toxins; Sequential tri-disulfide peptides (STPs); Inhibitory cytine knot (ICKs); Cylotides; Nonknotted STPs, Insecticidal peptides; Antimicrobial peptides
Mass spectrometry (MS) is an essential analytical tool in proteomics. Many existing algorithms for peptide detection are based on isotope template matching and usually work at different charge states separately, making them ineffective to detect overlapping peptides and low abundance peptides.
We present BPDA, a Bayesian approach for peptide detection in data produced by MS instruments with high enough resolution to baseline-resolve isotopic peaks, such as MALDI-TOF and LC-MS. We model the spectra as a mixture of candidate peptide signals, and the model is parameterized by MS physical properties. BPDA is based on a rigorous statistical framework and avoids problems, such as voting and ad-hoc thresholding, generally encountered in algorithms based on template matching. It systematically evaluates all possible combinations of possible peptide candidates to interpret a given spectrum, and iteratively finds the best fitting peptide signal in order to minimize the mean squared error of the inferred spectrum to the observed spectrum. In contrast to previous detection methods, BPDA performs deisotoping and deconvolution of mass spectra simultaneously, which enables better identification of weak peptide signals and produces higher sensitivities and more robust results. Unlike template-matching algorithms, BPDA can handle complex data where features overlap. Our experimental results indicate that BPDA performs well on simulated data and real MS data sets, for various resolutions and signal to noise ratios, and compares very favorably with commonly used commercial and open-source software, such as flexAnalysis, OpenMS, and Decon2LS, according to sensitivity and detection accuracy.
Unlike previous detection methods, which only employ isotopic distributions and work at each single charge state alone, BPDA takes into account the charge state distribution as well, thus lending information to better identify weak peptide signals and produce more robust results. The proposed approach is based on a rigorous statistical framework, which avoids problems generally encountered in algorithms based on template matching. Our experiments indicate that BPDA performs well on both simulated data and real data, and compares very favorably with commonly used commercial and open-source software. The BPDA software can be downloaded from http://gsp.tamu.edu/Publications/supplementary/sun10a/bpda.
Dynamic Bayesian Network (DBN) is an approach widely used for reconstruction of gene regulatory networks from time-series microarray data. Its performance in network reconstruction depends on a structure learning algorithm. REVEAL (REVerse Engineering ALgorithm) is one of the algorithms implemented for learning DBN structure and used to reconstruct gene regulatory networks (GRN). However, the two-stage temporal Bayes network (2TBN) structure of DBN that specifies correlation between time slices cannot be obtained by score metrics used in REVEAL.
In this paper, we study a more sophisticated score function for DBN first proposed by Nir Friedman for stationary DBNs structure learning of both initial and transition networks but has not yet been used for reconstruction of GRNs. We implemented Friedman's Bayesian Information Criterion (BIC) score function, modified K2 algorithm to learn Dynamic Bayesian Network structure with the score function and tested the performance of the algorithm for GRN reconstruction with synthetic time series gene expression data generated by GeneNetWeaver and real yeast benchmark experiment data.
We implemented an algorithm for DBN structure learning with Friedman's score function, tested it on reconstruction of both synthetic networks and real yeast networks and compared it with REVEAL in the absence or presence of preprocessed network generated by Zou&Conzen's algorithm. By introducing a stationary correlation between two consecutive time slices, Friedman's score function showed a higher precision and recall than the naive REVEAL algorithm.
Friedman's score metrics for DBN can be used to reconstruct transition networks and has a great potential to improve the accuracy of gene regulatory network structure prediction with time series gene expression datasets.