Motivation: Tandem mass spectrometry (MS/MS) is an indispensable technology for identification of proteins from complex mixtures. Proteins are digested to peptides that are then identified by their fragmentation patterns in the mass spectrometer. Thus, at its core, MS/MS protein identification relies on the relative predictability of peptide fragmentation. Unfortunately, peptide fragmentation is complex and not fully understood, and what is understood is not always exploited by peptide identification algorithms.
Results: We use a hybrid dynamic Bayesian network (DBN)/support vector machine (SVM) approach to address these two problems. We train a set of DBNs on high-confidence peptide-spectrum matches. These DBNs, known collectively as Riptide, comprise a probabilistic model of peptide fragmentation chemistry. Examination of the distributions learned by Riptide allows identification of new trends, such as prevalent a-ion fragmentation at peptide cleavage sites C-term to hydrophobic residues. In addition, Riptide can be used to produce likelihood scores that indicate whether a given peptide-spectrum match is correct. A vector of such scores is evaluated by an SVM, which produces a final score to be used in peptide identification. Using Riptide in this way yields improved discrimination when compared to other state-of-the-art MS/MS identification algorithms, increasing the number of positive identifications by as much as 12% at a 1% false discovery rate.
Availability: Python and C source code are available upon request from the authors. The curated training sets are available at http://noble.gs.washington.edu/proj/intense/. The Graphical Model Tool Kit (GMTK) is freely available at http://ssli.ee.washington.edu/bilmes/gmtk.
Protein secondary structure prediction method based on probabilistic models such as hidden Markov model (HMM) appeals to many because it provides meaningful information relevant to sequence-structure relationship. However, at present, the prediction accuracy of pure HMM-type methods is much lower than that of machine learning-based methods such as neural networks (NN) or support vector machines (SVM).
In this paper, we report a new method of probabilistic nature for protein secondary structure prediction, based on dynamic Bayesian networks (DBN). The new method models the PSI-BLAST profile of a protein sequence using a multivariate Gaussian distribution, and simultaneously takes into account the dependency between the profile and secondary structure and the dependency between profiles of neighboring residues. In addition, a segment length distribution is introduced for each secondary structure state. Tests show that the DBN method has made a significant improvement in the accuracy compared to other pure HMM-type methods. Further improvement is achieved by combining the DBN with an NN, a method called DBNN, which shows better Q3 accuracy than many popular methods and is competitive to the current state-of-the-arts. The most interesting feature of DBN/DBNN is that a significant improvement in the prediction accuracy is achieved when combined with other methods by a simple consensus.
The DBN method using a Gaussian distribution for the PSI-BLAST profile and a high-ordered dependency between profiles of neighboring residues produces significantly better prediction accuracy than other HMM-type probabilistic methods. Owing to their different nature, the DBN and NN combine to form a more accurate method DBNN. Future improvement may be achieved by combining DBNN with a method of SVM type.
Hidden Markov models (HMMs) have been successfully applied to the tasks of transmembrane protein topology prediction and signal peptide prediction. In this paper we expand upon this work by making use of the more powerful class of dynamic Bayesian networks (DBNs). Our model, Philius, is inspired by a previously published HMM, Phobius, and combines a signal peptide submodel with a transmembrane submodel. We introduce a two-stage DBN decoder that combines the power of posterior decoding with the grammar constraints of Viterbi-style decoding. Philius also provides protein type, segment, and topology confidence metrics to aid in the interpretation of the predictions. We report a relative improvement of 13% over Phobius in full-topology prediction accuracy on transmembrane proteins, and a sensitivity and specificity of 0.96 in detecting signal peptides. We also show that our confidence metrics correlate well with the observed precision. In addition, we have made predictions on all 6.3 million proteins in the Yeast Resource Center (YRC) database. This large-scale study provides an overall picture of the relative numbers of proteins that include a signal-peptide and/or one or more transmembrane segments as well as a valuable resource for the scientific community. All DBNs are implemented using the Graphical Models Toolkit. Source code for the models described here is available at http://noble.gs.washington.edu/proj/philius. A Philius Web server is available at http://www.yeastrc.org/philius, and the predictions on the YRC database are available at http://www.yeastrc.org/pdr.
Transmembrane proteins control the flow of information and substances into and out of the cell and are involved in a broad range of biological processes. Their interfacing role makes them rewarding drug targets, and it is estimated that more than 50% of recently launched drugs target membrane proteins. However, experimentally determining the three-dimensional structure of a transmembrane protein is still a difficult task, and few of the currently known tertiary structures are of transmembrane proteins despite the fact that as many as one quarter of the proteins in a given organism are transmembrane proteins. Computational methods for predicting the basic topology of a transmembrane protein are therefore of great interest, and these methods must be able to distinguish between mature, membrane-spanning proteins and proteins that, when first synthesized, contain an N-terminal membrane-spanning signal peptide. In this work, we present Philius, a new computational approach that outperforms previous methods in simultaneously detecting signal peptides and correctly predicting the topology of transmembrane proteins. Philius also supplies a set of confidence scores with each prediction. A Philius Web server is available to the public as well as precomputed predictions for over six million proteins in the Yeast Resource Center database.
A significant amount of attention has recently been focused on modeling of gene regulatory networks. Two frequently used large-scale modeling frameworks are Bayesian networks (BNs) and Boolean networks, the latter one being a special case of its recent stochastic extension, probabilistic Boolean networks (PBNs). PBN is a promising model class that generalizes the standard rule-based interactions of Boolean networks into the stochastic setting. Dynamic Bayesian networks (DBNs) is a general and versatile model class that is able to represent complex temporal stochastic processes and has also been proposed as a model for gene regulatory systems. In this paper, we concentrate on these two model classes and demonstrate that PBNs and a certain subclass of DBNs can represent the same joint probability distribution over their common variables. The major benefit of introducing the relationships between the models is that it opens up the possibility of applying the standard tools of DBNs to PBNs and vice versa. Hence, the standard learning tools of DBNs can be applied in the context of PBNs, and the inference methods give a natural way of handling the missing values in PBNs which are often present in gene expression measurements. Conversely, the tools for controlling the stationary behavior of the networks, tools for projecting networks onto sub-networks, and efficient learning schemes can be used for DBNs. In other words, the introduced relationships between the models extend the collection of analysis tools for both model classes.
Gene regulatory networks; Probabilistic Boolean networks; Dynamic Bayesian networks
Peptide identification by tandem mass spectrometry is the dominant proteomics workflow for protein characterization in complex samples. The peptide fragmentation spectra generated by these workflows exhibit characteristic fragmentation patterns that can be used to identify the peptide. In other fields, where the compounds of interest do not have the convenient linear structure of peptides, fragmentation spectra are identified by comparing new spectra with libraries of identified spectra, an approach called spectral matching. In contrast to sequence-based tandem mass spectrometry search engines used for peptides, spectral matching can make use of the intensities of fragment peaks in library spectra to assess the quality of a match. We evaluate a hidden Markov model approach (HMMatch) to spectral matching, in which many examples of a peptide's fragmentation spectrum are summarized in a generative probabilistic model that captures the consensus and variation of each peak's intensity. We demonstrate that HMMatch has good specificity and superior sensitivity, compared to sequence database search engines such as X!Tandem. HMMatch achieves good results from relatively few training spectra, is fast to train, and can evaluate many spectra per second. A statistical significance model permits HMMatch scores to be compared with each other, and with other peptide identification tools, on a unified scale. HMMatch shows a similar degree of concordance with X!Tandem, Mascot, and NIST's MS Search, as they do with each other, suggesting that each tool can assign peptides to spectra that the others miss. Finally, we show that it is possible to extrapolate HMMatch models beyond a single peptide's training spectra to the spectra of related peptides, expanding the application of spectral matching techniques beyond the set of peptides previously observed.
computational molecular biology; mass spectroscopy; HMM; peptide identification; algorithms
Mocapy++ is a toolkit for parameter learning and inference in dynamic Bayesian networks (DBNs). It supports a wide range of DBN architectures and probability distributions, including distributions from directional statistics (the statistics of angles, directions and orientations).
The program package is freely available under the GNU General Public Licence (GPL) from SourceForge http://sourceforge.net/projects/mocapy. The package contains the source for building the Mocapy++ library, several usage examples and the user manual.
Mocapy++ is especially suitable for constructing probabilistic models of biomolecular structure, due to its support for directional statistics. In particular, it supports the Kent distribution on the sphere and the bivariate von Mises distribution on the torus. These distributions have proven useful to formulate probabilistic models of protein and RNA structure in atomic detail.
Deciphering the biological networks underlying complex phenotypic traits, e.g., human disease is undoubtedly crucial to understand the underlying molecular mechanisms and to develop effective therapeutics. Due to the network complexity and the relatively small number of available experiments, data-driven modeling is a great challenge for deducing the functions of genes/ proteins in the network and in phenotype formation. We propose a novel knowledge-driven systems biology method that utilizes qualitative knowledge to construct a Dynamic Bayesian network (DBN) to represent the biological network underlying a specific phenotype. Edges in this network depict physical interactions between genes and/or proteins. A qualitative knowledge model first translates typical molecular interactions into constraints when resolving the DBN structure and parameters. Therefore, the uncertainty of the network is restricted to a subset of models which are consistent with the qualitative knowledge. All models satisfying the constraints are considered as candidates for the underlying network. These consistent models are used to perform quantitative inference. By in silico inference, we can predict phenotypic traits upon genetic interventions and perturbing in the network. We applied our method to analyze the puzzling mechanism of breast cancer cell proliferation network and we accurately predicted cancer cell growth rate upon manipulating (anti)cancerous marker genes/proteins.
Dynamic Bayesian network; genetic network; phenotype prediction; genetic intervention; systems biology; breast cancer; cell proliferation
Dynamic Bayesian network (DBN) is among the mainstream approaches for modeling various biological networks, including the gene regulatory network (GRN). Most current methods for learning DBN employ either local search such as hill-climbing, or a meta stochastic global optimization framework such as genetic algorithm or simulated annealing, which are only able to locate sub-optimal solutions. Further, current DBN applications have essentially been limited to small sized networks.
To overcome the above difficulties, we introduce here a deterministic global optimization based DBN approach for reverse engineering genetic networks from time course gene expression data. For such DBN models that consist only of inter time slice arcs, we show that there exists a polynomial time algorithm for learning the globally optimal network structure. The proposed approach, named GlobalMIT+, employs the recently proposed information theoretic scoring metric named mutual information test (MIT). GlobalMIT+ is able to learn high-order time delayed genetic interactions, which are common to most biological systems. Evaluation of the approach using both synthetic and real data sets, including a 733 cyanobacterial gene expression data set, shows significantly improved performance over other techniques.
Our studies demonstrate that deterministic global optimization approaches can infer large scale genetic networks.
Peptide and protein identification via tandem mass spectrometry (MS/MS) lies at the heart of proteomic characterization of biological samples. Several algorithms are able to search, score, and assign peptides to large MS/MS datasets. Most popular methods, however, underutilize the intensity information available in the tandem mass spectrum due to the complex nature of the peptide fragmentation process, thus contributing to loss of potential identifications. We present a novel probabilistic scoring algorithm called Context-Sensitive Peptide Identification (CSPI) based on highly flexible Input-Output Hidden Markov Models (IO-HMM) that capture the influence of peptide physicochemical properties on their observed MS/MS spectra. We use several local and global properties of peptides and their fragment ions from literature. Comparison with two popular algorithms, Crux (re-implementation of SEQUEST) and X!Tandem, on multiple datasets of varying complexity, shows that peptide identification scores from our models are able to achieve greater discrimination between true and false peptides, identifying up to ∼25% more peptides at a False Discovery Rate (FDR) of 1%. We evaluated two alternative normalization schemes for fragment ion-intensities, a global rank-based and a local window-based. Our results indicate the importance of appropriate normalization methods for learning superior models. Further, combining our scores with Crux using a state-of-the-art procedure, Percolator, we demonstrate the utility of using scoring features from intensity-based models, identifying ∼4-8 % additional identifications over Percolator at 1% FDR. IO-HMMs offer a scalable and flexible framework with several modeling choices to learn complex patterns embedded in MS/MS data.
To decipher the complexity and improve the understanding of host-pathogen interactions, biologists must adopt new system level approaches in which the hierarchy of biological interactions and dynamics can be studied. This paper presents the application of systems biology for the cross-comparative analysis and interactome modeling of three different infectious agents, leading to the identification of novel, unique and common molecular host responses (biosignatures).
A computational systems biology method was utilized to create interactome models of the host responses to Brucella melitensis (BMEL), Salmonella enterica Typhimurium (STM) and Mycobacterium avium paratuberculosis (MAP). A bovine ligated ileal loop biological model was employed to capture the host gene expression response at four time points post infection. New methods based on Dynamic Bayesian Network (DBN) machine learning were employed to conduct a systematic comparative analysis of pathway and Gene Ontology category perturbations.
A cross-comparative assessment of 219 pathways and 1620 gene ontology (GO) categories was performed on each pathogen-host condition. Both unique and common pathway and GO perturbations indicated remarkable temporal differences in pathogen-host response profiles. Highly discriminatory pathways were selected from each pathogen condition to create a common system level interactome model comprised of 622 genes. This model was trained with data from each pathogen condition to capture unique and common gene expression features and relationships leading to the identification of candidate host-pathogen points of interactions and discriminatory biosignatures.
Our results provide deeper understanding of the overall complexity of host defensive and pathogen invasion processes as well as the identification of novel host-pathogen interactions. The application of advanced computational methods for developing interactome models based on DBN has proven to be instrumental in conducting multi-conditional cross-comparative analyses. Further, this approach generates a fully simulateable model with capabilities for predictive analysis as well as for diagnostic pattern recognition. The resulting biosignatures may represent future targets for identification of emerging pathogens as well as for development of antimicrobial drugs, immunotherapeutics, or vaccines for prevention and treatment of diseases caused by known, emerging/re-emerging infectious agents.
Adequate control of serum glucose in critically ill patients is a complex problem requiring continuous monitoring and intervention, which have a direct effect on clinical outcomes. Understanding temporal relationships can help to improve our knowledge of complex disease processes and their response to treatment. We discuss a Dynamic Bayesian Network (DBN) model that we created using the open-source Projeny toolkit to represent various clinical variables and the temporal and atemporal relationships underlying insulin and glucose homeostasis. We evaluated this model by comparing the DBN model’s insulin dose predictions against those of a rule-based protocol (eProtocol-insulin) currently used in the ICU. The results suggest that the DBN model’s predictions are as effective as or better than those of the rule-based protocol. The limitations of our methods are discussed, with a brief note on their generalizability.
Protein secondary structure prediction provides insight into protein function and is a valuable preliminary step for predicting the 3D structure of a protein. Dynamic Bayesian networks (DBNs) and support vector machines (SVMs) have been shown to provide state-of-the-art performance in secondary structure prediction. As the size of the protein database grows, it becomes feasible to use a richer model in an effort to capture subtle correlations among the amino acids and the predicted labels. In this context, it is beneficial to derive sparse models that discourage over-fitting and provide biological insight.
In this paper, we first show that we are able to obtain accurate secondary structure predictions. Our per-residue accuracy on a well established and difficult benchmark (CB513) is 80.3%, which is comparable to the state-of-the-art evaluated on this dataset. We then introduce an algorithm for sparsifying the parameters of a DBN. Using this algorithm, we can automatically remove up to 70-95% of the parameters of a DBN while maintaining the same level of predictive accuracy on the SD576 set. At 90% sparsity, we are able to compute predictions three times faster than a fully dense model evaluated on the SD576 set. We also demonstrate, using simulated data, that the algorithm is able to recover true sparse structures with high accuracy, and using real data, that the sparse model identifies known correlation structure (local and non-local) related to different classes of secondary structure elements.
We present a secondary structure prediction method that employs dynamic Bayesian networks and support vector machines. We also introduce an algorithm for sparsifying the parameters of the dynamic Bayesian network. The sparsification approach yields a significant speed-up in generating predictions, and we demonstrate that the amino acid correlations identified by the algorithm correspond to several known features of protein secondary structure. Datasets and source code used in this study are available at http://noble.gs.washington.edu/proj/pssp.
Accurate modeling of peptide fragmentation is necessary for the development of robust scoring functions for peptide-spectrum matches, which are the cornerstone of MS/MS-based identification algorithms. Unfortunately, peptide fragmentation is a complex process that can involve several competing chemical pathways, which makes it difficult to develop generative probabilistic models that describe it accurately. However, the vast amounts of MS/MS data being generated now make it possible to use data-driven machine learning methods to develop discriminative ranking-based models that predict the intensity ranks of a peptide's fragment ions. We use simple sequence-based features that get combined by a boosting algorithm in to models that make peak rank predictions with high accuracy. In an accompanying manuscript, we demonstrate how these prediction models are used to significantly improve the performance of peptide identification algorithms. The models can also be useful in the design of optimal MRM transitions, in cases where there is insufficient experimental data to guide the peak selection process. The prediction algorithm can also be run independently through PepNovo+, which is available for download from http://bix.ucsd.edu/Software/PepNovo.html.
MS/MS; peptide; fragmentation; prediction; machine learning; ranking; boosting; MRM
A better understanding of the mechanisms involved in gas-phase fragmentation of peptides is essential for the development of more reliable algorithms for high-throughput protein identification using mass spectrometry (MS). Current methodologies depend predominantly on the use of derived m/z values of fragment ions, and, the knowledge provided by the intensity information present in MS/MS spectra has not been fully exploited. Indeed spectrum intensity information is very rarely utilized in the algorithms currently in use for high-throughput protein identification.
In this work, a Bayesian neural network approach is employed to analyze ion intensity information present in 13878 different MS/MS spectra. The influence of a library of 35 features on peptide fragmentation is examined under different proton mobility conditions. Useful rules involved in peptide fragmentation are found and subsets of features which have significant influence on fragmentation pathway of peptides are characterised. An intensity model is built based on the selected features and the model can make an accurate prediction of the intensity patterns for given MS/MS spectra. The predictions include not only the mean values of spectra intensity but also the variances that can be used to tolerate noises and system biases within experimental MS/MS spectra.
The intensity patterns of fragmentation spectra are informative and can be used to analyze the influence of various characteristics of fragmented peptides on their fragmentation pathway. The features with significant influence can be used in turn to predict spectra intensities. Such information can help develop more reliable algorithms for peptide and protein identification.
Reverse engineering cellular networks is currently one of the most challenging problems in systems biology. Dynamic Bayesian networks (DBNs) seem to be particularly suitable for inferring relationships between cellular variables from the analysis of time series measurements of mRNA or protein concentrations. As evaluating inference results on a real dataset is controversial, the use of simulated data has been proposed. However, DBN approaches that use continuous variables, thus avoiding the information loss associated with discretization, have not yet been extensively assessed, and most of the proposed approaches have dealt with linear Gaussian models.
We propose a generalization of dynamic Gaussian networks to accommodate nonlinear dependencies between variables. As a benchmark dataset to test the new approach, we used data from a mathematical model of cell cycle control in budding yeast that realistically reproduces the complexity of a cellular system. We evaluated the ability of the networks to describe the dynamics of cellular systems and their precision in reconstructing the true underlying causal relationships between variables. We also tested the robustness of the results by analyzing the effect of noise on the data, and the impact of a different sampling time.
The results confirmed that DBNs with Gaussian models can be effectively exploited for a first level analysis of data from complex cellular systems. The inferred models are parsimonious and have a satisfying goodness of fit. Furthermore, the networks not only offer a phenomenological description of the dynamics of cellular systems, but are also able to suggest hypotheses concerning the causal interactions between variables. The proposed nonlinear generalization of Gaussian models yielded models characterized by a slightly lower goodness of fit than the linear model, but a better ability to recover the true underlying connections between variables.
We examine the efficacy of using discrete Dynamic Bayesian Networks (dDBNs), a data-driven modeling technique employed in machine learning, to identify functional correlations among neuroanatomical regions of interest. Unlike many neuroimaging analysis techniques, this method is not limited by linear and/or Gaussian noise assumptions. It achieves this by modeling the time series of neuroanatomical regions as discrete, as opposed to continuous, random variables with multinomial distributions. We demonstrated this method using an fMRI dataset collected from healthy and demented elderly subjects and identify correlates based on a diagnosis of dementia. The results are validated in three ways. First, the elicited correlates are shown to be robust over leave-one-out cross-validation and, via a Fourier bootstrapping method, that they were not likely due to random chance. Second, the dDBNs identified correlates that would be expected given the experimental paradigm. Third, the dDBN's ability to predict dementia is competitive with two commonly employed machine-learning classifiers: the support vector machine and the Gaussian naïve Bayesian network. We also verify that the dDBN selects correlates based on non-linear criteria. Finally, we provide a brief analysis of the correlates elicited from Buckner et al.'s data that suggests that demented elderly subjects have reduced involvement of entorhinal and occipital cortex and greater involvement of the parietal lobe and amygdala in brain activity compared with healthy elderly (as measured via functional correlations among BOLD measurements). Limitations and extensions to the dDBN method are discussed.
Bayesian networks; dementia; nonlinear analysis; functional connectivity; Talairach atlas; amygdala
Shotgun proteomics yields tandem mass spectra of peptides that can be identified by database search algorithms. When only a few observed peptides suggest the presence of a protein, establishing the accuracy of the peptide identifications is necessary for accepting or rejecting the protein identification. In this protocol, we describe the properties of peptide identifications that can differentiate legitimately identified peptides from spurious ones. The chemistry of fragmentation, as embodied in the ‘mobile proton’ and ‘pathways in competition’ models, informs the process of confirming or rejecting each spectral match. Examples of ion-trap and tandem time-of-flight (TOF/TOF) mass spectra illustrate these principles of fragmentation.
Dynamic Bayesian Network (DBN) is an approach widely used for reconstruction of gene regulatory networks from time-series microarray data. Its performance in network reconstruction depends on a structure learning algorithm. REVEAL (REVerse Engineering ALgorithm) is one of the algorithms implemented for learning DBN structure and used to reconstruct gene regulatory networks (GRN). However, the two-stage temporal Bayes network (2TBN) structure of DBN that specifies correlation between time slices cannot be obtained by score metrics used in REVEAL.
In this paper, we study a more sophisticated score function for DBN first proposed by Nir Friedman for stationary DBNs structure learning of both initial and transition networks but has not yet been used for reconstruction of GRNs. We implemented Friedman's Bayesian Information Criterion (BIC) score function, modified K2 algorithm to learn Dynamic Bayesian Network structure with the score function and tested the performance of the algorithm for GRN reconstruction with synthetic time series gene expression data generated by GeneNetWeaver and real yeast benchmark experiment data.
We implemented an algorithm for DBN structure learning with Friedman's score function, tested it on reconstruction of both synthetic networks and real yeast networks and compared it with REVEAL in the absence or presence of preprocessed network generated by Zou&Conzen's algorithm. By introducing a stationary correlation between two consecutive time slices, Friedman's score function showed a higher precision and recall than the naive REVEAL algorithm.
Friedman's score metrics for DBN can be used to reconstruct transition networks and has a great potential to improve the accuracy of gene regulatory network structure prediction with time series gene expression datasets.
Online automated quality assessment is critical to determine a sensor's fitness for purpose in real-time applications. A Dynamic Bayesian Network (DBN) framework is proposed to produce probabilistic quality assessments and represent the uncertainty of sequentially correlated sensor readings. This is a novel framework to represent the causes, quality state and observed effects of individual sensor errors without imposing any constraints upon the physical deployment or measured phenomenon. It represents the casual relationship between quality tests and combines them in a way to generate uncertainty estimates of samples. The DBN was implemented for a particular marine deployment of temperature and conductivity sensors in Hobart, Australia. The DBN was shown to offer a substantial average improvement (34%) in replicating the error bars that were generated by experts when compared to a fuzzy logic approach.
online filtering; automated; quality assessment; sensors; dynamic Bayesian networks
Data-independent tandem mass spectrometry isolates and fragments all of the molecular species within a given mass-to-charge window, regardless of whether a precursor ion was detected within the window. For shotgun proteomics on complex protein mixtures, data-independent MS/MS offers certain advantages over the traditional data-dependent MS/MS: identification of low-abundance peptides with insignificant precursor peaks; more direct relative quantification, free of biases caused by competing precursors and dynamic exclusion; and faster throughput due to simultaneous fragmentation of multiple peptides. However, data-independent MS/MS, especially on low-resolution ion-trap instruments, strains standard peptide identification programs, because of less precise knowledge of the peptide precursor mass and large numbers of spectra composed of two or more peptides. Here we describe a computer program called DeMux that deconvolves mixture spectra and improves the peptide identification rate by ~25%. We compare the number of identifications made by data-independent and data-dependent MS/MS at the peptide and protein levels: conventional data-dependent MS/MS makes a greater number of identifications but is less reproducible from run to run.
Protein/peptide analysis is commonly based on use of a tandem mass spectrometer combining electrospray ionization and collision-induced dissociation (CID). The major disadvantage of collisional ion activation is the internal heating of the parent ion, which predominantly yields cleavages of the weakest bonds, resulting in less informative fragment spectra; this often limits peptide sequence determination.
To better elucidate the peptide structure, more selective fragmentation techniques are of huge interest. Electron transfer dissociation (ETD) in a non-linear Paul trap has been introduced as a new fragmentation technique, which avoids internal parent ion heating. Induced by the electron transfer, the intermediate peptide radical cations fragment randomly at each amino acid position of the peptide backbone, which is particularly suitable for the sequence characterization of larger peptides and for post-translational modification (PTM) identification.
We investigated the electron transfer dissociation of large peptides using the Bruker HCTultra PTM Discovery System. Multiply charged positive peptide target ions are generated via conventional electrospray or via off-line nanospray. Electron transfer dissociation of larger peptides is done in a non-linear three-dimensional Paul trap, where consecutive trapping of peptide cations as well as reagent anion accumulation is enabled under full automatic software control.
Common resonance precursor activation via CID MS/ MS almost always results in poor sequence information if the peptide molecular weight exceeds 3000 Da. ETD MS/MS of the same precursor ion provides much better sequence information. Based on the enhanced resolution of the utilized instrument, since even quadruply charged fragment ions are identified, the entire amino acid sequence is read out of the ETD MS/MS spectrum.
Even when larger PTM peptides were investigated under ETD conditions, labile bounded PTMs remain attached to the peptide backbone, and are localized without ambiguity. ETD MS/MS data of larger phosphorylated as well as glycosylated peptides will be presented.
The identification of peptides by tandem mass spectrometry (MS/MS) is a central method of proteomics research, but due to the complexity of MS/MS data and the large databases searched, the accuracy of peptide identification algorithms remains limited. To improve the accuracy of identification we applied a machine-learning approach using a hidden Markov model (HMM) to capture the complex and often subtle links between a peptide sequence and its MS/MS spectrum.
Our model, HMM_Score, represents ion types as HMM states and calculates the maximum joint probability for a peptide/spectrum pair using emission probabilities from three factors: the amino acids adjacent to each fragmentation site, the mass dependence of ion types and the intensity dependence of ion types. The Viterbi algorithm is used to calculate the most probable assignment between ion types in a spectrum and a peptide sequence, then a correction factor is added to account for the propensity of the model to favor longer peptides. An expectation value is calculated based on the model score to assess the significance of each peptide/spectrum match.
We trained and tested HMM_Score on three data sets generated by two different mass spectrometer types. For a reference data set recently reported in the literature and validated using seven identification algorithms, HMM_Score produced 43% more positive identification results at a 1% false positive rate than the best of two other commonly used algorithms, Mascot and X!Tandem. HMM_Score is a highly accurate platform for peptide identification that works well for a variety of mass spectrometer and biological sample types.
The program is freely available on ProteomeCommons via an OpenSource license. See http://bioinfo.unc.edu/downloads/ for the download link.
The problem of identifying the proteins in a complex mixture using tandem mass spectrometry can be framed as an inference problem on a graph that connects peptides to proteins. Several existing protein identification methods make use of statistical inference methods for graphical models, including expectation maximization, Markov chain Monte Carlo, and full marginalization coupled with approximation heuristics. We show that, for this problem, the majority of the cost of inference usually comes from a few highly connected subgraphs. Furthermore, we evaluate three different statistical inference methods using a common graphical model, and we demonstrate that junction tree inference substantially improves rates of convergence compared to existing methods. The python code used for this paper is available at http://noble.gs.washington.edu/proj/fido.
Mass spectrometry; protein identification; graphical models; Bayesian inference
State Space Model (SSM) is a relatively new approach to inferring gene regulatory networks. It requires less computational time than Dynamic Bayesian Networks (DBN). There are two types of variables in the linear SSM, observed variables and hidden variables. SSM uses an iterative method, namely Expectation-Maximization, to infer regulatory relationships from microarray datasets. The hidden variables cannot be directly observed from experiments. How to determine the number of hidden variables has a significant impact on the accuracy of network inference. In this study, we used SSM to infer Gene regulatory networks (GRNs) from synthetic time series datasets, investigated Bayesian Information Criterion (BIC) and Principle Component Analysis (PCA) approaches to determining the number of hidden variables in SSM, and evaluated the performance of SSM in comparison with DBN.
True GRNs and synthetic gene expression datasets were generated using GeneNetWeaver. Both DBN and linear SSM were used to infer GRNs from the synthetic datasets. The inferred networks were compared with the true networks.
Our results show that inference precision varied with the number of hidden variables. For some regulatory networks, the inference precision of DBN was higher but SSM performed better in other cases. Although the overall performance of the two approaches is compatible, SSM is much faster and capable of inferring much larger networks than DBN.
This study provides useful information in handling the hidden variables and improving the inference precision.
Accurate peptide identification is important to high-throughput proteomics analyses that use mass spectrometry. Search programs compare fragmentation spectra (MS/MS) of peptides from complex digests with theoretically derived spectra from a database of protein sequences. Improved discrimination is achieved with theoretical spectra that are based on simulating gas phase chemistry of the peptides, but the limited understanding of those processes affects the accuracy of predictions from theoretical spectra.
We employed a robust data mining strategy using new feature annotation functions of MAE software, which revealed under-prediction of the frequency of occurrence in fragmentation of the second peptide bond. We applied methods of exploratory data analysis to pre-process the information in the MS/MS spectra, including data normalization and attribute selection, to reduce the attributes to a smaller, less correlated set for machine learning studies. We then compared our rule building machine learning program, DataSqueezer, with commonly used association rules and decision tree algorithms. All used machine learning algorithms produced similar results that were consistent with expected properties for a second gas phase mechanism at the second peptide bond.
The results provide compelling evidence that we have identified underlying chemical properties in the data that suggest the existence of an additional gas phase mechanism for the second peptide bond. Thus, the methods described in this study provide a valuable approach for analyses of this kind in the future.