PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-7 (7)
 

Clipboard (0)
None

Select a Filter Below

Journals
Authors
more »
Year of Publication
Document Types
1.  Unsupervised pattern discovery in human chromatin structure through genomic segmentation 
Nature methods  2012;9(5):473-476.
We applied a dynamic Bayesian network method that identifies joint patterns from multiple functional genomics experiments to ChIP-seq histone modification and transcription factor data, and DNaseI-seq and FAIRE-seq open chromatin readouts from the human cell line K562. In an unsupervised fashion, we identified patterns associated with transcription start sites, gene ends, enhancers, CTCF elements, and repressed regions. Software and genome browser tracks are at http://noble.gs.washington.edu/proj/segway/.
doi:10.1038/nmeth.1937
PMCID: PMC3340533  PMID: 22426492
2.  Learning sparse models for a dynamic Bayesian network classifier of protein secondary structure 
BMC Bioinformatics  2011;12:154.
Background
Protein secondary structure prediction provides insight into protein function and is a valuable preliminary step for predicting the 3D structure of a protein. Dynamic Bayesian networks (DBNs) and support vector machines (SVMs) have been shown to provide state-of-the-art performance in secondary structure prediction. As the size of the protein database grows, it becomes feasible to use a richer model in an effort to capture subtle correlations among the amino acids and the predicted labels. In this context, it is beneficial to derive sparse models that discourage over-fitting and provide biological insight.
Results
In this paper, we first show that we are able to obtain accurate secondary structure predictions. Our per-residue accuracy on a well established and difficult benchmark (CB513) is 80.3%, which is comparable to the state-of-the-art evaluated on this dataset. We then introduce an algorithm for sparsifying the parameters of a DBN. Using this algorithm, we can automatically remove up to 70-95% of the parameters of a DBN while maintaining the same level of predictive accuracy on the SD576 set. At 90% sparsity, we are able to compute predictions three times faster than a fully dense model evaluated on the SD576 set. We also demonstrate, using simulated data, that the algorithm is able to recover true sparse structures with high accuracy, and using real data, that the sparse model identifies known correlation structure (local and non-local) related to different classes of secondary structure elements.
Conclusions
We present a secondary structure prediction method that employs dynamic Bayesian networks and support vector machines. We also introduce an algorithm for sparsifying the parameters of the dynamic Bayesian network. The sparsification approach yields a significant speed-up in generating predictions, and we demonstrate that the amino acid correlations identified by the algorithm correspond to several known features of protein secondary structure. Datasets and source code used in this study are available at http://noble.gs.washington.edu/proj/pssp.
doi:10.1186/1471-2105-12-154
PMCID: PMC3118164  PMID: 21569525
3.  Learning a Weighted Sequence Model of the Nucleosome Core and Linker Yields More Accurate Predictions in Saccharomyces cerevisiae and Homo sapiens 
PLoS Computational Biology  2010;6(7):e1000834.
DNA in eukaryotes is packaged into a chromatin complex, the most basic element of which is the nucleosome. The precise positioning of the nucleosome cores allows for selective access to the DNA, and the mechanisms that control this positioning are important pieces of the gene expression puzzle. We describe a large-scale nucleosome pattern that jointly characterizes the nucleosome core and the adjacent linkers and is predominantly characterized by long-range oscillations in the mono, di- and tri-nucleotide content of the DNA sequence, and we show that this pattern can be used to predict nucleosome positions in both Homo sapiens and Saccharomyces cerevisiae more accurately than previously published methods. Surprisingly, in both H. sapiens and S. cerevisiae, the most informative individual features are the mono-nucleotide patterns, although the inclusion of di- and tri-nucleotide features results in improved performance. Our approach combines a much longer pattern than has been previously used to predict nucleosome positioning from sequence—301 base pairs, centered at the position to be scored—with a novel discriminative classification approach that selectively weights the contributions from each of the input features. The resulting scores are relatively insensitive to local AT-content and can be used to accurately discriminate putative dyad positions from adjacent linker regions without requiring an additional dynamic programming step and without the attendant edge effects and assumptions about linker length modeling and overall nucleosome density. Our approach produces the best dyad-linker classification results published to date in H. sapiens, and outperforms two recently published models on a large set of S. cerevisiae nucleosome positions. Our results suggest that in both genomes, a comparable and relatively small fraction of nucleosomes are well-positioned and that these positions are predictable based on sequence alone. We believe that the bulk of the remaining nucleosomes follow a statistical positioning model.
Author Summary
DNA in eukaryotes is packaged into a chromatin complex, the most basic element of which is the nucleosome. The precise positioning of the nucleosome cores allows for selective access to the DNA, and the mechanisms that control this positioning are important pieces of the gene expression puzzle. In this work, we describe a large-scale DNA sequence pattern that jointly characterizes the sequence preferences of the nucleosome core and the adjacent linkers. We show that this pattern can be used to predict nucleosome positions in both H. sapiens and S. cerevisiae more accurately than previously published methods. The model is most accurate in predicting the most stably positioned nucleosomes, and describes a sequence composition pattern that determines a locally optimal dyad (nucleosomal DNA mid-point) position. In contrast to some previous models, this model is not based primarily on excluding poly-A/T sequences, nor does the model prefer 10 bp periodicity. Our results suggest that local sequence composition is one of many factors that direct the positioning of nucleosomes, while dynamic processes such as transcriptional elongation and the actions of chromatin remodeling complexes also play a significant role in the overall chromatin landscape.
doi:10.1371/journal.pcbi.1000834
PMCID: PMC2900294  PMID: 20628623
4.  A dynamic Bayesian network for identifying protein-binding footprints from single molecule-based sequencing data 
Bioinformatics  2010;26(12):i334-i342.
Motivation: A global map of transcription factor binding sites (TFBSs) is critical to understanding gene regulation and genome function. DNaseI digestion of chromatin coupled with massively parallel sequencing (digital genomic footprinting) enables the identification of protein-binding footprints with high resolution on a genome-wide scale. However, accurately inferring the locations of these footprints remains a challenging computational problem.
Results: We present a dynamic Bayesian network-based approach for the identification and assignment of statistical confidence estimates to protein-binding footprints from digital genomic footprinting data. The method, DBFP, allows footprints to be identified in a probabilistic framework and outperforms our previously described algorithm in terms of precision at a fixed recall. Applied to a digital footprinting data set from Saccharomyces cerevisiae, DBFP identifies 4679 statistically significant footprints within intergenic regions. These footprints are mainly located near transcription start sites and are strongly enriched for known TFBSs. Footprints containing no known motif are preferentially located proximal to other footprints, consistent with cooperative binding of these footprints. DBFP also identifies a set of statistically significant footprints in the yeast coding regions. Many of these footprints coincide with the boundaries of antisense transcripts, and the most significant footprints are enriched for binding sites of the chromatin-associated factors Abf1 and Rap1.
Contact: jay.hesselberth@ucdenver.edu; william-noble@u.washington.edu
Supplementary information: Supplementary material is available at Bioinformatics online.
doi:10.1093/bioinformatics/btq175
PMCID: PMC2881360  PMID: 20529925
5.  Modeling peptide fragmentation with dynamic Bayesian networks for peptide identification 
Bioinformatics (Oxford, England)  2008;24(13):i348-i356.
Motivation
Tandem mass spectrometry (MS/MS) is an indispensable technology for identification of proteins from complex mixtures. Proteins are digested to peptides that are then identified by their fragmentation patterns in the mass spectrometer. Thus, at its core, MS/MS protein identification relies on the relative predictability of peptide fragmentation. Unfortunately, peptide fragmentation is complex and not fully understood, and what is understood is not always exploited by peptide identification algorithms.
Results
We use a hybrid dynamic Bayesian network (DBN)/support vector machine (SVM) approach to address these two problems. We train a set of DBNs on high-confidence peptide-spectrum matches. These DBNs, known collectively as Riptide, comprise a probabilistic model of peptide fragmentation chemistry. Examination of the distributions learned by Riptide allows identification of new trends, such as prevalent a-ion fragmentation at peptide cleavage sites C-term to hydrophobic residues. In addition, Riptide can be used to produce likelihood scores that indicate whether a given peptide-spectrum match is correct. A vector of such scores is evaluated by an SVM, which produces a final score to be used in peptide identification. Using Riptide in this way yields improved discrimination when compared to other state-of-the-art MS/MS identification algorithms, increasing the number of positive identifications by as much as 12% at a 1% false discovery rate.
Availability
Python and C source code are available upon request from the authors. The curated training sets are available at http://noble.gs.washington.edu/proj/intense/. The Graphical Model Tool Kit (GMTK) is freely available at http://ssli.ee.washington.edu/bilmes/gmtk.
Contact
noble@gs.washington.edu
doi:10.1093/bioinformatics/btn189
PMCID: PMC2665034  PMID: 18586734
6.  Transmembrane Topology and Signal Peptide Prediction Using Dynamic Bayesian Networks 
PLoS Computational Biology  2008;4(11):e1000213.
Hidden Markov models (HMMs) have been successfully applied to the tasks of transmembrane protein topology prediction and signal peptide prediction. In this paper we expand upon this work by making use of the more powerful class of dynamic Bayesian networks (DBNs). Our model, Philius, is inspired by a previously published HMM, Phobius, and combines a signal peptide submodel with a transmembrane submodel. We introduce a two-stage DBN decoder that combines the power of posterior decoding with the grammar constraints of Viterbi-style decoding. Philius also provides protein type, segment, and topology confidence metrics to aid in the interpretation of the predictions. We report a relative improvement of 13% over Phobius in full-topology prediction accuracy on transmembrane proteins, and a sensitivity and specificity of 0.96 in detecting signal peptides. We also show that our confidence metrics correlate well with the observed precision. In addition, we have made predictions on all 6.3 million proteins in the Yeast Resource Center (YRC) database. This large-scale study provides an overall picture of the relative numbers of proteins that include a signal-peptide and/or one or more transmembrane segments as well as a valuable resource for the scientific community. All DBNs are implemented using the Graphical Models Toolkit. Source code for the models described here is available at http://noble.gs.washington.edu/proj/philius. A Philius Web server is available at http://www.yeastrc.org/philius, and the predictions on the YRC database are available at http://www.yeastrc.org/pdr.
Author Summary
Transmembrane proteins control the flow of information and substances into and out of the cell and are involved in a broad range of biological processes. Their interfacing role makes them rewarding drug targets, and it is estimated that more than 50% of recently launched drugs target membrane proteins. However, experimentally determining the three-dimensional structure of a transmembrane protein is still a difficult task, and few of the currently known tertiary structures are of transmembrane proteins despite the fact that as many as one quarter of the proteins in a given organism are transmembrane proteins. Computational methods for predicting the basic topology of a transmembrane protein are therefore of great interest, and these methods must be able to distinguish between mature, membrane-spanning proteins and proteins that, when first synthesized, contain an N-terminal membrane-spanning signal peptide. In this work, we present Philius, a new computational approach that outperforms previous methods in simultaneously detecting signal peptides and correctly predicting the topology of transmembrane proteins. Philius also supplies a set of confidence scores with each prediction. A Philius Web server is available to the public as well as precomputed predictions for over six million proteins in the Yeast Resource Center database.
doi:10.1371/journal.pcbi.1000213
PMCID: PMC2570248  PMID: 18989393
7.  Modeling peptide fragmentation with dynamic Bayesian networks for peptide identification 
Bioinformatics  2008;24(13):i348-i356.
Motivation: Tandem mass spectrometry (MS/MS) is an indispensable technology for identification of proteins from complex mixtures. Proteins are digested to peptides that are then identified by their fragmentation patterns in the mass spectrometer. Thus, at its core, MS/MS protein identification relies on the relative predictability of peptide fragmentation. Unfortunately, peptide fragmentation is complex and not fully understood, and what is understood is not always exploited by peptide identification algorithms.
Results: We use a hybrid dynamic Bayesian network (DBN)/support vector machine (SVM) approach to address these two problems. We train a set of DBNs on high-confidence peptide-spectrum matches. These DBNs, known collectively as Riptide, comprise a probabilistic model of peptide fragmentation chemistry. Examination of the distributions learned by Riptide allows identification of new trends, such as prevalent a-ion fragmentation at peptide cleavage sites C-term to hydrophobic residues. In addition, Riptide can be used to produce likelihood scores that indicate whether a given peptide-spectrum match is correct. A vector of such scores is evaluated by an SVM, which produces a final score to be used in peptide identification. Using Riptide in this way yields improved discrimination when compared to other state-of-the-art MS/MS identification algorithms, increasing the number of positive identifications by as much as 12% at a 1% false discovery rate.
Availability: Python and C source code are available upon request from the authors. The curated training sets are available at http://noble.gs.washington.edu/proj/intense/. The Graphical Model Tool Kit (GMTK) is freely available at http://ssli.ee.washington.edu/bilmes/gmtk.
Contact:noble@gs.washington.edu
doi:10.1093/bioinformatics/btn189
PMCID: PMC2665034  PMID: 18586734

Results 1-7 (7)