Cancers arise from successive rounds of mutation and selection, generating clonal populations that vary in size, mutational content and drug responsiveness. Ascertaining the clonal composition of a tumor is therefore important both for prognosis and therapy. Mutation counts and frequencies resulting from next-generation sequencing (NGS) potentially reflect a tumor's clonal composition; however, deconvolving NGS data to infer a tumor's clonal structure presents a major challenge. We propose a generative model for NGS data derived from multiple subsections of a single tumor, and we describe an expectation-maximization procedure for estimating the clonal genotypes and relative frequencies using this model. We demonstrate, via simulation, the validity of the approach, and then use our algorithm to assess the clonal composition of a primary breast cancer and associated metastatic lymph node. After dividing the tumor into subsections, we perform exome sequencing for each subsection to assess mutational content, followed by deep sequencing to precisely count normal and variant alleles within each subsection. By quantifying the frequencies of 17 somatic variants, we demonstrate that our algorithm predicts clonal relationships that are both phylogenetically and spatially plausible. Applying this method to larger numbers of tumors should cast light on the clonal evolution of cancers in space and time.
Cancers arise from a series of mutations that occur over time. As a result, as a tumor grows each cell inherits a distinctive genotype, defined by the set of all somatic mutations that distinguish the tumor cell from normal cells. Acertaining these genotype patterns, and identifying which ones are associated with the growth of the cancer and its ability to metastasize, can potentially give clinicians insights into how to treat the cancer. In this work, we describe a method for inferring the predominant genotypes within a single tumor. The method requires that a tumor be sectioned and that each section be subjected to a high-throughput sequencing procedure. The resulting mutations and their associated frequencies within each tumor section are then used as input to a probabilistic model that infers the underlying genotypes and their relative frequencies within the tumor. We use simulated data to demonstrate the validity of the approach, and then we apply our algorithm to data from a primary breast cancer and associated metastatic lymph node. We demonstrate that our algorithm predicts genotypes that are consistent with an evolutionary model and with the physical topology of the tumor itself. Applying this method to larger numbers of tumors should cast light on the evolution of cancers in space and time.
The analysis of a shotgun proteomics experiment results in a list of peptide-spectrum matches (PSMs) in which each fragmentation spectrum has been matched to a peptide in a database. Subsequently, most protein inference algorithms rank peptides according to the best-scoring PSM for each peptide. However, there is disagreement in the scientific literature on the best method to assess the statistical significance of the resulting peptide identifications. Here, we use a previously described calibration protocol to evaluate the accuracy of three different peptide-level statistical confidence estimation procedures: the classical Fisher’s method, and two complementary procedures that estimate significance, respectively, before and after selecting the top-scoring PSM for each spectrum. Our experiments show that the latter method, which is employed by MaxQuant and Percolator, produces the most accurate, well-calibrated results.
Shotgun proteomics; peptides; statistics
Motivation: Recent technological advances allow the measurement, in a single Hi-C experiment, of the frequencies of physical contacts among pairs of genomic loci at a genome-wide scale. The next challenge is to infer, from the resulting DNA–DNA contact maps, accurate 3D models of how chromosomes fold and fit into the nucleus. Many existing inference methods rely on multidimensional scaling (MDS), in which the pairwise distances of the inferred model are optimized to resemble pairwise distances derived directly from the contact counts. These approaches, however, often optimize a heuristic objective function and require strong assumptions about the biophysics of DNA to transform interaction frequencies to spatial distance, and thereby may lead to incorrect structure reconstruction.
Methods: We propose a novel approach to infer a consensus 3D structure of a genome from Hi-C data. The method incorporates a statistical model of the contact counts, assuming that the counts between two loci follow a Poisson distribution whose intensity decreases with the physical distances between the loci. The method can automatically adjust the transfer function relating the spatial distance to the Poisson intensity and infer a genome structure that best explains the observed data.
Results: We compare two variants of our Poisson method, with or without optimization of the transfer function, to four different MDS-based algorithms—two metric MDS methods using different stress functions, a non-metric version of MDS and ChromSDE, a recently described, advanced MDS method—on a wide range of simulated datasets. We demonstrate that the Poisson models reconstruct better structures than all MDS-based methods, particularly at low coverage and high resolution, and we highlight the importance of optimizing the transfer function. On publicly available Hi-C data from mouse embryonic stem cells, we show that the Poisson methods lead to more reproducible structures than MDS-based methods when we use data generated using different restriction enzymes, and when we reconstruct structures at different resolutions.
Availability and implementation: A Python implementation of the proposed method is available at http://cbio.ensmp.fr/pastis.
email@example.com or firstname.lastname@example.org
The identification of proteins from spectra derived from a tandem mass spectrometry experiment involves several challenges: matching each observed spectrum to a peptide sequence, ranking the resulting collection of peptide-spectrum matches, assigning statistical confidence estimates to the matches, and identifying the proteins. The present work addresses algorithms to rank peptide-spectrum matches. Many of these algorithms, such as PeptideProphet, IDPicker, or Q-ranker, follow similar methodology that includes representing peptide-spectrum matches as feature vectors and using optimization techniques to rank them. We propose a richer and more flexible feature set representation that is based on the parametrization of the SEQUEST XCorr score and that can be used by all of these algorithms. This extended feature set allows a more effective ranking of the peptide-spectrum matches based on the target-decoy strategy, in comparison to a baseline feature set devoid of these XCorr-based features. Ranking using the extended feature set gives 10–40% improvement in the number of distinct peptide identifications relative to a range of q-value thresholds. While this work is inspired by the model of the theoretical spectrum and the similarity measure between spectra used specifically by SEQUEST, the method itself can be applied to the output of any database search. Further, our approach can be trivially extended beyond XCorr to any linear operator that can serve as similarity score between experimental spectra and peptide sequences.
The problem of identifying the proteins in a complex mixture using tandem mass spectrometry can be framed as an inference problem on a graph that connects peptides to proteins. Several existing protein identification methods make use of statistical inference methods for graphical models, including expectation maximization, Markov chain Monte Carlo, and full marginalization coupled with approximation heuristics. We show that, for this problem, the majority of the cost of inference usually comes from a few highly connected subgraphs. Furthermore, we evaluate three different statistical inference methods using a common graphical model, and we demonstrate that junction tree inference substantially improves rates of convergence compared to existing methods. The python code used for this paper is available at http://noble.gs.washington.edu/proj/fido.
Mass spectrometry; protein identification; graphical models; Bayesian inference
Motivation Accurate knowledge of the genome-wide binding of transcription factors in a particular cell type or under a particular condition is necessary for understanding transcriptional regulation. Using epigenetic data such as histone modification and DNase I, accessibility data has been shown to improve motif-based in silico methods for predicting such binding, but this approach has not yet been fully explored.
Results We describe a probabilistic method for combining one or more tracks of epigenetic data with a standard DNA sequence motif model to improve our ability to identify active transcription factor binding sites (TFBSs). We convert each data type into a position-specific probabilistic prior and combine these priors with a traditional probabilistic motif model to compute a log-posterior odds score. Our experiments, using histone modifications H3K4me1, H3K4me3, H3K9ac and H3K27ac, as well as DNase I sensitivity, show conclusively that the log-posterior odds score consistently outperforms a simple binary filter based on the same data. We also show that our approach performs competitively with a more complex method, CENTIPEDE, and suggest that the relative simplicity of the log-posterior odds scoring method makes it an appealing and very general method for identifying functional TFBSs on the basis of DNA and epigenetic evidence.
Availability and implementation: FIMO, part of the MEME Suite software toolkit, now supports log-posterior odds scoring using position-specific priors for motif search. A web server and source code are available at http://meme.nbcr.net. Utilities for creating priors are at http://research.imb.uq.edu.au/t.bailey/SD/Cuellar2011.
Supplementary information: Supplementary data are available at Bioinformatics online.
The ENCODE Project has generated a wealth of experimental information mapping diverse chromatin properties in several human cell lines. Although each such data track is independently informative toward the annotation of regulatory elements, their interrelations contain much richer information for the systematic annotation of regulatory elements. To uncover these interrelations and to generate an interpretable summary of the massive datasets of the ENCODE Project, we apply unsupervised learning methodologies, converting dozens of chromatin datasets into discrete annotation maps of regulatory regions and other chromatin elements across the human genome. These methods rediscover and summarize diverse aspects of chromatin architecture, elucidate the interplay between chromatin activity and RNA transcription, and reveal that a large proportion of the genome lies in a quiescent state, even across multiple cell types. The resulting annotation of non-coding regulatory elements correlate strongly with mammalian evolutionary constraint, and provide an unbiased approach for evaluating metrics of evolutionary constraint in human. Lastly, we use the regulatory annotations to revisit previously uncharacterized disease-associated loci, resulting in focused, testable hypotheses through the lens of the chromatin landscape.
Spectral counting methods provide an easy means of identifying proteins with differing abundances between complex mixtures using shotgun proteomics data. The crux spectral-counts command, implemented as part of the Crux software toolkit, implements four previously reported spectral counting methods, the spectral index (SIN), the exponentially modified protein abundance index (emPAI), the normalized spectral abundance factor (NSAF), and the distributed normalized spectral abundance factor (dNSAF).
We compared the reproducibility and the linearity relative to each protein’s abundance of the four spectral counting metrics. Our analysis suggests that NSAF yields the most reproducible counts across technical and biological replicates, and both SIN and NSAF achieve the best linearity.
With the crux spectral-counts command, Crux provides open-source modular methods to analyze mass spectrometry data for identifying and now quantifying peptides and proteins. The C++ source code, compiled binaries, spectra and sequence databases are available at
Peptides are routinely identified from mass spectrometry-based proteomics experiments by matching observed spectra to peptides derived from protein databases. The error rates of these identifications can be estimated by target-decoy analysis, which involves matching spectra to shuffled or reversed peptides. Besides estimating error rates, decoy searches can be used by semi-supervised machine learning algorithms to increase the number of confidently identified peptides. As for all machine learning algorithms, however, the results must be validated to avoid issues such as overfitting or biased learning, which would produce unreliable peptide identifications. Here, we discuss how the target-decoy method is employed in machine learning for shotgun proteomics, focusing on how the results can be validated by cross-validation, a frequently used validation scheme in machine learning. We also use simulated data to demonstrate the proposed cross-validation scheme's ability to detect overfitting.
We applied a dynamic Bayesian network method that identifies joint patterns from multiple functional genomics experiments to ChIP-seq histone modification and transcription factor data, and DNaseI-seq and FAIRE-seq open chromatin readouts from the human cell line K562. In an unsupervised fashion, we identified patterns associated with transcription start sites, gene ends, enhancers, CTCF elements, and repressed regions. Software and genome browser tracks are at http://noble.gs.washington.edu/proj/segway/.
Computational analysis of mass spectra remains the bottleneck in many proteomics experiments. SEQUEST was one of the earliest software packages to identify peptides from mass spectra by searching a database of known peptides. Though still popular, SEQUEST performs slowly. Crux and TurboSEQUEST have successfully sped up SEQUEST by adding a precomputed index to the search, but the demand for ever-faster peptide identification software continues to grow. Tide, introduced here, is a software program that implements the SEQUEST algorithm for peptide identification and that achieves a dramatic speedup over Crux and SEQUEST. The optimization strategies detailed here employ a combination of algorithmic and software engineering techniques to achieve speeds up to 170 times faster than a recent version of SEQUEST that uses indexing. For example, on a single Xeon CPU, Tide searches 10,000 spectra against a tryptic database of 27,499 C. elegans proteins at a rate of 1,550 spectra per second, which compares favorably with a rate of 8.8 spectra per second for a recent version of SEQUEST with index running on the same hardware.
shotgun proteomics; peptide identification
Motivation: A question that often comes up after applying a motif finder to a set of co-regulated DNA sequences is whether the reported putative motif is similar to any known motif. While several tools have been designed for this task, Habib et al. pointed out that the scores that are commonly used for measuring similarity between motifs do not distinguish between a good alignment of two informative columns (say, all-A) and one of two uninformative columns. This observation explains why tools such as Tomtom occasionally return an alignment of uninformative columns which is clearly spurious. To address this problem, Habib et al. suggested a new score [Bayesian Likelihood 2-Component (BLiC)] which uses a Bayesian information criterion to penalize matches that are also similar to the background distribution.
Results: We show that the BLiC score exhibits other, highly undesirable properties, and we offer instead a general approach to adjust any motif similarity score so as to reduce the number of reported spurious alignments of uninformative columns. We implement our method in Tomtom and show that, without significantly compromising Tomtom's retrieval accuracy or its runtime, we can drastically reduce the number of uninformative alignments.
Availability and Implementation: The modified Tomtom is available as part of the MEME Suite at http://meme.nbcr.net.
Contact: email@example.com; firstname.lastname@example.org
Supplementary Information: Supplementary data are available at Bioinformatics online.
In shotgun proteomics, the quality of a hypothesized match between an observed spectrum and a peptide sequence is quantified by a score function. Because the score function lies at the heart of any peptide identification pipeline, this function greatly affects the final results of a proteomics assay. Consequently, valid statistical methods for assessing the quality of a given score function are extremely important. Previously, several research groups have used samples of known protein composition to assess the quality of a given score function. We demonstrate that this approach is problematic, because the outcome can depend on factors other than the score function itself. We then propose an alternative use of the same type of data to assess the quality of a given score function. The central idea of our approach is that database matches that are not explained by any protein in the purified sample comprise a robust representation of incorrect matches. We apply our alternative assessment scheme to several commonly used score functions, and we show that our approach generates a reproducible measure of the calibration of a given peptide identification method. Furthermore, we show how our quality test can be useful in the development of novel score functions.
A variety of functionally important protein properties, such as secondary structure, transmembrane topology and solvent accessibility, can be encoded as a labeling of amino acids. Indeed, the prediction of such properties from the primary amino acid sequence is one of the core projects of computational biology. Accordingly, a panoply of approaches have been developed for predicting such properties; however, most such approaches focus on solving a single task at a time. Motivated by recent, successful work in natural language processing, we propose to use multitask learning to train a single, joint model that exploits the dependencies among these various labeling tasks. We describe a deep neural network architecture that, given a protein sequence, outputs a host of predicted local properties, including secondary structure, solvent accessibility, transmembrane topology, signal peptides and DNA-binding residues. The network is trained jointly on all these tasks in a supervised fashion, augmented with a novel form of semi-supervised learning in which the model is trained to distinguish between local patterns from natural and synthetic protein sequences. The task-independent architecture of the network obviates the need for task-specific feature engineering. We demonstrate that, for all of the tasks that we considered, our approach leads to statistically significant improvements in performance, relative to a single task neural network approach, and that the resulting model achieves state-of-the-art performance.
High-throughput proteomics experiments involving tandem mass spectrometry produce large volumes of complex data that require sophisticated computational analyses. As such, the field offers many challenges for computational biologists. In this article, we briefly introduce some of the core computational and statistical problems in the field and then describe a variety of outstanding problems that readers of PLoS Computational Biology might be able to help solve.
A growing body of experimental evidence supports the hypothesis that the 3D structure of chromatin in the nucleus is closely linked to important functional processes, including DNA replication and gene regulation. In support of this hypothesis, several research groups have examined sets of functionally associated genomic loci, with the aim of determining whether those loci are statistically significantly colocalized. This work presents a critical assessment of two previously reported analyses, both of which used genome-wide DNA–DNA interaction data from the yeast Saccharomyces cerevisiae, and both of which rely upon a simple notion of the statistical significance of colocalization. We show that these previous analyses rely upon a faulty assumption, and we propose a correct non-parametric resampling approach to the same problem. Applying this approach to the same data set does not support the hypothesis that transcriptionally coregulated genes tend to colocalize, but strongly supports the colocalization of centromeres, and provides some evidence of colocalization of origins of early DNA replication, chromosomal breakpoints and transfer RNAs.
As genome-wide experiments and annotations become more prevalent, researchers increasingly require tools to help interpret data at this scale. Many functional genomics experiments involve partitioning the genome into labeled segments, such that segments sharing the same label exhibit one or more biochemical or functional traits. For example, a collection of ChlP-seq experiments yields a compendium of peaks, each labeled with one or more associated DNA-binding proteins. Similarly, manually or automatically generated annotations of functional genomic elements, including cis-regulatory modules and protein-coding or RNA genes, can also be summarized as genomic segmentations.
We present a software toolkit called Segtools that simplifies and automates the exploration of genomic segmentations. The software operates as a series of interacting tools, each of which provides one mode of summarization. These various tools can be pipelined and summarized in a single HTML page. We describe the Segtools toolkit and demonstrate its use in interpreting a collection of human histone modification data sets and Plasmodium falciparum local chromatin structure data sets.
Segtools provides a convenient, powerful means of interpreting a genomic segmentation.
The problem of identifying proteins from a shotgun proteomics experiment has not been definitively solved. Identifying the proteins in a sample requires ranking them, ideally with interpretable scores. In particular, “degenerate” peptides, which map to multiple proteins, have made such a ranking difficult to compute. The problem of computing posterior probabilities for the proteins, which can be interpreted as confidence in a protein’s presence, has been especially daunting. Previous approaches have either ignored the peptide degeneracy problem completely, addressed it by computing a heuristic set of proteins or heuristic posterior probabilities, or by estimating the posterior probabilities with sampling methods. We present a probabilistic model for protein identification in tandem mass spectrometry that recognizes peptide degeneracy. We then introduce graph-transforming algorithms that facilitate efficient computation of protein probabilities, even for large data sets. We evaluate our identification procedure on five different well-characterized data sets and demonstrate our ability to efficiently compute high-quality protein posteriors.
Mass spectrometric identification of cross-linked peptides can provide valuable information about the structure of protein complexes. We describe a straightforward database search scheme that identifies and assigns statistical confidence estimates to spectra from cross-linked peptides. The method is well suited to targeted analysis of a single protein complex, without requiring an isotope labeling strategy. Our approach uses a SEQUEST-style search procedure in which the database is comprised of a mixture: single peptides with and without linkers attached, and cross-linked products. In contrast to several previous approaches, we generate theoretical spectra that account for all of the expected peaks from a cross-linked product, and we employ an empirical curve-fitting procedure to estimate statistical confidence measures. We show that our fully automated procedure successfully re-identifies spectra from a previous study, and we provide evidence that our statistical confidence estimates are accurate.
protein-protein interaction; peptide identification; calibration; cross-linked peptides
Summary: A motif is a short DNA or protein sequence that contributes to the biological function of the sequence in which it resides. Over the past several decades, many computational methods have been described for identifying, characterizing and searching with sequence motifs. Critical to nearly any motif-based sequence analysis pipeline is the ability to scan a sequence database for occurrences of a given motif described by a position-specific frequency matrix.
Results: We describe Find Individual Motif Occurrences (FIMO), a software tool for scanning DNA or protein sequences with motifs described as position-specific scoring matrices. The program computes a log-likelihood ratio score for each position in a given sequence database, uses established dynamic programming methods to convert this score to a P-value and then applies false discovery rate analysis to estimate a q-value for each position in the given sequence. FIMO provides output in a variety of formats, including HTML, XML and several Santa Cruz Genome Browser formats. The program is efficient, allowing for the scanning of DNA sequences at a rate of 3.5 Mb/s on a single CPU.
Availability and Implementation: FIMO is part of the MEME Suite software toolkit. A web server and source code are available at http://meme.sdsc.edu.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods—i.e., measures of similarity between query and target sequences—provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional “semantic space.” Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space.
Searching a protein or DNA sequence database to find sequences that are evolutionarily related to a query is one of the foundational problems in computational biology. These database searches rely on pairwise comparisons of sequence similarity between the query and targets, but despite years of method refinements, pairwise comparisons still often fail to detect more distantly related targets. In this study, we adapt recent work from natural language processing to exploit the global structure of the data space in this detection problem. In particular, we borrow the idea of a semantic embedding, where by training on a large text data set, one learns an embedding of words into a low-dimensional semantic space such that words embedded close to each other are likely to be semantically related. We present the ProtEmbed algorithm, which learns an embedding of protein sequences into a semantic space where evolutionarily-related proteins are embedded in close proximity. The flexible training algorithm allows additional pieces of evidence, such as 3D structural information, to be incorporated in the learning process and enables ProtEmbed to achieve state-of-the-art performance for the task of detecting targets that have remote evolutionary relationships to the query.
Drawing valid conclusions from an experiment often requires associating statistical confidence measures with the observed data. But these measures can be stated in terms of p-values, false discovery rates or q-values. What are the differences? And how should you decide which one to use?
Accurately modeling the DNA sequence preferences of transcription factors (TFs), and using these models to predict in vivo genomic binding sites for TFs, are key pieces in deciphering the regulatory code. These efforts have been frustrated by the limited availability and accuracy of TF binding site motifs, usually represented as position-specific scoring matrices (PSSMs), which may match large numbers of sites and produce an unreliable list of target genes. Recently, protein binding microarray (PBM) experiments have emerged as a new source of high resolution data on in vitro TF binding specificities. PBM data has been analyzed either by estimating PSSMs or via rank statistics on probe intensities, so that individual sequence patterns are assigned enrichment scores (E-scores). This representation is informative but unwieldy because every TF is assigned a list of thousands of scored sequence patterns. Meanwhile, high-resolution in vivo TF occupancy data from ChIP-seq experiments is also increasingly available. We have developed a flexible discriminative framework for learning TF binding preferences from high resolution in vitro and in vivo data. We first trained support vector regression (SVR) models on PBM data to learn the mapping from probe sequences to binding intensities. We used a novel -mer based string kernel called the di-mismatch kernel to represent probe sequence similarities. The SVR models are more compact than E-scores, more expressive than PSSMs, and can be readily used to scan genomics regions to predict in vivo occupancy. Using a large data set of yeast and mouse TFs, we found that our SVR models can better predict probe intensity than the E-score method or PBM-derived PSSMs. Moreover, by using SVRs to score yeast, mouse, and human genomic regions, we were better able to predict genomic occupancy as measured by ChIP-chip and ChIP-seq experiments. Finally, we found that by training kernel-based models directly on ChIP-seq data, we greatly improved in vivo occupancy prediction, and by comparing a TF's in vitro and in vivo models, we could identify cofactors and disambiguate direct and indirect binding.
Transcription factors (TFs) are proteins that bind sites in the non-coding DNA and regulate the expression of targeted genes. Being able to predict the genome-wide binding locations of TFs is an important step in deciphering gene regulatory networks. Historically, there was very limited experimental data on the DNA-binding preferences of most TFs. Computational biologists used known sites to estimate simple binding site motifs, called position-specific scoring matrices, and scan the genome for additional potential binding locations, but this approach often led to many false positive predictions. Here we introduce a machine learning approach to leverage new high resolution data on the binding preferences of TFs, namely, protein binding microarray (PBM) experiments which measure the in vitro binding affinities of TFs with respect to an array of double-stranded DNA probes, and chromatin immunoprecipitation experiments followed by next generation sequencing (ChIP-seq) which measure in vivo genome-wide binding of TFs in a given cell type. We show that by training statistical models on high resolution PBM and ChIP-seq data, we can more accurately represent the subtle DNA binding preferences of TFs and predict their genome-wide binding locations. These results will enable advances in the computational analysis of transcriptional regulation in mammalian genomes.