The identification of proteins from spectra derived from a tandem mass spectrometry experiment involves several challenges: matching each observed spectrum to a peptide sequence, ranking the resulting collection of peptide-spectrum matches, assigning statistical confidence estimates to the matches, and identifying the proteins. The present work addresses algorithms to rank peptide-spectrum matches. Many of these algorithms, such as PeptideProphet, IDPicker, or Q-ranker, follow similar methodology that includes representing peptide-spectrum matches as feature vectors and using optimization techniques to rank them. We propose a richer and more flexible feature set representation that is based on the parametrization of the SEQUEST XCorr score and that can be used by all of these algorithms. This extended feature set allows a more effective ranking of the peptide-spectrum matches based on the target-decoy strategy, in comparison to a baseline feature set devoid of these XCorr-based features. Ranking using the extended feature set gives 10–40% improvement in the number of distinct peptide identifications relative to a range of q-value thresholds. While this work is inspired by the model of the theoretical spectrum and the similarity measure between spectra used specifically by SEQUEST, the method itself can be applied to the output of any database search. Further, our approach can be trivially extended beyond XCorr to any linear operator that can serve as similarity score between experimental spectra and peptide sequences.
Gene transcription can be regulated by remote enhancer regions through chromosome looping either in cis or in trans. Cancer cells are characterized by wholesale changes in long-range gene interactions, but the role that these long-range interactions play in cancer progression and metastasis is not well understood. In this study, we used IGFBP3, a gene involved in breast cancer pathogenesis, as bait in a 4C-seq experiment comparing normal breast cells (HMEC) with two breast cancer cell lines (MCF7, an ER positive cell line, and MDA-MB-231, a triple negative cell line). The IGFBP3 long-range interaction profile was substantially altered in breast cancer. Many interactions seen in normal breast cells are lost and novel interactions appear in cancer lines. We found that in HMEC, the breast carcinoma amplified sequence gene family (BCAS) 1–4 were among the top 10 most significantly enriched regions of interaction with IGFBP3. 3D-FISH analysis indicated that the translocation-prone BCAS genes, which are located on chromosomes 1, 17, and 20, are in close physical proximity with IGFBP3 and each other in normal breast cells. We also found that epidermal growth factor receptor (EGFR), a gene implicated in tumorigenesis, interacts significantly with IGFBP3 and that this interaction may play a role in their regulation. Breakpoint analysis suggests that when an IGFBP3 interacting region undergoes a translocation an additional interaction detectable by 4C is gained. Overall, our data from multiple lines of evidence suggest an important role for long-range chromosomal interactions in the pathogenesis of cancer.
Chemical cross-linking is an attractive technique for the study of the structure of protein complexes due to its low sample consumption and short analysis time. Furthermore, distance constraints obtained from the identification of cross-linked peptides by mass spectrometry can be used to construct and validate protein models. If a sufficient number of distance constraints are obtained, then determining the secondary structure of a protein can allow inference of the protein’s fold. In this work, we show how the distance constraints obtained from cross-linking experiments can identify secondary structures within the protein sequence. Molecular modeling of alpha helices and beta sheets indicate cross-linking patterns based on the topological distances between reactive residues. DSS cross-linking experiments with model alpha helix containing proteins corroborated the molecular modeling predictions. The patterns established here can be extended to other cross-linkers with known spacing lengths.
Tandem mass spectrometry experiments generate from thousands to millions of spectra. These spectra can be used to identify the presence of proteins in biological samples. In this work, we propose a new method to identify peptides, substrings of proteins, based on clustered tandem mass spectrometry data. In contrast to previously proposed approaches, which identify one representative spectrum for each cluster using traditional database searching algorithms, our method uses all available information to score all the spectra in a cluster against candidate peptides using Bayesian model selection. We illustrate the performance of our method by applying it to seven-standard-protein mixture data.
Bayesian analysis; Bioinformatics; Clustered tandem mass spectra; False discovery rate; Peptide identification; Proteomics
Additive genetic variance (VA) and total genetic variance (VG) are core concepts in biomedical, evolutionary and production-biology genetics. What determines the large variation in reported VA/VG ratios from line-cross experiments is not well understood. Here we report how the VA/VG ratio, and thus the ratio between narrow and broad sense heritability (h2/H2), varies as a function of the regulatory architecture underlying genotype-to-phenotype (GP) maps. We studied five dynamic models (of the cAMP pathway, the glycolysis, the circadian rhythms, the cell cycle, and heart cell dynamics). We assumed genetic variation to be reflected in model parameters and extracted phenotypes summarizing the system dynamics. Even when imposing purely linear genotype to parameter maps and no environmental variation, we observed quite low VA/VG ratios. In particular, systems with positive feedback and cyclic dynamics gave more non-monotone genotype-phenotype maps and much lower VA/VG ratios than those without. The results show that some regulatory architectures consistently maintain a transparent genotype-to-phenotype relationship, whereas other architectures generate more subtle patterns. Our approach can be used to elucidate these relationships across a whole range of biological systems in a systematic fashion.
The broad-sense heritability of a trait is the proportion of phenotypic variance attributable to genetic causes, while the narrow-sense heritability is the proportion attributable to additive gene effects. A better understanding of what underlies variation in the ratio of the two heritability measures, or the equivalent ratio of additive variance VA to total genetic variance VG, is important for production biology, biomedicine and evolution. We find that reported VA/VG values from line crosses vary greatly and ask if biological mechanisms underlying such differences can be elucidated by linking computational biology models with genetics. To this end, we made use of models of the cAMP pathway, the glycolysis, circadian rhythms, the cell cycle and cardiocyte dynamics. We assumed additive gene action from genotypes to model parameters and studied the resulting GP maps and VA/VG ratios of system-level phenotypes. Our results show that some types of regulatory architectures consistently preserve a transparent genotype-to-phenotype relationship, whereas others generate more subtle patterns. Particularly, systems with positive feedback and cyclic dynamics resulted in more non-monotonicity in the GP map leading to lower VA/VG ratios. Our approach can be used to elucidate the VA/VG relationship across a whole range of biological systems in a systematic fashion.
The problem of identifying the proteins in a complex mixture using tandem mass spectrometry can be framed as an inference problem on a graph that connects peptides to proteins. Several existing protein identification methods make use of statistical inference methods for graphical models, including expectation maximization, Markov chain Monte Carlo, and full marginalization coupled with approximation heuristics. We show that, for this problem, the majority of the cost of inference usually comes from a few highly connected subgraphs. Furthermore, we evaluate three different statistical inference methods using a common graphical model, and we demonstrate that junction tree inference substantially improves rates of convergence compared to existing methods. The python code used for this paper is available at http://noble.gs.washington.edu/proj/fido.
Mass spectrometry; protein identification; graphical models; Bayesian inference
For evaluation of audiological service outcomes, the primary objective was to determine baseline and target profiles on the Speech, Spatial and Qualities of Hearing scale (SSQ); a secondary objective was to test a short form of the SSQ; opportunity was also taken to compare responses of samples providing consistent versus inconsistent self-assessments.
2×2×2 factorial design crossed age, reported presence versus absence of hearing difficulty, and low versus high self-rated hearing ability.
Eight samples (total n=413), representing two age ranges; a response of “yes” or “no” to a question about having hearing difficulty, and either low or high self-rated hearing ability on six items from the SSQ.
Using present and previous results, baseline SSQ profiles were determined indicating the pattern of response likely to be observed prior to clinical intervention, and both an achieved outcome and “ideal” target outcome from such intervention. The six-item SSQ yielded better test-retest results in consistent versus inconsistent samples. The inconsistent samples showed signs of different interpretations of “hearing difficulty”.
Baseline and both actual and ideal target outcomes can guide comparative appraisal of clinical achievements; more research is needed to determine a robust short form of the SSQ.
The salamander has the remarkable ability to regenerate its limb after amputation. Cells at the site of amputation form a blastema and then proliferate and differentiate to regrow the limb. To better understand this process, we performed deep RNA sequencing of the blastema over a time course in the axolotl, a species whose genome has not been sequenced. Using a novel comparative approach to analyzing RNA-seq data, we characterized the transcriptional dynamics of the regenerating axolotl limb with respect to the human gene set. This approach involved de novo assembly of axolotl transcripts, RNA-seq transcript quantification without a reference genome, and transformation of abundances from axolotl contigs to human genes. We found a prominent burst in oncogene expression during the first day and blastemal/limb bud genes peaking at 7 to 14 days. In addition, we found that limb patterning genes, SALL genes, and genes involved in angiogenesis, wound healing, defense/immunity, and bone development are enriched during blastema formation and development. Finally, we identified a category of genes with no prior literature support for limb regeneration that are candidates for further evaluation based on their expression pattern during the regenerative process.
Salamanders such as the axolotl can fully regenerate a limb upon amputation, making them the vertebrate champions of regeneration. On the other hand, humans and other mammals possess a very limited ability to regenerate limb structures. Learning about the genes, gene networks, and pathways activated in the salamander during limb regeneration will provide cues to improving the regenerative response in mammals. Elucidating these genes, networks, and pathways is difficult, however, because the axolotl does not yet have its genome sequenced and because it has diverged evolutionarily from species with a sequenced genome. Here, we produce a set of gene transcripts via RNA sequencing (RNA-seq) for the axolotl and provide information on the nature of the genes activated during regeneration. To determine the identity of these axolotl genes, we use comparative transcriptomics techniques to match the axolotl transcript data to that of the well-annotated human gene set. Supporting previous studies, we find upregulation of many genes previously found to be involved in limb development and regeneration. In addition, we find a burst of cancer-related genes during the first phase of regeneration and identify a set of genes previously not associated with the regeneration process.
Cellular signal transduction generally involves cascades of post-translational protein modifications that rapidly catalyze changes in protein-DNA interactions and gene expression. High-throughput measurements are improving our ability to study each of these stages individually, but do not capture the connections between them. Here we present an approach for building a network of physical links among these data that can be used to prioritize targets for pharmacological intervention. Our method recovers the critical missing links between proteomic and transcriptional data by relating changes in chromatin accessibility to changes in expression and then uses these links to connect proteomic and transcriptome data. We applied our approach to integrate epigenomic, phosphoproteomic and transcriptome changes induced by the variant III mutation of the epidermal growth factor receptor (EGFRvIII) in a cell line model of glioblastoma multiforme (GBM). To test the relevance of the network, we used small molecules to target highly connected nodes implicated by the network model that were not detected by the experimental data in isolation and we found that a large fraction of these agents alter cell viability. Among these are two compounds, ICG-001, targeting CREB binding protein (CREBBP), and PKF118–310, targeting β-catenin (CTNNB1), which have not been tested previously for effectiveness against GBM. At the level of transcriptional regulation, we used chromatin immunoprecipitation sequencing (ChIP-Seq) to experimentally determine the genome-wide binding locations of p300, a transcriptional co-regulator highly connected in the network. Analysis of p300 target genes suggested its role in tumorigenesis. We propose that this general method, in which experimental measurements are used as constraints for building regulatory networks from the interactome while taking into account noise and missing data, should be applicable to a wide range of high-throughput datasets.
The ways in which cells respond to changes in their environment are controlled by networks of physical links among the proteins and genes. The initial signal of a change in conditions rapidly passes through these networks from the cytoplasm to the nucleus, where it can lead to long-term alterations in cellular behavior by controlling the expression of genes. These cascades of signaling events underlie many normal biological processes. As a result, being able to map out how these networks change in disease can provide critical insights for new approaches to treatment. We present a computational method for reconstructing these networks by finding links between the rapid short-term changes in proteins and the longer-term changes in gene regulation. This method brings together systematic measurements of protein signaling, genome organization and transcription in the context of protein-protein and protein-DNA interactions. When used to analyze datasets from an oncogene expressing cell line model of human glioblastoma, our approach identifies key nodes that affect cell survival and functional transcriptional regulators.
Motivation Accurate knowledge of the genome-wide binding of transcription factors in a particular cell type or under a particular condition is necessary for understanding transcriptional regulation. Using epigenetic data such as histone modification and DNase I, accessibility data has been shown to improve motif-based in silico methods for predicting such binding, but this approach has not yet been fully explored.
Results We describe a probabilistic method for combining one or more tracks of epigenetic data with a standard DNA sequence motif model to improve our ability to identify active transcription factor binding sites (TFBSs). We convert each data type into a position-specific probabilistic prior and combine these priors with a traditional probabilistic motif model to compute a log-posterior odds score. Our experiments, using histone modifications H3K4me1, H3K4me3, H3K9ac and H3K27ac, as well as DNase I sensitivity, show conclusively that the log-posterior odds score consistently outperforms a simple binary filter based on the same data. We also show that our approach performs competitively with a more complex method, CENTIPEDE, and suggest that the relative simplicity of the log-posterior odds scoring method makes it an appealing and very general method for identifying functional TFBSs on the basis of DNA and epigenetic evidence.
Availability and implementation: FIMO, part of the MEME Suite software toolkit, now supports log-posterior odds scoring using position-specific priors for motif search. A web server and source code are available at http://meme.nbcr.net. Utilities for creating priors are at http://research.imb.uq.edu.au/t.bailey/SD/Cuellar2011.
Supplementary information: Supplementary data are available at Bioinformatics online.
The mechanism by which homologous chromosomes pair during meiosis, as a prelude to recombination, has long been mysterious. At meiosis, the telomeres in many organisms attach to the nuclear envelope and move together to form the telomere bouquet, perhaps to facilitate the homologous search. It is believed that diffusion alone is not sufficient to account for the formation of the bouquet, and that some directed movement is also required. Here we consider the formation of the telomere bouquet in a wheat-rye hybrid both experimentally and using mathematical modelling. The large size of the wheat nucleus and wheat's commercial importance make chromosomal pairing in wheat a particularly interesting and important process, which may well shed light on pairing in other organisms. We show that, prior to bouquet formation, sister chromatid telomeres are always attached to a hemisphere of the nuclear membrane and tend to associate in pairs. We study a mutant lacking the Ph1 locus, a locus ensuring correct homologous chromosome pairing, and discover that bouquet formation is delayed in the wild type compared to the mutant. Further, we develop a mathematical model of bouquet formation involving diffusion and directed movement, where we show that directed movement alone is sufficient to explain bouquet formation dynamics.
The appearance of sexual reproduction over a billion years ago led to a revolution in how organisms pass on genetic material to their offspring. In sexually reproducing organisms parental diploid cells, containing two nearly identical copies of each chromosome (homologues), produce gametes containing only one copy of each chromosome. This in turn requires the pairing of the related homologous chromosomes to ensure their subsequent segregation into the gametes. How this pairing is achieved is poorly understood since chromosomes must search the entire nucleus for their homologous partner. Many organisms move the ends of each chromosome (the telomeres) along the periphery of the nucleus into a small patch forming the telomere bouquet. We show here that direct movement of telomeres towards the bouquet site, potentially driven by molecular motors, can explain bouquet formation dynamics. We focus in particular on a wheat-rye hybrid since understanding homologous pairing in wheat could have profound implications for breeding resistant crops by aiding the production of hybrids. We also show that wheat seems to have evolved a mechanism to delay the onset of telomere bouquet formation, perhaps in order to ensure chromosomes find their correct homologous partners.
The ENCODE Project has generated a wealth of experimental information mapping diverse chromatin properties in several human cell lines. Although each such data track is independently informative toward the annotation of regulatory elements, their interrelations contain much richer information for the systematic annotation of regulatory elements. To uncover these interrelations and to generate an interpretable summary of the massive datasets of the ENCODE Project, we apply unsupervised learning methodologies, converting dozens of chromatin datasets into discrete annotation maps of regulatory regions and other chromatin elements across the human genome. These methods rediscover and summarize diverse aspects of chromatin architecture, elucidate the interplay between chromatin activity and RNA transcription, and reveal that a large proportion of the genome lies in a quiescent state, even across multiple cell types. The resulting annotation of non-coding regulatory elements correlate strongly with mammalian evolutionary constraint, and provide an unbiased approach for evaluating metrics of evolutionary constraint in human. Lastly, we use the regulatory annotations to revisit previously uncharacterized disease-associated loci, resulting in focused, testable hypotheses through the lens of the chromatin landscape.
Spectral counting methods provide an easy means of identifying proteins with differing abundances between complex mixtures using shotgun proteomics data. The crux spectral-counts command, implemented as part of the Crux software toolkit, implements four previously reported spectral counting methods, the spectral index (SIN), the exponentially modified protein abundance index (emPAI), the normalized spectral abundance factor (NSAF), and the distributed normalized spectral abundance factor (dNSAF).
We compared the reproducibility and the linearity relative to each protein’s abundance of the four spectral counting metrics. Our analysis suggests that NSAF yields the most reproducible counts across technical and biological replicates, and both SIN and NSAF achieve the best linearity.
With the crux spectral-counts command, Crux provides open-source modular methods to analyze mass spectrometry data for identifying and now quantifying peptides and proteins. The C++ source code, compiled binaries, spectra and sequence databases are available at
Inferring the combinatorial regulatory code of transcription factors (TFs) from genome-wide TF binding profiles is challenging. A major reason is that TF binding profiles significantly overlap and are therefore highly correlated. Clustered occurrence of multiple TFs at genomic sites may arise from chromatin accessibility and local cooperation between TFs, or binding sites may simply appear clustered if the profiles are generated from diverse cell populations. Overlaps in TF binding profiles may also result from measurements taken at closely related time intervals. It is thus of great interest to distinguish TFs that directly regulate gene expression from those that are indirectly associated with gene expression. Graphical models, in particular Bayesian networks, provide a powerful mathematical framework to infer different types of dependencies. However, existing methods do not perform well when the features (here: TF binding profiles) are highly correlated, when their association with the biological outcome is weak, and when the sample size is small. Here, we develop a novel computational method, the Neighbourhood Consistent PC (NCPC) algorithms, which deal with these scenarios much more effectively than existing methods do. We further present a novel graphical representation, the Direct Dependence Graph (DDGraph), to better display the complex interactions among variables. NCPC and DDGraph can also be applied to other problems involving highly correlated biological features. Both methods are implemented in the R package ddgraph, available as part of Bioconductor (http://bioconductor.org/packages/2.11/bioc/html/ddgraph.html). Applied to real data, our method identified TFs that specify different classes of cis-regulatory modules (CRMs) in Drosophila mesoderm differentiation. Our analysis also found depletion of the early transcription factor Twist binding at the CRMs regulating expression in visceral and somatic muscle cells at later stages, which suggests a CRM-specific repression mechanism that so far has not been characterised for this class of mesodermal CRMs.
Transcription factors (TFs) are proteins that bind to DNA and regulate gene expression. Recent technological advances make it possible to map TF binding patterns across the whole genome. Multiple single-gene studies showed that combinatorial binding of multiple transcription factors determines the gene transcriptional output. A common naive assumption is that correlated binding profiles may indicate combinatorial binding. However, it has been found that many TFs bind to distinct hotspots whose role is currently unclear. It is thus of great interest to find transcription factor combinations whose correlated binding is causally most immediate to gene expression. Building upon theories of statistical dependence and causality, we develop novel graphical modelbased algorithms that handle highly correlated transcription factor binding profiles more efficiently and reliably than existing algorithms do. These algorithms can also be applied to other biological areas involving highly correlated variables, such as the analysis of high-throughput gene knock-down experiments.
Peptides are routinely identified from mass spectrometry-based proteomics experiments by matching observed spectra to peptides derived from protein databases. The error rates of these identifications can be estimated by target-decoy analysis, which involves matching spectra to shuffled or reversed peptides. Besides estimating error rates, decoy searches can be used by semi-supervised machine learning algorithms to increase the number of confidently identified peptides. As for all machine learning algorithms, however, the results must be validated to avoid issues such as overfitting or biased learning, which would produce unreliable peptide identifications. Here, we discuss how the target-decoy method is employed in machine learning for shotgun proteomics, focusing on how the results can be validated by cross-validation, a frequently used validation scheme in machine learning. We also use simulated data to demonstrate the proposed cross-validation scheme's ability to detect overfitting.
We applied a dynamic Bayesian network method that identifies joint patterns from multiple functional genomics experiments to ChIP-seq histone modification and transcription factor data, and DNaseI-seq and FAIRE-seq open chromatin readouts from the human cell line K562. In an unsupervised fashion, we identified patterns associated with transcription start sites, gene ends, enhancers, CTCF elements, and repressed regions. Software and genome browser tracks are at http://noble.gs.washington.edu/proj/segway/.
Computational analysis of mass spectra remains the bottleneck in many proteomics experiments. SEQUEST was one of the earliest software packages to identify peptides from mass spectra by searching a database of known peptides. Though still popular, SEQUEST performs slowly. Crux and TurboSEQUEST have successfully sped up SEQUEST by adding a precomputed index to the search, but the demand for ever-faster peptide identification software continues to grow. Tide, introduced here, is a software program that implements the SEQUEST algorithm for peptide identification and that achieves a dramatic speedup over Crux and SEQUEST. The optimization strategies detailed here employ a combination of algorithmic and software engineering techniques to achieve speeds up to 170 times faster than a recent version of SEQUEST that uses indexing. For example, on a single Xeon CPU, Tide searches 10,000 spectra against a tryptic database of 27,499 C. elegans proteins at a rate of 1,550 spectra per second, which compares favorably with a rate of 8.8 spectra per second for a recent version of SEQUEST with index running on the same hardware.
shotgun proteomics; peptide identification
Tandem mass spectrometry has emerged as a powerful tool for the characterization of complex protein samples, an increasingly important problem in biology. The effort to efficiently and accurately perform inference on data from tandem mass spectrometry experiments has resulted in several statistical methods. We use a common framework to describe the predominant methods and discuss them in detail. These methods are classified using the following categories: set cover methods, iterative methods, and Bayesian methods. For each method, we analyze and evaluate the outcome and methodology of published comparisons to other methods; we use this comparison to comment on the qualities and weaknesses, as well as the overall utility, of all methods. We discuss the similarities between these methods and suggest directions for the field that would help unify these similar assumptions in a more rigorous manner and help enable efficient and reliable protein inference.
Mass spectrometry; Proteomics; Bayesian methods
Many human diseases, arising from mutations of disease susceptibility genes (genetic diseases), are also associated with viral infections (virally implicated diseases), either in a directly causal manner or by indirect associations. Here we examine whether viral perturbations of host interactome may underlie such virally implicated disease relationships. Using as models two different human viruses, Epstein-Barr virus (EBV) and human papillomavirus (HPV), we find that host targets of viral proteins reside in network proximity to products of disease susceptibility genes. Expression changes in virally implicated disease tissues and comorbidity patterns cluster significantly in the network vicinity of viral targets. The topological proximity found between cellular targets of viral proteins and disease genes was exploited to uncover a novel pathway linking HPV to Fanconi anemia.
Many “virally implicated human diseases” - diseases for which there is scientific consensus of viral involvement - are associated with genetic alterations in particular disease susceptibility genes. We proposed and demonstrated that for two human viruses, Epstein-Barr virus and human papillomavirus, topological proximity should exist between host targets of viruses and genes associated with virally implicated diseases on host interactome networks (local impact hypothesis). For representative EBV- and HPV16- implicated diseases, genes in the neighborhood of viral targets in the host interactome have significantly shifted expression levels in virally implicated disease tissues, in line with the local impact hypothesis. The viral neighborhoods in the host interactome, along with their disease associations, defined as “viral disease networks”, contain connections known to be informative upon disease mechanisms as well as diseases whose associations with viruses are not yet known. We prioritized these diseases for their candidacy as potential virally implicated diseases based on network topology, and benchmarked this prioritization of candidate diseases using relative risk measurement which depicts population-based clinical associations between candidate diseases and viral infection. Exogenous expression of HPV viral proteins in a human cell line offered evidence for a novel disease pathway that links HPV to Fanconi anemia.
Motivation: A question that often comes up after applying a motif finder to a set of co-regulated DNA sequences is whether the reported putative motif is similar to any known motif. While several tools have been designed for this task, Habib et al. pointed out that the scores that are commonly used for measuring similarity between motifs do not distinguish between a good alignment of two informative columns (say, all-A) and one of two uninformative columns. This observation explains why tools such as Tomtom occasionally return an alignment of uninformative columns which is clearly spurious. To address this problem, Habib et al. suggested a new score [Bayesian Likelihood 2-Component (BLiC)] which uses a Bayesian information criterion to penalize matches that are also similar to the background distribution.
Results: We show that the BLiC score exhibits other, highly undesirable properties, and we offer instead a general approach to adjust any motif similarity score so as to reduce the number of reported spurious alignments of uninformative columns. We implement our method in Tomtom and show that, without significantly compromising Tomtom's retrieval accuracy or its runtime, we can drastically reduce the number of uninformative alignments.
Availability and Implementation: The modified Tomtom is available as part of the MEME Suite at http://meme.nbcr.net.
Contact: email@example.com; firstname.lastname@example.org
Supplementary Information: Supplementary data are available at Bioinformatics online.
In shotgun proteomics, the quality of a hypothesized match between an observed spectrum and a peptide sequence is quantified by a score function. Because the score function lies at the heart of any peptide identification pipeline, this function greatly affects the final results of a proteomics assay. Consequently, valid statistical methods for assessing the quality of a given score function are extremely important. Previously, several research groups have used samples of known protein composition to assess the quality of a given score function. We demonstrate that this approach is problematic, because the outcome can depend on factors other than the score function itself. We then propose an alternative use of the same type of data to assess the quality of a given score function. The central idea of our approach is that database matches that are not explained by any protein in the purified sample comprise a robust representation of incorrect matches. We apply our alternative assessment scheme to several commonly used score functions, and we show that our approach generates a reproducible measure of the calibration of a given peptide identification method. Furthermore, we show how our quality test can be useful in the development of novel score functions.
In higher eukaryotes, replication program specification in different cell types remains to be fully understood. We show for seven human cell lines that about half of the genome is divided in domains that display a characteristic U-shaped replication timing profile with early initiation zones at borders and late replication at centers. Significant overlap is observed between U-domains of different cell lines and also with germline replication domains exhibiting a N-shaped nucleotide compositional skew. From the demonstration that the average fork polarity is directly reflected by both the compositional skew and the derivative of the replication timing profile, we argue that the fact that this derivative displays a N-shape in U-domains sustains the existence of large-scale gradients of replication fork polarity in somatic and germline cells. Analysis of chromatin interaction (Hi-C) and chromatin marker data reveals that U-domains correspond to high-order chromatin structural units. We discuss possible models for replication origin activation within U/N-domains. The compartmentalization of the genome into replication U/N-domains provides new insights on the organization of the replication program in the human genome.
DNA replication in human cells requires the parallel progression along the genome of thousands of replication machineries. Comprehensive knowledge of genetic inheritance at different development stages relies on elucidating the mechanisms that regulate the location and progression of these machineries throughout the duration of the DNA synthetic phase of the cell cycle. Here, we determine in multiple human cell types the existence of a new type of megabase-sized replication domains across which the average orientation of the replication machinery changes in a linear manner. These domains are revealed in 7 somatic cell types by a U-shaped pattern in the replication timing profiles as well as by N-shaped patterns in the DNA compositional asymmetry profile reflecting the existence of a replication-associated mutational asymmetry in the germline. These domains therefore correspond to a robust mode of replication across cell types and during evolution. Using genome-wide data on the frequency of interaction of distant chromatin segments in two cell lines, we find that these U/N-replication domains remarkably correspond to self-interacting folding units of the chromatin fiber.
A variety of functionally important protein properties, such as secondary structure, transmembrane topology and solvent accessibility, can be encoded as a labeling of amino acids. Indeed, the prediction of such properties from the primary amino acid sequence is one of the core projects of computational biology. Accordingly, a panoply of approaches have been developed for predicting such properties; however, most such approaches focus on solving a single task at a time. Motivated by recent, successful work in natural language processing, we propose to use multitask learning to train a single, joint model that exploits the dependencies among these various labeling tasks. We describe a deep neural network architecture that, given a protein sequence, outputs a host of predicted local properties, including secondary structure, solvent accessibility, transmembrane topology, signal peptides and DNA-binding residues. The network is trained jointly on all these tasks in a supervised fashion, augmented with a novel form of semi-supervised learning in which the model is trained to distinguish between local patterns from natural and synthetic protein sequences. The task-independent architecture of the network obviates the need for task-specific feature engineering. We demonstrate that, for all of the tasks that we considered, our approach leads to statistically significant improvements in performance, relative to a single task neural network approach, and that the resulting model achieves state-of-the-art performance.
High-throughput proteomics experiments involving tandem mass spectrometry produce large volumes of complex data that require sophisticated computational analyses. As such, the field offers many challenges for computational biologists. In this article, we briefly introduce some of the core computational and statistical problems in the field and then describe a variety of outstanding problems that readers of PLoS Computational Biology might be able to help solve.
A growing body of experimental evidence supports the hypothesis that the 3D structure of chromatin in the nucleus is closely linked to important functional processes, including DNA replication and gene regulation. In support of this hypothesis, several research groups have examined sets of functionally associated genomic loci, with the aim of determining whether those loci are statistically significantly colocalized. This work presents a critical assessment of two previously reported analyses, both of which used genome-wide DNA–DNA interaction data from the yeast Saccharomyces cerevisiae, and both of which rely upon a simple notion of the statistical significance of colocalization. We show that these previous analyses rely upon a faulty assumption, and we propose a correct non-parametric resampling approach to the same problem. Applying this approach to the same data set does not support the hypothesis that transcriptionally coregulated genes tend to colocalize, but strongly supports the colocalization of centromeres, and provides some evidence of colocalization of origins of early DNA replication, chromosomal breakpoints and transfer RNAs.