SNP genotyping microarrays have revolutionized the study of complex disease. The current range of commercially available genotyping products contain extensive catalogues of low frequency and rare variants. Existing SNP calling algorithms have difficulty dealing with these low frequency variants, as the underlying models rely on each genotype having a reasonable number of observations to ensure accurate clustering.
Here we develop KRLMM, a new method for converting raw intensities into genotype calls that aims to overcome this issue. Our method is unique in that it applies careful between sample normalization and allows a variable number of clusters k (1, 2 or 3) for each SNP, where k is predicted using the available data. We compare our method to four genotyping algorithms (GenCall, GenoSNP, Illuminus and OptiCall) on several Illumina data sets that include samples from the HapMap project where the true genotypes are known in advance. All methods were found to have high overall accuracy (> 98%), with KRLMM consistently amongst the best. At low minor allele frequency, the KRLMM, OptiCall and GenoSNP algorithms were observed to be consistently more accurate than GenCall and Illuminus on our test data.
Methods that tailor their approach to calling low frequency variants by either varying the number of clusters (KRLMM) or using information from other SNPs (OptiCall and GenoSNP) offer improved accuracy over methods that do not (GenCall and Illuminus). The KRLMM algorithm is implemented in the open-source crlmm package distributed via the Bioconductor project (http://www.bioconductor.org).
Genotyping; Clustering; Microarray data analysis
Motivation: Repetitive sequences account for approximately half of the human genome. Accurately ascertaining sequences in these regions with next generation sequencers is challenging, and requires a different set of analytical techniques than for reads originating from unique sequences. Complicating the matter are repetitive regions subject to programmed rearrangements, as is the case with the antigen-binding domains in the Immunoglobulin (Ig) and T-cell receptor (TCR) loci.
Results: We developed a probability-based score and visualization method to aid in distinguishing true structural variants from alignment artifacts. We demonstrate the usefulness of this method in its ability to separate real structural variants from false positives generated with existing upstream analysis tools. We validated our approach using both target-capture and whole-genome experiments. Capture sequencing reads were generated from primary lymphoid tumors, cancer cell lines and an EBV-transformed lymphoblast cell line over the Ig and TCR loci. Whole-genome sequencing reads were from a lymphoblastoid cell-line.
Availability: We implement our method as an R package available at https://github.com/Eitan177/targetSeqView. Code to reproduce the figures and results are also available.
Supplementary data are available at Bioinformatics online.
Epigenome-wide association studies of human disease and other quantitative traits are becoming increasingly common. A series of papers reporting age-related changes in DNA methylation profiles in peripheral blood have already been published. However, blood is a heterogeneous collection of different cell types, each with a very different DNA methylation profile.
Using a statistical method that permits estimating the relative proportion of cell types from DNA methylation profiles, we examine data from five previously published studies, and find strong evidence of cell composition change across age in blood. We also demonstrate that, in these studies, cellular composition explains much of the observed variability in DNA methylation. Furthermore, we find high levels of confounding between age-related variability and cellular composition at the CpG level.
Our findings underscore the importance of considering cell composition variability in epigenetic studies based on whole blood and other heterogeneous tissue sources. We also provide software for estimating and exploring this composition confounding for the Illumina 450k microarray.
RNA-sequencing (RNA-seq) is a flexible technology for measuring genome-wide expression that is rapidly replacing microarrays as costs become comparable. Current differential expression analysis methods for RNA-seq data fall into two broad classes: (1) methods that quantify expression within the boundaries of genes previously published in databases and (2) methods that attempt to reconstruct full length RNA transcripts. The first class cannot discover differential expression outside of previously known genes. While the second approach does possess discovery capabilities, statistical analysis of differential expression is complicated by the ambiguity and variability incurred while assembling transcripts and estimating their abundances. Here, we propose a novel method that first identifies differentially expressed regions (DERs) of interest by assessing differential expression at each base of the genome. The method then segments the genome into regions comprised of bases showing similar differential expression signal, and then assigns a measure of statistical significance to each region. Optionally, DERs can be annotated using a reference database of genomic features. We compare our approach with leading competitors from both current classes of differential expression methods and highlight the strengths and weaknesses of each. A software implementation of our method is available on github (https://github.com/alyssafrazee/derfinder).
Bioinformatics; Differential expression; False discovery rate; Genomics; RNA sequencing
High-throughput technologies are widely used, for example to assay genetic variants, gene and protein expression, and epigenetic modifications. One often overlooked complication with such studies is batch effects, which occur because measurements are affected by laboratory conditions, reagent lots and personnel differences. This becomes a major problem when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. Using both published studies and our own analyses, we argue that batch effects (as well as other technical and biological artefacts) are widespread and critical to address. We review experimental and computational approaches for doing so.
Summary: Frozen robust multiarray analysis (fRMA) is a single-array preprocessing algorithm that retains the advantages of multiarray algorithms and removes certain batch effects by downweighting probes that have high between-batch residual variance. Here, we extend the fRMA algorithm to two new microarray platforms—Affymetrix Human Exon and Gene 1.0 ST—by modifying the fRMA probe-level model and extending the frma package to work with oligo ExonFeatureSet and GeneFeatureSet objects.
Availability and implementation: All packages are implemented in R. Source code and binaries are freely available through the Bioconductor project. Convenient links to all software and data packages can be found at http://mnmccall.com/software
The Gene Expression Barcode project, http://barcode.luhs.org, seeks to determine the genes expressed for every tissue and cell type in humans and mice. Understanding the absolute expression of genes across tissues and cell types has applications in basic cell biology, hypothesis generation for gene function and clinical predictions using gene expression signatures. In its current version, this project uses the abundant publicly available microarray data sets combined with a suite of single-array preprocessing, quality control and analysis methods. In this article, we present the improvements that have been made since the previous version of the Gene Expression Barcode in 2011. These include a variety of new data mining tools and summaries, estimated transcriptomes and curated annotations.
The behavior of epigenetic mechanisms in the brain is obscured by tissue heterogeneity and disease-related histological changes. Not accounting for these confounders leads to biased results. We develop a statistical methodology that estimates and adjusts for celltype composition by decomposing neuronal and non-neuronal differential signal. This method provides a conceptual framework for deconvolving heterogeneous epigenetic data from postmortem brain studies. We apply it to find cell-specific differentially methylated regions between prefrontal cortex and hippocampus. We demonstrate the utility of the method on both Infinium 450k and CHARM data.
DNA methylation; epigenetics; differentially methylated region; brain region; cell-type heterogeneity; deconvolution; NeuN; neuron; glia; postmortem brain; fluorescence activated cell sorting
For a better understanding of the biology of an organism, a complete description is needed of all regions of the genome that are actively transcribed. Tiling arrays are used for this purpose. They allow for the discovery of novel transcripts and the assessment of differential expression between two or more experimental conditions such as genotype, treatment, tissue, etc. In tiling array literature, many efforts are devoted to transcript discovery, whereas more recent developments also focus on differential expression. To our knowledge, however, no methods for tiling arrays have been described that can simultaneously assess transcript discovery and identify differentially expressed transcripts. In this paper, we adopt wavelet based functional models to the context of tiling arrays. The high dimensionality of the data triggered us to avoid inference based on Bayesian MCMC methods. Instead, we introduce a fast empirical Bayes method that provides adaptive regularization of the functional effects. A simulation study and a case study illustrate that our approach is well suited for the simultaneous assessment of transcript discovery and differential expression in tiling array studies, and that it outperforms methods that accomplish only one of these tasks.
tiling microarray; wavelets; adaptive regularization; transcript discovery; differential expression; genomics; Arabidopsis thaliana
Human cancers nearly ubiquitously harbor epigenetic alterations. While such alterations in epigenetic marks, including DNA methylation, are potentially heritable, they can also be dynamically altered. Given this potential for plasticity, the degree to which epigenetic changes can be subject to selection and act as drivers of neoplasia has been questioned. Here, we carried out genome-scale analyses of DNA methylation alterations in lethal metastatic prostate cancer and created DNA methylation “cityscape” plots to visualize these complex data. We show that somatic DNA methylation alterations, despite showing marked inter-individual heterogeneity among men with lethal metastatic prostate cancer, were maintained across all metastases within the same individual. The overall extent of maintenance in DNA methylation changes was comparable to that of genetic copy number alterations. Regions that were frequently hypermethylated across individuals were markedly enriched for cancer and development/differentiation related genes. Additionally, regions exhibiting high consistency of hypermethylation across metastases within individuals, even if variably hypermethylated across individuals, showed enrichment of cancer-related genes. Interestingly, whereas some regions showed intra-individual metastatic tumor heterogeneity in promoter methylation, such methylation alterations were generally not correlated with gene expression. This was despite a general tendency for promoter methylation patterns to be strongly correlated with gene expression, particularly at regions that were variably methylated across individuals. These findings suggest that DNA methylation alterations have the potential for producing selectable driver events in carcinogenesis and disease progression and highlight the possibility of targeting such epigenome alterations for development of longitudinal markers and therapeutic strategies.
In honeybee societies, distinct caste phenotypes are created from the same genotype, suggesting a role for epigenetics in deriving these behaviorally different phenotypes. We found no differences in DNA methylation between irreversible worker/queen castes, but substantial differences between nurses and forager subcastes. Reverting foragers back to nurses reestablished methylation levels for a majority of genes and provided the first evidence in any organism of reversible epigenetic changes associated with behavior.
Motivation: Although chromatin immunoprecipitation coupled with
high-throughput sequencing (ChIP-seq) or tiling array hybridization (ChIP-chip) is
increasingly used to map genome-wide–binding sites of transcription factors (TFs),
it still remains difficult to generate a quality ChIPx (i.e. ChIP-seq or ChIP-chip)
dataset because of the tremendous amount of effort required to develop effective
antibodies and efficient protocols. Moreover, most laboratories are unable to easily
obtain ChIPx data for one or more TF(s) in more than a handful of biological contexts.
Thus, standard ChIPx analyses primarily focus on analyzing data from one experiment, and
the discoveries are restricted to a specific biological context.
Results: We propose to enrich this existing data analysis paradigm by
developing a novel approach, ChIP-PED, which superimposes ChIPx data on large amounts of
publicly available human and mouse gene expression data containing a diverse collection of
cell types, tissues and disease conditions to discover new biological contexts with
potential TF regulatory activities. We demonstrate ChIP-PED using a number of examples,
including a novel discovery that MYC, a human TF, plays an important
functional role in pediatric Ewing sarcoma cell lines. These examples show that ChIP-PED
increases the value of ChIPx data by allowing one to expand the scope of possible
discoveries made from a ChIPx experiment.
Supplementary data are available at Bioinformatics
Background Gestational age at birth strongly predicts neonatal, adolescent and adult morbidity and mortality through mostly unknown mechanisms. Identification of specific genes that are undergoing regulatory change prior to birth, such as through changes in DNA methylation, would increase our understanding of developmental changes occurring during the third trimester and consequences of pre-term birth (PTB).
Methods We performed a genome-wide analysis of DNA methylation (using microarrays, specifically CHARM 2.0) in 141 newborns collected in Baltimore, MD, using novel statistical methodology to identify genomic regions associated with gestational age at birth. Bisulphite pyrosequencing was used to validate significant differentially methylated regions (DMRs), and real-time PCR was performed to assess functional significance of differential methylation in a subset of newborns.
Results We identified three DMRs at genome-wide significance levels adjacent to the NFIX, RAPGEF2 and MSRB3 genes. All three regions were validated by pyrosequencing, and RAGPEF2 also showed an inverse correlation between DNA methylation levels and gene expression levels. Although the three DMRs appear very dynamic with gestational age in our newborn sample, adult DNA methylation levels at these regions are stable and of equal or greater magnitude than the oldest neonate, directionally consistent with the gestational age results.
Conclusions We have identified three differentially methylated regions associated with gestational age at birth. All three nearby genes play important roles in the development of several organs, including skeletal muscle, brain and haematopoietic system. Therefore, they may provide initial insight into the basis of PTB's negative health outcomes. The genome-wide custom DNA methylation array technology and novel statistical methods employed in this study could constitute a model for epidemiologic studies of epigenetic variation.
Epigenetic epidemiology; differentially methylated regions; pre-term birth; gestational age; genome-wide DNA methylation
Background During the past 5 years, high-throughput technologies have been successfully used by epidemiology studies, but almost all have focused on sequence variation through genome-wide association studies (GWAS). Today, the study of other genomic events is becoming more common in large-scale epidemiological studies. Many of these, unlike the single-nucleotide polymorphism studied in GWAS, are continuous measures. In this context, the exercise of searching for regions of interest for disease is akin to the problems described in the statistical ‘bump hunting’ literature.
Methods New statistical challenges arise when the measurements are continuous rather than categorical, when they are measured with uncertainty, and when both biological signal, and measurement errors are characterized by spatial correlation along the genome. Perhaps the most challenging complication is that continuous genomic data from large studies are measured throughout long periods, making them susceptible to ‘batch effects’. An example that combines all three characteristics is genome-wide DNA methylation measurements. Here, we present a data analysis pipeline that effectively models measurement error, removes batch effects, detects regions of interest and attaches statistical uncertainty to identified regions.
Results We illustrate the usefulness of our approach by detecting genomic regions of DNA methylation associated with a continuous trait in a well-characterized population of newborns. Additionally, we show that addressing unexplained heterogeneity like batch effects reduces the number of false-positive regions.
Conclusions Our framework offers a comprehensive yet flexible approach for identifying genomic regions of biological interest in large epidemiological studies using quantitative high-throughput methods.
Epigenetic epidemiology; DNA methylation; genome-wide analysis; bump hunting; batch effects
It has recently been proposed that variation in DNA methylation at specific genomic locations may play an important role in the development of complex diseases such as cancer. Here, we develop 1- and 2-group multiple testing procedures for identifying and quantifying regions of DNA methylation variability. Our method is the first genome-wide statistical significance calculation for increased or differential variability, as opposed to the traditional approach of testing for mean changes. We apply these procedures to genome-wide methylation data obtained from biological and technical replicates and provide the first statistical proof that variably methylated regions exist and are due to interindividual variation. We also show that differentially variable regions in colon tumor and normal tissue show enrichment of genes regulating gene expression, cell morphogenesis, and development, supporting a biological role for DNA methylation variability in cancer.
Bump finding; Functional data analysis; Multiple testing; Preprocessing; Variably methylation regions (VMRs)
454 pyrosequencing is a commonly used massively parallel DNA sequencing technology with a wide variety of application fields such as epigenetics, metagenomics and transcriptomics. A well-known problem of this platform is its sensitivity to base-calling insertion and deletion errors, particularly in the presence of long homopolymers. In addition, the base-call quality scores are not informative with respect to whether an insertion or a deletion error is more likely. Surprisingly, not much effort has been devoted to the development of improved base-calling methods and more intuitive quality scores for this platform.
We present HPCall, a 454 base-calling method based on a weighted Hurdle Poisson model. HPCall uses a probabilistic framework to call the homopolymer lengths in the sequence by modeling well-known 454 noise predictors. Base-calling quality is assessed based on estimated probabilities for each homopolymer length, which are easily transformed to useful quality scores.
Using a reference data set of the Escherichia coli K-12 strain, we show that HPCall produces superior quality scores that are very informative towards possible insertion and deletion errors, while maintaining a base-calling accuracy that is better than the current one. Given the generality of the framework, HPCall has the potential to also adapt to other homopolymer-sensitive sequencing technologies.
Early screening for cancer is arguably one of the greatest public health advances over the last fifty years. However, many cancer screening tests are invasive (digital rectal exams), expensive (mammograms, imaging) or both (colonoscopies). This has spurred growing interest in developing genomic signatures that can be used for cancer diagnosis and prognosis. However, progress has been slowed by heterogeneity in cancer profiles and the lack of effective computational prediction tools for this type of data.
We developed anti-profiles as a first step towards translating experimental findings suggesting that stochastic across-sample hyper-variability in the expression of specific genes is a stable and general property of cancer into predictive and diagnostic signatures. Using single-chip microarray normalization and quality assessment methods, we developed an anti-profile for colon cancer in tissue biopsy samples. To demonstrate the translational potential of our findings, we applied the signature developed in the tissue samples, without any further retraining or normalization, to screen patients for colon cancer based on genomic measurements from peripheral blood in an independent study (AUC of 0.89). This method achieved higher accuracy than the signature underlying commercially available peripheral blood screening tests for colon cancer (AUC of 0.81). We also confirmed the existence of hyper-variable genes across a range of cancer types and found that a significant proportion of tissue-specific genes are hyper-variable in cancer. Based on these observations, we developed a universal cancer anti-profile that accurately distinguishes cancer from normal regardless of tissue type (ten-fold cross-validation AUC > 0.92).
We have introduced anti-profiles as a new approach for developing cancer genomic signatures that specifically takes advantage of gene expression heterogeneity. We have demonstrated that anti-profiles can be successfully applied to develop peripheral-blood based diagnostics for cancer and used anti-profiles to develop a highly accurate universal cancer signature. By using single-chip normalization and quality assessment methods, no further retraining of signatures developed by the anti-profile approach would be required before their application in clinical settings. Our results suggest that anti-profiles may be used to develop inexpensive and non-invasive universal cancer screening tests.
Gene expression; Cancer; Genomic signatures; Microarray normalization and quality assessment; Anti-profiles
DNA methylation is an important epigenetic modification involved in gene regulation, which can now be measured using whole-genome bisulfite sequencing. However, cost, complexity of the data, and lack of comprehensive analytical tools are major challenges that keep this technology from becoming widely applied. Here we present BSmooth, an alignment, quality control and analysis pipeline that provides accurate and precise results even with low coverage data, appropriately handling biological replicates. BSmooth is open source software, and can be downloaded from http://rafalab.jhsph.edu/bsmooth.
First identified as histone-modifying proteins, lysine acetyltranferases (KATs) and deacetylases (KDACs) antagonize each other through modification of the side chains of lysine residues in histone proteins1. (De)acetylation of many non-histone proteins involved in chromatin, metabolism or cytoskeleton regulation were further identified in eukaryotic organisms2–6, but the corresponding modifying enzymes and substrate-specific functions of the modification are unclear. Moreover, mechanisms underlying functional specificity of individual KDACs7 remain enigmatic, and the substrate spectra of each KDAC lack comprehensive definition. Here we dissect the functional specificity of twelve critical human KDACs using a genome-wide synthetic lethality screen8–13 in cultured human cells. The genetic interaction profiles revealed enzyme-substrate relationships between individual KDACs and many important substrates governing a wide array of biological processes including metabolism, development and cell cycle progression. We further confirmed that (de)acetylation of the catalytic subunit of the adenosine monophosphate-activated protein kinase (AMPK), a critical cellular energy-sensing protein kinase complex, is controlled by the opposing catalytic activities of HDAC1 and p300. Its deacetylation enhances physical interaction with the upstream kinase LKB1, in turn leading to AMPK phosphorylation and activation, resulting in lipid breakdown in human liver cells. These findings provide new insights into previously underappreciated metabolism-regulatory roles of HDAC1 in coordinating nutrient availability and cellular responses upstream of AMPK, and demonstrate the importance of high-throughput genetic interaction profiling to elucidate functional specificity and critical substrates of individual human KDACs potentially valuable for therapeutic applications.
In systems biology, the task of reverse engineering gene pathways from data has been limited not just by the curse of dimensionality (the interaction space is huge) but also by systematic error in the data. The gene expression barcode reduces spurious association driven by batch effects and probe effects. The binary nature of the resulting expression calls lends itself perfectly to modern regularization approaches that thrive in high-dimensional settings.
The Partitioned LASSO-Patternsearch algorithm is proposed to identify patterns of multiple dichotomous risk factors for outcomes of interest in genomic studies. A partitioning scheme is used to identify promising patterns by solving many LASSO-Patternsearch subproblems in parallel. All variables that survive this stage proceed to an aggregation stage where the most significant patterns are identified by solving a reduced LASSO-Patternsearch problem in just these variables. This approach was applied to genetic data sets with expression levels dichotomized by gene expression bar code. Most of the genes and second-order interactions thus selected and are known to be related to the outcomes.
We demonstrate with simulations and data analyses that the proposed method not only selects variables and patterns more accurately, but also provides smaller models with better prediction accuracy, in comparison to several alternative methodologies.
Genotyping platforms such as Affymetrix can be used to assess genotype-phenotype as well as copy number-phenotype associations at millions of markers. While genotyping algorithms are largely concordant when assessed on HapMap samples, tools to assess copy number changes are more variable and often discordant. One explanation for the discordance is that copy number estimates are susceptible to systematic differences between groups of samples that were processed at different times or by different labs. Analysis algorithms that do not adjust for batch effects are prone to spurious measures of association. The R package crlmm implements a multilevel model that adjusts for batch effects and provides allele-specific estimates of copy number. This paper illustrates a workflow for the estimation of allele-specific copy number and integration of the marker-level estimates with complimentary Bioconductor software for inferring regions of copy number gain or loss. All analyses are performed in the statistical environment R.
copy number; batch effects; robust; multilevel model; high-throughput; oligonucleotide array
Motivation: Changes in the copy number of chromosomal DNA segments [copy number variants (CNVs)] have been implicated in human variation, heritable diseases and cancers. Microarray-based platforms are the current established technology of choice for studies reporting these discoveries and constitute the benchmark against which emergent sequence-based approaches will be evaluated. Research that depends on CNV analysis is rapidly increasing, and systematic platform assessments that distinguish strengths and weaknesses are needed to guide informed choice.
Results: We evaluated the sensitivity and specificity of six platforms, provided by four leading vendors, using a spike-in experiment. NimbleGen and Agilent platforms outperformed Illumina and Affymetrix in accuracy and precision of copy number dosage estimates. However, Illumina and Affymetrix algorithms that leverage single nucleotide polymorphism (SNP) information make up for this disadvantage and perform well at variant detection. Overall, the NimbleGen 2.1M platform outperformed others, but only with the use of an alternative data analysis pipeline to the one offered by the manufacturer.
Availability: The data is available from http://rafalab.jhsph.edu/cnvcomp/.
Contact: firstname.lastname@example.org; email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
While genome-wide association studies are ongoing to identify sequence variation influencing susceptibility to major depressive disorder (MDD), epigenetic marks, such as DNA methylation, which can be influenced by environment, might also play a role. Here we present the first genome-wide DNA methylation (DNAm) scan in MDD. We compared 39 postmortem frontal cortex MDD samples to 26 controls. DNA was hybridized to our Comprehensive High-throughput Arrays for Relative Methylation (CHARM) platform, covering 3.5 million CpGs. CHARM identified 224 candidate regions with DNAm differences >10%. These regions are highly enriched for neuronal growth and development genes. Ten of 17 regions for which validation was attempted showed true DNAm differences; the greatest were in PRIMA1, with 12–15% increased DNAm in MDD (p = 0.0002–0.0003), and a concomitant decrease in gene expression. These results must be considered pilot data, however, as we could only test replication in a small number of additional brain samples (n = 16), which showed no significant difference in PRIMA1. Because PRIMA1 anchors acetylcholinesterase in neuronal membranes, decreased expression could result in decreased enzyme function and increased cholinergic transmission, consistent with a role in MDD. We observed decreased immunoreactivity for acetylcholinesterase in MDD brain with increased PRIMA1 DNAm, non-significant at p = 0.08.
While we cannot draw firm conclusions about PRIMA1 DNAm in MDD, the involvement of neuronal development genes across the set showing differential methylation suggests a role for epigenetics in the illness. Further studies using limbic system brain regions might shed additional light on this role.
DNA methylation is a key regulator of gene function in a multitude of both normal and abnormal biological processes, but tools to elucidate its roles on a genome-wide scale are still in their infancy. Methylation sensitive restriction enzymes and microarrays provide a potential high-throughput, low-cost platform to allow methylation profiling. However, accurate absolute methylation estimates have been elusive due to systematic errors and unwanted variability. Previous microarray preprocessing procedures, mostly developed for expression arrays, fail to adequately normalize methylation-related data since they rely on key assumptions that are violated in the case of DNA methylation. We develop a normalization strategy tailored to DNA methylation data and an empirical Bayes percentage methylation estimator that together yield accurate absolute methylation estimates that can be compared across samples. We illustrate the method on data generated to detect methylation differences between tissues and between normal and tumor colon samples.
DNA methylation; Epigenetics; Microarray