Massively-parallel cDNA sequencing has opened the way to deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here, we present the Trinity methodology for de novo full-length transcriptome reconstruction, and evaluate it on samples from fission yeast, mouse, and whitefly – an insect whose genome has not yet been sequenced. Trinity fully reconstructs a large fraction of the transcripts present in the data, also reporting alternative splice isoforms and transcripts from recently duplicated genes. In all cases, Trinity performs better than other available de novo transcriptome assembly programs, and its sensitivity is comparable to methods relying on genome alignments. Our approach provides a unified and general solution for transcriptome reconstruction in any sample, especially in the complete absence of a reference genome.
DNA methylation is highly dynamic during mammalian embryogenesis. It is broadly accepted that the paternal genome is actively depleted of 5-methyl cytosine at fertilization, followed by passive loss that reaches a minimum at the blastocyst stage. However, this model is based on limited data, and to date no base-resolution maps exist to support and refine it. Here, we generated genome-scale DNA methylation maps in mouse gametes and through post-implantation embryogenesis. We find that the oocyte already exhibits global hypomethylation, most prominently at specific families of long interspersed element-1 and long terminal repeat retro-elements, which are disparate between gametes and resolve to lower methylation values in zygote. Surprisingly, the oocyte contributes a unique set of Differentially Methylated Regions (DMRs), including many CpG Island promoter regions, that are maintained in the early embryo but are lost upon specification and absent from somatic cells. In contrast, sperm-contributed DMRs are largely intergenic and resolve to hypermethylation after the blastocyst stage. Our data provide a complete genome-scale, base-resolution timeline of DNA methylation in the pre-specified embryo, when this epigenetic modification is most dynamic, before returning to the canonical somatic pattern.
Sequencing-based approaches have led to new insights about DNA methylation. While many different techniques for genome-scale mapping of DNA methylation have been employed, throughput has been a key limitation for most. To further facilitate the mapping of DNA methylation, we describe a protocol for gel-free multiplexed reduced representation bisulfite sequencing (mRRBS) that reduces the workload dramatically and enables processing of 96 or more samples per week. mRRBS achieves similar CpG coverage to the original RRBS protocol, while the higher throughput and lower cost make it better suited for large-scale DNA methylation mapping studies, including cohorts of cancer samples.
Learning to read and write the transcriptional regulatory code is of central importance to progress in genetic analysis and engineering. Here, we describe a massively parallel reporter assay (MPRA) that enables systematic dissection of transcriptional regulatory elements by integrating microarray-based DNA synthesis and high-throughput tag sequencing. We apply MPRA to compare more than 27,000 distinct variants of two inducible enhancers in human cells: a synthetic cAMP-regulated enhancer and the virus-inducible interferon beta enhancer. We first show that the resulting data define accurate maps of functional transcription factor binding sites in both enhancers at single-nucleotide resolution. We then use the data to train quantitative sequence-activity models (QSAMs) of the two enhancers. We show that QSAMs from two cellular states can be combined to identify novel enhancer variants that optimize potentially conflicting objectives, such as maximizing induced activity while minimizing basal activity.
We have developed a process for transcriptome analysis of bacterial communities that accommodates both intact and fragmented starting RNA and combines efficient rRNA removal with strand-specific RNA-seq. We applied this approach to an RNA mixture derived from three diverse cultured bacterial species and to RNA isolated from clinical stool samples. The resulting expression profiles were highly reproducible, enriched up to 40-fold for non-rRNA transcripts, and correlated well with profiles representing undepleted total RNA.
The developmental potential of human pluripotent stem cells suggests that they can produce disease-relevant cell types for biomedical research. However, substantial variation has been reported among pluripotent cell lines, which could affect their utility and clinical safety. Such cell-line-specific differences must be better understood before one can confidently use embryonic stem (ES) or induced pluripotent stem (iPS) cells in translational research. Toward this goal we have established genome-wide reference maps of DNA methylation and gene expression for 20 previously derived human ES lines and 12 human iPS cell lines, and we have measured the in vitro differentiation propensity of these cell lines. This resource enabled us to assess the epigenetic and transcriptional similarity of ES and iPS cells and to predict the differentiation efficiency of individual cell lines. The combination of assays yields a scorecard for quick and comprehensive characterization of pluripotent cell lines.
Despite rapid progress in characterizing transcription factor-driven reprogramming of somatic cells to an induced pluripotent stem (iPS) cell state, many mechanistic questions still remain. To gain insight into the earliest events in the reprogramming process, we systematically analyzed the transcriptional and epigenetic changes that occur during early factor induction after discrete numbers of divisions. We observed rapid, genome-wide changes in the euchromatic histone modification, H3K4me2, at more than a thousand loci including large subsets of pluripotency-related gene promoters and enhancers. In contrast, patterns of the repressive H3K27me3 modification remained largely unchanged except for focused depletion specifically at positions where H3K4 methylation is gained. These chromatin regulatory events precede transcriptional changes within the corresponding loci. Our data provide evidence for an early, organized, and population-wide epigenetic response to ectopic reprogramming factors that clarifies the temporal order of certain events during reprogramming.
DNA methylation plays an important role in development and disease. The primary sites of DNA methylation in vertebrates are cytosines in the CpG dinucleotide context, which account for roughly three quarters of the total DNA methylation content in human and mouse cells. While the genomic distribution, inter-individual stability, and functional role of CpG methylation are reasonably well understood, little is known about DNA methylation targeting CpA, CpT, and CpC (non-CpG) dinucleotides. Here we report a comprehensive analysis of non-CpG methylation in 76 genome-scale DNA methylation maps across pluripotent and differentiated human cell types. We confirm non-CpG methylation to be predominantly present in pluripotent cell types and observe a decrease upon differentiation and near complete absence in various somatic cell types. Although no function has been assigned to it in pluripotency, our data highlight that non-CpG methylation patterns reappear upon iPS cell reprogramming. Intriguingly, the patterns are highly variable and show little conservation between different pluripotent cell lines. We find a strong correlation of non-CpG methylation and DNMT3 expression levels while showing statistical independence of non-CpG methylation from pluripotency associated gene expression. In line with these findings, we show that knockdown of DNMTA and DNMT3B in hESCs results in a global reduction of non-CpG methylation. Finally, non-CpG methylation appears to be spatially correlated with CpG methylation. In summary these results contribute further to our understanding of cytosine methylation patterns in human cells using a large representative sample set.
Epigenetic modifications including DNA methylation at the position 5 of the cytosine base provide regulatory information to the genome sequence. The primary target of cytosine methylation in mammals is the CpG dinucleotide. However, previous studies in the mouse and more recent work in humans have highlighted the presence of non-CpG methylation in pluripotent cells. Currently, little is known about the role of this type of DNA methylation. We sought to further characterize non-CpG methylation by employing a comprehensive data set of genome-scale methylation maps across various human cell types. Our analysis reveals that non-CpG methylation varies dramatically between pluripotent cells and is closely linked to CpG methylation. Moreover, we show that depletion of the de novo DNA methyltransferases results in a global reduction of non-CpG methylation levels. Taken together, these findings further advance our understanding of cytosine methylation and describe its distribution among a large number of human cell types.
Regulation of RNA levels is determined through the interplay between RNA production, processing and degradation. However, since most global studies of RNA regulation do not distinguish the separate contributions of these processes, relatively little is known about how they are temporally integrated to determine changes in RNA levels. In particular, while some studies emphasize the role of changes in the rate of transcription, others suggest a prominent involvement of time-varying degradation rates. Here, we combine metabolic labeling of RNA at high temporal resolution with advanced RNA quantification assays and computational modeling to estimate RNA transcription and degradation rates during the model response of immune dendritic cells (DCs) to pathogens. We find that changes in transcription rates determine the majority of temporal changes in RNA levels, but that changes in degradation rate are important for shaping sharp ‘peaked’ responses. Furthermore, transcription rate changes precede corresponding changes in RNA level by a small lag (15-30 min), which is shorter for induced than for repressed genes. We used massively parallel sequencing of the newly-transcribed RNA population – including non-polyadenylated transcripts – to estimate constant RNA degradation and processing rates. We find that temporally constant degradation rates vary significantly between genes and contribute substantially to the observed differences in the dynamic response, and that specific groups of transcripts, mostly cytokines and transcription factors, are undergoing faster mRNA maturation. Our study provides a new quantitative approach to study key steps in the integrative process of RNA regulation.
We have adapted a solution hybrid selection protocol to enrich pathogen DNA in clinical samples dominated by human genetic material. Using mock mixtures of human and Plasmodium falciparum malaria parasite DNA as well as clinical samples from infected patients, we demonstrate an average of approximately 40-fold enrichment of parasite DNA after hybrid selection. This approach will enable efficient genome sequencing of pathogens from clinical samples, as well as sequencing of endosymbiotic organisms such as Wolbachia that live inside diverse metazoan phyla.
Sequencing-based DNA methylation profiling methods are comprehensive and, as accuracy and affordability improve, will increasingly supplant microarrays for genome-scale analyses. Here, four sequencing-based methodologies were applied to biological replicates of human embryonic stem cells to compare their CpG coverage genome-wide and in transposons, resolution, cost, concordance and its relationship with CpG density and genomic context. The two bisulfite methods reached concordance of 82% for CpG methylation levels and 99% for non-CpG cytosine methylation levels. Using binary methylation calls, two enrichment methods were 99% concordant, while regions assessed by all four methods were 97% concordant. To achieve comprehensive methylome coverage while reducing cost, an approach integrating two complementary methods was examined. The integrative methylome profile along with histone methylation, RNA, and SNP profiles derived from the sequence reads allowed genome-wide assessment of allele-specific epigenetic states, identifying most known imprinted regions and new loci with monoallelic epigenetic marks and monoallelic expression.
DNA methylation; Sequencing; Bisulfite
DNA methylation is a key component of mammalian gene regulation and the most classical example of an epigenetic mark. DNA methylation patterns are mitotically heritable and stable over time, but they undergo considerable changes in response to cell differentiation, diseases and environmental influences. Several methods have been developed for DNA methylation profiling on a genomic scale. Here, we benchmark four of these methods on two sample pairs, comparing their accuracy and power to detect DNA methylation differences. The results show that all evaluated methods (MeDIP-seq: methylated DNA immunoprecipitation, MethylCap-seq: methylated DNA capture by affinity purification, RRBS: reduced representation bisulfite sequencing, and the Infinium HumanMethylation27 assay) produce accurate DNA methylation data. However, these methods differ in their ability to detect differentially methylated regions between pairs of samples. We highlight strengths and weaknesses of the four methods and give practical recommendations for the design of epigenomic case-control studies.
Epigenome profiling; epigenetics; sequencing; differentially methylated regions; molecular diagnostics; biomarker discovery; cancer
Strand-specific, massively-parallel cDNA sequencing (RNA-Seq) is a powerful tool for novel transcript discovery, genome annotation, and expression profiling. Despite multiple published methods for strand-specific RNA-Seq, no consensus exists as to how to choose between them. Here, we developed a comprehensive computational pipeline to compare library quality metrics from any RNA-Seq method. Using the well-annotated Saccharomyces cerevisiae transcriptome as a benchmark, we compared seven library construction protocols, including both published and our own novel methods. We found marked differences in strand-specificity, library complexity, evenness and continuity of coverage, agreement with known annotations, and accuracy for expression profiling. Weighing each method’s performance and ease, we identify the dUTP second strand marking and the Illumina RNA ligation methods as the leading protocols, with the former benefitting from the current availability of paired-end sequencing. Our analysis provides a comprehensive benchmark, and our computational pipeline is applicable for assessment of future protocols in other organisms.
Despite the ever-increasing output of Illumina sequencing data, loci with extreme base compositions are often under-represented or absent. To evaluate sources of base-composition bias, we traced genomic sequences ranging from 6% to 90% GC through the process by quantitative PCR. We identified PCR during library preparation as a principal source of bias and optimized the conditions. Our improved protocol significantly reduces amplification bias and minimizes the previously severe effects of PCR instrument and temperature ramp rate.
RNA-Seq provides an unbiased way to study a transcriptome, including both coding and non-coding genes. To date, most RNA-Seq studies have critically depended on existing annotations, and thus focused on expression levels and variation in known transcripts. Here, we present Scripture, a method to reconstruct the transcriptome of a mammalian cell using only RNA-Seq reads and the genome sequence. We apply it to mouse embryonic stem cells, neuronal precursor cells, and lung fibroblasts to accurately reconstruct the full-length gene structures for the vast majority of known expressed genes. We identify substantial variation in protein-coding genes, including thousands of novel 5′-start sites, 3′-ends, and internal coding exons. We then determine the gene structures of over a thousand lincRNA and antisense loci. Our results open the way to direct experimental manipulation of thousands of non-coding RNAs, and demonstrate the power of ab initio reconstruction to render a comprehensive picture of mammalian transcriptomes.
Recent studies in budding yeast have shown that antisense transcription occurs at many loci. However, the functional role of antisense transcripts has been demonstrated only in a few cases and it has been suggested that most antisense transcripts may result from promiscuous bi-directional transcription in a dense genome.
Here, we use strand-specific RNA sequencing to study anti-sense transcription in Saccharomyces cerevisiae. We detect 1,103 putative antisense transcripts expressed in mid-log phase growth, ranging from 39 short transcripts covering only the 3' UTR of sense genes to 145 long transcripts covering the entire sense open reading frame. Many of these antisense transcripts overlap sense genes that are repressed in mid-log phase and are important in stationary phase, stress response, or meiosis. We validate the differential regulation of 67 antisense transcripts and their sense targets in relevant conditions, including nutrient limitation and environmental stresses. Moreover, we show that several antisense transcripts and, in some cases, their differential expression have been conserved across five species of yeast spanning 150 million years of evolution. Divergence in the regulation of antisense transcripts to two respiratory genes coincides with the evolution of respiro-fermentation.
Our work provides support for a global and conserved role for antisense transcription in yeast gene regulation.
Bisulfite sequencing measures absolute levels of DNA methylation at single-nucleotide resolution, providing a robust platform for molecular diagnostics. Here, we optimize bisulfite sequencing for genome-scale analysis of clinical samples. Specifically, we outline how restriction digestion targets bisulfite sequencing to hotspots of epigenetic regulation; we show that 30ng of DNA are sufficient for genome-scale analysis; we demonstrate that our protocol works well on formalin-fixed, paraffin-embedded (FFPE) samples; and we describe a statistical method for assessing significance of altered DNA methylation patterns.
Epigenome profiling; epigenetics; bisulfite sequencing; human disease samples; cancer; biomarker development; molecular diagnostics; FFPE
DNA methylation is essential for normal development1–3 and has been implicated in many pathologies including cancer4,5. Our knowledge about the genome-wide distribution of DNA methylation, how it changes during cellular differentiation and how it relates to histone methylation and other chromatin modifications in mammals remains limited. Here we report the generation and analysis of genome-scale DNA methylation profiles at nucleotide resolution in mammalian cells. Using high-throughput reduced representation bisulphite sequencing6 and single-molecule-based sequencing, we generated DNA methylation maps covering most CpG islands, and a representative sampling of conserved non-coding elements, transposons and other genomic features, for mouse embryonic stem cells, embryonic-stem-cell-derived and primary neural cells, and eight other primary tissues. Several key findings emerge from the data. First, DNA methylation patterns are better correlated with histone methylation patterns than with the underlying genome sequence context. Second, methylation of CpGs are dynamic epigenetic marks that undergo extensive changes during cellular differentiation, particularly in regulatory regions outside of core promoters. Third, analysis of embryonic-stem-cell-derived and primary cells reveals that ‘weak’ CpG islands associated with a specific set of developmentally regulated genes undergo aberrant hypermethylation during extended proliferation in vitro, in a pattern reminiscent of that reported in some primary tumours. More generally, the results establish reduced representation bisulphite sequencing as a powerful technology for epigenetic profiling of cell populations relevant to developmental biology, cancer and regenerative medicine.
DNA methylation is a critical epigenetic mark that is essential for mammalian development and aberrant in many diseases including cancer. Over the past decade multiple methods have been developed and applied to characterize its genome-wide distribution. Of these, Reduced Representation Bisulfite Sequencing (RRBS) generates nucleotide resolution Illumina-based libraries that enrich for CpG-dense regions by methylation-insensitive restriction digestion. Here we provide an extensive, optimized protocol for generating RRBS libraries and discuss the power of this strategy for methylome profiling. We include information on sequence analysis and the relative coverage over genomic regions of interest for a representative mouse MspI generated RRBS library. Contemporary sequencing and array-based technologies are compared against sample throughput and coverage, highlighting the variety of options available to investigate methylation on the genome-scale.
The three-dimensional folding of chromosomes compartmentalizes the genome and and can bring distant functional elements, such as promoters and enhancers, into close spatial proximity 2-6. Deciphering the relationship between chromosome organization and genome activity will aid in understanding genomic processes, like transcription and replication. However, little is known about how chromosomes fold. Microscopy is unable to distinguish large numbers of loci simultaneously or at high resolution. To date, the detection of chromosomal interactions using chromosome conformation capture (3C) and its subsequent adaptations required the choice of a set of target loci, making genome-wide studies impossible 7-10.
We developed Hi-C, an extension of 3C that is capable of identifying long range interactions in an unbiased, genome-wide fashion. In Hi-C, cells are fixed with formaldehyde, causing interacting loci to be bound to one another by means of covalent DNA-protein cross-links. When the DNA is subsequently fragmented with a restriction enzyme, these loci remain linked. A biotinylated residue is incorporated as the 5' overhangs are filled in. Next, blunt-end ligation is performed under dilute conditions that favor ligation events between cross-linked DNA fragments. This results in a genome-wide library of ligation products, corresponding to pairs of fragments that were originally in close proximity to each other in the nucleus. Each ligation product is marked with biotin at the site of the junction. The library is sheared, and the junctions are pulled-down with streptavidin beads. The purified junctions can subsequently be analyzed using a high-throughput sequencer, resulting in a catalog of interacting fragments.
Direct analysis of the resulting contact matrix reveals numerous features of genomic organization, such as the presence of chromosome territories and the preferential association of small gene-rich chromosomes. Correlation analysis can be applied to the contact matrix, demonstrating that the human genome is segregated into two compartments: a less densely packed compartment containing open, accessible, and active chromatin and a more dense compartment containing closed, inaccessible, and inactive chromatin regions. Finally, ensemble analysis of the contact matrix, coupled with theoretical derivations and computational simulations, revealed that at the megabase scale Hi-C reveals features consistent with a fractal globule conformation.
We describe Hi-C, a method that probes the three-dimensional architecture of whole genomes by coupling proximity-based ligation with massively parallel sequencing. We constructed spatial proximity maps of the human genome with Hi-C at a resolution of 1Mb. These maps confirm the presence of chromosome territories and the spatial proximity of small, gene rich chromosomes. We identified an additional level of genome organization that is characterized by the spatial segregation of open and closed chromatin to form two genome-wide compartments. At the megabase scale, the chromatin conformation is consistent with a fractal globule, a knot-free conformation that enables maximally dense packing while preserving the ability to easily fold and unfold any genomic locus. The fractal globule is distinct from the more commonly used globular equilibrium model. Our results demonstrate the power of Hi-C to map the dynamic conformations of whole genomes.