The 2013-2015 West African epidemic of Ebola virus disease (EVD) reminds us how little is known about biosafety level-4 viruses. Like Ebola virus, Lassa virus (LASV) can cause hemorrhagic fever with high case fatality rates. We generated a genomic catalog of almost 200 LASV sequences from clinical and rodent reservoir samples. We show that whereas the 2013-2015 EVD epidemic is fueled by human-to-human transmissions, LASV infections mainly result from reservoir-to-human infections. We elucidated the spread of LASV across West Africa and show that this migration was accompanied by changes in LASV genome abundance, fatality rates, codon adaptation, and translational efficiency. By investigating intrahost evolution, we found that mutations accumulate in epitopes of viral surface proteins, suggesting selection for immune escape. This catalog will serve as a foundation for the development of vaccines and diagnostics.
Although RNA-seq is a powerful tool, the considerable time and cost associated with library construction has limited its utilization for various applications. RNAtag-Seq, an approach to generate multiple RNA-seq libraries in a single reaction, lowers time and cost per sample, and it produces data on prokaryotic and eukaryotic samples that are comparable to those generated by traditional strand-specific RNA-seq approaches.
Intra-tumoral heterogeneity plays a critical role in tumor evolution. To define the contribution of DNA methylation to heterogeneity within tumors, we performed genome-scale bisulfite sequencing of 104 primary chronic lymphocytic leukemias (CLL). Compared to 26 normal B cell samples, CLLs consistently displayed higher intra-sample variability of DNA methylation patterns across the genome, which appears to arise from stochastically disordered methylation in malignant cells. Transcriptome analysis of bulk and single CLL cells revealed that methylation disorder was linked to low-level expression. Disordered methylation was further associated with adverse clinical outcome. We therefore propose that disordered methylation plays a similar role to genetic instability, enhancing the ability of cancer cells to search for superior evolutionary trajectories.
DNA methylation is a key epigenetic modification involved in regulating gene expression and maintaining genomic integrity. Here we inactivated all three catalytically active DNA methyltransferases in human embryonic stem cells (ESCs) using CRISPR/Cas9 genome editing to further investigate their roles and genomic targets. Disruption of DNMT3A or DNMT3B individually, as well as of both enzymes in tandem, creates viable, pluripotent cell lines with distinct effects on their DNA methylation landscape as assessed by whole-genome bisulfite sequencing. Surprisingly, in contrast to mouse, deletion of DNMT1 resulted in rapid cell death in human ESCs. To overcome the immediate lethality, we generated a doxycycline (DOX) responsive tTA-DNMT1* rescue line and readily obtained homozygous DNMT1 mutant lines. However, DOX-mediated repression of the exogenous DNMT1* initiates rapid, global loss of DNA methylation, followed by extensive cell death. Our data provide a comprehensive characterization of DNMT mutant ESCs, including single base genome-wide maps of their targets.
Human pluripotent stem cell derived models that accurately recapitulate neural development in vitro and allow for the generation of specific neuronal subtypes are of major interest to the stem cell and biomedical community. Notch signaling, particularly through the Notch effector HES5, is a major pathway critical for the onset and maintenance of neural progenitor cells (NPCs) in the embryonic and adult nervous system1-3. This can be exploited to isolate distinct populations of human embryonic stem (ES) cell derived NPCs4. Here, we report the transcriptional and epigenomic analysis of six consecutive stages derived from a HES5-GFP reporter ES cell line5 differentiated along the neural trajectory aimed at modeling key cell fate decisions including specification, expansion and patterning during the ontogeny of cortical neural stem and progenitor cells. In order to dissect the regulatory mechanisms that orchestrate the stage-specific differentiation process, we developed a computational framework to infer key regulators of each cell state transition based on the progressive remodeling of the epigenetic landscape and then validated these through a pooled shRNA screen. We were also able to refine our previous observations on epigenetic priming at transcription factor binding sites and show here that they are mediated by combinations of core and stage- specific factors. Taken together, we demonstrate the utility of our system and outline a general framework, not limited to the context of the neural lineage, to dissect regulatory circuits of differentiation.
Pluripotent stem cells provide a powerful system to dissect the underlying molecular dynamics that regulate cell fate changes during mammalian development. Here we report the integrative analysis of genome wide binding data for 38 transcription factors with extensive epigenome and transcriptional data across the differentiation of human embryonic stem cells to the three germ layers. We describe core regulatory dynamics and show the lineage specific behavior of selected factors. In addition to the orchestrated remodeling of the chromatin landscape, we find that the binding of several transcription factors is strongly associated with specific loss of DNA methylation in one germ layer and in many cases a reciprocal gain in the other layers. Taken together, our work shows context-dependent rewiring of transcription factor binding, downstream signaling effectors, and the epigenome during human embryonic stem cell differentiation.
The 2013–2015 Ebola virus disease (EVD) epidemic is caused by the Makona variant of Ebola virus (EBOV). Early in the epidemic, genome sequencing provided insights into virus evolution and transmission and offered important information for outbreak response. Here, we analyze sequences from 232 patients sampled over 7 months in Sierra Leone, along with 86 previously released genomes from earlier in the epidemic. We confirm sustained human-to-human transmission within Sierra Leone and find no evidence for import or export of EBOV across national borders after its initial introduction. Using high-depth replicate sequencing, we observe both host-to-host transmission and recurrent emergence of intrahost genetic variants. We trace the increasing impact of purifying selection in suppressing the accumulation of nonsynonymous mutations over time. Finally, we note changes in the mucin-like domain of EBOV glycoprotein that merit further investigation. These findings clarify the movement of EBOV within the region and describe viral evolution during prolonged human-to-human transmission.
•In Sierra Leone, transmission has primarily been within-country, not between-country•Infectious doses are large enough for intrahost variants to transmit between hosts•A prolonged epidemic removes deleterious mutations from the viral population•There is preliminary evidence for human RNA editing effects on the Ebola genome
Ebola virus genomes from 232 patients sampled over 7 months in Sierra Leone were sequenced. Transmission of intrahost genetic variants suggests a sufficiently high infectious dose during transmission. The human host may have caused direct alterations to the Ebola virus genome.
In its largest outbreak, Ebola virus disease is spreading through Guinea, Liberia, Sierra Leone, and Nigeria. We sequenced 99 Ebola virus genomes from 78 patients in Sierra Leone to ∼2000× coverage. We observed a rapid accumulation of interhost and intrahost genetic variation, allowing us to characterize patterns of viral transmission over the initial weeks of the epidemic. This West African variant likely diverged from central African lineages around 2004, crossed from Guinea to Sierra Leone in May 2014, and has exhibited sustained human-to-human transmission subsequently, with no evidence of additional zoonotic sources. Because many of the mutations alter protein sequences and other biologically meaningful targets, they should be monitored for impact on diagnostics, vaccines, and therapies critical to outbreak response.
We have developed a robust RNA sequencing method for generating complete de novo assemblies with intra-host variant calls of Lassa and Ebola virus genomes in clinical and biological samples. Our method uses targeted RNase H-based digestion to remove contaminating poly(rA) carrier and ribosomal RNA. This depletion step improves both the quality of data and quantity of informative reads in unbiased total RNA sequencing libraries. We have also developed a hybrid-selection protocol to further enrich the viral content of sequencing libraries. These protocols have enabled rapid deep sequencing of both Lassa and Ebola virus and are broadly applicable to other viral genomics studies.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0519-7) contains supplementary material, which is available to authorized users.
Deep mutational scanning has emerged as a promising tool for mapping sequence–activity relationships in proteins, ribonucleic acid and deoxyribonucleic acid. In this approach, diverse variants of a sequence of interest are first ranked according to their activities in a relevant assay, and this ranking is then used to infer the shape of the fitness landscape around the wild-type sequence. Little is currently known, however, about the degree to which such fitness landscapes are dependent on the specific assay conditions from which they are inferred. To explore this issue, we performed comprehensive single-substitution mutational scanning of APH(3′)II, a Tn5 transposon-derived kinase that confers resistance to aminoglycoside antibiotics, in Escherichia coli under selection with each of six structurally diverse antibiotics at a range of inhibitory concentrations. We found that the resulting local fitness landscapes showed significant dependence on both antibiotic structure and concentration, and that this dependence can be exploited to guide protein engineering. Specifically, we found that differential analysis of fitness landscapes allowed us to generate synthetic APH(3′)II variants with orthogonal substrate specificities.
Differentiation of human embryonic stem cells (hESCs) provides a unique opportunity to study the regulatory mechanisms that facilitate cellular transitions in a human context. To that end, we performed comprehensive transcriptional and epigenetic profiling of populations derived through directed differentiation of hESCs representing each of the three embryonic germ layers. Integration of whole genome bisulfite sequencing, chromatin immunoprecipitation-sequencing and RNA-sequencing reveals unique events associated with specification towards each lineage. Dynamic alterations in DNA methylation and H3K4me1 are evident at putative distal regulatory elements bound by pluripotency factors or activated in specific lineages. In addition, we identified germ layer-specific H3K27me3 enrichment at sites exhibiting high DNA methylation in the undifferentiated state. A better understanding of these initial specification events will facilitate identification of deficiencies in current approaches leading to more faithful differentiation strategies as well as provide insights into the rewiring of human regulatory programs during cellular transitions.
DNA methylation is a defining feature of mammalian cellular identity and essential for normal development1,2. Most cell types, except germ cells and pre-implantation embryos3–5, display relatively stable DNA methylation patterns with 70–80% of all CpGs being methylated6. Despite recent advances we still have a too limited understanding of when, where and how many CpGs participate in genomic regulation. Here we report the in depth analysis of 42 whole genome bisulfite sequencing (WGBS) data sets across 30 diverse human cell and tissue types. We observe dynamic regulation for only 21.8% of autosomal CpGs within a normal developmental context, a majority of which are distal to transcription start sites. These dynamic CpGs co-localize with gene regulatory elements, particularly enhancers and transcription factor binding sites (TFBS), which allow identification of key lineage specific regulators. In addition, differentially methylated regions (DMRs) often harbor SNPs associated with cell type related diseases as determined by GWAS. The results also highlight the general inefficiency of WGBS as 70–80% of the sequencing reads across these data sets provided little or no relevant information regarding CpG methylation. To further demonstrate the utility of our DMR set, we use it to classify unknown samples and identify representative signature regions that recapitulate major DNA methylation dynamics. In summary, although in theory every CpG can change its methylation state, our results suggest that only a fraction does so as part of coordinated regulatory programs. Therefore our selected DMRs can serve as a starting point to help guide novel, more effective reduced representation approaches to capture the most informative fraction of CpGs as well as further pinpoint putative regulatory elements.
While genetic lesions responsible for some Mendelian disorders can be rapidly discovered through massively parallel sequencing (MPS) of whole genomes or exomes, not all diseases readily yield to such efforts. We describe the illustrative case of the simple Mendelian disorder medullary cystic kidney disease type 1 (MCKD1), mapped more than a decade ago to a 2-Mb region on chromosome 1. Ultimately, only by cloning, capillary sequencing, and de novo assembly, we found that each of six MCKD1 families harbors an equivalent, but apparently independently arising, mutation in sequence dramatically underrepresented in MPS data: the insertion of a single C in one copy (but a different copy in each family) of the repeat unit comprising the extremely long (~1.5-5 kb), GC-rich (>80%), coding VNTR in the mucin 1 gene. The results provide a cautionary tale about the challenges in identifying genes responsible for Mendelian, let alone more complex, disorders through MPS.
RNA-Seq is an effective method to study the transcriptome, but can be difficult to apply to scarce or degraded RNA from fixed clinical samples, rare cell populations, or cadavers. Recent studies have proposed several methods for RNA-Seq of low quality and/or low quantity samples, but their relative merits have not been systematically analyzed. Here, we compare five such methods using metrics relevant to transcriptome annotation, transcript discovery, and gene expression. Using a single human RNA sample, we constructed and sequenced ten libraries with these methods and two control libraries. We find that the RNase H method performed best for low quality RNA, and confirmed this with actual degraded samples. RNase H can even effectively replace oligo (dT) based methods for standard RNA-Seq. SMART and NuGEN had distinct strengths for low quantity RNA. Our analysis allows biologists to select the most suitable methods and provides a benchmark for future method development.
Understanding the principles governing mammalian gene regulation has been hampered by the difficulty in measuring in-vivo binding dynamics of large numbers of transcription factors (TF) to DNA. Here, we develop a high-throughput Chromatin ImmunoPrecipitation (HT-ChIP) method to systematically map protein-DNA interactions. HT-ChIP was applied to define the dynamics of DNA binding by 25 TFs and 4 chromatin marks at 4 time-points following pathogen stimulus of dendritic cells. Analyzing over 180,000 TF-DNA interactions we find that TFs vary substantially in their temporal binding landscapes. This data suggests a model for transcription regulation whereby TF networks are hierarchically organized into cell differentiation factors, factors that bind targets prior to stimulus to prime them for induction, and factors that regulate specific gene programs. Overlaying HT-ChIP data on gene expression dynamics shows that many TF-DNA interactions are established prior to the stimuli, predominantly at immediate-early genes, and identified specific TF ensembles that coordinately regulate gene-induction.
Recent molecular studies have revealed that, even when derived from a seemingly
homogenous population, individual cells can exhibit substantial differences in gene expression,
protein levels, and phenotypic output1–5, with important functional consequences4,5. Existing studies of cellular
heterogeneity, however, have typically measured only a few pre-selected RNAs1,2 or proteins5,6 simultaneously
because genomic profiling methods3 could not be
applied to single cells until very recently7–10. Here, we use single-cell RNA-Seq
to investigate heterogeneity in the response of bone marrow derived dendritic cells (BMDCs) to
lipopolysaccharide (LPS). We find extensive, and previously unobserved, bimodal variation in mRNA
abundance and splicing patterns, which we validate by RNA-fluorescence in situ
hybridization (RNA-FISH) for select transcripts. In particular, hundreds of key immune genes are
bimodally expressed across cells, surprisingly even for genes that are very highly expressed at the
population average. Moreover, splicing patterns demonstrate previously unobserved levels of
heterogeneity between cells. Some of the observed bimodality can be attributed to closely related,
yet distinct, known maturity states of BMDCs; other portions reflect differences in the usage of key
regulatory circuits. For example, we identify a module of 137 highly variable, yet co-regulated,
antiviral response genes. Using cells from knockout mice, we show that variability in this module
may be propagated through an interferon feedback circuit involving the transcriptional regulators
Stat2 and Irf7. Our study demonstrates the power and promise of single-cell genomics in uncovering
functional diversity between cells and in deciphering cell states and circuits.
It was a zoological sensation when a living specimen of the coelacanth was first discovered in 1938, as this lineage of lobe-finned fish was thought to have gone extinct 70 million years ago. The modern coelacanth looks remarkably similar to many of its ancient relatives, and its evolutionary proximity to our own fish ancestors provides a glimpse of the fish that first walked on land. Here we report the genome sequence of the African coelacanth, Latimeria chalumnae. Through a phylogenomic analysis, we conclude that the lungfish, and not the coelacanth, is the closest living relative of tetrapods. Coelacanth protein-coding genes are significantly more slowly evolving than those of tetrapods, unlike other genomic features . Analyses of changes in genes and regulatory elements during the vertebrate adaptation to land highlight genes involved in immunity, nitrogen excretion and the development of fins, tail, ear, eye, brain, and olfaction. Functional assays of enhancers involved in the fin-to-limb transition and in the emergence of extra-embryonic tissues demonstrate the importance of the coelacanth genome as a blueprint for understanding tetrapod evolution.
DNA methylation is a mechanism of epigenetic regulation that is common to all vertebrates. Functional studies underscore its relevance for tissue homeostasis, but the global dynamics of DNA methylation during in vivo differentiation remain underexplored. Here we report high-resolution DNA methylation maps of adult stem cell differentiation in mouse, focusing on 19 purified cell populations of the blood and skin lineages. DNA methylation changes were locus-specific and relatively modest in magnitude. They frequently overlapped with lineage-associated transcription factors and their binding sites, suggesting that DNA methylation may protect cells from aberrant transcription factor activation. DNA methylation and gene expression provided complementary information, and combining the two enabled us to infer the cellular differentiation hierarchy of the blood lineage directly from genomic data. In summary, these results demonstrate that in vivo differentiation of adult stem cells is associated with small but informative changes in the genomic distribution of DNA methylation.
Epigenomics; bioinformatics; stem cells; blood lineage; skin lineage; hematopoietic stem cells; hair follicle stem cells; computational epigenetics
Massively-parallel cDNA sequencing has opened the way to deep and efficient probing of transcriptomes. Current approaches for transcript reconstruction from such data often rely on aligning reads to a reference genome, and are thus unsuitable for samples with a partial or missing reference genome. Here, we present the Trinity methodology for de novo full-length transcriptome reconstruction, and evaluate it on samples from fission yeast, mouse, and whitefly – an insect whose genome has not yet been sequenced. Trinity fully reconstructs a large fraction of the transcripts present in the data, also reporting alternative splice isoforms and transcripts from recently duplicated genes. In all cases, Trinity performs better than other available de novo transcriptome assembly programs, and its sensitivity is comparable to methods relying on genome alignments. Our approach provides a unified and general solution for transcriptome reconstruction in any sample, especially in the complete absence of a reference genome.
DNA methylation is highly dynamic during mammalian embryogenesis. It is broadly accepted that the paternal genome is actively depleted of 5-methyl cytosine at fertilization, followed by passive loss that reaches a minimum at the blastocyst stage. However, this model is based on limited data, and to date no base-resolution maps exist to support and refine it. Here, we generated genome-scale DNA methylation maps in mouse gametes and through post-implantation embryogenesis. We find that the oocyte already exhibits global hypomethylation, most prominently at specific families of long interspersed element-1 and long terminal repeat retro-elements, which are disparate between gametes and resolve to lower methylation values in zygote. Surprisingly, the oocyte contributes a unique set of Differentially Methylated Regions (DMRs), including many CpG Island promoter regions, that are maintained in the early embryo but are lost upon specification and absent from somatic cells. In contrast, sperm-contributed DMRs are largely intergenic and resolve to hypermethylation after the blastocyst stage. Our data provide a complete genome-scale, base-resolution timeline of DNA methylation in the pre-specified embryo, when this epigenetic modification is most dynamic, before returning to the canonical somatic pattern.
Sequencing-based approaches have led to new insights about DNA methylation. While many different techniques for genome-scale mapping of DNA methylation have been employed, throughput has been a key limitation for most. To further facilitate the mapping of DNA methylation, we describe a protocol for gel-free multiplexed reduced representation bisulfite sequencing (mRRBS) that reduces the workload dramatically and enables processing of 96 or more samples per week. mRRBS achieves similar CpG coverage to the original RRBS protocol, while the higher throughput and lower cost make it better suited for large-scale DNA methylation mapping studies, including cohorts of cancer samples.
Learning to read and write the transcriptional regulatory code is of central importance to progress in genetic analysis and engineering. Here, we describe a massively parallel reporter assay (MPRA) that enables systematic dissection of transcriptional regulatory elements by integrating microarray-based DNA synthesis and high-throughput tag sequencing. We apply MPRA to compare more than 27,000 distinct variants of two inducible enhancers in human cells: a synthetic cAMP-regulated enhancer and the virus-inducible interferon beta enhancer. We first show that the resulting data define accurate maps of functional transcription factor binding sites in both enhancers at single-nucleotide resolution. We then use the data to train quantitative sequence-activity models (QSAMs) of the two enhancers. We show that QSAMs from two cellular states can be combined to identify novel enhancer variants that optimize potentially conflicting objectives, such as maximizing induced activity while minimizing basal activity.
We have developed a process for transcriptome analysis of bacterial communities that accommodates both intact and fragmented starting RNA and combines efficient rRNA removal with strand-specific RNA-seq. We applied this approach to an RNA mixture derived from three diverse cultured bacterial species and to RNA isolated from clinical stool samples. The resulting expression profiles were highly reproducible, enriched up to 40-fold for non-rRNA transcripts, and correlated well with profiles representing undepleted total RNA.
The developmental potential of human pluripotent stem cells suggests that they can produce disease-relevant cell types for biomedical research. However, substantial variation has been reported among pluripotent cell lines, which could affect their utility and clinical safety. Such cell-line-specific differences must be better understood before one can confidently use embryonic stem (ES) or induced pluripotent stem (iPS) cells in translational research. Toward this goal we have established genome-wide reference maps of DNA methylation and gene expression for 20 previously derived human ES lines and 12 human iPS cell lines, and we have measured the in vitro differentiation propensity of these cell lines. This resource enabled us to assess the epigenetic and transcriptional similarity of ES and iPS cells and to predict the differentiation efficiency of individual cell lines. The combination of assays yields a scorecard for quick and comprehensive characterization of pluripotent cell lines.