Medulloblastomas are the most common malignant brain tumors in children1. Identifying and understanding the genetic events that drive these tumors is critical for the development of more effective diagnostic, prognostic and therapeutic strategies. Recently, our group and others described distinct molecular subtypes of medulloblastoma based on transcriptional and copy number profiles2–5. Here, we utilized whole exome hybrid capture and deep sequencing to identify somatic mutations across the coding regions of 92 primary medulloblastoma/normal pairs. Overall, medulloblastomas exhibit low mutation rates consistent with other pediatric tumors, with a median of 0.35 non-silent mutations per megabase. We identified twelve genes mutated at statistically significant frequencies, including previously known mutated genes in medulloblastoma such as CTNNB1, PTCH1, MLL2, SMARCA4 and TP53. Recurrent somatic mutations were identified in an RNA helicase gene, DDX3X, often concurrent with CTNNB1 mutations, and in the nuclear co-repressor (N-CoR) complex genes GPS2, BCOR, and LDB1, novel findings in medulloblastoma. We show that mutant DDX3X potentiates transactivation of a TCF promoter and enhances cell viability in combination with mutant but not wild type beta-catenin. Together, our study reveals the alteration of Wnt, Hedgehog, histone methyltransferase and now N-CoR pathways across medulloblastomas and within specific subtypes of this disease, and nominates the RNA helicase DDX3X as a component of pathogenic beta-catenin signaling in medulloblastoma.
Pacific Biosciences technology provides a fundamentally new data type that provides the potential to overcome some limitations of current next generation sequencing platforms by providing significantly longer reads, single molecule sequencing, low composition bias and an error profile that is orthogonal to other platforms. With these potential advantages in mind, we here evaluate the utility of the Pacific Biosciences RS platform for human medical amplicon resequencing projects.
We evaluated the Pacific Biosciences technology for SNP discovery in medical resequencing projects using the Genome Analysis Toolkit, observing high sensitivity and specificity for calling differences in amplicons containing known true or false SNPs. We assessed data quality: most errors were indels (~14%) with few apparent miscalls (~1%). In this work, we define a custom data processing pipeline for Pacific Biosciences data for human data analysis.
Critically, the error properties were largely free of the context-specific effects that affect other sequencing technologies. These data show excellent utility for follow-up validation and extension studies in human data and medical genetics projects, but can be extended to other organisms with a reference genome.
DNA replication initiates at distinct origins in eukaryotic genomes, but the genomic features that define these sites are not well understood.
We have taken a combined experimental and bioinformatic approach to identify and characterize origins of replication in three distantly related fission yeasts: Schizosaccharomyces pombe, Schizosaccharomyces octosporus and Schizosaccharomyces japonicus. Using single-molecule deep sequencing to construct amplification-free high-resolution replication profiles, we located origins and identified sequence motifs that predict origin function. We then mapped nucleosome occupancy by deep sequencing of mononucleosomal DNA from the corresponding species, finding that origins tend to occupy nucleosome-depleted regions.
The sequences that specify origins are evolutionarily plastic, with low complexity nucleosome-excluding sequences functioning in S. pombe and S. octosporus, and binding sites for trans-acting nucleosome-excluding proteins functioning in S. japonicus. Furthermore, chromosome-scale variation in replication timing is conserved independently of origin location and via a mechanism distinct from known heterochromatic effects on origin function. These results are consistent with a model in which origins are simply the nucleosome-depleted regions of the genome with the highest affinity for the origin recognition complex. This approach provides a general strategy for understanding the mechanisms that define DNA replication origins in eukaryotes.
The fission yeast clade, comprising Schizosaccharomyces pombe, S. octosporus, S. cryophilus and S. japonicus, occupies the basal branch of Ascomycete fungi and is an important model of eukaryote biology. A comparative annotation of these genomes identified a near extinction of transposons and the associated innovation of transposon-free centromeres. Expression analysis established that meiotic genes are subject to antisense transcription during vegetative growth, suggesting a mechanism for their tight regulation. In addition, trans-acting regulators control new genes within the context of expanded functional modules for meiosis and stress response. Differences in gene content and regulation also explain why, unlike the Saccharomycotina, fission yeasts cannot use ethanol as a primary carbon source. These analyses elucidate the genome structure and gene regulation of fission yeast and provide tools for investigation across the Schizosaccharomyces clade.
Differences in gene expression are thought to be an important source of phenotypic diversity, so dissecting the genetic components of natural variation in gene expression is important for understanding the evolutionary mechanisms that lead to adaptation. Gene expression is a complex trait that, in diploid organisms, results from transcription of both maternal and paternal alleles. Directly measuring allelic expression rather than total gene expression offers greater insight into regulatory variation. The recent emergence of high-throughput sequencing offers an unprecedented opportunity to study allelic transcription at a genomic scale for virtually any species. By sequencing transcript pools derived from heterozygous individuals, estimates of allelic expression can be directly obtained. The statistical power of this approach is influenced by the number of transcripts sequenced and the ability to unambiguously assign individual sequence fragments to specific alleles on the basis of transcribed nucleotide polymorphisms. Here, using mathematical modelling and computer simulations, we determine the minimum sequencing depth required to accurately measure relative allelic expression and detect allelic imbalance via high-throughput sequencing under a variety of conditions. We conclude that, within a species, a minimum of 500–1000 sequencing reads per gene are needed to test for allelic imbalance, and consequently, at least five to 10 millions reads are required for studying a genome expressing 10 000 genes. Finally, using 454 sequencing, we illustrate an application of allelic expression by testing for cis-regulatory divergence between closely related Drosophila species.
cis-regulation; Drosophila melanogaster; Drosophila simulans; gene expression; hybrids
We have adapted a solution hybrid selection protocol to enrich pathogen DNA in clinical samples dominated by human genetic material. Using mock mixtures of human and Plasmodium falciparum malaria parasite DNA as well as clinical samples from infected patients, we demonstrate an average of approximately 40-fold enrichment of parasite DNA after hybrid selection. This approach will enable efficient genome sequencing of pathogens from clinical samples, as well as sequencing of endosymbiotic organisms such as Wolbachia that live inside diverse metazoan phyla.
Despite the ever-increasing output of Illumina sequencing data, loci with extreme base compositions are often under-represented or absent. To evaluate sources of base-composition bias, we traced genomic sequences ranging from 6% to 90% GC through the process by quantitative PCR. We identified PCR during library preparation as a principal source of bias and optimized the conditions. Our improved protocol significantly reduces amplification bias and minimizes the previously severe effects of PCR instrument and temperature ramp rate.
Early life experiences have a major impact on adult phenotypes [1–3]. However, the mechanisms by which animals retain a cellular memory of early experience are not well understood. Here we show that adult wild-type C. elegans that transiently passed through the stress-resistant dauer larval stage exhibit distinct gene expression profiles and life history traits, as compared to adult animals that bypassed this stage. Using chromatin immmunoprecipitation experiments coupled with massively parallel sequencing, we find that genome-wide levels of specific histone tail modifications are markedly altered in post-dauer animals. Mutations in subsets of genes implicated in chromatin remodeling abolish, or alter, the observed changes in gene expression and life history traits in post-dauer animals. Modifications to the epigenome as a consequence of early experience may contribute in part to a memory of early experience, and generate phenotypic variation in an isogenic population.
Joubert syndrome (JBTS), related disorders (JSRD) and Meckel syndrome (MKS) are ciliopathies. We now report that MKS2 and JBTS2 loci are allelic and due to mutations in TMEM216, encoding an uncharacterized tetraspan transmembrane protein. JBTS2 patients displayed frequent nephronophthisis and polydactytly, and two cases conformed to the Oro-Facio-Digital type VI phenotype, whereas skeletal dysplasia was common in MKS fetuses. A single p.R73L mutation was identified in all patients of Ashkenazi Jewish descent (n=10). TMEM216 localized to the base of primary cilia, and loss of TMEM216 in patient fibroblasts or following siRNA knockdown caused defective ciliogenesis and centrosomal docking, with concomitant hyperactivation of RhoA and Dishevelled. TMEM216 complexed with Meckelin, encoded by a gene also mutated in JSRD and MKS. Abrogation of tmem216 expression in zebrafish led to gastrulation defects that overlap with other ciliary morphants. The data implicate a new family of proteins in the ciliopathies, and further support allelism between ciliopathy disorders.
We report the application of single molecule-based sequencing technology for high-throughput profiling of histone modifications in mammalian cells. By obtaining over 4 billion bases of sequence from chromatin immunoprecipitated DNA, we generated genome-wide chromatin state maps of mouse embryonic stem cells, neural progenitor cells and embryonic fibroblasts. We find that lysine 4 and lysine 27 tri-methylation effectively discriminate genes that are expressed, poised for expression, or stably repressed, and therefore reflect cell state and lineage potential. Lysine 36 tri-methylation marks primary coding and non-coding transcripts, facilitating gene annotation. Lysine 9 and lysine 20 tri-methylation are detected at satellite, telomeric and active long-terminal repeats, and can spread into proximal unique sequences. Lysine 4 and lysine 9 tri-methylation mark imprinting control regions. Finally, we show that chromatin state can be read in an allele-specific manner by using single nucleotide polymorphisms. This study provides a framework for the application of comprehensive chromatin profiling towards characterization of diverse mammalian cell populations.
An automated method for constructing libraries for 454 sequencing significantly reduces the cost and time required.
We present an automated, high throughput library construction process for 454 technology. Sample handling errors and cross-contamination are minimized via end-to-end barcoding of plasticware, along with molecular DNA barcoding of constructs. Automation-friendly magnetic bead-based size selection and cleanup steps have been devised, eliminating major bottlenecks and significant sources of error. Using this methodology, one technician can create 96 sequence-ready 454 libraries in 2 days, a dramatic improvement over the standard method.
Targeting genomic loci by massively parallel sequencing requires new methods to enrich templates to be sequenced. We developed a capture method that uses biotinylated RNA “baits” to “fish” targets out of a “pond” of DNA fragments. The RNA is transcribed from PCR-amplified oligodeoxynucleotides originally synthesized on a microarray, generating sufficient bait for multiple captures at concentrations high enough to drive the hybridization. We tested this method with 170-mer baits that target >15,000 coding exons (2.5 Mb) and four regions (1.7 Mb total) using Illumina sequencing as read-out. About 90% of uniquely aligning bases fell on or near bait sequence; up to 50% lay on exons proper. The uniformity was such that ~60% of target bases in the exonic “catch”, and ~80% in the regional catch, had at least half the mean coverage. One lane of Illumina sequence was sufficient to call high-confidence genotypes for 89% of the targeted exon space.
Cancer results from somatic alterations in key genes, including point mutations, copy number alterations and structural rearrangements. A powerful way to discover cancer-causing genes is to identify genomic regions that show recurrent copy-number alterations (gains and losses) in tumor genomes. Recent advances in sequencing technologies suggest that massively parallel sequencing may provide a feasible alternative to DNA microarrays for detecting copy-number alterations. Here, we present: (i) a statistical analysis of the power to detect copy-number alterations of a given size; (ii) SegSeq, an algorithm to identify chromosomal breakpoints using massively parallel sequence data; and (iii) analysis of experimental data from three matched pairs of tumor and normal cell lines. We show that a collection of ∼14 million aligned sequence reads from human cell lines has comparable power to detect events as the current generation of DNA microarrays and has over two-fold better precision for localizing breakpoints (typically, to within ∼1 kb).
High-throughput sequencing platforms provide an approach for detecting rare HIV-1 variants and documenting more fully quasispecies diversity. We applied this technology to the V3 loop-coding region of env in samples collected from 4 chronically HIV-infected subjects in whom CCR5 antagonist (vicriviroc [VVC]) therapy failed. Between 25,000–140,000 amplified sequences were obtained per sample. Profound baseline V3 loop sequence heterogeneity existed; predicted CXCR4-using populations were identified in a largely CCR5-using population. The V3 loop forms associated with subsequent virologic failure, either through CXCR4 use or the emergence of high-level VVC resistance, were present as minor variants at 0.8–2.8% of baseline samples. Extreme, rapid shifts in population frequencies toward these forms occurred, and deep sequencing provided a detailed view of the rapid evolutionary impact of VVC selection. Greater V3 diversity was observed post-selection. This previously unreported degree of V3 loop sequence diversity has implications for viral pathogenesis, vaccine design, and the optimal use of HIV-1 CCR5 antagonists.