Clonal evolution is a key feature of cancer progression and relapse. We studied intratumoral heterogeneity in 149 chronic lymphocytic leukemia (CLL) cases by integrating whole-exome sequence and copy number to measure the fraction of cancer cells harboring each somatic mutation. We identified driver mutations as predominantly clonal (e.g., MYD88, trisomy 12 and del(13q)) or subclonal (e.g., SF3B1, TP53), corresponding to earlier and later events in CLL evolution. We sampled leukemia cells from 18 patients at two timepoints. Ten of 12 CLL cases treated with chemotherapy (but only 1 of 6 without treatment) underwent clonal evolution, predominantly involving subclones with driver mutations (e.g., SF3B1, TP53) that expanded over time. Furthermore, presence of a subclonal driver mutation was an independent risk factor for rapid disease progression. Our study thus uncovers patterns of clonal evolution in CLL, providing insights into its stepwise transformation, and links the presence of subclones with adverse clinical outcome.
Purine biosynthesis and metabolism, conserved in all living organisms, is essential for cellular energy homeostasis and nucleic acids synthesis. The de novo synthesis of purine precursors is under tight negative feedback regulation mediated by adenosine and guanine nucleotides. We describe a new distinct early-onset neurodegenerative condition resulting from mutations in the adenosine monophosphate deaminase 2 gene (AMPD2). Patients have characteristic brain imaging features of pontocerebellar hypoplasia (PCH), due to loss of brainstem and cerebellar parenchyma. We found that AMPD2 plays an evolutionary conserved role in the maintenance of cellular guanine nucleotide pools by regulating the feedback inhibition of adenosine derivatives on de novo purine synthesis. AMPD2 deficiency results in defective GTP-dependent initiation of protein translation, which can be rescued by administration of purine precursors. These data suggest AMPD2-related PCH as a new, potentially treatable early-onset neurodegenerative disease.
Purine; pyrimidine; deaminase; salvage; translation; GTP; de novo synthesis; neurodegeneration
While genetic lesions responsible for some Mendelian disorders can be rapidly discovered through massively parallel sequencing (MPS) of whole genomes or exomes, not all diseases readily yield to such efforts. We describe the illustrative case of the simple Mendelian disorder medullary cystic kidney disease type 1 (MCKD1), mapped more than a decade ago to a 2-Mb region on chromosome 1. Ultimately, only by cloning, capillary sequencing, and de novo assembly, we found that each of six MCKD1 families harbors an equivalent, but apparently independently arising, mutation in sequence dramatically underrepresented in MPS data: the insertion of a single C in one copy (but a different copy in each family) of the repeat unit comprising the extremely long (~1.5-5 kb), GC-rich (>80%), coding VNTR in the mucin 1 gene. The results provide a cautionary tale about the challenges in identifying genes responsible for Mendelian, let alone more complex, disorders through MPS.
To characterize the role of rare complete human knockouts in autism spectrum disorders (ASD), we identify genes with homozygous or compound heterozygous loss-of-function (LoF) variants (defined as nonsense and essential splice sites) from exome sequencing of 933 cases and 869 controls. We identify a two-fold increase in complete knockouts of autosomal genes with low rates of LoF variation (≤5% frequency) in cases and estimate a 3% contribution to ASD risk by these events, confirming this observation in an independent set of 563 probands and 4,605 controls. Outside the pseudo-autosomal regions on the X-chromosome, we similarly observe a significant 1.5-fold increase in rare hemizygous knockouts in males, contributing to another 2% of ASDs in males. Taken together these results provide compelling evidence that rare autosomal and X-chromosome complete gene knockouts are important inherited risk factors for ASD.
Detection of somatic point substitutions is a key step in characterizing the cancer genome. Mutations in cancer are rare (0.1–100/Mb) and often occur only in a subset of the sequenced cells, either due to contamination by normal cells or due to tumor heterogeneity. Consequently, mutation calling methods need to be both specific, avoiding false positives, and sensitive to detect clonal and sub-clonal mutations. The decreased sensitivity of existing methods for low allelic fraction mutations highlights the pressing need for improved and systematically evaluated mutation detection methods. Here we present MuTect, a method based on a Bayesian classifier designed to detect somatic mutations with very low allele-fractions, requiring only a few supporting reads, followed by a set of carefully tuned filters that ensure high specificity. We also describe novel benchmarking approaches, which use real sequencing data to evaluate the sensitivity and specificity as a function of sequencing depth, base quality and allelic fraction. Compared with other methods, MuTect has higher sensitivity with similar specificity, especially for mutations with allelic fractions as low as 0.1 and below, making MuTect particularly useful for studying cancer subclones and their evolution in standard exome and genome sequencing data.
The incidence of esophageal adenocarcinoma (EAC) has risen 600% over the last 30 years. With a five-year survival rate of 15%, identification of new therapeutic targets for EAC is greatly important. We analyze the mutation spectra from whole exome sequencing of 149 EAC tumors/normal pairs, 15 of which have also been subjected to whole genome sequencing. We identify a mutational signature defined by a high prevalence of A to C transversions at AA dinucleotides. Statistical analysis of exome data identified significantly mutated 26 genes. Of these genes, four (TP53, CDKN2A, SMAD4, and PIK3CA) have been previously implicated in EAC. The novel significantly mutated genes include chromatin modifying factors and candidate contributors: SPG20, TLR4, ELMO1, and DOCK2. Functional analyses of EAC-derived mutations in ELMO1 reveal increased cellular invasion. Therefore, we suggest a new hypothesis about the potential activation of the RAC1 pathway to be a contributor to EAC tumorigenesis.
Prior studies have identified recurrent oncogenic mutations in colorectal adenocarcinoma1 and have surveyed exons of protein-coding genes for mutations in 11 affected individuals2,3. Here we report whole-genome sequencing from nine individuals with colorectal cancer, including primary colorectal tumors and matched adjacent non-tumor tissues, at an average of 30.7× and 31.9× coverage, respectively. We identify an average of 75 somatic rearrangements per tumor, including complex networks of translocations between pairs of chromosomes. Eleven rearrangements encode predicted in-frame fusion proteins, including a fusion of VTI1A and TCF7L2 found in 3 out of 97 colorectal cancers. Although TCF7L2 encodes TCF4, which cooperates with β-catenin4 in colorectal carcinogenesis5,6, the fusion lacks the TCF4 β-catenin–binding domain. We found a colorectal carcinoma cell line harboring the fusion gene to be dependent on VTI1A-TCF7L2 for anchorage-independent growth using RNA interference-mediated knockdown. This study shows previously unidentified levels of genomic rearrangements in colorectal carcinoma that can lead to essential gene fusions and other oncogenic events.
Lung adenocarcinoma, the most common subtype of non-small cell lung cancer, is responsible for over 500,000 deaths per year worldwide. Here, we report exome and genome sequences of 183 lung adenocarcinoma tumor/normal DNA pairs. These analyses revealed a mean exonic somatic mutation rate of 12.0 events/megabase and identified the majority of genes previously reported as significantly mutated in lung adenocarcinoma. In addition, we identified statistically recurrent somatic mutations in the splicing factor gene U2AF1 and truncating mutations affecting RBM10 and ARID1A. Analysis of nucleotide context-specific mutation signatures grouped the sample set into distinct clusters that correlated with smoking history and alterations of reported lung adenocarcinoma genes. Whole genome sequence analysis revealed frequent structural re-arrangements, including in-frame exonic alterations within EGFR and SIK2 kinases. The candidate genes identified in this study are attractive targets for biological characterization and therapeutic targeting of lung adenocarcinoma.
Despite recent insights into melanoma genetics, systematic surveys for driver mutations are challenged by an abundance of passenger mutations caused by carcinogenic ultraviolet (UV) light exposure. We developed a permutation-based framework to address this challenge, employing mutation data from intronic sequences to control for passenger mutational load on a per gene basis. Analysis of large-scale melanoma exome data by this approach discovered six novel melanoma genes (PPP6C, RAC1, SNX31, TACC1, STK19 and ARID2), three of which - RAC1, PPP6C and STK19 - harbored recurrent and potentially targetable mutations. Integration with chromosomal copy number data contextualized the landscape of driver mutations, providing oncogenic insights in BRAF- and NRAS-driven melanoma as well as those without known NRAS/BRAF mutations. The landscape also clarified a mutational basis for RB and p53 pathway deregulation in this malignancy. Finally, the spectrum of driver mutations provided unequivocal genomic evidence for a direct mutagenic role of UV light in melanoma pathogenesis.
Systemic lupus erythematosus (SLE) is a common systemic autoimmune disease with complex etiology but strong clustering in families (λS = ~30). We performed a genome-wide association scan using 317,501 SNPs in 720 women of European ancestry with SLE and in 2,337 controls, and we genotyped consistently associated SNPs in two additional independent sample sets totaling 1,846 affected women and 1,825 controls. Aside from the expected strong association between SLE and the HLA region on chromosome 6p21 and the previously confirmed non-HLA locus IRF5 on chromosome 7q32, we found evidence of association with replication (1.1 × 10−7 < Poverall < 1.6 × 10−23; odds ratio 0.82–1.62)in four regions: 16p11.2 (ITGAM), 11p15.5 (KIAA1542), 3p14.3 (PXK) and 1q25.1 (rs10798269). We also found evidence for association (P < 1 × 10−5) at FCGR2A, PTPN22 and STAT4, regions previously associated with SLE and other autoimmune diseases, as well as at ≥9 other loci (P < 2 × 10−7). Our results show that numerous genes, some with known immune-related functions, predispose to SLE.
As a first step toward understanding how rare variants contribute to risk for complex diseases, we sequenced 15,585 human protein-coding genes to an average median depth of 111× in 2440 individuals of European (n = 1351) and African (n = 1088) ancestry. We identified over 500,000 single-nucleotide variants (SNVs), the majority of which were rare (86% with a minor allele frequency less than 0.5%), previously unknown (82%), and population-specific (82%). On average, 2.3% of the 13,595 SNVs each person carried were predicted to affect protein function of ∼313 genes per genome, and ∼95.7% of SNVs predicted to be functionally important were rare. This excess of rare functional variants is due to the combined effects of explosive, recent accelerated population growth and weak purifying selection. Furthermore, we show that large sample sizes will be required to associate rare variants with complex traits.
Establishing the age of each mutation segregating in contemporary human populations is important to fully understand our evolutionary history1,2 and will help facilitate the development of new approaches for disease gene discovery3. Large-scale surveys of human genetic variation have reported signatures of recent explosive population growth4-6, notable for an excess of rare genetic variants, qualitatively suggesting that many mutations arose recently. To more quantitatively assess the distribution of mutation ages, we resequenced 15,336 genes in 6,515 individuals of European (n=4,298) and African (n=2,217) American ancestry and inferred the age of 1,146,401 autosomal single nucleotide variants (SNVs). We estimate that ~73% of all protein-coding SNVs and ~86% of SNVs predicted to be deleterious arose in the past 5,000-10,000 years. The average age of deleterious SNVs varied significantly across molecular pathways, and disease genes contained a significantly higher proportion of recently arisen deleterious SNVs compared to other genes. Furthermore, European Americans had an excess of deleterious variants in essential and Mendelian disease genes compared to African Americans, consistent with weaker purifying selection due to the out-of-Africa dispersal. Our results better delimit the historical details of human protein-coding variation, illustrate the profound effect recent human history has had on the burden of deleterious SNVs segregating in contemporary populations, and provides important practical information that can be used to prioritize variants in disease gene discovery.
Autism spectrum disorders are a genetically heterogeneous constellation of syndromes characterized by impairments in reciprocal social interaction. Available somatic treatments have limited efficacy. We have identified inactivating mutations in the gene BCKDK (Branched Chain Ketoacid Dehydrogenase Kinase) in consanguineous families with autism, epilepsy, and intellectual disability. The encoded protein is responsible for phosphorylation-mediated inactivation of the E1α subunit of branched-chain ketoacid dehydrogenase (BCKDH). Patients with homozygous BCKDK mutations display reductions in BCKDK messenger RNA and protein, E1α phosphorylation, and plasma branched-chain amino acids. Bckdk knockout mice show abnormal brain amino acid profiles and neurobehavioral deficits that respond to dietary supplementation. Thus, autism presenting with intellectual disability and epilepsy caused by BCKDK mutations represents a potentially treatable syndrome.
The somatic genetic basis of chronic lymphocytic leukemia, a common and clinically heterogeneous leukemia occurring in adults, remains poorly understood.
We obtained DNA samples from leukemia cells in 91 patients with chronic lymphocytic leukemia and performed massively parallel sequencing of 88 whole exomes and whole genomes, together with sequencing of matched germline DNA, to characterize the spectrum of somatic mutations in this disease.
Nine genes that are mutated at significant frequencies were identified, including four with established roles in chronic lymphocytic leukemia (TP53 in 15% of patients, ATM in 9%, MYD88 in 10%, and NOTCH1 in 4%) and five with unestablished roles (SF3B1, ZMYM3, MAPK1, FBXW7, and DDX3X). SF3B1, which functions at the catalytic core of the spliceosome, was the second most frequently mutated gene (with mutations occurring in 15% of patients). SF3B1 mutations occurred primarily in tumors with deletions in chromosome 11q, which are associated with a poor prognosis in patients with chronic lymphocytic leukemia. We further discovered that tumor samples with mutations in SF3B1 had alterations in pre–messenger RNA (mRNA) splicing.
Our study defines the landscape of somatic mutations in chronic lymphocytic leukemia and highlights pre-mRNA splicing as a critical cellular process contributing to chronic lymphocytic leukemia.
Prostate cancer is the second most common cancer in men worldwide and causes over 250,000 deaths each year1. Overtreatment of indolent disease also results in significant morbidity2. Common genetic alterations in prostate cancer include losses of NKX3.1 (8p21)3,4 and PTEN (10q23)5,6, gains of the androgen receptor gene (AR)7,8 and fusion of ETS-family transcription factor genes with androgen-responsive promoters9–11. Recurrent somatic base-pair substitutions are believed to be less contributory in prostate tumorigenesis12,13 but have not been systematically analyzed in large cohorts. Here we sequenced the exomes of 112 prostate tumor/normal pairs. Novel recurrent mutations were identified in multiple genes, including MED12 and FOXA1. SPOP was the most frequently mutated gene, with mutations involving the SPOP substrate binding cleft in 6–15% of tumors across multiple independent cohorts. SPOP-mutant prostate cancers lacked ETS rearrangements and exhibited a distinct pattern of genomic alterations. Thus, SPOP mutations may define a new molecular subtype of prostate cancer.
Neighboring genes are often coordinately expressed within cis-regulatory modules, but evidence that nonparalogous genes share functions in mammals is lacking. Here, we report that mutation of either TMEM138 or TMEM216 causes a phenotypically indistinguishable human ciliopathy, Joubert syndrome. Despite a lack of sequence homology, the genes are aligned in a head-to-tail configuration and joined by chromosomal rearrangement at the amphibian-to-reptile evolutionary transition. Expression of the two genes is mediated by a conserved regulatory element in the noncoding intergenic region. Coordinated expression is important for their interdependent cellular role in vesicular transport to primary cilia. Hence, during vertebrate evolution of genes involved in ciliogenesis, nonparalogous genes were arranged to a functional gene cluster with shared regulatory elements.
We report on results from whole-exome sequencing (WES) of 1,039 subjects diagnosed with autism spectrum disorders (ASD) and 870 controls selected from the NIMH repository to be of similar ancestry to cases. The WES data came from two centers using different methods to produce sequence and to call variants from it. Therefore, an initial goal was to ensure the distribution of rare variation was similar for data from different centers. This proved straightforward by filtering called variants by fraction of missing data, read depth, and balance of alternative to reference reads. Results were evaluated using seven samples sequenced at both centers and by results from the association study. Next we addressed how the data and/or results from the centers should be combined. Gene-based analyses of association was an obvious choice, but should statistics for association be combined across centers (meta-analysis) or should data be combined and then analyzed (mega-analysis)? Because of the nature of many gene-based tests, we showed by theory and simulations that mega-analysis has better power than meta-analysis. Finally, before analyzing the data for association, we explored the impact of population structure on rare variant analysis in these data. Like other recent studies, we found evidence that population structure can confound case-control studies by the clustering of rare variants in ancestry space; yet, unlike some recent studies, for these data we found that principal component-based analyses were sufficient to control for ancestry and produce test statistics with appropriate distributions. After using a variety of gene-based tests and both meta- and mega-analysis, we found no new risk genes for ASD in this sample. Our results suggest that standard gene-based tests will require much larger samples of cases and controls before being effective for gene discovery, even for a disorder like ASD.
This study evaluates association of rare variants and autism spectrum disorders (ASD) in case and control samples sequenced by two centers. Before doing association analyses, we studied how to combine information across studies. We first harmonized the whole-exome sequence (WES) data, across centers, in terms of the distribution of rare variation. Key features included filtering called variants by fraction of missing data, read depth, and balance of alternative to reference reads. After filtering, the vast majority of variants calls from seven samples sequenced at both centers matched. We also evaluated whether one should combine summary statistics from data from each center (meta-analysis) or combine data and analyze it together (mega-analysis). For many gene-based tests, we showed that mega-analysis yields more power. After quality control of data from 1,039 ASD cases and 870 controls and a range of analyses, no gene showed exome-wide evidence of significant association. Our results comport with recent results demonstrating that hundreds of genes affect risk for ASD; they suggest that rare risk variants are scattered across these many genes, and thus larger samples will be required to identify those genes.
Autism spectrum disorders (ASD) are believed to have genetic and environmental origins, yet in only a modest fraction of individuals can specific causes be identified1,2. To identify further genetic risk factors, we assess the role of de novo mutations in ASD by sequencing the exomes of ASD cases and their parents (n= 175 trios). Fewer than half of the cases (46.3%) carry a missense or nonsense de novo variant and the overall rate of mutation is only modestly higher than the expected rate. In contrast, there is significantly enriched connectivity among the proteins encoded by genes harboring de novo missense or nonsense mutations, and excess connectivity to prior ASD genes of major effect, suggesting a subset of observed events are relevant to ASD risk. The small increase in rate of de novo events, when taken together with the connections among the proteins themselves and to ASD, are consistent with an important but limited role for de novo point mutations, similar to that documented for de novo copy number variants. Genetic models incorporating these data suggest that the majority of observed de novo events are unconnected to ASD, those that do confer risk are distributed across many genes and are incompletely penetrant (i.e., not necessarily causal). Our results support polygenic models in which spontaneous coding mutations in any of a large number of genes increases risk by 5 to 20-fold. Despite the challenge posed by such models, results from de novo events and a large parallel case-control study provide strong evidence in favor of CHD8 and KATNAL2 as genuine autism risk factors.
Medulloblastomas are the most common malignant brain tumors in children1. Identifying and understanding the genetic events that drive these tumors is critical for the development of more effective diagnostic, prognostic and therapeutic strategies. Recently, our group and others described distinct molecular subtypes of medulloblastoma based on transcriptional and copy number profiles2–5. Here, we utilized whole exome hybrid capture and deep sequencing to identify somatic mutations across the coding regions of 92 primary medulloblastoma/normal pairs. Overall, medulloblastomas exhibit low mutation rates consistent with other pediatric tumors, with a median of 0.35 non-silent mutations per megabase. We identified twelve genes mutated at statistically significant frequencies, including previously known mutated genes in medulloblastoma such as CTNNB1, PTCH1, MLL2, SMARCA4 and TP53. Recurrent somatic mutations were identified in an RNA helicase gene, DDX3X, often concurrent with CTNNB1 mutations, and in the nuclear co-repressor (N-CoR) complex genes GPS2, BCOR, and LDB1, novel findings in medulloblastoma. We show that mutant DDX3X potentiates transactivation of a TCF promoter and enhances cell viability in combination with mutant but not wild type beta-catenin. Together, our study reveals the alteration of Wnt, Hedgehog, histone methyltransferase and now N-CoR pathways across medulloblastomas and within specific subtypes of this disease, and nominates the RNA helicase DDX3X as a component of pathogenic beta-catenin signaling in medulloblastoma.
Multiple myeloma is an incurable malignancy of plasma cells, and its pathogenesis is poorly understood. Here we report the massively parallel sequencing of 38 tumor genomes and their comparison to matched normal DNAs. Several new and unexpected oncogenic mechanisms were suggested by the pattern of somatic mutation across the dataset. These include the mutation of genes involved in protein translation (seen in nearly half of the patients), genes involved in histone methylation, and genes involved in blood coagulation. In addition, a broader than anticipated role of NF-κB signaling was suggested by mutations in 11 members of the NF-κB pathway. Of potential immediate clinical relevance, activating mutations of the kinase BRAF were observed in 4% of patients, suggesting the evaluation of BRAF inhibitors in multiple myeloma clinical trials. These results indicate that cancer genome sequencing of large collections of samples will yield new insights into cancer not anticipated by existing knowledge.
Hepatocellular carcinoma (HCC) is a highly heterogeneous disease, and prior attempts to develop genomics-based classification for HCC have yielded highly divergent results, indicating difficulty to identify unified molecular anatomy. We performed a meta-analysis of gene expression profiles in datasets from 8 independent patient cohorts across the world. In addition, aiming to establish the real world applicability of a classification system, we profiled 118 formalin-fixed, paraffin-embedded tissues from an additional patient cohort. A total of 603 patients were analyzed, representing the major etiologies of HCC (hepatitis B and C) collected from Western and Eastern countries. We observed 3 robust HCC subclasses (termed S1, S2, and S3), each correlated with clinical parameters such as tumor size, extent of cellular differentiation, and serum alpha-fetoprotein levels. An analysis of the components of the signatures indicated that S1 reflected aberrant activation of the WNT signaling pathway, S2 was characterized by proliferation as well as MYC and AKT activation, and S3 was associated with hepatocyte differentiation. Functional studies indicated that the WNT pathway activation signature characteristic of S1 tumors was not simply the result of beta-catenin mutation, but rather was the result of TGF-beta activation, thus representing a new mechanism of WNT pathway activation in HCC. These experiments establish the first consensus classification framework for HCC based on gene-expression profiles, and highlight the power of integrating of multiple datasets to define a robust molecular taxonomy of the disease.
hepatocellular carcinoma; transcriptome; meta-analysis; transforming growth factor-beta; WNT pathway
As researchers begin probing deep coverage sequencing data for increasingly rare mutations and subclonal events, the fidelity of next generation sequencing (NGS) laboratory methods will become increasingly critical. Although error rates for sequencing and polymerase chain reaction (PCR) are well documented, the effects that DNA extraction and other library preparation steps could have on downstream sequence integrity have not been thoroughly evaluated. Here, we describe the discovery of novel C > A/G > T transversion artifacts found at low allelic fractions in targeted capture data. Characteristics such as sequencer read orientation and presence in both tumor and normal samples strongly indicated a non-biological mechanism. We identified the source as oxidation of DNA during acoustic shearing in samples containing reactive contaminants from the extraction process. We show generation of 8-oxoguanine (8-oxoG) lesions during DNA shearing, present analysis tools to detect oxidation in sequencing data and suggest methods to reduce DNA oxidation through the introduction of antioxidants. Further, informatics methods are presented to confidently filter these artifacts from sequencing data sets. Though only seen in a low percentage of reads in affected samples, such artifacts could have profoundly deleterious effects on the ability to confidently call rare mutations, and eliminating other possible sources of artifacts should become a priority for the research community.
Knowledge of “actionable” somatic genomic alterations present in each tumor (e.g., point mutations, small insertions/deletions, and copy number alterations that direct therapeutic options) should facilitate individualized approaches to cancer treatment. However, clinical implementation of systematic genomic profiling has rarely been achieved beyond limited numbers of oncogene point mutations. To address this challenge, we utilized a targeted, massively parallel sequencing approach to detect tumor genomic alterations in formalin-fixed, paraffin embedded (FFPE) tumor samples. Nearly 400-fold mean sequence coverage was achieved, and single nucleotide sequence variants, small insertions/deletions, and chromosomal copy number alterations were detected simultaneously with high accuracy compared to other methods in clinical use. Putatively actionable genomic alterations, including those that predict sensitivity or resistance to established and experimental therapies, were detected in each tumor sample tested. Thus, targeted deep sequencing of clinical tumor material may enable mutation-driven clinical trials and, ultimately, ”personalized” cancer treatment.