Knowledge of individual ancestry is important for genetic association studies where population structure leads to false positive signals. Estimating individual ancestry with targeted sequence data, which constitutes the bulk of current sequence datasets, is challenging. Here, we propose a new method for accurate estimation of genetic ancestry. Our method skips genotype calling and directly analyzes sequence reads. We validate the method using simulated and empirical data and show that the method can accurately infer worldwide continental ancestry with whole genome shotgun coverage as low as 0.001X. For estimates of fine-scale ancestry within Europe, the method performs well with coverage of 0.1X. At an even finer-scale, the method improves discrimination between exome-sequenced participants originating from different provinces within Finland. Finally, we show that our method can be used to improve case-control matching in genetic association studies and reduce the risk of spurious findings due to population structure.
The Drug-Gene Interaction database (DGIdb) mines existing resources that generate hypotheses about how mutated genes might be targeted therapeutically or prioritized for drug development. It provides an interface for searching lists of genes against a compendium of drug-gene interactions and potentially druggable genes. DGIdb can be accessed at dgidb.org.
To test the hypothesis that rare variants are associated with Drug-induced long QT syndrome (diLQTS) and torsade de pointes (TdP).
diLQTS is associated with the potentially fatal arrhythmia TdP. The contribution of rare genetic variants to the underlying genetic framework predisposing diLQTS has not been systematically examined.
We performed whole exome sequencing (WES) on 65 diLQTS cases and 148 drug-exposed controls of European descent. We employed rare variant analyses (variable threshold [VT] and sequence kernel association test [SKAT]) and gene-set analyses to identify genes enriched with rare amino-acid coding (AAC) variants associated with diLQTS. Significant associations were reanalyzed by comparing diLQTS cases to 515 ethnically matched controls from the NHLBI GO Exome Sequencing Project (ESP).
Rare variants in 7 genes were enriched in the diLQTS cases according to SKAT or VT compared to drug exposed controls (p<0.001). Of these, we replicated the diLQTS associations for KCNE1 and ACN9 using 515 ESP controls (p<0.05). A total of 37% of the diLQTS cases also had ≥1 rare AAC variant, as compared to 21% of controls (p=0.009), in a predefined set of seven congenital LQTS (cLQTS) genes encoding potassium channels or channel modulators (KCNE1,KCNE2,KCNH2,KCNJ2, KCNJ5,KCNQ1,AKAP9).
By combining WES with aggregated rare variant analyses, we implicate rare variants in KCNE1 and ACN9 as risk factors for diLQTS. Moreover, diLQTS cases were more burdened by rare AAC variants in cLQTS genes encoding potassium channel modulators, supporting the idea that multiple rare variants, notably across cLQTS genes, predispose to diLQTS.
exome; torsade des pointes; long QT syndrome; genetics, adverse drug event
Data from eight breast cancer genome sequencing projects identified 25 patients with HER2 somatic mutations in cancers lacking HER2 gene amplification. To determine the phenotype of these mutations, we functionally characterized thirteen HER2 mutations using in vitro kinase assays, protein structure analysis, cell culture and xenograft experiments. Seven of these mutations are activating mutations, including G309A, D769H, D769Y, V777L, P780ins, V842I, and R896C. HER2 in-frame deletion 755-759, which is homologous to EGFR exon 19 in-frame deletions, had a neomorphic phenotype with increased phosphorylation of EGFR or HER3. L755S produced lapatinib resistance, but was not an activating mutation in our experimental systems. All of these mutations were sensitive to the irreversible kinase inhibitor, neratinib. These findings demonstrate that HER2 somatic mutation is an alternative mechanism to activate HER2 in breast cancer and they validate HER2 somatic mutations as drug targets for breast cancer treatment.
Genomics; Breast Cancer; Receptor Tyrosine Kinase; Oncogene
Retinoblastoma is a rare childhood cancer of the developing retina. Most retinoblastomas initiate with biallelic inactivation of the RB1 gene through diverse mechanisms including point mutations, nucleotide insertions, deletions, loss of heterozygosity and promoter hypermethylation. Recently, a novel mechanism of retinoblastoma initiation was proposed. Gallie and colleagues discovered that a small proportion of retinoblastomas lack RB1 mutations and had MYCN amplification . In this study, we identifed recurrent chromosomal, regional and focal genomic lesions in 94 primary retinoblastomas with their matched normal DNA using SNP 6.0 chips. We also analyzed the RB1 gene mutations and compared the mechanism of RB1 inactivation to the recurrent copy number variations in the retinoblastoma genome. In addition to the previously described focal amplification of MYCN and deletions in RB1 and BCOR, we also identifed recurrent focal amplification of OTX2, a transcription factor required for retinal photoreceptor development. We identifed 10 retinoblastomas in our cohort that lacked RB1 point mutations or indels. We performed whole genome sequencing on those 10 tumors and their corresponding germline DNA. In one of the tumors, the RB1 gene was unaltered, the MYCN gene was amplified and RB1 protein was expressed in the nuclei of the tumor cells. In addition, several tumors had complex patterns of structural variations and we identified 3 tumors with chromothripsis at the RB1 locus. This is the first report of chromothripsis as a mechanism for RB1 gene inactivation in cancer.
chromothripsis; retinoblastoma; RB1; MYCN
New technologies for DNA sequencing, coupled with advanced analytical approaches, are now providing unprecedented speed and precision in decoding human genomes. This combination of technology and analysis, when applied to the study of cancer genomes, is revealing specific and novel information about the fundamental genetic mechanisms that underlie cancer’s development and progression. This review outlines the history of the past several years of development in this realm, and discusses the current and future applications that will further elucidate cancer’s genomic causes.
The genetic structure of the indigenous hunter-gatherer peoples of southern Africa, the oldest known lineage of modern human, is important for understanding human diversity. Studies based on mitochondrial1 and small sets of nuclear markers2 have shown that these hunter-gatherers, known as Khoisan, San, or Bushmen, are genetically divergent from other humans1,3. However, until now, fully sequenced human genomes have been limited to recently diverged populations4–8. Here we present the complete genome sequences of an indigenous hunter-gatherer from the Kalahari Desert and a Bantu from southern Africa, as well as protein-coding regions from an additional three hunter-gatherers from disparate regions of the Kalahari. We characterize the extent of whole-genome and exome diversity among the five men, reporting 1.3 million novel DNA differences genome-wide, including 13,146 novel amino acid variants. In terms of nucleotide substitutions, the Bushmen seem to be, on average, more different from each other than, for example, a European and an Asian. Observed genomic differences between the hunter-gatherers and others may help to pinpoint genetic adaptations to an agricultural lifestyle. Adding the described variants to current databases will facilitate inclusion of southern Africans in medical research efforts, particularly when family and medical histories can be correlated with genome-wide data.
Cancer immunoediting, the process whereby the immune system controls tumour outgrowth and shapes tumour immunogenicity, is comprised of three phases: elimination, equilibrium and escape1–5. Although many immune components that participate in this process are known, its underlying mechanisms remain poorly defined. A central tenet of cancer immunoediting is that T cell recognition of tumour antigens drives the immunologic destruction or sculpting of a developing cancer. However, our current understanding of tumour antigens comes largely from analyses of cancers that develop in immunocompetent hosts and thus may have already been edited. Little is known about the antigens expressed in nascent tumour cells, whether they are sufficient to induce protective anti-tumour immune responses or whether their expression is modulated by the immune system. Here, using massively parallel sequencing, we characterize expressed mutations in highly immunogenic methylcholanthrene-induced sarcomas derived from immunodeficient Rag2−/− mice which phenotypically resemble nascent primary tumour cells1,3,5. Employing class I prediction algorithms, we identify mutant spectrin-β2 as a potential rejection antigen of the d42m1 sarcoma and validate this prediction by conventional antigen expression cloning and detection. We also demonstrate that cancer immunoediting of d42m1 occurs via a T cell-dependent immunoselection process that promotes outgrowth of pre-existing tumour cell clones lacking highly antigenic mutant spectrin-β2 and other potential strong antigens. These results demonstrate that the strong immunogenicity of an unedited tumour can be ascribed to expression of highly antigenic mutant proteins and show that outgrowth of tumour cells that lack these strong antigens via a T cell-dependent immunoselection process represents one mechanism of cancer immunoediting.
The 11th International Meeting on Human Genome Variation and Complex Genome Analysis (HGV2009: Tallinn, Estonia, 11th–13th September 2009) provided a stimulating workshop environment where diverse academics and industry representatives explored the latest progress, challenges, and opportunities in relating genome variation to evolution, technology, health, and disease. Key themes included Genome-Wide Association Studies (GWAS), progress beyond GWAS, sequencing developments, and bioinformatics approaches to large-scale datasets.
HGV2009; SNP; variation; GWAS; CNV
The commonest pediatric brain tumors are low-grade gliomas (LGGs). We utilized whole genome sequencing to discover multiple novel genetic alterations involving BRAF, RAF1, FGFR1, MYB, MYBL1 and genes with histone-related functions, including H3F3A and ATRX, in 39 LGGs and low-grade glioneuronal tumors (LGGNTs). Only a single non-silent somatic alteration was detected in 24/39 (62%) tumors. Intragenic duplications of the FGFR1 tyrosine kinase domain (TKD) and rearrangements of MYB were recurrent and mutually exclusive in 53% of grade II diffuse LGGs. Transplantation of Trp53-null neonatal astrocytes containing TKD-duplicated FGFR1 into brains of nude mice generated high-grade astrocytomas with short latency and 100% penetrance. TKD-duplicated FGFR1 induced FGFR1 autophosphorylation and upregulation of the MAPK/ERK and PI3K pathways, which could be blocked by specific inhibitors. Focusing on the therapeutically challenging diffuse LGGs, our study of 151 tumors has discovered genetic alterations and potential therapeutic targets across the entire range of pediatric LGGs/LGGNTs.
Brain tumors (gliomas) contain large populations of infiltrating macrophages and recruited microglia, which in experimental murine glioma models promote tumor formation and progression. Among the barriers to understanding the contributions of these stromal elements to high-grade glioma (glioblastoma; GBM) biology is the relative paucity of tools to characterize infiltrating macrophages and resident microglia. In this study, we leveraged multiple RNA analysis platforms to identify new monocyte markers relevant to GBM patient outcome.
High-confidence lists of mouse resident microglia- and bone marrow-derived macrophage-specific transcripts were generated using converging RNA-seq and microarray technologies and validated using qRT-PCR and flow cytometry. Expression of select cell surface markers was analyzed in brain-infiltrating macrophages and resident microglia in an induced GBM mouse model, while allogeneic bone marrow transplantation was performed to trace the origins of infiltrating and resident macrophages. Glioma tissue microarrays were examined by immunohistochemistry, and the Gene Expression Omnibus (GEO) database was queried to determine the prognostic value of identified microglia biomarkers in human GBM.
We generated a unique catalog of differentially-expressed bone marrow-derived monocyte and resident microglia transcripts, and demonstrated that brain-infiltrating macrophages acquire F11R expression in GBM and following bone-marrow transplantation. Moreover, mononuclear cell F11R expression positively correlates with human high-grade glioma and additionally serves as a biomarker for GBM patient survival, regardless of GBM molecular subtype.
These studies establish F11R as a novel monocyte prognostic marker for GBM critical for defining a subpopulation of stromal cells for future potential therapeutic intervention.
We report the results of whole genome and transcriptome sequencing of tumor and adjacent normal tissue samples from 17 patients with non-small cell lung carcinoma (NSCLC). We identified 3,726 point mutations and over 90 indels in the coding sequence, with an average mutation frequency more than 10-fold higher in smokers than in never-smokers. Novel alterations in genes involved in chromatic modification and DNA repair pathways were identified along with DACH1, CFTR, RELN, ABCB5, and HGF. Deep digital sequencing revealed diverse clonality patterns in both never smokers and smokers. All validated EFGR and KRAS mutations were present in the founder clones, suggesting possible roles in cancer initiation. Analysis revealed 14 fusions including ROS1 and ALK as well as novel metabolic enzymes. Cell cycle and JAK-STAT pathways are significantly altered in lung cancer along with perturbations in 54 genes that are potentially targetable with currently available drugs.
Producing gene fusions through genomic structural rearrangements is a major mechanism for tumor evolution. Therefore, accurately detecting gene fusions and the originating rearrangements is of great importance for personalized cancer diagnosis and targeted therapy. We present a tool, BreakTrans, that systematically maps predicted gene fusions to structural rearrangements. Thus, BreakTrans not only validates both types of predictions, but also provides mechanistic interpretations. BreakTrans effectively validates known fusions and discovers novel events in a breast cancer cell line. Applying BreakTrans to 43 breast cancer samples in The Cancer Genome Atlas identifies 90 genomically validated gene fusions. BreakTrans is available at http://bioinformatics.mdanderson.org/main/BreakTrans
Gene expression profiling classifies breast cancer into intrinsic subtypes based on the biology of the underlying disease pathways. We have used material from a prospective randomized trial of tamoxifen versus placebo in premenopausal women with primary breast cancer (NCIC CTG MA.12) to evaluate the prognostic and predictive significance of intrinsic subtypes identified by both the PAM50 gene set and by immunohistochemistry.
Total RNA from 398 of 672 (59%) patients was available for intrinsic subtyping with a quantitative reverse transcriptase PCR (qRT-PCR) 50-gene predictor (PAM50) for luminal A, luminal B, HER-2–enriched, and basal-like subtypes. A tissue microarray was also constructed from 492 of 672 (73%) of the study population to assess a panel of six immunohistochemical IHC antibodies to define the same intrinsic subtypes.
Classification into intrinsic subtypes by the PAM50 assay was prognostic for both disease-free survival (DFS; P = 0.0003) and overall survival (OS; P = 0.0002), whereas classification by the IHC panel was not. Luminal subtype by PAM50 was predictive of tamoxifen benefit [DFS: HR, 0.52; 95% confidence interval (CI), 0.32–0.86 vs. HR, 0.80; 95% CI, 0.50–1.29 for nonluminal subtypes], although the interaction test was not significant (P = 0.24), whereas neither subtyping by central immunohistochemistry nor by local estrogen receptor (ER) or progesterone receptor (PR) status were predictive. Risk of relapse (ROR) modeling with the PAM50 assay produced a continuous risk score in both node-negative and node-positive disease.
In the MA.12 study, intrinsic subtype classification by qRT-PCR with the PAM50 assay was superior to IHC profiling for both prognosis and prediction of benefit from adjuvant tamoxifen.
Summary: Despite recent progress, computational tools that identify gene fusions from next-generation whole transcriptome sequencing data are often limited in accuracy and scalability. Here, we present a software package, BreakFusion that combines the strength of reference alignment followed by read-pair analysis and de novo assembly to achieve a good balance in sensitivity, specificity and computational efficiency.
Supplementary data are available at Bioinformatics online
As part of the molecular revolution sweeping medicine, comprehensive genomic studies are adding powerful dimensions to medical research. However, their power exposes new regulatory, strategic, and quality assurance challenges for biorepositories. A key issue is that unlike other research techniques commonly applied to banked specimens, nucleic acid sequencing, if sufficiently extensive, yields data that could identify a patient. This evolving paradigm renders the concepts of anonymized and anonymous specimens increasingly outdated. The challenges for biorepositories in this new era include refined consent processes and wording, selection and use of legacy specimens, quality assurance procedures, institutional documentation, data sharing, and interaction with institutional review boards. Given current trends, biorepositories should consider these issues now, even if they are not currently experiencing sample requests for genomic analysis. We summarize our current experiences and best practices at Washington University Medical School, St Louis, MO, our perceptions of emerging trends, and recommendations.
Genomic studies; Biorepositories; Biobanks; Quality assurance; Regulatory standards
Detection and characterization of genomic structural variation are important for understanding the landscape of genetic variation in human populations and in complex diseases such as cancer. Recent studies demonstrate the feasibility of detecting structural variation using next-generation, short-insert, paired-end sequencing reads. However, the utility of these reads is not entirely clear, nor are the analysis methods under which accurate detection can be achieved. The algorithm BreakDancer predicts a wide variety of structural variants including indels, inversions, and translocations. We examined BreakDancer's performance in simulation, comparison with other methods, analysis of an acute myeloid leukemia sample, and the 1,000 Genomes trio individuals. We found that it substantially improved the detection of small and intermediate size indels from 10 bp to 1 Mbp that are difficult to detect via a single conventional approach.
The human Y chromosome began to evolve from an autosome hundreds of millions of years ago, acquiring a sex-determining function and undergoing a series of inversions that suppressed crossing over with the X chromosome1,2. Little is known about the Y chromosome’s recent evolution because only the human Y chromosome has been fully sequenced. Prevailing theories hold that Y chromosomes evolve by gene loss, the pace of which slows over time, eventually leading to a paucity of genes, and stasis3,4. These theories have been buttressed by partial sequence data from newly emergent plant and animal Y chromosomes5-8, but they have not been tested in older, highly evolved Y chromosomes like that of humans. We therefore finished sequencing the male-specific region of the Y chromosome (MSY) in our closest living relative, the chimpanzee, achieving levels of accuracy and completion previously reached for the human MSY. We then compared the MSYs of the two species and found that they differ radically in sequence structure and gene content, implying rapid evolution during the past 6 million years. The chimpanzee MSY harbors twice as many massive palindromes as the human MSY, yet it has lost large fractions of the MSY protein-coding genes and gene families present in the last common ancestor. We suggest that the extraordinary divergence of the chimpanzee and human MSYs was driven by four synergistic factors: the MSY’s prominent role in sperm production, genetic hitchhiking effects in the absence of meiotic crossing over, frequent ectopic recombination within the MSY, and species differences in mating behavior. While genetic decay may be the principal dynamic in the evolution of newly emergent Y chromosomes, wholesale renovation is the paramount theme in the ongoing evolution of chimpanzee, human, and perhaps other older MSYs.
To assess the genetic consequences of induced Pluripotent Stem Cell (iPSC) reprogramming, we sequenced the genomes of ten murine iPSC clones derived from three independent reprogramming experiments, and compared them to their parental cell genomes. We detected hundreds of single nucleotide variants (SNVs) in every clone, with an average of 11 in coding regions. In two experiments, all SNVs were unique for each clone and did not cluster in pathways, but in the third, all four iPSC clones contained 157 shared genetic variants, which could also be detected in rare cells (<1 in 500) within the parental MEF pool. This data suggests that most of the genetic variation in iPSC clones is not caused by reprogramming per se, but is rather a consequence of cloning individual cells, which “captures” their mutational history. These findings have implications for the development and therapeutic use of cells that are reprogrammed by any method.
The St. Jude Children’s Research Hospital–Washington University Pediatric Cancer Genome Project (PCGP) is participating in the international effort to identify somatic mutations that drive cancer. These cancer genome sequencing efforts will not only yield an unparalleled view of the altered signaling pathways in cancer but should also identify new targets against which novel therapeutics can be developed. Although these projects are still deep in the phase of generating primary DNA sequence data, important results are emerging and valuable community resources are being generated that should catalyze future cancer research. We describe here the rationale for conducting the PCGP, present some of the early results of this project and discuss the major lessons learned and how these will affect the application of genomic sequencing in the clinic.
Exome sequencing of 343 families, each with a single child on the autism spectrum and at least one unaffected sibling, reveal de novo small indels and point substitutions, which come mostly from the paternal line in an age-dependent manner. We do not see significantly greater numbers of de novo missense mutations in affected versus unaffected children, but gene-disrupting mutations (nonsense, splice site, and frame shifts) are twice as frequent, 59 to 28. Based on this differential and the number of recurrent and total targets of gene disruption found in our and similar studies, we estimate between 350 and 400 autism susceptibility genes. Many of the disrupted genes in these studies are associated with the fragile X protein, FMRP, reinforcing links between autism and synaptic plasticity. We find FMRP-associated genes are under greater purifying selection than the remainder of genes and suggest they are especially dosage-sensitive targets of cognitive disorders.
Medulloblastoma is a malignant childhood brain tumour comprising four discrete subgroups. To identify mutations that drive medulloblastoma we sequenced the entire genomes of 37 tumours and matched normal blood. One hundred and thirty-six genes harbouring somatic mutations in this discovery set were sequenced in an additional 56 medulloblastomas. Recurrent mutations were detected in 41 genes not yet implicated in medulloblastoma: several target distinct components of the epigenetic machinery in different disease subgroups, e.g., regulators of H3K27 and H3K4 trimethylation in subgroup-3 and 4 (e.g., KDM6A and ZMYM3), and CTNNB1-associated chromatin remodellers in WNT-subgroup tumours (e.g., SMARCA4 and CREBBP). Modelling of mutations in mouse lower rhombic lip progenitors that generate WNT-subgroup tumours, identified genes that maintain this cell lineage (DDX3X) as well as mutated genes that initiate (CDH1) or cooperate (PIK3CA) in tumourigenesis. These data provide important new insights into the pathogenesis of medulloblastoma subgroups and highlight targets for therapeutic development.
Motivation: The sequencing of tumors and their matched normals is frequently used to study the genetic composition of cancer. Despite this fact, there remains a dearth of available software tools designed to compare sequences in pairs of samples and identify sites that are likely to be unique to one sample.
Results: In this article, we describe the mathematical basis of our SomaticSniper software for comparing tumor and normal pairs. We estimate its sensitivity and precision, and present several common sources of error resulting in miscalls.
Availability and implementation: Binaries are freely available for download at http://gmt.genome.wustl.edu/somatic-sniper/current/, implemented in C and supported on Linux and Mac OS X.
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
With more chromosomes than any other sequenced genome, the macronuclear genome of Oxytricha trifallax has a unique and complex architecture, including alternative fragmentation and predominantly single-gene chromosomes.
The macronuclear genome of the ciliate Oxytricha trifallax displays an extreme and unique eukaryotic genome architecture with extensive genomic variation. During sexual genome development, the expressed, somatic macronuclear genome is whittled down to the genic portion of a small fraction (∼5%) of its precursor “silent” germline micronuclear genome by a process of “unscrambling” and fragmentation. The tiny macronuclear “nanochromosomes” typically encode single, protein-coding genes (a small portion, 10%, encode 2–8 genes), have minimal noncoding regions, and are differentially amplified to an average of ∼2,000 copies. We report the high-quality genome assembly of ∼16,000 complete nanochromosomes (∼50 Mb haploid genome size) that vary from 469 bp to 66 kb long (mean ∼3.2 kb) and encode ∼18,500 genes. Alternative DNA fragmentation processes ∼10% of the nanochromosomes into multiple isoforms that usually encode complete genes. Nucleotide diversity in the macronucleus is very high (SNP heterozygosity is ∼4.0%), suggesting that Oxytricha trifallax may have one of the largest known effective population sizes of eukaryotes. Comparison to other ciliates with nonscrambled genomes and long macronuclear chromosomes (on the order of 100 kb) suggests several candidate proteins that could be involved in genome rearrangement, including domesticated MULE and IS1595-like DDE transposases. The assembly of the highly fragmented Oxytricha macronuclear genome is the first completed genome with such an unusual architecture. This genome sequence provides tantalizing glimpses into novel molecular biology and evolution. For example, Oxytricha maintains tens of millions of telomeres per cell and has also evolved an intriguing expansion of telomere end-binding proteins. In conjunction with the micronuclear genome in progress, the O. trifallax macronuclear genome will provide an invaluable resource for investigating programmed genome rearrangements, complementing studies of rearrangements arising during evolution and disease.
The macronuclear genome of the ciliate Oxytricha trifallax, contained in its somatic nucleus, has a unique genome architecture. Unlike its diploid germline genome, which is transcriptionally inactive during normal cellular growth, the macronuclear genome is fragmented into at least 16,000 tiny (∼3.2 kb mean length) chromosomes, most of which encode single actively transcribed genes and are differentially amplified to a few thousand copies each. The smallest chromosome is just 469 bp, while the largest is 66 kb and encodes a single enormous protein. We found considerable variation in the genome, including frequent alternative fragmentation patterns, generating chromosome isoforms with shared sequence. We also found limited variation in chromosome amplification levels, though insufficient to explain mRNA transcript level variation. Another remarkable feature of Oxytricha's macronuclear genome is its inordinate fondness for telomeres. In conjunction with its possession of tens of millions of chromosome-ending telomeres per macronucleus, we show that Oxytricha has evolved multiple putative telomere-binding proteins. In addition, we identified two new domesticated transposase-like protein classes that we propose may participate in the process of genome rearrangement. The macronuclear genome now provides a crucial resource for ongoing studies of genome rearrangement processes that use Oxytricha as an experimental or comparative model.