Motivation: The sequencing of tumors and their matched normals is frequently used to study the genetic composition of cancer. Despite this fact, there remains a dearth of available software tools designed to compare sequences in pairs of samples and identify sites that are likely to be unique to one sample.
Results: In this article, we describe the mathematical basis of our SomaticSniper software for comparing tumor and normal pairs. We estimate its sensitivity and precision, and present several common sources of error resulting in miscalls.
Availability and implementation: Binaries are freely available for download at http://gmt.genome.wustl.edu/somatic-sniper/current/, implemented in C and supported on Linux and Mac OS X.
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
We developed CREST (Clipping REveals STructure), an algorithm that uses next-generation sequencing reads with partial alignments to a reference genome to directly map structural variations at the nucleotide level of resolution. Application of CREST to whole-genome sequencing data from five pediatric T-lineage acute lymphoblastic leukemias (T-ALLs) and a human melanoma cell line, COLO-829, identified 160 somatic structural variations. Experimental validation exceeded 80% demonstrating that CREST had a high predictive accuracy.
Gene duplication is an important source of phenotypic change and adaptive evolution. We use a novel genomic approach to identify highly identical sequence missing from the reference genome, confirming the cortical development gene Slit-Robo Rho GTPase activating protein 2 (SRGAP2) duplicated three times in humans. We show that the promoter and first nine exons of SRGAP2 duplicated from 1q32.1 (SRGAP2A) to 1q21.1 (SRGAP2B) ~3.4 million years ago (mya). Two larger duplications later copied SRGAP2B to chromosome 1p12 (SRGAP2C) and to proximal 1q21.1 (SRGAP2D), ~2.4 and ~1 mya, respectively. Sequence and expression analysis shows SRGAP2C is the most likely duplicate to encode a functional protein and among the most fixed human-specific duplicate genes. Our data suggest a mechanism where incomplete duplication created a novel function —at birth, antagonizing parental SRGAP2 function 2–3 mya a time corresponding to the transition from Australopithecus to Homo and the beginning of neocortex expansion.
Massively parallel sequencing technologies continue to alter the study of human genetics. As the cost of sequencing declines, next-generation sequencing (NGS) instruments and datasets will become increasingly accessible to the wider research community. Investigators are understandably eager to harness the power of these new technologies. Sequencing human genomes on these platforms, however, presents numerous production and bioinformatics challenges. Production issues like sample contamination, library chimaeras and variable run quality have become increasingly problematic in the transition from technology development lab to production floor. Analysis of NGS data, too, remains challenging, particularly given the short-read lengths (35–250 bp) and sheer volume of data. The development of streamlined, highly automated pipelines for data analysis is critical for transition from technology adoption to accelerated research and publication. This review aims to describe the state of current NGS technologies, as well as the strategies that enable NGS users to characterize the full spectrum of DNA sequence variation in humans.
massively parallel sequencing; next generation sequencing; human genome; variant detection; short read alignment; whole genome sequencing
To identify somatic mutations in paediatric diffuse intrinsic pontine gliomas (DIPGs), we performed whole genome sequencing of 7 DIPGs and matched germline DNA, and targeted sequencing of an additional 43 DIPGs and 36 non-brainstem paediatric glioblastomas (non-BS-PGs). 78% of DIPGs and 22% of non-BS-PGs contained p.K27M mutation in H3F3A, encoding histone H3.3, or the related HIST1H3B, encoding histone H3.1. An additional 14% of non-BS-PGs had somatic p.G34R H3F3A mutations.
Motivation: The expansion of cancer genome sequencing continues to stimulate development of analytical tools for inferring relationships between somatic changes and tumor development. Pathway associations are especially consequential, but existing algorithms are demonstrably inadequate.
Methods: Here, we propose the PathScan significance test for the scenario where pathway mutations collectively contribute to tumor development. Its design addresses two aspects that established methods neglect. First, we account for variations in gene length and the consequent differences in their mutation probabilities under the standard null hypothesis of random mutation. The associated spike in computational effort is mitigated by accurate convolution-based approximation. Second, we combine individual probabilities into a multiple-sample value using Fisher–Lancaster theory, thereby improving differentiation between a few highly mutated genes and many genes having only a few mutations apiece. We investigate accuracy, computational effort and power, reporting acceptable performance for each.
Results: As an example calculation, we re-analyze KEGG-based lung adenocarcinoma pathway mutations from the Tumor Sequencing Project. Our test recapitulates the most significant pathways and finds that others for which the original test battery was inconclusive are not actually significant. It also identifies the focal adhesion pathway as being significantly mutated, a finding consistent with earlier studies. We also expand this analysis to other databases: Reactome, BioCarta, Pfam, PID and SMART, finding additional hits in ErbB and EPHA signaling pathways and regulation of telomerase. All have implications and plausible mechanistic roles in cancer. Finally, we discuss aspects of extending the method to integrate gene-specific background rates and other types of genetic anomalies.
Availability: PathScan is implemented in Perl and is available from the Genome Institute at: http://genome.wustl.edu/software/pathscan.
Supplementary information: Supplementary data are available at Bioinformatics online.
Understanding the prevailing mutational mechanisms responsible for human genome structural variation requires uniformity in the discovery of allelic variants and precision in terms of breakpoint delineation. We develop a resource based on capillary end-sequencing of 13.8 million fosmid clones from 17 human genomes and characterize the complete sequence of 1,054 large structural variants corresponding to 589 deletions, 384 insertions, and 81 inversions. We analyze the 2,081 breakpoint junctions and infer potential mechanism of origin. Three mechanisms account for the bulk of germline structural variation: microhomology-mediated processes involving short (2–20 bp) stretches of sequence (28%), non-allelic homologous recombination (NAHR) (22%) and L1 retrotransposition (19%). The high quality and long-range continuity of the sequence reveals more complex mutational mechanisms including repeat-mediated inversions and gene conversion that are most often missed by other methods including comparative genomic hybridization, SNP microarrays and next-generation sequencing.
There is a complex relationship between the evolution of segmental duplications and rearrangements associated with human disease. We performed a detailed analysis of one region on chromosome 16p12.1 associated with neurocognitive disease and identified one of the largest structural inconsistencies with the human reference assembly. Various genomic analyses show that all examined humans are homozygously inverted relative to the reference genome for a 1.1-Mbp region on 16p12.1. We determined that this assembly discrepancy stems from two common structural configurations with worldwide frequencies of 17.6% (S1) and 82.4% (S2). This polymorphism arose from the rapid integration of segmental duplications, precipitating two local inversions within the human lineage over the last 10 million years. The two human haplotypes differ by 333 kbp of additional duplicated sequence present in S2 but not in S1. Importantly, we show that the S2 configuration harbors directly oriented duplications specifically predisposing this chromosome to disease rearrangement.
Motivation: DNA copy number aberration (CNA) is a hallmark of genomic abnormality in tumor cells. Recurrent CNA (RCNA) occurs in multiple cancer samples across the same chromosomal region and has greater implication in tumorigenesis. Current commonly used methods for RCNA identification require CNA calling for individual samples before cross-sample analysis. This two-step strategy may result in a heavy computational burden, as well as a loss of the overall statistical power due to segmentation and discretization of individual sample's data. We propose a population-based approach for RCNA detection with no need of single-sample analysis, which is statistically powerful, computationally efficient and particularly suitable for high-resolution and large-population studies.
Results: Our approach, correlation matrix diagonal segmentation (CMDS), identifies RCNAs based on a between-chromosomal-site correlation analysis. Directly using the raw intensity ratio data from all samples and adopting a diagonal transformation strategy, CMDS substantially reduces computational burden and can obtain results very quickly from large datasets. Our simulation indicates that the statistical power of CMDS is higher than that of single-sample CNA calling based two-step approaches. We applied CMDS to two real datasets of lung cancer and brain cancer from Affymetrix and Illumina array platforms, respectively, and successfully identified known regions of CNA associated with EGFR, KRAS and other important oncogenes. CMDS provides a fast, powerful and easily implemented tool for the RCNA analysis of large-scale data from cancer genomes.
Availability: The R and C programs implementing our method are available at https://dsgweb.wustl.edu/qunyuan/software/cmds.
Supplementary information: Supplementary data are available at Bioinformatics online.
In birds, as in mammals, one pair of chromosomes differs between the sexes. In birds, males are ZZ and females ZW. In mammals, males are XY and females XX. Like the mammalian XY pair, the avian ZW pair is believed to have evolved from autosomes, with most change occurring in the chromosomes found in only one sex – the W and Y chromosomes1–5. By contrast, the sex chromosomes found in both sexes – the Z and X chromosomes – are assumed to have diverged little from their autosomal progenitors2. Here we report findings that overturn this assumption for both the chicken Z and human X chromosomes. The chicken Z chromosome, which we sequenced essentially to completion, is less gene-dense than chicken autosomes but contains a massive tandem array containing hundreds of duplicated genes expressed in testes. A comprehensive comparison of the chicken Z chromosome to the finished sequence of the human X chromosome demonstrates that each evolved independently from different portions of the ancestral genome. Despite this independence, the chicken Z and human X chromosomes share features that distinguish them from autosomes: the acquisition and amplification of testis-expressed genes, as well as a low gene density resulting from an expansion of intergenic regions. These features were not present on the autosomes from which the Z and X chromosomes originated but were instead acquired during the evolution of the Z and X as sex chromosomes. We conclude that the avian Z and mammalian X chromosomes followed convergent evolutionary trajectories, despite their evolving with opposite (female vs. male) systems of heterogamety. More broadly, in birds and mammals, sex chromosome evolution involved not only gene loss in sex-specific chromosomes, but also marked expansion and gene acquisition in sex chromosomes common to males and females.
The extent of human genomic structural variation suggests that there must be portions of the genome yet to be discovered, annotated and characterized at the sequence level. We present a resource and analysis of 2,363 novel insertion sequences corresponding to 720 genomic loci. We show that a substantial fraction of these sequences are either missing, fragmented or mis-assigned when compared to recent de novo sequence assemblies from short-read next-generation sequence data. We determine that 18–37% of these novel insertions are copy-number polymorphic, including loci that show extensive population stratification among Europeans, Asians and Africans. Complete sequencing of 156 of these insertions identifies novel exons and conserved non-coding sequences not yet represented in the reference genome. We develop a method to accurately genotype these novel insertions by mapping next-generation sequencing datasets to the breakpoint thereby providing a means to characterize copy-number status for regions previously inaccessible to SNP microarrays.
A genomic era of cancer studies is developing rapidly, fueled by the emergence of next-generation sequencing technologies that provide exquisite sensitivity and resolution. This article discusses several areas within cancer genomics that are being transformed by the application of new technology, and in the process are dramatically expanding our understanding of this disease. Although, we anticipate that there will be many exciting discoveries in the near future, the ultimate success of these endeavors rests on our ability to translate what is learned into better diagnosis, treatment and prevention of cancer.
To date, few peptides in the complex mixture of platypus venom have been identified and sequenced, in part due to the limited amounts of platypus venom available to study. We have constructed and sequenced a cDNA library from an active platypus venom gland to identify the remaining components.
We identified 83 novel putative platypus venom genes from 13 toxin families, which are homologous to known toxins from a wide range of vertebrates (fish, reptiles, insectivores) and invertebrates (spiders, sea anemones, starfish). A number of these are expressed in tissues other than the venom gland, and at least three of these families (those with homology to toxins from distant invertebrates) may play non-toxin roles. Thus, further functional testing is required to confirm venom activity. However, the presence of similar putative toxins in such widely divergent species provides further evidence for the hypothesis that there are certain protein families that are selected preferentially during evolution to become venom peptides. We have also used homology with known proteins to speculate on the contributions of each venom component to the symptoms of platypus envenomation.
This study represents a step towards fully characterizing the first mammal venom transcriptome. We have found similarities between putative platypus toxins and those of a number of unrelated species, providing insight into the evolution of mammalian venom.
Summary: Massively parallel sequencing technologies hold incredible promise for the study of DNA sequence variation, particularly the identification of variants affecting human disease. The unprecedented throughput and relatively short read lengths of Roche/454, Illumina/Solexa, and other platforms have spurred development of a new generation of sequence alignment algorithms. Yet detection of sequence variants based on short read alignments remains challenging, and most currently available tools are limited to a single platform or aligner type. We present VarScan, an open source tool for variant detection that is compatible with several short read aligners. We demonstrate VarScan's ability to detect SNPs and indels with high sensitivity and specificity, in both Roche/454 sequencing of individuals and deep Illumina/Solexa sequencing of pooled samples.
Availability and Implementation: Source code and documentation freely available at http://genome.wustl.edu/tools/cancer-genomics implemented as a Perl package and supported on Linux/UNIX, MS Windows and Mac OSX.
Supplementary information: Supplementary data are available at Bioinformatics online.
The systematic karyotyping of bone marrow cells was the first genomic approach used to personalize therapy for patients with leukemia. The paradigm established by cytogenetic studies in leukemia (from gene discovery to therapeutic intervention) now has the potential to be rapidly extended with the use of whole-genome sequencing approaches for cancer, which are now possible. We are now entering a period of exponential growth in cancer gene discovery that will provide many novel therapeutic targets for a large number of cancer types. Establishing the pathogenetic relevance of individual mutations is a major challenge that must be solved. However, after thousands of cancer genomes have been sequenced, the genetic rules of cancer will become known and new approaches for diagnosis, risk stratification and individualized treatment of cancer patients will surely follow.
array CGH; cancer; comparative genomic hybridization; genomics; next-generation sequencing; SNP array
Next-generation sequencing technology provides an attractive means to obtain large-scale sequence data necessary for comparative genomic analysis. To analyse the patterns of mutation rate variation and selection intensity across the avian genome, we performed brain transcriptome sequencing using Roche 454 technology of 10 different non-model avian species. Contigs from de novo assemblies were aligned to the two available avian reference genomes, chicken and zebra finch. In total, we identified 6499 different genes across all 10 species, with ~1000 genes found in each full run per species. We found evidence for a higher mutation rate of the Z chromosome than of autosomes (male-biased mutation) and a negative correlation between the neutral substitution rate (dS) and chromosome size. Analyses of the mean dN/dS ratio (ω) of genes across chromosomes supported the Hill–Robertson effect (the effect of selection at linked loci) and point at stochastic problems with ω as an independent measure of selection. Overall, this study demonstrates the usefulness of next-generation sequencing for obtaining genomic resources for comparative genomic analysis of non-model organisms.
Next generation sequencing; 454; Avian genomics; Male-mutation bias; Selection; Hill-Robertson effect
Ostertagia ostertagi is a gastrointestinal parasitic nematode that affects cattle and leads to a loss of production. In this study, we present the first large-scale genomic survey of Ostertagia ostertagi by the analysis of expressed transcripts from three stages of the parasite: third-stage larvae, fourth-stage larvae and adult worms. Using an in silico approach, 2,284 genes were identified from over 7,000 expressed sequence tags and abundant transcripts were analyzed and characterized by their functional profile. Of the 2,284 genes, 66% had similarity to other known or predicted genes while the rest were novel and potentially represent genes specific to the species and/or stages. Furthermore, a subset of the novel proteins were structurally annotated and assigned putative function based on orthologs in C. elegans and corresponding RNA interference phenotypes. Hence, over 70% of the genes were annotated using protein sequences, domains and pathway databases. Differentially expressed transcripts from the two larval stages and their functional profiles were also studied leading to a more detailed understanding of the parasite’s life-cycle. The identified transcripts are a valuable resource for genomic studies of O. ostertagi and can facilitate the design of control strategies and vaccine programs.
Cattle; Parasite; Nematode; Transcripts; ESTs; Ostertagia ostertagi; Comparative genomics
Hookworm infection is one of the most important neglected diseases in developing countries, with approximately 1 billion people infected worldwide. To better understand hookworm biology and nematode parasitism, the present study generated a near complete transcriptome of the canine hookworm Ancylostoma caninum to a very high coverage using high throughput technology, and compared it to those of the free-living nematode Caenorhabditis elegans and the parasite Brugia malayi.
The generated transcripts from four developmental stages, infective L3, serum stimulated L3, adult male and adult female, covered 93% of the A. caninum transcriptome. The broad diversity among nematode transcriptomes was confirmed, and an impact of parasitic adaptation on transcriptome diversity was inferred. Intra-population analysis showed that A. caninum has higher coding sequence diversity than humans. Examining the developmental expression profiles of A. caninum revealed major transitions in gene expression from larval stages to adult. Adult males expressed the highest number of selectively expressed genes, but adult female expressed the highest number of selective parasitism-related genes. Genes related to parasitism adaptation and A. caninum specific genes exhibited more expression selectivity while those conserved in nematodes tend to be consistently expressed. Parasitism related genes were expressed more selectively in adult male and female worms. The comprehensive analysis of digital expression profiles along with transcriptome comparisons enabled identification of a set of parasitism genes encoding secretory proteins in animal parasitic nematode.
This study validated the usage of deep sequencing for gene expression profiling. Parasitic adaptation of the canine hookworm is related to its diversity and developmental dynamics. This comprehensive comparative genomic and expression study substantially improves our understanding of the basic biology and parasitism of hookworms and, is expected, in the long run, to accelerate research toward development of vaccines and novel anthelmintics.
Completely penetrant mutations in the surfactant protein B gene (SFTPB) and ≥75% reduction of SFTPB expression disrupt pulmonary surfactant function and cause neonatal respiratory distress syndrome. To inform studies of genetic regulation of SFTPB expression, we created a catalogue of SFTPB variants by comprehensive resequencing from an unselected, population-based cohort (N=1,116). We found an excess of low frequency variation (81 SNPs and 5 small insertion/deletions). Despite its small genomic size (9.7 kb), SFTPB was characterized by weak linkage disequilibrium (LD) and high haplotype diversity. Using the HapMap Yoruban and European populations, we identified a recombination hot spot that spans SFTPB, was not detectable in our focused resequencing data, and accounts for weak LD. Using homology based software tools, we discovered no definitively damaging exonic variants. We conclude that excess low frequency variation, intragenic recombination, and lack of common, disruptive exonic variants favor complete resequencing as the optimal approach for genetic association studies to identify regulatory SFTPB variants that cause neonatal respiratory distress syndrome in genetically diverse populations.
Rare population variants are known to have important biomedical implications, but their systematic discovery has only recently been enabled by advances in DNA sequencing. The design process of a discovery project remains formidable, being limited to ad hoc mixtures of extensive computer simulation and pilot sequencing. Here, the task is examined from a general mathematical perspective.
We pose and solve the population sequencing design problem and subsequently apply standard optimization techniques that maximize the discovery probability. Emphasis is placed on cases whose discovery thresholds place them within reach of current technologies. We find that parameter values characteristic of rare-variant projects lead to a general, yet remarkably simple set of optimization rules. Specifically, optimal processing occurs at constant values of the per-sample redundancy, refuting current notions that sample size should be selected outright. Optimal project-wide redundancy and sample size are then shown to be inversely proportional to the desired variant frequency. A second family of constants governs these relationships, permitting one to immediately establish the most efficient settings for a given set of discovery conditions. Our results largely concur with the empirical design of the Thousand Genomes Project, though they furnish some additional refinement.
The optimization principles reported here dramatically simplify the design process and should be broadly useful as rare-variant projects become both more important and routine in the future.
Wilson and King were among the first to recognize that the extent of phenotypic change between humans and great apes was dissonant with the rate of molecular change. Proteins are virtually identical1,2; cytogenetically there are few rearrangements that distinguish ape-human chromosomes3; rates of single-basepair change4-7 and retroposon activity8-10 have slowed particularly within hominid lineages when compared to rodents or monkeys. Here, we perform a systematic analysis of duplication content of four primate genomes (macaque, orangutan, chimpanzee and human) in an effort to understand the pattern and rates of genomic duplication during hominid evolution. We find that the ancestral branch leading to human and African great apes shows the most significant increase in duplication activity both in terms of basepairs and in terms of events. This duplication acceleration within the ancestral species is significant when compared to lineage-specific rate estimates even after accounting for copy-number polymorphism and homoplasy. We discover striking examples of recurrent and independent gene-containing duplications within the gorilla and chimpanzee that are absent in the human lineage. Our results suggest that the evolutionary properties of copy-number mutation differ significantly from other forms of genetic mutation and, in contrast to the hominid slowdown of single basepair mutations, there has been a genomic burst of duplication activity at this period during human evolution.
Structural variations in the form of DNA insertions and deletions are an important aspect of human genetics and especially relevant to medical disorders. Investigations have shown that such events can be detected via tell-tale discrepancies in the aligned lengths of paired-end DNA sequencing reads. Quantitative aspects underlying this method remain poorly understood, despite its importance and conceptual simplicity. We report the statistical theory characterizing the length-discrepancy scheme for Gaussian libraries, including coverage-related effects that preceding models are unable to account for.
Deletion and insertion statistics both depend heavily on physical coverage, but otherwise differ dramatically, refuting a commonly held doctrine of symmetry. Specifically, coverage restrictions render insertions much more difficult to capture. Increased read length has the counterintuitive effect of worsening insertion detection characteristics of short inserts. Variance in library insert length is also a critical factor here and should be minimized to the greatest degree possible. Conversely, no significant improvement would be realized in lowering fosmid variances beyond current levels. Detection power is examined under a straightforward alternative hypothesis and found to be generally acceptable. We also consider the proposition of characterizing variation over the entire spectrum of variant sizes under constant risk of false-positive errors. At 1% risk, many designs will leave a significant gap in the 100 to 200 bp neighborhood, requiring unacceptably high redundancies to compensate. We show that a few modifications largely close this gap and we give a few examples of feasible spectrum-covering designs.
The theory resolves several outstanding issues and furnishes a general methodology for designing future projects from the standpoint of a spectrum-wide constant risk.
Genetic lesions affecting a number of kinases and other elements within the epidermal growth factor receptor (EGFR) signaling pathway have been implicated in the pathogenesis of human non–small-cell lung cancer (NSCLC). We performed mutational profiling of a large cohort of lung adenocarcinomas to uncover other potential somatic mutations in genes of this pathway that could contribute to lung tumorigenesis. We have identified in 2 of 207 primary lung tumors a somatic activating mutation in exon 2 of MEK1 (i.e., mitogen-activated protein kinase kinase 1 or MAP2K1) that substitutes asparagine for lysine at amino acid 57 (K57N) in the nonkinase portion of the kinase. Neither of these two tumors harbored known mutations in other genes encoding components of the EGFR signaling pathway (i.e., EGFR, HER2, KRAS, PIK3CA, and BRAF). Expression of mutant, but not wild-type, MEK1 leads to constitutive activity of extracellular signal–regulated kinase (ERK)-1/2 in human 293T cells and to growth factor–independent proliferation of murine Ba/F3 cells. A selective MEK inhibitor, AZD6244, inhibits mutant-induced ERK activity in 293T cells and growth of mutant-bearing Ba/F3 cells. We also screened 85 NSCLC cell lines for MEK1 exon 2 mutations; one line (NCI-H1437) harbors a Q56P substitution, a known transformation-competent allele of MEK1 originally identified in rat fibroblasts, and is sensitive to treatment with AZD6244. MEK1 mutants have not previously been reported in lung cancer and may provide a target for effective therapy in a small subset of patients with lung adenocarcinoma.