Non-human primates provide genetic model systems biologically intermediate between humans and other mammalian model organisms. Populations of Caribbean vervet monkeys (Chlorocebus aethiops sabaeus) are genetically homogeneous and large enough to permit well-powered genetic mapping studies of quantitative traits relevant to human health, including expression quantitative trait loci (eQTL). Previous transcriptome-wide investigation in an extended vervet pedigree identified 29 heritable transcripts for which levels of expression in peripheral blood correlate strongly with expression levels in the brain. Quantitative trait linkage analysis using 261 microsatellite markers identified significant (n = 8) and suggestive (n = 4) linkages for 12 of these transcripts, including both cis- and trans-eQTL. Seven transcripts, located on different chromosomes, showed maximum linkage to markers in a single region of vervet chromosome 9; this observation suggests the possibility of a master trans-regulator locus in this region. For one cis-eQTL (at B3GALTL, beta-1,3-glucosyltransferase), we conducted follow-up single nucleotide polymorphism genotyping and fine-scale association analysis in a sample of unrelated Caribbean vervets, localizing this eQTL to a region of <200 kb. These results suggest the value of pedigree and population samples of the Caribbean vervet for linkage and association mapping studies of quantitative traits. The imminent whole genome sequencing of many of these vervet samples will enhance the power of such investigations by providing a comprehensive catalog of genetic variation.
Summary: Despite recent progress, computational tools that identify gene fusions from next-generation whole transcriptome sequencing data are often limited in accuracy and scalability. Here, we present a software package, BreakFusion that combines the strength of reference alignment followed by read-pair analysis and de novo assembly to achieve a good balance in sensitivity, specificity and computational efficiency.
Supplementary data are available at Bioinformatics online
The emergence of next-generation sequencing (NGS) technologies offers an incredible opportunity to comprehensively study DNA sequence variation in human genomes. Commercially available platforms from Roche (454), Illumina (Genome Analyzer and Hiseq 2000), and Applied Biosystems (SOLiD) have the capability to completely sequence individual genomes to high levels of coverage. NGS data is particularly advantageous for the study of structural variation (SV) because it offers the sensitivity to detect variants of various sizes and types, as well as the precision to characterize their breakpoints at base pair resolution. In this chapter, we present methods and software algorithms that have been developed to detect SVs and copy number changes using massively parallel sequencing data. We describe visualization and de novo assembly strategies for characterizing SV breakpoints and removing false positives.
Next-generation sequencing; Paired-end sequencing; 454; Illumina; Solexa; Abi solid; Insertions; Deletions; Duplications; Inversions; Translocations; Indels; Copy number variants
Detection and characterization of genomic structural variation are important for understanding the landscape of genetic variation in human populations and in complex diseases such as cancer. Recent studies demonstrate the feasibility of detecting structural variation using next-generation, short-insert, paired-end sequencing reads. However, the utility of these reads is not entirely clear, nor are the analysis methods under which accurate detection can be achieved. The algorithm BreakDancer predicts a wide variety of structural variants including indels, inversions, and translocations. We examined BreakDancer's performance in simulation, comparison with other methods, analysis of an acute myeloid leukemia sample, and the 1,000 Genomes trio individuals. We found that it substantially improved the detection of small and intermediate size indels from 10 bp to 1 Mbp that are difficult to detect via a single conventional approach.
The unprecedented resolution of high-throughput genomics has enabled the recent discovery of a phenomenon by which specific regions of the genome are shattered and then stitched together via a single devastating event, referred to as chromothripsis. Potential mechanisms governing this process are now emerging, with implications for our understanding of the role of genomic rearrangements in development and disease.
Transposable elements (TEs) are abundant in the human genome, and some are capable of generating new insertions through RNA intermediates. In cancer, the disruption of cellular mechanisms that normally suppress TE activity may facilitate mutagenic retrotranspositions. We performed single-nucleotide resolution analysis of TE insertions in 43 high-coverage whole-genome sequencing data sets from five cancer types. We identified 194 high-confidence somatic TE insertions, as well as thousands of polymorphic TE insertions in matched normal genomes. Somatic insertions were present in epithelial tumors but not in blood or brain cancers. Somatic L1 insertions tend to occur in genes that are commonly mutated in cancer, disrupt the expression of the target genes, and are biased toward regions of cancer-specific DNA hypomethylation, highlighting their potential impact in tumorigenesis.
The human Y chromosome began to evolve from an autosome hundreds of millions of years ago, acquiring a sex-determining function and undergoing a series of inversions that suppressed crossing over with the X chromosome1,2. Little is known about the Y chromosome’s recent evolution because only the human Y chromosome has been fully sequenced. Prevailing theories hold that Y chromosomes evolve by gene loss, the pace of which slows over time, eventually leading to a paucity of genes, and stasis3,4. These theories have been buttressed by partial sequence data from newly emergent plant and animal Y chromosomes5-8, but they have not been tested in older, highly evolved Y chromosomes like that of humans. We therefore finished sequencing the male-specific region of the Y chromosome (MSY) in our closest living relative, the chimpanzee, achieving levels of accuracy and completion previously reached for the human MSY. We then compared the MSYs of the two species and found that they differ radically in sequence structure and gene content, implying rapid evolution during the past 6 million years. The chimpanzee MSY harbors twice as many massive palindromes as the human MSY, yet it has lost large fractions of the MSY protein-coding genes and gene families present in the last common ancestor. We suggest that the extraordinary divergence of the chimpanzee and human MSYs was driven by four synergistic factors: the MSY’s prominent role in sperm production, genetic hitchhiking effects in the absence of meiotic crossing over, frequent ectopic recombination within the MSY, and species differences in mating behavior. While genetic decay may be the principal dynamic in the evolution of newly emergent Y chromosomes, wholesale renovation is the paramount theme in the ongoing evolution of chimpanzee, human, and perhaps other older MSYs.
The St. Jude Children’s Research Hospital–Washington University Pediatric Cancer Genome Project (PCGP) is participating in the international effort to identify somatic mutations that drive cancer. These cancer genome sequencing efforts will not only yield an unparalleled view of the altered signaling pathways in cancer but should also identify new targets against which novel therapeutics can be developed. Although these projects are still deep in the phase of generating primary DNA sequence data, important results are emerging and valuable community resources are being generated that should catalyze future cancer research. We describe here the rationale for conducting the PCGP, present some of the early results of this project and discuss the major lessons learned and how these will affect the application of genomic sequencing in the clinic.
Massively parallel sequencing technologies continue to alter the study of human genetics. As the cost of sequencing declines, next-generation sequencing (NGS) instruments and datasets will become increasingly accessible to the wider research community. Investigators are understandably eager to harness the power of these new technologies. Sequencing human genomes on these platforms, however, presents numerous production and bioinformatics challenges. Production issues like sample contamination, library chimaeras and variable run quality have become increasingly problematic in the transition from technology development lab to production floor. Analysis of NGS data, too, remains challenging, particularly given the short-read lengths (35–250 bp) and sheer volume of data. The development of streamlined, highly automated pipelines for data analysis is critical for transition from technology adoption to accelerated research and publication. This review aims to describe the state of current NGS technologies, as well as the strategies that enable NGS users to characterize the full spectrum of DNA sequence variation in humans.
massively parallel sequencing; next generation sequencing; human genome; variant detection; short read alignment; whole genome sequencing
Retroposed processed gene transcripts are an important source of material for new gene formation on evolutionary timescales. Most prior work on gene retrocopy discovery compared copies in reference genome assemblies to their source genes. Here, we explore gene retrocopy insertion polymorphisms (GRIPs) that are present in the germlines of individual humans, mice, and chimpanzees, and we identify novel gene retrocopy insertions in cancerous somatic tissues that are absent from patient-matched non-cancer genomes.
Through analysis of whole-genome sequence data, we found evidence for 48 GRIPs in the genomes of one or more humans sequenced as part of the 1,000 Genomes Project and The Cancer Genome Atlas, but which were not in the human reference assembly. Similarly, we found evidence for 755 GRIPs at distinct locations in one or more of 17 inbred mouse strains but which were not in the mouse reference assembly, and 19 GRIPs across a cohort of 10 chimpanzee genomes, which were not in the chimpanzee reference genome assembly. Many of these insertions are new members of existing gene families whose source genes are highly and widely expressed, and the majority have detectable hallmarks of processed gene retrocopy formation. We estimate the rate of novel gene retrocopy insertions in humans and chimps at roughly one new gene retrocopy insertion for every 6,000 individuals.
We find that gene retrocopy polymorphisms are a widespread phenomenon, present a multi-species analysis of these events, and provide a method for their ascertainment.
Motivation: The sequencing of tumors and their matched normals is frequently used to study the genetic composition of cancer. Despite this fact, there remains a dearth of available software tools designed to compare sequences in pairs of samples and identify sites that are likely to be unique to one sample.
Results: In this article, we describe the mathematical basis of our SomaticSniper software for comparing tumor and normal pairs. We estimate its sensitivity and precision, and present several common sources of error resulting in miscalls.
Availability and implementation: Binaries are freely available for download at http://gmt.genome.wustl.edu/somatic-sniper/current/, implemented in C and supported on Linux and Mac OS X.
Contact: firstname.lastname@example.org; email@example.com
Supplementary information: Supplementary data are available at Bioinformatics online.
We developed CREST (Clipping REveals STructure), an algorithm that uses next-generation sequencing reads with partial alignments to a reference genome to directly map structural variations at the nucleotide level of resolution. Application of CREST to whole-genome sequencing data from five pediatric T-lineage acute lymphoblastic leukemias (T-ALLs) and a human melanoma cell line, COLO-829, identified 160 somatic structural variations. Experimental validation exceeded 80% demonstrating that CREST had a high predictive accuracy.
Gene duplication is an important source of phenotypic change and adaptive evolution. We use a novel genomic approach to identify highly identical sequence missing from the reference genome, confirming the cortical development gene Slit-Robo Rho GTPase activating protein 2 (SRGAP2) duplicated three times in humans. We show that the promoter and first nine exons of SRGAP2 duplicated from 1q32.1 (SRGAP2A) to 1q21.1 (SRGAP2B) ~3.4 million years ago (mya). Two larger duplications later copied SRGAP2B to chromosome 1p12 (SRGAP2C) and to proximal 1q21.1 (SRGAP2D), ~2.4 and ~1 mya, respectively. Sequence and expression analysis shows SRGAP2C is the most likely duplicate to encode a functional protein and among the most fixed human-specific duplicate genes. Our data suggest a mechanism where incomplete duplication created a novel function —at birth, antagonizing parental SRGAP2 function 2–3 mya a time corresponding to the transition from Australopithecus to Homo and the beginning of neocortex expansion.
To identify somatic mutations in paediatric diffuse intrinsic pontine gliomas (DIPGs), we performed whole genome sequencing of 7 DIPGs and matched germline DNA, and targeted sequencing of an additional 43 DIPGs and 36 non-brainstem paediatric glioblastomas (non-BS-PGs). 78% of DIPGs and 22% of non-BS-PGs contained p.K27M mutation in H3F3A, encoding histone H3.3, or the related HIST1H3B, encoding histone H3.1. An additional 14% of non-BS-PGs had somatic p.G34R H3F3A mutations.
Motivation: The expansion of cancer genome sequencing continues to stimulate development of analytical tools for inferring relationships between somatic changes and tumor development. Pathway associations are especially consequential, but existing algorithms are demonstrably inadequate.
Methods: Here, we propose the PathScan significance test for the scenario where pathway mutations collectively contribute to tumor development. Its design addresses two aspects that established methods neglect. First, we account for variations in gene length and the consequent differences in their mutation probabilities under the standard null hypothesis of random mutation. The associated spike in computational effort is mitigated by accurate convolution-based approximation. Second, we combine individual probabilities into a multiple-sample value using Fisher–Lancaster theory, thereby improving differentiation between a few highly mutated genes and many genes having only a few mutations apiece. We investigate accuracy, computational effort and power, reporting acceptable performance for each.
Results: As an example calculation, we re-analyze KEGG-based lung adenocarcinoma pathway mutations from the Tumor Sequencing Project. Our test recapitulates the most significant pathways and finds that others for which the original test battery was inconclusive are not actually significant. It also identifies the focal adhesion pathway as being significantly mutated, a finding consistent with earlier studies. We also expand this analysis to other databases: Reactome, BioCarta, Pfam, PID and SMART, finding additional hits in ErbB and EPHA signaling pathways and regulation of telomerase. All have implications and plausible mechanistic roles in cancer. Finally, we discuss aspects of extending the method to integrate gene-specific background rates and other types of genetic anomalies.
Availability: PathScan is implemented in Perl and is available from the Genome Institute at: http://genome.wustl.edu/software/pathscan.
Supplementary information: Supplementary data are available at Bioinformatics online.
Understanding the prevailing mutational mechanisms responsible for human genome structural variation requires uniformity in the discovery of allelic variants and precision in terms of breakpoint delineation. We develop a resource based on capillary end-sequencing of 13.8 million fosmid clones from 17 human genomes and characterize the complete sequence of 1,054 large structural variants corresponding to 589 deletions, 384 insertions, and 81 inversions. We analyze the 2,081 breakpoint junctions and infer potential mechanism of origin. Three mechanisms account for the bulk of germline structural variation: microhomology-mediated processes involving short (2–20 bp) stretches of sequence (28%), non-allelic homologous recombination (NAHR) (22%) and L1 retrotransposition (19%). The high quality and long-range continuity of the sequence reveals more complex mutational mechanisms including repeat-mediated inversions and gene conversion that are most often missed by other methods including comparative genomic hybridization, SNP microarrays and next-generation sequencing.
There is a complex relationship between the evolution of segmental duplications and rearrangements associated with human disease. We performed a detailed analysis of one region on chromosome 16p12.1 associated with neurocognitive disease and identified one of the largest structural inconsistencies with the human reference assembly. Various genomic analyses show that all examined humans are homozygously inverted relative to the reference genome for a 1.1-Mbp region on 16p12.1. We determined that this assembly discrepancy stems from two common structural configurations with worldwide frequencies of 17.6% (S1) and 82.4% (S2). This polymorphism arose from the rapid integration of segmental duplications, precipitating two local inversions within the human lineage over the last 10 million years. The two human haplotypes differ by 333 kbp of additional duplicated sequence present in S2 but not in S1. Importantly, we show that the S2 configuration harbors directly oriented duplications specifically predisposing this chromosome to disease rearrangement.
Motivation: DNA copy number aberration (CNA) is a hallmark of genomic abnormality in tumor cells. Recurrent CNA (RCNA) occurs in multiple cancer samples across the same chromosomal region and has greater implication in tumorigenesis. Current commonly used methods for RCNA identification require CNA calling for individual samples before cross-sample analysis. This two-step strategy may result in a heavy computational burden, as well as a loss of the overall statistical power due to segmentation and discretization of individual sample's data. We propose a population-based approach for RCNA detection with no need of single-sample analysis, which is statistically powerful, computationally efficient and particularly suitable for high-resolution and large-population studies.
Results: Our approach, correlation matrix diagonal segmentation (CMDS), identifies RCNAs based on a between-chromosomal-site correlation analysis. Directly using the raw intensity ratio data from all samples and adopting a diagonal transformation strategy, CMDS substantially reduces computational burden and can obtain results very quickly from large datasets. Our simulation indicates that the statistical power of CMDS is higher than that of single-sample CNA calling based two-step approaches. We applied CMDS to two real datasets of lung cancer and brain cancer from Affymetrix and Illumina array platforms, respectively, and successfully identified known regions of CNA associated with EGFR, KRAS and other important oncogenes. CMDS provides a fast, powerful and easily implemented tool for the RCNA analysis of large-scale data from cancer genomes.
Availability: The R and C programs implementing our method are available at https://dsgweb.wustl.edu/qunyuan/software/cmds.
Supplementary information: Supplementary data are available at Bioinformatics online.
In birds, as in mammals, one pair of chromosomes differs between the sexes. In birds, males are ZZ and females ZW. In mammals, males are XY and females XX. Like the mammalian XY pair, the avian ZW pair is believed to have evolved from autosomes, with most change occurring in the chromosomes found in only one sex – the W and Y chromosomes1–5. By contrast, the sex chromosomes found in both sexes – the Z and X chromosomes – are assumed to have diverged little from their autosomal progenitors2. Here we report findings that overturn this assumption for both the chicken Z and human X chromosomes. The chicken Z chromosome, which we sequenced essentially to completion, is less gene-dense than chicken autosomes but contains a massive tandem array containing hundreds of duplicated genes expressed in testes. A comprehensive comparison of the chicken Z chromosome to the finished sequence of the human X chromosome demonstrates that each evolved independently from different portions of the ancestral genome. Despite this independence, the chicken Z and human X chromosomes share features that distinguish them from autosomes: the acquisition and amplification of testis-expressed genes, as well as a low gene density resulting from an expansion of intergenic regions. These features were not present on the autosomes from which the Z and X chromosomes originated but were instead acquired during the evolution of the Z and X as sex chromosomes. We conclude that the avian Z and mammalian X chromosomes followed convergent evolutionary trajectories, despite their evolving with opposite (female vs. male) systems of heterogamety. More broadly, in birds and mammals, sex chromosome evolution involved not only gene loss in sex-specific chromosomes, but also marked expansion and gene acquisition in sex chromosomes common to males and females.
The extent of human genomic structural variation suggests that there must be portions of the genome yet to be discovered, annotated and characterized at the sequence level. We present a resource and analysis of 2,363 novel insertion sequences corresponding to 720 genomic loci. We show that a substantial fraction of these sequences are either missing, fragmented or mis-assigned when compared to recent de novo sequence assemblies from short-read next-generation sequence data. We determine that 18–37% of these novel insertions are copy-number polymorphic, including loci that show extensive population stratification among Europeans, Asians and Africans. Complete sequencing of 156 of these insertions identifies novel exons and conserved non-coding sequences not yet represented in the reference genome. We develop a method to accurately genotype these novel insertions by mapping next-generation sequencing datasets to the breakpoint thereby providing a means to characterize copy-number status for regions previously inaccessible to SNP microarrays.
A genomic era of cancer studies is developing rapidly, fueled by the emergence of next-generation sequencing technologies that provide exquisite sensitivity and resolution. This article discusses several areas within cancer genomics that are being transformed by the application of new technology, and in the process are dramatically expanding our understanding of this disease. Although, we anticipate that there will be many exciting discoveries in the near future, the ultimate success of these endeavors rests on our ability to translate what is learned into better diagnosis, treatment and prevention of cancer.
To date, few peptides in the complex mixture of platypus venom have been identified and sequenced, in part due to the limited amounts of platypus venom available to study. We have constructed and sequenced a cDNA library from an active platypus venom gland to identify the remaining components.
We identified 83 novel putative platypus venom genes from 13 toxin families, which are homologous to known toxins from a wide range of vertebrates (fish, reptiles, insectivores) and invertebrates (spiders, sea anemones, starfish). A number of these are expressed in tissues other than the venom gland, and at least three of these families (those with homology to toxins from distant invertebrates) may play non-toxin roles. Thus, further functional testing is required to confirm venom activity. However, the presence of similar putative toxins in such widely divergent species provides further evidence for the hypothesis that there are certain protein families that are selected preferentially during evolution to become venom peptides. We have also used homology with known proteins to speculate on the contributions of each venom component to the symptoms of platypus envenomation.
This study represents a step towards fully characterizing the first mammal venom transcriptome. We have found similarities between putative platypus toxins and those of a number of unrelated species, providing insight into the evolution of mammalian venom.
Summary: Massively parallel sequencing technologies hold incredible promise for the study of DNA sequence variation, particularly the identification of variants affecting human disease. The unprecedented throughput and relatively short read lengths of Roche/454, Illumina/Solexa, and other platforms have spurred development of a new generation of sequence alignment algorithms. Yet detection of sequence variants based on short read alignments remains challenging, and most currently available tools are limited to a single platform or aligner type. We present VarScan, an open source tool for variant detection that is compatible with several short read aligners. We demonstrate VarScan's ability to detect SNPs and indels with high sensitivity and specificity, in both Roche/454 sequencing of individuals and deep Illumina/Solexa sequencing of pooled samples.
Availability and Implementation: Source code and documentation freely available at http://genome.wustl.edu/tools/cancer-genomics implemented as a Perl package and supported on Linux/UNIX, MS Windows and Mac OSX.
Supplementary information: Supplementary data are available at Bioinformatics online.