Identifying insertion/deletion polymorphisms (INDELs) with high confidence has been intrinsically challenging in short-read sequencing data. Here we report our approach for improving INDEL calling accuracy by using a machine learning algorithm to combine call sets generated with three independent methods, and by leveraging the strengths of each individual pipeline. Utilizing this approach, we generated a consensus exome INDEL call set from a large dataset generated by the 1000 Genomes Project (1000G), maximizing both the sensitivity and the specificity of the calls.
This consensus exome INDEL call set features 7,210 INDELs, from 1,128 individuals across 13 populations included in the 1000 Genomes Phase 1 dataset, with a false discovery rate (FDR) of about 7.0%.
In our study we further characterize the patterns and distributions of these exonic INDELs with respect to density, allele length, and site frequency spectrum, as well as the potential mutagenic mechanisms of coding INDELs in humans.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-015-1333-7) contains supplementary material, which is available to authorized users.
INDEL; 1000 Genomes Project; Distribution; Mutagenesis
Mobile elements constitute greater than 45% of the human genome as a result of repeated insertion
events during human genome evolution. Although most of mobile elements are fixed within the human
population, some elements (including ALU, long interspersed elements (LINE) 1 (L1), and SVA) are
still actively duplicating and may result in life-threatening human diseases such as cancer,
motivating the need for accurate mobile-element insertion (MEI) detection tools. We developed a
software package, TANGRAM, for MEI detection in next-generation sequencing data, currently serving
as the primary MEI detection tool in the 1000 Genomes Project. TANGRAM takes advantage of valuable
mapping information provided by our own MOSAIK mapper, and until recently required MOSAIK mappings
as its input. In this study, we report a new feature that enables TANGRAM to be used on alignments
generated by any mainstream short-read mapper, making it accessible for many genomic users. To
demonstrate its utility for cancer genome analysis, we have applied TANGRAM to the TCGA (The Cancer
Genome Atlas) mutation calling benchmark 4 dataset. TANGRAM is fast, accurate, easy to use, and open
source on https://github.com/jiantao/Tangram.
mobile-element insertion; structural variation; ALU
The human genome reference assembly is crucial for aligning and analyzing sequence data, and for genome annotation, among other roles. However, the models and analysis assumptions that underlie the current assembly need revising to fully represent human sequence diversity. Improved analysis tools and updated data reporting formats are also required.
Mobile elements constitute greater than 45% of the human genome as a result of repeated insertion events during human genome evolution. Although most of mobile elements are fixed within the human population, some elements (including ALU, long interspersed elements (LINE) 1 (L1), and SVA) are still actively duplicating and may result in life-threatening human diseases such as cancer, motivating the need for accurate mobile-element insertion (MEI) detection tools. We developed a software package, TANGRAM, for MEI detection in next-generation sequencing data, currently serving as the primary MEI detection tool in the 1000 Genomes Project. TANGRAM takes advantage of valuable mapping information provided by our own MOSAIK mapper, and until recently required MOSAIK mappings as its input. In this study, we report a new feature that enables TANGRAM to be used on alignments generated by any mainstream short-read mapper, making it accessible for many genomic users. To demonstrate its utility for cancer genome analysis, we have applied TANGRAM to the TCGA (The Cancer Genome Atlas) mutation calling benchmark 4 dataset. TANGRAM is fast, accurate, easy to use, and open source on https://github.com/jiantao/Tangram.
mobile-element insertion; structural variation; ALU
Mobile elements (MEs) constitute greater than 50% of the human genome as a result of repeated insertion events during human genome evolution. Although most of these elements are now fixed in the population, some MEs, including ALU, L1, SVA and HERV-K elements, are still actively duplicating. Mobile element insertions (MEIs) have been associated with human genetic disorders, including Crohn’s disease, hemophilia, and various types of cancer, motivating the need for accurate MEI detection methods. To comprehensively identify and accurately characterize these variants in whole genome next-generation sequencing (NGS) data, a computationally efficient detection and genotyping method is required. Current computational tools are unable to call MEI polymorphisms with sufficiently high sensitivity and specificity, or call individual genotypes with sufficiently high accuracy.
Here we report Tangram, a computationally efficient MEI detection program that integrates read-pair (RP) and split-read (SR) mapping signals to detect MEI events. By utilizing SR mapping in its primary detection module, a feature unique to this software, Tangram is able to pinpoint MEI breakpoints with single-nucleotide precision. To understand the role of MEI events in disease, it is essential to produce accurate individual genotypes in clinical samples. Tangram is able to determine sample genotypes with very high accuracy. Using simulations and experimental datasets, we demonstrate that Tangram has superior sensitivity, specificity, breakpoint resolution and genotyping accuracy, when compared to other, recently developed MEI detection methods.
Tangram serves as the primary MEI detection tool in the 1000 Genomes Project, and is implemented as a highly portable, memory-efficient, easy-to-use C++ computer program, built under an open-source development model.
Structural variation; Mobile element insertion; Retrotransposon; Endogenous retrovirus; L1; Alu; SVA; High-throughput sequencing
Many tumors are composed of genetically divergent cell subpopulations. We report SubcloneSeeker, a package capable of exhaustive identification of subclone structures and evolutionary histories with bulk somatic variant allele frequency measurements from tumor biopsies. We present a statistical framework to elucidate whether specific sets of mutations are present within the same subclones, and the order in which they occur. We demonstrate how subclone reconstruction provides crucial information about tumorigenesis and relapse mechanisms; guides functional study by variant prioritization, and has the potential as a rational basis for informed therapeutic strategies for the patient. SubcloneSeeker is available at: https://github.com/yiq/SubcloneSeeker.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0443-x) contains supplementary material, which is available to authorized users.
The simultaneous targeting of host and pathogen processes represents an untapped approach for the treatment of intracellular infections. Hypoxia-inducible factor-1 (HIF-1) is a host cell transcription factor that is activated by and required for the growth of the intracellular protozoan parasite Toxoplasma gondii at physiological oxygen levels. Parasite activation of HIF-1 is blocked by inhibiting the family of closely related Activin-Like Kinase (ALK) host cell receptors ALK4, ALK5, and ALK7, which was determined in part by use of an ALK4,5,7 inhibitor named SB505124. Besides inhibiting HIF-1 activation, SB505124 also potently blocks parasite replication under normoxic conditions. To determine whether SB505124 inhibition of parasite growth was exclusively due to inhibition of ALK4,5,7 or because the drug inhibited a second kinase, SB505124-resistant parasites were isolated by chemical mutagenesis. Whole-genome sequencing of these mutants revealed mutations in the Toxoplasma MAP kinase, TgMAPK1. Allelic replacement of mutant TgMAPK1 alleles into wild-type parasites was sufficient to confer SB505124 resistance. SB505124 independently impacts TgMAPK1 and ALK4,5,7 signaling since drug resistant parasites could not activate HIF-1 in the presence of SB505124 or grow in HIF-1 deficient cells. In addition, TgMAPK1 kinase activity is inhibited by SB505124. Finally, mice treated with SB505124 had significantly lower tissue burdens following Toxoplasma infection. These data therefore identify SB505124 as a novel small molecule inhibitor that acts by inhibiting two distinct targets, host HIF-1 and TgMAPK1.
Understanding how a compound blocks growth of an intracellular pathogen is important not only for developing these compounds into drugs that can be prescribed to patients, but also because these data will likely provide novel insight into the biology of these pathogens. Forward genetic screens are one established approach towards defining these mechanisms. But performing these screens with intracellular parasites has been limited not only because of technical limitations but also because the compounds may have off-target effects in either the host or parasite. Here, we report the first compound that kills a pathogen by simultaneously inhibiting distinct host- and parasite-encoded targets. Because developing drug resistance simultaneously to two targets is less likely, this work may highlight a new approach to antimicrobial drug discovery.
Next generation sequencing is helping to overcome limitations in organisms less accessible to classical or reverse genetic methods by facilitating whole genome mutational analysis studies. One traditionally intractable group, the Apicomplexa, contains several important pathogenic protozoan parasites, including the Plasmodium species that cause malaria.
Here we apply whole genome analysis methods to the relatively accessible model apicomplexan, Toxoplasma gondii, to optimize forward genetic methods for chemical mutagenesis using N-ethyl-N-nitrosourea (ENU) and ethylmethane sulfonate (EMS) at varying dosages.
By comparing three different lab-strains we show that spontaneously generated mutations reflect genome composition, without nucleotide bias. However, the single nucleotide variations (SNVs) are not distributed randomly over the genome; most of these mutations reside either in non-coding sequence or are silent with respect to protein coding. This is in contrast to the random genomic distribution of mutations induced by chemical mutagenesis. Additionally, we report a genome wide transition vs transversion ratio (ti/tv) of 0.91 for spontaneous mutations in Toxoplasma, with a slightly higher rate of 1.20 and 1.06 for variants induced by ENU and EMS respectively. We also show that in the Toxoplasma system, surprisingly, both ENU and EMS have a proclivity for inducing mutations at A/T base pairs (78.6% and 69.6%, respectively).
The number of SNVs between related laboratory strains is relatively low and managed by purifying selection away from changes to amino acid sequence. From an experimental mutagenesis point of view, both ENU (24.7%) and EMS (29.1%) are more likely to generate variation within exons than would naturally accumulate over time in culture (19.1%), demonstrating the utility of these approaches for yielding proportionally greater changes to the amino acid sequence. These results will not only direct the methods of future chemical mutagenesis in Toxoplasma, but also aid in designing forward genetic approaches in less accessible pathogenic protozoa as well.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-354) contains supplementary material, which is available to authorized users.
Whole genome sequencing; Chemical mutagenesis; In vitro adaptation; SNV calling; Apicomplexa
Interpreting variants, especially noncoding ones, in the increasing
number of personal genomes is challenging. We used patterns of polymorphisms in
functionally annotated regions in 1092 humans to identify deleterious variants;
then we experimentally validated candidates. We analyzed both coding and
noncoding regions, with the former corroborating the latter. We found regions
particularly sensitive to mutations (“ultrasensitive”) and
variants that are disruptive because of mechanistic effects on
transcription-factor binding (that is, “motif-breakers”). We also
found variants in regions with higher network centrality tend to be deleterious.
Insertions and deletions followed a similar pattern to single-nucleotide
variants, with some notable exceptions (e.g., certain deletions and enhancers).
On the basis of these patterns, we developed a computational tool (FunSeq),
whose application to ~90 cancer genomes reveals nearly a hundred
candidate noncoding drivers.
MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery. All variant discovery benefits from an accurate description of the read placement confidence. To this end, MOSAIK uses a neural-network based training scheme to provide well-calibrated mapping quality scores, demonstrated by a correlation coefficient between MOSAIK assigned and actual mapping qualities greater than 0.98. In order to ensure that studies of any genome are supported, a training pipeline is provided to ensure optimal mapping quality scores for the genome under investigation. MOSAIK is multi-threaded, open source, and incorporated into our command and pipeline launcher system GKNO (http://gkno.me).
Motivation: A common question arises at the beginning of every experiment where RNA-Seq is used to detect differential gene expression between two conditions: How many reads should we sequence?
Results: Scotty is an interactive web-based application that assists biologists to design an experiment with an appropriate sample size and read depth to satisfy the user-defined experimental objectives. This design can be based on data available from either pilot samples or publicly available datasets.
Availability: Scotty can be freely accessed on the web at http://euler.bc.edu/marthlab/scotty/scotty.php
Supplementary data are is available at Bioinformatics online.
Motivation: High-throughput biological research requires simultaneous visualization as well as analysis of genomic data, e.g. read alignments, variant calls and genomic annotations. Traditionally, such integrative analysis required desktop applications operating on locally stored data. Many current terabyte-size datasets generated by large public consortia projects, however, are already only feasibly stored at specialist genome analysis centers. As even small laboratories can afford very large datasets, local storage and analysis are becoming increasingly limiting, and it is likely that most such datasets will soon be stored remotely, e.g. in the cloud. These developments will require web-based tools that enable users to access, analyze and view vast remotely stored data with a level of sophistication and interactivity that approximates desktop applications. As rapidly dropping cost enables researchers to collect data intended to answer questions in very specialized contexts, developers must also provide software libraries that empower users to implement customized data analyses and data views for their particular application. Such specialized, yet lightweight, applications would empower scientists to better answer specific biological questions than possible with general-purpose genome browsers currently available.
Results: Using recent advances in core web technologies (HTML5), we developed Scribl, a flexible genomic visualization library specifically targeting coordinate-based data such as genomic features, DNA sequence and genetic variants. Scribl simplifies the development of sophisticated web-based graphical tools that approach the dynamism and interactivity of desktop applications.
Supplementary data are available at Bioinformatics online.
The Smith-Waterman algorithm, which produces the optimal pairwise alignment between two sequences, is frequently used as a key component of fast heuristic read mapping and variation detection tools for next-generation sequencing data. Though various fast Smith-Waterman implementations are developed, they are either designed as monolithic protein database searching tools, which do not return detailed alignment, or are embedded into other tools. These issues make reusing these efficient Smith-Waterman implementations impractical.
To facilitate easy integration of the fast Single-Instruction-Multiple-Data Smith-Waterman algorithm into third-party software, we wrote a C/C++ library, which extends Farrar’s Striped Smith-Waterman (SSW) to return alignment information in addition to the optimal Smith-Waterman score. In this library we developed a new method to generate the full optimal alignment results and a suboptimal score in linear space at little cost of efficiency. This improvement makes the fast Single-Instruction-Multiple-Data Smith-Waterman become really useful in genomic applications. SSW is available both as a C/C++ software library, as well as a stand-alone alignment tool at: https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library.
The SSW library has been used in the primary read mapping tool MOSAIK, the split-read mapping program SCISSORS, the MEI detector TANGRAM, and the read-overlap graph generation program RZMBLR. The speeds of the mentioned software are improved significantly by replacing their ordinary Smith-Waterman or banded Smith-Waterman module with the SSW Library.
The basal complex in Toxoplasma functions as the contractile ring in the cell division process. Basal complex contraction tapers the daughter cytoskeleton toward the basal end and is required for daughter segregation. We have previously shown that the protein MORN1 is essential for basal complex assembly and likely acts as a scaffolding protein. To further our understanding of the basal complex we combined subcellular fractionation with an affinity purification of the MORN1 complex and identified its protein composition. We identified two new components of the basal complex, one of which uniquely associated with the basal complex in mature parasites, the first of its kind. In addition, we identified several other novel cytoskeleton proteins with different spatiotemporal dynamics throughout cell division. Since many of these proteins are unique to Apicomplexa this study significantly contributes to the annotation of their unique cytoskeleton. Furthermore we show that G-actin binding protein TgCAP is localized at the apical cap region in intracellular parasites, but quickly re-distributes to a cytoplasmic localization pattern upon egress.
Apicomplexa; Toxoplasma; MORN1; basal complex; cytoskeleton; motility; IMC
Toxoplasma gondii has a largely clonal population in North America and Europe, with types I, II and III clonal lineages accounting for the majority of strains isolated from patients. RH, a particular type I strain, is most frequently used to characterize Toxoplasma biology. However, compared to other type I strains, RH has unique characteristics such as faster growth, increased extracellular survival rate and inability to form orally infectious cysts. Thus, to identify candidate genes that could account for these parasite phenotypic differences, we determined genetic differences and differential parasite gene expression between RH and another type I strain, GT1. Moreover, as differences in host cell modulation could affect Toxoplasma replication in the host, we determined differentially modulated host processes among the type I strains through host transcriptional profiling.
Through whole genome sequencing, we identified 1,394 single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) between RH and GT1. These SNPs/indels together with parasite gene expression differences between RH and GT1 were used to identify candidate genes that could account for type I phenotypic differences. A polymorphism in dense granule protein, GRA2, determined RH and GT1 differences in the evasion of the interferon gamma response. In addition, host transcriptional profiling identified that genes regulated by NF-ĸB, such as interleukin (IL)-12p40, were differentially modulated by the different type I strains. We subsequently showed that this difference in NF-ĸB activation was due to polymorphisms in GRA15. Furthermore, we observed that RH, but not other type I strains, recruited phosphorylated IĸBα (a component of the NF-ĸB complex) to the parasitophorous vacuole membrane and this recruitment of p- IĸBα was partially dependent on GRA2.
We identified candidate parasite genes that could be responsible for phenotypic variation among the type I strains through comparative genomics and transcriptomics. We also identified differentially modulated host pathways among the type I strains, and these can serve as a guideline for future studies in examining the phenotypic differences among type I strains.
Toxoplasma; Type I strains; Comparative genomics; Transcriptomics; NF-ĸB
Next generation sequencing and advances in genomic enrichment technologies have enabled the discovery of the full spectrum of variants from common to rare alleles in the human population. The application of such technologies can be limited by the amount of DNA available. Whole genome amplification (WGA) can overcome such limitations. Here we investigate applicability of using WGA by comparing SNP and INDEL variant calls from a single genomic/WGA sample pair from two capture separate experiments: a 50 Mbp whole exome capture and a custom capture array of 4 Mbp region on chr12.
Our results comparing variant calls derived from genomic and WGA DNA show that the majority of variant SNP and INDEL calls are common to both callsets, both at the site and genotype level and suggest that allele bias plays a minimal role when using WGA DNA in re-sequencing studies.
Although the results of this study are based on a limited sample size, they suggest that using WGA DNA allows the discovery of the vast majority of variants, and achieves high concordance metrics, when comparing to genomic DNA calls.
Whole genome amplified DNA; Capture sequencing; Next generation sequencing; Variant discovery
Summary: ART is a set of simulation tools that generate synthetic next-generation sequencing reads. This functionality is essential for testing and benchmarking tools for next-generation sequencing data analysis including read alignment, de novo assembly and genetic variation discovery. ART generates simulated sequencing reads by emulating the sequencing process with built-in, technology-specific read error models and base quality value profiles parameterized empirically in large sequencing datasets. We currently support all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa and Applied Biosystems' SOLiD. ART also allows the flexibility to use customized read error model parameters and quality profiles.
Availability: Both source and binary software packages are available at http://www.niehs.nih.gov/research/resources/software/art
Supplementary data are available at Bioinformatics online.
Exocytosis is essential to the lytic cycle of apicomplexan parasites and required for the pathogenesis of toxoplasmosis and malaria. DOC2 proteins recruit the membrane fusion machinery required for exocytosis in a Ca2+-dependent fashion. Here, the phenotype of a Toxoplasma gondii conditional mutant impaired in host cell invasion and egress was pinpointed to a defect in secretion of the micronemes, an apicomplexan-specific organelle that contains adhesion proteins. Whole genome sequencing identified the etiological point mutation in TgDOC2.1. A conditional allele of the orthologous gene engineered into Plasmodium falciparum was also defective in microneme secretion. However, the major effect was on invasion, suggesting microneme secretion is dispensable for Plasmodium egress.
DNA capture technologies combined with high-throughput sequencing now enable cost-effective, deep-coverage, targeted sequencing of complete exomes. This is well suited for SNP discovery and genotyping. However there has been little attention devoted to Copy Number Variation (CNV) detection from exome capture datasets despite the potentially high impact of CNVs in exonic regions on protein function.
As members of the 1000 Genomes Project analysis effort, we investigated 697 samples in which 931 genes were targeted and sampled with 454 or Illumina paired-end sequencing. We developed a rigorous Bayesian method to detect CNVs in the genes, based on read depth within target regions. Despite substantial variability in read coverage across samples and targeted exons, we were able to identify 107 heterozygous deletions in the dataset. The experimentally determined false discovery rate (FDR) of the cleanest dataset from the Wellcome Trust Sanger Institute is 12.5%. We were able to substantially improve the FDR in a subset of gene deletion candidates that were adjacent to another gene deletion call (17 calls). The estimated sensitivity of our call-set was 45%.
This study demonstrates that exonic sequencing datasets, collected both in population based and medical sequencing projects, will be a useful substrate for detecting genic CNV events, particularly deletions. Based on the number of events we found and the sensitivity of the methods in the present dataset, we estimate on average 16 genic heterozygous deletions per individual genome. Our power analysis informs ongoing and future projects about sequencing depth and uniformity of read coverage required for efficient detection.
Motivation: Analysis of genomic sequencing data requires efficient, easy-to-use access to alignment results and flexible data management tools (e.g. filtering, merging, sorting, etc.). However, the enormous amount of data produced by current sequencing technologies is typically stored in compressed, binary formats that are not easily handled by the text-based parsers commonly used in bioinformatics research.
Results: We introduce a software suite for programmers and end users that facilitates research analysis and data management using BAM files. BamTools provides both the first C++ API publicly available for BAM file support as well as a command-line toolkit.
Availability: BamTools was written in C++, and is supported on Linux, Mac OSX and MS Windows. Source code and documentation are freely available at http://github.org/pezmaster31/bamtools.
The evolution of gene expression is a challenging problem in evolutionary biology, for which accurate, well-calibrated measurements and methods are crucial.
We quantified gene expression with whole-transcriptome sequencing in four diploid, prototrophic strains of Saccharomyces species grown under the same condition to investigate the evolution of gene expression. We found that variation in expression is gene-dependent with large variations in each gene's expression between replicates of the same species. This confounds the identification of genes differentially expressed across species. To address this, we developed a statistical approach to establish significance bounds for inter-species differential expression in RNA-Seq data based on the variance measured across biological replicates. This metric estimates the combined effects of technical and environmental variance, as well as Poisson sampling noise by isolating each component. Despite a paucity of large expression changes, we found a strong correlation between the variance of gene expression change and species divergence (R2 = 0.90).
We provide an improved methodology for measuring gene expression changes in evolutionary diverged species using RNA Seq, where experimental artifacts can mimic evolutionary effects.
GEO Accession Number: GSE32679
RNA-Seq; Comparative transcriptomics; S. cerevisiae; S. paradoxus; S. mikatae; S. bayanus
Rare coding variants constitute an important class of human genetic variation, but are underrepresented in current databases that are based on small population samples. Recent studies show that variants altering amino acid sequence and protein function are enriched at low variant allele frequency, 2 to 5%, but because of insufficient sample size it is not clear if the same trend holds for rare variants below 1% allele frequency.
The 1000 Genomes Exon Pilot Project has collected deep-coverage exon-capture data in roughly 1,000 human genes, for nearly 700 samples. Although medical whole-exome projects are currently afoot, this is still the deepest reported sampling of a large number of human genes with next-generation technologies. According to the goals of the 1000 Genomes Project, we created effective informatics pipelines to process and analyze the data, and discovered 12,758 exonic SNPs, 70% of them novel, and 74% below 1% allele frequency in the seven population samples we examined. Our analysis confirms that coding variants below 1% allele frequency show increased population-specificity and are enriched for functional variants.
This study represents a large step toward detecting and interpreting low frequency coding variation, clearly lays out technical steps for effective analysis of DNA capture data, and articulates functional and population properties of this important class of genetic variation.
As a consequence of the accumulation of insertion events over evolutionary time, mobile elements now comprise nearly half of the human genome. The Alu, L1, and SVA mobile element families are still duplicating, generating variation between individual genomes. Mobile element insertions (MEI) have been identified as causes for genetic diseases, including hemophilia, neurofibromatosis, and various cancers. Here we present a comprehensive map of 7,380 MEI polymorphisms from the 1000 Genomes Project whole-genome sequencing data of 185 samples in three major populations detected with two detection methods. This catalog enables us to systematically study mutation rates, population segregation, genomic distribution, and functional properties of MEI polymorphisms and to compare MEI to SNP variation from the same individuals. Population allele frequencies of MEI and SNPs are described, broadly, by the same neutral ancestral processes despite vastly different mutation mechanisms and rates, except in coding regions where MEI are virtually absent, presumably due to strong negative selection. A direct comparison of MEI and SNP diversity levels suggests a differential mobile element insertion rate among populations.
We embarked on this study to explore the 1000 Genomes Project (1000GP) pilot dataset as a substrate for Mobile Element Insertion (MEI) discovery and analysis. MEI is already well known as a significant component of genetic variation in the human population. However the full extent and effects of MEI can only be assessed by accurate detection in large whole-genome sequencing efforts such as the 1000GP. In this study we identified 7,380 distinct genomic locations of variant MEI and carried out rigorous validation experiments that confirmed the high accuracy of the detected events. We were able to measure the frequency of each variant in three continental population groups and found that inherited MEI variants propagate through populations in much the same way as single nucleotide polymorphisms, except that MEI are more strongly suppressed in protein coding parts of the genome. We also found evidence that the MEI mutation rate has not been constant over human population history, rather that different populations appear to have different characteristic MEI mutation rates.
Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API.