Insertion coordinate convention.
Number of deletion call sets supporting reference MEI locus. The average number of deletions call sets supporting MEI events is about eight (blue) while for all deletions in the 1000GP release (gray dashed line) the average number of calls was about three. The peak at the call sets for Alu MEI deletions corresponds to the eight Illumina RP based call sets (BC, Wash U, WTSI, for both pilots, Broad for pilot 1 and U.Wash for pilot 2) and two SR call sets (Pindel for both pilots).
UCSC browser display of reference MEI. (top) The deletion (red track with 1000GP deletion id's P1_M_061510_12_213 for low coverage pilot and P2_M_061510_12_22 from the trio pilot) matches to the annotated AluYg6 element at chr12:8516855–8517156, present in the NCBI36 reference sequence but missing in the sequenced sample. The black RepeatMasker track shows that the AluYg6 element matches the deletion start and end coordinates. The green tracks indicate the extent of the chimpanzee assembly, which does not include the AluYg6 element. The blue DGV tracks show that this particular deletion has been previously identified by several experiments with various degrees of position resolution. (bottom) Example of questionable reference MEI. The blue track at the top marks a detected deletion (id P2_M_061510_3_301) at chromosome 3, 60,660,331 bp that overlaps >50% with a short annotated L1HS element, but the start and end coordinates do not match precisely. The chimpanzee genome (in yellow) has a gap in the region, but the edges do not align precisely. This deletion was included in the count of 2,010 reference MEI, but adds to the level of uncertainty.
1000 Genome Project pilot sample breakdown. a) Venn diagram of pilot samples by sequencing platform (Illumina and 454 only). The bulk of the samples were sequenced by Illumina. The circle areas are only roughly proportional to the number of samples contained. b) Venn diagram of samples used for MEI detection (left) and genotyping (right). MEI detected as insertions (red) and deletions (blue) have different signatures and algorithms resulting in the difference between the samples used.
Illumina paired end fragment length distributions. Left) Low coverage pilot fragment length distributions for a random selection of 20 lanes of Illumina read pair data. Most libraries have a median fragment length from 100 to 300 bp with a wide variety of shapes. Right) Trio pilot fragment length distributions for 130 lanes of Illumina read pair data for NA12878. Five libraries are shown in different colors with different characteristic shapes. The small peak visible in orange at 550 bp is shifted by 300 bp from the main peak. This small peak arises from reference Alu insertions of length 300 bp. This small Alu peak occurs for all libraries in both pilots.
MEI insertion sensitivity vs. coverage for the two methods. Coverage for the RP method is quantified as “span” coverage on the blue scale. Span coverage is calculated based on the fragment gap between the reads at the end of the fragment where RP detection is sensitive to large structural variations. The SR algorithm sensitivity depends on read coverage (red scale at the top) because the insertion can be detected anywhere within a given read (except within 20 bp of the ends). The detection sensitivity at maximum coverage is determined by the trio overlap calculations from Table S6
. Sensitivity at reduced coverage values is calculated by down sampling the number of supporting reads and counting the fraction of insertions that survive the selection criteria.
Non-reference MEI insertion breakpoint resolution. (top) the position residual between matched RP to SR insertions. (bottom) 1000GP loci vs. dbRIP. The dbRIP hg18 coordinates were shifted by TSD such that both lists adopt the ‘leftmost’ coordinate convention.
Venn diagrams of MEI insertion overlap with recent studies. (top) L1 overlap with Ewing and Kazazian 
. (bottom) Alu overlap with Hormozdiari et. al. 
Genomic distance to nearest element of the same family. (top) Non-reference MEI. 1000GP and HuRef distributions are plotted as well as L1 distances for Ewing and Kazazian 
and Alu distance for Hormozdiari et. al. 
. Distances <1 indicate insertions within annotated elements.
Insertion position resolution comparison. Non-reference MEI were matched to dbRIP using a 200 bp window.
Number of MEI per 1 MB binned regions across genome. (top) Dotted gray line is a simple Poisson model for MEI distributed uniformly across the accessible genome (2.85 Gb). The red arrow points to a significant hotspot in chromosome 6, position 33 Mb in the HLA region where 19 MEI were detected in a 1 MB region. (bottom) MEI density profile across chromosome 6 showing spike in region of HLA at 33 Mb.
MEI insertion length. a) Comparison of insertion lengths with 617 dbRIP assembled MEI insertions that match 1000 Genomes MEI using a 200 bp window around insertion position. b) MEI insertion length residual distribution. c) The insertion length from MEI deletions (red) is the number of reference nucleotides in the deleted region (the annotated mobile element plus one copy of the TSD and any carry-over sequence). Sharp peaks at 300 bp and 6000 bp are the Alu and L1 insertions respectively. The insertion length for MEI detected as insertions (blue) is estimated from the span of the mapping coordinates within the mobile element. This estimate does not take into account any inserted sequence that is not part of the mobile element such as the TSD, poly-A tail, or carry-over sequence.
Genotyping efficiency. top) Fraction of MEI sites surviving genotype quality thresholds in low coverage data for non-reference MEI (blue steps, GQ≥7) and for reference MEI (red, GQ≥10). Also shown is genotype accuracy based on validation experiments for non-reference MEI (dashed with grey 95% confidence interval). bottom) Sample-by-sample fraction of MEI sites surviving genotype quality threshold for vs. coverage in low coverage samples. Non-reference MEI (crosses) show a genotyping efficiency approaching 60% at 4 fragments/base spanning coverage, while reference MEI (circles) genotyping efficiency is nearly flat at 80%. Samples from the three population groups show the same trends. Coverage here is calculated as spanning coverage, most relevant for RP detection.
Hardy-Weinberg Equilibrium test. Proportions of each genotype as a function of allele frequency for each population group (blue: CEU, red YRI, and green CHBJPT). Also plotted in gray dashed lines for comparison is the proportion expected from HWE.
Genotype Matrix of low coverage samples. Each element in the matrix corresponds to a sample and a locus at which the genotype is color coded. Sample populations are labeled across the top, separated by green lines. The chromosome order for the MEI loci is labeled on the right side, with non-reference MEI (“insertions”) and reference MEI (“deletions”) grouped separately. This matrix was input to Principal Component Analysis for plotted in the main text (Figure S16d
Principal Component Analysis population clustering for PCR genotypes, MEI ins, MEI del, combined. A matrix of genotypes for each site and sample was input to a PCA and the resulting first two components are plotted against each other. The sum of insertion alleles is the value in the matrix elements. For elements corresponding to sites and samples without genotypes, the global average genotype value was used. a) Genotypes from PCR validation for the low coverage pilot. b) Genotypes from low coverage non-reference MEI only. c) Genotypes from reference MEI only. d) Genotypes from samples with both non-reference and reference MEI. Population clusters become tighter as more MEI insertion information is added to PCA.
Coalescent simulation allele frequency spectra for the combined CEU, YRI, CHB and JPT population groups. AF is binned in units of 0.1. The lowest bin (0–0.1) is not plotted to allow the spectra at higher AF to be compared. The normalizations for MEI detected as insertions (red) and deletions (green) are set to that the two components sum to the total unbiased MEI AFS (blue).
MEI insertion rate vs. coalescent time for increasing MEI site selection thresholds. The estimated MEI insertion rates (main text Eq.2) for each sample is plotted vs. the coalescent time derived from SNP heterozygosity. Panel a) is the same as from the main text and corresponds to genotyped sites with GQ≥7, which also corresponds to sites with at least two supporting fragments. As more supporting fragments are required b) NF≥3, c) NF≥5, d) NF≥7, the numbers of genotyped sites decrease, but the trend between populations in the MEI insertion rates remains.
Combined MEI event list (external Excel file). Genomic coordinates with confidence intervals are listed for each of the 7380 MEI loci. Each event is characterized by an element type (ELEMENT
Alu, L1, or SVA), element STRAND (+ or −), detection (DET
DEL or INS for non-reference and reference MEI respectively), event ID, estimated insertion length (LEN), detection algorithm (ALG), validation status (VAL), validation method (VALMETH
PCR, ASM for assembly, 7SLRNA should be discarded due to proximity to annotated 7SLRNA element), population (POP
CEU, YRI, CHB, or JPT), allele frequency in three major groups (AF), number of genotyped samples in the three groups, number of insertion alleles in the three groups, previous study ID's (DBVARID, DBRIPID, PUBID), TSD length, number of insertion-supporting fragments from the 5′ side (NALT5), from the 3′ side (NALT3), the 1000 Genomes CALL SET name, quality value (Q), gene/exon/UTR/CDS interrupted (GENE), sub-family, and inserted sequence when available, and a list of all samples in which the alternate allele was detected (ALTSAMPLES). Note: 71 events identified by the VAL field as invalidated or in close proximity to a 7SLRNA loci are marked in yellow and were not included in the counts of interrupted genes, exons, UTRs, or CDS regions.
Samples with corresponding sequence coverage (external Excel file) Sequence coverage for each of the 185 samples calculated in terms of Illumina span-coverage for RP detection, 454 base coverage for SR detection and Illumina base-coverage (including single-end read data) for deletion detection.
Reference MEI detection method breakdown. (external Excel file) Thirteen different algorithms contributed to the detection of MEI present in the reference but not in a sample. a) Breakdown by pilot. b) Breakdown by algorithm. The bulk of MEI deletions were found by Illumina RP and SR methods.
Validation genotypes for non-reference MEI datasets (external Excel file). Complete genotyping information for all samples tested at the 746 sites used for false detection rate estimates and for genotyping assessment. a) Additional validation results for non-reference MEI loci (external Excel file) Genome coordinates for 267 additional validation PCR experiments carried out at Yale, EMBL, and LSU. These experiments were done as preliminary tests (EMBL, Yale, LSU-PRELIM) and for testing specific loci (SVA, de novo, exon interrupting).
MEI sensitivity based on comparison to gold standard events. (external Excel file) The fraction of HuRef MEI 
found by this study is a lower limit to the detection sensitivity to common MEI alleles. a) MEI insertion detection sensitivity. b) MEI deletion sensitivity. b) MEI deletion sensitivity based on loci detected in the same samples from Mills et al. 
Trios (external Excel file). a) Overlap between RP and SR in the same trio samples (NA12878 and NA19240) can be used to estimate detection sensitivity. Columns RP and SR are the counts of all loci for the two samples broken down by element type. RP-only and SR-only count loci where only one method found the insertion. RP+SR is the count of loci deleted by both methods. The detection sensitivity estimates (εRP, εSR, and ε) with corresponding statistical 1-sigma errors are derived from the overlaps. The combined detected efficiency is based on the union of the two independent methods. b) Counts of MEI site differences between two individuals. The trio samples were used for this because of the relatively high coverage and corresponding sensitivity to low frequency alleles. Corrections to the counts compensate for less-than-perfect detection sensitivity and false detections. The trio children from two populations (CEU and YRI) have the most differences (2034±120) while the CEU parents have the fewest (663±120). The YRI parents' count of sites is between the other pairs. These differences are plotted vs. the corresponding coalescent time in (main text). c) De novo insertion hunt. Any MEI appearing in the children of the family trios but not in the parent would be a de novo MEI insertion. Six candidates from NA12878 (a) and 15 from NA19240 (b). All but one de novo candidate occurred at a site not found in any of the other samples. This site was PCR tested and identified in NA12892 (mother).
Sub-family breakdown (external Excel file). Fragments from 1,105 of the Alu insertions were assembled into contigs spanning the Alu element to allow subfamily identification. The subfamilies are compared with those from the reference MEI detected as deletions and to the Venter MEI.
Non-reference MEI genotyping validation (external Excel file). Genotype contingency table for non-reference MEI vs. genotypes from PCR validation experiments. “0/0” are homozygous reference, “0/1” are heterozygous insertions, and “1/1” are homozygous insertions (VCF file genotype label convention). Counts in each box are the numbers of sites and samples with the corresponding combination of genotype from sequencing and PCR. The overall genotyping accuracy is the fraction of counts on the diagonal while the genotyping efficiency is the fraction of all genotyped sites & samples divided by sites×samples for the given pilot dataset. Only genotypes with Q≥7 are included. The low coverage (a) accuracy is 87% and the efficiency is 57%. The trio pilot (b) accuracy is 95.7% and the genotyping efficiency is 89.9%. The improved genotyping performance for the trio pilot is a consequence of higher coverage.
MEI genotyping corrections. (external Excel file). a) Detection sensitivity. b) Genotyping efficiency with correction factors used in constructing the allele frequency spectra for each population and element type. c) Heterozygosity counts and correction factors for each sample and element family.
Loss of Function variants (external Excel file). Counts of insertions occurring within genes, UTR, and CDS regions annotated from Gencode version 3b. This table is partially shown as in the main text. Only insertions with breakpoint confidence intervals entirely within the annotation region are counted. Any insertion candidate subsequently invalidated is not counted. A random placement model is used to estimate the number of expected insertions in the absence of selection. a) MEI counts. b) The corresponding counts of SNPs from the low coverage pilots are also listed along with the expected numbers of SNPs based on random placement. The suppression factor for MEI (~46×) is similar to that of a SNP changing a stop codon (~42×).
Mobile element consensus sequences (external Excel file). Repbase element names and sequences for each of the element added to the reference genome for MEI insertion detection.
The 1000 Genomes Project Consortium.