|Home | About | Journals | Submit | Contact Us | Français|
Characterizing structural variants in the human genome is of great importance, but a genome wide analysis to detect interspersed repeats has not been done. Thus, the degree to which mobile DNAs contribute to genetic diversity, heritable disease, and oncogenesis remains speculative. We perform transposon insertion profiling by microarray (TIP-chip) to map human L1(Ta) retrotransposons (LINE-1 s) genome-wide. This identified numerous novel human L1(Ta) insertional polymorphisms with highly variant allelic frequencies. We also explored TIP-chip's usefulness to identify candidate alleles associated with different phenotypes in clinical cohorts. Our data suggest that the occurrence of new insertions is twice as high as previously estimated, and that these repeats are under-recognized as sources of human genomic and phenotypic diversity. We have just begun to probe the universe of human L1(Ta) polymorphisms, and as TIP-chip is applied to other insertions such as Alu SINEs, it will expand the catalog of genomic variants even further.
Following completion of the human genome reference sequence, comparative genomics across and within species is identifying functional elements and establishing relationships between genetic variation and phenotypic diversity. The HapMap project addresses interindividual human SNP variation (International HapMap Consortium, 2003). Recent studies have shown the human genome also contains extensive structural variants (SVs), encompassing in aggregate greater nucleotide content than SNPs, and with potential relationships to genetic diseases (Stankiewicz and Lupski, 2010). A more extensive and specialized toolkit of molecular methods is required to fully appreciate this dynamic dimension of our genomes.
Studies of SVs, mainly focused on copy number variations (CNVs), rely on fosmid library paired-end sequencing and comparative genomic hybridization (CGH) (Scherer et al., 2007). These techniques are somewhat biased against the discovery of those SVs that are less than several kb long and high in copy number; these account for the majority of human SV sequence. Therefore, important gaps remain in understanding the full spectrum of human genetic variation. SV of high copy interspersed repeats or insertion sequence variations (ISVs) (Gresham et al., 2008) are relatively uncharacterized. These sequences, most of which are derived from “copy-and-paste” retroelements, differ in structure, copy number, and location. They pose a significant challenge even to whole-genome sequencing, and are often underrepresented in genome assemblies.
Besides representing a major class of structural variant, ISVs can serve as sites for nonallelic homologous recombination to create CNVs. Large genome-wide studies have found statistical enrichment of mobile DNAs near CNVs and translocation or inversion breakpoints (Cooper et al., 2007; Korbel et al., 2007). Examples of L1 or Alu involved in disease related deletions and translocation junctions are also well-documented (Gu et al., 2008; Kolomietz et al., 2002; Morisada et al., 2010); however most were discovered by CGH and related methods which are blind to new insertions.
ISVs reflecting polymorphic mobile element insertions have significant functional impact. Short interspersed elements (SINEs, such as Alus) frequently serve as gene enhancers and promoters or affect transcript structure (i.e., being incorporated into exons or used as sites for alternative mRNA splicing; (Cordaux and Batzer, 2009). Similarly, evidence suggests that long interspersed elements 1 (LINE1/L1) can alter mRNA splicing of target transcripts (Belancio et al., 2008; Speek, 2001), result in transcript initiation or truncation/reinitiation (Wheelan et al., 2005), and may play roles in generating neuronal diversity (Muotri et al., 2005). Intronic L1 insertions can also damp or otherwise subtly alter gene expression (Chen et al., 2006; Han and Boeke, 2005; Han et al., 2004). Indeed, intronic insertions can lead to decreased transcript levels and loss of gene function in humans (Schwahn et al., 1998; Ustyugova et al., 2006) and other mammals (Credille et al., 2009; Yajima et al., 1999). Therefore, such intronic L1 insertions might well predispose to complex traits in humans. There are also examples of exonic mobile element insertions causing genetic diseases by germline or early embryonic integration (Kazazian et al., 1988; Van de Water et al., 1998) and transforming mutations in cancer by somatic insertion (Miki et al., 1992). It is unknown whether the relative rarity of such reports reflects mobile element quiescence in the context of effective host defense mechanisms or systematic biases against their discovery. Even in a healthy individual, we still do not know the number of mobile element copies of each type present or the frequencies of these alleles in the general population. For all these reasons, we developed an effective general tool, TIP-chip, for genome-wide discovery of human insertion polymorphisms.
We describe here primarily our findings using TIP-chip to identify polymorphic L1 insertions. Most ISV families (L1, Alu, and SVA) continue to accumulate in our genomes through the activity of L1, specifically, the youngest L1 family, L1PA1, also known as the “transcribed L1, subset a” or L1(Ta)s (Skowronski et al., 1988). In vitro assays unequivocally demonstrate the retrotransposition capacity of full-length (6 kb) intact L1(Ta)s (Brouha et al., 2003; Moran et al., 1996); nearly all known human L1 polymorphisms are related to L1(Ta) insertions.
Our approach for identifying L1(Ta) ISV depends on a ligation-mediated PCR (Arnold and Hodgson, 1991; Wheelan et al., 2006). In this method, partially complementary oligonucleotides (vectorettes) are ligated to restriction enzyme (RE) digested genomic DNA. This requires first strand PCR priming from known (transposon end) sequence. The 3′ terminus of the first strand primer hybridizes to a three base pair sequence unique to the L1(Ta) subset (Skowronski et al., 1988). In subsequent cycles, the 3′ end of the first strand pairs with a second primer allowing exponential amplification. The resulting amplicons include the extreme 3′ end of the L1(Ta) and unique downstream DNA sequence. The amplicon mixture is fluorescently labeled and hybridized to tiling microarrays (Figure 1). TIP-chip data consist of small numbers of high intensity probes (Figure S1 available online) recognizable as peaks formed by contiguous probes when associated with corresponding genomic locations. Multiple PCR templates are generated for each genomic L1(Ta) by parallel RE digests prior to vectorette ligation. The combination of REs used maximizes genomic coverage; an insertion lies within 1–5 kb of at least one 3′ RE site in approximately 92% of the genome (Experimental Procedures). The interprobe distance on the tiling arrays used is such that for approximately 90.5% of the genome, there are three probes (average of 7) within 3 kb of an insertion. Since sequences closer to the 3′ end of an insertion are included in more RE fragments, and shorter templates amplify better, there is an inverse relationship between probe fluorescence intensity and distance from the L1(Ta). Thus, peak shapes reflect both insertion position and orientation.
To evaluate the general applicability of TIP-chip to map other human mobile elements, we designed primers specific to SINEs that have been recently expanding in humans through the activity of L1(Ta), AluYa5/8 families and AluYb8/9 families, and a family of autonomous endogenous retrovirus, Hs_a HERV-K (Experimental Procedures). In all three cases we were able to detect numerous insertions of those types included in the March 2006 human reference sequence (hs_ref [hg18] NCBI Build 36.1) (Figure S1). HERV-K TIP-chip showed relatively lower numbers of insertions and levels of polymorphism, though we were able to discover nonreference LTRs. Alu insertions in contrast are abundant and highly polymorphic. In addition to intergenic and intronic Alus, we discovered a polymorphic exonic AluYb in the complement component 7 (C7) locus in the first sample we analyzed (Xing et al., 2009). Thus, TIP-chip appears a robust method for identifying insertions of a wide variety of transposable element types. For the remainder of the paper, we focus on our experience with L1(Ta) detection by TIP-chip.
To test L1(Ta) TIP-chip utility and reproducibility, we used 385K feature X chromosome genomic tiling arrays. As expected, numerous peak positions corresponded to one of the 38 known L1(Ta)s exactly matching our forward primer sequence the hs_ref (Figure 2). In a family of 4, we identified 28 peaks reflecting reference L1(Ta)s, and correctly identified orientation in 84% of these. No non-hs_ref L1(Ta) insertions cataloged in the database of human retrotransposon insertion polymorphisms (dbRIP) (Wang et al., 2006) or included in alternate genome assemblies (Levy et al., 2007; Venter et al., 2001) were detected (Table S1). Importantly, however, this experiment showed 6 previously unknown L1(Ta)s which were verified by 3′ junction PCR analyses (Table S1). Of the 34 L1(Ta)s seen in this family, 13 were polymorphic. All showed sex-linked inheritance.
We tested whether TIP-chip could comprehensively map L1(Ta)s on a whole-genome tiling microarray (four 2.1M feature arrays). For data analysis, we developed a Hidden Markov Model (HMM) for recognizing characteristically asymmetric peaks and imposed a multivariate cutoff algorithm for retaining peaks (Experimental Procedures). Figure 3A illustrates distribution of L1(Ta) peaks in peripheral blood leukocytes (PBL) DNA from a healthy individual (sample 1); data from other representative samples are included in Figure S2. In these examples, we recognize a range of total peak numbers in excess of Ne, the expected number of different L1(Ta) alleles per diploid human genome (515; see Experimental Procedures) and impose a cutoff based on this value. In the sample illustrated in Figure 3A, we retained 514 peaks, 323 of which correspond to reference L1(Ta)s. Of the 191 candidate non-hs_ref L1(Ta)s identified in the sample, 49 were in dbRIP (Wang et al., 2006) or included in the alternate genome assemblies, 3 were confirmed by data in Beck et al. (2010) [this issue of Cell]. We attempted to verify 139 others by site-specific PCR crossing the 3′ junction of the L1 or spanning the insertion (Table S1) and recovered amplicons consistent with 91 insertions. Of a sequenced subset, 22 were sequence-verified, a recovery which allowed us to estimate that 56 reflect true L1 insertions. Thus, of novel peaks retained by the cutoff algorithm, 108 appear to represent true insertions verifiable by data mining or PCR validation, for an overall assay positive predictive value of 84%. Additionally, in this sample, we were able to sequence verify an additional seven insertions among peaks that did not meet the cutoff. Thus, cutoff criteria can be relaxed to maximize new L1(Ta) discovery, but is kept close to Ne here to conservatively reflect the expected number of true positives.
Identification of L1(Ta)s included in hs_ref serves as a quality metric; most high quality TIP-chip data sets identify about 300 of 460 perfect matches to our L1(Ta) primer present in the reference genome. This value is comparable to numbers of reference L1(Ta)s included in the alternate genome assemblies (Figure 3B). Unidentified reference L1(Ta)s can be ascribed to polymorphic insertions absent from an individual (true negatives) and undetected L1(Ta)s [false negatives, e.g., due to < 3 probes in the 1 kilobase downstream of the L1(Ta) 3′ end]. In whole-genome TIP-chip studies of 15 unrelated individuals, there are 56 reference L1(Ta)s undetected in any individual; 47 fall in this ‘probe poor’ category. Forty of these lie in repeat-rich regions [>900bp of the 1 kilobase following the L1(Ta) 3′ end are annotated by RepeatMasker (Smit et al., 2004) (Figure S3)]. For the remaining 9 insertions, insertion allele frequencies are reported for five of them in dbRIP, with four sufficiently infrequent that their absence in this sample set is expected (insertion allele frequencies 0.019–0.051). Twelve ‘probe poor’ reference L1(Ta)s are found on the X chromosome. Of these, 9 were detected on the 385K chromosome X array platform, indicating that detection difficulty on the whole-genome array does not reflect failure to amplify these sequences and could be solved by improved probe content (Figure S3).
We compared L1(Ta) identification by TIP-chip directly with assembled whole-genome sequencing data for Hs_alt_huref (Venter) DNA. Xing et al. (2009) found 49 nonreference Hs_alt_ huref insertions by analyzing indel-containing contigs. We found 40 more in the Hs_alt_huref assembly deposited at NCBI and an additional 32 sequence verified insertions by TIP-chip. (Table S1).
To assess whole-genome TIP-chip reproducibility and address the hypothesis that L1(Ta) insertions commonly occur in early stages of human embryonic development so as to create significant somatic mosaicism (Kano et al., 2009; van den Hurk et al., 2007), we performed whole-genome TIP-chip analysis on PBLs (4 paired samples) or lymphoblastoid lines (1 paired sample) of 5 phenotypically discordant monozygotic twin sets. We find high agreement between L1(Ta) TIP-chip peaks in comparing these samples (Figure 4). No peak discrepancies (i.e., peak presence versus absence) were found in pairwise comparisons. We attempted PCRs at 89 peak positions showing differences in peak height between twins and discovered no insertions unique to one individual in a twin pair.
TIP-chip enables assessment of L1(Ta) genotypes in numerous samples and thus is useful for determining population-based allele frequencies. Given this, we can estimate average allele frequency of the L1(Ta) complement present in any individual; a parameter we call i. We determined i using two independent methods, the first based on TIP-chip as the sole means of genotyping 75 males for X chromosome insertions and the second was based on whole-genome analyses (below). On the X chromosome, 161 high scoring peaks served as the basis for i calculation. Of these, 33 correspond to L1(Ta) in hs_ref, and extensive validation PCRs for 10 samples on this array platform indicate a positive predictive value of 80.5% for non-hs_ref insertions (Figure S4). Nonreference L1(Ta)s showed an extremely broad range of allele frequencies (0.013 and 0.987, average 0.58; Figure 5A, Table S2). The average L1(Ta) allele frequency i, was determined to be 0.75 (Figure 5B). This parameter defines genome-wide variation of this class of ISV.
We also evaluated new insertion discovery rates and L1(Ta) insertion characteristics on the X chromosome. Discovery rates for potential novel L1(Ta)s were highest in the first 10 samples, reflecting high frequency polymorphisms, and thereafter decreased with sample number, although low allele frequency insertions continued to be found at ~0.8 insertions per sample throughout (samples 11–75; Figure S4). Insertion-spanning PCRs were designed to estimate L1(Ta) length for 16 novel insertions, of which 7 (44%) appear full-length (FL;6kb). In combination with reference L1(Ta)s detected (n = 49), allele frequency and L1(Ta) size were uncorrelated (p = 0.11), concordant with earlier work (Boissinot et al., 2004). Similarly, allele frequencies were similar in intergenic (0.76, sd 0.33) and intragenic (0.67, sd 0.38) L1(Ta)s(p=0.41). When considered with hs_ref L1(Ta) insertion lengths on the X chromosome and compared to autosomal reference insertions, we found more FL L1(Ta)s on chromosome X but the difference was statistically insignificant (37.9% FL on X; 26.2% FL on autosomes, sd = 18.2%).
Our second assessment of i based on whole-genome data reflects a weighted average of reference and non-hs_ref L1(Ta) allele frequencies in proportions reflective of one individuals genome. Whole-genome TIP-chip data for 15 unrelated individuals provided an allele frequency for the reference L1(Ta)s (average = 0.94). The non-hs_ref L1(Ta) allele frequency was based on: (1) non-hs_ref L1(Ta)s with allele frequency data in dbRIP (Myers et al., 2002) (a.f. range 0.15–0.83; average 0.47) and (2) genotyping panels of individuals for 8 novel L1(Ta)s found in sample 1 (a.f. 0–0.82, average 0.38, n = 196; Figure 6). Genotype distributions for these insertions departed in varying degrees from Hardy-Weinberg equilibrium values as expected for studies of heterogeneous ethnic populations. Even low frequency insertions were found in multiple ethnic groups; one was absent from African individuals. These data give an estimate for i of 0.83. Thus, the two values of i are in relatively good agreement. For the following sections, we used the X chromosome-derived value of 0.75 as it used the highest quality array platform, was hemizygous-based, and surveyed the largest population.
The chromosomal distribution of TIP-chip peaks largely reflects chromosome size (Figures S5A and S5B). A 2-fold enrichment on the X chromosome for L1(Ta) elements is observed across the 3 published haploid sequence assemblies, consistent with the elevated overall density of older L1s on the X (Bailey et al., 2000). The tendency of L1(Ta)s to accumulate in AT-rich regions has been described previously (Gasior et al., 2006). GC content analysis of genomic intervals surrounding candidate and verified novel L1(Ta)s found by TIP-chip confirm this observation (Figure S5C).
Although genes are enriched in GC-rich genomic intervals, we identified many L1(Ta)s within genes. Intragenic sequences comprise 41% of the genome (1% exons and 40% introns; UCSC known genes definitions). In the genome-wide L1(Ta) TIP-chip survey in Figure 3A, 201 (39%) reflect intronic insertions, and 313 (61%) are intergenic. No exonic insertions of L1(Ta) were identified; however we did find an exonic Alu insertion by TIP-chip (above).
Because intronic insertions can affect gene function, we evaluated intragenic insertions further. Gene category studies were based on distribution of associated molecular functions, biological processes, and pathways as annotated in PANTHER (Thomas et al., 2003). Intragenic insertions were most frequently in genes categorized as “unclassified” molecular function and/or process, though neither group was overrepresented statistically. Also of note, no L1(Ta)s (or Alus) were identified in the four homeobox gene clusters, HOXA, HOXB, HOXC and HOXD, a megabase region relatively devoid of interspersed repeats (Lander et al., 2001). Reference and candidate L1(Ta)s inferred from peaks within or near (<5 kb) genes were enriched in antisense orientation relative to target gene (p < 0.0001, Figures S6A and S6B).
To identify possible genetic etiologies for X-linked disease, insertions were profiled on the X chromosome in 10 males with unexplained muscular dystrophy or X-linked cardiomyopathy and 69 unrelated male probands with presumptively X-linked intellectual disability. No novel dystrophin insertions were seen in the first group. However, several novel L1(Ta) X chromosome insertions were discovered in the latter cohort; 6 were low frequency insertions based on genotyping (Table S4). Three insertions were “private” (unique to the proband) based on screening ~600 individuals of ethnically diverse backgrounds.
Two L1(Ta) low-frequency alleles are shown in Figure 7 (see also Figure S6). Each is intronic and in antisense orientation relative to the gene; one is located in the Nance-Horan syndrome (NHS) gene, the other in DACH2. NHS is caused by protein-truncating mutations and characterized by congenital ocular anomalies and partially penetrant intellectual disability. The L1(Ta) insertion is a 206 bp sequence in the first intron. This allele was found in 5 of 361 control males (allele frequency 1.38%) without intellectual disability, so its clinical significance is unclear. The insertion in the DACH2 locus is private and consists of a 368bp L1 sequence, is located in the second intron of DACH2 and accompanied by a 4bp target deletion. The DACH2 Drosophila ortholog dachshund regulates neuronal differentiation (Martini et al., 2000). Mammalian Dach2 is highly expressed in fetal brain relative to other tissues (Kent et al., 2002), and mapping studies have implicated it as a potential locus for intellectual disability. Though functional effects of this intronic insertion are as yet unknown, it illustrates how L1(Ta) mapping can identify infrequent or private insertions meriting further study in the context of disease.
L1s and other ISVs reflect an important source of human genetic diversity. They are understudied because conventional genomic approaches generally exclude high copy number, large repeats. In silico studies mining human genome sequencing data for novel L1(Ta) insertions and their characterization in demographically diverse samples have provided important insights into L1(Ta) activity (Bennett et al., 2004; Boissinot et al., 2000; Konkel et al., 2007; Myers et al., 2002; Witherspoon et al., 2006; Xing et al., 2009). Thus far, this approach has limited novel L1(Ta) discovery to relatively few individuals and/or loci. As an alternative, several one-sided PCR methods have been described to clone insertion sites of L1(Ta) elements, but these have not readily lent themselves to high-throughput L1(Ta) mapping. Thus reliable identification of infrequent or somatic insertions has been untenable, and even common insertions are poorly characterized. Similarly, direct measures of ongoing L1(Ta) activity have been difficult to accomplish experimentally.
In our assessment, TIP-chip represents the first method to comprehensively and quickly map retroelement insertions. Using TIP-chip, we discovered numerous novel L1(Ta) insertion alleles, including high frequency alleles, in many demographics. In a typical individual, TIP-chip identifies over 100 L1(Ta) insertions absent from hs_ref. In Hs_alt_huref DNA, the method was able to detect 32 novel insertions not incorporated into the assembly of whole-genome shotgun sequencing reads. These findings underscore the incompleteness of reference genome assemblies with respect to ISVs.
In papers submitted in parallel, Beck et al. (2010) and Iskow et al. (2010) used fosmid end-sequence mapping and deep sequencing approaches to generate genome-wide L1(Ta) insertion datasets. All the methods have advantages and disadvantages. Primary advantages of the fosmid method include its unbiased ability to identify large indels, its utility to detect insertions in repetitive DNA, and its low false positive rate. Its main disadvantages are that it is low throughput and cannot identify small insertions; many L1(Ta) insertions are < 1 kb. Short-read deep sequencing approaches can detect precise insertion positions – a major advantage. Challenges include optimizing amplicon sizes and sequencing coverage to allow multiplex runs and thereby reduce cost per sample. Like TIP-chip, insertions in repetitive regions are difficult or impossible to map, though we have shown this disadvantage of TIP-chip can be mitigated in part by increasing the length of the vectorette PCR amplicons with no additional cost and improved probe and array design. TIP-chip is the fastest, and we believe, the most cost effective method today. It is also especially valuable when describing polymorphisms in specific genomic regions is desirable (i.e., single chromosomes, candidate gene loci) as these can be easily tiled on small custom arrays and run at low cost on many samples. Moreover, once more complete maps of transposon insertions are available, small but genome-wide transposon genotyping arrays can be designed for association studies. Finally, TIP-chip effectively detects many types of ISVs, including SINEs, and the two-color platform allows distinguishing two element types on one array.
We have re-examined several properties of L1(Ta)s with the most comprehensive data set now available. While quality metrics have varied between and within these multiarray runs, we have no evidence that total L1(Ta) burden varies substantially between individuals. The chromosomal distribution of TIP-chip peaks largely reflects chromosome size, and shows a modest albeit not statistically significant enrichment of L1(Ta)s on chromosome 4, like the distribution of L1(Ta)s in hs_ref (Figures S5A and S5B). A 2-fold enrichment on the X chromosome for L1(Ta) elements is observed across the published haploid sequence assemblies, consistent with elevated overall density of older L1s on the X (Bailey et al., 2000).
We also observe a predilection for L1(Ta)s to accumulate in AT-rich regions, reflecting either mechanism of ORF2p mediated insertion and/or selection against insertions in proximity to genes (Gasior et al., 2007). Thus far, we have found verified novel insertions only in intergenic or intronic regions; no exonic L1(Ta) insertion (or otherwise obviously deleterious to gene function) were observed. These are consistent with prior in silico analyses of polymorphic L1(Ta) integrations, but contrast with studies of Alu insertions which are seen frequently in proximity to genes and occasionally in exons (Xing et al., 2009), perhaps providing selective advantage (Lander et al., 2001). In a single sample, genome-wide mapping of AluYa5/8 and AluYb8/9 insertions by TIP-chip, we observed an exonic insertion, and we expect features of exonic sequence (GC content and uniqueness) will make for especially effective probe coverage and high quality TIP-chip peaks in these areas. Of L1(Ta) elements inserted within or near (<5000bp) genes, we noticed a statistically significant enrichment for antisense orientation, both considering reference L1(Ta)s or all candidate L1(Ta) insertions identified by TIP-chip. These results and other analyses (Figures S6A and S6B) suggest that L1(Ta)s inserted in antisense orientation relative to host genes are less deleterious overall, consistent with the hypothesis that sense insertions can lead to polymerase elongation defects and/or premature polyadenylation (Han et al., 2004; Perepelitsa-Belancio and Deininger, 2003). Presumably, such a bias against sense insertions is more obvious in reference L1(Ta)s and L1(pre-Ta)s (older elements), due to increased selection time. Mechanisms for target gene dysregulation by L1(Ta)s in both orientations have been posited, however (Belancio et al., 2008; Han and Boeke, 2005; Han et al., 2004; Speek, 2001; Wheelan et al., 2005).
We have gained insights into the prevalence of polymorphic L1(Ta)s by performing X chromosome directed screens in large numbers of males and by genome-wide TIP-chip L1(Ta) discovery followed by genotyping human genetic diversity panels by site-specific PCR. Our X chromosome data suggest that across all L1(Ta) insertions in one human, the average insertion allele frequency is about 0.75. Many novel insertions we describe in this study show high allele frequencies across different populations. This suggests that, despite the status of various human genome projects, we are in the early phases of describing these important ISVs. Additionally, we found many uncommon alleles, some of which are likely private insertions unique to a limited kindred or individual.
The sheer quantity and low allele frequency of many novel insertions described suggest L1(Ta)s remain highly active in modern humans. Indeed, TIP-chip data provide an experimental basis for revisiting estimates of L1 activity (i.e., occurrences of de novo insertions in the general population). By comparing the Hs_alt_ huref L1(Ta) profile as discovered with TIP-chip and in silico analysis to the hs_ref profile, we revise the current estimate of L1(Ta) insertion rate from 1 insertion in every 225 births to approximately 1 in 108 (Experimental Procedures). This number is a conservative estimate, as we have not exhaustively PCR verified TIP-chip peaks in this sample and excluded many peaks from consideration. That we readily identified one low-frequency insertion absent from African individuals in one sample and three potential private insertions in a single chromosome study of a clinical cohort (see below) also suggests L1(Ta) activity—and the LINE and SINE ISVs it enables—may have been previously underestimated.
Finally, although TIP-chip can be employed for ISVs discovery throughout the entire genome, the method has the unique advantage that it can be used to efficiently characterize relatively rare insertions over narrower intervals in surveying large populations. This feature may make TIP-chip especially useful in clinical genetics. Here we examine X chromosome L1(Ta) sites in 69 males with clinically defined X-linked intellectual disability, and verified 6 novel, relatively uncommon L1(Ta) insertions and 3 private insertions within this group (insertion allele frequencies < 0.0018–0.0025). Three are in or near brain-expressed genes or genes with known roles in central nervous system development. Though the biological effect of these particular intronic L1(Ta) insertions remains uncertain, the study shows how knowledge of L1(Ta) positions can identify candidate risk alleles meriting further study.
In summary, we have developed a high-throughput method, TIP-chip, for mapping an active group of mobile DNAs in humans. We show the technique is readily generalized to other interspersed repeats. We illustrate initial insights it has provided into L1(Ta) genomic distribution and the dynamics of these repeats in our genomes. Genome-wide TIP-chip studies of several individuals show that L1(Ta)s are extremely polymorphic and an underappreciated type of SV underlying human genetic diversity. Future L1(Ta) and ISV mapping by TIP-chip and similar methods will continue to expand our understanding of the human genomic diversity and play an increasingly important role in identifying causes of genetic disease.
Aliquots of high Mr genomic DNA were digested in parallel with six REs (AseI, BspHI, BstYI, HindIII, NcoI, and PstI) chosen by a greedy algorithm to maximize genomic fragments 1–5 kb long. Sticky ends are ligated to vectorette adapters. Vectorette PCR was performed using a touchdown PCR program and ExTaq polymerase (Takara Bio; Shiga, Japan). Amplicons were purified and concentrated using Microcon columns (Millipore; Billerica, MA) and digested with REs to generate smaller fragments. These fragments are labeled with Cy3-dUTP or Cy5-dUTP (Enzo Biochem; New York, NY) using exo− Klenow polymerase-mediated (New England Biolabs; Ipswich, MA) extension from random 9-mers (Stratagene-Agilent Technologies; Santa Clara, CA). After additional clean up and concentration using Microcon columns, labeled amplicon fragments were hybridized to 2.1M feature HD2 whole-genome economy-type microarrays or 385K feature single chromosome arrays (Nimblegen/Roche Applied Science; Madison, WI) according to manufacturer's instructions. Arrays were hybridized in MAUI mixers (Biomicro Systems; Salt Lake City, UT), washed, and scanned using a Genepix (Molecular Devices; Sunnyvale, CA). A detailed description may be found in Supplemental Experimental Procedures.
Probe coordinates and fluorescence intensity values (.gff files) were generated using Nimblescan (Roche Nimblegen). Peaks corresponding to candidate transposon insertion site are identified by custom L1 Signal Analysis (LISA-map) software (Huang, et al. in preparation, available on request). Peak positions that overlap with the insertions found in the hs_ref genome (referred to as reference peaks) were used as a quality control measurement. The software detects peaks based on a HMM incorporating probe intensity and peak morphology. Peaks are ranked by the sum of the posterior probability of each probe being in a peak. The best cutoff of each sample was determined by systemically varying four different parameters after exclusion of peaks identified as vectorette PCR background. Peaks were removed (i) after the ith number of reference peaks in the ranked list, (ii) if the region showed ‘noisy’ background (variance = j), (iii) if the peak was made up of less than k number of consecutive probes (allowing 1 failed probe within the peak interval), and (iv) if local background intensity (defined by a 40 probe window flanking the peak) was above threshold m. Finally, peaks were reranked based on maximum probe intensity and peaks below the last reference peak are deleted. Cutoff values for each variable were imposed to target a total peak number closest to the expected number of L1(Ta) insertion positions per diploid human, Ne = 535 (see below), while removing the fewest reference L1(Ta) peaks. Reference L1(Ta)s that did not make the cutoff (on average < 12% per sample) are retained in the final list.
The expected number of different L1(Ta) alleles per diploid human genome (Ne; i.e., true, unique TIP-chip peaks) was estimated assuming that total L1(Ta) number does not vary significantly between individuals. In the three sequenced haploid genome assemblies (hs_ref, hs_alt_HuRef, hs_alt_celera from ftp://ftp.ncbi.nih.gov/genomes/), the L1(Ta) counts are 413, 363, and 460 respectively. We used the average of these values (412) as an estimate of L1(Ta) insertions per haploid genome. Determining the diploid L1(Ta) content then requires an estimate of zygosity, derived from the average allele frequency for L1(Ta)s found in any single individual (i, see text). This value is assumed to be constant and invariant among chromosomes. Allele frequencies of 161 candidate novel L1(Ta) insertions found by chromosome X TIP-chip were defined based on 75 male samples profiled (allele frequency = number of TIP-chip peaks found at that genomic location divided by 75, Figure 5A, Table S2). Then, i was determined for each of the individuals by averaging the allele frequencies for each insertion on their X chromosome; the mean of the 75 i values was 0.75 (0.95 for hs_ref L1(Ta)s; 0.58 for nonreference L1s). Defining this average frequency value on a per individual genome basis is fundamental to both our derivation and application of this estimate. If one instead considers the universe of L1(Ta) insertion allele frequencies, the average allele frequency value would asymptotically approach zero as more people are profiled and rarer and rarer insertion alleles discovered. The product of the number of insertions in a haploid genome times the average allele frequency (412*0.75 = 309) provides the number of expected homozygous insertions. Therefore, the expected number of distinct L1(Ta) alleles per diploid human genome, Ne, is 412*2 – 309 = 515.
We followed the method described by Xing et al. (2009). These authors used SNPs to estimate divergence of the haploid genomes Hs_alt_huref and the NCBI reference hs_ref build at 18,483 generations. The authors then cataloged nonreference L1s in indel-containing contigs without ‘N’ nucleotides from the diploid Venter genome; their resulting estimate for L1(Ta) new alleles is 1 in 225 individuals (1 in 212 considering all L1). This value is based on both L1(Ta) retrotransposition events and establishment of homozygosity. Our group analyzed the haploid assembly (Hs_alt_huref) of the Venter genome at NCBI and identified additional L1(Ta)-containing reads by searching for exact matches to the primer used in our vectorette PCR. In addition, TIP-chip identified 32 more that were subsequently verified by sequencing. This sums to 121 nonreference L1(Ta) insertions in the Hs_alt_huref genome, a value higher than previously recognized. Using the nonreference i derived above, 0.58, weestimated the ratio of homozygous to heterozygous insertions at 41:59 [i^2: 2*i (1-i)], giving a total number of non-hs_ref L1(Ta) insertions of 85 in the haploid genome. This provides a basis for revising the estimate of L1(Ta) insertions upwards to one insertion per 108 individuals.
Supported in part by NIH grant P01-CA16319, RC1 HG005359 and grants from the Brain Science Institute at Johns Hopkins University School of Medicine and the Goldhirsh Foundation (J.D.B.), and NIH grant K08-CA134746 and a Career Award for Medical Scientists from the Burroughs Wellcome Foundation (K.H.B.). We thank Jon Alder, Joe Costello, Lisa Scheifele, Daniel Yuan, Syntyche Walker, Ed Davis, Kate O'Donnell, Lixin Dai, Wengfeng An, and Christina Schrum for helpful discussions and Audrey Hendley, Naera El-Sharkawy, Lisa Scheifele, and Daniel Yuan for technical assistance. We thank Robert B. Weiss and Kevin M. Flanigan for X-linked dilated cardiomyopathy and Becker muscular dystrophy patient samples; Cindy Skinner, Cassandra Obie, and Abby Adamczyk for assistance in providing X-linked intellectual disability genomic DNA samples; and Pei-Lung Chen, Darci Ferrer, Sarah E. Ritter, and Gary Cutting for familial and twin genomic DNA. Finally, we thank Bang Wong at ClearScience and Cheng Lai Victor Huang for assistance with artwork.
Supplemental Information: Supplemental Information includes Extended Experimental Procedures, four tables, six figures, and Supplemental References and can be found with this article online at doi:10.1016/j.cell.2010.05.026.