L1s and other ISVs reflect an important source of human genetic diversity. They are understudied because conventional genomic approaches generally exclude high copy number, large repeats.
In silico studies mining human genome sequencing data for novel L1(Ta) insertions and their characterization in demographically diverse samples have provided important insights into L1(Ta) activity (
Bennett et al., 2004;
Boissinot et al., 2000;
Konkel et al., 2007;
Myers et al., 2002;
Witherspoon et al., 2006;
Xing et al., 2009). Thus far, this approach has limited novel L1(Ta) discovery to relatively few individuals and/or loci. As an alternative, several one-sided PCR methods have been described to clone insertion sites of L1(Ta) elements, but these have not readily lent themselves to high-throughput L1(Ta) mapping. Thus reliable identification of infrequent or somatic insertions has been untenable, and even common insertions are poorly characterized. Similarly, direct measures of ongoing L1(Ta) activity have been difficult to accomplish experimentally.
In our assessment, TIP-chip represents the first method to comprehensively and quickly map retroelement insertions. Using TIP-chip, we discovered numerous novel L1(Ta) insertion alleles, including high frequency alleles, in many demographics. In a typical individual, TIP-chip identifies over 100 L1(Ta) insertions absent from hs_ref. In Hs_alt_huref DNA, the method was able to detect 32 novel insertions not incorporated into the assembly of whole-genome shotgun sequencing reads. These findings underscore the incompleteness of reference genome assemblies with respect to ISVs.
In papers submitted in parallel,
Beck et al. (2010) and
Iskow et al. (2010) used fosmid end-sequence mapping and deep sequencing approaches to generate genome-wide L1(Ta) insertion datasets. All the methods have advantages and disadvantages. Primary advantages of the fosmid method include its unbiased ability to identify large indels, its utility to detect insertions in repetitive DNA, and its low false positive rate. Its main disadvantages are that it is low throughput and cannot identify small insertions; many L1(Ta) insertions are < 1 kb. Short-read deep sequencing approaches can detect precise insertion positions – a major advantage. Challenges include optimizing amplicon sizes and sequencing coverage to allow multiplex runs and thereby reduce cost per sample. Like TIP-chip, insertions in repetitive regions are difficult or impossible to map, though we have shown this disadvantage of TIP-chip can be mitigated in part by increasing the length of the vectorette PCR amplicons with no additional cost and improved probe and array design. TIP-chip is the fastest, and we believe, the most cost effective method today. It is also especially valuable when describing polymorphisms in specific genomic regions is desirable (i.e., single chromosomes, candidate gene loci) as these can be easily tiled on small custom arrays and run at low cost on many samples. Moreover, once more complete maps of transposon insertions are available, small but genome-wide transposon genotyping arrays can be designed for association studies. Finally, TIP-chip effectively detects many types of ISVs, including SINEs, and the two-color platform allows distinguishing two element types on one array.
We have re-examined several properties of L1(Ta)s with the most comprehensive data set now available. While quality metrics have varied between and within these multiarray runs, we have no evidence that total L1(Ta) burden varies substantially between individuals. The chromosomal distribution of TIP-chip peaks largely reflects chromosome size, and shows a modest albeit not statistically significant enrichment of L1(Ta)s on chromosome 4, like the distribution of L1(Ta)s in hs_ref (
Figures S5A and S5B). A 2-fold enrichment on the X chromosome for L1(Ta) elements is observed across the published haploid sequence assemblies, consistent with elevated overall density of older L1s on the X (
Bailey et al., 2000).
We also observe a predilection for L1(Ta)s to accumulate in AT-rich regions, reflecting either mechanism of ORF2p mediated insertion and/or selection against insertions in proximity to genes (
Gasior et al., 2007). Thus far, we have found verified novel insertions only in intergenic or intronic regions; no exonic L1(Ta) insertion (or otherwise obviously deleterious to gene function) were observed. These are consistent with prior
in silico analyses of polymorphic L1(Ta) integrations, but contrast with studies of
Alu insertions which are seen frequently in proximity to genes and occasionally in exons (
Xing et al., 2009), perhaps providing selective advantage (
Lander et al., 2001). In a single sample, genome-wide mapping of
AluYa5/8 and
AluYb8/9 insertions by TIP-chip, we observed an exonic insertion, and we expect features of exonic sequence (GC content and uniqueness) will make for especially effective probe coverage and high quality TIP-chip peaks in these areas. Of L1(Ta) elements inserted within or near (<5000bp) genes, we noticed a statistically significant enrichment for antisense orientation, both considering reference L1(Ta)s or all candidate L1(Ta) insertions identified by TIP-chip. These results and other analyses (
Figures S6A and S6B) suggest that L1(Ta)s inserted in antisense orientation relative to host genes are less deleterious overall, consistent with the hypothesis that sense insertions can lead to polymerase elongation defects and/or premature polyadenylation (
Han et al., 2004;
Perepelitsa-Belancio and Deininger, 2003). Presumably, such a bias against sense insertions is more obvious in reference L1(Ta)s and L1(pre-Ta)s (older elements), due to increased selection time. Mechanisms for target gene dysregulation by L1(Ta)s in both orientations have been posited, however (
Belancio et al., 2008;
Han and Boeke, 2005;
Han et al., 2004;
Speek, 2001;
Wheelan et al., 2005).
We have gained insights into the prevalence of polymorphic L1(Ta)s by performing X chromosome directed screens in large numbers of males and by genome-wide TIP-chip L1(Ta) discovery followed by genotyping human genetic diversity panels by site-specific PCR. Our X chromosome data suggest that across all L1(Ta) insertions in one human, the average insertion allele frequency is about 0.75. Many novel insertions we describe in this study show high allele frequencies across different populations. This suggests that, despite the status of various human genome projects, we are in the early phases of describing these important ISVs. Additionally, we found many uncommon alleles, some of which are likely private insertions unique to a limited kindred or individual.
The sheer quantity and low allele frequency of many novel insertions described suggest L1(Ta)s remain highly active in modern humans. Indeed, TIP-chip data provide an experimental basis for revisiting estimates of L1 activity (i.e., occurrences of de novo insertions in the general population). By comparing the Hs_alt_ huref L1(Ta) profile as discovered with TIP-chip and in silico analysis to the hs_ref profile, we revise the current estimate of L1(Ta) insertion rate from 1 insertion in every 225 births to approximately 1 in 108 (Experimental Procedures). This number is a conservative estimate, as we have not exhaustively PCR verified TIP-chip peaks in this sample and excluded many peaks from consideration. That we readily identified one low-frequency insertion absent from African individuals in one sample and three potential private insertions in a single chromosome study of a clinical cohort (see below) also suggests L1(Ta) activity—and the LINE and SINE ISVs it enables—may have been previously underestimated.
Finally, although TIP-chip can be employed for ISVs discovery throughout the entire genome, the method has the unique advantage that it can be used to efficiently characterize relatively rare insertions over narrower intervals in surveying large populations. This feature may make TIP-chip especially useful in clinical genetics. Here we examine X chromosome L1(Ta) sites in 69 males with clinically defined X-linked intellectual disability, and verified 6 novel, relatively uncommon L1(Ta) insertions and 3 private insertions within this group (insertion allele frequencies < 0.0018–0.0025). Three are in or near brain-expressed genes or genes with known roles in central nervous system development. Though the biological effect of these particular intronic L1(Ta) insertions remains uncertain, the study shows how knowledge of L1(Ta) positions can identify candidate risk alleles meriting further study.
In summary, we have developed a high-throughput method, TIP-chip, for mapping an active group of mobile DNAs in humans. We show the technique is readily generalized to other interspersed repeats. We illustrate initial insights it has provided into L1(Ta) genomic distribution and the dynamics of these repeats in our genomes. Genome-wide TIP-chip studies of several individuals show that L1(Ta)s are extremely polymorphic and an underappreciated type of SV underlying human genetic diversity. Future L1(Ta) and ISV mapping by TIP-chip and similar methods will continue to expand our understanding of the human genomic diversity and play an increasingly important role in identifying causes of genetic disease.