|Home | About | Journals | Submit | Contact Us | Français|
Long Interspersed Element-1 (LINE-1 or L1) sequences comprise the bulk of retrotransposition activity in the human genome; however, the abundance of highly active or ‘hot’ L1s in the human population remains largely unexplored. Here, we used a fosmid-based, paired-end DNA sequencing strategy to identify 68 full-length L1s which are differentially present among individuals but are absent from the human genome reference sequence. The majority of these L1s were highly active in a cultured cell retrotransposition assay. Genotyping 26 elements revealed that two L1s are only found in Africa and that two more are absent from the H952 subset of the Human Genome Diversity Panel. Therefore, these results suggest that ‘hot’ L1s are more abundant in the human population than previously appreciated, and that ongoing L1 retrotransposition continues to be a major source of inter-individual genetic variation.
L1s comprise ~17% of human DNA and have been an instrumental force in shaping genome architecture (Lander et al., 2001). Most L1s are molecular fossils that cannot move (retrotranspose) to new genomic locations (Grimaldi and Singer, 1983; Lander et al., 2001). However, a small number of human-specific L1 (L1Hs) elements remain retrotransposition-competent (Badge et al., 2003; Brouha et al., 2003; Sassaman et al., 1997). On occasion, their retrotransposition has resulted in sporadic cases of human disease (reviewed in Babushok and Kazazian, 2007; Kazazian et al., 1988).
During the past fifteen years, computational, molecular biological, and genomic approaches have been used to identify and characterize L1Hs elements (Badge et al., 2003; Boissinot et al., 2000; Boissinot et al., 2004; Brouha et al., 2003; Lander et al., 2001; Moran et al., 1996; Myers et al., 2002; Ovchinnikov et al., 2001; Sheen et al., 2000; Xing et al., 2009). Several themes have emerged from these studies. First, L1Hs elements can be stratified into several subfamilies (pre-Ta, Ta-0, Ta-1, Ta1-d, Ta1-nd) based upon the presence of diagnostic sequence variants contained within their 5′ and 3′ untranslated regions (UTRs) (Boissinot et al., 2000; Skowronski et al., 1988; Smit et al., 1995). Second, many L1Hs elements are dimorphic in that they are differentially present in individual genomes and/or are present in an individual, but absent from the haploid Human Genome Reference sequence (HGR) (Badge et al., 2003; Boissinot et al., 2004; Brouha et al., 2003; Lander et al., 2001; Myers et al., 2002; Xing et al., 2009). Third, it has been estimated that the average human genome contains ~80–100 active (retrotransposition-competent) L1Hs elements, and that only a small number of highly active L1Hs elements (‘hot’ L1s) account for the bulk of retrotransposition activity in the HGR (Brouha et al., 2003). Those studies, as well as recent efforts to identify insertion, deletion, and inversion polymorphisms (structural variants) in humans (Kidd et al., 2008; Korbel et al., 2007; Tuzun et al., 2005; Xing et al., 2009) indicate that ongoing L1 retrotransposition contributes to inter-individual genetic variation.
Here, we employed a fosmid-based, paired-end DNA resource to identify full-length L1Hs elements in the genomes of six individuals of diverse geographic origin. Over half (37/68) of the newly identified L1s were ‘hot’ for retrotransposition when examined in a cultured cell assay (Moran et al., 1996). Genotyping a subset of these L1s further revealed that some are likely restricted to Africans, whereas others are absent from the Human Genome Diversity Panel (HGDP) (Cann et al., 2002) suggesting that they are present at very low allele frequencies.
To identify novel, full-length L1s in the genomes of geographically diverse individuals, we exploited a fosmid-based, paired-end DNA sequencing strategy that previously was used to identify structural variants in human DNA (Kidd et al., 2008; Tuzun et al., 2005). Fragments of genomic DNA approximately 40kb in size were individually cloned using fosmid vectors (see Extended Experimental Procedures). Sequence reads were obtained from both ends of each insert (paired-end sequences) and compared to the HGR. End-sequences from genomic fragments that do not differ significantly in size from the HGR will map ~40kb away from each other. In contrast, paired-end sequences derived from genomic fragments containing a full-length, dimorphic ~6kb L1Hs element will be separated by ~34kb when mapped to the HGR (Figure 1) (Tuzun et al., 2005). In general, the predicted variants were required to be supported by two fosmid clones containing putative insertions from the same individual. The size cutoffs used in our screening protocols are biased to allow the identification of full-length or near full-length L1 insertion polymorphisms, but not severely 5′ truncated L1 sequences, which are replication-deficient (Table 1). Through this scheme, we should be able to identify the bulk of full-length L1s in an individual genome that are dimorphic when compared to the HGR.
Fosmids fulfilling the above mapping criterion were subjected to a series of screens (Figure 1). First, allele-specific oligonucleotide hybridization using probes directed against diagnostic sequences in the L1Hs 5′ UTR identified insertion fosmids that contain putative dimorphic L1Hs elements (Boissinot et al., 2000; Tuzun et al., 2005). Second, Southern blotting with a probe directed against the 5′ UTR of L1.3 (Accession# L19088) enabled the identification of fosmids that contained putative full-length L1Hs elements (Dombroski et al., 1993; Sassaman et al., 1997). Third, a suppression PCR-based method (ATLAS) (Badge et al., 2003) and/or direct sequencing was used to verify the presence of a full-length (or near full-length) L1Hs element in the fosmid. Finally, genomic sequences flanking the 5′ and 3′ ends of the newly identified L1Hs elements were used as probes in BLAT searches (http://genome.ucsc.edu/cgi-bin/hgBlat?command=start) (Kent, 2002) to confirm that the L1 was absent from the HGR (NCBI build 36.1/hg18). Flanking sequences also were used to determine whether any of the L1Hs elements were present in a database of known polymorphic retrotransposon insertions (dbRIP; http://dbrip.brocku.ca/) (Wang et al., 2006). Two additional L1Hs elements were identified through direct sequencing of the fosmids (#1-2-1 and 10-2-1).
We first conducted a pilot study to examine a fosmid library from a female individual (G248; NA15510) for full-length L1Hs insertions (Table 1) (Tuzun et al., 2005). Despite the fact that this library was optimized for identifying ~8kb insertion polymorphisms as part of the Human Genome Structural Variation project (HGSV) (Kidd et al., 2008; Tuzun et al., 2005), we were able to identify five novel L1Hs elements using our screening protocol (Table 1).
The above data provided ‘proof of principle’ that our strategy was effective for identifying full-length, dimorphic L1Hs elements. Thus, we next screened fosmid libraries from five females representing four distinct geographic populations that were studied as part of the HapMap project (one Japanese (NA18956), one Chinese (NA18555), one Western European CEPH (NA12878), and two Yoruban individuals (NA19240, NA19129)) (Consortium, 2005; Kidd et al., 2008). Size cutoffs allowed detection of insertion polymorphisms as small as ~4.2–5.5kb and enabled the identification of an additional 64 L1Hs elements (Table 1) (Kidd et al., 2008). As our strategy is biased toward finding novel, full-length L1s, we generally observed a decrease in the number of L1Hs elements identified in each successive library screen (e.g., ABC13 was the last library analyzed and contained relatively few novel L1Hs elements). In total, we identified 69 L1Hs elements that were absent from the HGR, one of which was identified in two different individuals (#4-1 and 5–77, respectively). This element also was completely annotated in dbRIP, unlike 65 of the distinct 68 L1s identified in this study (Table 1). The number of elements discovered at each stage of the analysis is detailed in the Extended Experimental Procedures.
We next tested if the L1Hs elements identified in our screens were active for retrotransposition in cultured cells. Sixty-seven elements were cloned into either a pBluescript and/or pCEP4 L1 expression vector that contained an mneoI retrotransposition indicator cassette in its 3′ UTR (#2-42 was refractory to cloning; details in Experimental Procedures) (Freeman et al., 1994; Moran et al., 1996). The pBluescript-based L1 constructs lack an exogenous promoter; thus, L1 expression is driven from its native 5′ UTR. Elements isolated from libraries ABC11–13 were assayed in this context. L1s isolated from the G248, ABC9, and ABC10 libraries were assayed in pCEP4 (CMV+/5′UTR+) and/or pBluescript (5′UTR+) based contexts. The resultant plasmids were transfected into HeLa cells and successful retrotransposition events were detected as G418-resistant foci (Figure 2a) (Moran et al., 1996). Retrotransposition activities are reported relative to L1.3, and ‘hot’ refers to an L1 that jumps at >10% of L1.3 (see Table S1). Notably, 22 elements yielded similar retrotransposition efficiencies relative to L1.3 when tested in either a CMV+/5′UTR+ or a 5′UTR+ context (data not shown). Since the subcloning procedure does not involve PCR, we truly are testing the retrotransposition capability of each of the identified L1Hs elements in our screen.
Each individual contained between three and nine ‘hot’ L1s in their genome and 55% (37/67) of the L1Hs elements tested were hot for retrotransposition (Figures 2a & 2b, Table 1). These 37 ‘hot’ L1Hs elements represent an approximately 4-fold increase in the number of ‘hot’ L1s identified in previous studies (Badge et al., 2003; Brouha et al., 2002; Brouha et al., 2003; Kimberland et al., 1999; Lander et al., 2001; Sassaman et al., 1997). Examination of the 3′ UTR sequences of the 68 L1s uncovered six elements that contain an ACG in place of the Ta subfamily diagnostic ACA characters. These elements are termed ‘pre-Ta’, and represent an older L1s subfamily (Boissinot et al., 2000; Brouha et al., 2003; Kazazian et al., 1988; Lander et al., 2001; Myers et al., 2002; Skowronski et al., 1988). Two pre-Ta L1s (#3-5 and 5–55) were ‘hot’ for retrotransposition (Figure 2B; Table S1). These data agree with previous studies, which showed that a de novo insertion of a pre-Ta L1 into the Factor VIII gene resulted in a sporadic case of hemophilia A (Kazazian et al., 1988).
We next sequenced each L1Hs element in its entirety and compared these data to fosmid sequences previously deposited in GenBank (Kidd et al., 2008). We annotated each L1 for hallmarks of retrotransposition as well as their chromosomal environment (Table S2). In general, the L1Hs elements were flanked by target-site duplications that ranged from 6 to 20bp, inserted into an L1 endonuclease consensus cleavage sequence (Cost and Boeke, 1998; Feng et al., 1996; Morrish et al., 2002), and their 3′ ends had either homopolymeric poly (A) tails that ranged from ~8–41bp in size or interrupted poly (A) tails/3′ transductions ranging from ~18bp to 1,105bp in length (Table S2) (Goodier et al., 2000; Holmes et al., 1994; Moran et al., 1999; Pickeral et al., 2000).
A subset of the elements (~32/68) contained an additional 1–14bp of untemplated nucleotides at their 5′ ends, termed 5′ end heterogeneity (Athanikar et al., 2004; Lavie et al., 2004). Five of these L1s have an extra G at their 5′ ends, and one has three extra Gs when compared to a ‘hot’ L1Hs consensus sequence (Brouha et al., 2003). These extra nucleotides potentially could result either from a terminal transferase activity associated with the L1 reverse transcriptase, or reverse transcription of the 7-methylguanosine cap at the 5′ end of L1 RNA (Boeke, 2003; Gilbert et al., 2005; Symer et al., 2002). The majority of elements identified were full-length; however, we also found 7 elements (e.g. #1-5 and 2–30) that were truncated within their 5′ UTR. These data, along with the fact that the fosmid libraries provided ~4–5 fold coverage of each haplotype from the 6 individuals (Kidd et al., 2008), indicate that our screening procedure identified the majority of the full-length L1s in these genomes.
The 68 L1Hs elements were dispersed throughout the genome. We did not identify L1Hs elements on chromosomes 16 or 19 (Figure 2c); however, this result probably reflects our small sample size rather than a systematic bias against their ability insert on these chromosomes (Lander et al., 2001). Consistently, we previously were able to detect the insertion of engineered L1s into chromosomes 16 and 19 of HeLa cells (Gilbert et al., 2005).
Approximately 32% (22/68) of L1Hs elements were present in the introns of known RefSeq genes (http://www.ncbi.nlm.nih.gov/RefSeq/), and mutations in several of these genes are implicated in human genetic disorders (Table S3). Thirteen L1 insertions were in the anti-sense orientation (i.e., were transcribed in the opposite orientation to the gene), whereas 9 L1 insertions were in the same transcriptional orientation as the gene. Since ~26–38% of the genome is spanned by genes (Venter et al., 2001), the data suggest that the L1s have inserted randomly with respect to gene content, which is in agreement with previous studies (Gilbert et al., 2005; Gilbert et al., 2002; Ovchinnikov et al., 2001; Symer et al., 2002).
Our sequencing studies uncovered several expected trends and some unexpected results. All 37 ‘hot’ L1 elements and the 6 low-level activity elements had two intact open reading frames (ORFs). A consensus sequence derived from these 37 ‘hot’ L1s was identical at the amino acid level to a previously derived consensus (Brouha et al., 2003).
Inactive elements generally had frame shift (5/24) or chain-terminating nonsense mutations (9/24) in at least one of the L1 ORFs. However, 10 of these low-level activity or inactive elements contained two intact open reading frames. One L1 (#3-24) contained an S228P missense mutation within the endonuclease (EN) domain of ORF2p (Feng et al., 1996; Weichenrieder et al., 2004). Though L1s containing EN mutations are unable to retrotranspose in HeLa cells, they can retrotranspose in Chinese Hamster Ovary (CHO) cells deficient in the non-homologous end-joining (NHEJ) pathway of DNA repair, presumably by parasitizing a free 3′ OH group to initiate target-primed reverse-transcription (TPRT) (Morrish et al., 2007; Morrish et al., 2002). Interestingly, although #3-24 is inactive in NHEJ proficient cell lines, the L1 retrotransposed at roughly 60% the efficiency of the wild-type control, L1.3, in NHEJ deficient CHO cells (Morrish et al., 2002). Introducing the S228P change into L1.3 (Sassaman et al., 1997) also allowed efficient EN-independent retrotransposition, indicating that this mutation is largely responsible for the inactivity of #3-24 in HeLa cells (Figure S1).
Analysis of genomic sequences flanking the 68 L1Hs elements revealed a number of interesting findings. The poly (A) tails of 25 L1s were interrupted or contained 3′ transductions (Goodier et al., 2000; Holmes et al., 1994; Moran et al., 1999; Pickeral et al., 2000), seventeen of which clustered into ‘subfamilies’ of L1Hs elements. In one case, we identified an L1 (#2-1) as the likely source element for one of these ‘subfamilies’. For #1-3, 3–31, and 1–5, these transductions/interrupted poly (A) tails were identical to those in L1Hs elements that have caused disease-producing mutations (e.g., L1RP, LRE3) (Brouha et al., 2002; Kimberland et al., 1999). In other cases, the transductions denote examples of recently amplified subfamilies (Goodier et al., 2000; Lander et al., 2001; Pickeral et al., 2000).
Examining the 5′ genomic flanks showed that the retrotransposition of a full-length L1 from the ABC9 genomic library (#2-24) that integrated on chromosome 10 was accompanied by ~250bp of an Alu element which maps to chromosome 16. The Alu sequence is in the opposite transcriptional orientation to the L1, 13bp of unmapped sequence separates the elements, and the whole insertion was flanked by target site duplications (TSDs) (Figure S2). Thus, though most of the full-length L1Hs elements identified here have amplified by canonical retrotransposition, recombination and/or replication-mediated repair processes may facilitate the integration of some elements (Gilbert et al., 2005; Gilbert et al., 2002; Symer et al., 2002). Additionally, our screen allowed us to resolve possible sequence anomalies in the HGR. For example, one fosmid that lacks a dimorphic L1Hs element (#6-105) actually contains two L1s (a PA2 and pre-Ta element) that likely were collapsed into a harlequin element during the HGR assembly (Figure S2).
Finally, the data also enabled us to examine allelic heterogeneity associated with L1Hs elements. For example, one L1 (#5-70) was present in the HGR, but contained a stop codon in ORF2 and was not tested for activity (Brouha et al., 2003). Interestingly, #5-70 retrotransposed at ~8% of the level of L1.3, further illustrating how allelic heterogeneity can impact retrotransposon activity (Lutz et al., 2003; Seleme et al., 2006).
The 68 L1Hs elements identified here are dimorphic with respect to presence; thus, we tested if a subset of these L1s represented population-restricted or potentially private alleles. To address this question, we first compiled existing genotyping data (Badge et al., 2003; Myers et al., 2002; Xing et al., 2009). Additional genotyping then was conducted on a subset of the L1s discovered here (26 in total; see Supplemental Information for selection criteria). The 26 L1s first were genotyped in a CEPH panel of 129 unrelated individuals. Nine L1s absent from the CEPH panel then were genotyped in a Zimbabwean panel of 72 unrelated individuals. Finally, if the element was absent from both panels, it was genotyped on the H952 subset of the HGDP consisting of ~1050 individuals from ~51 worldwide populations (Figure 3a and Table S4) (Cann et al., 2002; Rosenberg, 2006).
Two elements (#3-5 and 3–31) genotyped on the HGDP exist at very low allele frequencies and were only found in Africans. Two other L1Hs elements (#1-5 and 3–24) were absent from the HGDP (Table S4). Element #3-24 (the S228P mutant described above) was found in the ABC10 Yoruban library. Further genotyping revealed that the L1Hs element containing the mutation was present in her mother (but not her father), excluding a de novo origin (Figure 3b). The other putatively ‘private’ L1Hs element was from G248 (#1-5), so we could not examine its segregation in a trio. Interestingly, this L1 insertion occurred into an intron of the ABCA1 gene (Figure 3c); mutations in ABCA1 have been associated with Tangier disease and low serum HDL levels (Frikke-Schmidt, 2009).
To estimate the total number of active L1s in one individual, we carried out in silico genotyping of the 68 L1Hs elements in ABC13, the last library examined in our subtractive scheme. We identified 20 regions containing distinct L1 insertions identified in the first 5 individuals that corresponded to insertion fosmids in the ABC13 HGSV track (http://hgsv.washington.edu/) of the UCSC genome browser (Figure 4a, Table S4) (Kent et al., 2002; Kidd et al., 2008). PCR genotyping confirmed that ABC13 contained 18 of these 20 elements (Figure 4b), and was homozygous with respect to presence for three of the elements. This result suggests that in silico genotyping could be used as a screening tool to identify L1Hs elements present at low allele frequencies in the population (Table S4).
Adding the 18 L1Hs elements identified by in silico genotyping to the seven novel L1Hs elements identified in the ABC13 genome through our fosmid screens revealed that this individual contains 25/68 L1Hs elements identified in this study. Additional genotyping revealed that this individual contains 2 of the ‘hot’ L1s characterized in a previous study (Table 1) (Brouha et al., 2003). Combining these numbers with our retrotransposition data indicates that the ABC13 genome contains 14 potentially ‘hot’ L1Hs elements, and that at least 3 of these elements are present in a homozygous state.
Our data suggest that, on average, the 68 L1Hs elements identified here are present at lower allele frequencies, are more active, and may be evolutionarily younger than those in previous studies (Brouha et al., 2003). To test this hypothesis, we derived maximum likelihood estimates for the ages of Ta-1 L1Hs elements in our dataset and that of Brouha et al. (Brouha et al., 2003; Marchani et al., 2009). This analysis revealed that the Ta-1 L1Hs elements identified here are significantly younger (1.0 MY 95% C.I. 0.98–1.01 MY) than those reported previously (2.01 MY 95% C.I. 2.00–2.02 MY) (Marchani et al., 2009) (1.73 MY 95% C.I. 1.69–1.77 MY) (Brouha et al., 2003).
The maximum likelihood estimated age (Marchani et al., 2009) (1.0 MY) of the L1s reported here differs significantly from that calculated using the ad hoc method, which uses sequence divergence within subfamilies of elements to determine age (Carroll et al., 2001) (1.18 MY old). These two methods are known to be respectively robust (the maximum likelihood method) and sensitive (the ad hoc method) to the presence of multiple active lineages in the dataset (i.e. departures from the master gene model of L1 evolution) (Cordaux et al., 2004). The difference in these two estimates may indicate that members of multiple active L1Hs subfamilies are present in our dataset, and suggests that the true age of the L1s may be younger than either calculation suggests. Indeed, the above data are consistent with the hypothesis that the HGR is strongly biased in favor of older, fixed L1Hs elements.
We next used a neighbor joining approach, rooted with an intact chimpanzee L1 element, to generate a phylogenetic tree of the 68 full-length L1Hs elements (Figure 5, see Extended Experimental Procedures). As predicted, pre-Ta elements were located near the root of the tree. Interestingly, two known (L1RP & LRE3) and five other currently amplifying ‘subfamilies’ clustered together on the tree (Figure 5; see groups of colored elements), even though the interrupted poly (A) tail/transduction sequences themselves were excluded from the sequence alignments.
We have developed a systematic process to identify novel, dimorphic, active L1Hs elements in genomes of individuals from diverse geographic populations. Many of the newly identified L1Hs elements exist at low allele frequencies in the population and four L1Hs elements represent ‘rare’ alleles, three of which appear to be restricted to Africans. Sequence-based age estimates further reveal that these L1Hs elements appear to be, on average, evolutionarily younger than those identified in previous studies (Brouha et al., 2003; Marchani et al., 2009). These data are consistent with the notion that full-length active L1s are systematically underrepresented in available genome reference sequences (Badge et al., 2003; Boissinot et al., 2004; Brouha et al., 2003; Sassaman et al., 1997; Sheen et al., 2000; Xing et al., 2009).
Our study has underscored the effectiveness of fosmid paired-end libraries in the discovery of novel, active L1Hs elements. Though a number of technologies have been developed to identify polymorphic L1s (Badge et al., 2003; Boissinot et al., 2004; Brouha et al., 2003; Moran et al., 1996; Myers et al., 2002; Sheen et al., 2000; Xing et al., 2009), the approach described here is not reliant upon PCR fidelity, readily allowing the identification of active L1Hs elements and making sequencing of genomic flanking sequences, poly (A) tails, and L1-mediated transductions relatively straightforward. Thus, we predict that the fosmid-based approach likely will be superior to second-generation, low-coverage genome sequencing methodologies (e.g., many individual genomes characterized in the 1000 genomes project; http://www.1000genomes.org/page.php) for comprehensively identifying and characterizing ‘rare’ L1 alleles in individual genomes. Indeed, recently published genome sequences highlight the difficulties in detecting and unambiguously mapping highly repetitive insertions (relative to a reference genome), including L1Hs elements (Bentley et al., 2008; McKernan et al., 2009; Wang et al., 2008; Wheeler et al., 2008).
Our analysis revealed that many active L1s cluster in small ‘subfamilies’. In the strictest sense, these data argue against a master gene model (Deininger et al., 1992) and instead support a model in which multiple active source L1Hs elements (including members of both the pre-Ta and Ta-subfamilies) are currently retrotransposing in modern human genomes (Cordaux et al., 2004). We cannot formally exclude a ‘stealth’ model, where L1s in unfavorable expression contexts sometimes give rise to new retrotransposition-competent source elements that can be expressed from a more favorable genomic context (Han et al., 2005). However, the most parsimonious explanation of our data is that multiple source L1Hs elements and subfamilies with limited ‘life-spans’ exist in the genome. We posit that ‘hot’ L1Hs elements must give rise to new, active progeny at a faster rate than they are inactivated by cellular mutational processes (see Figure 6 for model); this can lead to a scenario where small numbers of currently active L1Hs lineages may out-compete older L1s for limiting reagents, such as host factors (Boissinot and Furano, 2001). This competition scenario both supports and extends current lineage succession models and could potentially explain the monophyletic history of L1s and the appearance of a replication-dominant L1Hs subfamily (Boissinot et al., 2000; Cordaux et al., 2004; Seleme et al., 2006).
Our data set is still relatively small, and it remains difficult to estimate the actual number of ‘hot’ L1s in the extant population. However, our ability to readily identify rare ‘hot’ L1s in the genomes of geographically diverse individuals strongly suggests that these highly active L1Hs elements are more abundant in the population than previously appreciated. The active L1Hs elements identified here also have the potential to impact modern human genomes by retrotransposing flanking genomic sequences to new chromosomal locations and by serving as substrates for non-allelic homologous recombination (reviewed in Cordaux and Batzer, 2009; Moran et al., 1999). The proteins encoded by these L1s also may promote the retrotransposition of Alu elements and non-coding RNAs (Bennett et al., 2008; Dewannieux et al., 2003; Garcia-Perez et al., 2007). Indeed, our data support the hypothesis that ‘hot’ L1s are actively retrotransposing in modern-day human genomes and suggest that some of the L1 alleles identified here could serve as source elements for disease-producing L1 insertions.
Genomic DNA from the 6 individuals was obtained from transformed lymphoblastoid cell lines (available from the Coriell Cell Repository). The DNA was hydrodynamically sheared, end-repaired, size selected for 40kb fragments by pulsed field gel electrophoresis, and ligated into fosmid vectors (Donahue and Ebling, 2007). Agencourt Biosciences Corporation constructed all libraries, with the exception of the G248 library, which was constructed as part of the human genome project finishing effort. From each library, approximately 1 million individual cloned fragments were arrayed into 384-well plates. End-sequence pairs were obtained from both ends of each DNA fragment using standard capillary sequencing and were mapped back to the HGR. Insertion-containing fosmids were identified as the subset of fosmids containing an apparent insert that was ~3 standard deviations smaller than the library mean (Kidd et al., 2008; Tuzun et al., 2005).
Insertion-containing fosmids identified in silico were screened for L1Hs elements in the following manner. First, all insertion fosmids were subjected to allele-specific oligonucleotide hybridization to identify characters in the 5′ UTRs of newer L1 subfamilies (Badge et al., 2003; Boissinot et al., 2000). This protocol was adapted from ‘hybridization of bacterial DNA on filters’ (Sambrook, 1989). Fosmid DNAs were prepared according to the Very Low-Copy Plasmid/Cosmid Purification protocol for the Qiagen-tip 100 Midi prep kit (Qiagen). Those DNAs were subjected to Southern blotting followed by ATLAS (Badge et al., 2003) and/or direct sequencing to identify L1Hs elements that were absent from the HGR. Sequences flanking the L1Hs elements then were used as probes in BLAT searches at the UCSC genome browser (http://genome.ucsc.edu/) to determine the insertion site in the HGR (Kent, 2002; Kent et al., 2002). Detailed protocols for each step of the screening process, as well as the number of fosmids positive at each stage of the analysis, can be found in the Extended Experimental Procedures.
In general, L1Hs elements were cloned directly from insertion-containing fosmids by digestion with AccI (Sassaman et al., 1997). The restricted DNA was separated on a 0.8% agarose gel, and the ~6kb L1-containing restriction fragment was cloned into an L1 expression vector. This method captures the vast majority of the L1Hs sequence, leaving only the first ~35bp and last ~50bp of the original L1 5′ and 3′ UTRs present in the cloning vector, respectively. One element, #2-42, was refractory to this cloning procedure, as it contains a polymorphism near the 3′ end of ORF2 that creates an additional AccI site. The PDH L1.3 mutant was generated by site-directed mutagenesis. Each L1Hs element was sequenced in its entirety. Detailed protocols for the creation of each construct are included in the Extended Experimental Procedures.
We used a modification of a transient transfection protocol to conduct retrotransposition assays in HeLa and CHO cells (Moran et al., 1996; Morrish et al., 2002; Wei et al., 2000). Briefly, cells in 6-well dishes were transfected using the Fugene 6 agent (Roche) with 1μg of plasmid (containing the indicator cassette) per each well. Cells were fed with media ~24 hours post plating, and daily from 72 hours with media plus either 400μg/mL G418 or 10μg/mL blasticidin. Fourteen days post transfection, cells were fixed and stained with 0.1% crystal violet. Colonies were counted in the appropriate wells, and these counts were normalized to GFP transfection efficiency. Detailed protocols for culture and assay conditions are found in the Extended Experimental Procedures.
The genomic locations of L1Hs insertions were compared to a database of human retrotransposon insertion polymorphisms (dbRIP; http://dbrip.brocku.ca/) (Wang et al., 2006). PCR genotyping assays were designed for a subset of L1Hs elements that were not completely annotated in dbRIP. Genotyping initially was conducted on a CEPH panel of 129 unrelated individuals of Northern European ancestry. If a L1Hs element was absent from the CEPH panel, it was genotyped on a panel containing genomic DNAs from 72 unrelated Zimbabwean individuals. Finally, if an L1Hs element was absent from both genotyping panels, it was genotyped on the H952 subset (Rosenberg, 2006) of the HGDP (Cann et al., 2002) (see Figure 3b). In silico genotyping was conducted using the HGSV track of the UCSC genome browser (Kent et al., 2002; Kidd et al., 2008). Details about these analyses are in the Extended Experimental Procedures.
Sequences of the 69 full-length L1 elements were classified into subfamilies using the L1Xplorer analysis website (Penzkofer et al., 2005). Ta-1, Ta-0 and Non-Canonical (NC) (Brouha et al., 2003) elements were separately aligned using Muscle 3.52 (Edgar, 2004) on the Phylomen web server (http://phylemon.bioinfo.cipf.es/cgi-bin/home.cgi) (Tarraga et al., 2007). Raw alignments were manually refined to remove all indels, all variable CpG sites and the L1 polypurine tract using Jalview (Waterhouse et al., 2009). Maximum likelihood estimates of the age (T) of each group, the sampling variance of T, and its 95% confidence intervals were calculated using the mleT script (Marchani et al., 2009) running under Matlab 7.2 -2007a (The Mathworks Inc., Natick, MA). The subroutine CountMutations (Marchani et al., 2009) was also utilized to calculate the number of substitutions in the datasets to enable the “ad hoc” subfamily age estimation method (Marchani et al., 2009).]
The sequences of the 69 elements were aligned as described above. Raw alignments were manually refined using Jalview (Waterhouse et al., 2009) to remove large indels and truncated elements; this led to the exclusion of #6-113 due to a large 5′ UTR deletion.
A single Neighbor Joining tree of the 68 remaining full-length elements was constructed using the PHYLIP package (Felsenstein, 1989). Branch lengths were corrected using the Kimura 2 parameter model (Kimura, 1980). To assess the reliability of the phylogeny, 1000 bootstrapped re-samples of the multiple alignment were made using the seqboot program of the PHYLIP package (Felsenstein, 1989). The neighbor joining tree derived from the full dataset was manually annotated with bootstrap values using Dendroscope (Huson et al., 2007) (Figure 5). Only bifurcations that occurred in more than 70% of bootstrap re-samples are labeled.
We thank Prof. Sir Alec Jeffreys FRS for access to CEPH and Zimbabwean DNA samples, and Prof. Mark Jobling for access to HGDP DNA samples. [We thank Dr Elizabeth Marchani for advice on maximum likelihood age estimates and Dr. José Luis Garcia-Perez for plasmid JJ105/L1.3. We thank Dr. Garcia-Perez and members of the Moran lab for helpful comments. C.R.B. was supported in part by NIH training grants T32GM7544 & T32000040. J.M.K. was supported by a National Science Foundation Graduate Research Fellowship. Work in the laboratory of E.E.E. was supported by grant HG004120. P.C. and C.M. were supported by a Wellcome Trust Project Grant (075163/Z/04/Z) to R.M.B and Prof. Sir Alec Jeffreys, FRS. J.V.M. is supported by NIH grants GM066695 and GM060518. The University of Michigan Cancer Center Support Grant (5P30CA46592) helped defray sequencing costs incurred in this study. J.V.M. and E.E.E. are Investigators of the Howard Hughes Medical Institute.
Accession numbers for all elements are tabulated in the Supplemental Information. Two L1Hs elements (Accession Numbers (#1-5) GU477636 and (#6-102) GU477637) were recently posted in GenBank.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.