|Home | About | Journals | Submit | Contact Us | Français|
Long interspersed element 1s (LINE-1s or L1s) are a family of non-long-terminal-repeat retrotransposons that predominate in the human genome. Active LINE-1 elements encode proteins required for their mobilization. L1-encoded proteins also act in trans to mobilize short interspersed elements (SINEs), such as Alu elements. L1 and Alu insertions have been implicated in many human diseases, and their retrotransposition provides an ongoing source of human genetic diversity. L1/Alu elements are expected to ensure their transmission to subsequent generations by retrotransposing in germ cells or during early embryonic development. Here, we determined that several subfamilies of Alu elements are expressed in undifferentiated human embryonic stem cells (hESCs) and that most expressed Alu elements are active elements. We also exploited expression from the L1 antisense promoter to map expressed elements in hESCs. Remarkably, we found that expressed Alu elements are enriched in the youngest subfamily, Y, and that expressed L1s are mostly located within genes, suggesting an epigenetic control of retrotransposon expression in hESCs. Together, these data suggest that distinct subsets of active L1/Alu elements are expressed in hESCs and that the degree of somatic mosaicism attributable to L1 insertions during early development may be higher than previously anticipated.
The human genome is highly complex in structure, but only ~1.5% of human DNA has protein coding potential (53). More than 40% of the genome is composed of sequences derived from mobile genetic elements (transposons and retrotransposons) (53). At present, only long interspersed element 1s (LINE-1s or L1s) and some short interspersed elements (SINEs) are actively transposing in the human genome (62). LINE-1 elements (here LINE-1s) are autonomous retrotransposons that constitute ~17% of human DNA (53), and recent estimates indicate that an average human genome contains around 80 to 100 sequences that are able to transpose, i.e., are retrotransposition-competent LINE1s (here RC-L1s) (19, 71). However, these elements vary dramatically in their retrotransposition activity in cell culture-based retrotransposition assays (19). In addition, allelic heterogeneity in retrotransposition activity (56, 73) and the presence of RC-L1 elements that show the presence or absence of polymorphism between individuals (8, 15, 84) imply that there can be significant variation in RC-L1 activity between individual genomes.
An RC-L1 is ~6 kb in length (29, 72) and contains an ~900-bp-long 5′ untranslated region (UTR) with internal promoter activity (78), two open reading frames (ORFs), an ~150-bp-long 3′ UTR, and a poly(A) tail (72). ORF1 encodes a 40-kDa protein with RNA binding and nucleic acid chaperone activities (38, 40, 52, 59, 60). ORF2 encodes a 150-kDa protein with reverse transcriptase (RT) and endonuclease activities (33, 61). Both proteins are required for the mobilization of L1 within the human genome (65). L1 retrotransposition involves the reverse transcription of an mRNA intermediate by a mechanism termed target-primed reverse transcription (25, 26, 55, 64). The mobilization of SINEs occurs by a similar mechanism (46), through the use of LINE-1-encoded ORF2p (28).
Alu elements are the most successful human SINEs, and they are present at greater than one million copies in the human genome (53). Alu elements are nonautonomous non-long-terminal-repeat retrotransposons derived from human gene 7SL (reviewed in references 9 and 23), and the average human genome contains ~6,000 active core Alu elements (12). An Alu core is defined as the ~280-bp region that includes both Alu monomers that are capable of retrotransposing in cultured cells but excludes any flanking genomic 5′ or 3′ regions.
Despite the high prevalence of transposable elements in the human genome and the presence of several LINE and SINE subfamilies in this genome, apparently at present only certain members of each class are active (designated “young,” “human-specific,” or “hot” elements [reviewed in reference 62]). As a consequence, L1 and Alu elements can act as insertional mutagens, and indeed, many cases of human disease have been caused by such insertions (11, 37). In addition to their potential as insertional mutagens, there are many ways in which de novo L1/Alu insertions and L1/L1 or Alu/Alu recombination can impact the human genome (reviewed in references 11, 23, 37, 44, and 50). Overall, it is estimated that 1 in 35 to 45 newborns harbors a de novo L1 or Alu retrotransposition event (24, 31, 42, 49). These new events must occur either in parental germ cells or early in embryonic development, prior to the partitioning of the germ cell lineage. Indeed, through the characterization of human mutagenic insertions and the use of mouse models of L1 retrotransposition, it has been revealed that L1 retrotransposition can occur in germ cells, during early embryonic development, and in particular somatic tissues (3, 7, 18, 27, 35, 47, 66-68, 80). On the other hand, recent studies have revealed that L1 mobilization processes are a source of genomic variation among humans, with particular impact on our somatic genome, as revealed by the identification of several de novo L1 insertions in a cohort of lung tumors (10, 27, 31, 42, 45).
Human embryonic stem cells (hESCs) offer an excellent model to study biological processes during early human development, as they mimic pluripotent cells isolated from the inner cell mass (ICM) of human embryos (79). Several hESC lines and human embryonic carcinoma (hEC) cell lines express L1 retrotransposition intermediates (ribonucleoprotein particles [RNPs; 39, 52, 58]), and a diverse range of L1 mRNAs (representing active and inactive subfamilies) are expressed in these cells (35, 41, 58, 74). Furthermore, several cultured hESC lines can support the retrotransposition of engineered LINE-1 elements using a cultured-cell-based assay (35, 65). There is a growing but disparate set of observations relating to host factors that influence the retrotransposition of Alu and L1, including the differential effect of APOBEC proteins on the mobility of L1 and Alu (recently reviewed in reference 21), the control of L1 expression by DNA methylation in germ cells by DNMT3L, Piwi proteins, and Piwi-interacting RNAs (17, 57), as well as single-stranded retrotransposon DNA degradation by the exonuclease Trex1 (77). These observations suggest that there is a diverse array of host defense systems that can interfere with L1 retrotransposition. Perhaps the most enigmatic feature of these systems is the fact that full-length L1 human-specific (L1Hs) elements contain an active antisense promoter in their 5′ UTR (76). Recently, it was reported that, in conjunction with the sense L1 promoter, transcripts initiated from the antisense promoter could trigger an RNA interference (RNAi) response that attenuates the mobility of L1 in cultured cells (75, 85). Intriguingly, deletion of the L1 antisense promoter enhances retrotransposition in cultured cells (85), but it has been retained in the vast majority of endogenous active elements, suggesting that it has some essential, perhaps regulatory, function.
To characterize a sample of the active retrotransposon “transcriptome” of hESCs, we cloned and sequenced expressed Alu elements and tested their retrotransposition potential in cultured human cells. We also utilized antisense L1 (AS-L1) transcripts, expressed in hESCs, to identify and map expressed L1 elements and their host genes. In addition, we found that the antisense promoter of L1 is robust over evolutionary time and that most expressed L1s are located within genes, suggesting epigenetic control of their expression.
All reagents were purchased from GIBCO-Invitrogen, unless otherwise indicated. The cell lines PA-1 (86) and HeLa-HA were grown as previously described (6). Briefly, cells were passaged by standard trypsinization (using a 0.05% stock) and the culture medium was minimum essential medium (MEM) supplemented with 10% heat-inactivated fetal bovine serum, 1× nonessential amino acids, and 1 mM l-glutamine. 2102Ep (4) and N-Tera2D1 cl1 (N-Tera2D1) (5) cells were grown in high-glucose Dulbecco MEM supplemented with 10% fetal bovine serum and 1 mM l-glutamine. hESCs were grown as previously described (35). hESC lines H7, H9, and H13B (WA07, WA09, and WA13) were obtained from Wicell and maintained on gelatin-coated plates using irradiated mouse embryonic fibroblasts (MEFs) from CF-1 mice (Chemicon). Gamma irradiation with a 2100 Cesium source indicator was used to mitotically inactivate MEFs. MEFs were used at a density of 25,000/cm2. The culture medium for hESCs was Dulbecco MEM-knockout supplemented with 4 ng/ml b-FGF, 20% knockout serum replacement, 1 mM l-glutamine, 50 μM β-mercaptoethanol, and 0.1 mM nonessential amino acids. hESCs were manually passage twice a week. Transfected hESCs were grown in Matrigel-coated plates (B&D) using MEF-conditioned medium for 24 h (35). All of the cell lines were grown in a humidified incubator at 37°C with 7% CO2.
Approval from the Spanish National Embryo Ethical Committee was obtained to work with hESCs.
Plasmid DNAs were purified using a Midiprep kit from Qiagen, checked for superhelicity by electrophoresis on 0.7% agarose-ethidium bromide gels (only highly supercoiled preparations of DNA [>90%] were used for transfection), and filtered through a 0.22-μm filter. The following plasmids were used. pRL-SV40 is a 4.8-kb plasmid that contains the coding region of Renilla luciferase under the transcriptional control of the early simian virus 40 (SV40) promoter. It is cloned in a modified pBSKS II (Stratagene) plasmid that contains an SV40 late polyadenylation signal. 5S-FF is a 5.7-kb plasmid that contains the 5′ UTR of a human L1Hs element (L1.3) (71) cloned in the sense orientation in plasmid pGL3-basic (Promega). 5AS-FF is a 5.7-kb plasmid that contains the 5′ UTR of a human L1Hs element (L1.3) (71) cloned in the antisense orientation in plasmid pGL3-basic (Promega). Derivatives of plasmids 5S-FF and 5AS-FF but containing the 5′ UTR from older LINE-1s (L1PA2, L1PA3, L1PA4, L1PA6, L1PA7, L1PA8, and L1PA10) were constructed using the same procedure. pCEP-5′UTRORF2NoNeo has been described previously (2). It contains a 5.0-kb NotI-BamHI fragment containing the L1.3 5′ UTR and L1.3 ORF2 cloned in pCEP4 (Invitrogen). pAluNF1-neoIII contains a 2.1-kb fragment containing the 7SL promoter, a copy of the NF1 Alu element (a Ya5 member) (82), a neo3 self-splicing indicator cassette (30), a 33-bp poly(A) tail, and a BC1 transcription termination sequence cloned in pBSKS-II (Invitrogen). An AgeI and a BstZ17I site were introduced into the 5′and 3′ends of Alu NF1, respectively, to help the cloning of Alu elements expressed in hESCs. pCEP-EGFP contains the 0.9-kb coding sequence of the humanized enhanced green fluorescent protein (EGFP), which was derived from plasmid phrGFP-C (Stratagene) cloned in pCEP4 (Invitrogen).
hESCs were transfected by nucleofection (Amaxa) exactly as described previously (35), using 4 × 106 cells and 4 μg of purified DNA (2 μg of pRL-SV40 and 2 μg of either 5S-FF or 5AS-FF). The luciferase signal was read using the dual-system kit from Promega.
The Alu trans-retrotransposition assay in HeLa-HA cells was conducted in six-well tissue culture plates as previously described (34). Briefly, HeLa-HA cells were plated at 4 × 104/well in six-well tissue culture plates. We used a full plate per Alu construct to be analyzed. Approximately 14 to 18 h after plating, three wells of the plate were cotransfected with 0.66 μg of a reporter plasmid (pAluNF1-neoIII) and 0.33 μg of a driver L1 that lacks an indicator cassette (pCEP-5′UTRORF2NoNeo). We used 3 μl of Fugene 6 transfection reagent (Roche Biochemical). The remaining three wells were cotransfected with equal amounts of an EGFP reporter plasmid (human Renilla green fluorescent protein [pCEP-EGFP]), a reporter plasmid, and a driver L1. At 72 h posttransfection, this set of wells was trypsinized and subjected to flow cytometry. The percentage of EGFP cells was used to determine the transfection efficiency of each sample. At 72 h posttransfection, cells in the remaining wells were subjected to G-418 selection (400 μg/ml) for 12 days. The retrotransposition efficiency is expressed as the number of G-418-resistant foci divided by the number of transfected (EGFP-positive) cells.
Cells were washed twice with 1× phosphate-buffered saline (Invitrogen), and total RNA was extracted using the TRIzol reagent (Invitrogen). To generate cDNAs, 4 μg of total RNA was treated with 100 U of RNase-free DNase I (Promega) for 1 h at 37°C. To prevent contamination with genomic DNA, the DNase treatment was repeated twice. Then, 1 μg of RNA was reverse transcribed with Moloney murine leukemia virus RT (25 U; Promega) primed with a 3′ random amplification of cDNA ends (RACE) primer for 1 h at 42°C by following the manufacturer's instructions. The sequence of the RACE primer is 5′GCGAGCACAGAATTAATACGACTCACTATAGGTTTTTTTTTTTTVN.
To generate a library of expressed Alu elements, RACE-primed cDNAs were used in a PCR with primers Outer (5′GCGAGCACAGAATTAATACGACT) and Alu_library (5′ GGTGGCTCACGCCTGTAATCCCAG) in triplicate using High Fidelity Expand Taq (Roche). We used a 3′ RACE primer to prevent amplification of exonized Alu elements. The PCR conditions included an initial cycle of 95°C for 2 min, followed by 25 cycles of 30 s at 94°C, 30 s at 54°C, and 30 s at 72°C, with a final step of 72°C for 10 min. Thirty microliters of each PCR product was resolved on 2% agarose gels, the band was excised, and the DNA was extracted using the QIAquick extraction kit (Qiagen). PCR amplification products were cloned in pGEMT-Easy (Promega), and approximately 15 clones per reaction were randomly sequenced using M13 universal primers. Sequences were analyzed by BLAT (51) at http://www.genome.ucsc.edu using the March 2006 human genome assembly. The Alu subfamily was determined using RepeatMasker at http://www.repeatmasker.org.
To generate a library of expressed LINEs, RACE-primed cDNAs were used in a PCR with primers Outer and ABIEL_library (5′GTGAGATGAACCCGGTACCTCAG) in triplicate using High Fidelity Expand Taq (Roche). The PCR conditions included an initial cycle of 95°C for 2 min, followed by 30 cycles of 30 s at 94°C, 30 s at 54°C, and 90 s at 72°C, with a final step of 72°C for 10 min. Thirty microliters of each PCR product was resolved on 2% agarose gels, and products were excised and purified in two groups, 100- to 300-bp and 300- to 600-bp sizes, unless otherwise indicated. DNA was extracted using the QIAquick extraction kit (Qiagen), and products were cloned in pGEMT-Easy (Promega). Approximately 30 clones per reaction were randomly sequenced using M13 universal primers. Sequences were first analyzed by RepeatMasker (http://www.repeatmasker.org) to determine the subfamily of LINE-1 that generated the antisense transcript. The unique nonrepeated portion of the sequence was extracted and mapped using BLAT (51) (http:www.genome.ucsc.edu) to the March 2006 human reference sequence (NCBI36.3/hg18) to identify a source LINE-1 locus.
Total RNA was extracted using TRIzol (Invitrogen) and following the manufacturer's directions. Next, 1 μg was treated with 2 U of RNase-free DNase I (Invitrogen) for 30 min at room temperature. To prevent genomic DNA contamination, this step was repeated twice. Then, a High-Capacity cDNA reverse transcription kit (Applied Biosystems) was used to generate cDNAs.
To determine the L1 expression level, triplicate samples of diluted (1/5 and 1/10) cDNAs were used in a real-time PCR using Platinum SYBR green quantitative PCR SuperMix-UDG (Invitrogen) in an MX3005P real-time PCR machine (Stratagene). We used two sets of oligonucleotide primers (27) to amplify 61-bp and 84-bp amplicons from the 5′ UTR and ORF2 regions, respectively, of a consensus L1Hs element. Glyceraldehyde 3-phosphate dehydrogenase (GAPDH) was amplified as an internal normalization control as previously described (63). We determined the threshold cycles (CTs) for LINE-1 and GAPDH and performed a melting curve analysis from 50°C to 95°C with readings every 0.2°C to confirm the identities of the amplified products. The CT obtained from the GAPDH PCR was used to normalize the mRNA contents of the samples.
To determine which L1s are expressed in pluripotent cells, a fraction of the cDNAs was subjected to conventional RT-PCR using primers that amplify a 235-bp portion of L1 ORF1 (35, 36). Next, amplified products were cloned into pGEMT-Easy (Promega) and 30 randomly selected clones were sequenced. Upon sequencing and RepeatMasker analysis, we determined the type (i.e., subfamily) of L1 expressed as described previously (35, 36).
Finally, to determine the expression level of Alu subfamilies Y, S, and J, triplicate samples of diluted cDNAs were used in a real-time PCR using the conditions described above. For each Alu subfamily, we designed specific primers (available upon request) for each subfamily by using a database of all known human Alu elements (12). As described above, we also determined the CT for GAPDH, which was used to normalize the mRNA content in the samples. Note that L1/Alu quantification using this procedure may likely amplify other L1/Alu fragments exonized or present in longer transcription units.
Genomic DNA was extracted from H13B (grown on Matrigel as described previously ) and HeLa cells using the DNeasy Blood Mini kit (Qiagen) by following the manufacturer's instructions. We then used 200 ng of genomic DNA per genotyping PCR using High Fidelity Expand Taq (Roche). The PCR conditions included an initial cycle of 95°C for 4 min, followed by 35 cycles of 30 s at 94°C, 30 s at 54°C, and 60 s at 72°C, with a final step of 72°C for 10 min. Twenty microliters of each PCR product was resolved on 1.5% agarose gels, and the amplification products were excised, purified (using the QIAquick extraction kit [Qiagen]), and cloned into pGEMT-Easy (Promega). We sequenced at least four clones of each PCR product to confirm the identity of the amplified product.
The 5′ UTR of L1PA1, L1PA2, L1PA3, L1PA4, L1PA6, L1PA7, L1PA8, and L1PA10 elements was analyzed using the TF Search at http://www.cbrc.jp/research/db/TFSEARCH.html. Only those transcription factor binding sites (TFBS) that showed a score of >0.93 were considered in the analysis. As a control for false-positive TFBSs, we generated a scrambled sequence of each 5′ UTR sequence (at http://www.molbiol.ru/eng/scripts/01_16.html) that was analyzed using the TF Search. None of the TFBSs identified in the LINE-1 5′ UTRs was identified in the scrambled sequences (see Document 1 posted at http://www.juntadeandalucia.es/fundacionprogresoysalud/repositorio/files/0000/0080/16._MCB00561_SupInfo_final.pdf).
To determine the percentage of full-length L1Hs elements within human genes, we performed BLAST searches of four major genomic DNA sequence data sources. These were the GenBank nucleotide database (April 2008), the human genome reference assembly (NCBI36.3/hg18) (53), the Celera Genomics human genome shotgun assembly (AADB/November 2001) (81), and the HuRef diploid human genome sequence (J. Craig Venter Institute whole-genome shotgun assembly [May 2007; http://www.jcvi.org/research/huref/]). This assembly represents a composite haploid version of the diploid genome sequence from a single individual (J. Craig Venter) (54). To be considered for further analysis, the identified L1Hs elements had to have a genomic size of >5,900 bp and show ≥98.5% sequence identity to a known hot L1 (L1.3, accession no. L19088) (71). The 37-bp poly(A) tail located at the 3′ end of the L1.3 element sequence was removed so that L1 hits would not be excluded due to variation in the length of this simple-sequence tract. All L1 sequences meeting these criteria and their flanking sequences were exhaustively compared to remove redundant sequences. This analysis identified a nonredundant set of 533 elements whose insertion points were mapped to the human reference sequence (NCBI36.3/hg18), irrespective of whether the element was present in the reference assembly. The insertion coordinates of the 533 mapped elements were compared to the transcription start and stop coordinates of a nonredundant set of 20,304 human genes derived from the UCSC Genome Browser RefSeq Genes track. Where multiple transcripts were present for a gene, the transcript with the largest genomic size was used. One hundred sixty-four (~30%) of the 533 L1 elements mapped within RefSeq Gene transcription units by these criteria.
To determine if L1s within genes are preferentially expressed in pluripotent cells, we used a hypergeometric analysis as follows. The total population of N segments was assigned 533, in which n (164) have a particular annotation, X = “located within genes.” In samples, we analyzed k genes with that annotation in a sample of K genes (those expressed L1s). Next, we calculated the probability of the observation using the hypergeometric distribution as follows:
where N is the number of segments on the reference list, n is the number of segments on the reference list annotated with X, K is the number of segments on the input list, and k is the number of segments on the input list annotated with X, and
To generate a P value, we used the following equation:
To obtain a profile of Alu elements expressed in hESCs, we isolated total RNAs from three undifferentiated hESC lines (H7, H9, and H13B) and employed a 3′ RACE primer to generate a library of cDNAs. These cDNAs were used in PCRs with an outer primer and a primer designed to be able to amplify a broad range of Alu elements (12). We conducted PCRs in triplicate, cloned and sequenced the products, and identified the subfamily of each Alu element sequence using Repeatmasker (http://www.repeatmasker.org/), analyzing more than 100 distinct sequences. We observed that hESC lines express a wide range of Alu elements, including both old and young subfamilies (Fig. (Fig.1a;1a; see Table Table11 posted at http://www.juntadeandalucia.es/fundacionprogresoysalud/repositorio/files/0000/0080/16._MCB00561_SupInfo_final.pdf). We confirmed the expression of subfamilies Y, S, and J by quantitative RT-PCR using primers specific for each subfamily (Fig. (Fig.1b;1b; see Materials and Methods). Subfamilies Y, Sx, and Sp were the most abundant, and there were no large differences between male and female hESCs in the type of Alu elements expressed (see Table Table11 posted at the above URL).
An average human genome contains ~6,000 active core Alu elements, including the modern Y and older S Alu subfamilies (12). We therefore analyzed cloned each Alu element for the presence of 124 conserved positions that are retained by active core Alu elements (12). We found that ~70% of the Alu elements expressed in hESCs contain the majority (>80%) of these conserved nucleotides (see Table Table22 posted at http://www.juntadeandalucia.es/fundacionprogresoysalud/repositorio/files/0000/0080/16._MCB00561_SupInfo_final.pdf). Thus, hESCs express many Alu elements that contain a potentially active core element, with a modest enrichment of the youngest Alu subfamily, Y (see below).
We next sought to determine whether the core sequences derived from hESC-expressed Alu elements are active in cultured cells (Fig. (Fig.1c)1c) (28). To avoid bias, we chose 13 hESC-expressed Alu elements at random and cloned their core sequence into the backbone of plasmid pAluNF1-neoIII (see Materials and Methods). All of these Alu cores contain the conserved G25 nucleotide, which is critical for SRP 9/14 binding, and all but two elements contained the G159 nucleotide, which also is present in the conserved SRP 9/14 binding site (see Fig. Fig.11 posted at http://www.juntadeandalucia.es/fundacionprogresoysalud/repositorio/files/0000/0080/16._MCB00561_SupInfo_final.pdf). Of the 13 randomly selected Alu elements analyzed, 7 belong to the Sx subfamily, 2 each belong to the Sp and Sg subfamilies, 1 belongs to the Sc subfamily, and 1 belongs to the Y subfamily. We then tested these Alu constructs for retrotransposition in HeLa cells (28). In this assay, an untagged driver L1 (that produces ORF2p [2, 13]) is cotransfected with an Alu element tagged with a reporter gene (conferring resistance to G-418 ) that can only be activated upon a single round of trans retrotransposition (Fig. (Fig.1c).1c). As controls we used a known active Ya5 Alu element (28, 82) which was transfected with or without an L1 driver.
Of the 13 Alu elements analyzed (Fig. 1d and e), 8 (~61%) had at least 10% of the activity of a Ya5 Alu element (elements A-1_Sx, A-12_Y, A-13_Sx, A-14_Sx, A-15_Sg, A-16_Sc, A-19_Sp, and A-20_Sx, Fig. 1d and e). The remaining five Alu elements had less than 2% of the activity of the Ya5 element (A-5_Sx, A-18_Sp, A-21_Sx, A-32_Sg, and A-34_Sx), likely due to the presence of mutations, deletions, and insertions within the core sequence (see Fig. Fig.11 posted at http://www.juntadeandalucia.es/fundacionprogresoysalud/repositorio/files/0000/0080/16._MCB00561_SupInfo_final.pdf). In agreement with a previous study (12), we found an old Sx Alu element that displayed a high level of trans retrotransposition (clone A-20_Sx, shows ~70% of the activity of a Ya5 Alu element, Fig. 1d and e). Control experiments indicated that Alu Ya5 mobilization could only be achieved upon cotransfection with a driver L1 (Fig. 1d and e, compare Ya5 and Ya5 with no driver) (28, 43). Thus, undifferentiated hESCs express a wide variety of Alu elements and some of these contain an active core element.
Endogenous L1 RNPs can be detected in hESC and hEC cell lines, suggesting that the L1 sense promoter is active in pluripotent cells (35, 41, 58, 74). We confirmed these findings by determining the amount of sense L1 transcripts present in pluripotent cells (and compared this to the amount in differentiated cells as a control). Briefly, we designed real-time PCR primer sets annealing to either the 5′ UTR or ORF2 region of a consensus L1 and determined the L1 mRNA contents of hESC, hEC, and HeLa cells (see Materials and Methods). We observed that, on average, pluripotent cells express 10 to 15 times more sense L1 mRNA than differentiated cells (Fig. (Fig.2a).2a). In addition, we used conventional RT-PCR and sequencing to determine which L1 is expressed in pluripotent cells as described previously (35, 36). We observed that a wide range of L1s is expressed in hEC cells and hESCs (Fig. (Fig.2b2b).
Next, we examined if the L1 antisense (AS) promoter was active in pluripotent cells and if it could allow the identification of expressed L1s by expression originating from their AS promoter. A previous report characterized an antisense promoter located in the 5′ UTR of L1 (region between bp 400 and 600) (76) and reported human cell-specific transcripts originating from these promoters.
To test the strength of both sense and antisense L1 promoters in pluripotent cells, we cloned the 5′ UTR of an L1Hs element (L1.3) (71) into the firefly luciferase reporter pGL3-basic plasmid (Fig. (Fig.2c)2c) in the sense (5S-FF) or antisense orientation (5AS-FF). We then cotransfected these plasmids into cultured cells with a Renilla luciferase internal control (driven by the SV40 promoter). When plasmids were transfected into the male hEC line 2102Ep, we detected active transcription produced from both sense and antisense L1 promoters (Fig. (Fig.2d,2d, top graph). We also detected efficient transcription from both L1 promoters in H13B (male) and H7 (female) hESCs (Fig. (Fig.2d,2d, middle and bottom graphs). Although hESCs are notoriously difficult to transfect, the observed difference in promoter strength is not likely caused by a technical difficulty, as the level of luciferase units (LUs) is far higher than in the untransfected controls (10- to 100-fold, depending of the construct; see Fig. Fig.22 posted at http://www.juntadeandalucia.es/fundacionprogresoysalud/repositorio/files/0000/0080/16._MCB00561_SupInfo_final.pdf). Notably, the sense promoter was 3 to 13 times more active than the antisense L1Hs promoter in all of the cell lines tested, which is consistent with previous reports (76, 85).
Given that the L1 antisense promoter is active in hECs and hESCs, we reasoned that L1s located in different genomic loci could give rise to unique transcripts that could be mapped precisely to the genome (Fig. (Fig.3)3) (76). Thus, we adapted a 3′ RACE protocol to precisely map mRNAs produced by transcription from the L1 antisense promoter into flanking genomic sequences (see Fig. Fig.33 for a simplified example with five L1s). The assay involved the isolation of total RNA from cells (Fig. (Fig.3a),3a), the generation of a cDNA library using a 3′ RACE primer (in triplicate, Fig. Fig.3b),3b), and a final PCR (also in triplicate) that used a primer complementary to the 3′ RACE adapter and an L1 AS primer to allow the specific amplification of AS-L1 transcripts (Fig. (Fig.3c).3c). PCR products were then cloned and sequenced, and the unique part of L1 AS transcripts was mapped back to the human genome reference sequence (HGRS) using BLAT (51) (Fig. 3d and e). This procedure will identify L1s with active AS transcription and very likely sense transcription.
We first conducted a mapping reaction using total RNA isolated from the hEC cell line 2102Ep and observed amplification products that ranged in size from ~50 to 500 bp (data not shown). We cloned amplification products from this pool and sequenced randomly selected clones. We observed that the average size of inserts in the library was approximately 170 bp, likely due to bulk cloning (see Fig. 3a and b posted at http://www.juntadeandalucia.es/fundacionprogresoysalud/repositorio/files/0000/0080/16._MCB00561_SupInfo_final.pdf) and identified several AS-L1 transcripts that could be precisely mapped to regions of the HGRS that contained an annotated full-length L1 element (Fig. (Fig.4a4a and Table Table1).1). These results indicated that the antisense promoter of L1 can be used to identify the genomic loci from which individual L1s are expressed, at least by using their AS promoter as a proxy, in pluripotent cells. Notably, we also identified AS-L1 transcripts that precisely map to the HGRS but that did not have an annotated full-length L1 element upstream. These transcripts likely indicate the presence of novel dimorphic L1 insertions (see below).
We next conducted the L1 AS mapping procedure with total RNAs isolated from three undifferentiated hESC lines (H7, H9, and H13B). To avoid size enrichment artifacts, we size fractionated the amplified PCR products into two groups, <300 bp and 300 to 600 bp. As in hEC cells, we observed a range of amplification products in the three hESC lines. However, we obtained a better representation of the sizes of the AS-L1 transcripts in these libraries, likely due to the fractionation of amplified products, with maximum sizes of 300 and 595 bp in each size group (average sizes of 104 and 382 bp, respectively; see Fig. 3a and b posted at http://www.juntadeandalucia.es/fundacionprogresoysalud/repositorio/files/0000/0080/16._MCB00561_SupInfo_final.pdf).
We next characterized approximately 30 clones per library (in triplicate) and then mapped the sequences to the HGRS using BLAT (51). An overview of the expressed AS-L1s in the three hESC lines is shown in Fig. 4b to d (see Table Table22 for a detailed description of each annotated entry). These data indicate that active AS-L1 expression occurs from different chromosomal locations in hESCs. Notably, the majority of AS-L1 transcripts map to both human-specific and older L1 subfamilies in the HGRS (Fig. (Fig.4;4; see Fig. 6a), suggesting that the activity of the L1 antisense promoter is not restricted to L1Hs elements (see below). As we reasoned that those L1s showing AS transcription very likely would express sense L1 mRNA, we next analyzed whether L1Hs elements that generate an AS-L1 transcript in hEC cells and hESCs correspond to previously characterized active or hot L1 elements present in the HGRS (19). Indeed, among the L1Hs elements characterized in 2102Ep, H13B, H7, and H9 cells, some were previously demonstrated to be retrotransposition competent in a cell culture-based assay (19) and some correspond to elements that contain at least >20% of the activity of a hot reference L1 (see Table Table2,2, last column). Remarkably, our results indicate that in some hESC lines, ~30% of the expressed L1Hs elements (identified on the basis of active AS-L1 expression) correspond to known RC-L1Hs (19).
In analyzing L1 expression in hESCs, we found that almost half of the characterized L1Hs elements correspond to loci lacking an L1HS element in the HGRS. We reasoned that loci showing AS-L1 expression where a full-length L1Hs was absent from the HGRS represented polymorphic elements that are differentially present or absent in different hESC lines, consistent with previous findings (8, 15, 84). Notably, we did not find any polymorphic L1s in the set of identified older L1 elements (L1PA2 to L1PA15), consistent with their lack of activity during recent human evolution (14-16). To confirm that these loci contain polymorphic L1Hs elements, we genotyped a set of eight putatively polymorphic L1Hs elements isolated from H13B cells (Fig. (Fig.5a5a and Table Table2).2). Two representative examples of the genotyping are shown in Fig. 5c and d, where the loci on chromosomes 10 and 3 (clones 13_F6 and 13_F10, respectively) produce amplification products consistent with the presence of an L1Hs insertion. Cloning and sequencing confirmed the presence of both L1Hs elements. We also confirmed the presence of the other six polymorphic L1Hs elements in the genome of H13B cells (data not shown), and in one case we found that the element is also present in the genome of HeLa cells (Fig. (Fig.5d).5d). Thus, our results indicate that the L1 antisense promoter can be used to detect the presence of polymorphic L1 elements that are expressed in hESCs.
In our analysis of the AS-L1 transcripts expressed in 2102Ep and hESCs, we observed that between 40 and 55% of the sequences correspond to older L1s, which suggests that the activity of the antisense L1 promoter is conserved through LINE-1 evolution. In hESCs, we found an overrepresentation of AS-L1 transcripts generated by L1PA2, L1PA3, and L1PA4 elements, although older subfamilies could also generate AS-L1 transcripts (Fig. (Fig.6a).6a). Indeed, each clone containing an AS-L1 transcript originating from an old L1 (i.e., non-L1Hs elements) could be precisely mapped to an annotated repeat in the HGRS (Table (Table11 and and2).2). This suggests that they likely do not represent PCR recombination artifacts, although some of the older L1-containing AS-L1 transcripts may represent artifacts of priming within a longer transcript (generated upstream of the full-length L1 element [20, 48]).
To unambiguously determine if the activity of the L1 antisense promoter is conserved through evolution, we cloned the first 900 bp of a cohort of old L1s (L1PA2, L1PA3, L1PA4, L1PA6, L1PA7, L1PA8, and L1PA10) into the vector pGL3Basic and determined their promoter strengths in both the sense and antisense orientations in hEC cells relative to those of the sense and antisense promoters of an L1Hs or L1PA1 element (L1.3, see above). Remarkably, we observed that both sense and antisense promoters from the cohort of old L1s were active in several hEC lines, including PA-1, 2102Ep, and N-Tera2D1 (Fig. (Fig.6b;6b; see Fig. Fig.44 posted at http://www.juntadeandalucia.es/fundacionprogresoysalud/repositorio/files/0000/0080/16._MCB00561_SupInfo_final.pdf), although their strength was always lower than that of an L1PA1 element in our experimental settings (see Fig. Fig.55 posted at the above URL). This may mean that different TFBSs are differentially present in each promoter (see Document 1 posted at the above URL). In addition, technical limitations associated with the use of luciferase-based constructs to measure promoter strength may influence the reported level of activity of a promoter. Thus, these results indicate that the antisense promoter of L1 is conserved through evolution, at least up to L1PA10 elements.
When analyzing the expression of AS-L1s in hESCs, we identified genomic loci with active AS-L1 expression that were common to different hESC lines. Of these, five loci were shared by all of the hESC lines and seven were shared by at least two hESCs (Fig. 4c and d). Within these groups, four and six AS-L1 transcripts, respectively, are generated by full-length L1Hs elements, and the remainders are generated by L1PA3 and L1PA7 elements. Although the coverage of our procedure is unknown and may be biased toward shorter transcripts, it seems that active AS-L1 expression is largely confined to discrete loci in hESCs and that different hESC lines share some of these loci.
In addition, we found that most of the expressed AS-L1 transcripts characterized could be mapped to known/annotated genes and expressed sequence tags (ESTs; Table Table2,2, 88%, 80%, and 81% in H13B, H7, and H9, respectively) and that their expression level appears to be independent of the L1's age. Notably, the proportion of expressed L1 elements that reside within genes in pluripotent cells seemed much higher than expected. To determine if expressed L1s in pluripotent cells are disproportionately located within genes, we compared our data to a nonredundant data set of human-specific full-length L1 sequences. These sequences were extracted from four large genomic DNA data sources (the GenBank nucleotide sequence database [April 2008], the HGRS , the Celera Genomics human genome sequence , and the HuRef diploid human assembly ). Of the 533 full-length L1Hs elements in these data sets, 164 are located within the transcription unit of Refseq genes (~30%; see Materials and Methods for details). This starkly contrasts with the ~80% of the expressed L1s reported here that are located within known genes. A hypergeometric analysis (see Materials and Methods) under the null hypothesis that the distribution of expressed L1s in genes is the same as their genomic distribution, irrespective of expression, confirmed that this null hypothesis can be robustly rejected (P = 4.25e-05, 1.70e-06, 8.84e-08, and 1.68e-09 in 2102Ep, H7, H9, and H13B cells, respectively). Assuming that full-length, human-specific L1s maintain a promoter activity similar to that of our data set of expressed L1s, these data strongly suggest that there is epigenetic regulation of the L1 antisense promoter in pluripotent cells.
In agreement with the above hypothesis, the expression of Alu elements correlated with the known number of Alu elements belonging to each subfamily in the human genome (Fig. (Fig.1a,1a, see Table Table11 posted at http://www.juntadeandalucia.es/fundacionprogresoysalud/repositorio/files/0000/0080/16._MCB00561_SupInfo_final.pdf) (9, 12, 62). However, we did detect a >2-fold enrichment of expressed Alu Y elements over Alu J elements (the average abundance across all three lines was 18% [Alu Y] and 7% [Alu J]), which is intriguing, as both subfamilies are present at similar copy numbers in the human genome (Fig. 1a and b) (9, 12, 62).
A recent study has determined that there are about 6,000 Alu active core elements per genome, including members of old and young subfamilies (12). Our results indicate that hESCs express a wide range of Alu elements, including both young and old subfamilies. These results are also consistent with previous L1 expression analyses where it was found that hESCs express a range of L1s of various ages (35). When corrected for the known copy number of Alu elements in the human genome, our results indicate that hESCs primarily express young Alu subfamily members (Y and S), which is in agreement with their recent evolutionary amplification in humans (9). To obtain an unbiased overview of expressed Alu elements that contain an active core, we analyzed the activity of a randomly selected cohort of Alu elements expressed in hESCs, and determined that, on average, 60% of them have >10% of the activity of a reference hot Alu element (12, 82). This result reflects the activity of the Alu cores and does not incorporate the influence of 5′ and 3′ flanking regions on the assayed Alu elements' activity (1, 22, 70). When combined with an in silico estimate of Alu activity, assuming that those elements with less than 10% variation with respect to the active core consensus represent active elements (12), approximately 40 to 50% of the core of Alu elements expressed in hESCs have a trans-retrotransposition potential of at least 10% of that of an active Alu element known to have caused a human disease (82).
To obtain a locus-specific census of expressed L1s in hESCs, we developed a method employing the antisense promoter contained within the L1 5′ UTR (76). First, we demonstrated that both sense (78) and antisense (76) L1Hs promoters are active in hEC cells and hESCs by using a plasmid reporter assay. Thus, it is likely that both promoters are active in genomic L1 copies, although our mapping protocol only captures transcription initiating from the L1 AS promoter. Indeed, we have confirmed that the mapping technique is useful in identifying L1Hs at various genomic loci that show active antisense transcription. Although we do not know the coverage of the transcriptome achieved by this procedure, the rate of false positives obtained is low, as we have never detected nonannotated old L1s generating AS-L1 transcripts (having analyzed >200 independent transcripts), and we have confirmed the existence of polymorphic L1Hs purely on the basis of their AS-L1 expression. It is worth mentioning that a significant proportion (~50%) of AS-L1 transcripts originated from old L1 subfamilies. Thus, we analyzed a panel of old L1s for promoter activity and found that the activity of the L1 AS promoter is robust throughout L1 evolution (at least as far back as L1PA10 elements). These old L1s also contain active sense promoters and potentially can produce double-stranded RNAs that could trigger an RNAi response to regulate L1 activity (85). Due to its functional conservation during evolution, it will be interesting to elucidate if the AS L1 promoter serves as an autoregulatory signal or participates in any step of L1 retrotransposition.
We also have determined that some of the expressed AS-L1s correspond to previously identified active L1Hs elements. In agreement, recent reports on somatic human tissues revealed the presence of many L1 elements transcribed in those cells, but few of them are likely to be active (69). However, it should be noted that allelic heterogeneity could impact the activity of a given L1 allele, as previously reported (56, 73). Furthermore, we have determined that almost half of the identified L1Hs elements expressed in hESCs are polymorphic, consistent with previous reports (8, 15, 84). It was previously shown that de novo L1 insertions can accumulate during early human embryonic development in vivo (80). Very recently, in a mouse model of human L1 retrotransposition, it was found that most de novo insertions occur in early embryonic development and that insertions in germ cells are uncommon (47). Indeed, all known human mutagenic L1/Alu insertions could have occurred early in development, indicating that hESCs are a bona fide model to study the accumulation of new L1/Alu retrotransposition events in humans (47). Our results indicate that approximately 40% of all expressed Alu elements and up to 20% of expressed L1Hs elements in hESCs (on the basis of their AS promoter) may represent active elements with the potential to retrotranspose. This suggests that the degree of somatic mosaicism attributable to L1 insertions generated during early development may be much higher than previously anticipated (10, 27, 31, 42, 45).
Having captured a cohort of L1s actively expressed in hESCs, we have found that expression of L1s (at least from their antisense promoter) appears to be confined to discrete genomic loci and that some of these loci are shared by different hESC lines. Indeed, recent studies that analyzed the regulated retrotransposon transcriptome in a panel of human somatic tissues by using deep-sequencing cap analysis gene expression and other methods noted that, despite making up a third of our genome, retrotransposons are less expressed, on average, than nonrepetitive regions of the genome (32, 69). In hESCs, in contrast, the expression of L1s located in known or hypothetical human genes is readily detectable. This is reflected in a very significant enrichment of L1s expressed from within known genes, relative to their genomic distribution (~80% of loci versus 30% expected). These data strongly suggest that while the L1 antisense promoter (and, by extrapolation, the sense promoter) is intrinsically active in human cells, there is apparently a relaxation in the control of L1 expression in pluripotent cells, most likely mediated by the widespread epigenetic remodeling typical of these cells. These results are also consistent with a recent report that demonstrated efficient epigenetic silencing and reactivation (by chromatin-modifying agents) of EGFP-marked de novo L1 retrotransposition insertion events in a panel of hEC cell lines (36). It remains to be seen whether the L1 expression detected in our study reflects programmed expression activation of hESC-specific genes or is a general feature of extensive epigenetic reprogramming. One model is that gene expression required to maintain the pluripotent state exposes L1 promoters within specific genes to chromatin contexts permissive for L1 expression. Indeed, it is tempting to speculate that active L1 elements present in expressed chromatin areas of human embryonic cells (i.e., ICM cells) would be more likely to generate copies of themselves that would be transmitted to new generations (Fig. (Fig.77).
We thank Thierry Heidmann (CNRS, France), Alicia Barroso-delJesus (Andalusian Stem Cell Bank, Spain), Peter W. Andrews (University of Sheffield, United Kingdom), and Astrid Roy-Engel (Tulane University) for providing the neo3 cassette, plasmid pRL-SV40, hEC cell lines, and HeLa-HA cells, respectively. We also thank Victoria Kelly (University of Michigan) for help with luciferase readings during early stages of this study, Sandra R. Richardson and John V. Moran (University of Michigan) for critically reading the manuscript, and Carolina Elosua (Andalusian Stem Cell Bank) and Alberto Labarga (Scientifik, Spain) for help with statistical analyses. J.L.C. thanks Gustavo Melen and Alicia Barroso-DelJesus (Andalusian Stem Cell Bank, Spain) for cloning advice and support. We are indebted to John V. Moran (Howard Hughes Medical Institute, University of Michigan), who enabled the proof-of-principle experiments for the antisense-based identification of expressed L1s using 2102Ep cells to be carried out in his laboratory when J.L.G.-P. was a postdoctoral fellow.
This work was supported by a Marie Curie IRG action (FP7-PEOPLE-2007-4-3-IRG to J.L.G.P.); by the Instituto de Salud Carlos III/FEDER, Spain (EMER07/056 to J.L.G.P., CP07/00065 to J.L.G.P., FIS PI08171 to J.L.G.P., and FIS PI070527 to J.A.M.); by the Junta de Andalucia, Spain (CICE P09-CTS-4980 to J.L.G.P., PeS PI-002 to J.L.G.P., and CICE P08-CTS-3678 to S.M.); by the Wellcome Trust (075163/Z/04/Z to R.M.B.); and by an MRC Capacity Building Ph.D. Studentship in Bioinformatics (to R.K.H.).
Published ahead of print on 1 November 2010.