|Home | About | Journals | Submit | Contact Us | Français|
Retrotransposons are mobile genetic elements that employ a germ line “copy-and-paste” mechanism to spread throughout metazoan genomes1. At least 50% of the human genome is derived from retrotransposons, with three active families (L1, Alu and SVA) associated with insertional mutagenesis and disease2-3. Epigenetic and post-transcriptional suppression block retrotransposition in somatic cells4-5, excluding early embryo development and some malignancies6-7. Recent reports of L1 expression8-9 and copy number variation10-11 (CNV) in the human brain suggest L1 mobilization may also occur during later development. However, the corresponding integration sites have not been mapped. Here we apply a high-throughput method to identify numerous L1, Alu and SVA germ line mutations, as well as 7,743 putative somatic L1 insertions in the hippocampus and caudate nucleus of three individuals. Surprisingly, we also found 13,692 and 1,350 somatic Alu and SVA insertions, respectively. Our results demonstrate that retrotransposons mobilize to protein-coding genes differentially expressed and active in the brain. Thus, somatic genome mosaicism driven by retrotransposition may reshape the genetic circuitry that underpins normal and abnormal neurobiological processes.
Malignancy and ageing are commonly associated with the accumulation of deleterious mutations that lead to loss of function, cell death or uncontrolled growth. Retrotransposition is clearly mutagenic; an estimated 400 million retrotransposon-derived structural variants are present in the global human population3 and more than 70 diseases involve heritable and de novo retrotransposition events2. Presumably for this reason transposition-competent retrotransposons are heavily methylated and transcriptionally inactivated4-5. Nevertheless, substantial somatic L1 retrotransposition has been detected in neural cell lineages10-12. Given the complex structural and functional organization of the mammalian brain, its adaptive and regenerative capabilities13 and the unresolved etiology of many neurobiological disorders, these somatic insertions could be of major significance14.
One explanation for the observed transpositional activity in the brain may be that the L1 promoter is transiently released from epigenetic suppression during neurogenesis11-12. Transposition-competent L1s can then repeatedly mobilize to different loci in individual cells and produce somatic mosaicism. Several lines of evidence support this model, including L1 transcription8-9 and CNV in brain tissues from human donors of various ages10-11 as well as mobilization of engineered L1s in vitro and in transgenic rodents10,12. Importantly, it is not known where somatic L1 insertions occur in the genome nor, considering that open chromatin is susceptible to L1 integration15, whether these events disproportionately affect protein-coding loci expressed in the brain.
Mapping the individual retrotransposition events that collectively form a somatic mosaic is challenging due to the rarity of each mutant allele in a heterogeneous cell population. We therefore developed a high-throughput protocol called retrotransposon capture sequencing (RC-seq). Firstly, fragmented genomic DNA was hybridized to custom sequence capture arrays targeting the 5′ and 3′ termini of full-length L1, Alu and SVA retrotransposons (Fig . 1a, Supplementary Table S1, Supplementary Table S2). Immobile ERVK and ERV1 LTR elements were included as negative controls. Secondly, the captured DNA was deeply sequenced, yielding ~25 million paired-end 101mer reads per sample (Fig. 1b). Lastly, read pairs were mapped using a conservative computational pipeline designed to identify known (Fig. 1c) and novel (Fig. 1d, Supplementary Fig. S1a-d) retrotransposon insertions with uniquely mapped read pairs (“diagnostic reads”) spanning their termini.
Previous works have equated L1 CNV with somatic mobilization in vivo10-11. To test this assumption with RC-seq, we first screened five brain sub-regions taken from three individuals (donors A, B and C) for L1 CNV. A significant (p<0.001) increase was observed in the number of L1 ORF2 copies present in DNA extracted from the hippocampus of donor C and a similar though smaller increase for donor A (Fig. 2). RC-seq was then applied to the brain regions that exhibited the highest (hippocampus) and lowest (caudate nucleus) L1 CNV using samples from all three donors, including a technical replicate of donor A caudate nucleus. A total of 177.4 million RC-seq paired-end reads were generated from seven libraries (Supplementary Table S3). RC-seq achieved deep sequencing coverage of known active retrotransposons, high reproducibility and limited sequence capture bias (Supplementary Results).
Read pairs diagnostic for novel retrotransposon insertions were clustered based on their insertion site, relative orientation and retrotransposon family. A total of 25,229 clusters were produced. Proximal clusters arranged on opposing strands indicated two termini of one insertion and were paired, resulting in a catalogue of 24,540 novel insertions (Supplementary Table S4). Unsurprisingly, the vast majority of these were either L1 (32.2%) or Alu (60.9%) (Fig. 3a). To segregate germ line mutations from other events, we combined the three largest available catalogues of L1 and Alu polymorphisms6,16-17 as an annotation database and also performed RC-seq upon genomic DNA extracted from pooled human blood, producing 6,150 clusters (Supplementary Table S5) that were intersected with the existing brain RC-seq clusters. Any brain clusters that (a) contained RC-seq reads from more than one region or individual, (b) overlapped a blood RC-seq cluster or (c) matched a known polymorphism were designated as germ line insertions. Overall, 8.4% of Alu insertions in the brain were annotated as germ line, versus only 1.9% for L1. Nearly all unannotated L1 insertions matched fewer than three diagnostic RC-seq reads (Fig. 3b) and were considered potential somatic insertions.
Candidate insertions were validated by PCR amplification and capillary sequencing. Thirty-five germ line L1, Alu, SVA and LTR insertions readily confirmed by single-step PCR (Supplementary Table S6). Given low target molecule abundance and the high genomic frequency of the L1 3′ end, we devised a 5′ end nested PCR validation assay for somatic insertions. From 850 and 2,601 full-length (≥90%) L1 and Alu insertions, respectively representing 11.0% and 19.0% of the putative somatic insertions found for each family, we selected 29 examples (14 L1, 15 Alu) for validation. Nearly all of the chosen examples were exonic or intronic and were prioritized based on the degree of 5′ truncation, with longer insertions preferred. Optimization of the protocol, combined with substantial input DNA (100ng) ultimately lead to the confirmation of 14/14 L1s and 12/15 Alus (Supplementary Table S7, Supplementary Fig. S2). Four somatic SVA insertions were also assayed using the same process and two confirmed (both SVA_F) before the available input material was exhausted.
Repeated attempts to PCR amplify the corresponding 3′ junctions consistently yielded off-target amplicons, leaving validation based exclusively on 5′ junctions. For this reason, we could not experimentally identify the target site duplications (TSDs) that are indicative of retrotransposition via target-primed reverse transcription (TPRT)1. We propose that the 3′ junctions of insertions validated at their 5′ end did not amplify efficiently due to the confounding factors listed above, as well as the presence of long polyA tails in on-target amplicons but often not, as we found, in off-target amplicons.
However, TSDs could in some cases be found directly by RC-seq (Supplementary Fig. 1d). An examination of germ line insertions sequenced to high depth (≥10 reads) at both their 5′ and 3′ ends revealed that 43/50 (86%) presented TSDs. Due to their very low abundance - and therefore low sequencing coverage - only three putative somatic insertions were detected by at least one RC-seq read at both termini. Two of these examples (one L1 and one Alu) presented TSDs. Despite these and other data strongly supporting retrotransposition as the main cause of somatic mobilization (Supplementary Results) an insufficient number of examples were sequenced at both ends to distinguish whether TPRT or an alternative retrotransposition mechanism18 was primarily responsible.
The somatic origin of each insertion was demonstrated by its presence in one of the assayed brain tissues and absence from the other, according to RC-seq and PCR results. As illustrative examples, an intronic somatic L1 insertion in HDAC1 is detailed in Fig. 4a and Fig. 4b whilst an exonic somatic Alu insertion in RAI1 is shown in Fig. 4c and Fig. 4d. These experimental results indicated that insertions detected by RC-seq occurred in vivo and did not represent sequencing artifacts.
Donor element annotation revealed that 80.2% of somatic L1 insertions corresponded to the most recently active human L1 subfamilies, L1-Ta and pre-Ta (Supplementary Fig. S3a). The normalized hippocampus:caudate nucleus ratio for somatic L1 insertions was 1.3, 0.5 and 2.2 for donors A, B and C, respectively, paralleling trends from the L1 CNV assay (Fig. 2). Protein-coding loci were disproportionately affected (Supplementary Table S8) compared to random expectation and compared to prior germ line frequencies (P<0.0001 for exons and introns, χ2 test). Pre-existing microarray expression data indicated that genes containing intronic L1s were twice as likely to be differentially over-expressed in the brain compared to random expectation (P<0.0001, χ2 test). Key loci were found to contain somatic L1 insertions, including tumor suppressor genes deleted in neuroblastoma and glioma (e.g. CAMTA1), dopamine receptors (e.g. DRD3) and neurotransmitter transporters (SLC6A5, SLC6A6, SLC6A9). Globally, a Gene Ontology analysis revealed enrichment for terms relevant to neurogenesis and synaptic function (Supplementary Table S9).
Unlike L1, Alu retrotransposition has not previously been reported in normal brain cells, a major finding of the present work. However, the L1 transposition machinery is known to mobilize Alu in trans19 and 83.0% of the somatic Alu insertions corresponded to the AluY subfamily most active in the human germ line (Supplementary Fig. S3b), making the coincidence of somatic L1 and Alu mobilization plausible. On a per element basis the observed Alu activity was approximately twenty-fold lower than L1 (Supplementary Results). Thus, it is unlikely that Alu CNV would be statistically significant if assayed by TaqMan qPCR10. The genomic patterns of Alu and L1 insertions were also different; somatic Alu insertions were not overrepresented in introns but were even more common in exons than L1 (Supplementary Table S8). Alu exonization is a noted cause of genetic disease2. Overall, L1, Alu and, to a more limited extent, SVA mobilization produced a large number of insertions that affected protein-coding genes.
Our results provide clear evidence that somatic L1 and Alu mobilization fundamentally alters the genetic landscape of the human brain and that retrotransposition is the primary mechanism underlying this phenomenon. In contrast to germ line activity6,16, somatic insertions disproportionately impacted protein-coding loci. Germ line insertions are rarely found in regions where they generate a deleterious phenotype because such mutations are strongly selected against during evolution. Somatic events, on the other hand, are present for one generation and may affect protein-coding loci in a specific environmental context, perhaps drawn to open chromatin in transcribed regions15. Apart from the obvious effects of exonic insertions, intronic events could act as subtle “transcriptional rheostats”20 or as cis-regulatory elements21 akin to the IAP insertion responsible for the viable yellow allele of Agouti in the mouse22.
Several recent studies have catalogued retrotransposon insertions in the human germ line and tumors6,16,23-24. Through RC-seq we have extended these data to the brain and linked somatic retrotransposition to neurobiological genes. For instance, HDAC1 is a genome-wide transcriptional regulator that controls the canonical L1 promoter4,25 and is implicated in psychiatric disease and tumorigenesis26. Another example highlighted here, RAI1, is a transcription factor highly expressed in the brain and previously linked with schizophrenia and Smith-Magenis syndrome27. An exonic Alu insertion in RAI1, as shown in Fig. 4c, could therefore have phenotypic consequences.
The hippocampus appears predisposed to somatic L1 retrotransposition10, which is intriguing given that its subgranular zone is a major source of adult neurogenesis13. This is also consistent with the hypothesis that L1 retrotransposition is related to neural plasticity14. Even more intriguing is the possibility that the APOBECs, RNA/DNA editing enzymes that have expanded under strong positive selection in the primate lineage and been shown to control L1 mobility, may modulate somatic retrotransposition in the brain28.
Mutagenesis due to somatic retrotransposition has obvious tumorigenic potential29 and may play a role in other diseases and biological processes. For example, deletion of the chromatin remodeling HDAC1 cofactor MeCP212,25 leads to increased L1 copy number and may inhibit neuronal maturation in Rett syndrome30. Somatic mosaicism could also be a factor in neurological dimorphisms seen among discordant monozygotic twins14. Future studies may determine whether the overall frequency of somatic retrotransposition varies considerably between individuals, as suggested by our data and previous experiments10, and between populations. Ultimately, direct identification of transcripts disrupted by somatic retrotransposition, together with its epigenetic regulation, may provide insights into the molecular processes underlying human cognition, neurodevelopmental disorders and neoplastic transformation.
Tissues were provided by the Netherlands Brain Bank (Amsterdam, The Netherlands) for three post mortem donors without evidence of neurodegeneration. Pooled human genomic DNA was purchased from Promega.
Quantitative PCR experiments were performed with minor modifications to an earlier approach10. Quantification included five technical replicates. For each assay, the ratio of L1 ORF2 to α-satellite repeats (SATA) was normalized to the ratio obtained for caudate nucleus. Ratios were compared across brain regions with a repeated measures one-way ANOVA with Bonferroni correction.
A NimbleGen Sequence Capture 2.1M Array was customized to contain oligonucleotide probes tiled across the termini of full-length L1, Alu and SVA retrotransposons, as well as LTRs intended to act as negative controls. Probes were not filtered for repetitiveness. Eight probes were typically generated per L1, SVA and LTR and four probes per Alu, with a total of 4,885 probes across 875 targeted elements.
DNA sequencing libraries were constructed using an Illumina paired-end kit, with substantial modifications (see Supplementary Methods). 2.5μg starting genomic DNA was used for each RC-seq library. Ligation mediated PCR (LM-PCR) based amplification was performed pre and post hybridization. The average insert size was ~250nt. Enrichment was confirmed by qPCR against Alu. Sequencing was performed by ARK-Genomics, The Roslin Institute, on an Illumina GAIIx instrument.
Paired-end RC-seq reads were mapped to hg19 using SOAP2. Reads where both ends could be aligned to the genome, but not at the same locus, indicated novel retrotransposon insertions. These alignments were corroborated by BLAT, stringently filtered and clustered. Clusters were annotated using published retrotransposon databases6,16-17 and the NCBI RefSeq database.
J.K.B. is supported by a Wellcome Trust Clinical Fellowship (090385/Z/09/Z) through the Edinburgh Clinical Academic Track (ECAT). G.J.F. is funded by an Institute Strategic Programme Grant and a New Investigator Award from the British BBSRC (BB/H005935/1) and a C.J. Martin Overseas Based Biomedical Fellowship from the Australian NHMRC (575585). Human brain tissues were provided by the Netherlands Brain Bank to P.H. with ethical consent to be used as described in the study.
Supplementary Information is linked to the online version of the paper at www.nature.com/nature.
Author Information RC-seq sequences can be downloaded from the NCBI Sequence Read Archive (SRA) at www.ncbi.nlm.nih.gov/sra using the identifier SRA024401.1. Reprints and permissions information are available at www.nature.com/reprints. The authors declare competing financial interests: D.J.G., T.A.R. and J.A.J. are employed by Roche NimbleGen, Inc., and Roche NimbleGen capture arrays and reagents were used in the study.