Most intermediate-size mouse structural variants are due to transposition
High-resolution data from whole genome shotgun (WGS) sequencing of four inbred mouse strains, A/J, DBA/2J, 129S1/SvImJ (henceforth, 129S1), and 129X1/SvJ (129X1) (Mural et al. 2002
), whose genomes remain unassembled, were deposited recently at the National Center for Biotechnology Information (NCBI) trace archive (Mural et al. 2002
; Wade and Daly 2005
). To identify genomic variants distinguishing these strains, we downloaded ~26 million WGS sequence traces (cumulative length ~ 18 billion nucleotides, nt) and aligned them individually to the reference C57BL6/J (C57) genome assembly using GMAP. This software application was developed to map exons and therefore is well-suited to align genomic fragments with intervening breaks. It appeared to speed alignments over other applications such as Blat by 10–100-fold (R. M. Stephens, N. Volfovsky, unpublished data). We found that 73% of the individual sequence traces align unambiguously to the C57 reference genome with minimal or no variation (, , Supplementary Fig. 1
). Many traces validate known SNPs and/or identify new ones, and show that significant portions of the compared strains’ genomes are non-polymorphic in pairwise comparisons (Wade et al. 2002
). By contrast, others align to multiple repetitive elements or to no unique locus, identify short tandem repeat (STR) polymorphisms (N. Volfovsky et al., manuscript in preparation), and/or identify indel variants.
Discovery of structural variation between inbred mouse strains by WGS trace alignment
Categorization and coverage of 26 million WGS sequence traces
Upon merging overlapping individual WGS traces ( and Supp. Fig. 1
), more than ten thousand intermediate-sized variants, ranging from 100 nt to 10 kb, are predicted by this analysis to be present in the C57 reference but absent from at least one of the other strain(s) (, and Supplementary Table 1
). We call such indel variants a “polymorphic insertion in C57” since they are present in the reference genome () but absent from another strain. Even more variants were found present in at least one of the four unassembled strains but absent from the reference (“polymorphic insertion in strain X”). These latter variants are difficult to characterize without full genome assemblies, precluding their detailed analysis here. We do not wish to imply by this nomenclature that the polymorphisms’ mechanism of formation is known in all cases; an indel variant that we call an insertion in a given strain could alternatively have been deleted from another strain. All polymorphisms identified here were determined from comparisons with the reference C57 mouse genome. Our alignment procedures, categorization of WGS traces, and resulting sequence coverage for each strain are described in , , Supplementary Fig. 1
, and Supplementary Materials and Methods
. Comprehensive data about the genomic variants distinguishing mouse strains, as discovered in this study, are available using PolyBrowse, our new genomic polymorphism query and display website at http://polybrowse.abcc.ncifcrf.gov/
(Stephens et al. 2008
Intermediate-sized structural variation between mouse strains due to endogenous retrotransposition
Almost all such variants include at least 70% sequence content from various classes of repetitive elements (), as identified by RepeatMasker (Smit et al. 2007
). A large majority contains >90% transposon sequences per variant. Their length distribution is strikingly bimodal, matching transposons’ known structures in the mouse genome (). Of these transposon indels, L1 (LINE, L
lement) retrotransposons are the most numerous. L1 integrants are frequently truncated from the 5’ end, but many others are full length (Symer et al. 2002
). L1 polymorphisms contributed the most variant nucleotides to the strains’ genomes overall; their mean +/− standard deviation (SD) length is 1,130 +/− 590 nucleotides (nt). Other classes of active transposable elements, including short interspersed elements (SINEs, mostly B2 elements) and long terminal repeat-containing retrotransposons (e.g. ERV-K and MaLR elements), are also very frequently polymorphic between strains ().
We tabulated a total of 666,328 “reference L1s” (each > 100 nt) in the haploid C57 reference genome using RepeatMasker (Smit et al. 2007
), based on their evolutionary ages and structures ( and Supplementary Table 3
). These counts are likely to be inexact because gaps remain in the reference genome assembly, currently 98.6% complete (). Remaining gaps frequently include highly repetitive sequences. Mouse Y chromosome sequences have not been assembled, and some transposons are “compound” elements contiguous to one another that cannot be counted unambiguously.
At least 127,803 L1 elements (19.2% of the total) are present in all four strains’ unassembled genomes and in C57, so we call them “non-polymorphic” (). Notably, some of these may be fixed in all mouse lineages, but their presence has been determined only for the five inbred strains here. By contrast, at least 6,723 (1%) distinct elements are L1 polymorphisms in C57, i.e.
present in the C57 reference and possibly other strains, but absent from at least one strain. We compared the a
bsent or p
resent status (A/P call) for all five inbred strains in 1,861 fully predicted cases out of 6,723 L1 polymorphisms. These pairwise comparisons confirmed that 129S1 and 129X1 strains are most similar, while A/J and DBA/2J are most divergent (Supplementary Table 4
). These results corroborate both earlier phylogenetic analyses using SNPs and other genomic markers, and strains’ known breeding histories (Wade et al. 2002
If a similar proportion of all reference L1s were polymorphic, then up to ~33,000 L1s would be absent from at least one of the four unassembled strains. Additionally, many thousands of other currently unknown L1 integrants, absent from the reference genome, are likely to be present in one or more of the unassembled mouse strains. Thus the analysis presented here substantially under-estimates structural variation including transposition-mediated variation between the strains.
To validate predictions of L1s present or absent in the strains, we arbitrarily selected a set of 31 L1 integrants for validation by polymerase chain reaction (PCR) (). This collection is an arbitrary sample of mouse L1s genome-wide, as we included 22 independent polymorphic L1s present in the C57 reference but absent from at least one of the other strains. Of these, 11 were chosen from several regions of chromosome 10, and others were picked at a frequency of approximately one per chromosome. The remaining 9 elements were chosen for validation based upon their activity in a screen for fusion transcripts (see below). PCR assays were run both across left and right junctions between L1s and flanking genomic sequences, and across empty and/or occupied genomic target sites. We required results from the three PCR tests to be self-consistent. Predictions from all but one of 78 individual WGS traces (99%) identifying empty target sites (where reference L1s are absent from a strain) were validated (Supplementary Table 5
), suggesting very low error rates in trace sequencing and alignments, and minimal confounding by other forms of genomic variation such as copy number variants. A predicted integrant on chr. 17 could not be assayed in any strain, probably because its target site lies within an ancient element repeated in many genomic locations ().
Validation of L1 polymorphisms in classical and wild mouse strains.
We wanted to determine if more extensive genomic variation distinguishes other lineages. Therefore, the same L1 integrants were assayed by PCR in 16 additional mouse strains and related species that have been studied in large-scale SNP discovery and analysis projects () (Frazer et al. 2007
; Wade and Daly 2005
; Yang et al. 2007
). Strikingly, none of the 31 L1s assayed (0%) is present in SPRET/EiJ, although Mus spretus
diverged from ancestors of the classical inbred strains approximately one million years ago, and our collection emphasized integrants known to be polymorphic among those laboratory strains. If we had assayed mostly non-polymorphic L1s, presumably some would be present at conserved loci in Mus spretus
. Only 2/28 (7%) each are present in CAST/EiJ (Mus castaneus
) and MOLF/EiJ (Mus molossinus
), respectively, and 1/30 (3%) is in PWD/PhJ. For comparison, the overall contribution from the genomes of these ancestral strains to classical inbred mouse strains has been estimated to be 3% from CAST/EiJ, 10% from MOLF/EiJ and 6% from PWD/PhJ, illustrating that our collection approximates the genome-wide contributions of these ancestors estimated by SNP analysis (Frazer et al. 2007
). However, in WSB/EiJ, a strain most closely related to Mus musculus domesticus
(the common ancestor for a majority of classical mouse strain genomes) (Wade et al. 2002
), only a small minority (10 out of 29; 34%) of the assayed L1 integrants is present. This value deviates substantially from expected contribution (68%) from Mus musculus domesticus
to the classical inbred mouse strains (Frazer et al. 2007
), but might be explained by the small sample size and non-random distribution of L1s assayed here ().
Although most of the integrants chosen for validation are polymorphic, three of the 31 validated integrants are non-polymorphic in the five strains. Of these, none are fixed in all 21 lineages (). Several integrants are present only in a few strains, suggesting that they integrated very recently in evolutionary time, quite possibly within the past few hundred years or less. This relatively rapid rate of genomic change is comparable with that reported for copy number variants, which have emerged within several hundred generations of inbreeding of C57BL6 sub-strains (Egan et al. 2007
). While more than 19% of reference L1 elements are non-polymorphic in the five strains, a substantially smaller fraction likely will be non-polymorphic in all strains. These results are consistent with a recent analysis of SNPs in classical inbred mice, supporting their intrasubspecific origin (Yang et al. 2007
). Additional WGS sequencing of divergent mouse species such as Mus spretus
and Mus castaneus
likely would identify fundamentally different patterns of transposon integrants and resulting differences in chromosome structures.
The chromosomal distributions of reference and polymorphic L1 retrotransposons were compared to genes and G/C-rich regions (). As expected, L1s are not uniformly distributed genome-wide, but tend to be located in gene-poor regions (Ostertag and Kazazian 2001
). Strikingly, the mouse genome contains many more reference L1 elements than exons. Polymorphic L1s and exons contribute to similar extents (). L1s are also enriched in A/T-rich genomic regions (Gasior et al. 2007
). Variation in L1 polymorphism densities along chromosomes is not due simply to differences in WGS trace coverage (Supplementary Fig. 2
, Supplementary Tables 2–3
). We cannot analyze the Y chromosome since its coverage is minimal due to its composition of arrayed Huge Repeats.
Chromosomal distribution of mouse L1s and SNPs
Compared with autosomes, the X chromosome has a significantly higher density of reference L1s ( and ; p = 0), as expected (Ostertag and Kazazian 2001
). Less purifying selection on the sex chromosomes would allow accumulation of deleterious L1s on chromosome X (Boissinot et al. 2001
). Chromosome 11 contains a substantially lower density of reference L1s (; p = 0).
Non-random distribution of L1 retrotransposons on chromosomes and within genes
By contrast, there are many fewer L1 polymorphisms on the X chromosome and chromosome 10, and increased numbers of L1 polymorphisms on chromosomes 1 and 3. Out of 600,486 autosomal L1s, 6,484 (1.08%) are polymorphic, while only 237 out of 65,038 L1s on the X chromosome (0.36%) are polymorphic (p = 1.47 × 10−22
; and ). The high density of L1s on the X chromosome, together with its paradoxical lack of L1 polymorphisms, could be due to prevention of or strong selection against new insertions, or selection for older ones. This apparent contradiction suggests that non-polymorphic L1s may play an important biological role there, perhaps in X inactivation (Lyon 1998
We compared L1 variants and SNPs pairwise between the reference genome and A/J or DBA/2J, respectively. Such pairwise comparisons revealed that most polymorphic L1 integration sites coincide with SNP-dense regions (p < 1E-10; and Supplementary Materials and Methods
). A plausible explanation for this concordance between a large majority of L1 variants and SNP-dense regions is that most polymorphic transposon integration sites and flanking genomic sequences, co-inherited from distant ancestors, then diverged with a subsequent accumulation of SNPs. Alternatively, these two forms of genomic variation might be expected to coincide in those chromosomal regions where such changes can be tolerated. While independent polymorphic L1s are substantially less numerous than SNPs (Frazer et al. 2007
), they contain at least a thousand-fold more nucleotides per variant ().
Importantly, occasional L1 variants integrated into genomic regions without apparent SNPs, so-called “identical by descent” (IBD) (insets
, ). However, such transposon integrants clearly have caused substantial local variation despite lack of SNPs. Screening for polymorphic transposons might provide a powerful new way to genotype mouse strains and other mammalian species, particularly in IBD regions with few or no SNPs available (2005; Yang et al. 2007
Several structural features of polymorphic L1s are consistent with their young evolutionary ages. In contrast with both reference and non-polymorphic elements, polymorphic L1s have a bimodal length distribution with a significantly increased number of long, full-length elements (). They also more frequently have target site duplications (TSDs) and poly(A) tails, and when present, their TSDs and poly(A) tails are significantly longer than those of reference or non-polymorphic L1s (Supplementary Fig. 3
). Polymorphic L1s also have a canonical target site preference, lower nucleotide substitution rate, and more frequently are classified as young, active L1 subfamily members (Supplementary Table 6
). These results strongly suggest that such genomic integrants are bona fide
products of recent retrotransposition (Symer et al. 2002
Polymorphic L1s are bona fide products of recent retrotransposition
These results collectively show that polymorphic L1s are substantially younger than other L1s in the mouse genome. However, L1 polymorphisms typically are localized in high-density SNP regions (), suggesting their localization and co-inheritance within divergent ancestral blocks (Wade et al. 2002
). Clearly, determination of the ages and evolutionary relationships of individual transposon integrants and other genomic variants along chromosomes in different strains will require further investigation.
Multiple forms of transcriptional variation have been linked previously with transposons, which may contribute cryptic or alternative promoters, terminators and/or splice sites, affect RNA polymerase processivity, trigger altered chromatin conformations, mediate homologous recombination and/or template small RNA expression (Belancio et al. 2006
; Ostertag and Kazazian 2001
; Speek 2001
; Wheelan et al. 2005
; Yang and Kazazian 2006
). However, the extent of transcriptional variation due to endogenous transposition is not known.
Nearly half (53%) of both non-polymorphic and polymorphic L1s are located within 100 kb of annotated RefSeq genes. Approximately 20% of both reference L1s and L1 variants occur inside transcription units, representing a significant bias against L1 integrants within genes, since 28–30% of the mouse genome is comprised of annotated RefSeq genes including introns (An et al. 2006
) (). Presumably this relative exclusion of L1 elements from genes reflects selection against them, or less likely, their non-random integration into intergenic regions.
Of the non-polymorphic L1s within introns, approximately 68% are oriented antisense to the open reading frame (ORF; Supplementary Table 7
). A smaller majority (58%) of polymorphic L1s are antisense within genes. An antisense orientation bias also was observed for de novo
L1 integrants within genes in cultured human cells (Symer et al. 2002
). By contrast, both non-polymorphic and polymorphic L1s within an interval of 100 kb upstream or downstream of genes occur in both orientations (Supplementary Table 7
), suggesting a neutral orientation preference during retrotransposon integration per se
, as expected (Gilbert et al. 2005
). Presumably the observed orientation bias within genes is due to positive selection upon antisense elements or negative selection upon sense integrants (Boissinot et al. 2001
). The smaller majority of antisense polymorphic L1s within genes may reflect selection over a shorter period of time upon these evolutionarily younger integrants.
To find L1s associated with transcriptional variation in mouse strains, we screened pooled testis cDNA libraries for fragments of L1 TF
sequences. This approach allowed us to discover a new antisense promoter active within many full-length, young L1s (Li et al. 2008
). In an initial survey, a diverse collection of spliced, polyadenylated L1-gene fusion cDNAs, initiated by L1 elements in various gene introns or in intergenic regions, was identified (Supplementary Table 8
). Their corresponding antisense L1 templates are polymorphic, but non-polymorphic elements also can be expressed (Li et al. 2008
). Approximately half are present in the C57 genome, while others are absent ( and Supplementary Table 8
). The latter putative L1 integrants were identified in other strains’ genomic DNA by chromosome walking from expressed exons into adjacent introns. Each unknown L1 integrant’s genomic flanks were sequenced, revealing canonical TSDs and a poly(A) tail. Once identified, the presence or absence of each L1 template was determined by PCR in all 21 lineages. In one case, a polymorphic L1 is present exclusively in the A/J lineage but none of the others, suggesting that it integrated very recently ().
To verify that fusion transcripts are present exclusively in strains containing a putative genomic L1 template, we analyzed total RNAs isolated from adult male testes from the five strains. For example, fusion transcripts of L1-Drosha
and an L1-novel gene were identified only in strains with relevant antisense L1 polymorphisms present (). Similarly, other fusion transcripts were detected only in strains with corresponding L1 templates, including a chimeric transcript from the L1-ArhGAP15
locus (, Supplementary Fig. 2
, Supplementary Table 8
Transcriptional variation due to L1 variants
Fusion L1 transcripts are exemplified by the L1-Drosha
fusion transcript, which is expressed at ~30% of the level of native Drosha
in testis (). This transcript contains both translation start and splice donor sites from L1, and is spliced in-frame with downstream exons encoding catalytic domains of Drosha
(also called Rnasen
), an RNaseIII gene centrally involved in microRNA biosynthesis (Murchison and Hannon 2004
). Similarly, an L1-Parp8
fusion transcript also is predicted to be in-frame, and its open reading frame contains most functional domains of Parp8
(). As a control, an assay for readthrough transcripts for the canonical genes, from which L1 polymorphisms are spliced out with usual introns, showed comparable expression levels. Remarkably, a novel, spliced transcript 1ASII-1 is promoted by a polymorphic L1 () in a genomic region where no cDNA or expressed sequence tag (EST) had been reported previously.
No appreciable fusion L1-Drosha
transcript was identified by reverse transcriptase-mediated (RT-) PCR in non-gonadal tissues (). By contrast, the novel fusion transcript 1ASII-1 was detected both in testis and 11 day-embryo tissues (). We speculate that mechanisms such as transcriptional or post-transcriptional gene silencing, position effects, and/or availability of tissue-specific transcription factors may contribute to variable expression and control of particular transposon integrants in different developmental states (Whitelaw and Martin 2001
). These and other fusion transcripts may encode protein variants or noncoding RNAs with regulatory or other functions.
We asked what proportion of endogenous L1 variants might contribute to transcriptional variation in the strains. Therefore, we screened adult testis total RNA samples for more L1 fusion transcripts. Out of 205 full-length, antisense L1 polymorphisms predicted inside RefSeq genes in the C57 genome, an arbitrary sample of 68 was screened. Of these, 13 (19%) drive fusion L1-gene transcripts, including 40% of the TF
polymorphisms tested (Supplementary Table 9
) (Li et al. 2008
). Additionally, fusion L1-Arhgap15
transcription was identified in another screen (, Supplementary Table 9
, and Supplementary Fig. 2b
). Notably, two distinct intronic L1 polymorphisms occur in Grid2
in different strains, but only one drives expression of a fusion L1-Grid2
transcript while the other does not (Supplementary Table 9
). Thus, we speculate that both polymorphic and non-polymorphic L1s may initiate additional transcripts in testes or other tissues, developmental stages and/or disease states such as cancers.
Another way by which L1 variants can affect tissue-specific gene structure and expression () is illustrated by the rd7 mouse model of retinal degeneration (Chen et al. 2006
). A de novo
insertion of a full-length antisense L1 into exon 5 of Nr2e3
disrupts that gene’s normal transcription and splicing. Its donor itself is polymorphic, present only in C57, NZB/BinJ, and AKR/J out of the 21 strains tested (), thereby providing the first example of a “hot” endogenous mouse L1 that actively retrotransposed from its chromosomal location (Brouha et al. 2003
). Thus, other full-length, polymorphic L1s also may be highly active donors in vivo
Genomic and transcriptional variation due to endogenous transposition
Ontology analysis (Mi et al. 2005
) of annotated genes containing L1 polymorphisms showed a significant exclusion from certain categories of genes, including genes associated with cell cycle, nucleic acid metabolism and oncogenesis (, Supplementary Table 10
). By contrast, L1 polymorphisms are significantly enriched in the receptors category of molecular functions, suggesting that these genes generally may tolerate added structural or transcriptional variability mediated by transposon integration events. Non-polymorphic L1s and reference L1s were enriched significantly in brain-associated genes along with other ontological categories (Supplementary Table 10
). A recent, high resolution analysis of copy number variation between mouse strains revealed that these structural variants also are excluded from similar groups of mouse genes required in fundamental cellular processes, e.g.
those involved in cell cycle and nucleic acid metabolism (Cutler et al. 2007
Exclusion or enrichment of polymorphic L1s within genes in various ontological categories.