|Home | About | Journals | Submit | Contact Us | Français|
Numerous inbred mouse strains comprise models for human diseases and diversity, but the molecular differences between them are mostly unknown. Several mammalian genomes have been assembled, providing a framework for identifying structural variations. To identify variants between inbred mouse strains at a single nucleotide resolution, we aligned 26 million individual sequence traces from four laboratory mouse strains to the C57BL/6J reference genome. We discovered and analyzed over ten thousand intermediate-length genomic variants (from 100 nucleotides to 10 kilobases) distinguishing these strains from the C57BL/6J reference. Approximately 85% of such variants are due to recent mobilization of endogenous retrotransposons, predominantly L1 elements, greatly exceeding that reported in humans. Many genes’ structure and expression are altered directly by polymorphic L1 retrotransposons, including Drosha, Parp8, Scn1a, Arhgap15 and others including novel genes. L1 polymorphisms are distributed non-randomly across the genome, as they are excluded significantly from the X chromosome and from genes associated with the cell cycle, but are enriched in receptor genes. Thus, recent endogenous L1 retrotransposition has diversified genomic structures and transcripts significantly, distinguishing mouse lineages and driving a major portion of natural genetic variation.
Inbred mouse strains form a foundation for mammalian genetics research. Hundreds of distinct lineages including well-known laboratory strains were generated from limited founders by repetitive crosses of highly related animals within the past 100–300 years. Individuals of a given strain are both virtually homozygous at all autosomal loci and isogenic (Beck et al. 2000). The power of mouse genetics research in part comes from naturally occurring genetic variation between different strains. Phenotypic differences between mouse lineages, such as disease susceptibility traits, behavioral differences and many other characteristics, are widely used to model human developmental and metabolic disorders, cancers, and many other diseases and traits (Beck et al. 2000).
Genome sequence assemblies have been completed recently for the mouse and other mammalian species. Large-scale resequencing projects have focused upon identification of certain forms of sequence variation, especially short variants such as single nucleotide polymorphisms (SNPs) (International HapMap Consortium 2005), that might account for functional differences between mammalian individuals or lineages. Such work has helped map a small number of quantitative trait loci, tabulated common variants associated with cancers and other diseases, and facilitated analysis of mammalian evolution (Conrad et al. 2006; Frazer et al. 2007; Wade and Daly 2005). More recently, longer structural variants have been identified, distinguishing human individuals and mouse sub-strains (Egan et al. 2007; Korbel et al. 2007; Mills et al. 2006). Several recent studies on human structural variation revealed that non-homologous end joining and endogenous transposition of retroelements have contributed mechanistically to most insertion or deletion (indel) changes between human genomes (Korbel et al. 2007; Mills et al. 2006).
Various classes of repetitive elements, mostly transposons, make up nearly half of the mammalian genomes assembled (Lander et al. 2001; Waterston et al. 2002). While some retrotransposon families are actively mobilized in mouse and human genomes (Kazazian 2004), occasionally resulting in disease-causing mutations (Chen et al. 2005) and various forms of genomic instability (Symer et al. 2002), their contributions to structural variation are largely unknown. Since transposons can introduce promoters, terminators, and alternative splice sites, and affect local chromatin structures (Belancio et al. 2006; Chen et al. 2006; Roy-Engel et al. 2005; Wheelan et al. 2005; Whitelaw and Martin 2001), their active mobilization in genomes is a likely determinant of transcriptional variation (Horie et al. 2007), and therefore at least some cases of phenotypic variation.
A comprehensive analysis of structural variation between classical inbred mouse strains has not been conducted to date, except for SNPs and certain copy number variants (CNVs). In this study, to identify intermediate-length structural variants between inbred mouse strains at extremely high resolution, i.e. single nucleotide resolution, we aligned individual sequence traces to the reference mouse genome using a fast and accurate new method. Virtually all sampled predictions were validated by specific polymerase chain reaction (PCR) assays. Surprisingly, most of the identified genomic variants between mouse strains were caused by recent mobilization of endogenous transposable elements, of which L1 retrotransposons were most active. Additionally, as described here, we found that a substantial number of these polymorphic transposons directly altered transcript structures and expression levels in corresponding mouse strains.
High-resolution data from whole genome shotgun (WGS) sequencing of four inbred mouse strains, A/J, DBA/2J, 129S1/SvImJ (henceforth, 129S1), and 129X1/SvJ (129X1) (Mural et al. 2002), whose genomes remain unassembled, were deposited recently at the National Center for Biotechnology Information (NCBI) trace archive (Mural et al. 2002; Wade and Daly 2005). To identify genomic variants distinguishing these strains, we downloaded ~26 million WGS sequence traces (cumulative length ~ 18 billion nucleotides, nt) and aligned them individually to the reference C57BL6/J (C57) genome assembly using GMAP. This software application was developed to map exons and therefore is well-suited to align genomic fragments with intervening breaks. It appeared to speed alignments over other applications such as Blat by 10–100-fold (R. M. Stephens, N. Volfovsky, unpublished data). We found that 73% of the individual sequence traces align unambiguously to the C57 reference genome with minimal or no variation (Fig. 1, Table 1, Supplementary Fig. 1). Many traces validate known SNPs and/or identify new ones, and show that significant portions of the compared strains’ genomes are non-polymorphic in pairwise comparisons (Wade et al. 2002). By contrast, others align to multiple repetitive elements or to no unique locus, identify short tandem repeat (STR) polymorphisms (N. Volfovsky et al., manuscript in preparation), and/or identify indel variants.
Upon merging overlapping individual WGS traces (Fig. 1a and Supp. Fig. 1), more than ten thousand intermediate-sized variants, ranging from 100 nt to 10 kb, are predicted by this analysis to be present in the C57 reference but absent from at least one of the other strain(s) (Fig. 1, Fig. 2 and Supplementary Table 1). We call such indel variants a “polymorphic insertion in C57” since they are present in the reference genome (Fig. 1) but absent from another strain. Even more variants were found present in at least one of the four unassembled strains but absent from the reference (“polymorphic insertion in strain X”). These latter variants are difficult to characterize without full genome assemblies, precluding their detailed analysis here. We do not wish to imply by this nomenclature that the polymorphisms’ mechanism of formation is known in all cases; an indel variant that we call an insertion in a given strain could alternatively have been deleted from another strain. All polymorphisms identified here were determined from comparisons with the reference C57 mouse genome. Our alignment procedures, categorization of WGS traces, and resulting sequence coverage for each strain are described in Fig. 1, Table 1, Supplementary Fig. 1, Supplementary Tables 1–2 and Supplementary Materials and Methods. Comprehensive data about the genomic variants distinguishing mouse strains, as discovered in this study, are available using PolyBrowse, our new genomic polymorphism query and display website at http://polybrowse.abcc.ncifcrf.gov/ (Stephens et al. 2008).
Almost all such variants include at least 70% sequence content from various classes of repetitive elements (Fig. 2a), as identified by RepeatMasker (Smit et al. 2007). A large majority contains >90% transposon sequences per variant. Their length distribution is strikingly bimodal, matching transposons’ known structures in the mouse genome (Fig. 2). Of these transposon indels, L1 (LINE, Long Interspersed Element) retrotransposons are the most numerous. L1 integrants are frequently truncated from the 5’ end, but many others are full length (Symer et al. 2002). L1 polymorphisms contributed the most variant nucleotides to the strains’ genomes overall; their mean +/− standard deviation (SD) length is 1,130 +/− 590 nucleotides (nt). Other classes of active transposable elements, including short interspersed elements (SINEs, mostly B2 elements) and long terminal repeat-containing retrotransposons (e.g. ERV-K and MaLR elements), are also very frequently polymorphic between strains (Fig. 2b).
We tabulated a total of 666,328 “reference L1s” (each > 100 nt) in the haploid C57 reference genome using RepeatMasker (Smit et al. 2007), based on their evolutionary ages and structures (Fig. 1b and Supplementary Table 3). These counts are likely to be inexact because gaps remain in the reference genome assembly, currently 98.6% complete (Table 1). Remaining gaps frequently include highly repetitive sequences. Mouse Y chromosome sequences have not been assembled, and some transposons are “compound” elements contiguous to one another that cannot be counted unambiguously.
At least 127,803 L1 elements (19.2% of the total) are present in all four strains’ unassembled genomes and in C57, so we call them “non-polymorphic” (Fig. 1b). Notably, some of these may be fixed in all mouse lineages, but their presence has been determined only for the five inbred strains here. By contrast, at least 6,723 (1%) distinct elements are L1 polymorphisms in C57, i.e. present in the C57 reference and possibly other strains, but absent from at least one strain. We compared the absent or present status (A/P call) for all five inbred strains in 1,861 fully predicted cases out of 6,723 L1 polymorphisms. These pairwise comparisons confirmed that 129S1 and 129X1 strains are most similar, while A/J and DBA/2J are most divergent (Supplementary Table 4). These results corroborate both earlier phylogenetic analyses using SNPs and other genomic markers, and strains’ known breeding histories (Wade et al. 2002).
If a similar proportion of all reference L1s were polymorphic, then up to ~33,000 L1s would be absent from at least one of the four unassembled strains. Additionally, many thousands of other currently unknown L1 integrants, absent from the reference genome, are likely to be present in one or more of the unassembled mouse strains. Thus the analysis presented here substantially under-estimates structural variation including transposition-mediated variation between the strains.
To validate predictions of L1s present or absent in the strains, we arbitrarily selected a set of 31 L1 integrants for validation by polymerase chain reaction (PCR) (Table 2). This collection is an arbitrary sample of mouse L1s genome-wide, as we included 22 independent polymorphic L1s present in the C57 reference but absent from at least one of the other strains. Of these, 11 were chosen from several regions of chromosome 10, and others were picked at a frequency of approximately one per chromosome. The remaining 9 elements were chosen for validation based upon their activity in a screen for fusion transcripts (see below). PCR assays were run both across left and right junctions between L1s and flanking genomic sequences, and across empty and/or occupied genomic target sites. We required results from the three PCR tests to be self-consistent. Predictions from all but one of 78 individual WGS traces (99%) identifying empty target sites (where reference L1s are absent from a strain) were validated (Supplementary Table 5), suggesting very low error rates in trace sequencing and alignments, and minimal confounding by other forms of genomic variation such as copy number variants. A predicted integrant on chr. 17 could not be assayed in any strain, probably because its target site lies within an ancient element repeated in many genomic locations (Table 2).
We wanted to determine if more extensive genomic variation distinguishes other lineages. Therefore, the same L1 integrants were assayed by PCR in 16 additional mouse strains and related species that have been studied in large-scale SNP discovery and analysis projects (Table 2) (Frazer et al. 2007; Wade and Daly 2005; Yang et al. 2007). Strikingly, none of the 31 L1s assayed (0%) is present in SPRET/EiJ, although Mus spretus diverged from ancestors of the classical inbred strains approximately one million years ago, and our collection emphasized integrants known to be polymorphic among those laboratory strains. If we had assayed mostly non-polymorphic L1s, presumably some would be present at conserved loci in Mus spretus. Only 2/28 (7%) each are present in CAST/EiJ (Mus castaneus) and MOLF/EiJ (Mus molossinus), respectively, and 1/30 (3%) is in PWD/PhJ. For comparison, the overall contribution from the genomes of these ancestral strains to classical inbred mouse strains has been estimated to be 3% from CAST/EiJ, 10% from MOLF/EiJ and 6% from PWD/PhJ, illustrating that our collection approximates the genome-wide contributions of these ancestors estimated by SNP analysis (Frazer et al. 2007). However, in WSB/EiJ, a strain most closely related to Mus musculus domesticus (the common ancestor for a majority of classical mouse strain genomes) (Wade et al. 2002), only a small minority (10 out of 29; 34%) of the assayed L1 integrants is present. This value deviates substantially from expected contribution (68%) from Mus musculus domesticus to the classical inbred mouse strains (Frazer et al. 2007), but might be explained by the small sample size and non-random distribution of L1s assayed here (Table 2).
Although most of the integrants chosen for validation are polymorphic, three of the 31 validated integrants are non-polymorphic in the five strains. Of these, none are fixed in all 21 lineages (Table 2). Several integrants are present only in a few strains, suggesting that they integrated very recently in evolutionary time, quite possibly within the past few hundred years or less. This relatively rapid rate of genomic change is comparable with that reported for copy number variants, which have emerged within several hundred generations of inbreeding of C57BL6 sub-strains (Egan et al. 2007). While more than 19% of reference L1 elements are non-polymorphic in the five strains, a substantially smaller fraction likely will be non-polymorphic in all strains. These results are consistent with a recent analysis of SNPs in classical inbred mice, supporting their intrasubspecific origin (Yang et al. 2007). Additional WGS sequencing of divergent mouse species such as Mus spretus and Mus castaneus likely would identify fundamentally different patterns of transposon integrants and resulting differences in chromosome structures.
The chromosomal distributions of reference and polymorphic L1 retrotransposons were compared to genes and G/C-rich regions (Fig. 3). As expected, L1s are not uniformly distributed genome-wide, but tend to be located in gene-poor regions (Ostertag and Kazazian 2001). Strikingly, the mouse genome contains many more reference L1 elements than exons. Polymorphic L1s and exons contribute to similar extents (Fig. 3a). L1s are also enriched in A/T-rich genomic regions (Gasior et al. 2007). Variation in L1 polymorphism densities along chromosomes is not due simply to differences in WGS trace coverage (Supplementary Fig. 2, Supplementary Tables 2–3). We cannot analyze the Y chromosome since its coverage is minimal due to its composition of arrayed Huge Repeats.
Compared with autosomes, the X chromosome has a significantly higher density of reference L1s (Fig. 3a and Table 3; p = 0), as expected (Ostertag and Kazazian 2001). Less purifying selection on the sex chromosomes would allow accumulation of deleterious L1s on chromosome X (Boissinot et al. 2001). Chromosome 11 contains a substantially lower density of reference L1s (Table 3a; p = 0).
By contrast, there are many fewer L1 polymorphisms on the X chromosome and chromosome 10, and increased numbers of L1 polymorphisms on chromosomes 1 and 3. Out of 600,486 autosomal L1s, 6,484 (1.08%) are polymorphic, while only 237 out of 65,038 L1s on the X chromosome (0.36%) are polymorphic (p = 1.47 × 10−22; Fig. 3a and Table 3a). The high density of L1s on the X chromosome, together with its paradoxical lack of L1 polymorphisms, could be due to prevention of or strong selection against new insertions, or selection for older ones. This apparent contradiction suggests that non-polymorphic L1s may play an important biological role there, perhaps in X inactivation (Lyon 1998).
We compared L1 variants and SNPs pairwise between the reference genome and A/J or DBA/2J, respectively. Such pairwise comparisons revealed that most polymorphic L1 integration sites coincide with SNP-dense regions (p < 1E-10; Fig. 3b and Supplementary Materials and Methods). A plausible explanation for this concordance between a large majority of L1 variants and SNP-dense regions is that most polymorphic transposon integration sites and flanking genomic sequences, co-inherited from distant ancestors, then diverged with a subsequent accumulation of SNPs. Alternatively, these two forms of genomic variation might be expected to coincide in those chromosomal regions where such changes can be tolerated. While independent polymorphic L1s are substantially less numerous than SNPs (Frazer et al. 2007), they contain at least a thousand-fold more nucleotides per variant (Fig. 3b).
Importantly, occasional L1 variants integrated into genomic regions without apparent SNPs, so-called “identical by descent” (IBD) (insets, Fig. 3b). However, such transposon integrants clearly have caused substantial local variation despite lack of SNPs. Screening for polymorphic transposons might provide a powerful new way to genotype mouse strains and other mammalian species, particularly in IBD regions with few or no SNPs available (2005; Yang et al. 2007).
Several structural features of polymorphic L1s are consistent with their young evolutionary ages. In contrast with both reference and non-polymorphic elements, polymorphic L1s have a bimodal length distribution with a significantly increased number of long, full-length elements (Fig. 4). They also more frequently have target site duplications (TSDs) and poly(A) tails, and when present, their TSDs and poly(A) tails are significantly longer than those of reference or non-polymorphic L1s (Supplementary Fig. 3). Polymorphic L1s also have a canonical target site preference, lower nucleotide substitution rate, and more frequently are classified as young, active L1 subfamily members (Supplementary Table 6). These results strongly suggest that such genomic integrants are bona fide products of recent retrotransposition (Symer et al. 2002).
Three young L1 subfamilies are currently active in mouse; some members of these active subfamilies have caused murine diseases by insertional mutagenesis. Ranked by their occurrence in the reference genome, these are TF, A and GF (Goodier et al. 2001; Naas et al. 1998; Ostertag and Kazazian 2001; Saxton and Martin 1998). Similarly, a majority (59%) of polymorphic L1s are products of retrotransposition by young, active donors, i.e. TF (28%), A (23%) and GF (8%) subfamily members (Supplementary Table 3).
These results collectively show that polymorphic L1s are substantially younger than other L1s in the mouse genome. However, L1 polymorphisms typically are localized in high-density SNP regions (Fig. 3b), suggesting their localization and co-inheritance within divergent ancestral blocks (Wade et al. 2002). Clearly, determination of the ages and evolutionary relationships of individual transposon integrants and other genomic variants along chromosomes in different strains will require further investigation.
Multiple forms of transcriptional variation have been linked previously with transposons, which may contribute cryptic or alternative promoters, terminators and/or splice sites, affect RNA polymerase processivity, trigger altered chromatin conformations, mediate homologous recombination and/or template small RNA expression (Belancio et al. 2006; Ostertag and Kazazian 2001; Speek 2001; Wheelan et al. 2005; Yang and Kazazian 2006). However, the extent of transcriptional variation due to endogenous transposition is not known.
Nearly half (53%) of both non-polymorphic and polymorphic L1s are located within 100 kb of annotated RefSeq genes. Approximately 20% of both reference L1s and L1 variants occur inside transcription units, representing a significant bias against L1 integrants within genes, since 28–30% of the mouse genome is comprised of annotated RefSeq genes including introns (An et al. 2006) (Table 3b). Presumably this relative exclusion of L1 elements from genes reflects selection against them, or less likely, their non-random integration into intergenic regions.
Of the non-polymorphic L1s within introns, approximately 68% are oriented antisense to the open reading frame (ORF; Supplementary Table 7). A smaller majority (58%) of polymorphic L1s are antisense within genes. An antisense orientation bias also was observed for de novo L1 integrants within genes in cultured human cells (Symer et al. 2002). By contrast, both non-polymorphic and polymorphic L1s within an interval of 100 kb upstream or downstream of genes occur in both orientations (Supplementary Table 7), suggesting a neutral orientation preference during retrotransposon integration per se, as expected (Gilbert et al. 2005). Presumably the observed orientation bias within genes is due to positive selection upon antisense elements or negative selection upon sense integrants (Boissinot et al. 2001). The smaller majority of antisense polymorphic L1s within genes may reflect selection over a shorter period of time upon these evolutionarily younger integrants.
To find L1s associated with transcriptional variation in mouse strains, we screened pooled testis cDNA libraries for fragments of L1 TF sequences. This approach allowed us to discover a new antisense promoter active within many full-length, young L1s (Li et al. 2008). In an initial survey, a diverse collection of spliced, polyadenylated L1-gene fusion cDNAs, initiated by L1 elements in various gene introns or in intergenic regions, was identified (Supplementary Table 8). Their corresponding antisense L1 templates are polymorphic, but non-polymorphic elements also can be expressed (Li et al. 2008). Approximately half are present in the C57 genome, while others are absent (Table 2 and Supplementary Table 8). The latter putative L1 integrants were identified in other strains’ genomic DNA by chromosome walking from expressed exons into adjacent introns. Each unknown L1 integrant’s genomic flanks were sequenced, revealing canonical TSDs and a poly(A) tail. Once identified, the presence or absence of each L1 template was determined by PCR in all 21 lineages. In one case, a polymorphic L1 is present exclusively in the A/J lineage but none of the others, suggesting that it integrated very recently (Table 2).
To verify that fusion transcripts are present exclusively in strains containing a putative genomic L1 template, we analyzed total RNAs isolated from adult male testes from the five strains. For example, fusion transcripts of L1-Drosha, L1-Parp8 and an L1-novel gene were identified only in strains with relevant antisense L1 polymorphisms present (Fig. 5). Similarly, other fusion transcripts were detected only in strains with corresponding L1 templates, including a chimeric transcript from the L1-ArhGAP15 locus (Table 2, Supplementary Fig. 2, Supplementary Table 8).
Fusion L1 transcripts are exemplified by the L1-Drosha fusion transcript, which is expressed at ~30% of the level of native Drosha in testis (Fig. 5a). This transcript contains both translation start and splice donor sites from L1, and is spliced in-frame with downstream exons encoding catalytic domains of Drosha (also called Rnasen), an RNaseIII gene centrally involved in microRNA biosynthesis (Murchison and Hannon 2004). Similarly, an L1-Parp8 fusion transcript also is predicted to be in-frame, and its open reading frame contains most functional domains of Parp8 (Fig. 5b). As a control, an assay for readthrough transcripts for the canonical genes, from which L1 polymorphisms are spliced out with usual introns, showed comparable expression levels. Remarkably, a novel, spliced transcript 1ASII-1 is promoted by a polymorphic L1 (Fig. 5c) in a genomic region where no cDNA or expressed sequence tag (EST) had been reported previously.
No appreciable fusion L1-Drosha transcript was identified by reverse transcriptase-mediated (RT-) PCR in non-gonadal tissues (Fig. 5a). By contrast, the novel fusion transcript 1ASII-1 was detected both in testis and 11 day-embryo tissues (Fig. 5c). We speculate that mechanisms such as transcriptional or post-transcriptional gene silencing, position effects, and/or availability of tissue-specific transcription factors may contribute to variable expression and control of particular transposon integrants in different developmental states (Whitelaw and Martin 2001). These and other fusion transcripts may encode protein variants or noncoding RNAs with regulatory or other functions.
We asked what proportion of endogenous L1 variants might contribute to transcriptional variation in the strains. Therefore, we screened adult testis total RNA samples for more L1 fusion transcripts. Out of 205 full-length, antisense L1 polymorphisms predicted inside RefSeq genes in the C57 genome, an arbitrary sample of 68 was screened. Of these, 13 (19%) drive fusion L1-gene transcripts, including 40% of the TF polymorphisms tested (Supplementary Table 9) (Li et al. 2008). Additionally, fusion L1-Arhgap15 transcription was identified in another screen (Table 2, Supplementary Table 9, and Supplementary Fig. 2b). Notably, two distinct intronic L1 polymorphisms occur in Grid2 in different strains, but only one drives expression of a fusion L1-Grid2 transcript while the other does not (Supplementary Table 9). Thus, we speculate that both polymorphic and non-polymorphic L1s may initiate additional transcripts in testes or other tissues, developmental stages and/or disease states such as cancers.
Another way by which L1 variants can affect tissue-specific gene structure and expression (Fig. 6) is illustrated by the rd7 mouse model of retinal degeneration (Chen et al. 2006). A de novo insertion of a full-length antisense L1 into exon 5 of Nr2e3 disrupts that gene’s normal transcription and splicing. Its donor itself is polymorphic, present only in C57, NZB/BinJ, and AKR/J out of the 21 strains tested (Table 2), thereby providing the first example of a “hot” endogenous mouse L1 that actively retrotransposed from its chromosomal location (Brouha et al. 2003). Thus, other full-length, polymorphic L1s also may be highly active donors in vivo.
Ontology analysis (Mi et al. 2005) of annotated genes containing L1 polymorphisms showed a significant exclusion from certain categories of genes, including genes associated with cell cycle, nucleic acid metabolism and oncogenesis (Table 4, Supplementary Table 10). By contrast, L1 polymorphisms are significantly enriched in the receptors category of molecular functions, suggesting that these genes generally may tolerate added structural or transcriptional variability mediated by transposon integration events. Non-polymorphic L1s and reference L1s were enriched significantly in brain-associated genes along with other ontological categories (Supplementary Table 10). A recent, high resolution analysis of copy number variation between mouse strains revealed that these structural variants also are excluded from similar groups of mouse genes required in fundamental cellular processes, e.g. those involved in cell cycle and nucleic acid metabolism (Cutler et al. 2007).
In this comprehensive study of intermediate length structural variants that distinguish different inbred mouse strains, we found that a large majority was caused by endogenous retrotransposition, predominantly by L1 retrotransposons. Other classes of active retrotransposons, including LTR elements and SINEs, also have caused substantial variation between the strains (Fig. 2). These variants, which could become a useful adjunct to SNPs and STRs in genotyping studies, can be accessed in detail using the mouse PolyBrowse website (Stephens et al. 2008). While we identified over ten thousand independent variants (Fig. 2), their total numbers do not remotely approximate 8.3 million SNPs identified to date in 16 classical and wild strains (Frazer et al. 2007). Nevertheless, summation of their cumulative lengths (Fig. 2) strongly suggests that these variants have altered millions of nucleotides genome-wide, affecting the structures of perhaps hundreds of genes. Recently, a similar scope of structural variation has been attributed to copy number variation between mouse strains (Cutler et al. 2007).
The extent of recent endogenous transposition in causing structural variation between mouse strains also appears to be substantially larger than that in humans, where non-homologous end joining appears to have been a predominant mechanism for generating variation (Korbel et al. 2007; Levy et al. 2007; Mills et al. 2006). The reasons for this striking difference are unclear, since human L1 retrotransposons (which mobilize LINEs, SINEs and SVA elements) paradoxically are more active than mouse L1s in tissue culture assays (Han and Boeke 2004). Moreover, their overall content in the human genome exceeds that in mouse (Lander et al. 2001; Waterston et al. 2002). Determination and comparison of the rates of structural variation by endogenous retrotransposition and by other mechanisms (Egan et al. 2007; Korbel et al. 2007) in mouse, man and other species will require additional study.
In this study, we used GMAP (Wu and Watanabe 2005) in a new way to align individual sequence traces to the C57 reference genome assembly (Fig. 1 and Supplementary Fig. 1). It is important to note that this alignment procedure, while fast and accurate, is also very stringent, as many additional polymorphisms are likely to remain uncounted. For example, variants in genomic regions with low sequence trace coverage were not counted here. If by chance single sequence traces did not span a variant substantially on both sides, that variant would not be counted. Moreover, polymorphisms that are present in an unassembled genome but absent from the C57 reference genome were not fully identified here. In an effort to describe the complete extent of variants existing between strains, we currently are comparing classes of variants that can be identified by different methods including mate pair alignments (Dew et al. 2005), and documenting many more novel variants present in strains with unassembled genomes.
The genomes of more distantly related mouse species such as Mus spretus are likely to be even more distinct from the classical strains analyzed here, due in large part to consequences of active endogenous transposition. As shown in Table 2, not a single one of the arbitrary, polymorphic L1 retrotransposons that we assayed is present in the Mus spretus genomic DNA, suggesting that a major component of its genomic architecture (likely corresponding to many thousands of elements, on average approximately 1 kb long) is fundamentally different from that in its relatives. It is possible that such non-coding genomic compartments, outside of conserved exons, have been shaped differentially by endogenous transposition, but might contribute nevertheless to important biological differences between species, since their coding exons are expected to be extremely similar.
A substantial fraction of L1 variants directly affect neighboring gene expression and structures in a range of tissues, possibly contributing to functional differences between strains (Muotri et al. 2005). However, we presume that a majority of both polymorphic and non-polymorphic L1s still do not significantly affect expression of overlapping or nearby genes in most tissues (Supplementary Table 9), as we do not anticipate large differences between strains in the structure or expression of most genes. We cannot exclude the possibility that polymorphic transposons, in many cases, may cause subtle differences in the expression and structures of many genes (Han et al. 2004). It will be of great interest to compare transcriptomes in various mouse species with very distinctive genome structures, for example using gene expression microarrays or ultra-high-throughput sequencing, to elucidate the relationship between structural variation and transcriptional variation more fully (Stranger et al. 2007).
Many of the novel fusion L1 transcripts that we identified reflect altered gene structures. For example, the L1-Drosha and L1-Parp8 fusion transcripts (Fig. 5a and b, Supp. Table 8) are predicted to encode many of the catalytic domains of the native gene products, together with short domains from the antisense L1 elements. Others, such as the novel spliced transcript 1ASII-1 (Fig. 5c), also demonstrate that transcription levels can be altered dramatically at a genomic locus previously thought to be devoid of exons. As the biological significance of such fusion transcripts remains unclear, we currently are evaluating whether such transcripts, initiated by certain polymorphic transposons, could rescue upstream promoter traps or affect tissue-specific gene expression levels. At least some of the variant fusion transcripts resulting directly from L1 retrotransposon polymorphisms may be noncoding RNAs with possible regulatory roles.
It is entirely possible that other structural variants, including those caused by other classes of retrotransposon polymorphisms (Fig. 2), may exert even larger effects upon transcriptional variation. For example, LTR retrotransposons may contain stronger promoters active in additional tissues and in other genomic contexts (Horie et al. 2007). Thus the functional consequences of transposon-mediated genomic variation upon transcripts may be variable themselves (Han et al. 2004). Variable transcription or added regulation mediated by polymorphic transposon promoters could provide a selective advantage that helps explain how mammalian hosts tolerate huge numbers of transposons in their genomes, despite the negative burden that their dispersal and maintenance engenders (Bestor 2003; Boissinot et al. 2001; Han et al. 2004; Yoder et al. 1997).
The generation of diversity between and within very recently separated mouse lineages by active mobilization of L1 retrotransposons emphasizes in detail that these elements are a built-in, active, dynamic engine for evolutionary changes – driving genetic variation and providing a substrate for natural selection – that operates even now (Kazazian 2004). As we documented here, the resulting changes caused by endogenous transposons are not merely structural, genomic variants: they can bring about direct changes in expressed transcripts, and quite likely phenotypic variation, as well.
Approximately 26 million sequence traces (~18 billion nucleotides) from four inbred mouse strains (A/J, DBA2/J, 129S1/SvImJ, and 129X1/SvJ) were downloaded from the tracedb archive, National Center for Biotechnology Information (NCBI, NIH). Only high quality (>300 nt with Phred score >Q20) sequence traces were included, thereby excluding a very small percentage of traces. GMAP was used to align each individual trace to the C57 genome assembly (Stephens et al. 2008; Wu and Watanabe 2005). Possible alignment categories included no best alignment; polymorphism in C57; polymorphism in strain X; almost perfect alignment; and others (Fig. 1 and Supplementary Fig. 1). Candidate indels’ boundaries were determined by merging traces.
PolyBrowse, a query tool and graphical browser at http://polybrowse.abcc.ncifcrf.gov/ based on GBrowse (Stein et al. 2002), was developed to display all indels described here together with other available genomic variants and annotated features (Stephens et al. 2008). C57 reference genomic data were downloaded from UCSC website, http://genome.ucsc.edu/, Feb. 2006 release. Protein domains were predicted using the SMART database, http://smart.embl-heidelberg. de/ (Letunic et al. 2006).
Procedures are described in Supplementary Materials and Methods.
Total RNA was isolated from grossly dissected adult testes (fasted, 72–75 day old males, harvested at same time of day), frozen in RNALater (Ambion), and homogenized in Trizol (Invitrogen) following standard protocols.
Genomic DNA from C57, 129S1, 129X1, A/J and DBA/2J mice was purchased from The Jackson Laboratory (Bar Harbor, ME). A locus-specific PCR amplicon was designed across the empty target site of each polymorphic repetitive element (Table 2 and Supplementary Table 5). Occasionally the same PCR reaction detected smaller integrants (<500 nt), while both left and/or right junctions of larger integrants were assayed using unique locus-specific primers in flanking genomic sequences paired with primers within the repetitive element (sequences available upon request). PCR products were assessed by agarose gel electrophoresis using standard methods.
Screens of commercial phage libraries and online EST libraries were performed as described in Supplementary Materials and Methods.
Synthesis of cDNAs was performed using SuperScript II (Invitrogen) with oligo-dT and gene-specific primers. Sequencing was performed as described in Supplementary Methods.
SNP reference genome coordinates were downloaded from NIEHS Perlegen and Celera databases (stored at tracedb, NCBI website) and compared to polymorphic transposon coordinates as described (Stephens et al. 2008).
To test various hypotheses about the genome-wide distribution of the 6,723 independent polymorphic L1s identified here, we generated lists of simulated integrants using a random number generator to assign chromosomal coordinates. To approximate genomic or intragenic distributions, 6,723 integrant locations were simulated 500 times, resulting in 3,361,500 simulated L1 insertions. Intronic integrants were identified by comparison with a database of RefSeq genes (NCBI). P-values were calculated using the binomial statistic and were adjusted by applying the Bonferroni correction (SPSS software) (Slonim 2002).
To sample gene categories randomly for ontology analysis, based on their relative lengths, the simulation was performed 1,000 times, resulting in 6,723,000 simulated integrants. To investigate whether genes are involved in a biological process affected by polymorphisms, we used the GeneID associated with each accession to query the PANTHER database (Mi et al. 2005) at http://www.panther.org. Simulated integrants or reference L1s were used alternatively as reference groups, as indicated. Biological process or molecular function categories were deemed significant if, upon applying the Bonferroni correction, their p-values are less than 0.01 as determined by the binominal statistic (Mi et al. 2005).
We thank Drs. Maxine Singer, Michael Kuehn, Beverly Mock, Maura Gillison, and Berton Zbar for helpful comments on drafts of this manuscript, and members of the Symer lab for constructive discussions. This research was supported by the Intramural Research Program of the Center for Cancer Research, National Cancer Institute, NIH and in part was funded by NCI contract N01-CO-12400. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U. S. Government.
NCI-Frederick is accredited by Association for Assessment and Accreditation of Laboratory Animal Care International and follows the U.S. Public Health Service Policy for the Care and Use of Laboratory Animals. Animal care was provided in accordance with the procedures outlined in the "Guide for Care and Use of Laboratory Animals" (National Research Council, 1996, National Academy Press, Washington, D.C.). Mouse studies were performed following a protocol approved by the Animal Care and Use Committee, NCI Frederick.
This manuscript is accompanied by Supplementary Information. GenBank accession numbers EF591871 – EF591883 are included in tables with novel sequences. No unpublished results from individuals other than the authors have been referenced in this manuscript.