There are two competing views on the origin and composition of the genome of classical inbred strains6,7
. The first view claims that the genome of these strains is 68% M. m. domesticus
, 10% M. m. molossinus
, 6% musculus
, 3% M. m. castaneus
and 13% of unknown origin6
. On the other hand, we concluded that 92% is of M. m. domesticus
, 6% of M. m. musculus
and 1% of M. m. castaneus
. Both studies were based on NIEHS data6
but took different approaches to the use of wild-derived inbred strains as reference genomes to infer subspecific origin. Frazer and coworkers assumed that the four wild-derived strains, WSB/EiJ, PWD/PhJ, CAST/EiJ and MOLF/EiJ, were faithful representative of four subspecies, M. m. domesticus, M. m. musculus, M. m. castaneus
and M. m. molossinus
, respectively. We concluded that three of these wild-derived strains, PWD/PhJ, CAST/EiJ and MOLF/EiJ, had introgressed haplotypes from other subspecies. Obviously, in regions where a given wild-derived strain has undergone such intersubspecific introgression the genotypes are not suitable as a reference for that subspecies. The results presented here conclusively demonstrate that classical inbred strains are overwhelmingly derived from M. m. domesticus
, that the non M. m. domesticus
contribution to their genomes is largely of M. m. molosinus
origin, and that intersubspecific introgression is common in wild-derived laboratory strains.
The wild caught mice used here represent a wide geographically diverse sample. The genomes of these mice are overwhelmingly derived from a single subspecies (mean: 99.84%; range: 100 – 98.42%). Half of wild caught mice carry small regions with haplotypes from a second subspecies, mostly in heterozygous combinations. We acknowledge that a larger and more geographically diverse set of mice would be of great interest but it would have little impact on our conclusions regarding the origin of the genome of the laboratory mouse. We also acknowledge that our definition of diagnostic alleles in SNPs and VINOs may change with the inclusion of more samples. However, this definition provides a simple and robust method to assign phylogenetic origin while preserving enough flexibility to account for genotyping error, homoplasy and gene flow among subspecies in the wild. Although our method works very well at Mb genomic scale it has limitations in providing subspecific assignments at finer scale (Supplementary Figure 8
Excluding hybrid strains, 28 wild-derived strains have intersubspecific introgressions covering between 1% and 27% of their genome (; Supplementary Table 1
). In CAST/EiJ and PWD/PhJ, the two strains that were used as references in previous studies, introgression covers 12% and 7% of their genome, respectively confirming 96% of regions that were declared introgressed in our previous study (Supplementary Figure 9
). We have been able to identify additional regions of introgression in CAST/EiJ and PWD/PhJ due to the better reference genotypes for each subspecies and the combined use of SNPs and VINOs. Subspecies, time since derivation, and laboratory history appear to have a strong effect on the prevalence and extent of intersubspecific introgression, which could have occurred in the wild or in the laboratory. The limited extent of introgression in wild caught samples suggests that breeding in the laboratory played a major role in shaping the genomes of wild-derived strains. Independent confirmation was obtained by comparing the genome of wild-derived and classical inbred strains. Fifteen wild-derived strains have inherited haplotypes from classical inbred strains. Contamination by classical strains was expected, and likely intentional, in some cases (i.e., SOD1/EiJ and RBB/DnJ) but not in others (i.e., CASA/EiJ and CALB/RkJ). Introgression in the remaining wild-derived strains probably arose though a combination of gene flow in the wild (in samples captured close to hybrid zones and recently colonized regions) and breeding in the laboratory to non-classical mouse stocks (most likely other wild-derived mice). Wild-derived inbred strains have been used frequently as models in evolutionary studies 20
. Our results suggest that new information about the subspecific origin of the strains should be incorporated in the analyses.
A complementary strength of our study was the ability to account and correct for ascertainment biases in the SNPs included in the array. Most of these SNPs were selected on the basis of the local phylogeny among the NIEHS strains. This approach ensured that all major local branches were represented while ignoring minor branches. However, the approach also had limitations because locally all branches represented in the array were allocated the same number of SNPs and, therefore, long and short local branches would appear to be equal in length17
. Furthermore, there are subspecies-specific false negative rates in SNP identification in the NIEHS study and prior identification of a SNP is a necessary condition for its presence in the array7
. Subspecies-specific false negative rates in SNP discovery should also impact negatively the rate at which selected SNPs are converted into successful genotyping assays17
. For example, M. m. castaneus
SNPs should be underrepresented compared to the true level of diversity due the combined effects of our selection criteria and the higher assay failure rate. However, we were able to overcome the high failure rate by using VINOs. For the purpose of this study, VINOs have the critical advantage of being less subject to ascertainment biases within a given phylogenetic group. However, VINOs can only be reliably detected in homozygosity resulting in a significant undercounting of VINOs in some samples (Supplementary Table 1
). We conclude that the combination of SNP and VINO genotype data in wild caught mice has enormous value for population studies.
Among the most useful results of the present study are the maps of subspecific origin and haplotype diversity of the genome of classical inbred strains (). These maps should allow researchers to combine information from multiple crosses to refine candidate intervals. It should also extend the advantages of the very high-density genotype data in the 15 NIEHS strains (and eventually whole genome sequence) to many additional classical strains5,10
. Our maps will enable researchers to determine not only which strains share the same haplotype on a given region but the sequence divergence among those strains that do not share them. We have also calculated the number of variants used to infer IBD and a score to guide interpretation of these trees by potential users. In particular we have flagged haplotypes with weak support. Our data and tools should allow researchers to rapidly determine the number of haplotypes in a given region and the level of sequence divergence among them. Both are important considerations for association mapping. These data will also allow researchers to identify discrete regions of genetic divergence between substrains. Finally, they may be used to select strains with the desired level and type of genetic variation in any given region of the genome.
The spatial distribution of mean genetic variation observed in the 100 classical strains analyzed here is very similar to the one reported previously for a set of only 12 classical strains7
(Supplementary Figure 10
Although our approach of recombination intervals cannot directly be extended to wild-derived strains we have used a fixed window approach to determine the level of haplotype diversity and IBD among these strains. This analysis demonstrates that, as expected, there is much more diversity in wild-derived strains than in classical strains () and, therefore, opportunities to optimize genetic research. Analysis of the frequency distribution of genotype identity in pairwise comparisons between wild-derived strains provides insight into the natural history of these strains and the populations from which they were derived. In contrast with comparison to classical inbred strains these distributions are typically unimodal in intrasubspecific comparisons (Supplementary Figure 6b
). However, we observe also a strong signature of IBD in several pairwise comparisons. Some of the strongest cases involve pairs of strains derived from mice trapped in geographically close localities (Supplementary Table 1
). Excess IBD can be explained by the presence of introgression from classical inbred strains that are themselves IBD for significant fraction of their genome (Supplementary Figure 6
). There are some strains that are connected to several cliques creating a complex network. Finally, all M. m. molossinus
wild-derived strains (Supplementary Table 1
) have very high levels of IBD (~34%). This observation and the unusually high level of genotype identity between the M. m. molossinus
haplotypes present in classical strains and wild-derived M. m. molossinus
strains strongly suggest a recent population bottleneck in this hybrid subspecies.
In summary, our observation of residual heterozygosity among inbred mouse strains, the striking local differences in the level of genetic similarity between substrains, the identification of large deletions of different ages and prevalence of contamination emphasizes the importance of deep, unbiased and frequent genetic characterization of laboratory stocks. Our genome browser provides access to the trees and links between recombination intervals, local trees, and the maps for subspecific origin and haplotype diversity. Our analysis demonstrates that classical inbred strains are in fact mosaics of a handful of haplotypes present in the founder fancy mice population. The genetic divergence among these haplotypes varies widely both locally and across the genome. Furthermore, the contribution of subspecies other than M. m. domesticus
is limited and its distribution highlights the complex population structure in these strains. On the other hand, wild-derived laboratory strains represent a deep reservoir of genetic diversity untapped in classical strains and are in many cases analogous to three-way intersubspecific hybrids that classical inbred strains were thought to be. Our previous work7,21
combined with the results of the deep survey of mouse resources presented here demonstrates that the laboratory mouse represents an unparalleled model for genetic studies in mammals.