We identified CNVs in inbred laboratory mouse strains by hybridizing genomic DNA to GeneChip® Mouse Exon 1.0 ST Arrays, which contain probes that specifically target exons and are thus ideally suited to identify CNVs that underlie gene expression. These arrays provide at least an order of magnitude increase in probe number compared to previous efforts 
. CNVs were identified from probe hybridization intensity data using a Hidden Markov Model (HMM) approach. The HMM assigned probabilities to each probe for three states: duplicated, deleted or ground, relative to the B6 reference strain. The use of the HMM model and a large number of exon-specific probes provides high power to detect large CNVs that involve known genes; smaller or intergenic CNVs are less likely to be identified using this method. We identified a total of 68 duplications and 47 deletions (Table S1
) some of which have been identified previously 
. The limited overlap between our results and those previously published was expected because our probes did not uniformly interrogate non-genic regions. The size and frequency distribution of CNVs identified by our approach are shown in Figure S1
. In this paper we focus on a duplication that is among the most consistent findings of this and previous studies of CNVs in mice 
. This duplication is located on chromosome 17 from 30,174,390 to 30,651,226 Mb (build 36) and encompasses full copies of Glo1
and partial copies of Glp1r
PCR to Confirm Duplication on Chromosome 17
We confirmed the presence of this duplication using real time PCR to quantify genomic DNA template near the predicted ends of the duplication (Table S2
) and subsequently used PCR with primers directed across the predicted duplication; such primers would only produce a product in the presence of a tandem duplication (reverse primers P3 or P4 and forward P11, see and Table S2
). We then sequenced across the duplication boundary which allowed us to precisely define the location and the full extent of the duplication (30,174,390–30,651,226; mouse genome build 36). We used this PCR-based assay to test for the duplication in 71 inbred mouse strains, which included the 40 strains that make up the JAX phenome panel 
), and determined that 23 of these strains have the duplication.
Characterization of Duplication in Wild-caught Mice
To further evaluate the history of this duplication, we examined unrelated individuals collected directly from the wild over a large geographic range from all three mouse subspecies: M. m. domesticus, M. m. musculus, and M. m. castaneus. PCR and sequencing at the tandem duplication boundary confirmed that domesticus mice from Germany (100%) and France (25%) and a single castaneus mouse from Taiwan (12.5%) had the duplication (). The duplication was absent in some domesticus populations (those from Iran), as well as most castaneus and all musculus mice and was also absent in the closely related species M. spretus. These results indicate that the duplication predates the isolation of laboratory mice from the wild. Whether the single observation of this duplication in a castaneus mouse is the result of gene flow between domesticus and castaneus populations or indicates that the duplication predates the divergence of these two populations is a important question that cannot be immediately resolved given the current data.
Presence of the duplication in wild mice.
Relationship between Duplication and Gene Expression in BXD RI Strains
We next explored the possibility that this duplication causes heritable gene expression differences. We used WebQTL (www.webqtl.org
) to explore the relationship between the duplication and Glo1
expression in the BXD recombinant inbred (RI) lines. The BXD RI lines are a cross of DBA/2J, which has the duplication and B6 which does not have the duplication. We identified highly significant cis-eQTLs for all 4 probesets that target Glo1
(), as well as probe set 1458719_at which has been annotated as either Btbd9
() in hippocampal expression data (Hippocampus consortium M430v2 (Jun06) PDNN). We observed similar eQTLs in all other tissues for which expression data in the BXD RI strains was available (whole brain, striatum, cerebellum, eye, hematopoietic cells, kidney and liver).
cis-eQTL for Glo1 and the probe 1458719_at.
The probeset 1458719_at maps to an intron of Glp1r and is homologous to several mouse ESTs (BM233846, BQ560923, BX632944). These ESTs appear to be fusion products of the duplicated copy of Btbd9 and an intron of Glp1r; mRNA transcription presumably continues across the duplication boundary and incorporates sequence from an intronic region of Glp1r, thus producing a novel gene product in strains that possess this duplication. Consistent with this hypothesis, we have only observed appreciable signal from 1458719_at in strains that are positive for the duplication. We did not observe any significant cis-eQTLs in any of the tissues that we examined for the many probes that correctly measure the partially duplicated genes Btbd9 and Glp1r nor did we observe evidence of significant eQTL for Dnahc8. Thus, while 2 genes are fully duplicated (Glo1 and Dnahc8) and two others are partially duplicated (Btbd9 and Glp1r), only Glo1 shows a statistically significant increase in expression as a result of the duplication.
Relationship between Duplication and Gene Expression in a Panel of Inbred Strains
Having established that an eQTL for Glo1
was highly significant when considering a cross between two inbred strains, we sought to extend these findings by mapping an eQTL for expression of Glo1
in a panel of inbred strains, similar to approaches that have been proposed for genome-wide association analysis 
and in an effort to follow up on the findings of Hovatta et al 
. We obtained expression data from the amygdala for 27 inbred strains for which duplication status was also known. To our surprise, of the 48 SNPs examined between 30 and 31 Mb neither single SNPs nor 3-SNP haplotypes were strongly associated with Glo1
expression in the amygdala. Specifically, the maximum −log10
(p) for any single SNP was 2.9 (rs3150712; 30.49 Mb) whereas the maximum −log10
(p) for any 3-SNP haplotype was 2.77 (rs33190587; 30.12 Mb). When the same tests were re-run with a correction for population structure, the highest scoring SNP and 3-SNP haplotypes did not change, but their −log10
(p) decreased to 2.15 and 2.28, respectively. In contrast, the presence of the duplication, as determined using our PCR-based assay, was extremely significantly associated with Glo1
(p)>6), revealing the expected cis-eQTL for Glo1
. Correction for population structure did not diminish this extremely significant result. These data were surprising because we had expected that SNPs and 3-SNP haplotypes in the vicinity of the duplication would be strongly associated with the duplication and hence expression of Glo1
Haplotype Structure Among Inbred Strains from 30 to 31 Mb
SNP-association results from the BXD RI strains, but not from a panel of inbred strains, identified a cis-eQTL at the location of the duplication for Glo1. To better understand this surprising result, we examined the haplotype structure from 30–31 Mb in the 71 strains where the duplication status was known using the same 48 SNPs. Significantly, these SNPs flanked, but were not internal to the duplicated region. We observed four distinct haplotypes that contained the duplication (, denoted with green borders). Careful examination of the data in led us to conclude that multiple non-allelic homologous recombination events had taken place within the duplicated region. Non-allelic homologous recombination can occur between two chromosomes that both contain the duplication or between one duplicated and one non-duplicated chromosome. Such a recombination would lead to the exchange of either the distal or proximal haplotypes shown in without altering duplication status. In particular, this explanation accounts for the SNP haplotype observed in the proximal regions of B6 and related strains (not duplicated) and DBA/2J (duplicated; ). Thus, we hypothesize that recombination within the duplication is at least partially responsible for the complex haplotype structure around the duplication.
Haplotype blocks from 30–31 Mb across 71 inbred strains at 48 SNP markers.
Sequencing to Obtain Additional SNPs within the Duplication
The number of SNPs located within the duplication that were available from public databases was relatively small compared to the number of SNPs on either side of the duplicated region. Those that were available had higher than average rates of missing genotypes, and genotypes appeared to be disproportionately missing in strains that were positive for the duplication. We suspected that both phenomena were due the SNP assays or strain-specific SNP genotypes being scored as failures because heterozygous genotypes were obtained from inbred strains. A heterozygous genotype contradicts the assumption that these mice are inbred, and could thus appear to be a technical error. We suspected that these apparently heterozygous genotypes weredue to polymorphisms between the proximal and distal duplicated regions. To test this hypothesis, we sequenced 9 SNPs that were entirely within the duplication in 18 inbred strains (6 of which had the duplication). We observed multiple apparently heterozygous genotypes in duplicated strains, but none in non-duplicated strains (data not shown). The heterozygous genotypes can best be explained by a region that is polymorphic when comparing the proximal and distal copies of the duplication, rather than being heterozygous due to differences between two chromosomes. The heterozygous genotypes agreed with the haplotype structure shown in , and it was possible to identify the haplotypes of the proximal and distal copies. These observations are consistent with the hypothesis that the complex relationship between the duplication and surrounding haplotypes is due to crossovers that occurred within the duplicated region.
In order to avoid the ambiguity associated with heterozygous genotypes we used sequencing primers that spanned the duplication boundary (duplicated strains only). For comparison, we also used sequencing primers where one primer was outside of the duplicated region and the other was inside the duplicated region (all strains). This allowed us to uniquely compare SNPs that might be polymorphic between the proximal and distal copies in the duplicated strains, and also to directly extend our haplotype analysis into the duplicated region. When considering the proximal duplication boundary (~30.17 Mb), we observed two informative SNPs (30,174,423 and 30,174,589) both of which appear to have been exchanged via crossing over between duplicated and non-duplicated strains. The region near to 30.65 Mb was more interesting: a SNP at 30,650,629 is polymorphic such that the second duplication-containing haplotype block (the one that includes the A/J strain) differs from all other duplicated strains. The second duplication-containing haplotype block has a characteristic allele that is never observed in the analogous region in any of the other duplicated or non-duplicated strains. The second SNP in this region (30,650,736) perfectly distinguishes the internal from the external duplicated and non-duplication sequence and is thus never observed in non-duplicated strains. Both observations are consistent with these SNPs arising after the duplication occurred, though we cannot exclude the possibility of recombination or gene conversion with an unobserved, non-duplicated chromosome that contained these SNPs. Had the SNP at 30,650,736 been present in the inbred strain database, it would have performed as well as our PCR assay in terms of predicting the presence of the duplication; thus, exhaustive SNP coverage may help to address problems with inbred strain haplotype mapping, but will also increase the multiple testing burden and thus are unlikely to offer a solution to the problem of using SNPs to identify CNV among inbred strain panels.
Evidence for Recent Loss of the Duplication
Another observation that we made in examining the data in is that four strains (129X1/SvJ, BALB/cJ, BTBR T+tf/J and PERA/EiJ) appeared to belong to one of the two common duplication-containing haplotypes but did not contain the duplication (; denoted with red boxes). We hypothesized that the duplication has reverted to the non-duplicated state in these strains by a process of non-allelic homologous recombination (sometimes termed unequal crossing over). This hypothesis was confirmed both by examination of DNA samples obtained from previous generations of these inbred strains as well as by taking advantage of the known breeding history of these strains. In this regard, one compelling case was the absence of the duplication in BALB/cJ. Historically, we know that BALB/cByJ and BALB/cJ were separated in the 1930s 
but they remain isogenic at all typed SNPs and clearly share the same haplotype throughout the duplicated region (). The absence of the duplication in BALB/cJ could be explained by a reversion event that occurred after their separation from BALB/cByJ in 1937. We obtained DNA samples from JAX for BALB/cByJ from 1982 and 2000 and determined that both had the duplication. We also obtained DNA from BALB/cBy (used by Tafti et al 
, see below), which was separated from BALB/cByJ in the 1970s, and found that, unlike BALB/cByJ, BALB/cBy did not have the duplication. Finally, we obtained DNA from 1998 and 2006 for BALB/cJ and found that the duplication was not present in either sample. These data are consistent with the hypothesis that the duplication has reverted to the non-duplicated state and/or that duplicated and non-duplicated alleles have been segregating within the BALB/c-lineage for some time. Because of the known breeding history of the BALB/c mice, the reversion observed in 129X1/SvJ, which shares the BALB/c haplotype, is presumed to have an independent origin. We obtained a sample from the F30
generation of PERA/EiJ (1982) and found that this sample was positive for the duplication, whereas the modern samples (F99) were negative for the duplication. SNPs that were inside the duplication in the F30
sample of PERA/EiJ exactly matched the other strains in that haplotype block (data not shown). These observations are consistent with loss of the duplication due to unequal crossing-over, as discussed above. Genomic DNA from BTBR T+tf/J from 1996 and 2004 as well as a sample from 129X1/SvJ from the 1990s were similar to modern samples and thus offered no further insights.
Thus, this duplication cannot be reliably predicted based on single SNPs or multi-SNP haplotypes due to simple recombination, genotyping problems due to heterozygous genotypes, and reversion to the non-duplicated state via non-allelic homologous recombination. The observation of changes in CNVs over time are consistent with recent reports from Watkins-Chow & Pavan 
and Egan et al 
which show changes in CNVs among and within inbred strains. Indeed, Egan et al 
appears to have identified a copy number difference for the duplication discussed in this paper between two different A/J mice (denoted as INTRA-1 in Supplemental on page 76 of the supplemental materials for that paper).
Relationship between the Duplication and Glo1 Expression in Wild-Caught Mice
We examined gene expression in whole brain homogenates from unrelated, outbred, wild-caught domesticus mice from Germany (6 mice) and France (6 mice). We found a highly significant (−log10(p)>4) association between the presence of the duplication, as determined using our PCR-based assay, and expression of Glo1 as measured by all four Glo1-specific probesets as well as 1458719_at.
We examined expression levels of all 4 Glo1
probesets as well as 1458719_at in a previously published dataset 
that examined inbred wild-derived domesticus
strains. Even with a very small sample size (n
2 per group) Glo1
expression was significantly higher in the domesticus
-derived inbred strains (−log10
(p)>2.25) relative to the other two inbred strains. PCR-based genotyping of genomic DNA confirmed that only the domesticus
-derived inbred strain had the duplication. These data further support a direct relationship between the duplication and Glo1
expression in outbred, wild-caught and wild-derived inbred mouse strains.
Relationship between the Duplication and Anxiety-Like Behavior Using Publicly Available Datasets
Our data clearly establish that increases in Glo1
expression in panels of inbred strains, outbred populations and even in wild-derived populations are driven by a duplication of the Glo1
gene. This knowledge should facilitate an examination of this gene's role in behavioral phenotypes. We used the results from our PCR-based genotyping method to score presence or absence of the duplication in many common inbred strains using data shown in Table S3
and found significant correlations with various classical tests of anxiety-like behavior including the elevated zero maze, elevate plus maze, light dark box and open field test that were available from the JAX Phenome site (www.jax.org/phenome
). A partial summary of these findings is presented in Table S4
. For example, data from 8 strains tested in the elevated zero maze 
showed that the duplication was associated with fewer beam breaks in the closed quadrant of the open field and more fecal boli (r
0.023, respectively). Data from another set of 13 strains tested in elevated zero maze study (Brown1; unpublished) showed that the duplication was associated with greater numbers of fecal boli (r
0.034). Data from 13 strains tested in the elevated plus maze (Brown1) again showed that the duplication was associated with a greater number of fecal boli (r
0.73; p<0.01). Data from 13 strains tested in the light dark box (Brown1) showed that the duplication was associated with less total activity and a greater number of fecal boli (r
0.76; p<0.01, r
0.83; p<0.01, respectively). Data from 21 strains tested in the open field test 
showed that the duplication was associated with less movement in the open field for each of the first 5 minutes (e.g. distance traveled in the first minute: r
0.57; p<0.01). Data from a separate study of 8 strains also tested in the open field 
, showed that the duplication was associated with less activity and time in the center of the open field (r
0.89; p<0.01 and r
0.010, respectively). Data from a third study of the open field (Brown1; unpublished) showed that the duplication was associated with less activity and a greater number of fecal boli (r
0.73; p<0.01, r
0.67, p<0.013, respectively). It is important to note that in all cases, presence of the duplication was negatively associated with activity and positively associated with anxiety-like behavior, consistent with our hypothesis and the observations of Hovatta et al 
. A number of other phenotypes were also correlated with the presence of the duplication even though they have no obvious relationship to anxiety. For example, latency to respond to the hot plate test, a measure of nociception 
, was slower in strains that carried the duplication (r
0.81; p<0.01). Taken together these data demonstrate that the duplication is correlated with behaviors including, but not limited to, those associated with anxiety-like behaviors.
Relationship between the Duplication and Anxiety-Like Behavior Using New Inbred Strain Data
Because of the limited number of inbred stains for which behavioral data were available in the public databases (21 or less strains per study), and the genetic and environmental variability presumed to exist for anxiety-like behavior, the power of the correlations observed in public databases was limited. To more rigorously test the association between the duplication and anxiety-like behavior, we used our own behavioral data from 38 inbred strains (901 individual mice total) for which the duplication status was known. We observed a significant reduction in percent time in the center of the open field among strains that had the duplication (p
0.0043), further demonstrating that the duplication was associated with greater anxiety-like behavior (). The strength of this association was not changed when a correction for population structure available in SNPster (snpster.gnf.org) was applied, suggesting that this association was not due to artifacts associated with population structure. While none of these correlations would have been significant if a correction for multiple comparisons (e.g. all CNV identified in this study) had been applied, such corrections were unwarranted because we had a strong prior hypothesis that Glo1
expression affects anxiety-like behavior. In the present study we have focused on a single CNV, and so the required threshold has been set to the traditional value of 0.05.
Relationship between behavior, duplication, gene expression and haplotype block.
Relationship between Glo1 Expression and Anxiety-Like Behavior in Inbred Strains
Because our hypothesis is that the duplication increases gene expression and thus alters behavior, we also examined the relationship between the duplication and Glo1
expression in the amygdala (; p<0.000001) and the relationship between Glo1
expression in the amygdala and anxiety-like behavior (; p
0.0012). Both correlations were more significant than the correlation between the duplication and behavior. We used multiple regression to examine the relationship between the duplication, Glo1 expression and behavior. Both forward- and reverse-selection methods arrived at a model that included expression but did not include our PCR-based measure of the duplication. This might be attributed to some strains having more than two copies of the duplicated region and showing correspondingly higher expression; our PCR-based technique does not determine the number of extra copies of this region. These results are consistent with the hypothesis that expression of Glo1
is a better predictor of behavior than the duplication itself.
Relationship between Flanking Haplotypes and Behavior in Inbred Strains
An alternative hypothesis that might also explain the correlation between the duplication and anxiety-like behavior is that the functionally significant allele is genetically linked to, but not contained within, the duplication. To test this hypothesis, we calculated the average anxiety-like behavior associated with three of the four duplication-containing haplotypes identified in for which our study of 38 inbred strains provided corresponding behavioral data (). We found that each of the three duplication-containing haplotypes was associated with higher anxiety-like behavior compared to the average anxiety-like behavior associated with all non-duplicated strains, which is consistent with the hypothesis that the duplication alters behavior, but is inconsistent with the alternative hypothesis that the duplication is genetically linked to another allele that alters anxiety-like behavior. We considered separately the two strains that were members of the duplication containing haplotype blocks but had apparently lost the duplication (BTBR_T+_tf/J and PERA/EiJ; red boxes, ); their anxiety-like behavior was similar to the average behavior of the non-duplication containing strains (). This observation further supports the hypothesis that the presence of the duplication directly effects anxiety-like behavior.
Relationship between the Duplication, Anxiety-like Behavior and Glo1 expression Using Outbred CD-1 Mice
To further test the relationship between the duplication and anxiety-like behavior, we evaluated behavior in the open field in outbred CD-1 mice. We found that 52 of the 94 mice examined had the duplication while 42 did not. We also examined an additional 12 CD-1 mice that were not tested behaviorally; 7 had the duplication and 5 did not. Our PCR assay cannot discriminate between mice that are heterozygous or homozygous for the duplication; nevertheless it is possible to solve the equations p2
1 given the data above, which yields a frequency of 0.334 for the duplication and 0.666 for the non-duplicated allele. These values apply only to the subset of 106 CD-1 mice that we genotyped, but should provide an approximate guide for future studies.
We observed a significant decrease in total distance traveled in the first 5 minutes of the open field test (F(1,92)
3.81; p<0.05; ) and a significant decrease in the time spent in the center of the open field among CD-1 mice that were positive for the duplication (F(1,92)
3.14; p<0.05; ). Moreover, we observed greater Glo1
expression among outbred CD-1 mice that were positive for the duplication (; F(1,57)
19.31; p<0.00005). These data are consistent with the positive relationship between the duplication and anxiety-like behavior observed among the inbred strain panels, and support the hypothesis that these relationships are unlikely to be due to linkage between the duplication and a nearby allele. In both inbred and outbred mice many other alleles are presumed to also affect anxiety-like behavior so that the contribution of this duplication would account for only a small percentage of the total trait variance; this is characteristic of all complex traits and has made the elucidation of their molecular correlates a challenging problem.
Behavior and gene expression in CD-1 mice with and without the duplication.