|Home | About | Journals | Submit | Contact Us | Français|
The extent to which differences in germ line DNA copy number contribute to natural phenotypic variation is unknown. We analyzed the copy number content of the mouse genome to a sub-10 kb resolution. We identified over 1,300 copy number variant regions (CNVRs), most of which are < 10 kb in length, are found in more than one strain, and, in total, span 3.2% (85 Mb) of the genome. To assess the potential functional impact of copy number variation, we mapped expression profiles of purified hematopoietic stem/progenitor cells, adipose tissue and hypothalamus to CNVRs in cis. Of the more than 600 significant associations between CNVRs and expression profiles, most map to CNVRs outside of the transcribed regions of genes. In hematopoietic stem/progenitor cells, up to 28% of strain-dependent expression variation is associated with copy number variation, supporting the role of germ line CNVs as major contributors to natural phenotypic variation in the laboratory mouse.
Copy number variants (CNVs), currently defined as genomic sequences greater than one kilobase that are polymorphic in copy number, have been identified in diverse species including human, chimp, rat, mouse, and drosophila1–10. In the short interval since the discovery of wide-spread copy number variation in apparently healthy individuals, there has been rapid expansion of both CNV detection techniques and their application across a range of biological samples and species. From these studies, it is apparent that copy number variation exceeds single nucleotide polymorphisms (SNPs) as a source of genetic variation, and that many CNVs contain or overlap genes and, thereby, may have functional effects. However, the role of copy number variation in mediating both ‘normal’ phenotypic variation and disease susceptibility is only beginning to emerge11–14.
Fundamental questions about the nature and impact of CNVs remain unanswered, mainly due to methodological constraints. We set out to determine the copy number variable content of the mouse genome and estimate its functional impact, as measured by gene expression profiling in vivo. The inbred mouse is an ideal model organism for this study for several reasons, including its homozygous genome, the ease with which biological samples can be acquired, and the preeminent role of the mouse as a model for biomedically relevant traits and diseases. Gene expression variation is a trait amenable to genetic mapping because it is easily quantified in vivo, it is the phenotype most proximally related to genetics, and the expression of all genes can be measured simultaneously. Finally, it is reasonable to hypothesize that the effect size of structural variations on gene expression will be large, so that a genome-wide association study could be informative, even with modest sample sizes.
To map the CNV content of the mouse genome, we selected 17 Tier 1–3 Mouse Phenome Project strains15 and three additional strains of biomedical interest (LG/J, NZB/BINJ, 129X1/SvJ), representing all major inbred lineages. We performed comparative genomic hybridization using a long-oligonucleotide array containing 2,149,887 probes evenly spaced across the reference genome with a median inter-probe spacing of 1,015 bases. We performed segmentation using wuHMM, a Hidden Markov Model algorithm that utilizes sequence-level information and can detect CNVs less than 5 kb in length (fewer than five probes) at a low false positive rate16. wuHMM scores CNVs based on the number and median log2-ratio of the probes comprising the prediction, such that calls with higher scores are more likely to represent true events. CNVs called in different strains that overlap can be assigned different boundaries due to technical or biological sources of variability. Because fine-mapping all putative CNVs is not feasible at present, a common approach to handling complexity and ambiguity in CNV boundaries is to treat overlapping CNVs as a unit, or, copy number variable region (CNVR)4. We merged overlapping wuHMM calls into CNVRs, some of which have complex architectures (Figure 1). We refer to CNVRs as ‘complex’ or ‘simple,’ as determined by wuHMM boundary concordance across strains (see Methods). To assign CNVR genotype calls to strains for QTL mapping and to improve upon the sensitivity of wuHMM, we clustered the log2-ratios of each CNVR (see Methods). The number of genotypes per CNVR was determined by selecting the cluster number that maximized the average silhouette function, which is a measure of clustering quality17. Genotypes were assigned according to the clusters in which strains were grouped. We refer to genotypes that differ from the reference strain’s genotype as ‘abnormal’ in complex CNVRs, and as ‘gain’ or ‘loss’ in simple CNVRs if the mean log2-ratio is greater or less than the reference sample, respectively.
Using initial parameters, wuHMM identified 10,681 putative CNVs which were merged into 3,359 CNVRs. To determine the false positive rate (FPR) of our CNV predictions, we randomly selected 61 short CNVRs for independent validation by qualitative (for losses) or quantitative (for gains) PCR (qPCR). The FPR approached 0 for CNVRs with average scores exceeding 1.5 and 2.5 for gains and losses, respectively (Supplementary Table 1). Therefore, we selected these score thresholds, resulting in an empirically estimated individual strain CNVR genotype FPR < 4.0%. For complex CNVRs, the same threshold was applied if the region contained either wuHMM gains or losses exceeding the corresponding threshold. We called the 1,333 CNVRs that passed these thresholds ‘high-confidence’ CNVRs and retained them for further analysis and quantitative trait mapping (Supplementary Figure 1, available at http://graubertlab.dom.wustl.edu/downloads.html, and Supplementary Table 2).
The 1,333 high-confidence CNVRs span 85 million non-redundant bases (3% of the genome) and are distributed across all 19 autosomes and the X chromosome (Figure 2). The CNVRs range in length from 1,871 bases to 3.84 Mb (mean length is 64 kb, median is 9 kb, over 50% are less than 10 kb) (Figure 3A). Although the length distribution of CNVRs is highly right-skewed, confirming previous estimates derived from CNVR mapping studies performed with lower resolution platforms18 and paired-end mapping19, the overall contribution of small CNVRs (i.e., less than 10 kb) to the total copy number variable content of the genome makes up only 3.3 Mb (0.13%) (Figure 3B), a finding consistent across all strains (Supplementary Figure 2). Complex CNVRs make up 23% of all CNVRs, but 63% of the CNV sequence content. The majority of small CNVRs are exclusively genotyped as losses (82%), probably reflecting the increased power to detect homozygous losses versus integral gains with a small number of aCGH probes. We detected a total of 663 gains, 2,854 losses, and 2,772 abnormal CNVR genotypes. 67% of CNVRs were called as gain, loss, or abnormal in more than one strain. The number of CNVR gains, losses, or abnormal genotypes ranges from 215 (C58/J) to 413 (KK/HIJ) per strain (mean = 331). The total CNV sequence per strain ranges from 26.4 (C58/J) to 48.3 (NOD/ShiLtJ) Mb (mean = 39.1 Mb); no single strain contributed disproportionately to the CNVR map (Figure 3C and Supplementary Figure 2).
Several previous reports have investigated the extent of copy number variation in inbred strains of mice1,2,5,20,21. If de novo events contribute only minimally to copy number variation among individuals within a strain22,23, then as detection technologies improve, studies assaying the same strains will have increasingly concordant results. We compared our CNVR map to previous reports that also used high-density oligonucleotide aCGH (see Methods). We found that when we compared CNVRs defined using strains in common with other studies, our map largely recapitulated the CNVRs found in the other studies: 64–84% of CNV content in the other studies was also detected in our high-confidence CNVRs (Supplementary Table 3). 48–87% of the copy number variable content that we report in the 19 strains is novel. However, when we compared CNVR maps regardless of strain we found that only 16% of the copy number variable content in our map was novel, suggesting that much of the total copy number variable sequence of the reference genome is known at the presently available detection limit.
Non-allelic homologous recombination (NAHR) has been proposed as a mechanism of CNV formation24. The hypothesis that segmental duplications (sequences >1 kb and having > 90% similarity to at least one other genomic region) act as nurseries of CNV by promoting NAHR has been supported by the enrichment of segmental duplications within and around CNVRs20,25. By permutation testing (see Methods), we found that there is significantly more segmental duplication sequence within and directly bordering medium (10–100 kb) and large (>100 kb) CNVRs (fold = 3.0 and 12.9, respectively, P < 0.01), but that segmental duplications are found less often than expected by chance within and near small (<10 kb) CNVRs (fold = 0.37, P < 0.01) (Figure 4 and Supplementary Table 4), consistent with a prior report of stronger association between segmental duplications and long CNVRs4. The pattern of enrichment of segmental duplication sequences near medium and large CNVRs extends to 2 Mb beyond the CNVR boundaries (fold from 2.25 - 1.43 and 7.80 - 2.45, respectively, P < 0.01) as does the pattern of depletion around small CNVRs (fold = 0.27 – 0.75, respectively, P < 0.01). Like segmental duplications, it has been suggested that repetitive elements may facilitate CNV generation through NAHR. Indirect evidence supporting this hypothesis has been presented in inbred mice where LINEs are enriched within segmental duplications20. We found that LINEs are enriched within medium and large CNVRs (fold = 1.61 and 1.50, P < 0.01), but are not enriched in small CNVRs (fold = 0.95, P = 0.81). We found an enrichment of LINE elements in sequences flanking all CNVRs types, although the association is less for small CNVRs (fold = 1.14 for small, 1.51 for medium, 1.43 for large, P< 0.01). Therefore, it is unlikely that small CNVRs are variations in the copy number of repetitive elements themselves26, but rather LINEs may facilitate the removal or expansion of neighboring sequence. Long terminal repeats (LTRs) are enriched within all CNVRs (fold = 1.3, 1.4, 1.53, P<0.01). This association persists for regions surrounding CNVRs to at least 10 kb for medium and large, but not small CNVRs. SINEs are depleted within and surrounding medium and large, but not small CNVRs (fold = 0.7, 0.45, P<0.01). Taken together, this analysis confirms that CNVRs greater than 10 kb frequently contain or directly border highly homologous elements of the genome that can facilitate NAHR and therefore CNVR generation. But, with the exception of the weak association between the regions surrounding small CNVRs and LINE sequences, there is no apparent genomic feature that could facilitate NAHR and give rise to the abundant, small, high-confidence CNVRs. Therefore, their origins will require detailed genomic analysis and further exploration.
We next determined the gene content of the high-confidence CNVRs, finding that 432 high-confidence CNVRs contain or partially contain 679 genes. Previous CNVR studies of the mouse genome have shown that CNVRs overlap coding sequence no more often than expected by chance, in contrast to CNVRs in human and rat genomes which appear to be enriched for gene content1,4,8,21. With a more comprehensive and finer-resolution map, we retested this hypothesis by permutation analysis. We found that small, medium and large CNVRs are found in genic regions less frequently than expected by chance (fold=0.86, 0.71, 0.90 respectively, P<0.01, 0.01, 0.05) (Figure 4).
To estimate the overall impact of CNV on gene expression in vivo, we first performed expression profiling of hematopoietic stem/progenitors cells using the Illumina Mouse Beadchip-6v1 platform (see Methods). Among many cell types and tissues suitable for this study we chose to profile a population that has well-defined surface markers, enabling the enrichment of a highly purified cell population that is transcriptionally active27, increasing the number of genes that could be assessed for association with CNVRs. We pooled bone marrow cells from two individuals from each strain and analyzed 2–3 biological replicates per strain (46 expression experiments). 29% of the probes on the array were detected as ‘present’ in at least three strains (see Methods). To validate the sort purity, we examined the expression profiles of the cell surface markers utilized in the sort strategy and found that they were consistent with the immunophenotype of the post-sort products (Supplementary Figure 3).
To determine the extent to which expression variation is associated with copy number variation, we first identified the genes that exhibit strain-specific expression. We identified 1,469 probes with significantly higher between- versus within-strain expression differences (P < 0.01, see Methods). We also determined the strain-specific expression profiles in epididymal adipose tissue and hypothalamus, as those data sets were publicly available28,29. We removed expression data for strains that were not profiled in our CNVR mapping work, leaving 15 strains from each study. Since no strain replicates were available in these studies, we identified strain-specific probes as those with a ratio of maximum to minimum expression > 3, the same threshold used to identify ‘variable’ expression traits in those studies (Table 1). It is impossible to determine if the differences in the number of ‘Present’ and strain-specific expression traits between tissues is due to fundamental differences in cross-tissue expression variation or, more likely, to the significant differences in the expression profiling platforms and analysis methods utilized in these studies.
CNVRs may impact local gene expression through a variety of mechanisms, including gene dosage, removal or relocation of regulatory material, or ‘neighborhood effects’ that disrupt local chromatin structure30. We estimated the overall contribution of CNVRs on local expression by in silico eQTL mapping, in which gene expression profiles were treated as quantitative traits and CNVRs as genetic markers. We limited the analysis to CNVR-expression traits that are tightly linked (< 2 Mb apart) because of reduced power to detect trans effects with a small sample size. We calculated eQTL significance using a weighted permutation method that accounts for the complex ancestral relationship among inbred strains31,32, and controlled the family-wise error rate arising from testing the association between a trait and multiple CNVRs by applying the Holm multiple testing correction to each trait’s p-values separately33.
We identified 672 significant associations between strain-specific expression traits and CNVRs in the hematopoietic stem/progenitor compartment. Because we used an alpha threshold of 0.05, after correcting for multiple tests we would expect to find only 113 associations by chance. The number of traits associated with a CNVR (degree of pleiotropy) ranged from 1–18 (mean=2.47, median=2); the number of CNVRs associated with a trait ranged from 1–9 (mean=1.65, median=1). While there were more eQTLs in which the Illumina probe sequence overlapped the CNVR than expected by chance (P < 0.05 by Fisher’s Exact Test), most eQTLs (92.3%) map outside of the corresponding CNVR. If these intergenic CNVRs mediate expression variation, they do so via mechanisms other than changes in gene dosage. CNVRs of each categorization, either by size or complexity, were found to be eQTLs and each was as likely to be an eQTL as expected by chance. After selecting the most significant association per trait from the 672 eQTLs, we found that 408 strain-specific expression traits representing 391 genes (27.8% of 1,469 strain-specific traits) were associated with 214 CNVRs (16% of all 1,333 CNVRs and 44.2% of the 484 testable CNVRs) (Table 1 and Supplementary Table 5). The frequency of eQTLs dropped with increasing distance from CNVR boundaries to expression probe locations (proximity) (Supplementary Figure 4). Similarly, the fraction of expression variation explained by a trait’s association with a CNVR decreased significantly with proximity (Supplementary Figure 4)
To validate the KL eQTLs, we queried the expression profiles of the 391 eQTL-associated genes in Kit+/Lineage−/Sca1+ (KLS) hematopoietic stem cells purified from BXD recombinant inbred mice34. Because the BXD mice are homozygous for either the C57BL/6J or DBA/2J genotype at most loci and SNP genotype data is publicly available, we were able to assign an inferred CNVR genotype based on the parental strain of origin of the SNP markers spanning each CNVR (Supplementary Table 6). Of the 160 KL eQTL-associated genes that were unambiguously annotated with a gene symbol, 74 genes (93 probe sets) were present on the Affymetrix U74A expression platform and 31 were detected as expressed in >80% of the RI lines. We found that 29% of these testable eQTL-associated genes had expression profiles that were also associated with the inferred CNVR genotype in the KLS BXD data (P-value < 0.05) (Supplementary Table 7).
Smaller proportions of strain-specific expression variation were associated with CNVRs in the other two tissues that we were able to analyze: 181 of 4,083 (4.4%) and 78 of 2,879 (2.7%) strain-specific traits in adipose tissue and hypothalamus, respectively, after selecting the most significant associations per trait (Table 1). Similarly, fewer CNVRs were detected as eQTLs: 24.9% and 15.0% of testable CNVRs in adipose tissue and hypothalamus, respectively. While there is variability in the impact of CNV on expression variation between tissues, differences in the number of eQTLs we detected in adipose tissue and hypothalamus cells are likely due to the reduced power (25% fewer samples) and less robust methods used to identify strain-specific expression in these data. The relationships between eQTL frequency and proximity, and between eQTL effect size and proximity, were present to a lesser extent in adipose and were not present in the hypothalamus (Supplementary Figure 4). As we found in the hematopoietic compartment, few adipose and hypothalamus eQTLs overlapped their associated traits (6.0% and 6.4%, respectively), but this was more than expected by chance (P<1e-5 and P<0.01 in adipose and hypothalamus, respectively). CNVRs across all length and complexity ranges were observed as eQTLs; no categorization was enriched or depleted.
Next, we asked whether any eQTLs were shared across tissues. Because we utilized expression data from different platforms, we defined expression trait overlap at the level of gene annotation rather than probe sequence. We found twenty-three eQTLs present in more than one tissue, five of which were gene-dosage effects (Figure 5 and Table 2). A correlation between Alad gene dosage, mRNA abundance, and enzymatic activity was previously demonstrated35,36 and Alad expression variation was associated with a cis-eQTL reported in an F2 inter-cross37, demonstrating that our analysis was able to detect known gene dosage eQTLs. Further, we found that strain-specific Glo1 over-expression is due to a large gain and that this gene-dosage effect is consistent across all three tissues that we tested (Figure 6A). A strain-specific expression pattern of Glo1 in hypothalamus was previously shown to be associated with and potentially casual for anxiety-related behavior38. Our analysis is the first, to our knowledge, to show that this expression variation is due to a CNV. Most eQTLs are found in only one tissue, indicating that tissue-specific factors compensate for CNVR-mediated gene expression variation. For example, the expression of guanylate-binding protein 1 (Gbp1) is associated with a CNVR containing its 3′-exon and 3′-UTR in hematopoietic and adipose cells, but not hypothalamus (Figure 6B). The expression pattern of Gbp1 (highly expressed in both hematopoietic stem/progenitor cells and adipose tissue in strains that contain the CNV, but not expressed at detectable levels in strains without the CNV or in the hypothalamus regardless of CNV genotype) is consistent with a model of expression regulation where hypothalamus-specific down-regulation or alterative splicing of Gbp1 overcomes the CNVR effect apparent in other tissues.
We reasoned that CNVRs that mediate expression variation by large scale disruption or modification of local chromatin structure rather than by gene dosage were likely to impact the expression of more than one gene. We tested one implication of this hypothesis using random permutations of the hematopoietic eQTL data. We calculated the probabilities of finding the observed number of CNVRs with a given degree of pleiotropy (defined as the number of expression traits associated with a CNVR). We found that there were more CNVRs with 7 and 8 associated expression traits than expected by chance (P < 0.05, 10,000 permutations). One CNVR (CNVR-ID 3014) with seven associated traits is a deletion located approximately 100 kb from the Major histocompatibility (Mhc) locus on chromosome 17 that removes highly conserved sequence with predicted regulatory potential. All of the associated traits are Mhc class Ib genes, many of which are expressed in multiple tissues and have unknown specific functions39. Genes at this locus have been speculated to undergo distal regulation via a chromosomal looping mechanism40 and, therefore, copy number changes that modify this looping structure would be expected to have pleiotropic effects on local expression. Alternatively, because the H2-T locus is known to have strain-specific duplications39, it is possible that the expression variation that we observed was due to gene dosage differences that are too complex for our computational methods to properly detect but are, in effect, tagged by the associated CNVR.
The central goal of our work was to estimate the functional impact of germ line copy number variation in vivo. To achieve this goal, we first identified CNVRs in twenty inbred strains at the highest resolution reported to date. We discovered 1,333 CNVRs spanning approximately 3% of the mouse genome. On average, there are over 300 CNVs per strain. As predicted, we found that the frequency of CNVRs increased with decreasing CNVR length, but that short CNVs account for only a small fraction of the total copy number variable sequence content of the mouse genome. We speculate that this trend will hold as higher resolution technologies are developed. Unexpectedly, we found that small CNVs (<10 kb) lack the enrichment of highly homologous sequences that frequently flank, and are presumed to contribute to the formation of medium (10–100 kb) and large (>100 kb) CNVs. Determining the mechanisms that generated these CNVs would facilitate the design of targeted assays to detect new CNVs and provide a better understanding of the forces that shaped the mouse genome. We are aware of only one report documenting similar short deletions in a small number of human genomes and therefore a mouse-to-human CNVR comparison will be informative as high-resolution human data become available41. A caveat of our CNVR map is that, as is true for all comparative genomic hybridization experiments, we were limited to finding variants in comparison to a reference sequence; sequences that do not exist in the C57BL/6J genome but vary in copy number among other strains were not detected. Therefore, the total extent of copy number variation relative to the union of all inbred mouse genomes must await comprehensive sequencing of other strains. However, a reasonable estimate of the amount of mouse genomic sequence lost in the C57BL/6J strain is the amount of genomic material lost per strain relative to C57BL/6J, which ranged from 16.8 to 33.8 Mb (mean = 25.5 Mb).
Using a relatively small number of inbred mouse strains, we found that all classes of CNVs were associated with gene expression changes in a variety of tissues. We found that 28% of strain-specific expression traits were associated with copy number variation in the hematopoietic progenitor/stem compartment, consistent with the 18% previously reported in human lymphoblastoid cell lines42. To validate these eQTLs, we inferred the CNVR genotypes of the BXD RI panel and analyzed publicly available KLS expression data. Over 29% of the testable KL eQTLs were supported in the BXD data set, a striking concordance given the substantial experimental and biological differences between the studies. We also detected many CNVR eQTLs in adipose tissue and hypothalamus, even though these data were produced with different mice, using different expression platforms, and the eQTL analysis was performed with 25% fewer strains. Much of the recent speculation on the potential impact of CNVs on phenotypic variation has centered on gene-dosage effects43. However, we found that only 7.3% of CNVR eQTLs contain the associated expression probe and therefore were due to gene-dosage effects. Presumably, the remaining CNVR eQTLs reflect expression variation mediated by alteration of regulatory material or local chromatin structure. This would be consistent with a model where (subtle) alterations in expression patterns are better tolerated than complete or partial gene gains or losses.
Some of the CNVR eQTLs reported here may be in linkage disequilibrium with another allele causing the associated expression change, underscoring the need to characterize the relationship between CNVs and other genetic variants. It is likely that there are additional eQTLs not detected here: CNVRs that alter expression in only one or two strains, trans eQTLs, eQTLs that associate with genes expressed in tissues not sampled here, and eQTLs with weak effects. Increasing the number of strains and the tissues sampled would address some of these limitations. However, extending this work to a much larger population with greater genetic diversity (i.e., the Collaborative Cross44) would increase the power to detect trans and weaker effects and therefore enable a clearer understanding the overall impact of CNVR on expression variability. Future work must reach beyond identifying statistical associations to better characterize the mechanisms by which a CNVR affects phenotypic (including expression) variation. In addition to estimating the impact of CNVRs on expression variation, the CNVR eQTLs reported here may be of practical value in identifying the causal variants in traditional QTLs because they present plausible hypotheses linking genetic differences between inbred strains to complex traits.
Male mice were obtained from The Jackson Laboratory (Bar Harbor, ME), housed in a specific pathogen-free facility, and sacrificed at 8–10 weeks of age. The same individual mice were used for both DNA- and RNA-based analyses. All experiments were performed in compliance with the guidelines of the Animal Studies Committee at Washington University, St. Louis, MO.
DNA was prepared from spleen, liver, kidney, and tail by phenol-chloroform extraction, and was quantified using UV spectroscopy (NanoDrop 1000, Thermo Scientific, Wilmington, DE). Kidney DNA for aCGH experiments were pooled in equal masses from 2–6 individuals per strain. Only individual samples passing NimbleGen quality control requirements were pooled.
A tiling-path CGH array for whole-genome analysis in mouse (mm8, NCBI Build 36) was utilized (http://www.nimblegen.com). Isothermal probes from 45–75 bp were selected with a median probe spacing of 1 kb. Labeling, hybridization, washing and array imaging were performed as previously described45. Previously, we demonstrated that regions of the mouse genome with high sequence divergence between the test and reference strains have lower aCGH probe signal intensities and can, therefore, potentially disrupt the identification of CNVs16. Using an imputed single nucleotide map46, we defined regions of high sequence divergence between the test and reference genomes for input to wuHMM, a Hidden Markov Model algorithm for CNV detection16. All putative wuHMM CNV calls with scores less than 1.5 or 1.9 (gains or losses, respectively) were discarded, as we have previously shown that they contain a high number of false positive predictions. CNVRs were defined by merging overlapping wuHMM calls across all individuals. To assess the complexity of the CNVRs, we calculated average boundary concordances (the average of the length of the intersection of a CNV and CNVR divided by the total CNVR length). CNVRs having average concordances <= 0.75 (Supplementary Figure 5) comprised less than 23% of the CNVRs detected in this study. We refer to these regions as ‘complex’ and all other CNVRs as ‘simple’. All microarray data, aCGH and expression, is available for download from GEO (http://www.ncbi.nlm.nih.gov/geo/) under series accession GSE10656.
Clustering of CNVRs was performed using partitioning among medoids (PAM) as implemented in R17. The average silhouette function calculates the average between versus within group distances and ranges from −1 to 1, with 1 representing perfect clustering17. We modified this function to weight groupings by their agreement with wuHMM calls. We executed PAM, varying the number of clusters from 2–7, and calculated the weighted average silhouette. The number of clusters with the maximum, modified average silhouette was selected for the number of genotypes per CNVR. Sometimes a clustering would result in a group of strains in which no wuHMM call had been made, representing a new gain, loss or abnormal genotype. These genotypes were disallowed and these strains were assigned into the same genotype label as the reference strains. CNVRs with both average silhouettes < 0.3 and average scores < 2.0 were discarded, as they were likely to represent spurious clusters.
61 simple CNVRs were randomly selected for validation from the set with average scores between 1.3 and 3.3. These CNVRs ranged from 887 bases to 67 kb (2 to 47 aCGH probes) and scored from 1.3 – 2.3 for gains, and 1.9 – 3.3 for losses. For qualitative PCR validation (losses only), primers were designed to target reference sequence within the predicted boundaries of the CNVR, prioritizing amplicons near or overlapping the aCGH probes with the maximum log2-ratio magnitudes. One to three amplicons were designed per CNVR. A positive control amplicon was designed for a region with no predicted CNVs in any of the 20 strains (primer sequences in Supplementary Table 8). For quantitative PCR (gains only), relative copy numbers were determined by real-time PCR (qPCR) using TaqMan detection chemistry and the ABI Prism 7300 Sequence Detection System (Applied Biosystems, http://www.appliedbiosystems.org), as previously described1. A CNVR loss was validated if no amplicon was produced using primers targeted within predicted CNVR boundaries. A CNVR gain was validated when qPCR demonstrated a >2-fold increase in inferred relative copy number relative to the reference strain. We defined the false positive rate (FPR) as the number of false positives divided by the number of gain and loss genotypes at or exceeding a given score threshold. The FPR for putative copy number losses with scores between 2.0 and 2.5 was 25% (152/608 CNV calls tested). Nearly a third of these amplicons (50/152) exhibited altered electrophoretic mobility consistent with the CNV strain distributions predicted by aCGH analysis. To better understand this phenomenon, we cloned and sequenced two of the amplicons from four affected strains and discovered three novel SNPs in each amplicon which overlapped an aCGH probe sequence in the CNVR in each case. Sequence divergence can disrupt probe hybridization resulting in decreased signal intensity and, at times, false positive deletion calls. Further, we found a 14- and a 10-bp insertion near the probe sequence in the affected strains, which accounted for the altered size of the amplicons. The co-occurrence of SNPs and in/dels has previously been reported and their potential causal relationship is under investigation47. For CNVRs with average scores exceeding 1.5 and 2.5 for gains and losses, respectively, the FPR approached 0 (Supplementary Table 1). Therefore, only calls that exceeded these thresholds were retained for further analysis.
CNVR coordinates were translated from mm6 to mm8 using liftOver, when necessary (http://genome.ucsc.edu/cgi-bin/hgLiftOver). We defined subsets of CNVRs by selecting only those CNVRs that have a gain, loss, or abnormal genotype in at least one of the strains in common with another study. Overlap between studies was reported as either the total shared sequence in the intersection of CNVRs or as the number of CNVRs that have overlapping boundaries. For comparisons of CNV content by CNVR size, sequence overlap was determined by calculating the total sequence intersection between only small, medium or large, high-confidence CNVRs and all CNVRs from other studies.
Sequence may be reported as copy number variable exclusively in other studies due to differences in genome coverage20, de novo events22, or because lower resolution platforms tend to over-estimate CNVR boundaries1,21. The comparison to a study that mainly targeted segmental duplicated regions of the genome resulted in the lowest agreement (63.9%)20. Many of these regions have sparse probe coverage on the platform that we utilized and therefore are problematic regions in which to detect CNVs. The second lowest overlap (64.3%) was with a study that specifically targeted the identification of de novo events in C57BL/6-derived strains22. It is possible that the 36.7% of CNV content exclusive to that study was not detected here because those sequences did not exist in, or comprised an undetectable fraction of the samples used in our study. We also assessed the overlap between CNVRs in our study and others, defined across all strains, to determine the overall consensus of reported copy number variation in the inbred mouse genome. To perform this comparison, we first merged all CNVs from previous studies into a single set of CNVRs finding that the amount of novel CNV sequence content is relatively low (16%) (Supplementary Table 3).
The association between CNVRs and genomic features was tested by randomly permuting the chromosome and position of each CNVR 100 times and determining the sequence content of the resulting region or flanking regions. Gene overlap enrichment was tested similarly, except that the test statistic was the number of CNVRs per permutation that overlapped at least one gene using UCSC’s knownGene annotation (http://genome.ucsc.edu/cgi-bin/hgGateway?org=Mouse&db=mm8).
Bone marrow cells were harvested from mouse femurs and stained with FITC-conjugated lineage markers (Gr-1, CD19, B220, CD3, CD4, CD8, TER119, and IL-7Rα) and APC-conjugated c-kit (BD Biosciences, San Diego, CA). Lineage-negative, c-kit positive cells were enriched using a modified MoFlo high speed sorter (Cytomation, Fort Collins, CO). Total RNA was prepared using Trizol LS (Invitrogen, Carlsbad, CA) and its concentration quantified using UV spectroscopy (Nanodrop). Total RNA quality was then determined by Agilent 2100 Bioanalyzer (Agilent Technologies) according to the manufacturer’s recommendations.
RNA transcripts were amplified by T7 linear amplification (MessageAmp TotalPrep amplification kit; ABI-Ambion). First strand synthesis was primed with oligo-dT, followed by in vitro transcription to generate amplified RNAs (aRNA). The aRNAs were then quantitated on a spectrophotometer, and quality determined by Agilent 2100 bioanalyzer according to the manufacturer’s recommendations. Hybridization to the MouseWG-6 v1.1 Expression Beadchip (Illumina), washing, and signal detection were performed using standard protocols. Quantitated data were imported into Beadstudio software (Illumina). On-slide spot replicates were averaged by Beadstudio and individual spot data was reported. Probes were defined as ‘present’ in a sample when the signal was significantly higher than in a set of negative control probes, (P < 0.05 after correcting for multiple tests). A probe was defined as present in a strain if it was called ‘present’ in all replicate samples of that strain. The correlation of within-strain expression profiles exceeded between-strain correlations in all but two strains (average within strain correlation = 0.9782, average between-strain correlation = 0.9528), demonstrating that the expression profiles reflect biological variation and not technical artifacts (i.e., due to differences in cell staining, sorting, RNA labeling, or hybridization).
Expression quantitative trait mapping was implemented as previously described28,31,32 with the exception that CNVR instead of SNP genotypes were used as predictor variables. Null distributions of F-statistics for CNVR-expression trait tests were generated by 10,000 random permutations of expression values. The permutations were weighted according to strain-relatedness as defined using an imputed SNP map46 (exponent = 3) such that closely related strains more frequently replaced each other than distantly related strains. All permutation analyses were implemented on custom software and executed on a compute cloud (http://aws.amazon.com/ec2). Often a single trait was tested against multiple CNVRs therefore the permutation-derived P-values were corrected by applying the Holm multiple testing correction separately for each trait.
BXD RI SNP genotype data was downloaded from: http://www.genenetwork.org/dbdoc/BXDGeno.html. A CNVR genotype of ‘B’, ‘D’, or ‘U’ was assigned for each CNVR to each strain if the two markers spanning the CNVR were both C57BL/6J, both DBA/2J, or discordant, respectively. BXD KLS expression data was downloaded from GEO, accession number GSE2031. Of the genes identified as having significant associations with CNVRs in cis in the KL expression data set, only those that were detected in at least 80% of the samples from either or both CNVR genotype groups were assessed for concordant expression in the BXD KLS data. Association between KLS expression and inferred CNVR genotype was performed as for KL expression data.
This work was supported in part by a grant from the NIH/NCI (CA101937). P.C was supported in part by the National Human Genome Research Institute (T32 HG000045) and a Kauffman Fellowship. Mice were kindly provided through a collaboration with the Mouse Phenome Project (The Jackson Laboratory, Bar Harbor, ME). Additional mice were provided by Ming You. We thank Tim Ley, Dan Link, and Matt Walter for helpful discussions. Cell sorting was performed by the High Speed Cell Sorter Core in the Alvin J. Siteman Cancer Center at Washington University School of Medicine. The Siteman Cancer Center is supported in part by an NCI Cancer Center Support Grant (P30 CA91842).