|Home | About | Journals | Submit | Contact Us | Français|
Genetic mapping on fully sequenced individuals is transforming our understanding of the relationship between molecular variation and variation in complex traits. Here we report a combined sequence and genetic mapping analysis in outbred rats that maps 355 quantitative trait loci for 122 phenotypes. We identify 35 causal genes involved in 31 phenotypes, implicating novel genes in models of anxiety, heart disease and multiple sclerosis. The relation between sequence and genetic variation is unexpectedly complex: at approximately 40% of quantitative trait loci a single sequence variant cannot account for the phenotypic effect. Using comparable sequence and mapping data from mice, we show the extent and spatial pattern of variation in inbred rats differ significantly from those of inbred mice, and that the genetic variants in orthologous genes rarely contribute to the same phenotype in both species.
Unraveling the complex relationship between phenotype and genotype poses a formidable challenge for biomedical science. Despite considerable success in identifying genetic loci that contribute to quantitative variation and disease susceptibility in humans1, in most organisms the causal genetic variants at loci that contribute to complex phenotypes remain unclear2. Finding the responsible molecular changes would allow us to understand how phenotypic variation arises and confirm the identity of relevant genes.
In this report we present results from an outbred rat heterogeneous stock (hereafter NIH-HS) in a combined sequence-based and genetic mapping analysis of 160 phenotypes. The NIH-HS, established in the 1980s in NIH, is descended from eight inbred progenitors, BN/SsN, MR/N, BUF/N, M520/N, WN/N, ACI/N, WKY/N, and F344/N3, containing segregating variation representative of commonly used laboratory rats.
Heterogeneous stocks (HS) have three characteristics suited to genetic mapping: (i) quantitative trait loci (QTLs) can be resolved to megabase resolution; (ii) the complete sequence of genotyped HS animals can be imputed with high accuracy from the progenitor genomes; (iii) the population has a well-defined haplotype space that can be exploited to determine whether genetic association is caused by single sequence variants or by haplotypes4-6. This distinction is fundamental to understanding the signals from genome wide association studies, where it is unknown how often causality can be attributed to a single variant. In natural populations it is rarely feasible to test for haplotypic effects because of the difficulty of estimating the large number of unknown rare haplotypes7.
Here we describe the sequence of the eight progenitors, the development of a rat SNP array, the genotyping and phenotyping of 1,407 outbred NIH-HS rats, and the mapping of hundreds of quantitative trait loci (QTLs). We use the haplotypic properties of the NIH-HS to investigate the molecular basis of these QTLs.
We generated sequence data equivalent to an average of 22X SOLiD coverage of the eight NIH-HS inbred founder strains. After mapping to the reference strain (BN/NHsdMcwi8) we report our results with respect to an accessible genome, which represents ~88% of the reference genome (Table 1). We identified 7.2M SNP sites (containing 19.8M genotypes differing from the reference in at least one strain), 633,000 indels (<10bp with the majority consisting of one (79.3%) or two (12.3%) basepair changes) and 44,000 structural variants.
We assessed the sensitivity and specificity of variant calls by comparison with 2.1 Mb of DNA from one non-reference strain, LE/Stm, finished to an estimated accuracy of one error per 100,000 bp9. Although LE/Stm is not an NIH-HS progenitor strain, it is one of the few non-reference rat strains cloned into a library of bacterial artificial chromosomes (BACs) (so suitable for highly accurate clone based sequencing)9 and one that is similarly diverged from the reference strain (BN/NHsdMcwi). Comparison of SOLiD and capillary variant calls showed 2.7% of SNPs, 2.2% of indels and 16.7% of structural variants were false positive calls. These error rates were independently confirmed in the NIH-HS strains by analysis of a randomly selected subset of variants using PCR-based resequencing, which confirmed all selected SNPs (84/84) and indels (80/80) and most of the called structural variants (53/54). In contrast, false-negative rates were much higher: 17.2% for SNPs, 41.4% for indels and 65% for structural variants. Most false-negative SNPs and indels are next to repeats (77.9 and 80.8% respectively).
Table 1 summarises the variation in each strain. Excluding BN/SsN (which is a sub-strain of the reference, with consequently far fewer differences than the other strains), the average number of SNPs per strain is 2.8M.
Sequence diversity in the NIH-HS progenitors has the following characteristics. First, diversity between all pairs of strains is similar, so that there are no extremely sequence divergent strains (Supplementary Figure 1). Second, in total 29% of 7.2M SNPs are private to a strain, hence unique haplotypes are relatively common in the NIH-HS. Third, regions of low diversity are small (median 400kb), with no blocks over 35 Mb (Figure 1a). Within divergent regions, there is a median of 151 differences per 100kb (Figure 1b).
In comparison with the eight inbred strains that founded the mouse HS4,10, the rat founders are less diverse (10.2M SNPs in the mouse founders), but that diversity is more homogeneous: in the mouse genomes, long tracts of identical haplotypes alternate with segments of much greater diversity (Figures 1a, 1b).
The NIH-HS rats were phenotyped with a protocol that includes six disease models (anxiety, diabetes, hypertension, aortic elastic lamina rupture, multiple sclerosis, osteoporosis) and measures of risk factors for common diseases (e.g. lipid and cholesterol levels and cardiac hypertrophy)11(Table 2). In total, 160 phenotypes were measured (Supplementary Table 1). We selected 1,407 animals for genotyping and 198 non-phenotyped parents, together with the HS founders.
We designed a high density Affymetrix SNP genotyping array (RATDIV), using sequences from 13 inbred strains, which interrogated 803,485 SNPs. The SOLiD and RATDIV calls agreed at 99.98% of the 560,000 SNPs segregating in the 8 NIH-HS founders. We genotyped the NIH-HS with the array, and reconstructed the mosaics of NIH-HS founder haplotypes from 265,551 polymorphic high quality SNPs. In the NIH-HS the mean minor allele frequency is 22% (Figure 1c) and linkage disequilibrium falls below 0.2 (median R2) within 1Mb on the autosomes (Figure 1d). Four pairs of loci show high interchromosomal LD, due to mis-assembly of the reference sequence used here (Rnor3.4); these loci were excluded from the analysis (Supplementary Table 2).
The NIH-HS contains individuals of varying relatedness that generate population structure and hence false positive genetic associations. We evaluated two strategies for dealing with relatedness; mixed models in which the genotypic similarity matrix between individuals modeled their phenotypic correlation12, and resampling methods to identify loci that replicate consistently across multiple QTL models fitted on subsamples of the mapping population13. In both strategies, QTLs were detected by haplotype association14.
We compared the methods by simulation to find out which best controlled the false positive rate while retaining power. Mixed models performed better than resampling when phenotypes were simulated to be normally distributed, but the reverse was true for non-normally distributed phenotypes (i.e. binary phenotypes and those with a negative binomial distribution). Since these methods have different advantages, we mapped all traits with both methods, but only report those QTLs detected at false discovery rate (FDR) of 10% by that method which performed best for each trait (thresholds in Supplementary Table 1). Figure 2 shows a genome scan for one phenotype (platelet aggregation), revealing three loci at 10% FDR.
We identified 355 QTLs for 122 phenotypes with a mean of 2.9 QTLs per phenotype (Supplementary Table 3). The number of QTLs per phenotype and the QTL effect sizes (Figure 1e) have markedly skewed distributions, with a median effect size of 5% (mean effect size 6.5 %). Large effect QTLs are rare: only 22 QTLs explain more than 15% of the variance. We identified 28 QTLs that explained less than 2.5% of the phenotypic variance.
Figure 1f shows the correlation between the heritability and the total variance explained jointly by the detected QTLs. On average the QTLs explain 42% of the heritable phenotypic variance. In comparison with QTLs mapped in other rat crosses in the Rat Genome Database, there is significant overlap with NIH-HS QTLs for the number of arterial elastic lamina ruptures, total cholesterol levels and heart weight (at a nominal p-value of 0.05; Supplementary Table 4).
We estimated the QTLs’ confidence intervals by simulating a large number of QTLs throughout the genome with various effect sizes, and calculated the distribution of the confidence intervals’ widths as a function of their significance (Supplementary Figure 2). The median size of the 90% confidence interval is 4.5Mb, on average containing more than 40 genes.
We investigated the extent to which our near complete catalogue of segregating sequence variants would identify genes and causative mutations. The HS permits a test, called merge analysis6, of whether a variant is responsible for phenotypic variation, under the assumption that a single imputed variant, or variants on a single progenitor haplotype, are causal. Because genetic variation segregates in the form of the progenitor haplotypes in the HS, QTLs can always be explained by variation in the haplotypes. When a QTL is due to a single variant though, genotypic variation at the variant will explain phenotypic variation better than progenitor haplotypes. To measure whether a single variant explains the QTL we calculated the difference d = logPmerge – logPhap where logPhap, is the maximum negative log 10 p-value of the haplotype test of no association and logPmerge is the maximum logP of all merge logP values of variants under the QTL. Any imputed variant that exceeds the maximum haplotype logP is termed a candidate variant. If d<0 then no candidate variants exist at the QTL. We investigated the characteristics of these candidate variants at 343 QTLs mapped using mixed models: at 131 QTLs (38%) we identified at least one candidate variant (Supplementary Table 3).
There are three ways in which focusing on these candidate variants helps identify genes at a QTL. First, we increase resolution by ruling out the great majority (usually over 90%) of sequence variants under most QTLs as being causal. We found 28 QTLs at which only a single gene contained candidate variants (Table 3). An example is Catenin-delta 2 (Ctnnd2) at a QTL for an anxiety-related phenotype (Figure 3a). CTNND2 is a protein found in complexes with cadherin cell adhesion molecules at neuronal synapses15. Figure 3b shows another example for a locus influencing heart weight, where out of 82 coding genes under the QTL, only Shank2 contained candidate SNPs. Shank2 encodes a synaptic protein 16 not previously associated with cardiovascular physiology.
Second, merge analysis identifies some candidate variants lying within coding regions. Those predicted to affect protein structure are more likely to be causal. Thus we identified a potential causal nucleotide in a QTL for antibody recognition of CD45RC on CD4+ and CD8+ T cells (Figure 3c). The antibody used binds to the CD45RC isoform, which expresses a C-domain, encoded by the sixth exon, in which we found a candidate variant changing an amino acid (p.Arg114His).
At 43 out of 91 non-synonymous candidate variants, where similar protein structures were available17 we predicted the structural consequences of mutations (for a further 48 candidate variants there were no homologies with known protein structures). Nine genes, listed in Table 3, contained candidate variants for which structural evidence suggests protein structure or interactions might be altered.
An example is shown in Figure 3d, for the protein TBX21, encoded by a gene under a QTL influencing the proportion of CD4+ cells with high expression of CD25.) Here the candidate variant changes glycine to arginine (p.Gly175Arg). The additional arginine could alter the DNA-binding characteristics of this protein.
Figure 3e shows the crystal structure of human ABCB10, a mitochondrial transporter induced by GATA1 during erythroid differentiation18,19. The candidate variant (p.Thr233Met), predicted to influence mean red cell volume, maps to a position in the protein structure where the residue side chain points to the centre of the transporter channel (Figure 3e). Threonine has a polar uncharged side chain while methionine has a hydrophobic side chain, a difference likely altering transporter function.
Third, merge analysis eliminates candidate genes at a QTL that are distant from any candidate variant. This approach confirmed a well-established relationship between a cluster of apolipoprotein genes at a QTL on chromosome 1 and cholesterol biosynthesis (HDL, LDL and total cholesterol). Similarly, merge analysis identified a locus influencing platelet aggregation, on chromosome 4, that harbors the von Willebrand factor gene, encoding a key glycoprotein involved in blood coagulation.
Merge analysis also contributes to an understanding of the pathogenesis of experimental autoimmune encephalomyelitis (EAE), an autoimmune neuroinflammatory disease with clinical and pathological similarities to multiple sclerosis (MS)20. The MHC class II region on chromosome 20 (Eae1) is known to influence EAE susceptibility. However, attempts to identify the responsible gene have had limited success. In this study, the two variants most likely to be causative for the QTL on chromosome 20 (i.e. highest logP) are a variant in an intron of Btnl2 and a variant 274 bp upstream of RT1-Db1, both in the class II region. The human orthologue of RT1-Db1, HLA-DRB1, is associated with multiple sclerosis with risk allele HLA-DRB1*15:0121.
Unexpectedly 212 QTLs (62%) had no candidate variant (Figure 4a). We considered four explanations for this observation: (i) causative variants were missing from the sequence catalogue; (ii) haplotype mapping is biased towards QTLs without candidate variants; (iii) the merge analysis underestimated statistical significance compared to single marker association; (iv) the presence of multiple causal variants.
First, causal variants may have been missed because our sequence data are incomplete. Despite linkage disequilibrium extending over a few megabases, not all variants are tagged by a nearby variant with identical strain distribution pattern (SDP) in the founders. For example, only 50% of the structural variants are tagged by a SNP lying within 1Mb.
However, because only a limited set of possible SDPs exist in the HS, we can test whether missing genotypes are responsible for failure to detect candidate variants. We generated SDPs for all diallelic and tri-allelic variants at every locus within the 212 QTLs and tested each by merge analysis, to see how many would have been candidates. Only 44 QTLs had candidate diallelic variants and 165 had diallelic or triallelic variants. Thus if the effects are attributable to a single diallelic variant that we had failed to sequence, then there are still 168 QTLs (49%) without a candidate variant. If the effects are attributable to a di-allelic or tri-allelic variant, the fraction reduces to 14%. However, triallelic SNPs are very uncommon and therefore unlikely to explain the large number of QTLs without candidate variants.
Second, haplotype mapping might simply not be powerful enough to detect candidate variants, or be biased towards QTLs without candidate variants. We addressed the first possibility by simulation and show the results in Figure 4a. In each case we report the distribution of the difference d between maximum merge and haplotype logPs, so that if candidate variants exist then d > 0 (where d = logPmerge – logPhap as defined above). When simulated QTLs arise from single causal variants, merge analysis does indeed identify candidate variants at almost all QTLs placed at random regions of the genome as well as at QTLs simulated in the same locations as the detected QTLs.
We also considered the performance of the method at QTLs where a single variant is highly likely to be the causal variant, namely at cis-acting expression QTLs 22,23. We tested 1,398 eQTLs detected in the hippocampus of HS mice24. We found that the merge analysis identified variants that exceed the haplotype-based test at 97% of QTLs (Figure 4b). Interestingly, when we carried out the same analysis on trans eQTLs, the distribution of d values was similar to that seen for the rat phenotypic QTLs (Figure 4b). This difference between cis and trans-eQTLs is true across all logP values, indicating that the difference is not due to lower power to detect trans eQTLs.
Since mapping QTLs using haplotype analysis might bias results towards finding loci without candidates (a winner’s curse is likely to operate), we used merge analysis to map QTLs genome-wide. The two methods do not identify the same QTLs (152 are unique to the merge method) but the merge method identified 16% fewer than the haplotype method. Importantly, only 9% of the merge-identified QTLs had no candidate variants (Supplemental figure 3). Consequently haplotype mapping will overestimate the number of QTLs without a candidate variant while merge analysis underestimates it. Therefore our best estimate of the proportion of QTLs without candidate variants is obtained from combining both methods. From the set of QTLs found by either merge or haplotype mapping we find that 44% of QTLs cannot be explained by single causal variants (instead of 62% when only the haplotype-based QTLs were considered). Thus while a winner’s curse does operate in favour of the haplotype analysis, it cannot account for all QTLs without a candidate variant.
The third explanation was that the merge analysis under-estimates statistical significance. We compared the performance of the merge analysis with single marker association at genotyped SNPs. Across all phenotypes, the R2 between the logP values was 0.9; agreement was strongest for the most highly associated SNPs. This result indicates that merge analysis performs as well as SNP analysis.
Finally, we investigated the extent to which multiple variants at QTLs would account for our findings. We investigated the consequences of a variety of complex QTL architectures by simulation and show the results in Figure 4a. Simulating multiple causal variants, on different haplotypes, reduced the frequency that any single variant exceeded the maximum haplotype logP, although this was still insufficient to mimic the observed distribution (Figure 4a). Simulating irreducible haplotypic effects arising from the reconstructed haplotype mosaics in the HS (rather than from a selection of sequence variants) also led to fewer QTLs with candidate variants (Figure 4a), although again it did not match the proportion observed with the real QTLs. Our simulations suggest that the presence of multiple causal variants at a locus accounts in part for the failure to find candidate causal variants.
It is often assumed that genetic loci underlying a phenotype identified in one species are homologous to those underlying the same phenotype in another, and that natural variation within these loci will pinpoint the same genes 25-27. However, there have been no genome-wide tests of the hypothesis for natural variation. Our data allowed us to examine whether genes and QTLs identified in the NIH-HS overlapped those found for the same phenotypes in a mouse HS10.
In total 38 measures were common to both studies, and were mapped using the same mixed model method. Only one measure, the ratio of CD4+ to CD8+ T-cells, showed overlap (using a FDR of 10% and looking in the 90% QTL confidence interval) but this was not significant (empirical p-value of 0.1). We repeated the analysis using QTLs called at a lower significance threshold (20th percentile of the extreme value distribution for each measure) and expanding the width of each QTL to 8Mb. Table 4 shows overlap for eight phenotypes, only two of which were significant at an empirical p-value of 0.05: serum urea concentration and the ratio of CD4+ to CD8+ T-cells. Overall, genetic variants in orthologous genes rarely contribute to the same phenotype in the two populations.
To test whether QTL overlap existed within similar pathways we compared the enrichment of KEGG pathways28. Only one measure, the proportion of B cells in the white blood cells, showed significant enrichment of a pathway (corrected p-value < 0.05). Even at a more relaxed significance threshold of 0.05 (non-corrected for multiple testing), only three measures show significant enrichment in the same KEGG pathways.
Using 1,407 outbred rats we have mapped 122 phenotypes and identified 355 QTLs at high resolution. We have shown how combining sequence with high resolution mapping data can lead to the immediate identification of candidate genes, and in some cases to the identification of candidate causal variants at many QTL. We highlight two examples here.
The locus on chromosome 10 regulating frequency of CD25+ CD4 T cells, and the frequencies of CD4 and CD8 T cells, has previously been shown to control CD4 and CD8 frequencies in a cross between ACI and F34429, both represented in the NIH-HS rats. The amino acid substitution at position 175 (p.Gly175Arg) of the TBX21 protein is a very strong causal candidate at this QTL since this domain is important for DNA interactions. Tbx21 has been implicated in the genetic control of regulatory T cells30, a subset of T cells with high surface expression of CD25, and might indirectly regulate the frequency of CD4+ and CD8+ T cells via the transcriptional repressor Sin3a31,32.
We implicated Abcb10 in red blood cell differentiation. Evidence from mouse knockouts indicates that this gene is essential for erythropoiesis18,19,33. The p.Thr233Met mutation positions a larger, bulkier residue into a region that is tightly packed in the open-outwards conformation of ABC transporters, potentially interfering with conformational changes which are essential for transport of the substrate.
Two noteworthy features of the genetic architecture of complex traits in the rat emerge from this study: (i) the contrast with human GWAS findings; (ii) about half of QTLs cannot be attributed to a single causal variant. We discuss these points below.
Rat and mouse HS experiments differ from human GWAS in two ways. In the rodent GWAS, far fewer subjects are required to detect a significant effect and fewer loci of larger effect explain more of the variance. In rats the median proportion of heritability explained by joint QTLs is 39.1% (mean 42.3%), in mice 32.2% (mean 42.0%). In humans the equivalent figure is often less than 10%.
One explanation for these differences is the markedly different allele frequencies: human populations are characterized by a preponderance of rare alleles (minor allele frequency less than 1%); HS populations have a relatively uniform distribution of minor allele frequencies (Figure 1c). However, it is important to realize that the mouse and rat differ in the degree of segregating variation (in the rat NIH-HS there are 7.2M SNP sites, compared to 10.2M in the mouse HS). In the rat there are 2.8M SNPs per HS strain, the corresponding number in the mouse HS is 4.4M. In other words, total sequence variation per se is not a critical determinant of the explanatory power of the QTLs. Furthermore, the heritabilities of the homologous phenotypes in rat NIH-HS and in HS mice are highly correlated (0.6, p-value 0.0002) (Supplemental Figure 4), implying that the additional sequence variation in the mouse does not give rise to an increase in heritability.
The failure to detect a single candidate variant at half of rat QTLs was surprising. We showed that while reliance on haplotype mapping can underestimate the number of QTLs without candidate variants, after taking this bias into account (by detecting QTLs with merge and haplotype analysis) there is still a large fraction (44%) of QTLs without candidate variants. The contrast between the 44% figure and the 97% that emerged from an analysis of variants at cis-eQTLs is striking. It is also notable that the findings from trans-eQTLs are so similar to those of the rat phenotypes (Figure 4) suggesting that cis-eQTLs are atypical. Our simulations indicate, but have not proven, that multiple causal variants are in part to blame. At present, we can only conclude that single causal variants are not always responsible for the genetic signal. Whether the lack of single causal variants at many loci is a general feature of loci influencing complex traits or not remains to be determined.
One simple interpretation of human GWAS is that each locus represents the presence of a single, relatively common, functional variant. Our results indicate that more complex models are required. Such alternative hypotheses exist, in which for example multiple alleles of varying frequency at the same or closely linked loci, contribute to the signal. Identifying the correct model of genetic action is critical for finding causative variants, since incorrect assumptions about the number and mode of action of genetic variants reduce power and can lead to false positive results34. The extent and nature of sequence diversity may be partly responsible for the complex way sequence variation acts at a QTL.
It is sometimes hoped that loci found in the rat could be typed and identified in humans, thus providing a cost-efficient way to find medically relevant genes. We observe some examples where the same loci act in different species, the most notable example being for variation in the ratio of CD4+ to CD8+ T-cells: the locus lies within the MHC in rats, humans35 and mice36 and its molecular nature in mouse has been identified as a deletion in the promoter of the Class II H2-Ea gene36. However formal tests for overlap between rat and mouse at the level of the gene or of a pathway yielded little that was statistically significant. Since the amount of sequence variation segregating within the two HS populations is relatively limited, failure to detect shared loci may be due to sampling. Also, the relatively small number of genes found for each phenotype reduces our power to detect pathways. We suspect that currently it is not possible to accurately assess overlap between the two species.
This study strengthens the rat’s role as a model organism in physiology and disease. Our mapping and sequencing data provide an important resource for addressing many biomedical questions.
DNA libraries for SOLiD sequencing were generated from genomic DNA from samples of the original rats that were used to create the HS population. The libraries were generated using standard protocols (Life Technologies) and had a median insert size of between 109 and 196 bp. All libraries were sequenced with fragment (50 bp) and paired-end (50+35 bp) runs using SOLiD 4 and SOLiD 5500 sequencers to a depth of at least 22x base coverage for each of the eight HS progenitors and for the strain LE/Stm, which was used to estimate error rates in comparison with hand-finished BAC sequence.
Sequence reads were mapped against contigs of the Rnor3.4 rat reference genome assembly (reference strain BN) using BWA v0.5.937 with parameters -c –l 25 –k 2 –n 10. Alignments from different libraries of the same HS progenitor were combined into a single BAM file.
Variant calling was performed independently on each strain. SNPs and short indels (<10bp) were called using a modified Samtools38 pipeline: Only unambiguously mapped reads were used. Sites with coverage below 4 or over 2000 were not used for SNP calling. Read bases with base-quality below 30 were ignored. Duplicate reads starting at the same position and mapped to the same strand as another read were discarded as likely PCR artifacts. Each of the called alleles had to be supported by at least one read where the variant mapped within the seed part of the read (first 25 bases). Non-reference alleles called with fewer than 3 reads were set to missing. Variable sites with more than 2 alleles within one founder were set to missing. The remaining variants were considered to be homozygous non-reference alleles (frequency of non-reference call > 2/3) or heterozygous alleles (frequency between 1/3 and 2/3) – however, we set to missing the small number of heterozygote calls as these were likely to be artefacts, for example due to unknown duplications. We later attempted to call all the missing genotypes by imputation (see below).
Copy number variants were called using depth-of-coverage approach implemented in DWAC-Seq v. 0.56 (https://github.com/Vityay/DWAC-Seq) using default parameters. Structural variants (SVs) were called using discordant pair mapping implemented in 1-2-3-SV v. 1.0 (https://github.com/Vityay/1-2-3-SV), requiring unambiguous mapping of both paired tags and at least 4 tag pairs per SV event. SV calls from these tools were merged. Prediction of the functional effect of each variant was performed by Variant Effect Predictor tool VEP 2.1 tool39.
We defined inaccessible regions of the HS rat genomes in a similar way as was done for mouse genomes4. A base was considered as accessible if it did not overlap simple, tandem repeats or low complexity sequence (defined by Dust, source: Ensembl release 66; http://www.ensembl.org), was not covered by more than 150 reads, and average mapping quality was at least 40. Nucleotide positions within 15bp of indels were also considered as inaccessible for SNP calling.
Thirteen BACs from the strain LE/Stm were sequenced using capillary methods, assembled and manually edited, producing a total of 2.1 Mb finished sequence. BAC sequences were aligned using BLAT40 For each BAC a single contiguous alignment was obtained, which was used to extract single base changes (SNPs) short indels (1-10 bp) and structural variants (100 bp and above). False positive and false negative rates were estimated from 1.9 Mb of genome sequence syntenic between BACs and genome assembly, excluding low quality BAC sequence (as defined by the BAC finishing team) and inaccessible regions (as defined above). False positive and false negative rates within this 1.9Mb were estimated from the discordance between our allele calls and those in the BACs.
Low false positive rates were independently confirmed by analysis of a randomly selected subset of 96 SNPs and 96 indels using PCR-based resequencing. Oligonucleotide primers were selected to amplify 300 bp fragments around the candidate polymorphism. When amplification was successful (SNPs: 84, indels: 80), amplicons were sequenced on an Applied Biosystems ABI 3730XL sequencer using Big-Dye terminator and analyzed with Polyphred software manually.
For CNVs and SVs 184 variants were selected and PCR primers were designed in such way that the presence or absence (depending on the variation type) of a PCR product could confirm the presence of the variation. After PCR, samples were run on agarose gel and analyzed manually. Of the 184 amplicons, 93 gave a PCR product. Of these 93, a group of 39 variants that were predicted SVs in the NIH-HS founders, were also confirmed by PCR in BN/NHsdMcwi indicating that these are probably assembly errors in the current reference genome (Rn3.4). Of the remaining 54 variants, 53 gave a banding pattern according to our expectance and in one case the predicted variation type was not correctly predicted.
Genotypes and genome accessibility data for HS rats (this study) and HS mice4 were used to characterize patterns of nucleotide diversity in these two panels. We partitioned each genome into non-overlapping windows such that each window contained 100 kb of accessible sequence (defined relative to the rat BN strain or mouse C57BL6 strain). The number of sequence differences per window was calculated for all windows and for all possible pairs of strains.
We found the spatial distribution of pairwise differences in the rat progenitors was bimodal with modes at 0 and 150 SNPs per 100 kb window (Figure 1b). Based on this distribution we defined a region of low nucleotide diversity between two strains as consecutive windows with nucleotide diversity below 13 SNPs per 100kb window.
The rat NIH-HS originates from a colony established in the 1980s in NIH3. Since its creation, the stock has been bred using a rotational outbreeding regime in order to minimize the extent of inbreeding, drift, and fixation.
A full description of the phenotyping protocol is given in Supplementary Note (“Phenotyping”).
All procedures were carried out in accordance with the Spanish legislation on “Protection of Animals Used for Experimental and Other Scientific Purposes” and the European Communities Council Directive (86/609/EEC) on this subject. The experimental protocol was approved by the Autonomous University of Barcelona Ethics committee (permit CEEAH 697).
The phenotype data were uploaded to a database (Integrated Genotyping System41) in batches over the three years of data collection. All relevant covariates were evaluated for their effect on each measure. The final set of covariates and transformations applied to each phenotype, as well as the number of data points for each measure, are given in Supplementary Table 1.
The RATDIV array was developed as a general SNP genotyping array, applicable both to the rat HS project and other populations of laboratory rats. Full descriptions of the development of the rat array and of the selection of the 265,551 SNPs used in this study are given in the Supplementary Note (sections “Development of the rat array” and “Selection of SNPs for this study” respectively).
Linkage disequilibrium (LD) between SNPs in the rat and mouse HS was calculated using PLINK42 from the genotypes called for the 261K autosomal rat SNPs and 12K autosomal mouse SNPs10. In the rat HS, eight regions with very high interchromosomal LD were identified, and excluded from subsequent analyses (Supplemental Table 2). Using UCSC liftover tool43, these regions mapped in the new rat reference genome assembly (RGSC 5.0) to the regions with which they were in high LD in the current assembly (Rnor3.4).
All genetic analysis was performed using R44. We used the R HAPPY package14 to calculate the descent probabilities from the eight HS founders for each animal at each of 265,551 inter-markers intervals, and then averaged these probabilities over 90kb windows, so that we eventually worked with 24,196 probability matrices. The density of the 265k SNPs was much greater than the density of recombinants in the HS, so the averaging did not cause any reduction in mapping resolution (most QTLs are mapped to intervals over a Mb wide, containing over ten 90kb intervals).
HS rats with different levels of relatedness were used in this study, including siblings, half sibs, cousins, uncles, great-uncles, etc. This unequal genome-wide genetic similarity means that correlations exist in the HS between distant markers. These long-range correlations (as opposed to short-range correlations due to physical linkage) can be responsible for false associations if not accounted for. We used two methods to control for unequal relatedness: Resample Model Averaging (as implemented in BAGPHENOTYPE 13) for non normally distributed phenotypes, and Mixed Models for normally distributed phenotypes. Information on the scope and the performance of the methods is given in the Supplementary Note (section “Comparison between mixed models and resample model averaging”). Because most of the phenotypes were normally distributed and the merge analysis was run in the mixed model framework, we present the mixed models briefly here. They were implemented in R so that haplotype mapping could be carried out using the descent probabilities output by HAPPY14. The model used to test for association between the ancestral haplotypes segregating at a locus L and phenotypic variation was:
where yi is the phenotypic value of the rat i, βc the regression coefficient of covariate c and xic the value of the covariate c in rat i. Importantly, the covariates include a dummy intercept term. TLs is the deviation in phenotypic value that results from carrying one copy of a haplotype from strain s at locus L and PLi(s) the expected number of haplotypes of type s carried by rat i at locus L output by HAPPY14. ui and εi are random effects, with cov(ui,uj) = σg2Ki,j and cov(εi,εj) = σe2Ii,j where σg2 and σe2 are estimated in the null model (no locus effect, TLs=0) using the R package EMMA12. K is the genetic covariance matrix, and is estimated from the genome-wide genotypic data using identity by state (IBS, the proportion of shared alleles between any two animals). The IBS matrix was calculated using the R package EMMA12. I is the identity matrix. The total covariance matrix V= σg2K + σe2I can be factorized as V=A2. Writing the equation (1) in matrix form,
Pre-multiplying (2) by A−1 gives a transformed equation
in which the variance-covariance structure of the random term A−1(u + ε) is now proportional to a diagonal matrix and so can be fitted as a standard linear model.
Calculations of the significance thresholds (when the phenotype was analysed with mixed models), inclusion probability thresholds (when it was analysed by resample model averaging), and confidence intervals are described in the Supplementary Note (section “QTL calling”).
Merge analysis is a form of imputation appropriate to HS-type populations whose genomes are mosaics of known haplotypes. Merge analysis asks two questions at each imputed variant: is the variant associated with the phenotype? (a standard test of association), and is it as significant as the test of haplotype association in the locality of the variant? We implemented merge analysis6 in a mixed models framework by comparing the models:
where V is a sequence variant in interval L, and MV is the merge matrix for the variant, formed by summing those columns of PL that carry the same allele at V (each column of PL represents one founder strain). This can be computed efficiently by defining a matrix BV that encodes the columns to be merged such that MV = PVBV. This test is applied at every variable site in the catalogue of single nucleotide variants that segregate between the 8 HS founders. From a statistical point of view, there is no difference between two variants with the same strain distribution pattern at a locus; they will give the same merge analysis result.
Because the two models (2), (3) are nested, the best possible fit (in terms of variance explained) is obtained with the haplotype model (2). If the QTL arises from variation at a single variant V, the fit of the merge model (3) for variant V will be as good as the fit of (2) and its significance will be greater due to the fewer number of degrees of freedom (for a diallelic variant, the degrees of freedom is 1 rather than 7 for the haplotype model). The merge model is fitted by multiplying by A−1 as before.
For each QTL lacking variants with a merge logP exceeding the haplotype logP, we looked for unobserved causal variants that might not have been sequenced. We simulated candidate variants with every possible strain distribution pattern (SDPs; 127 possible SDPs for diallelic variants, 1,094 possible SDPs when allowing for 3 alleles). The simulated variants were repeated within each QTL interval.
To investigate the hypothesis that failure to detect candidate variants by merge analysis reflected a complex architecture of the QTLs, we simulated QTLs arising from a single causal variant, QTLs arising from multiple causal variants within the same locus and/or multiple causal variants at linked loci, and QTLs arising from haplotypic effects not reducible to individual variants. In all cases the phenotypes were simulated from three components: a genetic random effect explaining 20% of phenotypic variation, uncorrelated errors explaining 75% of phenotypic variation, and a single QTL explaining 5% of phenotypic variation. When multiple causal variants were simulated, each explained the same proportion of phenotypic variation (5% divided by the number of causal variants). The effect sizes calculated a posteriori could be quite different from their target values due to correlations between the different components of the simulated phenotypes. For the simulations reported in Figure 4a, either a single causal variant was simulated, or nine causal variants in three linked loci (each locus within 2Mb of the central locus and distant by at least 200kb from each other locus). Alternatively, the probabilities PL were used to simulate irreducible QTLs. We analysed each simulation by merge analysis and when logPhaplotype was between 4 and 6 (to have a similar distribution of logP values to that of the rat QTLs) we calculated d = max logPmerge – max logPhaplotype. We compared the distribution of d from the different simulation sets to determine the likely genetic architecture of the QTLs.
Hippocampus expression levels in 460 HS mice measured using 12 thousand probes of Illumina Mouse WG-6 v1 BeadArrays 24 were mapped to the mouse ancestral haplotypes in the mixed model framework. QTLs were called in the same way as for the rat QTLs but using a confidence interval of 8Mb and a significance threshold of 4. cis-eQTLs were defined as within 2Mb of the beginning of the probe, and trans-eQTLs as those QTLs on a different chromosome from that of the probe or more than 10Mb away from it. Merge analysis was carried out at each eQTL and the difference between the maximum merge logP and the maximum haplotype logP was calculated.
To assess the potential effect of mutations on protein structure, homology models of target proteins were constructed and analysed. Amino acid sequences of target proteins were retrieved from Ensembl or UniProt databases45 and analysed using HHPred46 web server to identify structures with similar amino acid sequences in the Protein Data Bank17 for homology modelling with MODELLER47. Potential location of the mutation-bearing side chains (buried or surface exposed) and effect on the structure-function (e.g. disturbed hydrophobic core) was evaluated manually in PyMOL (The PyMOL Molecular Graphics System, Version 18.104.22.168 Schrödinger, LLC.).
Heritability is defined as the ratio of the genetic variance component by the sum of the variance components estimated in the null mixed model (covariates but no QTL).
Effect sizes are defined as the ratio between the fitted sum of squares and the total sum of squares in a model with covariates and without genetic random component. Joint effect sizes are defined as the ratio between the fitted sum of squares and the total sum of squares in a model without genetic random component including covariates and all the QTLs called for a given phenotype. Including the genetic random component would underestimate most of the effect sizes because part of the variance would have been attributed to it. Thus the QTL effect sizes reported are probably overestimates.
The number of genes under each QTL confidence interval was calculated using Ensembl protein coding genes and genes coding for micro RNAs (downloaded from BioMart48).
The calculation of the overlap between RGD and rat HS QTLs, and between mouse and rat HS QTLs is given in the Supplementary Note (section “Overlap between sets of QTLs”).
Kyoto Encyclopedia of Genes and Genomes pathways were retrieved using R KEGG.db package. We used INRICH 49 to find enrichment of pathways in the mouse and rat phenotypic QTLs (as defined by the 90% confidence interval) called at a low significance threshold (20th percentile of the extreme value distribution). We report the empirical and corrected p-values reported by INRICH.
We are grateful to T. Serikawa (Kyoto University) for the LE/Stm BAC clones. The Human Genome Sequencing Center sequence production teams at the Baylor College of Medicine produced the Sanger sequencing data for the eight sequenced strains used to define the RATDIV SNP genotyping array (see ref. 8 for a list of Baylor College of Medicine HGSC sequencing contributors). We thank E. Redei for providing the NIH-HS rat colony. The funders we would like to acknowledge are as follows: the European Union's Seventh Framework Programme (FP7/2007- 2013) under grant agreement HEALTH-F4-2010-241504 (EURATRANS); The Wellcome Trust (090532/Z/09/Z, 083573/Z/07/Z, 089269/Z/09/Z); The Swedish Research Council (grant K2008-66X-20776-01-4); the Harald and Greta Jeanssons Foundation; The Swedish Association for Persons with Neurological Disabilities; the Åke Wibergs Foundation; the Åke Löwnertz Foundation; Karolinska Institutet funds; the European Union's Sixth Framework Programme EURATools (grant LSHG-CT-2005-019015); the Bibbi and Nils Jensens Foundation; the Söderbergs Foundation; and the Knut and Alice Wallenbergs Foundation. We also thank the Ministerio de Ciencia e Innovación (reference PSI2009-10532 and the Formación de Personal Investigador fellowship to C.M.-C.); the Fundació La Maratò TV3 (reference 092630); the Direcció General de la Recerca (reference 2009SGR-0051); and the British Heart Foundation (BHFRG/07/005/23633). T.J.A. and S.S.A. acknowledge funding from the Imperial BHF Centre of Research Excellence. M. Johannesson acknowledges support from Prof. Nanna Svartz Foundation, The Swedish Rheumatism Association and The King Gustaf V 80th Anniversary Foundation. D. Gauguier acknowledges support from the Institute of Cardiometabolism and Nutrition (ICAN; ANR-10-IAHU-05). T.M. and E.Y.J. acknowledge support from Cancer Research UK (A10976) and the UK Medical Research Council (G9900061). T.F., D.L.K. and I.A. acknowledge support from the U.S. National Institutes of Health (R01 AR047822).
Mapping data are available at http://mus.well.ox.ac.uk/gscandb/rat (see Supplementary Note (“Guidelines to explore the genome scans and integrated sequence data”) for directions on how to explore the sequence data at each QTL).
Variant calls and inaccessible regions are available at http://www.hubrecht.eu/research/cuppen/suppl_data.html.
Accession numbers: Sequence data for the eight HS founders are available from EBI SRA archive under accession ERP001923. The LE/Stm BAC sequences are available in the NCBI Trace Archive (accessions FO181540, FO181541, FO117626, FO181542, FO117624, FO181543, FO117625, FO117627, FO117628, FO117629, FO117630, FO117631, and FO117632)