system of alleles in APOE
appears to play a crucial role in risk for LOAD. In fact, as much as 50% of the population risk for LOAD could be attributable to ε4
]. For the past decade, however, other loci in APOE
and surrounding genes have also been associated with risk for LOAD. For instance, variation in the 5’ region of APOE
has been shown to alter the expression of the gene, to produce population-level variation in metabolic lipid levels [3
], and to have a weak impact on risk for LOAD [9
]. Among the APOE
promoter SNPs, the −491 A, −427 C and −219 T variants have a higher frequency in AD cases than in controls in some, but not all, studies [25
]. Nonetheless, the impact of other loci has proven difficult to define because of their known or potential LD with SNP 33, which defines the ε3
dichotomy. Surprisingly, only a few studies have undertaken a comprehensive assessment of the patterns of LD in the APOE
], and none of those have used molecular methods to ensure accuracy of phased chromosomes. We provide a comprehensive assessment of LD by using molecular haplotyping to phase 21 SNPs in APOE
], and complementary statistical methods for additional 29 SNPs surrounding APOE
In our Caucasian sample, the association with AD was significant for APOE promoter SNPs -491 (SNP 21) and +113 (SNP 25). However, when ε4 count is incorporated into the model and after Bonferroni correction, no locus has a significant, independent effect on AD status. As for SNPs outside of APOE, one locus in our comprehensive analysis, a synonymous SNP in the TOMM40 gene, accounts for increased risk for developing AD. Again, when ε4 status is accounted for in the model, no single SNP explains a significant portion of the risk. Therefore, while tight LD between APOE and TOMM40 raises the possibility that the latter locus may contribute to the risk for developing AD, ε4 remains the most likely LOAD allele in the region.
itself, we genotyped 21 previously reported SNPs, but found only 16 to be polymorphic in our samples of 550 individuals, of which 11 had MAF > 0.02. These 11 SNPs cover 4,802 bp of genomic sequence. Thus, within APOE
, a SNP with MAF > 0.02 occurs every 437 bp, on average. This density of SNPs is slightly higher than what is observed, on average, from completely-sequenced genes in general [32
]. For the ε2
system of alleles, we found that ε4
is embedded in 12 different haplotype backgrounds; ε3
is embedded in 20 different haplotype backgrounds; while ε2
is embedded in only 3 backgrounds (). Because there were 298 haplotypes bearing ε4
and 741 bearing ε3
, there is proportionately more variety in ε4
–bearing haplotypes than ε3
–bearing haplotypes (on average, 24.8 copies per ε4
-haplotype versus 37.1 copies per ε3
-haplotype). This observation is consistent with the conjecture that ε4
is ancestral to ε3
, based on analyses of other primates [33
], all of which carry the ε4
allele. Nonetheless, the fact that ε3
is now far more common in human populations worldwide led to conjecture that ε3
has been under positive selection since its introduction in early humans [19
]. Consistent with our observation that variation in APOE
is at least as large as that seen in other genes, however, Fullerton et al.
] could find no statistical evidence for selection, which would be expected to reduce regional variation.
If ‘haplotype block’ structure is measured by the distribution of haplotypes, our analyses suggest most of the SNPs in APOE
exist in a single block. In fact only 5 haplotypes account for over 75% of the chromosomes in the sample. On the other hand, if LD is measured pairwise by r2
(Supplementary Fig. 1
), or even by multivariate assessment of LD based on pairwise r2
(Supplementary Fig. 2
), our analyses suggest much less LD. This result suggests that this contrast underscores the superiority of assessing multivariate LD, such as by analysis of the distribution of haplotypes.
To make the drawback of pairwise LD more concrete, we offer a simple example. Imagine there exists (or historically existed) a population in which there are five linked SNPs, with alleles named ‘1’ and ‘2’. Alleles at the loci are independent and thus all 32 possible haplotypes occur. From this population a sample is drawn to found a new population. The sample contains only four haplotypes (), each of which occurs with probability = ¼. As can be seen in , while the haplotype distribution is limited, the founder haplotypes set up a peculiar pattern of pairwise LD, regardless of the measure of LD used (see Devlin and Risch [21
] for discussion). Pairs of adjacent loci are pairwise independent, while more distant pairs of loci are either in absolute LD or independent. While artificial, this scenario makes two points: pairwise LD can fail to capture higher-level LD, even in very simple instances (known in statistics as Simpson’s paradox) and comparisons of pairwise LD across and within genomic regions potentially confound an evolutionary parameter of interest, namely the recombination rate, with founder effect. This confounding will be most important for recently-founded populations, but we suspect it is also important for other populations, such as those of European and Asian decent.
Table 4 Heuristic example of the failure of pairwise LD to capture higher-level LD. The four haplotypes occur with equal probability ¼ in the population. Pairwise LD, as measured by r2 (but true also of any measure of LD reviewed by Devlin and Risch 1995) (more ...)
Our experimental design over-samples for individuals diagnosed with LOAD. Devlin and Risch [21
] and Devlin et al.
] have shown that various measures of pairwise LD can be biased in the face of this over-sampling. Due to this bias, one might expect the patterns of LD to differ substantially between the LOAD and control samples. Instead we see similar patterns for both samples (, Supplementary Figs. 1
), although the controls show somewhat stronger LD. These patterns are probably due to the fact SNPs 33 and 35, defining the ε2
system, are not in high LD with many other genotyped SNPs in the region. If they were tightly linked, we would expect more divergent patterns in the two samples.
As described in more detail by North et al.
], the pattern of LD in the region has implications for the power to detect the association between ε2
and LOAD, assuming that this system of alleles was not genotyped but other SNPs in the region were. Two cross-currents complicate predictions about detection. As seen in , LD as measured by r2
is not large, yet this is the natural measure for power due to its direct connection to the chi-square statistic [35
]. On the other hand, the strength of the association between LOAD and the ε2
system is substantial. Assuming an odds ratio for ε4
in the LOAD versus control samples is about 2.0, and the frequency of ε4
in the population is 0.12. To detect the association with ε4
with 80% power at a significance level of 0.05 would require roughly 100 individuals diagnosed with LOAD and an equal number of controls. To detect the ε4
association by genotyping a locus in LD would require samples of size of roughly N/r2
]. Even if r2
were as small as 0.1, the required sample size for 80% power under these assumptions is only about 1000 cases and controls.
Scanning the region within and around APOE
, there is only one set of SNPs that show large LD, as measured by r2
, with ε2
system of alleles (, Supplementary Figs. 1
). These loci fall in TOMM40
, roughly 15 Kb 5’ of APOE
. SNPs within this region (SNPs 8-12, ) have some of the strongest genetic association with the risk of AD in our Caucasians AD samples (Supplementary Table
encodes a subunit of the multisubunit translocase of the outer mitochondrial membrane, the TOM complex [36
], which plays a role in protein transport into mitochondria. In fact, the TOM40 protein forms the critical pore and actively sorts protein for sub-mitochondrial locations [37
]. Because structural abnormalities and oxidative stress of the mitochondria are known to increase risk for AD, and defects in mitochondrial energy metabolism have been observed in AD [38
]. This raises the possibility that part of the liability to LOAD commonly ascribed to ε4
might have been caused by TOMM40,
on the basis of its strong LD. However, contrasting the effects of all 50 loci in this region on the risk of AD, with and without conditioning on ε4
status, our findings diminish the possibility that TOMM40
and other loci near APOE
may have a major effect on the risk of LOAD in Caucasians.
Our results support the idea that associations can be detected at SNPs near a complex disease gene when the causative mutations are essentially monophyletic, as for APOE ε4. However, high density of SNPs will be necessary to ensure the detection of such association with causative disease changes. Our study provided an excellent scenario to support this point of view. Because TOMM40 has functional implication in the AD pathogenesis and it shows strong genetic association with LOAD. If the APOE ε4-defining SNP (SNP 33 [rs429358]) was not genotyped and analyzed in the study, one could have mistakenly selected the TOMM40 to be the candidate gene for LOAD. Thus, enormous research effort could be in vain by studying the incorrect genes. Moreover, our study further demonstrated that the haplotype based analysis can provide additional information with respect to tests of significance and fine localization of the most critical causative variants.