|Home | About | Journals | Submit | Contact Us | Français|
The ε4 allele of APOE confers a two- to four-fold increased risk for late-onset Alzheimer’s disease (LOAD), but LOAD pathology does not all fit neatly around APOE. It is conceivable that genetic variation proximate to APOE contributes to LOAD risk. Therefore, we investigated the degree of linkage disequilibrium (LD) for a comprehensive set of 50 SNPs in and surrounding the APOE using a substantial Caucasian sample of 1100 chromosomes. SNPs in APOE were further molecular haplotyped to determine their phases. One set of SNPs in TOMM40, roughly 15 Kb upstream of APOE, showed intriguing LD with the ε4 allele, and were strongly associated with the risk for developing AD. However, when all the SNPs were entered into a logit model, only the effect of APOE ε4 remained significant. These observations diminish the possibility that loci in the TOMM40 may have a major effect on the risk of LOAD in Caucasians.
Genetic variation in apolipoprotein E gene (APOE) confers risk for both coronary artery disease (CAD)  and Alzheimer’s disease (AD) . The connection of this gene to CAD is apparent because the ApoE protein, as a component of serum lipoprotein particles, binds to cell surface receptors, mediates lipoprotein uptake, and thus has direct effects on lipid metabolism. In addition, both functional and regulatory variation in APOE account for population-level variation in metabolic lipid levels . The connection between ApoE and AD pathogenesis is more obscure. The major APOE risk for AD is generally assumed to come from the ε2/ε3/ε4 haplotype system with the ε4 allele increasing risk for both disorders and the ε2 allele is protective . However, recent estimates of heritability of AD range from 57-78% , with ε4 alleles accounting for only roughly 50% of that heritability.
The ε2/ε3/ε4 haplotype system is defined by 2 non-synonymous single nucleotide polymorphisms (SNPs) in APOE exon 4. One is a C/T SNP (rs429358) that encodes either arginine (C) or cysteine (T) in ApoE at amino acid 112. The second site defining this haplotype system is a C/T SNP (rs7412), which again encodes arginine (C) or cysteine (T) at ApoE amino acid 158. The allelic compositions of the commonly investigated haplotypes are TT for ε2, TC for ε3, and CC for ε4. The effects of these coding variants on ApoE function are well-defined . Regulation of APOE expression is controlled by cis-acting elements both within the gene and in flanking sequences. The function of these regulatory elements could potentially be influenced by genetic variation. Variation in the 5’ promoter region of APOE alters its expression [7, 8], and some of these variants may be associated with AD , although their impact appears to be minor. In fact, it has proven difficult to provide good estimates of the effect of the 5’ regulatory variation on risk for AD, in part because these SNPs could be in linkage disequilibrium (LD) with ε4. There are numerous non-coding SNPs within and immediately adjacent to APOE that may influence measures of lipid metabolism .
Variation just outside of the 3’untranslated region of APOE has also been reported to have a minor impact on risk for LOAD . Again, LD potentially confounds interpretation of the association. Downstream regulatory elements include two copies of a multienhancer that control expression of APOE in adipocytes, macrophages and astrocytes (ME1 and ME2) [12, 13], two copies of a hepatic control region (HCR1 and HCR2) [14, 15] enhancer that control expression in the liver, and a potential brain control region 42 kb from APOE that may control expression in brain neurons and microglia . It is unknown whether genetic variation in or near these elements controls APOE expression. What of the genes and genetic variants outside but potentially in LD with ε4 of APOE? For example, based on results from cladistic analyses, Templeton  argued that variation in APOC1 causes risk to LOAD. Also unknown is whether there are additional cis-elements upstream of the APOE promoter that contribute to this risk.
Because of the strong association between the ε2/ε3/ε4 variants of APOE and risk for AD, we investigated the LD structure of APOE and its surrounding region. Of particular interest are SNPs in potential regulatory regions both within and flanking APOE that could modify the risk associated with the ε2/ε3/ε4 haplotypes. Also, APOE is an excellent model to explore the ability of genome-wide association methods to detect a causative gene when risk for a common disease is determined by a single, monophyletic, common variant therein. To characterize the APOE-region LD structure, we genotyped 50 SNPs in and surrounding APOE, with particular reference to the ε2, ε3, and ε4 system of alleles. The 550 Caucasian samples genotyped were collected to evaluate the genetic basis of AD. Within APOE itself, we report on 21 SNPs that were molecularly phased using the methods reported in Yu et al. . This set of SNPs overlaps substantially with those assayed by Fullerton et al. . Outside of APOE, we relied on statistical methods to infer haplotypes or assess LD. In our study population, we assessed whether selection on ε4 (or possibly other loci in the region) alters the pattern of LD found in the AD sample relative to that found in the control sample. Significant LD was observed between ε4 and SNPs spanning 50 kb, a region containing multiple genes. Because of the LD patterns observed, it is difficult to distinguish the impact of ε4 from highly correlated SNPs in the region. In terms of detecting APOE as an AD risk gene, there are numerous SNPs in the region that would detect AD risk, but interestingly, they are not necessarily in APOE. In fact, some of the SNPs with the greatest power to detect risk are in the adjacent gene TOMM40, while many others much closer to ε4 would not be useful.
Genotype data of fifty loci (Fig. 1, Supplementary Table) in the APOE region were generated from a clinical sample of 550 Caucasians. Among the subjects, 193 individuals had a clinical diagnosis of LOAD, 125 individuals had diagnoses of other neurodegenerative disorders, and 232 individuals were controls. For our combined samples, 16 of the 21 APOE loci were polymorphic, and 11 of 21 had a minor allele frequency (MAF) > 0.05 (Supplementary Table). The remaining 29 loci, those outside of APOE, were selected to have MAF > 0.02 in the samples. After Bonferroni correction for multiple testing, two loci violated Hardy Weinberg (HW) assumptions in the LOAD sample (SNP 8 [rs6857] and SNP 11 [rs11556505]); a different locus violated HW in the control sample (SNP 1 [rs2965118]); and all three violated HW when the data are combined to produce the full sample (data not shown). None of the loci within APOE violated HW in any of the samples. SNPs in APOE were further haplotyped by molecular methods to determine their phases. We used the Allele Discriminating Long and Accurate PCR Haplotyping (ADLAPH) method  to produce unambiguous molecular haplotypes for 21 loci in APOE (SNPs 17-37, Fig. 1, Supplementary Table) from our study samples.
Of the 550 individuals genotyped and 1,100 haplotypes determined for 21 APOE loci, 12 were singleton haplotypes. We reasoned that some of these singleton haplotypes could result from genotyping error because an error would have a non-negligible probability of producing a novel, albeit pseudo-haplotype. To address the question of the rate of genotyping error, we performed two analyses, one computational and the other molecular (for details, see Methods section). We estimate a per-locus error rate of 0.0003 with an upper 95% confidence interval of 0.0006.
If the APOE loci were in linkage equilibrium, then we would expect hundreds of different haplotypes. Because these SNPs fall within a 5300 bp region, we do not expect such a substantial set of haplotypes. Only 35 unique haplotypes are observed in the distribution, and the distribution is distinctly skewed toward a few common haplotypes (Table 1). Five haplotypes account for over 75% of the haplotype distribution, and 13 haplotypes account for over 95% of the haplotype distribution. The common haplotypes are common in both the LOAD and control samples, although individual haplotypes clearly differ in frequency (Table 1), especially with respect to presence/absence of ε4. After accounting for errors, nine singleton haplotypes occur in the 1100 haplotypes. These singletons are evenly distributed across the samples: for the LOAD sample, the relative frequency of singletons is 0.008; for the control sample, it is 0.006; and for the miscellaneous sample it is 0.008.
In the total sample, 61 chromosomes carry ε2 on three different haplotypes (Table 1); 741 chromosomes carry ε3 on 20 different haplotypes; and 298 chromosomes carry ε4 on 12 different haplotypes. Contrasting the case and control samples (Table 1), the case sample contains 11, 210 and 165 haplotypes bearing ε2/ε3/ε4 alleles (n = 386), respectively, whereas the control sample contains 39, 353 and 72 haplotypes bearing ε2/ε3/ε4 alleles (n = 464).
To analyze the “haplotype-block” structure of APOE, as measured by the diversity of the haplotype distribution , we restricted the data to the 11 SNPs with MAF > 0.02 (see Supplementary Table). Results from this analysis are congruent with the restricted distribution of haplotypes in Table 1, suggesting that all but the last locus of APOE form a single haplotype block. Analyses for a recombination hotspot are more ambiguous: some runs of Phase 2 find no evidence for a recombination hotspot within these loci (mean posterior likelihood of 1.0 across the region); whereas other runs place a hotspot in the vicinity of loci 21-23 (rs449647 and rs769446, mean posterior likelihood of 276.7). The pattern of pairwise LD in the gene – as measured by a common metric D’  is compatible with the haplotype structure of the gene (Supplementary Figure 1), suggesting complete LD for most pairs of loci.
The complementary results of limited haplotype diversity, substantial D’, and limited evidence for recombination hotspots suggest that recombination has not been a major evolutionary factor within APOE. By contrast, another common measure of pairwise LD, namely r2 or Δ2  – is not uniformly large; instead pairwise LD varies substantially and is often small (Fig. 2). This measure is also a function of recombination, but is sensitive to a host of other factors, including homoplasy [21, 22]. Homoplasy in APOE has been demonstrated by Templeton et al. , who described the occurrence of a particular mutation on more than one haplotype background in the APOE region [22, 23].
In this subsection we focus largely on the ε2/ε3/ε4 system of alleles, because of their presumed central role in the risk for LOAD. We also use the data from individuals diagnosed with LOAD and contrast those data to the controls. Fig. 2 shows the pairwise LD between the ε2/ε3/ε4 system and other loci. Because of the relative rarity of ε2, we centered the LD analysis on SNP 33, which defines the ε4 versus ε3 dichotomy. When measured by r2, LD between alleles at SNP 33 and alleles outside of APOE shows notable variation (Fig. 2), much like the loci in APOE. Only one SNP in APOE shows a substantial r2 with SNP 33, SNP 28, whereas a much larger number of SNPs 5’ of APOE show substantial r2 (Fig. 2). LD tends to be unpredictable near the ε3/ε4 locus, but seems essentially absent at more substantial distances (Supplementary Table , Fig 2). The pattern is similar for both the LOAD and control samples. When LD is measured by |D’| (Fig 2), however, the values for most loci around SNP 33 are large (close to or equal 1), and are predictably small only at substantial distances from SNP 33 (Supplementary Table, Fig 2). Again the pattern is similar for both the LOAD and control samples. We do not report hotspot analysis for the larger region because we could not discern a consistent pattern in the results; it appears that numerous regions show some evidence for elevated recombination rates.
To summarize multivariate LD across these loci, as well as identify tag SNPs, we used the hierarchical cluster methods proposed by Rinaldo et al. . Tag SNPs were selected using a bound of 0.8. Cluster analysis identifies only a few substantial clusters, and shows similar clustering features for both the LOAD and control samples (Supplementary Figure 2). SNPs spanning roughly 50 Kb and covering APOC4, APOC2 and CLPTM1 show substantial joint LD (SNPs 41-50 in Supplementary Fig. 2). This cluster, however, has no noteworthy correlation with SNPs 33 and 35, which define the ε2/ε3/ε4 system. In terms of clustering by LD, SNPs 33 and 35 do not fall in the same cluster regardless of whether the LOAD or control samples were evaluated. Not only are SNPs 33 and 35 largely independent, they do not cluster strongly with other SNPs in APOE, with the exception of SNP 28, which clusters with SNP 33. Interestingly, SNP 33 clusters tightly with SNPs in TOMM40, especially SNPs 10-12 (Fig. 1), which are separated from SNP 33 by roughly 16 Kb. SNPs 10 and 11 occur in TOMM40 exon 3 and 4 respectively. SNP 35, defining the ε2 versus ε3 dichotomy, shows modest clustering with SNP 5, roughly 38 Kb away, in the second intron of PVRL2.
To contrast our results with data from HapMap, we downloaded the data from release #21a (phase II Jan07 on NCBI B35 assembly, dbSNP b125; see HapMap in Electronic-Database Information). This version contains 219 SNPs spanning the same region we evaluated. Notably, neither SNP 33 (rs429358) defining the ε3/ε4 dichotomy nor SNP 35 (rs7412) defining ε2/ε3 dichotomy has genotypes in the CEU population of this version of HapMap. Nonetheless, genotypes for two other SNPs that we have shown are highly correlated with SNP 33 (r2 > 0.5) are contained therein, namely SNPs 8 (rs6857) and 16 (rs10119), as well as others SNPs that are more moderately correlated. Because genotypes of SNPs 8, 16 and 33 are available in both Yoruba (YRI) and Japan (JPT) samples in HapMap, we evaluated these SNPs’ correlation in the two ethnic groups. Results indicated that both SNPs 8 and 16 do not correlate well with SNP 33 (r2 = 0.0 and 0.029 respectively for SNPs 8 and 16 in YRI, and r2 = 0.08 and 0.101 in JPT). Therefore, LD patterns in this region do not appear to be consistent across the different ethnic groups, although since SNP 8 was one of the SNPs which deviated from Hardy Weinberg equilibrium, we cannot exclude the possibility that there are as yet unrecognized genotyping artifacts, which could affect this conclusion.
Notably, the HapMap CEU data recapitulate the pattern of LD and clustering in Caucasians shown in Supplementary Fig. 2. Strong clustering emerges over a 50 Kb region, which contains the genes APOC1, APOC2, APOC4 and CLPTM1, while the roughly 60 Kb proximal shows no substantial clustering (data not shown). Clustering the HapMap CEU SNPs by H-clust reveals another SNP that should be highly correlated with SNP 33, specifically rs2075650, which has an r2 of roughly 0.85 with SNP 8. Tag SNP selection using H-clust and the HapMap data always draws a SNP highly correlated with SNP 33, for a wide range of stringency of SNP selection, even to r2 = 0.20 (data not shown). Therefore a genome-wide or local association scan built from HapMap data would very likely detect association to LOAD for regional SNPs, assuming the scan were adequately powered.
As has been demonstrated for many different samples, the distribution of ε4 alleles (Table 2) differs substantially between AD and control samples (Chi-square = 78.6605, df = 2, p-value = < 2.2e-16). When evaluated by sex, the distribution of ε4 alleles did not differ between men and women diagnosed with AD (Chi-square = 0.23, df = 2, p-value = 0.89) or between men and women in the control samples (Chi-square = 2.01, df = 1, p-value = 0.16). Analysis of age at AD diagnosis, performed using a Cox Proportional Hazards model, shows time to diagnosis is strongly dependent on ε4 count (β = -1.07; SE = 0.111; z = 9.67; p ~ 0.0). Survival analysis models with gender and ε4/gender interaction reveal no additional significant covariates.
Within APOE, 11 SNPs were sufficiently polymorphic for inference using MHA  analysis. MHA uses inferred evolutionary relationships among haplotypes, specified in the form of a cladogram, to structure the tests of association. The network relating all haplotypes with one step mutations is given in Fig. 3a; only one haplotype is not connected in this network, and it differs from three other haplotypes by 2 mutational steps. Evolutionary rules  break the cycles in the network to produce a cladogram (Fig. 3b) of haplotypes connected by one-step mutations. The cladogram has the plausible feature of clustering ε4 and ε3-containing haplotypes, but the three ε2-containing haplotypes are implausibly separated in the evolutionary space. The latter is of little concern given the modest impact of ε2, and in fact the results do not differ if the haplotypes are grouped (results not shown). We simplify the cladogram for statistical inference by consolidating rare haplotypes with more common haplotypes, because the impact on risk of rare haplotypes – even if it were substantially different from adjoining haplotypes – could not be distinguished statistically.
We also performed a cladistic analysis using eHap for ε2/ε3/ε4 haplotypes as a simple contrast of haplotypes on the cladogram ε4 – ε3 – ε2. When ε4 is contrasted to ε3, while estimating the effect of ε2 as a nuisance parameter, the contrast is highly significant, and thus the nodes remain distinct (Chi-square = 79.75; DF = 1; p ~ 0). When ε2 is contrasted to ε3, while estimating the effect of ε4 as a nuisance parameter, the contrast is not quite significant, and thus the nodes collapse for purposes of estimation (Chi-square = 3.48; DF = 1; p = 0.062). The latter result is typical for samples of this size; the small protective effect of ε2 is evident in the odds, but it is not significant. The odds of AD for the combined set ε2 and ε3 haplotypes is roughly fivefold less than that for ε4.
Using MHA we find a cluster of high-risk haplotypes, all of which contain ε4 alleles, and another cluster of low risk haplotypes that contain either ε2 or ε3 alleles. Thus, for these data and MHA, we find no evidence that SNPs elsewhere in APOE, such as in its regulatory region, have a significant impact on risk for AD. We would reach the same conclusions fitting AD status to two SNP models in which one SNP is always represented by ε4 allele count, as we show now.
We imagine two scenarios, one in which one analyzes association with AD status without knowledge of ε4 status, and one in which one appropriately conditions on ε4 count. Forty SNPs were informative enough in the AD and control samples to produce valid tests (see Supplementary Table). It is apparent that, in the absence of information about ε4 count, many loci in the region show significant association with AD status even after Bonferroni correction (Fig. 4). In fact, of the 40 informative loci, 28 SNPs have p ≤ 0.05 and 12 are less than the Bonferroni cut-off of 0.00125. Multiple SNPs in TOMM40 and APOE, and at least one SNP in LU, PVRL2, APOC1, APOC4 and CLPTM1 were associated with AD risk. In our sample, the association with AD was significant (p < 0.05) for APOE SNPs −491 (SNP 21) and +113 (SNP 25), but not for −427 (SNP 23), −219 (SNP 24) and +5361 (SNP 37). However, when ε4 count is incorporated into the model and after Bonferroni correction, no locus has a significant, independent effect on AD status (Fig. 4).
MHA for logical units across this region, such as genes, produces the same conclusion (data not shown). In general, when MHA is performed without conditioning on ε4 count, certain portions of the cladogram do not ‘collapse’ into a single node, suggesting some haplotypes were different in their impact on AD risk. However, when ε4 count is introduced as a covariate, the cladograms always collapsed into a single node, consistent with the null hypothesis that haplotypes had no impact on risk for AD.
Of these results, the association of SNPs in TOMM40 with AD, especially SNP 10 (rs157581), is arguably most intriguing. The C allele of SNP 10 in TOMM40 is in very strong LD with the C allele of SNP 33 in APOE, which defines the ε4 allele. The correspondence is so strong (Table 3) that one might wonder how the statistical models favor ε4 as the risk allele. But there is information to distinguish the effects. For an additive (allele) logit model, the odds ratio for presence of ε4 versus the status of AD is estimated to be 4.1; whereas the odds ratio for AD status using the alleles of SNP 10 is 2.88. Moreover, when variables representing both SNPs are entered into the logit model, either with or without an additional interaction term between the two SNPs, only the effect of APOE ε4 count is significant.
Perhaps the effects of the TOMM40 SNP 10 alleles are not additive, but they are expressed recessively. Consider individuals who are doubly homozygous for C alleles (at TOMM40 and APOE), homozygous only for C alleles at SNP 35 of APOE, homozygous only for C alleles at SNP 10 of TOMM40 or homozygous for neither. The case/control counts for these multilocus genotype classes are 24/1, 5/0, 8/9, and 144/217. From the contrast of 5/0 to 8/9, it appears that risk solely or predominantly arises from ε4 homozygotes, because the SNP 10 CC homozygotes are about equally likely to occur in case and control individuals who are not homozygous for ε4.
The ε2/ε3/ε4 system of alleles in APOE appears to play a crucial role in risk for LOAD. In fact, as much as 50% of the population risk for LOAD could be attributable to ε4 alone . For the past decade, however, other loci in APOE and surrounding genes have also been associated with risk for LOAD. For instance, variation in the 5’ region of APOE has been shown to alter the expression of the gene, to produce population-level variation in metabolic lipid levels , and to have a weak impact on risk for LOAD . Among the APOE promoter SNPs, the −491 A, −427 C and −219 T variants have a higher frequency in AD cases than in controls in some, but not all, studies [25-30]. Nonetheless, the impact of other loci has proven difficult to define because of their known or potential LD with SNP 33, which defines the ε3/ε4 dichotomy. Surprisingly, only a few studies have undertaken a comprehensive assessment of the patterns of LD in the APOE region , and none of those have used molecular methods to ensure accuracy of phased chromosomes. We provide a comprehensive assessment of LD by using molecular haplotyping to phase 21 SNPs in APOE itself , and complementary statistical methods for additional 29 SNPs surrounding APOE.
In our Caucasian sample, the association with AD was significant for APOE promoter SNPs -491 (SNP 21) and +113 (SNP 25). However, when ε4 count is incorporated into the model and after Bonferroni correction, no locus has a significant, independent effect on AD status. As for SNPs outside of APOE, one locus in our comprehensive analysis, a synonymous SNP in the TOMM40 gene, accounts for increased risk for developing AD. Again, when ε4 status is accounted for in the model, no single SNP explains a significant portion of the risk. Therefore, while tight LD between APOE and TOMM40 raises the possibility that the latter locus may contribute to the risk for developing AD, ε4 remains the most likely LOAD allele in the region.
Within APOE itself, we genotyped 21 previously reported SNPs, but found only 16 to be polymorphic in our samples of 550 individuals, of which 11 had MAF > 0.02. These 11 SNPs cover 4,802 bp of genomic sequence. Thus, within APOE, a SNP with MAF > 0.02 occurs every 437 bp, on average. This density of SNPs is slightly higher than what is observed, on average, from completely-sequenced genes in general . For the ε2/ε3/ε4 system of alleles, we found that ε4 is embedded in 12 different haplotype backgrounds; ε3 is embedded in 20 different haplotype backgrounds; while ε2 is embedded in only 3 backgrounds (Table 1). Because there were 298 haplotypes bearing ε4 and 741 bearing ε3, there is proportionately more variety in ε4–bearing haplotypes than ε3–bearing haplotypes (on average, 24.8 copies per ε4-haplotype versus 37.1 copies per ε3-haplotype). This observation is consistent with the conjecture that ε4 is ancestral to ε3, based on analyses of other primates , all of which carry the ε4 allele. Nonetheless, the fact that ε3 is now far more common in human populations worldwide led to conjecture that ε3 has been under positive selection since its introduction in early humans . Consistent with our observation that variation in APOE is at least as large as that seen in other genes, however, Fullerton et al.  could find no statistical evidence for selection, which would be expected to reduce regional variation.
If ‘haplotype block’ structure is measured by the distribution of haplotypes, our analyses suggest most of the SNPs in APOE exist in a single block. In fact only 5 haplotypes account for over 75% of the chromosomes in the sample. On the other hand, if LD is measured pairwise by r2 (Supplementary Fig. 1), or even by multivariate assessment of LD based on pairwise r2 (Supplementary Fig. 2), our analyses suggest much less LD. This result suggests that this contrast underscores the superiority of assessing multivariate LD, such as by analysis of the distribution of haplotypes.
To make the drawback of pairwise LD more concrete, we offer a simple example. Imagine there exists (or historically existed) a population in which there are five linked SNPs, with alleles named ‘1’ and ‘2’. Alleles at the loci are independent and thus all 32 possible haplotypes occur. From this population a sample is drawn to found a new population. The sample contains only four haplotypes (Table 4), each of which occurs with probability = ¼. As can be seen in Table 4, while the haplotype distribution is limited, the founder haplotypes set up a peculiar pattern of pairwise LD, regardless of the measure of LD used (see Devlin and Risch  for discussion). Pairs of adjacent loci are pairwise independent, while more distant pairs of loci are either in absolute LD or independent. While artificial, this scenario makes two points: pairwise LD can fail to capture higher-level LD, even in very simple instances (known in statistics as Simpson’s paradox) and comparisons of pairwise LD across and within genomic regions potentially confound an evolutionary parameter of interest, namely the recombination rate, with founder effect. This confounding will be most important for recently-founded populations, but we suspect it is also important for other populations, such as those of European and Asian decent.
Our experimental design over-samples for individuals diagnosed with LOAD. Devlin and Risch  and Devlin et al.  have shown that various measures of pairwise LD can be biased in the face of this over-sampling. Due to this bias, one might expect the patterns of LD to differ substantially between the LOAD and control samples. Instead we see similar patterns for both samples (Fig. 2, Supplementary Figs. 1 and 2), although the controls show somewhat stronger LD. These patterns are probably due to the fact SNPs 33 and 35, defining the ε2/ε3/ε4 system, are not in high LD with many other genotyped SNPs in the region. If they were tightly linked, we would expect more divergent patterns in the two samples.
As described in more detail by North et al. , the pattern of LD in the region has implications for the power to detect the association between ε2/ε3/ε4 and LOAD, assuming that this system of alleles was not genotyped but other SNPs in the region were. Two cross-currents complicate predictions about detection. As seen in Fig. 2, LD as measured by r2 is not large, yet this is the natural measure for power due to its direct connection to the chi-square statistic . On the other hand, the strength of the association between LOAD and the ε2/ε3/ε4 system is substantial. Assuming an odds ratio for ε4 versus ε3 in the LOAD versus control samples is about 2.0, and the frequency of ε4 in the population is 0.12. To detect the association with ε4 with 80% power at a significance level of 0.05 would require roughly 100 individuals diagnosed with LOAD and an equal number of controls. To detect the ε4 association by genotyping a locus in LD would require samples of size of roughly N/r2 . Even if r2 were as small as 0.1, the required sample size for 80% power under these assumptions is only about 1000 cases and controls.
Scanning the region within and around APOE, there is only one set of SNPs that show large LD, as measured by r2, with ε2/ε3/ε4 system of alleles (Fig. 2, Supplementary Figs. 1 and 2). These loci fall in TOMM40, roughly 15 Kb 5’ of APOE. SNPs within this region (SNPs 8-12, Fig. 1) have some of the strongest genetic association with the risk of AD in our Caucasians AD samples (Supplementary Table). TOMM40 encodes a subunit of the multisubunit translocase of the outer mitochondrial membrane, the TOM complex , which plays a role in protein transport into mitochondria. In fact, the TOM40 protein forms the critical pore and actively sorts protein for sub-mitochondrial locations . Because structural abnormalities and oxidative stress of the mitochondria are known to increase risk for AD, and defects in mitochondrial energy metabolism have been observed in AD [38-41]. This raises the possibility that part of the liability to LOAD commonly ascribed to ε4 might have been caused by TOMM40, on the basis of its strong LD. However, contrasting the effects of all 50 loci in this region on the risk of AD, with and without conditioning on ε4 status, our findings diminish the possibility that TOMM40 and other loci near APOE may have a major effect on the risk of LOAD in Caucasians.
Our results support the idea that associations can be detected at SNPs near a complex disease gene when the causative mutations are essentially monophyletic, as for APOE ε4. However, high density of SNPs will be necessary to ensure the detection of such association with causative disease changes. Our study provided an excellent scenario to support this point of view. Because TOMM40 has functional implication in the AD pathogenesis and it shows strong genetic association with LOAD. If the APOE ε4-defining SNP (SNP 33 [rs429358]) was not genotyped and analyzed in the study, one could have mistakenly selected the TOMM40 to be the candidate gene for LOAD. Thus, enormous research effort could be in vain by studying the incorrect genes. Moreover, our study further demonstrated that the haplotype based analysis can provide additional information with respect to tests of significance and fine localization of the most critical causative variants.
Human subjects were collected by the University of Washington Alzheimer’s Disease Research Center. All were unrelated individuals of European ancestry. The samples consist of 193 individuals diagnosed with LOAD, 232 similarly-aged subjects with no cognitive impairment, and 125 individuals with various other neurodegenerative disorders, including possible LOAD, dementia with Lewy bodies, Parkinson disease, progressive supranuclear palsy, and frontotemporal dementia.
Fifty potentially-variable sites were genotyped in this study, as mapped in Fig. 1. Twenty-one of these SNPs fall in APOE and its potential 5’ regulatory region, which covers roughly 5300 bp of genomic sequence, and were genotyped by primer extension assays using the SNuPE assay reagents . SNPs within APOE were selected according to the study of Fullerton et al.  and described in detail previously . An additional 29 SNPs were chosen to evaluate other genes/genomic elements that were proximate to APOE and plausibly could affect risk to LOAD. Sixteen fall in roughly 114 Kb 5’ to APOE; the remaining 13 fall in roughly 82 Kb 3’ to APOE (Fig. 1, Supplementary Table). These proximate SNPs were genotyped by TaqMan allele discrimination assays (Applied Biosystems, CA).
By our computational analysis, we wished to estimate the probability that a single error introduced into a naturally-occurring haplotype – defined to be a haplotype that occurred at least twice in the sample – would produce a pseudo-haplotype instead of a naturally-occurring haplotype. To estimate this probability PSH we iteratively performed this experiment: (1) randomly draw a haplotype from the distribution of naturally-occurring haplotypes; (2) randomly select one of the L = 11 polymorphic loci; (3) change the selected base pair to its complement; and (4) determine whether the resultant haplotype was also in the naturally-occurring haplotype list. Performing this experiment a million times yielded the estimated probability of producing a pseudo-haplotype by error, which was PSH = 0.792. If this experiment were performed using all variable loci, L = 16, it would yield a slightly higher probability estimate; if more than one locus were altered on a haplotype, the estimated probability would be substantially larger.
To determine if any of the singleton haplotypes were pseudo-haplotypes, we started with the genomic DNA from the 12 samples containing singleton haplotypes. These samples were scored by direct sequencing instead of primer extension reactions, which allowed us to generate completely independent results from the previous experiments. Among the 12 samples, 9 were consistent with the previous results. Three subjects, however, showed inconsistency at a single SNP. Two of these errors were clerical, occurring when the data were entered by hand; the other error was due to a rare SNP that disrupted one of the priming sites for primer extension reaction. Thus from our data we would estimate the probability of drawing a singleton haplotype with a single error, PE,S to be 3/1100 ≈ 0.00272, with 6 out of 12 singleton chromosomes representing an upper 96% confidence interval on the number of errors, given binomial sampling, to give a 96% upper confidence interval of ≈ 0.00544. Two of these errors occurred for SNPs with MAF > 0.02.
If we assume that errors are independent across loci on a haplotype and across haplotypes, it is straightforward to develop an estimator for the probability ε of an error on an individual SNP, namely ε ≈ PE,S / (L* PSH). Taking L = 11, and plugging in our estimates obtained from the molecular and computational analyses, we estimate a per-locus error rate of 0.00031. Two observations also follow from these calculations: the probability of haplotypes, natural or pseudo, with 2 or more errors on them is negligible; roughly 4 other haplotypes are expected to be erroneous, but they mimic naturally-occurring haplotypes and cannot be corrected.
To produce molecular haplotypes for 21 loci in APOE from our study samples, we used the Allele Discriminating Long and Accurate PCR Haplotyping (ADLAPH) method described in Yu et al. . Briefly, ADLAPH combines allele-discriminating primers and long-range PCR amplification to amplify long genomic fragments from only 1 of the 2 chromosome homologues of a particular subject. The phase-separated long-range PCR product is then genotyped by standard methods to yield one haplotype. Contrast with the original diploid genotypes is then carried out to provide the complementary haplotype. Comparisons between molecular and computational haplotyping methods have been previously discussed in our other studies [18, 42]. For a small region with tight LD (such as the entire APOE gene), the computational methods do not differs substantially in its estimates of haplotype distributions . However, when a larger region without tight LD was analyzed, the molecular haplotypes increased the linkage information by as much as 9% over the unphased SNPs . In this example, marker phase resolution via molecular haplotyping led to modest increases in the evidence for linkage in these data. Yet, larger gains may be possible in datasets with greater inherent phase ambiguity, such as in studies with larger numbers of markers, more polymorphic markers, or weaker LD between markers.
Haplotype frequencies comprised of SNPs outside of APOE were inferred by using maximum likelihood as implemented in the eHap program  (see CompGen in Electronic Database). To account for the phase-known haplotypes of APOE, we recoded haplotypes as alleles and the eHap program was specifically tailored to account for absolute haplotypes. Single SNP and haplotype-based statistical analyses were performed by using the eHap program . eHap relates haplotypes to phenotypes by using likelihood techniques that account for haplotype uncertainty. The program offers a flexible set of hypothesis tests, including goodness-of-fit or omnibus tests and specified contrasts of association between haplotypes and phenotypes.
To estimate haplotypes at all 50 loci and to infer regions of greater than expected frequency of recombination (recombination hotspots), we used Phase (Version 2.0) [44-46]. Phase uses Bayesian methods for inference, based on the assumption that the evolutionary relationships among haplotypes can be imputed from their degrees of similarity. LD block structure was defined in the sense of Rinaldo et al. , namely blocks are regions of limited haplotype diversity. To identify blocks, we used Entropy Blocker (CompGen). Using output from Phase 2, its algorithm identifies those regions that exhibit substantial multi-locus disequilibrium, ranging over a substantial number of SNPs, while allowing one or more SNPs to separate blocked regions or adjacent blocks. The model computes the likelihood of the data minus a penalty for model complexity, using the criteria that blocks should have very low haplotype diversity and the LD with SNP’s outside a block should be small. Entropy Blocker was also used to visualize pairwise LD.
To select ‘tagging SNPs,’ we used H-clust  (CompGen). The algorithm in H-clust identifies highly correlated sets of SNPs and chooses a SNP within each correlated cluster to represent the cluster. Input data are multilocus genotypes, which are transformed into a per-locus count of the minor alleles (0, 1 or 2). This transformed matrix of multilocus genotypes is then itself transformed to a correlation matrix, from which clusters of SNPs are identified by hierarchical clustering. Within each cluster, the SNP that is most highly correlated with other SNPs in the cluster is chosen as its tag SNP.
We use measured haplotype analyses or MHA  to evaluate haplotype associations with AD. MHA uses inferred evolutionary relationships among haplotypes, specified in the form of a cladogram (an unrooted evolutionary tree), to structure the tests of association. Procedures for MHA are described in Templeton et al.  and Seltman et al. . To perform this cladistic analysis, the cladogram is divided into subgroups (clades): individual haplotypes occurring as leaves (terminal nodes) on the tree represent 0-step clades; 1-step clades are produced by moving backward one mutational step from the 0-step clades toward internal nodes; and then this procedure is repeated to produce the 2-step clades and so forth. For inference, a series of 1 degree-of-freedom tests are performed in a sequential fashion based on the clades, from zero step clades onward. At each step in the algorithm, a full model is fit. The full model is the same within each step, but changes between steps, conditional upon the results of the previous step, with the goal of testing whether clades differ in their impact on phenotype, in this case risk of AD. MHA has been used in a variety of settings [47-50].
Results for MHA are reported in detail only for the molecularly-haplotyped SNPs in APOE. For SNPs outside of APOE, we performed MHA analyses for SNPs occurring in logical clusters, such as genes, and we fit additive logit models based on the count of alleles at the locus of interest; for both kinds of analyses, we ‘conditioned’ on ε4 genotype by entering the count of ε4 alleles as a covariate in the models.
Supported by grants from Alzheimer’s Association (IIRG-03-4750), NIA (AG24486), NIA (AG05136), NIA (AG21544), and NIMH (MH57881).
Electronic Database Information CompGen: Computational Genetics Lab at the University of Pittsburgh, http://wpicr.wpic.pitt.edu/WPICCompGen/
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.