|Home | About | Journals | Submit | Contact Us | Français|
Association methods based on haplotype similarity (HS) can overcome power and stability issues encountered in standard haplotype analyses. Current HS methods can be generally classified into evolutionary and two-sample approaches. We propose a new regression-based HS association method for case-control studies that incorporates covariate information and combines the advantages of the two classes of approaches by using inferred ancestral haplotypes. We first estimate the ancestral haplotypes of case individuals and then, for each individual, an ancestral-haplotype-based similarity score is computed by comparing that individual’s observed genotype with the estimated ancestral haplotypes. Trait values are then regressed on the similarity scores. Covariates can easily be incorporated into this regression framework. To account for the bias in the raw p-values due to the use of case data in constructing ancestral haplotypes, as well as to account for variation in ancestral haplotype estimation, a permutation procedure is adopted to obtain empirical p-values. Compared with the standard haplotype score test and the multilocus T2 test, our method improves power when neither the allele frequency nor linkage disequilibrium between the disease locus and its neighboring SNPs is too low and is comparable in other scenarios. We applied our method to the Genetic Analysis Workshop 15 simulated SNP data and successfully pinpointed a stretch of SNPs that covers the fine-scale region where the causal locus is located.
Association analysis of case-control studies is becoming the primary approach for localizing disease loci, especially for detecting genes with modest effects on a disease. To assess the association between genetic variants and disease, one can either test individual SNPs or the haplotypes of closely linked SNPs. Although studies of their relative efficiency revealed varying conclusions, it is generally appreciated that haplotype-based analyses have greater power when SNPs are in strong multilocus LD with the disease locus (Akey et al., 2001; Nielsen et al. 2004; Zaitlen et al. 2007), and are helpful in identifying rare causal variants (The International HapMap Consortium, 2003; de Bakker et al. 2005). However, the practical potential of haplotype-based analysis may not be fully realized due to difficulties balancing the dimensionality of the haplotypes with the amount of information they contain (e.g., Thomas et al. 2003; Browning, 2006). This high dimensionality can often lead to a loss of power.
Many approaches have been proposed to tackle the dimensionality problem implicit in haplotype-based methods. Most methods can be broadly grouped into three categories: haplotype clustering, haplotype smoothing, and haplotype similarity. Haplotype clustering (Seltman et al. 2003; Molitor et al. 2003a; Durrant et al. 2004; Tzeng, 2005; Tzeng et al. 2006; Browning, 2006) attempts to decrease the dimensionality of the problem by clustering evolutionarily close haplotypes into groups. Haplotype smoothing (Thomas et al., 2001; Thomas et al. 2003; Molitor et al. 2003b; Schaid, 2004; Tzeng and Zhang, 2007) pools information across haplotypes by introducing a correlation structure on the effect of similar haplotypes. Both haplotype clustering and smoothing use measures of haplotype similarity in forming clusters or pooling haplotype effects, but the focus of the analysis is on the clusters or pooled effects. In contrast, haplotype similarity methods (e.g., Houwen et al. 1994; McPeek and Strahs, 1999; Su et al. 2008) focus directly on the similarity metric and look for unusual sharing of chromosomal segments within homogeneous trait groups. In this study, we focus on the haplotype-similarity approach and introduce a method that aims to capture the merits of current similarity methods while also being able to incorporate covariate information.
The rationale behind similarity methods is that, around a causative disease locus, affected individuals are more likely sharing the same haplotype descended from a common ancestor and, hence, will tend to have more similar haplotypes than randomly sampled individuals. Depending on how this concept is implemented, similarity methods can be divided into two categories: evolutionary based approaches and two-sample based approaches. Evolutionary based approaches tend to use cases only and identify regions of increased similarity by comparing to levels expected from a genealogical process under the null (Durham and Feingold, 1997; Service et al. 1999). This approach takes direct advantage of the decay process of haplotype sharing--the driving mechanism for haplotype similarity, but its assumptions concerning evolutionary history can be hard to verify (McPeek and Strahs, 1999; Morris et al. 2002; Morris et al. 2003; Morris, 2005; Su et al. 2008). In contrast, the two-sample based methods use empirical sharing comparisons among case and control haplotypes (Van der Meulen and te Meerman, 1997; Tzeng et al. 2003a; Tzeng et al. 2003b; Allen and Satten 2009). Although the use these empirical sharing comparisons among case and control samples bypasses the need to model the evolutionary process, many of these methods limit similarity calculations to concordant samples (i.e., case-case similarity and control-control similarity) and do not use information obtained from measures of dissimilarity between cases and controls. Using the discordant information has been shown to significantly improve the power of similarity-based tests (Sha et al. 2007; Nolte et al. 2007). Finally, the existing haplotype-similarity methods do not incorporate covariate information. This makes them less attractive for studying complex traits where covariate adjustments can be crucial.
In this article, we address these issues by proposing a new method for analyzing data from case-control studies that combines aspects of both evolutionary and two-sample approaches to haplotype similarity. Our method follows the two-sample framework, while quantifying similarity in a spirit consistent with evolutionary-based approaches. Specifically, we estimate ‘ancestral haplotypes’ of cases using the multilocus decay-of-haplotype-sharing (DHS) model of McPeek and Strahs (1999). We then define the ‘ancestral haplotype similarity (AHS) scores’ as the degree of similarity between these ancestral haplotypes of cases and the haplotypes of an individual. The use of AHS scores allows us to utilize all sample information in association testing. It also makes it straightforward to extend similarity methods from the 2-sample to a regression framework, potentially allowing for both covariates and quantitative traits to be incorporated in the haplotype similarity analysis. Finally, though we refer to ‘ancestral haplotypes’, our method does not rely on the actual reconstruction of the evolutionary history of the sample – we merely use the estimated sequences as summaries of the case haplotypes and use them to calculate individual similarity scores. Thus the validity of our approach is not affected by our ability to reconstruct the actual ancestral haplotypes of the sample.
In the following sections, we illustrate the proposed AHS method, and present simulation results of type I errors and power. The results are compared to the standard haplotype score test of Schaid et al. (2002) and the multilocus T2 test of Xiong et al. (2002). We also applied our method to the simulated SNP data for Rheumatoid Arthritis (RA) from the Genetic Analysis Workshop (GAW) 15 and examined whether this proposed method can detect the DR and D loci that affect the risk of RA.
Our proposed method can be described in the following three major steps. First, we estimate the ancestral haplotypes of cases from the genotypes of case participants using the DHS method of McPeek and Strahs (McPeek and Strahs, 1999; Strahs and McPeek, 2003). Next, the inferred ancestral haplotypes are used as reference haplotypes to compute the AHS scores. The AHS score for an individual is obtained by directly comparing an individual's unphased genotypes to the inferred ancestral haplotypes. Finally, the phenotype of interest is regressed onto the AHS scores and covariates. A significant coefficient of the AHS scores indicates genetic association, as there is a different amount of sharing among the control haplotypes compared to the case haplotypes. The declaration of significance is based on the empirical p-values generated from permutation. Using permutation testing allows us to account for the estimation variation that is induced by the estimated ancestral haplotypes, and to correct the bias introduced by using only the case sample in determining the ancestral haplotypes. Below we describe each of the steps in detail.
We use the DHS method (McPeek and Strahs, 1999; Strahs and McPeek, 2003) to infer the ancestral case haplotypes. The DHS method models the decay process of haplotype sharing for haplotypes descended from certain common ancestors, taking into account recombination, mutation, and background LD. The model allows for multiple origins of case haplotypes, and incorporates haplotype correlation due to shared ancestry by applying a correction factor to the star-shaped genealogy of case haplotypes. The method has been implemented in the software of DHSMAP (Decay of Haplotype Sharing Mapping), and can use phased (haplotype) (McPeek and Strahs, 1999) or unphased (genotype) data (Strahs and McPeek, 2003).
A similarity score for each individual is computed based on the amount of similarity between that individual's haplotypes and the estimated ancestral haplotypes. To tackle the phase unknown problem, we adopt the count statistic of Tzeng et al. (2003a), and measure the level of similarity using the number of matching alleles between an individual’s haplotypes and ancestral haplotype. As shown in Schaid (2004) and Tzeng et al. (2009), such quantity can be obtained by counting the matching alleles between the observed genotypes and the ancestral haplotypes. In the case of multiple estimated ancestral haplotypes, we calculate the overall AHS score by taking a weighted average of the ancestral specific AHS score, with weights dependent on the likelihood of the ancestral haplotypes. Specifically, define Lj the likelihood value obtained from DHSMAP for the ancestral haplotype j, and sij the count of the matching alleles between the genotype of individual i and ancestral haplotype j. The weight for ancestral haplotype j is , and the overall AHS score for person i is .
To test association between haplotypes and disease status, we use logistic regression and write
where Di denotes disease status (=1 for case participants and 0 otherwise), Xi is the design matrix of covariates including an intercept, γ is a vector of parameters, and β is a parameter that governs the effect of AHS on disease status. The null hypothesis of no association corresponds to β=0. To address the bias and extra variability introduced into the procedure by using estimated ancestral haplotypes, we use permutation to obtain empirical p-values.
When covariates divide the sample into a relatively small number of categories, permutation datasets can be obtained by randomly shuffling individuals’ phenotypes within each covariate category, so as to maintain the potential association between the phenotype and the covariates. When the categories formed in this way are sparsely populated, or when covariates are continuous, the stratification score approach of Epstein et al. (2007) can be used to form (typically five) strata based on the quantiles of 0 Xi, where 0 is the estimator of γ obtained by fitting (1) with β = 0. Permutation can then be carried out with these strata. Alternatively, a parametric bootstrap procedure can be used to generate data under the null hypothesis. First, obtain 0as before. Next, define Gi the genotype of subject i, and sample (Xi,Gi) with replacement from the original data. Finally, use (1) with β = 0 and γ = 0 to generate Di. Repeat this sampling process until the required number of cases and controls is generated.
For each permutated dataset, we repeat the entire procedure of estimating the ancestral haplotypes, computing similarity scores and fitting the regression model. Empirical p-values are computed as the proportion that the absolute values of β estimates in the permutation datasets are greater than the absolute value of β estimate in the original dataset. All analysis procedures are implemented using R software (http://www.r-project.org/).
Simulations are performed to evaluate the power and type I error of the AHS method. Haplotype data are generated based on the HapMap 2 CEU population. There are in total thirty trio families in the original CEU data set. To create an unrelated random population, we select two parents from each family to form an unrelated sample pool and randomly draw cases and controls from this pool with replacement. We focus on the shortest chromosome 22 to facilitate data processing.
We consider six simulation scenarios, as listed in Table 1, with different minor allele frequency (MAF) of the causal variant (0.1 and 0.4) and different LD between the causal variant and its surrounding SNPs (high for average R2 > 0.8, moderate for average R2 ≈ 0.5, and low for average R2 < 0.2). Given a causal locus, the three neighboring markers (one marker to its left and two to its right) form a haplotype of size 3 (Figure 1). We then randomly draw two haplotypes with replacement and convert it to the unphased genotype for an individual. The observed genotype of an individual does not include the causal locus, though the trait value is determined by the genotype of the causal marker using the logistic model: Logit [ Pr ( D=1 | G, E ) ] =γ0 + γ1 × E + β × G, where D is the disease status coded as 0 and 1; E is for a binary covariate; G is the genotype of the causal marker coded as 0, 1 and 2. We randomly generate the value for covariate E from Bernoulli distribution with probability 0.5. Then, conditional on G and E, we use the logistic formula given above to determine the disease status of an individual. We repeat this process until obtain the desired sample size (100 cases and 100 controls or 200 cases and 200 controls). The logistic model parameters are set as γ1 = 1.0 and γ0 = −4 which corresponds to a population disease prevalence of 0.03. For type I error analysis, β is set to be 0 = ln(1.0), and for power analysis, β = ln(1.5). For each simulated dataset, we obtain the empirical p-value by analyzing 104 permutation datasets. In the power analysis, we compare the AHS method to the standard haplotype score test of Schaid et al. (2002) and the multilocus T2 test of Xiong et al. (2002). The McNemar’s test (McNemar, 1947) is used to examine whether the power difference is significant between the proposed method and the two benchmark methods.
The Genetic Analysis Workshop (GAW) has provided several data sets biannually since 1982. In 2006, GAW15 provided the simulated Rheumatoid Arthritis (RA) SNP data which includes the HLA DR locus and the D locus on chromosome 6. Both loci directly affect the risk of RA but Locus D has a low minor allele frequency of 0.0083. According to the simulation information provided by GAW, smoking status is an important covariate which affects RA risk.
We select one child from each affected sib-pair family to create a case pool. We randomly select N individuals from this case pool and N controls from all the controls in the simulated data. We consider N = 100 and 200 in the analysis. Among all the SNPs located on chromosome 6, we select an 8cM region containing 103 SNPs covering the HLA DR locus located at 49.45cM and D locus located at 54.57cM. We use three-SNP sliding windows of haplotypes to scan this region. For each sliding window, we estimate ancestral haplotypes of the cases using DHSMAP and calculate the AHS score for each individual. We then regresse the disease status on the ASH score, with smoking status included as a covariate in the regression. We obtaine the empirical p-value by 106 permutations.
Table 2 displays the type I error rates of the AHS tests which are calculated on the basis of 1000 simulation replicates having either 100 cases and 100 controls or 200 cases and 200 controls. The values in Table 2 are close to the nominal level 0.05, indicating that our permutation procedure controls type I error. The power analysis for the nominal level 0.05 is shown in Table 3 and Table 4; the results are obtained on the basis of 500 simulation replicates each having either 100 cases and 100 controls, or 200 cases and 200 controls. As expected, if the disease locus in low LD with its nearby markers and the sample size is small, there is no statistical power to detect the association for each method, and hence the power is around the nominal level. If the disease locus is in moderate or high LD with surrounding SNPs, we notice that the power performance depends on the disease allele frequency. When compared to the standard regression analysis, the AHS method exhibits no significant power difference if the disease allele frequency is low, although the power figures of the standard regression are often slightly higher. If the disease allele frequency is high, the power of the AHS method is significantly higher. A similar pattern is observed when compare the AHS method to the T2 test, except that the AHS method has a significantly or slightly higher power across different LD levels and disease allele frequencies. As the T2 test can not adjust for covariates, such power improvement of the AHS method may be from its use of a smaller degrees of freedom and its incorporation of the covariates in the analysis.
Using a dataset having 100 cases and 100 controls, the AHS method identifies a fine-scale region of 1cM containing 6 SNPs that are significantly associated with RA disease status. Figure 2 shows the profile of empirical p-values over chromosomal locations. The empirical p-values peak near 49.45cM where the true causal HLA DR locus lies. We also plot p-values from the score test in Figure 2, showing that both methods are capable of detecting the HLA DR signal.
Using a dataset having 200 cases and 200 controls, the AHS method identifies two very close regions. One region is exactly the same region as described above. The other one is a 1cM region containing 8SNPs that is highly linked with the first region. However, for the sample sizes we consider, neither methods detect the second disease loci, the D locus, located 5.12cM downstream from the DR locus on chromosome 6. This may be due to the extremely low minor allele frequency of the D locus (0.0083).
In this work, we propose a new haplotype similarity method that combines the evolutionary and two-sample (statistical) approaches to association analysis using haplotypes. Our analysis is based on reconstruction of ancestral haplotypes of cases. For this reason, our approach utilizes the population genetic insight that, if a susceptibility locus is recent, then case participants should share a more recent common ancestor than the control participants. We then use the similarity between an individual’s haplotypes and the case ancestral haplotypes in a regression formulation, which gives a simple testing scheme and also allows us to account for covariates that may affect disease status. Because of the bias introduced by constructing ancestral haplotypes among case participants only, and to account for the variability in the reconstruction of the ancestral haplotypes, a permutation procedure is used to evaluate significance. We have shown that our procedure has the proper size, and also has improved power compared with the standard haplotype analysis and the multilocus T2 test. We define the similarity score based on the number of matching alleles between compared haplotypes, which can be calculated using unphased genotype data, thereby avoiding the estimation of haplotype phase for each individual.
Unlike many haplotype sharing approaches that use a two-sample testing framework with similarity scores of pairwise comparisons, we use a regression model with similarity scores of single individuals. We note here that a regression approach based on pairwise comparison data is also possible (Tzeng et al. 2009), in which the focus is the study of the correlation between the trait similarity and haplotype similarity. This approach also allows for covariates. It may be of interest to understand the relative efficacy of method of this type and the AHS methods. These different features also suggest how to choose a similarity based method in practice. Most of the two-sample based approaches are easy to compute, suitable for binary traits, but often impractical to adjust for covariates. The proposed AHS method is suitable for binary traits, can incorporate covariates in the analysis, but is relatively more computational intensive. Finally the gene-trait similarity regression of Tzeng et al. (2009) is applicable to quantitative traits or binary traits, can model covariates, but also has higher computational cost than the two-sample based approaches.
In our proposed method we use the decay of haplotype sharing method (DHSMAP) to infer the ancestral haplotypes of cases. However, concerns may arise as to whether the estimated ancestral haplotypes are actually the true ancestral disease haplotypes, especially for complex diseases whose etiology tends to involve complicated interplay between genetic and environmental factors. However, what the AHS method requires is that the estimated ancestral haplotypes be representative of the case haplotypes, so that the similarity scores of each individual reflect similarity to case haplotypes. Further, because we use permutation to ensure proper size, any changes in choice of ancestral haplotypes will only affect the power of the test, not its size. We have been able to observe power gain using the ancestral disease haplotypes obtained from DHS, and further power improvement is possible by using more precisely reconstructed ancestral disease haplotypes.
The authors thank the GAW grant, R01-GM031575, for providing the GAW 15 simulated data to be used in this study. Support for generation of the GAW 15 simulated data is provided from NIH grants 5RO1-HL049609-14, 1R01-AG021917-01A1, the University of Minnesota, and the Minnesota Supercomputing Institute. Y.L. is partially supported by NSF DMS 0504726. J.Y.T is partially supported by NSF DMS 0504726 and NIH R01MH084022-01A1.
Disclaimer: The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention.