We have developed a powerful gene-based association test that combines information from nearby markers to test genetic association for both quantitative and binary traits. Our method is based on optimal aggregation of information from genotyped markers in a gene. Since different markers often associate with the trait locus at different levels, to appropriately apportion their contributions, we assigned a weight to each marker that is proportional to the amount of information it captures about the underlying trait locus. We described a weighting scheme that allows estimation of weight from a reference dataset such as the HapMap data, other publicly available dense SNP data, or resequencing data for a subset of the study subjects. The virtue of our test lies in the ability to borrow strength from nearby markers while reducing the degrees of freedom. Through simulations and real data analysis based on a wide range of LD structures, we demonstrated that it can increase the power to detect genetic association compared with several commonly used multi-point association tests.
Our weighting scheme shares some similarity with recently proposed imputation-based tests in that these approaches use external LD information (Li et al.
; Marchini et al.
; Nicholae, 2006). Unlike the imputation-based methods, which test each marker individually, our method tests the whole gene as a unit and thus can borrow strength from adjacent markers while reducing the degrees of freedom.
We propose to assign weights to markers based on pairwise LD coefficients estimated from a reference dataset. Alternatively, one could consider using high-dimensional LD information among all markers. For example, following our analytical derivations in the Section 2
, given genotypes at m
markers, for a quantitative trait Y
, it can be shown that
Therefore, one could define a ‘predictive score’ ST
) for each subject, where the high-dimensional LD information P
) can be estimated from a reference dataset. This is similar in spirit to the methods proposed by Nicolae (2006
) and Zaitlen et al.
). However, despite the increased complexity in computation, our preliminary results indicate that such an analysis does not offer a power gain over our approach, probably due to higher data sparseness and higher variability in the estimation of multi-marker LD than the pairwise LD measure, especially when the estimations are based on a relatively small sample such as the HapMap. In addition, as shown by our simulation results, TUNA (Nicolae, 2006
) is less powerful than ATOM for situations that we considered.
Our approach is quite general and can be applied to a wide range of applications including both candidate-gene and genome-wide association studies. For candidate-gene studies, one could use our test to obtain a global P-value for a gene, which can be used to facilitate comparisons across studies that have genotyped different sets of markers of the same gene. For genome-wide association studies, with increased marker density in future chips, single-marker analysis might suffer from severe over-correction of multiple comparisons. An alternative approach might be to divide the genome into regions, screen the genome first by multi-point association tests such as our method, and then follow up significant regions through more thorough analysis. In such an approach, the total number of tests during the screening phase would remain the same regardless of how dense genotyping had been carried out.
We recognize that our method relies on permutations for significance assessment, which is time consuming for genome-wide association studies. However, this problem can be solved by dividing the genome into smaller subsets and running our method for each subset on a node in a high-speed computing cluster. Moreover, we can adopt an inverse-sampling method for empirical P
-value estimation based on the procedures described by North et al.
) and Hauser et al.
). This procedure is based on the Poisson approximation to the Binomial distribution for small P
-values. In this procedure, we can set a high number of maximum permutations but only reach that maximum for small P
-values, thus increasing computational efficiency.
Our method assumes that the reference dataset, which is used for estimating the weights has similar LD structure as the study sample. Although not completely realistic, various studies have demonstrated genetic similarities across different populations (Conrad et al.
; Willer et al.
). In general, we would recommend that investigators compare the LD patterns between the study sample and the reference dataset first, and use our proposed method only when they have similar LD patterns.
Although our method was developed for analysis of markers within a gene or a region, it is readily extendable to pathway-based analysis (Wang et al.
). For example, if the genes in the same pathway increase the disease risk additively, then we can calculate the scores for markers within each gene as described in Section 2
and then obtain the principal components across all genes in the same pathway. We can then test for association between the pathway-based principal components and the trait of interest. Our simulation results for two-locus models demonstrate that the proposed approach has the potential to perform well when multiple disease variants are present.
In summary, we have developed a novel multi-marker association test by optimally weighting genotyped markers using LD information from a reference dataset. The standard approach for detecting genetic association has been the single-marker approach, which assesses the marginal effect of each marker separately. But this strategy may not be the most powerful if each marker only contributes small to moderate amounts of association information or if there is allelic heterogeneity. By optimally weighting the genotyped markers, our method efficiently captures association signals in the region and thus improves the power of detecting association. With the wide application of large-scale genotyping in current genetics studies, we believe that our method will provide a powerful multi-marker approach to identifying disease loci.
Funding: National Institute of Health (grant R01HG004517, to M.L. and C.L.).