PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of biostsLink to Publisher's site
 
Biostatistics. 2016 April; 17(2): 221–234.
Published online 2015 September 28. doi:  10.1093/biostatistics/kxv035
PMCID: PMC4834946

Group association test using a hidden Markov model

Summary

In the genomic era, group association tests are of great interest. Due to the overwhelming number of individual genomic features, the power of testing for association of a single genomic feature at a time is often very small, as are the effect sizes for most features. Many methods have been proposed to test association of a trait with a group of features within a functional unit as a whole, e.g. all SNPs in a gene, yet few of these methods account for the fact that generally a substantial proportion of the features are not associated with the trait. In this paper, we propose to model the association for each feature in the group as a mixture of features with no association and features with non-zero associations to explicitly account for the possibility that a fraction of features may not be associated with the trait while other features in the group are. The feature-level associations are first estimated by generalized linear models; the sequence of these estimated associations is then modeled by a hidden Markov chain. To test for global association, we develop a modified likelihood ratio test based on a log-likelihood function that ignores higher order dependency plus a penalty term. We derive the asymptotic distribution of the likelihood ratio test under the null hypothesis. Furthermore, we obtain the posterior probability of association for each feature, which provides evidence of feature-level association and is useful for potential follow-up studies. In simulations and data application, we show that our proposed method performs well when compared with existing group association tests especially when there are only few features associated with the outcome.

Keywords: Finite mixture model, Genome-wide association study, Modified likelihood ratio test

1. Introduction

With the fast growth in high-throughput sequencing technologies, identifying important features that are associated with a disease trait in a genome-wide study has become possible. Notable examples include rare variant association in human genome sequencing studies (Morgenthaler and Thilly, 2007; Morris and Pan, 2010; Wu and others, 2009), and sieve analyses in HIV vaccine trials comparing viral sequences between vaccines and placebo recipients (Rolland and others, 2012). Due to the overwhelming burden of multiple testing, testing for association between disease and individual features often lack power. To aggregate individual weak associations and reduce the number of tests, numerous methods have been proposed to perform a global association test within a functional unit, e.g. a gene or a pathway. Motivated by, though certainly not limited to, rare-variant association testing, we propose in this paper a novel group association test named “hidden Markov model variable detection method” (HMVD).

Existing association detection methods belong to two general classes: one class is to aggregate individual summary statistics like p-values or Z statistics for each feature. For example, Fisher (1932) construct test statistics based on sum of the p-value for the each feature. Tippett (1931) tests based on minimum of the p-value's for all features. More recent work along includes Chen and others (2012) and Cheung and others (2012) among others. For example, SigmaP (Cheung and others, 2012) tests for association by constructing test statistics using a weighted linear combination of the individual p-values.

The other class involves multiple regression models for the trait and all features together, with different levels of constraints to curb the number of parameters. One group of such tests, called Burden tests, simply collapses multiple features into one explanatory variable. For example, the cohort allelic sum test (Morgenthaler and Thilly, 2007) collapses variants within one region to a new indicator variable, which is one if a subject has at least one rare variant among that region, and zero otherwise. Along this line, Morris and Pan (2010) proposed to count the number of rare variants within a region and Li and Leal (2008) proposed the combined multivariate and collapsing test. These burden tests usually works well if a large portion of the variants are associated with the outcome. Wu and others (2009) proposed the sequence kernel association test (SKAT) by testing for the heterogeneity of single-nucleotide polymorphism (SNP) associations. They assumed that the SNP associations follow a normal distribution with zero mean, and tested for whether the distribution of associations has zero variance. SKAT is more powerful than the Burden tests when a small portion of variants are associated with the outcome or when the association of the SNPs go in both directions. Lee and others (2013) combined SKAT and Burden tests and introduced an “optimal” test that works well under both of the two aforementioned settings.

One limitation of those aforementioned methods is that they do not explicitly distinguish associated features from functionally neutral features. Thus, in settings with a substantial portion of neutral features, they may not be optimal in power performance. Recently, Logsdon and others (2014) proposed a variational Bayes discrete mixture (VBDM) method that incorporates the mixture of associated and neutral variants in the modeling. However, due to the approximation nature of the variational Bayes method, it does not work for dichotomous traits and logistic regression. Other authors have explored model selection techniques in group association tests, including a stepwise variable-selection method (Hoffmann and others, 2010) and Bayesian model selection (Capanu and Begg, 2011; Liang and Xiong, 2013).

In this paper, we propose a two-step procedure to conduct group association tests: first we summarize evidence of individual associations using equation M1 statistics or, equivalently, the observed association in a generalized linear model. We next model each feature-level association by a latent indicator variable indexing whether the feature is associated with the outcome. The sequence of estimated individual associations are described probabilistically by a hidden Markov model (HMM). This HMM allows for a flexible structure on the sequence of hidden states yet it is computationally manageable through a modified Baum–Welsh algorithm.

The primary goal of the proposed method is to test the global null hypothesis that there is no association between the set of features and the outcome. Under the alternative hypothesis, each equation M2 statistic or estimated association follows a mixture of two normal distributions: one is centered at zero (the null), the other is centered at the alternative. The mixture reduces to a single normal distribution under the null, and therefore we aim to test whether the number of mixture components is one. A common issue when testing for the number of mixture components is that the model is non-identifiable under the null (Lo and others, 2001). Let equation M3 be the proportion of the second components and equation M4 be the parameter for the second component. Then under the null when equation M5, equation M6 will be non-identifiable. As a result, the standard likelihood ratio test does not have the usual chi-square distribution. Chernoff and Lander (1995) and Sen and Ghosh (1985) among others find that the asymptotic distribution of the LRT involves the supremum of a Gaussian process. Recently, Chen and others (2001) proposed to add a equation M7 term to the log-likelihood function to keep equation M8 away from 0 or 1, where equation M9 is a user defined positive constant. They show that the parameters are consistently estimated under the null. In our method, we apply a similar approach by adding a equation M10 term to the log-likelihood function, so the support of p is limited to equation M11. Correlated equation M12 statistics are modelled by a hidden Markov chain, so that the proportion of null features and posterior probabilities can be computed by a extension of the Baum–Welsh algorithm. For hypothesis testing, we propose a modified likelihood ratio test ignoring the higher order correlation between features with the aforementioned penalty term. We show that the asymptotic distribution under the null hypothesis follows a equation M13 distribution.

Another contribution of our method is its “model selection” feature. On top of its ability to test for overall association, HMVD also provides a posterior probability for each feature, summarizing how likely it is marginally associated with the outcome. This property is desirable since it allows one to identify important features amongst a pool of candidate features, thereby providing guidance for follow-up studies.

2. Methods

In this paper, we focus on testing whether a set of equation M14 predictors (or features) (equation M15) is related to an outcome variable equation M16 while being adjusted for extra confounding variables (covariates) equation M17. For equation M18 independent subjects, let equation M19 be an equation M20 vector, equation M21 be an equation M22 matrix, and let equation M23 be an equation M24 matrix. We assume that our samples are obtained either from a case–control study or a cohort study; equation M25 can be either a dichotomized disease variable or a continuous response. The traditional way to test association of a group of features is to fit a multiple regression model, equation M26, where equation M27 is the coefficient for equation M28, equation M29 is the coefficient for equation M30. The usual F/equation M31 statistic to test equation M32 will have limited power when equation M33 is large.

Rather than fitting one model for all equation M34 features, which can be cumbersome when equation M35 is large, we fit a separate model for each feature:

equation M36
(2.1)

where equation M37 is the equation M38th column of equation M39. For the equation M40th feature, the estimated association equation M41 and its estimated variance equation M42 can be obtained through model (2.1). We assume that within a set of features, only some of these features are associated with the outcome. Let equation M43, equation M44, be an indicator whether feature equation M45 is associated with the outcome, and assume all associated features share the same effect size equation M46, we have

equation M47
(2.2)

where equation M48. Note that without any rescaling of the features, equation M49 as equation M50 increases. The assumption that all associated features have the same effect size may seem restrictive. However, we note that in many applications where all features are jointly tested these features are “similar”, e.g. they are all sequence variants within the same gene that may disrupt the function of that gene in a similar manner. As such, we believe that this assumption can be a reasonable approximation. We note that it is also possible to a priori rescale some of the features equation M51, for example to give them the same variance, or to incorporate some prior believes about effect sizes. In our simulation study, we will see that our method maintains correct control of the type I error and has good power, even if the assumption of the same effect size is not satisfied.

Our testing procedure has two parts: the “generic” regression model that relates equation M52 to equation M53 and equation M54 (2.1), and the HMM that relates the sequence of estimated association equation M55 (2.2). The goal here is to develop a test for association of a group of features using the second model. For this test to be valid, the actual form of (2.1) can be flexible: we require only an estimate equation M56 for equation M57 features and their estimated covariance matrix. We can also apply our procedure to a set of test statistics equation M58, equation M59, specifying equation M60, for equation M61. Such “weighted version”, can be obtained when each feature equation M62 is divided by the (unweighted) equation M63. For the weighted version, model (2.2) gets replaced by equation M64 where the covariance matrix of the equation M65 equals the correlation matrix for equation M66. In the remainder of this section, we show results for the unweighted version of our procedure. The derivations for the weighted version are completely analogous and therefore omitted.

Set equation M67 and let the corresponding estimating equation be equation M68, which is a function of equation M69. Then the joint distribution of equation M70 is

equation M71

where equation M72 is the last diagonal element of equation M73. Denote the variance by equation M74; its estimate equation M75 can be estimated from the data.

2.1. Hidden Markov model

To describe the likelihood of observed data, we need to impose some structure on the sequence of latent variables equation M76's to use (2.2) for group testing. The simplest structure is to assume that the equation M77's are independent, and that each follows a Bernoulli distribution with equation M78: that is, each feature has an equal “chance” of being associated with equation M79, independent of the other features. In this paper, we assume that the equation M80 follow a stationary Markov chain with transition matrix equation M81, equation M82, where equation M83 and equation M84. We assume that the features can be ordered in a meaningful way, and that whether one feature is associated with the response may be informative on whether the next feature is associated with the response. For example, for genetic data, there is a natural order of variants along the chromosome. The Markovian structure implies that for neighboring SNPs, if one is associated with the outcome the other one is more likely to be associated with the outcome. We note that when equation M85, the Markov chain model is equivalent to the simple Bernoulli model with equation M86.

We define the emission probabilities, i.e. the conditional distribution of equation M87 given equation M88 and the hidden states up to equation M89 (equation M90) as equation M91 Let equation M92 and equation M93. When equation M94 is sufficiently large, equation M95 is a multivariate Gaussian distribution given all the hidden states. We obtain the likelihood of equation M96 by summing over all hidden states: equation M97 where equation M98. Contrary to the usual first-order HMMs, the emission probability of a feature depends on the hidden states of the feature itself and all prior features. Therefore, the full likelihood of equation M99 involves summation over all possible combinations of the prior hidden states for each equation M100, which is computationally infeasible. Some simplification is needed. We consider two strategies for accounting for the dependency between features: for the hypothesis testing purpose, we develop a modified likelihood ratio test that ignores the higher order dependency and we prove that the resulting test is still valid. For estimating the posterior probability of an individual feature, we use a modified Baum–Welsh algorithm that reduced the dependency in emission probabilities to the first order only, thereby substantially reducing the burden of computation.

We formulate a model similar to the standard hidden Markov model (HMM) with one key difference. In the standard HMM, all observations equation M101 are assumed to be conditionally independent, that is, it is assumed that equation M102. These assumptions are usually not satisfied in our setting, since the effects from neighboring features (e.g. SNPs) are likely to be correlated. We generalize the HMM by allowing first-order conditional dependence between neighboring features and ignoring any higher order correlations.

In particular, we assume that equation M103 where equation M104, equation M105 and equation M106. For simplicity, we define equation M107 for equation M108, equation M109 and equation M110. Then the likelihood function, by ignoring higher order conditional correlations, can be expressed as

equation M111
(2.3)

2.2. Hypothesis testing

The primary goal of this paper is group testing for association. Assume that the Markov chain equation M112 is stationary, then equation M113. Testing for association between outcome and features is equivalent to testing for the composite null hypothesis

equation M114

This set-up has two complications. First, the parameters are not identifiable under equation M115; when equation M116, equation M117 vanishes in the likelihood, and when equation M118, equation M119 vanishes. Second, there are no existing results about the distribution for the likelihood ratio test statistics under the HMM setting described here. To eliminate the identifiability problem, one approach is to add a small penalty term, that pushes one component of the null parameters away from its boundary value (e.g. Chen and others, 2001; Fu and others, 2009). In the HMM setting, tests based on the marginal distribution of equation M120, ignoring dependency, have been proposed (Dannemann and Holzmann, 2008). The rational for this approximation is that the number of states in the HMM is equivalent to the number of components for each equation M121. Under equation M122, when the HMM only has one state, marginally equation M123 follows a normal distribution with mean 0. Under the alternative, when the HMM has two states, marginally equation M124 follows a mixture of four normal distributions, with the mixture probabilities being equation M125, equation M126, equation M127, and equation M128.

In our method, equation M129 has a distribution that is a mixture of four normal distributions:

equation M130
(2.4)

Here equation M131 denotes the normal probability density function evaluated with mean equation M132 and variance equation M133. Then the likelihood function for equation M134, ignoring higher order dependency, is

equation M135
(2.5)

We propose the following modified log-likelihood function:

equation M136
(2.6)

and we construct a likelihood ratio test based on this expression.

As discussed in Chen and others (2001), the best choice of the tuning parameter equation M137 depends on the model. In our simulation study we find that the value of equation M138 affects both the type I error and the power: setting equation M139 too large will reduce the power while setting equation M140 too small will lead to some inflation of the type I error. We found that equation M141 yields a good balance between type I error and power.

Interestingly, maximizing (2.6) will still give us a consistent estimator of equation M142. Related results on the consistency for MLEs while assuming independence have been discussed in Chandler and Bate (2007) and Zhou and others (2001).

Note both equation M143 and equation M144 are of order equation M145, it is natural to make the following transformation and rewrite the log-likelihood function. Define equation M146, equation M147, and equation M148. Then equation (2.4) can be written as equation M149 and the modified log-likelihood function can be rewritten as

equation M150
(2.7)

Now define the modified likelihood ratio test (MLRT) statistics as

equation M151
(2.8)

where equation M152 is the covariance matrix for equation M153, equation M154. equation M155 is the sum of all diagonal elements of equation M156 and equation M157 is the sum of all elements in equation M158. Then the following two theorems show the consistency of the estimator of equation M159 and the asymptotic distribution of equation M160 under the null hypothesis.

Theorem 2.1. —

If Conditions 1–5 (given in Supplementary material available at Biostatistics online) hold, the maximum modified likelihood estimator is consistent under the null.

Theorem 2.2. —

If Conditions 1–5 hold, then under the null hypothesis the modified likelihood ratio test statistics defined in equation (2.8) follows a equation M161 distribution as equation M162 and equation M163.

The proofs of the theorems are given in Supplementary material available at Biostatistics online. For data with limited sample size and small number of features, we observed small inflation of type I errors. We refer to the p-values that are obtained using Theorem 2.2 as approximate p-values. In cases where accurate p-values are needed, we propose the following adaptive permutation strategy. For each test, we obtain equation M164 and equation M165 as described in Section 2 and the approximate p-value equation M166. Under the null hypothesis, equation M167 follows a multivariate normal distribution with mean equation M168 and covariance equation M169, denoted by equation M170. Given a significance level equation M171, the adaptive permutation tests proceeds as follows. The first step is to determine whether a permutation test is needed. If equation M172 is large enough compared with equation M173, then there is no need to conduct a permutation test. In both the power comparison and type I error calculation in Section 3, we do not initiate the permutation test if equation M174. Otherwise, we set the number of permutations equation M175, where equation M176 is the greatest integer that is less than equation M177 and equation M178 is a user defined parameter. Throughout the paper, we set equation M179 to be 10. For permutation equation M180, we generate equation M181 from equation M182 and calculate the corresponding approximate p-value equation M183, equation M184. Then a test is claimed to be significant at level equation M185 if equation M186. The proposed adaptive permutation procedure is similar to the method introduced in Besag and Clifford (1991).

2.3. Estimation

Recall that the likelihood function based on the HMM structure, while ignoring higher order dependency, is equation M187. To deal with the composite null issue discussed in the previous section, we add a penalty term to the above likelihood function and obtain the following penalized version:

equation M188
(2.9)

We can estimate equation M189 by maximizing expression (2.9). Maximization can be done using a modified Baum–Welsh algorithm. A detailed description of this algorithm is given in Supplementary material available at Biostatistics online.

3. Simulation study

In this section, we conduct simulation study to evaluate the performance of the proposed HMVD method in settings with continuous and binary features and in settings with continuous and binary outcomes.

3.1. Type I error control for finite sample

We perform HMVD tests under the null hypothesis and calculate type I error rates in various settings. For each setting, the empirical type I error rate is obtained based on independent simulated datasets. Sample sizes are equation M190 and equation M191. The outcome equation M192 is generated either from a standard normal distribution or from a Bernoulli distribution with equation M193.

For continuous features, we generated 20 and 40 features with two different covariance matrix structures: independent features or features with an “AR1” correlation structure equation M194. The marginal variance of each feature was one for all simulations.

For discrete features, we generate genetic data mimicking real SNP data. We use HAPGEN2 (Spencer and others, 2009) and the CEU sample from the Hapmap project (data release #24) to generate genotypes of the Von Willebrand Factor (VWF) gene on chromosome 12. To generate rare variants, we only select those variants with minor allele frequency (MAF) equation M195. We select the first 20 or 40 variants to be consistent with the setting for continuous features.

For each simulation, we calculate the proportion of times the group of features is declared as significantly associated with the outcome at three thresholds (equation M196, equation M197, and equation M198). For equation M199, we generated equation M200 simulated datasets and for equation M201 and equation M202, we generated equation M203 simulated datasets to test for the type I error rates. Results for independent features and discrete features are given in Tables Tables112. Overall, the type I error rates are well controlled. The p-values are conservative for small sample size when outcome is binary and features are generated based on VWF gene. This is due to the conservative standard error obtained using logistic regression for small sample size when MAFs are small. Results for features generated from “AR1” correlation structure are given in Supplementary material available at Biostatistics online. The type I error rates are also well controlled.

Table 1.
Empirical type I error rates for continuous features with identity covariance matrix
Table 2.
Empirical type I error rates for discrete features generated from the VWF gene

3.2. Power comparison

To study the power of our proposed approach, we compare HMVD with other popular group testing methods: SKAT (Wu and others, 2009), SKAT-O (Lee and others, 2013), VBDM (Logsdon and others, 2014), SigmaP (Cheung and others, 2012), and the Burden test (Morris and Pan, 2010). We examine the power for the different methods across a variety of settings. The features that are associated with the response are selected independently with probability equation M262 (sparse signal) or 0.2 (denser signal). The number of features is again set to be either 20 or 40. Note that for equation M263 and equation M264 in equation M26512% of the simulations there is no signal, so for this setting the maximum power is around 88%. The sample size is set to be 4000 for all simulations. Comparisons are made for both continuous and ordinal features, and for continuous and binary outcomes equation M266.

For continuous outcomes, the data are generated from the model equation M267 where equation M268 are taken iid normal with mean 0 and variance 4. For binary outcomes, the data are generated from the model equation M269.

The effect sizes equation M270 are generated under two scenarios: In the first scenario, we assume that the features associated with the outcome share the same effect size, i.e., equation M271 or equation M272 (when our model is correct); In the second scenario we assume that the associated features have different effect sizes, such that equation M273 or equation M274. This scenario is used to study the robustness of HMVD against model misspecification. We note that in the second scenario on average 31% of the truly associated equation M275 will have effects that are in the opposite direction from the other equation M276. Under each scenario, the associated features are chosen randomly and we report the percentages of tests that are significant at level equation M277 over 1000 runs.

We generate continuous features equation M278 using the same covariance matrix structures as we used for the simulations for evaluating type I error. The effect size equation M279 take the value equation M280, or equation M281. These effect sizes are chosen such that the resulting powers cover a reasonable range. Figure Figure11 shows the comparison for the independent covariance structure. The top 2 panels are for continuous outcomes. All associated features share the common effect in the first panel and have different effect sizes in the second panel. The bottom 2 panels are for binary outcomes with common and different effect sizes. Results for features generated from “AR1” correlation structure are given in Supplementary material available at Biostatistics online.

Fig. 1.
Power comparison for continuous features with independent covariance structure. Top 2 panels are for continuous outcomes and bottom 2 panels are for binary outcomes.

For ordinal features, we compare power based on data generated using the VWF gene, as described in previous subsection. The MAF spectrum for selected SNPs in VWF gene is given in Appendix of Supplementary material available at Biostatistics online. The effect size equation M282 takes value equation M283 and equation M284. Results for ordinal features are given in Figure Figure22 with the same layout as Figure Figure11.

Fig. 2.
Power comparison for simulations based on simulated rare variants (MAF equation M285) of VWF gene. Top 2 panels are for continuous outcomes and bottom 2 panels are for binary outcomes.

The proposed method and VBDM yield the greatest power for the continuous outcome binary predictor case (Figure (Figure2).2). For all other set ups, HMVD has the best power, even for settings with model misspecification. The setting with “equation M286” and the setting with “equation M287” provide equivalent numbers of truly associated predictors but with different numbers of noise features (predictors that are not associated with the outcome). The effect of more noise features can be observed by comparing column 2 and column 3 in each figure. Noise features have the strongest effect on the Burden test: the power is greatly reduced when the number of noise features increases. Noise features have some effect on all other methods, though the effect on HMVD is minimal in many cases, suggesting that HMVD has the ability to better discern signals from noise. As mentioned earlier, model misspecification does not affect the performance of the proposed method. Another interesting observation is that our method show substantial advantages when the features are generated independently. This is plausible since we first conduct a variable by variable regression for each feature. The correlation structure is not explicitly modeled in the likelihood ratio test. We observe a slightly smaller advantage when all features are correlated. This is expected since we are conducting tests based on estimated effects: when all features are positively correlated, the effective sample size is reduced, which affects the power.

4. Data analysis

We apply our method to the same dataset from the Exome Sequencing Project (ESP) African American (AA) participants as studied in Logsdon and others (2014) and Johnsen and others (2013), which study the relationship between variants in the Von Willebrand Factor (VWF) gene and Von Willebrand Factor level. The Von Willebrand Factor (VWF) is a blood glycoprotein. It plays important roles in platelet adhesion. Low levels of VWF are associated with higher risk of Von Willebrand disease (VWD), a type of bleeding disorder. Johnsen and others (2013) and Logsdon and others (2014) showed that VWF missense variants are associated with VWF levels within the AA population.

The dataset consists of 2487 AA subjects from the CARe consortium who were imputed using the 1000-genomes reference panel as part of the ESP. Details of the imputation are in Auer and others (2012). There are a total of 30 imputed missense VWF variants. Two pairs of the variants are highly correlated with correlation equation M288; we remove one from each pair to reduce the collinearity and end up with 28 imputed missense VWF variants. We apply our method to study the association between 28 imputed missense VWF variants and log-transformed VWF level. In this section, we apply both weighted and unweighted version of our method for testing association. We obtain strong evidence of association with a equation M289 p-value of equation M290 and equation M291 and permutation based p-value equation M292 and equation M293 for the weighted and unweighted procedures, respectively. SKAT and SKAT-O also yield significant results, but the significance levels are not as strong as those for HMVD. For the VBDM method of Logsdon and others (2014), the weighted version found a significant association; however, the unweighted version failed to establish association. In Table Table3,3, p-values using different methods are summarized for comparison. It should also be noted that all other methods seem to be sensitive to the choice of weights.

Table 3.
Aggregate association test results for association between imputed missense variants within the VWF gene and log-transformed VWF levels in African Americans in the CARe consortium

Our method identifies four low-frequency variants that have a strong association with the outcome (with posterior probabilities equation M3040.99) and all the other variants seems to contribute little to the association (with posterior probabilities close to 0); see Table Table4.4. Three of the four variants selected by HMVD are Arg2287Trp, Ser1486Leu, and Val1439Met, which concord perfectly with the results reported in Johnsen and others (2013). Interestingly, in Johnsen and others's report, the effects of these three variants are of similar size, so it is not surprising that the unweighted version of our test establishes a significant effect (the weighted version provided similar results). The variants that show strong evidence of association are shown in bold.

Table 4.
VWF variants statistics and posterior probabilities. Variants with posterior probabilities greater than 0.5 are highlighted in bold face

5. Discussion

In this paper, we propose HMVD, a general group association test. By introducing a hidden indicator variable, our method explicitly accounts for the fact that a portion of variables might not be associated with the outcome. Compared with other group association tests HMVD yields increased power in a variety of scenarios. Our method is especially powerful when only a small fraction of all candidate features are associated with the outcome. Under both the HMM framework and the marginal independent likelihood framework, we obtain a posterior probability for each predictor being associated with the outcome. These probabilities serve as evidence of association for follow-up studies. We establish asymptotic distribution results of the HMVD test statistics for any generalized linear model.

As is evident from our example, HMVD is not very sensitive to the choice of weighting scheme, while other methods may provide different conclusions based on the choices of weights. This property is desirable since the “true” weight is never known, and ideally choice of weights should not affect the conclusions.

There are several possible extensions of our approach. First, in this paper, we assume all the associated features to have effects in one direction. Extensions can be made to allow both features with positive association and features with negative association. Second, although we focus on group-wise association test, our framework can be easily extended to other settings such as the identification of gene–environment interaction, or group-wise testing when features have entirely different scales.

Acknowledgments

The authors thank Benjamin Logsdon and the authors wish to acknowledge the support of the National Heart, Lung, and Blood Institute (NHLBI) and the contributions of the research institutions, study investigators, field staff, and study participants in creating this resource for biomedical research. Funding for GO ESP was provided by NHLBI grants RC2 HL-103010 (HeartGO), RC2 HL-102923 (LungGO), and RC2 HL-102924 (WHISP). The exome sequencing was performed through NHLBI grants RC2 HL-102925 (BroadGO) and RC2 HL-102926 (SeattleGO). In addition, this research was supported by National Institute of Health grants R01 HG-006124, P01 CA-53996, and R01 HL-114901. Conflict of Interest: None declared.

References

  • Auer P. L., Johnsen J. M., Johnson A. D., Logsdon B. A., Lange L. A., Nalls M. A., Zhang G., Franceschini N., Fox K., Lange E. M. and others (2012). Imputation of exome sequence variants into population-based samples and blood-cell-trait-associated loci in african americans: NHLBI go exome sequencing project. American Journal of Human Genetics 91, 794–808. [PubMed]
  • Besag J., Clifford P. (1991). Sequential monte carlo p-values. Biometrika 78, 301–304.
  • Capanu M., Begg C. B. (2011). Hierarchical modeling for estimating relative risks of rare genetic variants: properties of the pseudo-likelihood method. Biometrics 67, 371–380. [PMC free article] [PubMed]
  • Chandler R. E., Bate S. (2007). Inference for clustered data using the independence log-likelihood. Biometrika 94, 167–183.
  • Chen H., Chen J., Kalbfleisch J. D. (2001). A modified likelihood ratio test for homogeneity in finite mixture models. Journal of Royal Statistical Society, Series B 63, 19–29.
  • Chen L. S., Hsu L., Gamazon E. R., Cox N. J., Nicolae D. L. (2012). An exponential combination procedure for set-based association tests in sequencing studies. American Journal of Human Genetics 91, 977–986. [PubMed]
  • Chernoff H., Lander E. (1995). Asymptotic distribution of the likelihood ratio test that a mixture of two binomials is a single binomial. Journal of Statistical Planning and Inference 25, 579–586.
  • Cheung Y. H., Wang G., Leal S. M., Wang S. (2012). A fast and noise-resilient approach to detect rare-variant associations with deep sequencing data for complex disorders. Genetic Epidemiology 36, 675–685. [PubMed]
  • Fisher R. A. (1932) Statistical Methods for Research Workers, 4th edition NewYork: Springer.
  • Fu Y., Chen J., Kalbfleisch J. D. (2009). Modified likelihood ratio test for homogeneity in a two-sample problem. Statistica Sinica 19, 1603–1619.
  • Hoffmann T. J., Marini N. J., Witte J. S. (2010). Comprehensive approach to analyzing rare genetic variants. PLoS ONE 5, e13584. [PMC free article] [PubMed]
  • Johnsen J. M., Auer P. L., Morrison A. C., Jiao S., Wei P., Haessler J., Fox K., McGee S. R., Smith J. D., Carlson C. S. and others (2013). Common and rare von willebrand factor (VWF) coding variants, VWF levels, and factor VIII levels in african americans: the NHLBI exome sequencing project. Blood 122, 590–597. [PubMed]
  • Lee S., Wu M. C., Lin X. (2013). Optimal tests for rare variant effects in sequencing association studies. Biostatistics 13, 762–775. [PMC free article] [PubMed]
  • Li B., Leal S. M. (2008). Optimal tests for rare variant effects in sequencing association studies. The American Journal of Human Genetics 3, 311–321. [PubMed]
  • Liang F., Xiong M. (2013). Bayesian detection of disease–associated rare variants under posterior consistency. PLoS ONE 8, e69633. [PMC free article] [PubMed]
  • Lo Y., Mendell N. R., Rubin D. B. (2001). Testing the number of components in a normal mixture. Biometrika 88, 767–778.
  • Logsdon B. A., Dai J. Y., Auer P. L., Johnsen J. M., Ganesh S. K., Smith N. L., Wilson J. G., Tracy R. P., Lange L. A., Jiao S. and others (2014). A variational Bayes discrete mixture test for rare variant association. Genetic Epidemiology 38, 21–30. [PMC free article] [PubMed]
  • Morgenthaler S., Thilly W. G. (2007). A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST). Mutation Research 615, 28–56. [PubMed]
  • Morris A. P., Pan W. (2010). An evaluation of statistical approaches to rare variant analysis in genetic association studies. Genetic Epidemiology 34, 188–193. [PMC free article] [PubMed]
  • Rolland M., Edlefsen P. T., Larsen B. B., Tovanabutra S., Sanders-Buell E., Hertz T., deCamp A. C., Carrico C., Menis S., Magaret C. A. and others (2012). Increased HIV-1 vaccine efficacy against viruses with genetic signatures in env v2. Nature 490, 417–420. [PMC free article] [PubMed]
  • Sen P. K., Ghosh J. K. (1985). On the asymptotic performance of the log likelihood ratio statistic for the mixture model and related results. In Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer Volume II, 789–806.
  • Spencer C., Su Z., Donnelly P., Marchini J. (2009). Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genetics 5, e1000477. [PMC free article] [PubMed]
  • Tippett L. H. C. (1931) The Methods of Statistics. London: Williams and Norgate.
  • Wu M. C., Lee S., Cai T., Li Y., Boehnke M., Lin X. (2009). Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics 89, 82–93. [PubMed]
  • Zhou H., Xie M., Simpson D. G., Weinberg C. R. (2001). A generalized likelihood ratio approach for cluster-correlated data from human fertility studies. The Indian Journal of Statistics 63, 56–68.

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press