|Home | About | Journals | Submit | Contact Us | Français|
In the genomic era, group association tests are of great interest. Due to the overwhelming number of individual genomic features, the power of testing for association of a single genomic feature at a time is often very small, as are the effect sizes for most features. Many methods have been proposed to test association of a trait with a group of features within a functional unit as a whole, e.g. all SNPs in a gene, yet few of these methods account for the fact that generally a substantial proportion of the features are not associated with the trait. In this paper, we propose to model the association for each feature in the group as a mixture of features with no association and features with non-zero associations to explicitly account for the possibility that a fraction of features may not be associated with the trait while other features in the group are. The feature-level associations are first estimated by generalized linear models; the sequence of these estimated associations is then modeled by a hidden Markov chain. To test for global association, we develop a modified likelihood ratio test based on a log-likelihood function that ignores higher order dependency plus a penalty term. We derive the asymptotic distribution of the likelihood ratio test under the null hypothesis. Furthermore, we obtain the posterior probability of association for each feature, which provides evidence of feature-level association and is useful for potential follow-up studies. In simulations and data application, we show that our proposed method performs well when compared with existing group association tests especially when there are only few features associated with the outcome.
With the fast growth in high-throughput sequencing technologies, identifying important features that are associated with a disease trait in a genome-wide study has become possible. Notable examples include rare variant association in human genome sequencing studies (Morgenthaler and Thilly, 2007; Morris and Pan, 2010; Wu and others, 2009), and sieve analyses in HIV vaccine trials comparing viral sequences between vaccines and placebo recipients (Rolland and others, 2012). Due to the overwhelming burden of multiple testing, testing for association between disease and individual features often lack power. To aggregate individual weak associations and reduce the number of tests, numerous methods have been proposed to perform a global association test within a functional unit, e.g. a gene or a pathway. Motivated by, though certainly not limited to, rare-variant association testing, we propose in this paper a novel group association test named “hidden Markov model variable detection method” (HMVD).
Existing association detection methods belong to two general classes: one class is to aggregate individual summary statistics like p-values or Z statistics for each feature. For example, Fisher (1932) construct test statistics based on sum of the p-value for the each feature. Tippett (1931) tests based on minimum of the p-value's for all features. More recent work along includes Chen and others (2012) and Cheung and others (2012) among others. For example, SigmaP (Cheung and others, 2012) tests for association by constructing test statistics using a weighted linear combination of the individual p-values.
The other class involves multiple regression models for the trait and all features together, with different levels of constraints to curb the number of parameters. One group of such tests, called Burden tests, simply collapses multiple features into one explanatory variable. For example, the cohort allelic sum test (Morgenthaler and Thilly, 2007) collapses variants within one region to a new indicator variable, which is one if a subject has at least one rare variant among that region, and zero otherwise. Along this line, Morris and Pan (2010) proposed to count the number of rare variants within a region and Li and Leal (2008) proposed the combined multivariate and collapsing test. These burden tests usually works well if a large portion of the variants are associated with the outcome. Wu and others (2009) proposed the sequence kernel association test (SKAT) by testing for the heterogeneity of single-nucleotide polymorphism (SNP) associations. They assumed that the SNP associations follow a normal distribution with zero mean, and tested for whether the distribution of associations has zero variance. SKAT is more powerful than the Burden tests when a small portion of variants are associated with the outcome or when the association of the SNPs go in both directions. Lee and others (2013) combined SKAT and Burden tests and introduced an “optimal” test that works well under both of the two aforementioned settings.
One limitation of those aforementioned methods is that they do not explicitly distinguish associated features from functionally neutral features. Thus, in settings with a substantial portion of neutral features, they may not be optimal in power performance. Recently, Logsdon and others (2014) proposed a variational Bayes discrete mixture (VBDM) method that incorporates the mixture of associated and neutral variants in the modeling. However, due to the approximation nature of the variational Bayes method, it does not work for dichotomous traits and logistic regression. Other authors have explored model selection techniques in group association tests, including a stepwise variable-selection method (Hoffmann and others, 2010) and Bayesian model selection (Capanu and Begg, 2011; Liang and Xiong, 2013).
In this paper, we propose a two-step procedure to conduct group association tests: first we summarize evidence of individual associations using statistics or, equivalently, the observed association in a generalized linear model. We next model each feature-level association by a latent indicator variable indexing whether the feature is associated with the outcome. The sequence of estimated individual associations are described probabilistically by a hidden Markov model (HMM). This HMM allows for a flexible structure on the sequence of hidden states yet it is computationally manageable through a modified Baum–Welsh algorithm.
The primary goal of the proposed method is to test the global null hypothesis that there is no association between the set of features and the outcome. Under the alternative hypothesis, each statistic or estimated association follows a mixture of two normal distributions: one is centered at zero (the null), the other is centered at the alternative. The mixture reduces to a single normal distribution under the null, and therefore we aim to test whether the number of mixture components is one. A common issue when testing for the number of mixture components is that the model is non-identifiable under the null (Lo and others, 2001). Let be the proportion of the second components and be the parameter for the second component. Then under the null when , will be non-identifiable. As a result, the standard likelihood ratio test does not have the usual chi-square distribution. Chernoff and Lander (1995) and Sen and Ghosh (1985) among others find that the asymptotic distribution of the LRT involves the supremum of a Gaussian process. Recently, Chen and others (2001) proposed to add a term to the log-likelihood function to keep away from 0 or 1, where is a user defined positive constant. They show that the parameters are consistently estimated under the null. In our method, we apply a similar approach by adding a term to the log-likelihood function, so the support of p is limited to . Correlated statistics are modelled by a hidden Markov chain, so that the proportion of null features and posterior probabilities can be computed by a extension of the Baum–Welsh algorithm. For hypothesis testing, we propose a modified likelihood ratio test ignoring the higher order correlation between features with the aforementioned penalty term. We show that the asymptotic distribution under the null hypothesis follows a distribution.
Another contribution of our method is its “model selection” feature. On top of its ability to test for overall association, HMVD also provides a posterior probability for each feature, summarizing how likely it is marginally associated with the outcome. This property is desirable since it allows one to identify important features amongst a pool of candidate features, thereby providing guidance for follow-up studies.
In this paper, we focus on testing whether a set of predictors (or features) () is related to an outcome variable while being adjusted for extra confounding variables (covariates) . For independent subjects, let be an vector, be an matrix, and let be an matrix. We assume that our samples are obtained either from a case–control study or a cohort study; can be either a dichotomized disease variable or a continuous response. The traditional way to test association of a group of features is to fit a multiple regression model, , where is the coefficient for , is the coefficient for . The usual F/ statistic to test will have limited power when is large.
Rather than fitting one model for all features, which can be cumbersome when is large, we fit a separate model for each feature:
where is the th column of . For the th feature, the estimated association and its estimated variance can be obtained through model (2.1). We assume that within a set of features, only some of these features are associated with the outcome. Let , , be an indicator whether feature is associated with the outcome, and assume all associated features share the same effect size , we have
where . Note that without any rescaling of the features, as increases. The assumption that all associated features have the same effect size may seem restrictive. However, we note that in many applications where all features are jointly tested these features are “similar”, e.g. they are all sequence variants within the same gene that may disrupt the function of that gene in a similar manner. As such, we believe that this assumption can be a reasonable approximation. We note that it is also possible to a priori rescale some of the features , for example to give them the same variance, or to incorporate some prior believes about effect sizes. In our simulation study, we will see that our method maintains correct control of the type I error and has good power, even if the assumption of the same effect size is not satisfied.
Our testing procedure has two parts: the “generic” regression model that relates to and (2.1), and the HMM that relates the sequence of estimated association (2.2). The goal here is to develop a test for association of a group of features using the second model. For this test to be valid, the actual form of (2.1) can be flexible: we require only an estimate for features and their estimated covariance matrix. We can also apply our procedure to a set of test statistics , , specifying , for . Such “weighted version”, can be obtained when each feature is divided by the (unweighted) . For the weighted version, model (2.2) gets replaced by where the covariance matrix of the equals the correlation matrix for . In the remainder of this section, we show results for the unweighted version of our procedure. The derivations for the weighted version are completely analogous and therefore omitted.
Set and let the corresponding estimating equation be , which is a function of . Then the joint distribution of is
where is the last diagonal element of . Denote the variance by ; its estimate can be estimated from the data.
To describe the likelihood of observed data, we need to impose some structure on the sequence of latent variables 's to use (2.2) for group testing. The simplest structure is to assume that the 's are independent, and that each follows a Bernoulli distribution with : that is, each feature has an equal “chance” of being associated with , independent of the other features. In this paper, we assume that the follow a stationary Markov chain with transition matrix , , where and . We assume that the features can be ordered in a meaningful way, and that whether one feature is associated with the response may be informative on whether the next feature is associated with the response. For example, for genetic data, there is a natural order of variants along the chromosome. The Markovian structure implies that for neighboring SNPs, if one is associated with the outcome the other one is more likely to be associated with the outcome. We note that when , the Markov chain model is equivalent to the simple Bernoulli model with .
We define the emission probabilities, i.e. the conditional distribution of given and the hidden states up to () as Let and . When is sufficiently large, is a multivariate Gaussian distribution given all the hidden states. We obtain the likelihood of by summing over all hidden states: where . Contrary to the usual first-order HMMs, the emission probability of a feature depends on the hidden states of the feature itself and all prior features. Therefore, the full likelihood of involves summation over all possible combinations of the prior hidden states for each , which is computationally infeasible. Some simplification is needed. We consider two strategies for accounting for the dependency between features: for the hypothesis testing purpose, we develop a modified likelihood ratio test that ignores the higher order dependency and we prove that the resulting test is still valid. For estimating the posterior probability of an individual feature, we use a modified Baum–Welsh algorithm that reduced the dependency in emission probabilities to the first order only, thereby substantially reducing the burden of computation.
We formulate a model similar to the standard hidden Markov model (HMM) with one key difference. In the standard HMM, all observations are assumed to be conditionally independent, that is, it is assumed that . These assumptions are usually not satisfied in our setting, since the effects from neighboring features (e.g. SNPs) are likely to be correlated. We generalize the HMM by allowing first-order conditional dependence between neighboring features and ignoring any higher order correlations.
In particular, we assume that where , and . For simplicity, we define for , and . Then the likelihood function, by ignoring higher order conditional correlations, can be expressed as
The primary goal of this paper is group testing for association. Assume that the Markov chain is stationary, then . Testing for association between outcome and features is equivalent to testing for the composite null hypothesis
This set-up has two complications. First, the parameters are not identifiable under ; when , vanishes in the likelihood, and when , vanishes. Second, there are no existing results about the distribution for the likelihood ratio test statistics under the HMM setting described here. To eliminate the identifiability problem, one approach is to add a small penalty term, that pushes one component of the null parameters away from its boundary value (e.g. Chen and others, 2001; Fu and others, 2009). In the HMM setting, tests based on the marginal distribution of , ignoring dependency, have been proposed (Dannemann and Holzmann, 2008). The rational for this approximation is that the number of states in the HMM is equivalent to the number of components for each . Under , when the HMM only has one state, marginally follows a normal distribution with mean 0. Under the alternative, when the HMM has two states, marginally follows a mixture of four normal distributions, with the mixture probabilities being , , , and .
In our method, has a distribution that is a mixture of four normal distributions:
Here denotes the normal probability density function evaluated with mean and variance . Then the likelihood function for , ignoring higher order dependency, is
We propose the following modified log-likelihood function:
and we construct a likelihood ratio test based on this expression.
As discussed in Chen and others (2001), the best choice of the tuning parameter depends on the model. In our simulation study we find that the value of affects both the type I error and the power: setting too large will reduce the power while setting too small will lead to some inflation of the type I error. We found that yields a good balance between type I error and power.
Interestingly, maximizing (2.6) will still give us a consistent estimator of . Related results on the consistency for MLEs while assuming independence have been discussed in Chandler and Bate (2007) and Zhou and others (2001).
Note both and are of order , it is natural to make the following transformation and rewrite the log-likelihood function. Define , , and . Then equation (2.4) can be written as and the modified log-likelihood function can be rewritten as
Now define the modified likelihood ratio test (MLRT) statistics as
where is the covariance matrix for , . is the sum of all diagonal elements of and is the sum of all elements in . Then the following two theorems show the consistency of the estimator of and the asymptotic distribution of under the null hypothesis.
If Conditions 1–5 (given in Supplementary material available at Biostatistics online) hold, the maximum modified likelihood estimator is consistent under the null.
If Conditions 1–5 hold, then under the null hypothesis the modified likelihood ratio test statistics defined in equation (2.8) follows a distribution as and .
The proofs of the theorems are given in Supplementary material available at Biostatistics online. For data with limited sample size and small number of features, we observed small inflation of type I errors. We refer to the p-values that are obtained using Theorem 2.2 as approximate p-values. In cases where accurate p-values are needed, we propose the following adaptive permutation strategy. For each test, we obtain and as described in Section 2 and the approximate p-value . Under the null hypothesis, follows a multivariate normal distribution with mean and covariance , denoted by . Given a significance level , the adaptive permutation tests proceeds as follows. The first step is to determine whether a permutation test is needed. If is large enough compared with , then there is no need to conduct a permutation test. In both the power comparison and type I error calculation in Section 3, we do not initiate the permutation test if . Otherwise, we set the number of permutations , where is the greatest integer that is less than and is a user defined parameter. Throughout the paper, we set to be 10. For permutation , we generate from and calculate the corresponding approximate p-value , . Then a test is claimed to be significant at level if . The proposed adaptive permutation procedure is similar to the method introduced in Besag and Clifford (1991).
Recall that the likelihood function based on the HMM structure, while ignoring higher order dependency, is . To deal with the composite null issue discussed in the previous section, we add a penalty term to the above likelihood function and obtain the following penalized version:
We can estimate by maximizing expression (2.9). Maximization can be done using a modified Baum–Welsh algorithm. A detailed description of this algorithm is given in Supplementary material available at Biostatistics online.
In this section, we conduct simulation study to evaluate the performance of the proposed HMVD method in settings with continuous and binary features and in settings with continuous and binary outcomes.
We perform HMVD tests under the null hypothesis and calculate type I error rates in various settings. For each setting, the empirical type I error rate is obtained based on independent simulated datasets. Sample sizes are and . The outcome is generated either from a standard normal distribution or from a Bernoulli distribution with .
For continuous features, we generated 20 and 40 features with two different covariance matrix structures: independent features or features with an “AR1” correlation structure . The marginal variance of each feature was one for all simulations.
For discrete features, we generate genetic data mimicking real SNP data. We use HAPGEN2 (Spencer and others, 2009) and the CEU sample from the Hapmap project (data release #24) to generate genotypes of the Von Willebrand Factor (VWF) gene on chromosome 12. To generate rare variants, we only select those variants with minor allele frequency (MAF) . We select the first 20 or 40 variants to be consistent with the setting for continuous features.
For each simulation, we calculate the proportion of times the group of features is declared as significantly associated with the outcome at three thresholds (, , and ). For , we generated simulated datasets and for and , we generated simulated datasets to test for the type I error rates. Results for independent features and discrete features are given in Tables Tables11–2. Overall, the type I error rates are well controlled. The p-values are conservative for small sample size when outcome is binary and features are generated based on VWF gene. This is due to the conservative standard error obtained using logistic regression for small sample size when MAFs are small. Results for features generated from “AR1” correlation structure are given in Supplementary material available at Biostatistics online. The type I error rates are also well controlled.
To study the power of our proposed approach, we compare HMVD with other popular group testing methods: SKAT (Wu and others, 2009), SKAT-O (Lee and others, 2013), VBDM (Logsdon and others, 2014), SigmaP (Cheung and others, 2012), and the Burden test (Morris and Pan, 2010). We examine the power for the different methods across a variety of settings. The features that are associated with the response are selected independently with probability (sparse signal) or 0.2 (denser signal). The number of features is again set to be either 20 or 40. Note that for and in 12% of the simulations there is no signal, so for this setting the maximum power is around 88%. The sample size is set to be 4000 for all simulations. Comparisons are made for both continuous and ordinal features, and for continuous and binary outcomes .
For continuous outcomes, the data are generated from the model where are taken iid normal with mean 0 and variance 4. For binary outcomes, the data are generated from the model .
The effect sizes are generated under two scenarios: In the first scenario, we assume that the features associated with the outcome share the same effect size, i.e., or (when our model is correct); In the second scenario we assume that the associated features have different effect sizes, such that or . This scenario is used to study the robustness of HMVD against model misspecification. We note that in the second scenario on average 31% of the truly associated will have effects that are in the opposite direction from the other . Under each scenario, the associated features are chosen randomly and we report the percentages of tests that are significant at level over 1000 runs.
We generate continuous features using the same covariance matrix structures as we used for the simulations for evaluating type I error. The effect size take the value , or . These effect sizes are chosen such that the resulting powers cover a reasonable range. Figure Figure11 shows the comparison for the independent covariance structure. The top 2 panels are for continuous outcomes. All associated features share the common effect in the first panel and have different effect sizes in the second panel. The bottom 2 panels are for binary outcomes with common and different effect sizes. Results for features generated from “AR1” correlation structure are given in Supplementary material available at Biostatistics online.
For ordinal features, we compare power based on data generated using the VWF gene, as described in previous subsection. The MAF spectrum for selected SNPs in VWF gene is given in Appendix of Supplementary material available at Biostatistics online. The effect size takes value and . Results for ordinal features are given in Figure Figure22 with the same layout as Figure Figure11.
The proposed method and VBDM yield the greatest power for the continuous outcome binary predictor case (Figure (Figure2).2). For all other set ups, HMVD has the best power, even for settings with model misspecification. The setting with “” and the setting with “” provide equivalent numbers of truly associated predictors but with different numbers of noise features (predictors that are not associated with the outcome). The effect of more noise features can be observed by comparing column 2 and column 3 in each figure. Noise features have the strongest effect on the Burden test: the power is greatly reduced when the number of noise features increases. Noise features have some effect on all other methods, though the effect on HMVD is minimal in many cases, suggesting that HMVD has the ability to better discern signals from noise. As mentioned earlier, model misspecification does not affect the performance of the proposed method. Another interesting observation is that our method show substantial advantages when the features are generated independently. This is plausible since we first conduct a variable by variable regression for each feature. The correlation structure is not explicitly modeled in the likelihood ratio test. We observe a slightly smaller advantage when all features are correlated. This is expected since we are conducting tests based on estimated effects: when all features are positively correlated, the effective sample size is reduced, which affects the power.
We apply our method to the same dataset from the Exome Sequencing Project (ESP) African American (AA) participants as studied in Logsdon and others (2014) and Johnsen and others (2013), which study the relationship between variants in the Von Willebrand Factor (VWF) gene and Von Willebrand Factor level. The Von Willebrand Factor (VWF) is a blood glycoprotein. It plays important roles in platelet adhesion. Low levels of VWF are associated with higher risk of Von Willebrand disease (VWD), a type of bleeding disorder. Johnsen and others (2013) and Logsdon and others (2014) showed that VWF missense variants are associated with VWF levels within the AA population.
The dataset consists of 2487 AA subjects from the CARe consortium who were imputed using the 1000-genomes reference panel as part of the ESP. Details of the imputation are in Auer and others (2012). There are a total of 30 imputed missense VWF variants. Two pairs of the variants are highly correlated with correlation ; we remove one from each pair to reduce the collinearity and end up with 28 imputed missense VWF variants. We apply our method to study the association between 28 imputed missense VWF variants and log-transformed VWF level. In this section, we apply both weighted and unweighted version of our method for testing association. We obtain strong evidence of association with a p-value of and and permutation based p-value and for the weighted and unweighted procedures, respectively. SKAT and SKAT-O also yield significant results, but the significance levels are not as strong as those for HMVD. For the VBDM method of Logsdon and others (2014), the weighted version found a significant association; however, the unweighted version failed to establish association. In Table Table3,3, p-values using different methods are summarized for comparison. It should also be noted that all other methods seem to be sensitive to the choice of weights.
Our method identifies four low-frequency variants that have a strong association with the outcome (with posterior probabilities 0.99) and all the other variants seems to contribute little to the association (with posterior probabilities close to 0); see Table Table4.4. Three of the four variants selected by HMVD are Arg2287Trp, Ser1486Leu, and Val1439Met, which concord perfectly with the results reported in Johnsen and others (2013). Interestingly, in Johnsen and others's report, the effects of these three variants are of similar size, so it is not surprising that the unweighted version of our test establishes a significant effect (the weighted version provided similar results). The variants that show strong evidence of association are shown in bold.
In this paper, we propose HMVD, a general group association test. By introducing a hidden indicator variable, our method explicitly accounts for the fact that a portion of variables might not be associated with the outcome. Compared with other group association tests HMVD yields increased power in a variety of scenarios. Our method is especially powerful when only a small fraction of all candidate features are associated with the outcome. Under both the HMM framework and the marginal independent likelihood framework, we obtain a posterior probability for each predictor being associated with the outcome. These probabilities serve as evidence of association for follow-up studies. We establish asymptotic distribution results of the HMVD test statistics for any generalized linear model.
As is evident from our example, HMVD is not very sensitive to the choice of weighting scheme, while other methods may provide different conclusions based on the choices of weights. This property is desirable since the “true” weight is never known, and ideally choice of weights should not affect the conclusions.
There are several possible extensions of our approach. First, in this paper, we assume all the associated features to have effects in one direction. Extensions can be made to allow both features with positive association and features with negative association. Second, although we focus on group-wise association test, our framework can be easily extended to other settings such as the identification of gene–environment interaction, or group-wise testing when features have entirely different scales.
The authors thank Benjamin Logsdon and the authors wish to acknowledge the support of the National Heart, Lung, and Blood Institute (NHLBI) and the contributions of the research institutions, study investigators, field staff, and study participants in creating this resource for biomedical research. Funding for GO ESP was provided by NHLBI grants RC2 HL-103010 (HeartGO), RC2 HL-102923 (LungGO), and RC2 HL-102924 (WHISP). The exome sequencing was performed through NHLBI grants RC2 HL-102925 (BroadGO) and RC2 HL-102926 (SeattleGO). In addition, this research was supported by National Institute of Health grants R01 HG-006124, P01 CA-53996, and R01 HL-114901. Conflict of Interest: None declared.