Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2918520

Formats

Article sections

Authors

Related links

Genet Epidemiol. Author manuscript; available in PMC 2010 August 9.

Published in final edited form as:

PMCID: PMC2918520

NIHMSID: NIHMS206892

Division of Cancer Epidemiology and Genetics, National Cancer Institute, Executive Plaza South, Bethesda, MD, 20982

The publisher's final edited version of this article is available at Genet Epidemiol

See other articles in PMC that cite the published article.

Case-control genome-wide association studies provide a vast amount of genetic information that may be used to investigate secondary phenotypes. We study the situation in which the primary disease is rare and the secondary phenotype and genetic markers are dichotomous. An analysis of the association between a genetic marker and the secondary phenotype based on controls only is valid, whereas standard methods that also use cases result in biased estimates and highly inflated type I error if there is an interaction between the secondary phenotype and the genetic marker on the risk of the primary disease. Here we present an adaptively weighted method that combines the case and control data to study the association, while reducing to the controls only analysis if there is strong evidence of an interaction. The possibility of such an interaction and the misleading results for standard methods, but not for the adaptively weighted or controls only approaches, are illustrated by data from a case-control study of colorectal adenoma, in which the secondary phenotype is smoking. Simulations and asymptotic theory indicate that the adaptively weighted method can reduce the mean square error for estimation with a pre-specified SNP and increase the power to discover a new association in a genome-wide study, compared to an analysis of controls only. Further experience with genome-wide studies is needed to determine when methods that assume no interaction and gain precision and power, thereby can be recommended, and when methods such as the adaptively weighted or controls only approaches are needed to guard against the possibility of non-zero interactions.

The genome-wide association study (GWAS) is a powerful tool to identify genetic associations with a disease. A GWAS may also provide information on secondary phenotypes that are measured for all subjects. Added value could be gained from a GWAS by studying the association between genes and the secondary phenotypes.

When the disease in a GWAS is rare, the controls could be regarded as a random sample from the general population, and they could be used for estimating the association with a secondary phenotype. However, if the disease is associated with both the SNP and the secondary phenotype, including cases may introduce bias in the estimation of association, even when the disease is rare.

Lin and Zeng [2009] studied a likelihood-based method that combines the cases and controls efficiently to analyze secondary phenotypes in GWASs. This method is comprehensive because it covers both quantitative and dichotomous secondary phenotypes and both rare and non-rare disease. However, the maximum likelihood estimates are only unbiased and statistically efficient under the assumption that there is no interaction between gene and secondary phenotype effects on disease risk. In their paper, they stated that all standard methods based on controls only, cases only, and the combination of cases and controls yield unbiased estimates if the disease is rare. However, this is not true if there is an interaction between gene and secondary phenotype effects on disease risk.

The purpose of this study was to develop a procedure that can use cases as well as controls to increase efficiency without introducing bias. We focused on the important special situation in which the original disease is rare and the secondary phenotype and genetic marker are dichotomous. We investigated the performance of standard methods and the maximum likelihood estimation (MLE) method discussed in Lin and Zeng [2009] while allowing for the interaction. We found that the MLE method is equivalent to the analysis of controls only and does not provide additional efficiency, if one includes an interaction in the model.

We proposed an adaptively weighted method that combines the case and control data to estimate the association with reduced mean square error, based on a balance between bias and variance. In the presence of strong interaction, this method reduces to the controls only analysis. Both simulated and real data examples suggest an advantage of the proposed adaptively weighted estimator in both estimation and gene discovery. This research also provides guidance on the validity of the various proposed approaches for analyzing secondary phenotypes.

The paper is organized as follows. In the Methods Section, we describe the study setting and data and present the different analytic methods, including our adaptively weighted method. In the Results Section, we illustrate the performance of different methods on colorectal adenoma case-control data, for which there is an interaction between gene and secondary phenotype (smoking) on disease risk. In simulations, we evaluate the properties of estimates and tests from the different methods. We also illustrate the promising power of the adaptively weighted method in large-scale SNP discovery GWAS. Conclusions are in the Discussion Section and technical details are in the Appendix .

We consider the simple but important scenario of an unmatched case-control study with a rare disease, dichotomous genetic marker *G*, and dichotomous secondary phenotype *X*. Let *D* = 1 or 0 denote the diseased or non-diseased state for each individual. Let *G* = 1 or 0 according as an individual carries at least one SNP allele of interest or not. Let *X* = 1 or 0 denote whether the individual has or does not have the secondary trait. *n*_{0} and *n*_{1} are the number of controls and cases, respectively. The data can be represented as in Table 1.

Let **r _{0}** = (

One can use the logistic regression model for the dichotomous secondary phenotype

(1)

When the disease is rare, the dependency of *D* on *G* and *X* can be modeled as

(2)

We are interested in the inference regarding *β*_{1}, which represents the log odds ratio of *G* and *X* in the general population, namely exp(*β*_{1}) = OR_{GX}.

The controls can be regarded as a random sample of the general population if disease is rare, and therefore the odds ratio of *G* and *X* among the controls estimates the odds ratio in the population, namely . The MLE of *β*_{1} using controls only is given by

This estimator is nearly unbiased if the disease is rare. From standard asymptotic theory, . The chi-square test of association is based on the Wald statistic .

The estimator of *β*_{1} using cases only is given by

This is not unbiased unless *δ*_{12} = 0 in model (2), because the odds ratio of *G* and *X* among the cases is

(3)

The variance estimate of the case estimator is , which can be used in the Wald statistic .

When there is no interaction between *G* and *X* in model (2), both cases and controls can be used to estimate *β*_{1}. Combining the case and control estimates by inverse variance weighting leads to

where . By ignoring the variation in and , we estimate , which can be used in the Wald statistic . Although is more efficient than , it is only unbiased when *δ*_{12} = 0.

Lin and Zeng [2009] used the maximum likelihood estimation method to analyze secondary phenotype data based on the retrospective likelihood function,

(4)

Using the rare disease assumption with a dichotomous secondary phenotype, we found the following results from this likelihood:

Using the saturated disease model (2), the maximum likelihood estimator , and the maximum likelihood method does not use any information from the cases. (See Appendix for proof.)

Lin and Zeng [2009] assumed *δ*_{12} = 0 in model (2). Under this assumption, and for a rare disease, has only very slightly smaller asymptotic variance than . For example, for *δ*_{12} = 0 and *β*_{1} = 0.25, when *n*_{1} = *n*_{0} = 1, 000, and ; when *n*_{1} = *n*_{0} = 10, 000, and .

To capture some of the efficiency of or while avoiding the bias in these estimates that results when *δ*_{12} ≠ 0, we proposed an estimator that down-weights the contribution from cases as the evidence against *δ*_{12} = 0 increases.

Motivated by an empirical Bayes shrinkage estimator for gene-environment interaction [Mukherjee and Chatterjee, 2008], we proposed the following estimate:

(5)

where estimates the interaction.

From (5), as , , and as , . Thus adaptively combines the weighted and control only estimators based on the interaction estimate .

We estimated the variance of by rewriting (5) in term of and as

(6)

By noting that and are independent and neglecting the variability of and , we obtained the following variance estimate by Taylor expansion,

This estimator is used to construct Wald statistic .

Colorectal adenoma is a precursor of colorectal cancer. Colorectal adenoma is positively associated with smoking and negatively associated with haplotypes of a gene NAT2 that promotes rapid acetylation of carcinogens [Moslehi and others, 2006]. Here we code NAT2 = 1 for haplotypes corresponding to rapid acetylation and NAT2 = 0 for other haplotypes. NAT2 is also important in the metabolism of smoking-related carcinogens. Moslehi and others [2006] analyzed a case-control study of advanced colorectal adenoma in the Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial at the National Cancer Institute. Table 2 shows the observed cell counts for this case-control study. From marginal tables constructed from Table 2, one sees that the odds ratios relating colorectal adenoma to NAT2 and smoking are 0.93 and 1.41 respectively, demonstrating the protective effect of the rapid acetylator phenotype that corresponds to NAT2=1. The odds ratios relating NAT2 to smoking are 1.85 in controls but 0.378 in cases, indicating a protective interaction between smoking and NAT2 on the risk of colorectal adenoma. Adjusting for age and sex, we found with p-value= 0.0032. Ignoring this interaction will introduce bias in analyzing the secondary phenotype, smoking. Estimates of the log odds ratio associating NAT2 with smoking are shown for the five methods in Section 2 in Table 3. The adaptively weighted estimate is positive and similar to the control only estimate, while the case only, weighted and MLE estimates are negative. This example is chosen to illustrate the possibility that interactions might exist. In this case, NAT2 rapid acetylation helps eliminate carcinogens from smoking, thereby reducing the risk of colorectal adenoma from smoking. Typically, there might be no or only weak interactions, but this example illustrates the need for methods that are not misleading when interactions are present.

We used Monte Carlo simulation to evaluate the performance of different estimators of the effects of a pre-selected SNP. We fixed the probabilities of carrying one or two alleles of interest, *G*, and the secondary phenotype, *X*, as *P*(*G* = 1) = 0.3 and *P*(*X* = 1) = 0.3. We let *β*_{1} = 0 and 0.25 under the null and alternative hypotheses, respectively. Fixing these three values, one can determine the distribution of *G* and *X* in the population by solving three equations. Under the rare disease assumption, this solution also approximates the control cell probability vector **p _{0}**. For the disease-risk model (2), we set

Figure 1 shows the relative biases (left panels), the coverage probabilities of 95% confidence intervals for *β*_{1} (middle panels), and mean squared errors (MSE) for the AW and CO methods (right panels). Figure 2 displays the type I error for all methods and the power for testing **H _{0}** :

Relative biases (RBias), coverage probabilities (CP) of 95% confidence intervals and mean squared errors (MSE) for different estimators for two values of *β*_{1}:(1) *β*_{1} = 0; (2) *β*_{2} = 0.25.

The AW and CO methods perform very well. Their estimates are virtually unbiased and their confidence intervals have proper coverage probabilities (Figure 1). The AW method has smaller MSE near *δ*_{12} = 0 than the CO method (Figure 1), but not for values of *δ*_{12} away from 0. The CO method maintains nominal size (Figure 2). The AW method maintains near nominal type I error, although there is a very slight increase in size above 0.05 near |*δ*_{12}| = 0.4. The power of AW exceeds that of CO for *δ*_{12} > 0, but is less than that of CO for *δ*_{12} < 0 (Figure 2).

The CA, W and MLE methods, are seriously biased (Figure 1), have substantially subnominal coverage (Figure 1), and have above nominal Type I error (Figure 2) for *δ*_{12} ≠ 0.

In the previous section we studied estimation and hypothesis testing for a single pre-specified SNP. Here we investigate the size and power of different methods in large-scale SNP discovery GWAS.

By replacing the {**r _{0}**,

Assuming there were *N* = 500, 000 independent SNP genetic markers, we controlled the experimentwise significance by setting *α* = 0.05/(5 × 10^{5}) = 10^{−7}. It may be reasonable to suppose that *δ*_{12} = 0 for a large proportion of SNPs. We assume 99% of SNPs have *δ*_{12} = 0, and 1% of *δ*_{12} are independently distributed as *N*(0, (log(2)/2)^{2}), which implies that about 95% of nonzero *δ*_{12} values reside in [−log(2), log(2)]. We evaluated the genome-wide type I error and power of the various Wald tests averaged over the mixture distribution of *δ*_{12}. For each method we estimated the genome-wide type I error under the null from the formula

(7)

and the power under the alternative *β*_{1} = 0.25 can be calculated from

(8)

where *f*(*x*) is the normal density N[0, {log(2)/2}^{2}]. The integrals in equations (7) and (8) are approximated by drawing 100,000 *δ*_{12} from N[0, {log(2)/2}^{2}] and averaging the corresponding values of for equation (7) and of for equation (8). The non-centrality *λ* depends on *δ*_{12} and are recomputed for each realization.

The genome-wide type I error and power for different methods are presented in Table 4 for numbers of cases and controls *n* = 1000, *n* = 5000 and *n* = 10,000. The genome-wide type I error is 1.0 for *W*_{CA}, *W*_{W}, and *W*_{MLE}, indicating that these tests should not be used even if only a small proportion of SNPs have *δ*_{12} ≠ 0. Both *W*_{CO} and *W*_{AW} statistics have near nominal genome-wide type I error. However, for sample size *n*_{1} = *n*_{0} = 5, 000 or 10, 000, the power of *W*_{AW} greatly exceeds that for *W*_{CO}. For example, when *n*_{0} = *n*_{1} = 10,000, the power of *W*_{AW} is almost 98%, while that of *W*_{CO} is only 50%. Thus, substantial power gains can be achieved with *W*_{AW}. Unreported simulations from a mixture distribution with 20% non-zero *δ*_{12}s also show that *W*_{AW} has greater power than CO but a very slight excess in size (e.g. size= 0.053) was observed for *W*_{AW}.

Instead of setting fixed critical values for declaring an association statistically significant, one can rank p-values or chi-square statistics to select promising SNPs [Gail et. al., 2008]. In this section, we study the detection probability which is the probability that the test statistic for a specified disease-associated SNP will be among the top *T* chi-square values for all SNPs. We estimated the detection probability with 1, 000 simulated replicates. In each replicate, we generated data for *N* = 500, 000 SNPs with the same parameters used in Section 3.2, except that 10 disease-associated SNPs had nonzero *β*_{1}, whereas the 499,990 remaining null SNPs had *β*_{i} = 0. All SNPs had interactions *δ*_{12} drawn from a mixture distribution in which 99% of SNPs have *δ*_{12} = 0, and 1% of *δ*_{12} are independently distributed as *N*(0,*σ*^{2}). We conducted four independent simulations for the combinations of *β*_{1} = {0.25, 0.69} and *σ*^{2} = {(log(2)/2)^{2}, (log(2))^{2}}.

When the associated SNPs have small log odds ratios with the secondary phenotype, i.e. *β*_{1} = 0.25 (upper portion of Table 5), all the methods have low detection probabilities for *T* ≤ 100.. When *T* ≤ 100 and the variability of *δ*_{12} is large, the W and MLE methods have detection probabilities near zero, because some of the SNPs with interactions have large chi-square values than the disease-associated SNPs. The CO and AW methods perform slightly better. When *T* is 10, 000, (2% of the total SNPs), the detection probabilities of all methods increase and the W and MLE methods outperform the CO and AW methods. When the associated SNPs have a stronger association with the secondary phenotype, i.e. *β*_{1} = 0.69 (lower portion of Table 5), and with *σ*^{2} = {log(2)/2}^{2}, W and MLE have smaller detection probabilities than CO and AW for *T* = 10, but for *T* = 100 or 10, 000, the detection probabilities of W and MLE exceed those of CO and AW. When the variability of *δ*_{12} is larger, with *σ*^{2} = {log(2)}^{2}, the W and MLE methods have smaller detection probabilities than CO and AW, especially for *T* = 10 and 100. The CA method always has the lowest detection probability. As expected, CO is not affected by increasing variability of *δ*_{12}; AW is also robust to increasing variability of *δ*_{12}.

For a rare disease and dichotomous secondary endpoint and genetic marker, we have investigated whether and how to use data from diseased subjects to study the association between a genetic marker and secondary phenotype. We considered both estimating and testing the null hypothesis of no association for a pre-specified SNP, and for discovering an association in a GWAS with either hypothesis testing approaches or approaches based on ranking the chi-square statistics. In the absence of an interaction *δ*_{12} in model (2), each of the five methods we considered leads to valid inference, and the W and MLE methods are particularly efficient. In the presence of interaction (*δ*_{12} ≠ 0), the CO method controls the type I error perfectly, and the AW has proper size for rare interactions (1%) and only modestly supra-nominal type I error for common interactions (20%). The CA, W and MLE methods do not control type I error and cannot be recommended if it is plausible that *δ*_{12} ≠ 0. The AW method has lower MSE for a pre-selected SNP and greater power than the CO method, which is achieved at the cost of a slight increase in type I error. We showed that the MLE method reduces to the CO method if the model allows for non-null *δ*_{12}.

Under the assumption of rare disease and dichotomous genetic marker and phenotype, the CO method is robust in that it maintains the unbiasedness and nominal type I error despite any interaction effect. The W and MLE methods fully utilize both controls and cases and are most efficient for estimation. When there is no interaction effect, both weighted and MLE are almost twice as efficient as the control only method in estimation and have around 70% more power than the control only method. We prefer the weighted method because it is nearly fully efficient and its estimator is non-iterative. Thus, there are no problems of convergence as can arise with the MLE method. However, even a small interaction effect causes large bias and highly inflated type I error for the CA, W, and MLE methods. The AW method strikes a balance between the robust CO method and the W method. It maintains the unbiasedness and near nominal type I error across most values of *δ*_{12}, although it has moderately inflated type I error when *δ*_{12} is not far from zero. If *δ*_{12} is near zero, estimates based on AW have smaller MSE than those from CO for a prespecified SNP. Under a mixture distribution for *δ*_{12} which was chosen to allow most *δ*_{12} values to be zero, the AW method achieved an important gain in power compared to CO. The detection probabilities of the W and MLE methods degrade when ranking SNPs in the presence of increasing variability of *δ*_{12}. However, CO and AW methods maintain their detection probabilities and are robust to increasing variable *δ*_{12}.

Jiang, Scott and Wild [2006] discussed methods for analyzing secondary phenotypes in case-control studies. Their fully non-parametric approach (SPML1) corresponds to MLE under our model (2) with *δ*_{12} included, which is equivalent to the CO method for inference on *β*_{1}. Assuming *δ*_{12} = 0 corresponds to possibly misspecified parametric modeling (SPML2) in their notation. MLE under SPML2 was described by Lin and Zeng [2009], who also covered non rare diseases and both dichotomous and continuous secondary phenotypes. If *δ*_{12} ≠ 0, the MLE method of Lin and Zeng does not control the type I error, as indicated by our results in Section 3.2 and 3.3 and in the discussion of model misspecification for SPML2 by Jiang et al. [2006].

In unreported analysis, we evaluated the performance of the various methods using prostate cancer data from the Cancer Genetic Markers of Susceptibility (CGEMS) study. We conducted a genome-wide scan on the association between the secondary phenotype, body mass index BMI (1, if BMI≤ 25; 0, else), and the 516,564 SNPs from 22 autosomal chromosomes. We estimated the distribution of *δ*_{12}, and found no evidence that the variance of across SNPs exceeded that which would be expected from the multinomial sampling error alone. Thus, we did not find evidence that *δ*_{12} ≠ 0 for some SNPs. Under such situation, W and MLE are two most efficient methods, and both identified SNP rs7575639 with a genome-wide significant *p* < 10^{−7}. The 20 SNPs with smallest p-values selected by the W and MLE methods were identical, with only slight differences in ranking. For 11 SNPs, spurious results resulted from convergence problems for MLE. Only careful scrutiny of the extreme values for these SNPs revealed the problem with MLE. For this reason, we recommend the numerically stable W method instead of MLE.

Kraft [2007] argued that it is unlikely for both the secondary phenotype and genetic marker to affect the original case-control disease risk, let alone for there to be an interaction. In terms of equation (2), he suggested that either *δ*_{1} or *δ*_{2} would usually be zero and implicitly that *δ*_{12} would be zero. If this is so, one could use the W method and gain precision and power thereby. More experience is needed with GWASs to see if the W or MLE methods yield many false positive results as a consequence of *δ*_{12} ≠ 0, or if their detection probabilities for ranking promising SNPs are degraded by the presence of interaction effects. Our work makes it clear that spurious positive findings may result from such an interaction, and that one can protect against such findings by using the control only or adaptively weighted approaches.

The authors thank Dr. Kai Yu for helpful suggestions. This work was supported by the Intramural Research Program of the Division of Cancer Epidemiology and Genetics of the National Cancer Institute. This research was partially supported by the NIH Gene Environment Initiative (GEI) program. This research utilized the high-performance computational capabilities of the Biowulf PC/Linux cluster at the National Institutes of Health, Bethesda, Maryland, USA (http://biowulf.nih.gov).

Start from the retrospective likelihood function,

1. Using the saturated disease model and notations in Table 1, we obtained the log-likelihood function,

(9)

where *P*_{0} = P(*X* = 0). In this log-likelihood function, there are six unknown parameters {*β*_{0}, *β*_{1}, *P*_{0}, *δ*_{1}, *δ*_{2}, *δ*_{12}}. By differentiating (9) with respect to each parameter and setting the derivatives to zero, we obtain six equations. Solving them, we obtained the following analytic solutions: , , , , , and . Thus , proving the assertion in Section 2.3.

2. Using the reduced disease model with *δ*_{12} = 0, we have the following log-likelihood function:

(10)

In this loglikelihood function, there are five unknown parameters {*β*_{0}, *β*_{1}, *P*_{0}, *δ*_{1} , *δ*_{2}}. By differentiating (10) with respect to each parameter and setting the derivatives to zero, we obtain five equations. There are no explicit solutions for these parameters, except for *P*_{0}. Before solving them numerically, we simplified them to:

(11)

(12)

(13)

(14)

(15)

Equations (11) - (13) depend on {*β*_{0}, *β*_{1}, *δ*_{1}} only. Using SAS PROC MODEL, we obtained . Substituting these values and from (15) into (14), we solved for . Using these solutions we can estimate the variance of from the observed matrix of second derivatives of the log likelihood and compare them with corresponding estimates from the log likelihood (9).

[1] Jiang Y, Scott AJ, Wild CJ. Secondary analysis of case-control data. Stat Med. 2006;25:1323–1339. [PubMed]

[2] Gail MH, Pfeiffer RM, Wheeler W, Pee D. Probability of detecting disease-associated single nucleotide polymorphisms in case-control genome-wide association studies. Biostatistics (Oxford, England) 2008;9(2):201–15. [PubMed]

[3] Kraft P. Analyses of genome-wide association scans for additional outcomes. Epidemiology. 2007;18:838. [PubMed]

[4] Lin DY, Zeng D. Proper analysis of secondary phenotype data in case-control association studies. Genet Epidemiol. 2009;33:256–265. [PMC free article] [PubMed]

[5] Moslehi R, Chatterjee N, Church TR, Chen J, Yeager M, Weissfield J, Hein DW, Hayes RB. Cigarette smoking n-acetyltransferase genes and the risk of advanced colorectal adenoma. Pharmacogenomics. 2006;7:819–829. [PubMed]

[6] Mukherjee B, Chatterjee N. Exploiting gene-environment independence for analysis of casecontrol studies: an empirical bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics. 2008;64(3):685–94. [PubMed]

[7] Mukherjee B, Ahn J, Gruber S, Rennert G, Moreno V, Chatterjee N. Tests for gene-enviroment interaction from case-control data: a novel study of type I error, power and designs. Genet Epidemiol. 2008;32:615–626. [PubMed]

[8] Satten GA, Kupper LL. Inferences about exposure-disease associations using probability-of-exposure information. J Am Sta Assoc. 1993;88:200–208.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |