Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Genet Epidemiol. Author manuscript; available in PMC 2010 August 9.
Published in final edited form as:
PMCID: PMC2918520

Using Cases to Strengthen Inference on the Association between Single Nucleotide Polymorphisms and a Secondary Phenotype in Genome-Wide Association Studies


Case-control genome-wide association studies provide a vast amount of genetic information that may be used to investigate secondary phenotypes. We study the situation in which the primary disease is rare and the secondary phenotype and genetic markers are dichotomous. An analysis of the association between a genetic marker and the secondary phenotype based on controls only is valid, whereas standard methods that also use cases result in biased estimates and highly inflated type I error if there is an interaction between the secondary phenotype and the genetic marker on the risk of the primary disease. Here we present an adaptively weighted method that combines the case and control data to study the association, while reducing to the controls only analysis if there is strong evidence of an interaction. The possibility of such an interaction and the misleading results for standard methods, but not for the adaptively weighted or controls only approaches, are illustrated by data from a case-control study of colorectal adenoma, in which the secondary phenotype is smoking. Simulations and asymptotic theory indicate that the adaptively weighted method can reduce the mean square error for estimation with a pre-specified SNP and increase the power to discover a new association in a genome-wide study, compared to an analysis of controls only. Further experience with genome-wide studies is needed to determine when methods that assume no interaction and gain precision and power, thereby can be recommended, and when methods such as the adaptively weighted or controls only approaches are needed to guard against the possibility of non-zero interactions.

Keywords: adaptively weighted, case-control study, genome-wide association study, maximum likelihood, secondary phenotype


The genome-wide association study (GWAS) is a powerful tool to identify genetic associations with a disease. A GWAS may also provide information on secondary phenotypes that are measured for all subjects. Added value could be gained from a GWAS by studying the association between genes and the secondary phenotypes.

When the disease in a GWAS is rare, the controls could be regarded as a random sample from the general population, and they could be used for estimating the association with a secondary phenotype. However, if the disease is associated with both the SNP and the secondary phenotype, including cases may introduce bias in the estimation of association, even when the disease is rare.

Lin and Zeng [2009] studied a likelihood-based method that combines the cases and controls efficiently to analyze secondary phenotypes in GWASs. This method is comprehensive because it covers both quantitative and dichotomous secondary phenotypes and both rare and non-rare disease. However, the maximum likelihood estimates are only unbiased and statistically efficient under the assumption that there is no interaction between gene and secondary phenotype effects on disease risk. In their paper, they stated that all standard methods based on controls only, cases only, and the combination of cases and controls yield unbiased estimates if the disease is rare. However, this is not true if there is an interaction between gene and secondary phenotype effects on disease risk.

The purpose of this study was to develop a procedure that can use cases as well as controls to increase efficiency without introducing bias. We focused on the important special situation in which the original disease is rare and the secondary phenotype and genetic marker are dichotomous. We investigated the performance of standard methods and the maximum likelihood estimation (MLE) method discussed in Lin and Zeng [2009] while allowing for the interaction. We found that the MLE method is equivalent to the analysis of controls only and does not provide additional efficiency, if one includes an interaction in the model.

We proposed an adaptively weighted method that combines the case and control data to estimate the association with reduced mean square error, based on a balance between bias and variance. In the presence of strong interaction, this method reduces to the controls only analysis. Both simulated and real data examples suggest an advantage of the proposed adaptively weighted estimator in both estimation and gene discovery. This research also provides guidance on the validity of the various proposed approaches for analyzing secondary phenotypes.

The paper is organized as follows. In the Methods Section, we describe the study setting and data and present the different analytic methods, including our adaptively weighted method. In the Results Section, we illustrate the performance of different methods on colorectal adenoma case-control data, for which there is an interaction between gene and secondary phenotype (smoking) on disease risk. In simulations, we evaluate the properties of estimates and tests from the different methods. We also illustrate the promising power of the adaptively weighted method in large-scale SNP discovery GWAS. Conclusions are in the Discussion Section and technical details are in the Appendix .



We consider the simple but important scenario of an unmatched case-control study with a rare disease, dichotomous genetic marker G, and dichotomous secondary phenotype X. Let D = 1 or 0 denote the diseased or non-diseased state for each individual. Let G = 1 or 0 according as an individual carries at least one SNP allele of interest or not. Let X = 1 or 0 denote whether the individual has or does not have the secondary trait. n0 and n1 are the number of controls and cases, respectively. The data can be represented as in Table 1.

Table 1
Data for an unmatched case-control study with dichotomous genetic marker and secondary phenotype

Let r0 = (r000, r001, r010, r011) and r1 = (r100, r101, r110, r111) denote the case and control cell frequency vectors, respectively. Let p0 = (p000, p001, p010, p011), where p011 = 1−p000p001p010, and p1 = (p100, p101, p110, p111), where p111 = 1−p100p101p110, denote the unknown true cell probabilities in the underlying case and control populations, respectively. The observed cell frequencies can be viewed as realizations from two independent multinomial distributions, namely, r0 ~ Multinomial(n0,p0) and r1 ~ Multinomial(n1, p1).

One can use the logistic regression model for the dichotomous secondary phenotype

equation M1

When the disease is rare, the dependency of D on G and X can be modeled as

equation M2

We are interested in the inference regarding β1, which represents the log odds ratio of G and X in the general population, namely exp(β1) = ORGX.



The controls can be regarded as a random sample of the general population if disease is rare, and therefore the odds ratio of G and X among the controls estimates the odds ratio in the population, namely equation M3. The MLE of β1 using controls only is given by

equation M4

This estimator is nearly unbiased if the disease is rare. From standard asymptotic theory, equation M5. The chi-square test of association is based on the Wald statistic equation M6.


The estimator of β1 using cases only is given by

equation M7

This is not unbiased unless δ12 = 0 in model (2), because the odds ratio of G and X among the cases is

equation M8

The variance estimate of the case estimator is equation M9, which can be used in the Wald statistic equation M10.


When there is no interaction between G and X in model (2), both cases and controls can be used to estimate β1. Combining the case and control estimates by inverse variance weighting leads to

equation M11

where equation M12. By ignoring the variation in equation M13 and equation M14, we estimate equation M15, which can be used in the Wald statistic equation M16. Although equation M17 is more efficient than equation M18, it is only unbiased when δ12 = 0.


Lin and Zeng [2009] used the maximum likelihood estimation method to analyze secondary phenotype data based on the retrospective likelihood function,

equation M19

Using the rare disease assumption with a dichotomous secondary phenotype, we found the following results from this likelihood:

Result 1

Using the saturated disease model (2), the maximum likelihood estimator equation M20, and the maximum likelihood method does not use any information from the cases. (See Appendix for proof.)

Result 2

Lin and Zeng [2009] assumed δ12 = 0 in model (2). Under this assumption, and for a rare disease, equation M21 has only very slightly smaller asymptotic variance than equation M22. For example, for δ12 = 0 and β1 = 0.25, when n1 = n0 = 1, 000, equation M23 and equation M24; when n1 = n0 = 10, 000, equation M25 and equation M26.


To capture some of the efficiency of equation M27 or equation M28 while avoiding the bias in these estimates that results when δ12 ≠ 0, we proposed an estimator that down-weights the contribution from cases as the evidence against δ12 = 0 increases.

Motivated by an empirical Bayes shrinkage estimator for gene-environment interaction [Mukherjee and Chatterjee, 2008], we proposed the following estimate:

equation M29

where equation M30 estimates the interaction.

From (5), as equation M31, equation M32, and as equation M33, equation M34. Thus equation M35 adaptively combines the weighted and control only estimators based on the interaction estimate equation M36.

We estimated the variance of equation M37 by rewriting (5) in term of equation M38 and equation M39 as

equation M40

By noting that equation M41 and equation M42 are independent and neglecting the variability of equation M43 and equation M44, we obtained the following variance estimate by Taylor expansion,

equation M45

This estimator is used to construct Wald statistic equation M46.



Colorectal adenoma is a precursor of colorectal cancer. Colorectal adenoma is positively associated with smoking and negatively associated with haplotypes of a gene NAT2 that promotes rapid acetylation of carcinogens [Moslehi and others, 2006]. Here we code NAT2 = 1 for haplotypes corresponding to rapid acetylation and NAT2 = 0 for other haplotypes. NAT2 is also important in the metabolism of smoking-related carcinogens. Moslehi and others [2006] analyzed a case-control study of advanced colorectal adenoma in the Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial at the National Cancer Institute. Table 2 shows the observed cell counts for this case-control study. From marginal tables constructed from Table 2, one sees that the odds ratios relating colorectal adenoma to NAT2 and smoking are 0.93 and 1.41 respectively, demonstrating the protective effect of the rapid acetylator phenotype that corresponds to NAT2=1. The odds ratios relating NAT2 to smoking are 1.85 in controls but 0.378 in cases, indicating a protective interaction between smoking and NAT2 on the risk of colorectal adenoma. Adjusting for age and sex, we found equation M47 with p-value= 0.0032. Ignoring this interaction will introduce bias in analyzing the secondary phenotype, smoking. Estimates of the log odds ratio associating NAT2 with smoking are shown for the five methods in Section 2 in Table 3. The adaptively weighted estimate is positive and similar to the control only estimate, while the case only, weighted and MLE estimates are negative. This example is chosen to illustrate the possibility that interactions might exist. In this case, NAT2 rapid acetylation helps eliminate carcinogens from smoking, thereby reducing the risk of colorectal adenoma from smoking. Typically, there might be no or only weak interactions, but this example illustrates the need for methods that are not misleading when interactions are present.

Table 2
Data for studying NAT2 and smoking association in the colorectal adenoma case-control study*
Table 3
Odds ratio estimates associating NAT2 with smoking in the colorectal adenoma case-control study


We used Monte Carlo simulation to evaluate the performance of different estimators of the effects of a pre-selected SNP. We fixed the probabilities of carrying one or two alleles of interest, G, and the secondary phenotype, X, as P(G = 1) = 0.3 and P(X = 1) = 0.3. We let β1 = 0 and 0.25 under the null and alternative hypotheses, respectively. Fixing these three values, one can determine the distribution of G and X in the population by solving three equations. Under the rare disease assumption, this solution also approximates the control cell probability vector p0. For the disease-risk model (2), we set μ = −10 to reflect the rare disease and let δ1 = δ2 = 0 and δ12 vary from −2 to 2. The case cell probability vector p1 can be determined by combining p0 with the disease risk model [Satten and Kupper, 1993]. For each set of simulation parameters, we generated 10, 000 datasets with 1, 000 cases and 1, 000 controls from two independent multinomial distributions corresponding to the case and control populations. We estimated β1 by the adaptively weighted (AW) method, the controls only (CO) method, the cases only (CA) method, the weighted (W) method and the MLE method.

Figure 1 shows the relative biases (left panels), the coverage probabilities of 95% confidence intervals for β1 (middle panels), and mean squared errors (MSE) for the AW and CO methods (right panels). Figure 2 displays the type I error for all methods and the power for testing H0 : β1 = 0 against H1 : β1 = 0.25 for the CO and AW methods. All calculations were based on n0 = n1 = 1, 000 and α = 0.05.

Figure 1
Relative biases (RBias), coverage probabilities (CP) of 95% confidence intervals and mean squared errors (MSE) for different estimators for two values of β1:(1) β1 = 0; (2) β2 = 0.25.
Figure 2
Type I error rates and power of association tests at the 5% nominal significance level.

The AW and CO methods perform very well. Their estimates are virtually unbiased and their confidence intervals have proper coverage probabilities (Figure 1). The AW method has smaller MSE near δ12 = 0 than the CO method (Figure 1), but not for values of δ12 away from 0. The CO method maintains nominal size (Figure 2). The AW method maintains near nominal type I error, although there is a very slight increase in size above 0.05 near |δ12| = 0.4. The power of AW exceeds that of CO for δ12 > 0, but is less than that of CO for δ12 < 0 (Figure 2).

The CA, W and MLE methods, are seriously biased (Figure 1), have substantially subnominal coverage (Figure 1), and have above nominal Type I error (Figure 2) for δ12 ≠ 0.



In the previous section we studied estimation and hypothesis testing for a single pre-specified SNP. Here we investigate the size and power of different methods in large-scale SNP discovery GWAS.

By replacing the {r0, r1} in the different methods with the expected cell counts {n0p0, n1p1}, we obtained the asymptotic mean and variance for each estimate and hence the non-centrality, λ, of the corresponding one degree freedom chi-squared distribution for the Wald statistic, denoted as equation M48. Under the null hypothesis, when δ12 = 0, each method has λ = 0; when δ12 ≠ 0, every method has nonzero λ except the control only method. Under the alternative, all the λs are non-zero. We can compute the asymptotic size and power for testing one given SNP from the formula: equation M49 with the corresponding λ under the null and alternative hypothesis for each method respectively and equation M50 being the quantile of the central chi-square distribution, .

Assuming there were N = 500, 000 independent SNP genetic markers, we controlled the experimentwise significance by setting α = 0.05/(5 × 105) = 10−7. It may be reasonable to suppose that δ12 = 0 for a large proportion of SNPs. We assume 99% of SNPs have δ12 = 0, and 1% of δ12 are independently distributed as N(0, (log(2)/2)2), which implies that about 95% of nonzero δ12 values reside in [−log(2), log(2)]. We evaluated the genome-wide type I error and power of the various Wald tests averaged over the mixture distribution of δ12. For each method we estimated the genome-wide type I error under the null from the formula

equation M51

and the power under the alternative β1 = 0.25 can be calculated from

equation M52

where f(x) is the normal density N[0, {log(2)/2}2]. The integrals in equations (7) and (8) are approximated by drawing 100,000 δ12 from N[0, {log(2)/2}2] and averaging the corresponding values of equation M53 for equation (7) and of equation M54 for equation (8). The non-centrality λ depends on δ12 and are recomputed for each realization.

The genome-wide type I error and power for different methods are presented in Table 4 for numbers of cases and controls n = 1000, n = 5000 and n = 10,000. The genome-wide type I error is 1.0 for WCA, WW, and WMLE, indicating that these tests should not be used even if only a small proportion of SNPs have δ12 ≠ 0. Both WCO and WAW statistics have near nominal genome-wide type I error. However, for sample size n1 = n0 = 5, 000 or 10, 000, the power of WAW greatly exceeds that for WCO. For example, when n0 = n1 = 10,000, the power of WAW is almost 98%, while that of WCO is only 50%. Thus, substantial power gains can be achieved with WAW. Unreported simulations from a mixture distribution with 20% non-zero δ12s also show that WAW has greater power than CO but a very slight excess in size (e.g. size= 0.053) was observed for WAW.

Table 4
Genome-wide type I error and power for δ12 from a mixture distribution for various sample sizes with genome-wide significance 0.05


Instead of setting fixed critical values for declaring an association statistically significant, one can rank p-values or chi-square statistics to select promising SNPs [Gail et. al., 2008]. In this section, we study the detection probability which is the probability that the test statistic for a specified disease-associated SNP will be among the top T chi-square values for all SNPs. We estimated the detection probability with 1, 000 simulated replicates. In each replicate, we generated data for N = 500, 000 SNPs with the same parameters used in Section 3.2, except that 10 disease-associated SNPs had nonzero β1, whereas the 499,990 remaining null SNPs had βi = 0. All SNPs had interactions δ12 drawn from a mixture distribution in which 99% of SNPs have δ12 = 0, and 1% of δ12 are independently distributed as N(0,σ2). We conducted four independent simulations for the combinations of β1 = {0.25, 0.69} and σ2 = {(log(2)/2)2, (log(2))2}.

When the associated SNPs have small log odds ratios with the secondary phenotype, i.e. β1 = 0.25 (upper portion of Table 5), all the methods have low detection probabilities for T ≤ 100.. When T ≤ 100 and the variability of δ12 is large, the W and MLE methods have detection probabilities near zero, because some of the SNPs with interactions have large chi-square values than the disease-associated SNPs. The CO and AW methods perform slightly better. When T is 10, 000, (2% of the total SNPs), the detection probabilities of all methods increase and the W and MLE methods outperform the CO and AW methods. When the associated SNPs have a stronger association with the secondary phenotype, i.e. β1 = 0.69 (lower portion of Table 5), and with σ2 = {log(2)/2}2, W and MLE have smaller detection probabilities than CO and AW for T = 10, but for T = 100 or 10, 000, the detection probabilities of W and MLE exceed those of CO and AW. When the variability of δ12 is larger, with σ2 = {log(2)}2, the W and MLE methods have smaller detection probabilities than CO and AW, especially for T = 10 and 100. The CA method always has the lowest detection probability. As expected, CO is not affected by increasing variability of δ12; AW is also robust to increasing variability of δ12.

Table 5
Associated SNP detection probability for various numbers of selected SNPs, T. δ12 = 0 with probability 0.99 and with probability 0.01, δ12 is drawn from a normal distribution with mean 0 and variance σ2


For a rare disease and dichotomous secondary endpoint and genetic marker, we have investigated whether and how to use data from diseased subjects to study the association between a genetic marker and secondary phenotype. We considered both estimating and testing the null hypothesis of no association for a pre-specified SNP, and for discovering an association in a GWAS with either hypothesis testing approaches or approaches based on ranking the chi-square statistics. In the absence of an interaction δ12 in model (2), each of the five methods we considered leads to valid inference, and the W and MLE methods are particularly efficient. In the presence of interaction (δ12 ≠ 0), the CO method controls the type I error perfectly, and the AW has proper size for rare interactions (1%) and only modestly supra-nominal type I error for common interactions (20%). The CA, W and MLE methods do not control type I error and cannot be recommended if it is plausible that δ12 ≠ 0. The AW method has lower MSE for a pre-selected SNP and greater power than the CO method, which is achieved at the cost of a slight increase in type I error. We showed that the MLE method reduces to the CO method if the model allows for non-null δ12.

Under the assumption of rare disease and dichotomous genetic marker and phenotype, the CO method is robust in that it maintains the unbiasedness and nominal type I error despite any interaction effect. The W and MLE methods fully utilize both controls and cases and are most efficient for estimation. When there is no interaction effect, both weighted and MLE are almost twice as efficient as the control only method in estimation and have around 70% more power than the control only method. We prefer the weighted method because it is nearly fully efficient and its estimator is non-iterative. Thus, there are no problems of convergence as can arise with the MLE method. However, even a small interaction effect causes large bias and highly inflated type I error for the CA, W, and MLE methods. The AW method strikes a balance between the robust CO method and the W method. It maintains the unbiasedness and near nominal type I error across most values of δ12, although it has moderately inflated type I error when δ12 is not far from zero. If δ12 is near zero, estimates based on AW have smaller MSE than those from CO for a prespecified SNP. Under a mixture distribution for δ12 which was chosen to allow most δ12 values to be zero, the AW method achieved an important gain in power compared to CO. The detection probabilities of the W and MLE methods degrade when ranking SNPs in the presence of increasing variability of δ12. However, CO and AW methods maintain their detection probabilities and are robust to increasing variable δ12.

Jiang, Scott and Wild [2006] discussed methods for analyzing secondary phenotypes in case-control studies. Their fully non-parametric approach (SPML1) corresponds to MLE under our model (2) with δ12 included, which is equivalent to the CO method for inference on β1. Assuming δ12 = 0 corresponds to possibly misspecified parametric modeling (SPML2) in their notation. MLE under SPML2 was described by Lin and Zeng [2009], who also covered non rare diseases and both dichotomous and continuous secondary phenotypes. If δ12 ≠ 0, the MLE method of Lin and Zeng does not control the type I error, as indicated by our results in Section 3.2 and 3.3 and in the discussion of model misspecification for SPML2 by Jiang et al. [2006].

In unreported analysis, we evaluated the performance of the various methods using prostate cancer data from the Cancer Genetic Markers of Susceptibility (CGEMS) study. We conducted a genome-wide scan on the association between the secondary phenotype, body mass index BMI (1, if BMI≤ 25; 0, else), and the 516,564 SNPs from 22 autosomal chromosomes. We estimated the distribution of δ12, and found no evidence that the variance of equation M55 across SNPs exceeded that which would be expected from the multinomial sampling error alone. Thus, we did not find evidence that δ12 ≠ 0 for some SNPs. Under such situation, W and MLE are two most efficient methods, and both identified SNP rs7575639 with a genome-wide significant p < 10−7. The 20 SNPs with smallest p-values selected by the W and MLE methods were identical, with only slight differences in ranking. For 11 SNPs, spurious results resulted from convergence problems for MLE. Only careful scrutiny of the extreme values for these SNPs revealed the problem with MLE. For this reason, we recommend the numerically stable W method instead of MLE.

Kraft [2007] argued that it is unlikely for both the secondary phenotype and genetic marker to affect the original case-control disease risk, let alone for there to be an interaction. In terms of equation (2), he suggested that either δ1 or δ2 would usually be zero and implicitly that δ12 would be zero. If this is so, one could use the W method and gain precision and power thereby. More experience is needed with GWASs to see if the W or MLE methods yield many false positive results as a consequence of δ12 ≠ 0, or if their detection probabilities for ranking promising SNPs are degraded by the presence of interaction effects. Our work makes it clear that spurious positive findings may result from such an interaction, and that one can protect against such findings by using the control only or adaptively weighted approaches.


The authors thank Dr. Kai Yu for helpful suggestions. This work was supported by the Intramural Research Program of the Division of Cancer Epidemiology and Genetics of the National Cancer Institute. This research was partially supported by the NIH Gene Environment Initiative (GEI) program. This research utilized the high-performance computational capabilities of the Biowulf PC/Linux cluster at the National Institutes of Health, Bethesda, Maryland, USA (


Start from the retrospective likelihood function,

equation M57

1. Using the saturated disease model equation M58 and notations in Table 1, we obtained the log-likelihood function,

equation M59

where P0 = P(X = 0). In this log-likelihood function, there are six unknown parameters {β0, β1, P0, δ1, δ2, δ12}. By differentiating (9) with respect to each parameter and setting the derivatives to zero, we obtain six equations. Solving them, we obtained the following analytic solutions: equation M60, equation M61, equation M62, equation M63, equation M64, and equation M65. Thus equation M66, proving the assertion in Section 2.3.

2. Using the reduced disease model equation M67 with δ12 = 0, we have the following log-likelihood function:

equation M68

In this loglikelihood function, there are five unknown parameters {β0, β1, P0, δ1 , δ2}. By differentiating (10) with respect to each parameter and setting the derivatives to zero, we obtain five equations. There are no explicit solutions for these parameters, except for P0. Before solving them numerically, we simplified them to:

equation M69

equation M70

equation M71

equation M72

equation M73

Equations (11) - (13) depend on {β0, β1, δ1} only. Using SAS PROC MODEL, we obtained equation M74. Substituting these values and equation M75 from (15) into (14), we solved for equation M76. Using these solutions we can estimate the variance of equation M77 from the observed matrix of second derivatives of the log likelihood and compare them with corresponding estimates from the log likelihood (9).


[1] Jiang Y, Scott AJ, Wild CJ. Secondary analysis of case-control data. Stat Med. 2006;25:1323–1339. [PubMed]
[2] Gail MH, Pfeiffer RM, Wheeler W, Pee D. Probability of detecting disease-associated single nucleotide polymorphisms in case-control genome-wide association studies. Biostatistics (Oxford, England) 2008;9(2):201–15. [PubMed]
[3] Kraft P. Analyses of genome-wide association scans for additional outcomes. Epidemiology. 2007;18:838. [PubMed]
[4] Lin DY, Zeng D. Proper analysis of secondary phenotype data in case-control association studies. Genet Epidemiol. 2009;33:256–265. [PMC free article] [PubMed]
[5] Moslehi R, Chatterjee N, Church TR, Chen J, Yeager M, Weissfield J, Hein DW, Hayes RB. Cigarette smoking n-acetyltransferase genes and the risk of advanced colorectal adenoma. Pharmacogenomics. 2006;7:819–829. [PubMed]
[6] Mukherjee B, Chatterjee N. Exploiting gene-environment independence for analysis of casecontrol studies: an empirical bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics. 2008;64(3):685–94. [PubMed]
[7] Mukherjee B, Ahn J, Gruber S, Rennert G, Moreno V, Chatterjee N. Tests for gene-enviroment interaction from case-control data: a novel study of type I error, power and designs. Genet Epidemiol. 2008;32:615–626. [PubMed]
[8] Satten GA, Kupper LL. Inferences about exposure-disease associations using probability-of-exposure information. J Am Sta Assoc. 1993;88:200–208.