Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2684820

Formats

Article sections

Authors

Related links

Genet Epidemiol. Author manuscript; available in PMC 2010 April 1.

Published in final edited form as:

Genet Epidemiol. 2009 April; 33(3): 256–265.

doi: 10.1002/gepi.20377PMCID: PMC2684820

NIHMSID: NIHMS94335

Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina

Address for Correspondence: Danyu Lin, Ph.D., Department of Biostatistics, University of North Carolina, McGavran-Greenberg Hall, CB #7420, Chapel Hill, North Carolina 27599-7420, Phone: (919) 843-5134, Fax: (919) 966-3804, Email: ude.cnu.soib@nil

See other articles in PMC that cite the published article.

Case-control association studies often collect extensive information on secondary phenotypes, which are quantitative or qualitative traits other than the case-control status. Exploring secondary phenotypes can yield valuable insights into biological pathways and identify genetic variants influencing phenotypes of direct interest. All publications on secondary phenotypes have used standard statistical methods, such as least-squares regression for quantitative traits. Because of unequal selection probabilities between cases and controls, the case-control sample is not a random sample from the general population. As a result, standard statistical analysis of secondary phenotype data can be extremely misleading. Although one may avoid the sampling bias by analyzing cases and controls separately or by including the case-control status as a covariate in the model, the associations between a secondary phenotype and a genetic variant in the case and control groups can be quite different from the association in the general population. In this article, we present novel statistical methods that properly reflect the case-control sampling in the analysis of secondary phenotype data. The new methods provide unbiased estimation of genetic effects and accurate control of false-positive rates while maximizing statistical power. We demonstrate the pitfalls of the standard methods and the advantages of the new methods both analytically and numerically. The relevant software is available at our website.

There is a proliferation of genomewide association (GWA) studies worldwide. These studies usually employ the case-control design, which consists of a sample of cases (i.e. diseased individuals) and a sample of controls (i.e., disease-free individuals). Most GWA studies measure a variety of quantitative or qualitative traits other than the disease trait that defines the case-control status. Exploring these secondary phenotypes can discover genetic variants influencing previously unstudied phenotypes and provide important clues about causal pathways. Although the main interest of case-control association studies lies in the comparison of cases and controls, analysis of secondary phenotype data may supplement the case-control comparison in the initial reports or become the primary focus of subsequent publications.

Indeed, recent months have seen an explosion of publications on genetic variants influencing human quantitative traits, such as height [Weedon et al., 2007; Sanna et al., 2008; Weedon et al., 2008; Lettre et al., 2008; Gudbjartsson et al., 2008], body mass index (BMI) [Frayling et al., 2007; Loos et al., 2008], and lipid levels [Saxena et al., 2007; Willer et al., 2008; Kathiresan et al., 2008]. The data for those publications came mostly from case-control association studies of complex diseases (e.g., diabetes, cancer and hypertension). The most recent publications were based on meta-analysis of multiple GWA studies involving thousands or tens of thousands of individuals.

All publications on quantitative traits have relied on standard linear regression analysis (i.e., classical least-squares estimation under the linear model). Five types of analysis have been conducted to assess the effects of SNPs on quantitative traits using data from case-control association studies: (1) controls only; (2) cases only; (3) combined sample of cases and controls; (4) meta-analysis of cases and controls; (5) joint analysis of cases and controls adjusted for the disease status. Methods (1) and (2) are restricted to controls and cases, respectively. Method (3) analyzes cases and controls together and ignores the disease status. In method (4), cases and controls are analyzed separately and the results are combined by a meta-analytic procedure. Method (5) analyzes cases and controls together and includes the disease status as a covariate in the model.

None of the aforementioned analysis methods is statistically correct. Because cases and controls are selected at different rates from their respective subpopulations, the case-control sample does not constitute a random sample of the general population. As a result, the population association between a SNP and a secondary trait can be distorted in the case-control sample. Thus, method (3) can be grossly misleading, especially when the secondary trait is strongly correlated with the disease and the sampling rates are very different between cases and controls. The other four methods are conditional on the disease status and are thus unaffected by the biased case-control sampling. However, the associations between a SNP and a secondary trait in the case and control groups can be quite different from the association in the general population if the secondary trait is correlated with the disease.

To illustrate the above points, we consider a dichotomous secondary trait taking the values 0 and 1 with equal probability and a SNP with minor allele frequency (MAF) of 0.2. Assume that the disease rate is 10%, the odds ratio of disease with the SNP is 1.5 under the dominant mode of inheritance, and the odds ratio of disease with the secondary trait is 2. If the SNP is unrelated to the secondary trait in the general population (i.e., the odds ratio is 1), then the odds ratios of the secondary trait with the SNP will be approximately 0.975 in the case and control groups, and the odds ratio in the combined sample of cases and controls with an equal number of cases and controls will be approximately 1.045; see Table I for details. In other words, there exist (spurious) associations between the SNP and the secondary trait in the case and control groups, as well as in the combined sample, despite the absence of association at the population level. This phenomenon implies that methods (1), (2) and (3) are all biased; methods (4) and (5) are also biased because they combine the biased results from the case and control groups. Because GWA studies typically have large sample sizes, even modest levels of bias can lead to grossly inflated false-positive rates, especially in meta-analysis.

Probability distributions of disease status, genotype score and secondary trait in a case-control setting

The fact that method (3) is biased seems to contradict with the universally accepted practice of performing standard logistic regression on case-control data. Standard logistic regression analysis of case-control data indeed yields correct maximum likelihood estimates of odds ratios, although the estimate of the disease rate is generally biased [Prentice and Pyke, 1979]. This remarkable result, however, applies only to the primary disease outcome that defines the case-control status, and not to a secondary outcome that is correlated with the disease.

We have developed valid and efficient statistical methods to analyze secondary phenotype data in case-control association studies. Our methods are based on likelihood functions that properly reflect the case-control sampling. The corresponding maximum likelihood estimates are approximately unbiased and normally distributed. Furthermore, the estimates are statistically efficient in that they have the smallest variances among all valid estimates, and the corresponding association tests are the most powerful among all valid tests.

In the next section, we describe in more detail the ingredients of the new methods and the pitfalls of standard methods, particularly methods (1)-(5). In the subsequent Results section, we use Monte Carlo simulation to quantify the bias of standard methods in common situations and to evaluate the operating characteristics of the new methods. We relegate the technical details about the new and standard methods to Appendices A and B, respectively.

Let *D* denote the case-control status (1 = disease; 0 = no disease) and *Y* denote the secondary phenotype. Also, let *X* denote the genotype score for a SNP of interest. Under the additive mode of inheritance, *X* is the number of minor alleles; under the dominant (recessive) model, *X* indicates, by the values 1 versus 0, whether or not the individual carries at least one minor allele (two minor alleles). We use a generalized linear model to formulate the effects of *X* on *Y*, and write the conditional density of *Y* given *X* as *P*(*Y*|*X*). If *Y* is a quantitative trait, we use the linear regression model, which specifies that the conditional distribution of *Y* given *X* is normal with mean *β*_{0} + *β*_{1}*X* and variance *σ*^{2}. If *Y* is a dichotomous trait, we use the logistic regression model, under which

$$P\left(Y=1|X\right)=\frac{{e}^{{\beta}_{0}+{\beta}_{1}X}}{1+{e}^{{\beta}_{0}+{\beta}_{1}X}}.$$

We relate *Y* and *X* to *D* through the logistic regression model

$$P\left(D=1|X,Y\right)=\frac{{e}^{{\gamma}_{0}+{\gamma}_{1}X+{\gamma}_{2}Y}}{1+{e}^{{\gamma}_{0}+{\gamma}_{1}X+{\gamma}_{2}Y}}.$$

We are mainly interested in *β*_{1}.

For a case-control study with a total of *n* subjects, the data consist of (*D _{i}, Y_{i}, X_{i}*) (

$$\underset{i=1}{\overset{n}{\mathrm{\Pi}}}\phantom{\rule{0.2em}{0ex}}{\left\{\frac{P\left({D}_{i}=1|{X}_{i},{Y}_{i}\right)P\left({Y}_{i}|{X}_{i}\right)P\left({X}_{i}\right)}{P\left({D}_{i}=1\right)}\right\}}^{{D}_{i}}\phantom{\rule{0.2em}{0ex}}{\left\{\frac{P\left({D}_{i}=0|{X}_{i},{Y}_{i}\right)P\left({Y}_{i}|{X}_{i}\right)P\left({X}_{i}\right)}{P\left({D}_{i}=0\right)}\right\}}^{1-{D}_{i}},$$

where *P*(*D _{i}* = 1) = Σ

The estimation of *γ*_{0} may be unstable, especially for dichotomous *Y*. When the disease is rare such that *P*(*D* = 1|*X, Y*) ≈ *e*^{γ0+γ1X+γ2Y}, the parameter *γ*_{0} cancels in the numerator and denominator of the likelihood function. When the disease is not rare but the disease rate is known approximately, we can incorporate the information about the disease rate into the estimation. We can include environmental covariates in the models for *Y* and for *D*, but then the probability distribution of continuous environmental covariates will enter into the likelihood function as a high-dimensional nuisance parameter. We eliminate such nuisance parameters through the profile-likelihood approach. The interested readers are referred to Appendix A for the theoretical and computational details of the new methods.

As described in the Introduction section, standard statistical methods have been applied to the secondary phenotype data from case-control studies in five different ways, which we will refer to as methods (1)-(5). Methods (1)-(3) are based on the prospective likelihood function Π* _{i} P*(

We have conducted a thorough investigation into the properties of the five standard methods. We state here the main conclusions while relegating the details to Appendix B. If the secondary phenotype is not related to the case-control status, or more precisely, *D* is independent of *Y* given *X* (i.e., *γ*_{2} = 0), then all five methods are valid. If the SNP is not associated with the case-control status, or more precisely, *D* is independent of *X* given *Y* (i.e., *γ*_{1} = 0), then all five methods yield correct estimates of odds ratios for dichotomous traits, but the least-squares estimates for quantitative traits produced by the five methods are biased unless *β*_{1} = 0 or *γ*_{2} = 0. When the disease is rare in that the probability of disease is virtually 0, all standard methods except (3) are approximately valid.

The fact that standard methods are generally invalid unless *γ*_{1} = 0 or *γ*_{2} = 0 is disconcerting. Most secondary phenotypes are strongly correlated with the case-control status, so that *γ*_{2} ≠ 0. Thus, any SNPs that are associated with the case-control status will tend to be detected as being associated with secondary phenotypes by standard methods even when the latter associations do not exist. When the associations truly exist, all five methods may produce estimates that are biased toward the null and thus reduce statistical power.

We conducted extensive simulation studies to assess the performance of the standard and new methods in the analysis of secondary quantitative traits. We considered a SNP with MAF of 0.3 and additive mode of inheritance. For the model of the secondary quantitative trait, we set *β*_{0} = *σ*^{2} = 1, and let *β*_{1} = 0 and −0.12 under the null and alternative hypotheses, respectively; *β*_{1} = −0.12 means that each copy of the minor allele decreases the trait value by 12% of its standard deviation. For the disease-risk model, we set *γ*_{2} = log 2, varied the value of *γ*_{1} from 0 to log 2, and chose the value of *γ*_{0} to yield a disease rate of 1% or 5%. Note that *γ*_{1} pertains to the change in the log odds ratio of disease for each copy of the minor allele of the SNP and *γ*_{2} to the change in the log odds ratio of disease for one standard-derivation increase of the quantitative trait. The choice of *γ*_{2} = log 2 represents a strong, but not uncommon, association between the secondary phenotype and the disease. For each combination of simulation parameters, we generated 1,000,000 data sets with 1,000 cases and 1,000 controls. We compared the new method to standard methods (1)-(3), i.e., the least-squares methods based on the data from the control group, the case group, and the combined sample of cases and controls. We focused on methods (1)-(3) because methods (4) and (5) are closely related (1) and (2).

We summarize the key results in Figures 1 and and2.2. Figure 1 shows the biases of effect estimates and the coverage probabilities of 99% confidence intervals for *β*_{1} as a function of the odds ratio of disease with the SNP (i.e., *e*^{γ1}) under the alternative hypothesis (i.e., *β*_{1} = −0.12). Figure 2 displays the type I error and power for testing the null hypothesis of *β*_{1} = 0 at the nominal significance level of 1%. We restricted the horizontal axis to 1.5 because the odds ratios that have been observed in GWA studies thus far are mostly less than 1.5 although the odds ratios may be higher in candidate genes studies.

Relative biases of effect estimates and coverage probabilities of 99% confidence intervals for four analysis methods: least-squares methods based on controls only, cases only and combined sample of cases and controls versus the new method. The relative **...**

Type I error rates and power of association tests at the 1% nominal significance level for four analysis methods: least-squares methods based on controls only, cases only and combined sample of cases and controls versus the new method. The odds ratio **...**

The new method performs very well. The effect estimates are virtually unbiased, and the variance estimates accurately reflect the true variations (latter data not shown). A a result, the confidence intervals have proper coverage probabilities, and the association tests have correct type I error rates. The power of the new method is always above 80%.

The least-squares method based on the combined sample of cases and controls, i.e., method (3), can be very wrong, especially when the SNP is strongly related to the disease. The effect estimates are biased, the confidence intervals have poor coverage probabilities, and the type I error is inflated. The problems associated with this method are more severe for rarer diseases. The type I error rates are 8 times and 5 times the nominal significance level under the disease rates of 1% and 5%, respectively, when the odds ratio of disease with the SNP is 1.3. This strategy may be more powerful or less powerful than the new method, dependent on how the SNP affects the disease status and the secondary phenotype. Figure 2 shows that this strategy can be substantially less powerful than the new method.

When the disease is rare, say less than 1%, the least-squares methods based on controls only and cases only, i.e., methods (1) and (2), are appropriate in that the effect estimates are approximately unbiased, the confidence intervals have adequate coverage probabilities, and the association tests have reasonable type I error rates. These two methods, however, use half of the study subjects and are thus much less powerful than the new method. For relatively common diseases, methods (1) and (2) yield biased effect estimates, improper confidence intervals and inflated type I error, especially when the SNP is strongly related to the disease. When the disease rate is 5% and the odds ratio of disease with the SNP is 1.5, the type I error rates for methods (1) and (2) are about 1.3% and 1.8%, respectively.

All the aforementioned results pertain to MAF of 0.3, *γ*_{2} of log 2, 1,000 cases and 1,000 controls, and nominal significance level of 1%. We also considered other combinations of simulation parameters. The new methods continued to perform well. The performance of standard methods became worse as MAF, *γ*_{2} or sample size was increased and as the nominal significance level was lowered. In addition, the performance of methods (1) and (2) became worse as the disease rate was increased.

Figure 3 displays the type I error rates of the five standard methods for MAF of 0.2, disease rate of 7%, 2,000 cases and 2,000 controls, and nominal significance level of 10^{−6}. The type I error rates for all five methods increase rapidly with increasing values of *γ*_{1} and *γ*_{2} and are seriously inflated even with moderate values of *γ*_{1} and *γ*_{2}. Under *γ*_{1} = log 1.3 and *γ*_{2} = 0.5, the type I error rates of methods (1)–(5) are, respectively, 1.7, 2.4, 20, 3.2 and 3.2 times the nominal significance level; under *γ*_{1} = log 1.2 and *γ*_{2} = 1, the five type I error rates are 1.9, 6.2, 25, 6.7 and 7 times the nominal significance level. Controls-only analysis has the least inflation of type I error, followed by cases-only analysis. Meta-analysis of cases and controls has higher type I error than cases-only and controls-only analyses, mainly because it has twice the sample size. The joint analysis of cases and controls with the disease status as a covariate in the linear model has slightly higher inflation of type I error than the meta-analysis. The analysis of the combined sample without adjusting for the disease status, i.e., method (3), performs much worse than the other four methods.

The purpose of this article is two-fold: to evaluate the standard methods and to develop better methods for the analysis of secondary phenotype data in case-control association studies. We have demonstrated both analytically and numerically that all the standard methods can have severely inflated type I error and reduced power in practical situations. The new methods provide accurate control of the type I error and have the highest efficiency/power among all valid methods. We have developed efficient numerical algorithms to implement the new methods and posted the software online. With our software, analysis of a GWA study (with 500K-1,000K SNPs) can be completed in a few hours.

As mentioned in the Introduction section, the recent publications on human height, BMI and lipid levels relied on standard linear regression analysis of quantitative trait data from case-control studies of complex diseases. We are not challenging the conclusions of those publications because some of the initial results were validated in cross-sectional or cohort data, but the effect estimates and *P*-values reported in the papers might be questionable. Using valid and efficient statistical methods in the initial scans would reduce the number of false positives and increase the number of true positives among the initial results and thus enhance the success rates of validation efforts. It would be even more important to use proper statistical methods in any validation analysis.

When the disease is rare, all standard methods except (3) are approximately valid. Our simulation studies revealed that this assumption is problematic when the disease rate is appreciably higher than 1%. As shown in Figures 1 and and2,2, methods (1) and (2) have serious problems when the disease rate is 5%. We recommend to use the rare disease assumption only when the disease rate is less than 2%.

Many of the complex diseases currently under study are relatively common (5-10%), so the results of Figure 3 are particularly relevant. If the disease is type 2 diabetes, then *γ*_{2} is close to 1 for BMI and triglyceride levels and close to 0 for height. Thus, standard linear regression analysis of BMI and triglyceride levels in case-control studies of type 2 diabetes would be more biased than that of height. The most recent publications on quantitative traits were based on meta-analyses of tens of thousands of subjects, so the inflation of type I error would be much more profound than what is seen in Figure 3.

Although only a small fraction of SNPs are truly associated with a complex disease, any SNPs that are associated with the disease in the observed data (mostly by chance) will tend to be spuriously associated with secondary traits in the case and control groups as well as in the combined sample; therefore, all five standard methods can cause large-scale increases of false positive results in GWA studies. Indeed, the quantile-quantile plots in several publications [Weedon et al., 2007; Weedon et al., 2008; Gudbjartsson et al., 2008] showed substantial deviations of observed statistics from the expected (after correcting for population stratification), even at the 0.01 level of significance.

This research was supported by the National Institutes of Health. The authors thank Chad He for his assistance in preparing the figures.

We provide theoretical and computational details for the new statistical methods. In the main text of this article, *X* pertains to a single SNP. In this appendix, we expand *X* to contain all genetic and environmental factors of interest (including gene-environment interactions). We use *P _{θ}*(

$$P\left(D=1|X,Y\right)=\frac{{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}X+{\gamma}_{2}Y}}{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}X+{\gamma}_{2}Y}},$$

where *γ*_{1} is now a vector.

The data consist of (*D _{i}, Y_{i}, X_{i}*) (

$${\begin{array}{c}\prod _{i=1}^{n}\phantom{\rule{0.2em}{0ex}}\left\{\frac{{P}_{\theta}\left({Y}_{i}|{X}_{i}\right)P\left({X}_{i}\right){e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}{Y}_{i}}/(1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}{Y}_{i}})}{P\left({D}_{i}=1\right)}\right\}\\ \times \underset{i=1}{\overset{n}{\mathrm{\Pi}}}\phantom{\rule{0.2em}{0ex}}{\left\{\frac{{P}_{\theta}\left({Y}_{i}|{X}_{i}\right)P\left({X}_{i}\right)/\left(1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}{Y}_{i}}\right)}{P\left({D}_{i}=0\right)}\right\}}^{1-{D}_{i}},\end{array}}^{{D}_{i}}$$

where *P*(*x*) is the density of *X*, and

$$P\left(D=0\right)={\mathit{\int}}_{y,x}{P}_{\theta}\left(y|x\right)P\left(x\right)/\left(1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}x+{\gamma}_{2}y}\right)\mathit{dydx}.$$

For discrete components, the integration is replaced by the summation.

Before deriving estimation methods, we need to determine which parameters are estimable or identifiable. We wish to show that two sets of parameters {*γ*_{0}, *γ*_{1}, *γ*_{2}, *θ*, *P*(*x*)} and {*γ*_{0}^{*}, *γ*_{1}^{*}, *γ*_{2}^{*}, *θ**, *P**(*x*)} are identical if they yield the same likelihood. It is natural to assume that *P _{θ}*(

We first consider the case in which the disease is rare such that *P*(*D* = 1|*X, Y*) ≈ e^{γ0+γ1TX+γ2Y} and *P*(*D* = 0|*X, Y*) ≈ 1. Then the likelihood function becomes

$$\prod _{i=1}^{n}\phantom{\rule{0.2em}{0ex}}{\left\{\frac{{P}_{\theta}\left({Y}_{i}|{X}_{i}\right)P\left({X}_{i}\right){e}^{{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}{Y}_{i}}}{{\mathit{\int}}_{y,x}{P}_{\theta}\left(y|x\right)P\left(x\right){e}^{{\gamma}_{1}^{\mathrm{T}}x+{\gamma}_{2}y}\mathit{dydx}}\right\}}^{{D}_{i}}\times \prod _{i=1}^{n}\phantom{\rule{0.2em}{0ex}}{\{{P}_{\theta}({Y}_{i}|{X}_{i})P({X}_{i})\}}^{1-{D}_{i}}.$$

We let *D* = 0 to obtain *P _{θ}*(

Next, we consider the case in which the disease rate is known. Since *P*(*D* = 1) is known,

$${P}_{\theta}\left(Y|X\right)P\left(X\right)\frac{{e}^{D\left({\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}X+{\gamma}_{2}Y\right)}}{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}X+{\gamma}_{2}Y}}={P}_{{\theta}^{\ast}}\left(Y|X\right){P}^{\ast}\left(X\right)\frac{{e}^{D\left({\gamma}_{0}^{\ast}+{\gamma}_{1}^{\ast \mathrm{T}}X+{\gamma}_{2}^{\ast}Y\right)}}{1+{e}^{{\gamma}_{0}^{\ast}+{\gamma}_{1}^{\ast \mathrm{T}}X+{\gamma}_{2}^{\ast}Y}}.$$

Summing over *D* = 0 and 1 yields *P _{θ}*(

The remaining case is when the disease is not rare and the disease rate is unknown. It can be shown that *γ*_{1} = *γ*_{1}^{*}, *γ*_{2}= *γ*_{2}^{*}, and

$${P}_{\theta}\left(y|x\right)P\left(x\right)={c}_{0}{P}_{{\theta}^{\ast}}\left(y|x\right){P}^{\ast}\left(x\right)\frac{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}x+{\gamma}_{2}y}}{1+{e}^{{\gamma}_{0}^{\ast}+{\gamma}_{1}^{\mathrm{T}}x+{\gamma}_{2}y}},$$

where *c*_{0} is a constant [Roeder et al., 1996]. In most cases, the above equation implies *θ* = *θ**. (In particular, we can show that *θ* = *θ** for continuous traits satisfying the linear regression model; this is also true for dichotomous traits satisfying the logistic regression model if *γ*_{2} = 0 or if *γ*_{2} ≠ 0 and there is a continuous component of *X* that is related to *D*.) Then

$$P\left(x\right)={c}_{0}{P}^{\ast}\left(x\right)\frac{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}x+{\gamma}_{2}y}}{1+{e}^{{\gamma}_{0}^{\ast}+{\gamma}_{1}^{\mathrm{T}}x+{\gamma}_{2}y}}.$$

If *γ*_{2} ≠ 0, then the above equation clearly implies *γ*_{0}= *γ*_{0}^{*} and *P*(*x*) = *P**(*x*), so all parameters are identifiable. If *γ*_{2} = 0, we have

$$P\left(x\right)={c}_{0}{P}^{\ast}\left(x\right)\frac{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}x}}{1+{e}^{{\gamma}_{0}^{\ast}+{\gamma}_{1}^{\mathrm{T}}x}},$$

so *γ*_{1}, *γ*_{2} and *θ* are identifiable while *P*(*x*)/(1 + *e*^{γ0+γ1Tx}) is identifiable up to some constant.

In all cases, we estimate the parameters by maximizing the likelihood functions. Because *P*(*x*) is potentially high-dimensional, we use the profile likelihood approach. We provide estimation procedures separately for the three cases discussed above.

We first consider the rare-disease case. Write *p _{i}* =

$$\frac{1}{{p}_{i}}-{n}_{1}\frac{{\mathit{\int}}_{y}\phantom{\rule{0.2em}{0ex}}{P}_{\theta}\left(y|{X}_{i}\right){e}^{{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}y}\mathit{dy}}{{\mathit{\int}}_{y,x}\phantom{\rule{0.2em}{0ex}}{P}_{\theta}\left(y|x\right)P\left(x\right){e}^{{\gamma}_{1}^{\mathrm{T}}x+{\gamma}_{2}y}\mathit{dydx}}-\lambda =0,$$

where *λ* is the Lagrange multiplier for the constraint Σ_{i} *p _{i}* = 1, and

$${p}_{i}=\frac{1}{n-{n}_{1}+{n}_{1}\xi \phantom{\rule{0.2em}{0ex}}{\mathit{\int}}_{y}\phantom{\rule{0.2em}{0ex}}{P}_{\theta}\left(y|{X}_{i}\right){e}^{{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}y}dy\u2019}$$

where

$$\xi ={\left\{{\mathit{\int}}_{y,x}{P}_{\theta}\left(y|x\right)P\left(x\right){e}^{{\gamma}_{1}^{\mathrm{T}}x+{\gamma}_{2}y}\mathit{dydx}\right\}}^{-1}.$$

By the arguments of Lin and Zeng [2006], maximizing the likelihood function is equivalent to maximizing

$$\begin{array}{c}\sum _{i=1}^{n}\phantom{\rule{0.2em}{0ex}}\left\{\mathrm{log}\phantom{\rule{0.2em}{0ex}}\phantom{\rule{0.2em}{0ex}}{P}_{\theta}\left({Y}_{i}|{X}_{i}\right)+{D}_{i}({\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}{Y}_{i})\right\}+{n}_{1}\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\xi \\ -\sum _{i=1}^{n}\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\left\{1-\frac{{n}_{1}}{n}+\frac{{n}_{1}}{n}\xi \phantom{\rule{0.2em}{0ex}}{\mathit{\int}}_{y}{P}_{\theta}\left(y|{X}_{i}\right){e}^{{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}y}\mathit{dy}\right\},\end{array}$$

which is called the profile log-likelihood function for *γ*_{1}, *γ*_{2}, *θ* and *ξ*.

When the disease rate is known, we maximize

$$\prod _{i=1}^{n}\phantom{\rule{0.2em}{0ex}}\left\{{P}_{\theta}\left({Y}_{i}|{X}_{i}\right){p}_{i}\frac{{e}^{{D}_{i}\left({\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}{Y}_{i}\right)}}{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}{Y}_{i}}}\right\}$$

subject to the constraints that Σ* _{i} p_{i}* = 1 and

$$\sum _{i=1}^{n}{p}_{i}\phantom{\rule{0.2em}{0ex}}{\mathit{\int}}_{y}{P}_{\theta}\left(y|{X}_{i}\right)\frac{{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}y}}{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}y}}\mathit{dy}={\xi}_{0},$$

where *ξ*_{0} is the known value of *P*(*D* = 1). By using the Lagrange multipliers, we see that the estimate for *p _{i}* satisfies

$$\frac{1}{{p}_{i}}-\lambda \phantom{\rule{0.2em}{0ex}}{\mathit{\int}}_{y}{P}_{\theta}\left(y|{X}_{i}\right)\frac{{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}y}}{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}y}}\mathit{dy}-\stackrel{\sim}{\lambda}=0.$$

The two constraints imply *λξ*_{0} + = n; therefore,

$${p}_{i}={\left\{\lambda \phantom{\rule{0.2em}{0ex}}{\mathit{\int}}_{y}{P}_{\theta}\left(y|{X}_{i}\right)\frac{{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}y}}{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}y}}\mathit{dy}+\left(n-\lambda {\xi}_{0}\right)\right\}}^{-1},$$

where *λ* satisfies Σ* _{i} p_{i}* = 1. Thus, the profile log-likelihood function for

$$\begin{array}{c}\sum _{i=1}^{n}\left\{\mathrm{log}\phantom{\rule{0.2em}{0ex}}{P}_{\theta}\left({Y}_{i}|{X}_{i}\right)+{D}_{i}({\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}{Y}_{i})-\mathrm{log}(1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}{Y}_{i}})\right\}\\ -\sum _{i=1}^{n}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\left\{\lambda {\mathit{\int}}_{y}{P}_{\theta}\left(y|{X}_{i}\right)\frac{{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}y}}{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}y}}\mathit{dy}+(n-\lambda {\xi}_{0})\right\},\end{array}$$

where *λ* is determined by the equation

$$\sum _{i=1}^{n}\phantom{\rule{0.2em}{0ex}}{\left\{\lambda \phantom{\rule{0.2em}{0ex}}{\mathit{\int}}_{y}{P}_{\theta}\left(y|{X}_{i}\right)\frac{{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}y}}{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}y}}\mathit{dy}+(n-\lambda {\xi}_{0})\right\}}^{-1}=1$$

under the constraint that each term in the summation is positive.

Finally, we deal with the case in which the disease is not rare and the disease rate is unknown. If *γ*_{2} = 0, we can use standard statistical methods; see Appendix B. Suppose that *γ*_{2} ≠ 0. By introducing the Lagrange multiplier *λ*, we differentiate the log-likelihood function to obtain

$$\frac{1}{{p}_{i}}-\frac{{n}_{1}}{P\left(D=1\right)}{\mathit{\int}}_{y}{P}_{\theta}\left(y|{X}_{i}\right)P\left(D=1|{X}_{i},y\right)\mathit{dy}-\frac{{n}_{0}}{P\left(D=0\right)}{\mathit{\int}}_{y}{P}_{\theta}\left(y|{X}_{i}\right)P\left(D=0|{X}_{i},y\right)\mathit{dy}-\lambda =0,$$

where *n*_{1} and *n*_{0} are the numbers of cases and controls. It is easy to see that *λ* = 0, so

$${p}_{i}={\left\{\frac{{n}_{1}}{\xi}{\mathit{\int}}_{y}{P}_{\theta}\left(y|{X}_{i}\right)P\left(D=1|{X}_{i},y\right)\mathit{dy}+\frac{{n}_{0}}{\left(1-\xi \right)}{\mathit{\int}}_{y}{P}_{\theta}\left(y|{X}_{i}\right)P\left(D=0|{X}_{i},y\right)\mathit{dy}\right\}}^{-1},$$

where *ξ* = *P*(*D* = 1). Plugging this expression into the log-likelihood function yields the profile log-likelihood function for *γ*_{0}, *γ*_{1}, *γ*_{2}, *θ* and *ξ*

$$\begin{array}{c}\sum _{i=1}^{n}\phantom{\rule{0.2em}{0ex}}\left\{{D}_{i}({\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}{Y}_{i})-\mathrm{log}(1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}{Y}_{i}})+\mathrm{log}\phantom{\rule{0.2em}{0ex}}{P}_{\theta}\left({Y}_{i}|{X}_{i}\right)\right\}\\ -\sum _{i=1}^{n}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\left\{\frac{{n}_{1}}{\xi}{\mathit{\int}}_{y}{P}_{\theta}\left(y|{X}_{i}\right)\frac{{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}y}}{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}y}}\mathit{dy}+\frac{{n}_{0}}{\left(1-\xi \right)}{\mathit{\int}}_{y}{P}_{\theta}\left(y|{X}_{i}\right)\frac{1}{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}{X}_{i}+{\gamma}_{2}y}}\mathit{dy}\right\}\\ -{n}_{1}\phantom{\rule{0.2em}{0ex}}\mathrm{log}\phantom{\rule{0.2em}{0ex}}\xi -{n}_{0}\phantom{\rule{0.2em}{0ex}}\mathrm{log}\left(1-\xi \right).\end{array}$$

In all three cases, we maximize the profile log-likelihood functions via the Newton-Raphson algorithm or optimization algorithms. By the arguments used in the proofs of Theorems 2 and 3 of Lin and Zeng [2006], we can show that the maximum likelihood estimators are consistent and asymptotically normal. In addition, the limiting covariance matrix attains the efficiency bound and can be consistently estimated by the inverse of the negative Hessian matrix of the profile log-likelihood function.

We investigate the validity of standard statistical methods. Let *P* denote the true probability, and let *P _{obs}* denote the observed probability under the case-control design. Since

$${P}_{\mathit{obs}}\left(Y|X\right)=\frac{{P}_{\mathit{obs}}\left(Y,X|D=1\right){P}_{\mathit{obs}}\left(D=1\right)+{P}_{\mathit{obs}}\left(Y,X|D=0\right){P}_{\mathit{obs}}\left(D=0\right)}{{P}_{\mathit{obs}}\left(X\right)}$$

and *P _{obs}*(

$${P}_{\mathit{obs}}\left(Y|X\right)=\left\{\frac{{P}_{\mathit{obs}}\left(D=1\right)}{P\left(D=1\right)}\frac{{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}X+{\gamma}_{2}Y}}{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}X+{\gamma}_{2}Y}}+\frac{{P}_{\mathit{obs}}\left(D=0\right)}{P\left(D=0\right)}\frac{1}{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}X+{\gamma}_{2}Y}}\right\}P\left(Y|X\right)\frac{P\left(X\right)}{{P}_{\mathit{obs}}\left(X\right)}.$$

(1)

Likewise,

$${P}_{\mathit{obs}}\left(Y|X,D=0\right)=\frac{1}{P\left(D=0\right)}\frac{1}{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}X+{\gamma}_{2}Y}}P\left(Y|X\right)\frac{P\left(X\right)}{{P}_{\mathit{obs}}\left(X|D=0\right)},$$

(2)

and

$${P}_{\mathit{obs}}\left(Y|X,D=1\right)=\frac{1}{P\left(D=1\right)}\frac{{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}X+{\gamma}_{2}Y}}{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}X+{\gamma}_{2}Y}}P\left(Y|X\right)\frac{P\left(X\right)}{{P}_{\mathit{obs}}\left(X|D=1\right)}.$$

(3)

If *P _{obs}*(

$${P}_{\mathit{obs}}\left(X\right)=P\left(X|D=0\right)\{{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}X}{P}_{\mathit{obs}}\left(D=1\right)/P\left(D=1\right)+{P}_{\mathit{obs}}\left(D=0\right)/P\left(D=0\right)\}P\left(D=0\right).$$

Similarly, *P*(*X*) = *P*(*X* |*D* = 0)(*e*^{γ0+γ1TX} + 1)*P*(*D* = 0). Plugging these two expressions into equation (1), we see *P _{obs}*(

$$\frac{{P}_{\mathit{obs}}\left(D=1\right)}{P\left(D=1\right)}=\frac{{P}_{\mathit{obs}}\left(D=0\right)}{P\left(D=0\right)}.$$

This is false since *P _{obs}*(

What happens if *D* is independent of *X* conditional on *Y*, i.e., *γ*_{1} = 0? We first consider dichotomous traits. By switching the roles of *X* and *Y* in the above derivation, we see *P _{obs}*(

$${\mathit{\int}}_{y}\frac{y}{1+{e}^{{\gamma}_{0}+{\left({\gamma}_{1}+{\gamma}_{2}{\beta}_{1}\right)}^{\mathrm{T}}X+{\gamma}_{2}y}}{P}_{\mathit{obs}}\left(y\right)\mathit{dy}=0$$

for all *X*, implying *γ*_{1} = −*γ*_{2}*β*_{1}. Thus, standard regression analysis is biased under *γ*_{1} = 0 unless *β*_{1} = 0 or *γ*_{2} = 0. This explains the patterns of bias seen in Figure 2.

We now investigate the bias of including the case-control status as a predictor in a regression model. It follows from equations (2) and (3) that for dichotomous traits,

$$\mathrm{log}\frac{{P}_{\mathit{obs}}\left(Y=1|X,D\right)}{{P}_{\mathrm{obs}}\left(Y=0|X,D\right)}={\beta}_{0}+{\beta}_{1}^{\mathrm{T}}X+{\gamma}_{2}D-\mathrm{log}\frac{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}X+{\gamma}_{2}}}{1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}X}},$$

and for continuous traits,

$${E}_{\mathit{obs}}\left[Y|X,D=0\right]={\mathit{\int}}_{y}yP\left(y|X\right){\left(1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}X+{\gamma}_{2}y}\right)}^{-1}dy/{c}_{0}\left(X\right),$$

$${E}_{\mathit{obs}}\left[Y|X,D=1\right]=\frac{{\beta}_{0}+{\beta}_{1}^{\mathrm{T}}X-{\mathit{\int}}_{y}yP\left(y|X\right){\left(1+{e}^{{\gamma}_{0}+{\gamma}_{1}^{\mathrm{T}}X+{\gamma}_{2}y}\right)}^{-1}dy}{1-{c}_{0}\left(X\right)},$$

where *c*_{0}(*X*) = *∫*_{y} *P*(*y*|*X*)(1 + *e*^{γ0+γ1TX+γ2y})^{−1}*dy*. Thus, the logistic regression of *Y* on *X* and *D* is not a valid analysis of the effects of *X* on *Y* in the general population unless *γ*_{1} = 0 or *γ*_{2} = 0, whereas the linear regression of *Y* on *X* and *D* generally yields biased estimates of the effects of *X* on *Y* in the general population unless *γ*_{1} = 0 and *γ*_{2} = 0.

Finally, we consider rare disease. When the disease is rare, equation (1) becomes

$${P}_{\mathit{obs}}\left(Y|X\right)\approx \left\{\frac{{P}_{\mathit{obs}}\left(D=1\right)}{\mathit{\int}{e}^{{\gamma}_{1}^{\mathrm{T}}x+{\gamma}_{2}y}P\left(y|x\right)P\left(x\right)\mathit{dydx}}{e}^{{\gamma}_{1}^{\mathrm{T}}X+{\gamma}_{2}Y}+\frac{{P}_{\mathit{obs}}\left(D=0\right)}{P\left(D=0\right)}\right\}P\left(Y|X\right)\frac{P\left(X\right)}{{P}_{\mathit{obs}}\left(X\right)}.$$

Clearly, *P _{obs}*(

$${P}_{\mathit{obs}}\left(Y|X,D=0\right)\approx \frac{1}{P\left(D=0\right)}P\left(Y|X\right)\frac{P\left(X\right)}{{P}_{\mathit{obs}}\left(X|D=0\right)}\approx P\left(Y|X\right).$$

Thus, standard analysis based on controls only is approximately valid. On the other hand, equation (3) becomes

$${P}_{\mathit{obs}}\left(Y|X,D=1\right)\approx \frac{1}{\mathit{\int}{e}^{{\gamma}_{1}^{\mathrm{T}}x+{\gamma}_{2}y}P\left(y|x\right)P\left(x\right)\mathit{dydx}}{e}^{{\gamma}_{1}^{\mathrm{T}}x+{\gamma}_{2}y}\phantom{\rule{0.2em}{0ex}}P\left(Y|X\right)\frac{P\left(X\right)}{{P}_{\mathit{obs}}\left(X|D=1\right)}.$$

Thus, *P _{obs}*(

$$\frac{{P}_{\mathit{obs}}\left(Y=1|X,D=1\right)}{{P}_{\mathit{obs}}\left(Y=0|X,D=1\right)}\approx {e}^{{\gamma}_{2}}\frac{P\left(Y=1|X\right)}{P\left(Y=0|X\right)},$$

so standard logistic regression based on cases only yields approximately correct estimates of odds ratios. If *Y* is continuous satisfying the linear model, then *P _{obs}* (

- Frayling TM, Timpson NJ, Weedon MN, Zeggini E, Freathy RM, Lindgren CM, Perry JR, Elliott KS, Lango H, Rayner NW, et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science. 2007;316:889–894. [PMC free article] [PubMed]
- Gudbjartsson DF, Walters GB, Thorleifsson G, Stefansson H, Halldorsson BV, Zusmanovich P, Sulem P, Thorlacius S, Gylfason A, Steinberg S, et al. Many sequence variants affecting diversity of adult human height. Nat Genet. 2008;40:609–615. [PubMed]
- Kathiresan S, Melander O, Guiducci C, Surti A, Burtt NP, Rieder MJ, Cooper GM, Roos C, Voight BF, Havulinna AS, et al. Six new loci associated with blood low-density lipoprotein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Nat Genet. 2008;40:189–197. [PMC free article] [PubMed]
- Lettre G, Jackson AU, Gieger C, Schumacher FR, Berndt SI, Sanna S, Eyheramendy S, Voight BF, Butler JL, Guiducci C, et al. Identification of ten loci associated with height highlights new biological pathways in human growth. Nat Genet. 2008;40:584–591. [PMC free article] [PubMed]
- Lin DY, Zeng D. Likelihood-based inference on haplotype effects in genetic association studies. J Am Stat Ass. 2006;101:89–118. with discussion.
- Loos RJF, Lindgren CM, Li S, Wheeler E, Zhao JH, Prokopenko I, Inouye M, Freathy RM, Attwood AP, Beckmann JS, et al. Common variants near MC4R are associated with fat mass, weight and risk of obesity. Nat Genet. 2008;40:768–775. [PMC free article] [PubMed]
- Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–411.
- Roeder K, Carroll RJ, Lindsay BG. A semiparametric mixture approach to case-control studies with errors in covariables. J Am Stat Ass. 1996;91:722–732.
- Sanna S, Jackson AU, Nagaraja R, Willer CJ, Chen WM, Bonnycastle LL, Shen H, Timpson N, Lettre G, Usala G. Common variants in the GDF5-UQCC region are associated with variation in human height. Nat Genet. 2008;40:198–203. [PMC free article] [PubMed]
- Saxena R, Voight BF, Lyssenko V, Burtt NP, de Bakker PIW, Chen H, Roix JJ, Kathiresan S, Hirschhorn JN, Daly MJ, et al. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science. 2007;316:1331–1336. [PubMed]
- Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, Mangino M, Freathy RM, Perry JRB, Stevens S, Hall AS, et al. Genome-wide association analysis identifies 20 loci that influence adult height. Nat Genet. 2008;40:575–583. [PMC free article] [PubMed]
- Weedon MN, Lettre G, Freathy RM, Lindgren CM, Voight BF, Perry JR, Elliott KS, Hackett R, Guiducci C, Shields B, et al. A common variant of HMGA2 is associated with adult and childhood height in the general population. Nat Genet. 2007;39:1245–1250. [PMC free article] [PubMed]
- Willer CJ, Sanna S, Jackson AU, Scuteri A, Bonnycastle LL, Clarke R, Heath SC, Timpson NJ, Najjar SS, Stringham HM, et al. Newly identified loci that influence lipid concentrations and risk of coronary artery disease. Nat Genet. 2008;40:161–169. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |