Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2883271

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Association Analysis for Single Nucleotide Polymorphisms (SNPs)
- 3 Association Analysis for Imputed SNPs
- 4 Haplotypes
- 5 Discussion
- References

Authors

Related links

Stat Sci. Author manuscript; available in PMC 2010 June 10.

Published in final edited form as:

Stat Sci. 2009 November 1; 24(4): 489–502.

PMCID: PMC2883271

NIHMSID: NIHMS184154

NILANJAN CHATTERJEE, Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, DHHS. Rockville MD 20852, U.S.A;

NILANJAN CHATTERJEE: vog.hin.liam@nrettahc; YI-HAU CHEN: wt.ude.acinis.tats@nehchy; SHENG LUO: ude.cmt.htu@ouL.T.gnehS; RAYMOND J. CARROLL: ude.umat.tats@llorrac

See other articles in PMC that cite the published article.

Although prospective logistic regression is the standard method of analysis for case-control data, it has been recently noted that in genetic epidemiologic studies one can use the “retrospective” likelihood to gain major power by incorporating various population genetics model assumptions such as Hardy-Weinberg-Equilibrium (HWE), gene-gene and gene-environment independence. In this article, we review these modern methods and contrast them with the more classical approaches through two types of applications (i) association tests for typed and untyped single nucleotide polymorphisms (SNPs) and (ii) estimation of haplotype effects and haplotype-environment interactions in the presence of haplotype-phase ambiguity. We provide novel insights to existing methods by construction of various score-tests and pseudo-likelihoods. In addition, we describe a novel two-stage method for analysis of untyped SNPs that can use any flexible external algorithm for genotype imputation followed by a powerful association test based on the retrospective likelihood. We illustrate applications of the methods using simulated and real data.

Case-control study designs are now widely used to study the role of genetic susceptibility in the etiology of rare complex diseases. Typically, a case-control study involves recruiting all or a large fraction of the diseased subjects (cases) that arise in an underlying study base and then sampling a comparable number of healthy subjects (controls), ideally from the exact same study base, and possibly matched with the cases by some socio-demographic characteristics such as race, age and gender. Biological samples and questionnaire data collected on the sampled subjects are then used to determine their genetic susceptibility, such as SNP genotypes and history of some non-genetic (environmental) exposures. For rare diseases such as cancers, case-control studies are cost efficient compared to a cross-sectional or prospective cohort studies because they dramatically reduce the number of non-diseased subjects to study.

In general, the standard method for analysis of case-control data is the prospective logistic regression ignoring the retrospective nature of the underlying design. The validity of this approach relies on the classic results by Cornfield (1956) who showed the equivalence of prospective- and retrospective odds-ratios. The efficiency of the approach was established in two other classic papers by Andersen (1970) and Prentice and Pyke (1979), who showed that the prospective analysis of case-control data yields the proper maximum-likelihood estimates of the odds ratio parameters of the logistic model under a “semiparametric” setup that allows the distribution of the underlying covariates to remain completely unrestricted. More recently, it has been shown that even in the presence of missing data and measurement error in covariates, the “prospective” treatment of case-control data can yield proper maximum-likelihood estimates as long as the distribution of the underlying covariates is allowed to remain unrestricted (Roeder et al., 1996).

A special feature for studies in genetic epidemiology is that it is often reasonable to assume certain models for the population distribution of the genetic and environmental covariates of interest. The Hardy-Weinberg-Equilibrium (HWE) law, for example, which implies a simple relationship between *allele* and *genotype* frequencies at a given chromosomal locus, is a natural model for a random mating, large, stable population in the absence of new genetic mutations, inbreeding and selective survivorship among genotypes (see Hartl and Clark, 2007, Chapter 3). Genes which are physically apart and hence are not expected to be in linkage disequilibrium (LD) are also expected to be independently distributed in a homogeneous population. It is often also natural to assume a subject’s genetic susceptibility, a factor which is determined at birth, is independent of his/her subsequent environmental exposures. A pertinent question then is what is the most appropriate method for analysis of case-control data in genetic epidemiology where some natural model assumptions exist for the distribution of genetic and environmental factors in the underlying population.

We will assume data on some genetic (*G*) and environmental (*E*) exposures are collected in a case-control study involving *N*_{0} controls (*D* = 0) and *N*_{1} cases (*D* = 1). If one ignores the retrospective nature of the case-control design, one can conduct inference based on the prospective-likelihood

$${L}^{P}=\prod _{i=1}^{N}\text{pr}({D}_{i}\mid {G}_{i},{E}_{i}),$$

(1)

where *N* = *N*_{1} + *N*_{0}. The fundamental likelihood for case-control data, however, known as the “retrospective” likelihood, is given by

$${L}^{R}=\prod _{i=1}^{N}\text{pr}({G}_{i},{E}_{i}\mid {D}_{i}),$$

(2)

In the absence of any missing data, it is evident from the classical theory that the prospective-likelihood (1) provides a valid way of testing and estimation of the odds-ratio association parameters of the underlying logistic regression model. In fact, the prospective-likelihood yields the same maximum-likelihood estimates for the odds-ratio association parameters that could be obtained by maximization of the proper retrospective likelihood (2) while allowing pr(*G*, *E*), the joint distribution of *G* and *E*, to remain completely non-parametric. Under constraints on pr(*G*, *E*), however, the retrospective likelihood would not yield the same maximum-likelihood estimator as that from the prospective likelihood. More importantly, the retrospective-likelihood can exploit various population genetics model assumptions such as HWE, gene-gene and gene-environment independence to gain major efficiency over the prospective-likelihood for inference on various association and interaction parameters. At the same time, if the underlying model assumptions are violated, then the use of the retrospective likelihood can lead to serious bias for both testing and estimation procedures. In the presence of missing data, a further complication is that the use of the prospective likelihood may not be even strictly valid in certain settings, such as that described in Section 4 for estimation of haplotype effects, where for the purpose of identifiability *L ^{P}* also requires some modeling assumptions, thus destroying its equivalence with

In this article, we will review some modern developments for analysis of case-control studies in genetic epidemiology using the prospective- and retrospective-likelihoods. We will describe the methods primarily through two different types of applications: (a) association testing for genotyped and imputed single nucleotide polymorphisms (SNP) (Section 2 and Section 3) and (b) estimation of haplotype effects and haplotype-environment interactions in the presence of phase ambiguity (Section 4). In each section, we aim to provide new intuitive insights into the alternative methods by constructions of various score tests (Section 2 and 3) and pseudo-likelihoods (Section 4). As a byproduct, in Section 3, we also propose a novel “retrospective” method for association testing for untyped SNPs which can easily use any external algorithm for imputation of genotypes. In each section, we will use numerical examples to illustrate the bias and efficiency trade-off between the alternative methods. We will conclude the article with a discussion and recommendations for practical data analysis.

The genotype information for an individual SNP in a case-control study can be represented by the 2 *×* 3 contingency table defined by cross-tabulation of case-control and genotype status. Let *D* be the indicator of case (*D* = 1) or control (*D* = 0) status and let *G* be the number of minor alleles carried by an individual (*G* = 0, 1, 2). Let *n _{dg}* denote the number of subjects with genotype

$$\text{pr}(D=1\mid G)=\frac{exp\{\alpha +{\beta}^{\top}m(G)\}}{1+exp\{\alpha +{\beta}^{\top}m(G)\}},$$

(3)

where the function *m*(·) is chosen in a suitable way to reflect an assumed mode of genetic effect. If, for example, *G* denotes the count for the minor allele at a SNP locus, then one can chose *m*(*G*) = *G*, *m*(*G*) = *I*(*G* ≥ 1) or *m*(*G*) = *I*(*G* = 2) to model the effect of the minor allele as additive (in the logistic scale), dominant or recessive. One can also consider a saturated model by allowing *m*(*G*) to be a vector of two dummy variables associated with heterozygous (*G* = 1) and homozygous variant (*G* = 2) genotypes and *β* to be the corresponding log-odds-ratios. The prospective analysis of case-control data yields an asymptotically unbiased estimate for the genotype-odds-ratio parameters *β*, but not for the intercept parameter *α*.

The score-function for *β* under the prospective-likelihood (1) can be written as

$${U}_{PL}=\sum _{i=1}^{{N}_{1}+{N}_{0}}m({G}_{i})\{{D}_{i}-\text{pr}(D=1\mid {G}_{i})\}.$$

Under the null hypothesis, *β* = 0, we can estimate *p* = pr(*D* = 1*|G _{i}*) as =

$$\begin{array}{l}{U}_{PL}^{0}=\sum _{i=1}^{{N}_{1}}m({G}_{i})-\frac{{N}_{1}}{{N}_{1}+{N}_{0}}\sum _{i=1}^{{N}_{1}+{N}_{0}}m({G}_{i})\\ =\frac{{N}_{1}{N}_{0}}{{N}_{1}+{N}_{0}}\left\{\frac{1}{{N}_{1}}\sum _{i=1}^{{N}_{1}}m({G}_{i})-\frac{1}{{N}_{0}}\sum _{i={N}_{1}+1}^{{N}_{1}+{N}_{0}}m({G}_{i})\right\},\end{array}$$

which is proportional to the difference between the empirical mean of *m*(*G*) in the cases (*D* = 1) and in the controls (*D* = 0). We suppose without loss of generality that the indices for the cases are {*i* = 1, …, *N*_{1}} and those for the controls are {*i* = *N*_{1} + 1, …, *N*_{1} + *N*_{0}}. If, for example, we assume *m*(*G*) = *G*, i.e. the additive effect, then
${U}_{PL}^{0}$ corresponds to the numerator of the Cochran-Armitage trend test (**?**, Chapter 7) that is widely used for single-SNP association testing. More generally, a “prospective” score-test can be constructed under any genetic model based on
${U}_{PL}^{0}$ and its variance under the null hypothesis of no association that be estimated by

$${V}_{PL}^{0}=\frac{{N}_{1}{N}_{0}}{{N}_{1}+{N}_{0}}{V}_{m(G)}$$

where *V _{m}*

The retrospective likelihood, *L ^{R}*, for the genotype data of a single-SNP can be written as the product of two sets of multinomial probabilities:

$${L}^{R}={L}_{1}\times {L}_{0}=\prod _{g=0}^{2}{p}_{1g}^{{n}_{1g}}\times \prod _{g=0}^{2}{p}_{0g}^{{n}_{0g}},$$

where *p _{dg}* = pr(

$${p}_{1g}=\frac{{\psi}_{g}(\beta ){p}_{0g}}{{\sum}_{g=0}^{2}{\psi}_{g}(\beta ){p}_{0g}},$$

(4)

where *ψ _{g}*(

Now suppose we are willing to assume that HWE holds in the underlying population and that the disease is rare so that HWE also holds approximately in the control population. In the retrospective likelihood *L ^{R}*, we can write the genotype probabilities for the controls as a function of the frequency,

$${p}_{00}(f)={(1-f)}^{2},{p}_{01}(f)=2f(1-f),{p}_{02}={f}^{2}.$$

It is easy to show that the score-function for *β* associated with the retrospective likelihood can be written as

$${U}_{RL}=\sum _{i=1}^{{N}_{1}}[m({G}_{i})-{E}_{\mathit{HWE},f}\{m(G)\mid D=1\}],$$

which under the null hypothesis of no association reduces to

$${U}_{RL}^{0}=\sum _{i=1}^{{N}_{1}}[m({G}_{i})-{E}_{\mathit{HWE},f}\{m(G)\}],$$

(5)

Moreover, under the null hypothesis, the allele frequency *f* can be substituted for by its maximum-likelihood estimate

$$\widehat{f}=\frac{{n}_{+1}+2{n}_{+2}}{2N},$$

(6)

where *n*_{+}* _{g}* denotes the frequency for genotype

$${V}_{RL}^{0}={N}_{1}\left\{{V}_{m(G)}-\frac{{N}_{1}}{2N}\widehat{f}(1-\widehat{f})C(\widehat{f})C{(\widehat{f})}^{\top}\right\},$$

where

$$C(f)={cov}_{\mathit{HWE},f}\left\{m(G),\frac{G-2f}{f(1-f)}\right\}=\sum _{g}m(g)\frac{g-2f}{f(1-f)}{p}_{0g}(f).$$

By the Cauchy-Schwartz inequality, ${V}_{RL}^{0}\ge {V}_{PL}^{0}$ asymptotically, implying that the retrospective score test is asymptotically more powerful than its prospective counterpart when the assumption of HWE is valid.

Chen and Chatterjee (2007) compared the performance of 2 d.f. Wald-tests of association based on the retrospective and prospective likelihoods. They observed major gains in power for the test based on the retrospective-likelihood for the detection of non-multiplicative effects, e.g., recessive effects. Notice that if we assume an additive model, i.e. *m*(*G*) = *G*, then the prospective and retrospective score-functions
${U}_{RL}^{0}$ and
${U}_{PL}^{0}$ become identical because in this case
${E}_{\mathit{HWE},\widehat{f}}\{m(G)\}=2\widehat{f}={\sum}_{i=1}^{N}{G}_{i}/N$. The larger the departure of the effect of a SNP from the additive form, the greater the gain in efficiency for the retrospective method. Application of retrospective methods for association testing, however, requires caution because of their sensitivity to the underlying model assumption. In particular, it can be seen from the formula of
${U}_{RL}^{0}$ that the unbiasedness of that score function crucially depends on the assumption of HWE being correct for the underlying population. Satten and Epstein (2004) and Chen and Chatterjee (2007) have noted that even modest violation of HWE can cause serious inflation in Type-I error in association tests based on the retrospective likelihood.

Luo et al. (2009) considered an empirical-Bayes type shrinkage estimation approach to develop a 2 d.f. single-SNP association test that can gain power by exploiting the model assumptions of HWE for the underlying population and yet is resistant to bias when the model assumptions are violated. The method involves estimation of genotype-specific disease odds ratio parameters by data-adaptive “shrinkage” of a “prospective” model-free estimator that does not require the HWE assumption towards a “retrospective” model-based estimator that directly exploits the HWE constraints. The amount of “shrinkage” is sample-size and data-adaptive, so that in large samples the method has no bias whether the assumption of HWE holds or not and yet the method can gain efficiency by shrinking the analysis towards HWE, but only to the extent that the data validate the assumptions. In what follows, we provide some insight into the empirical-Bayes method through the construction of a score-test. For numerical illustration, however, we will focus on the Wald test as originally developed in Luo et al. (2009).

Let
$\overline{m}(G)={({N}_{1}+{N}_{0})}^{-1}{\sum}_{i=1}^{{N}_{1}+{N}_{0}}m({G}_{i})$ and
${s}_{\overline{m}(G)}^{2}={({N}_{1}+{N}_{0})}^{-1}{\sum}_{i=1}^{{N}_{0}+{N}_{1}}{\{m({G}_{i})-\overline{m}(G)\}}^{2}$ denote the sample mean and variance for the function *m*(*G*), respectively. Further, let = (*G*) − *E _{,HWE}m*(

$${U}_{EB}^{0}=\sum _{i=1}^{{N}_{1}}[m({G}_{i})-{E}_{EB}\{m(G)\}],$$

(7)

where *E _{EB}* {

$${E}_{EB}\{m(G)\}=\frac{{s}_{\overline{m}(G)}^{2}/N}{{s}_{\overline{m}(G)}^{2}/N+{\widehat{\tau}}^{2}}{E}_{\mathit{HWE},\widehat{f}}\{m(G)\}+\frac{{\widehat{\tau}}^{2}}{{s}_{\overline{m}(G)}^{2}/N+{\widehat{\tau}}^{2}}\overline{m}(G).$$

Thus, *E _{EB}* {

We illustrate the performance of alternative 2 d.f. single SNP association tests using data from the Cancer Genetics Markers of Susceptibility (CGEMS) study (Yeager et al., 2007; Hunter et al., 2007; Thomas et al., 2008), an NCI enterprize initiative to conduct multi-stage whole-genome association studies to identify susceptibility genes giving rise to increased risks of prostate and breast cancers. In this article, we will focus on data from the initial scan for the prostate cancer study, involving genotype data on about 550, 000 SNPs from 1, 172 cases and 1, 157 controls. The details of the CGEMS study design and the results from the initial scan and subsequent replication studies can be found at the web site https://caintegrator.nci.nih.gov/cgems/.

We consider 449, 698 SNPs from 22 non-sex chromosomes with minor allele frequencies larger than 0.05. Table 1 displays the empirical proportions of the number of SNPs that are found to be significant at different nominal significance levels using 2 d.f. tests based on three different methods: (a) prospective; (b) retrospective and (c) empirical-Bayes (see Luo et al. (2009) for more details). For a well-designed study and a robust analytic method, the empirical proportions are expected to be fairly close to the nominal significant levels, given that vast majority of the SNPs are likely to be not associated with the disease. In Table 1, we observe that the empirical proportions of significant SNPs found by the prospective method closely follows the nominal significance levels. In contrast, the corresponding proportions for the retrospective test deviate severely from the nominal values in the range of *α* ≤ 10^{−3}, indicating significantly inflated type-I error due to the violation of HWE for many SNPs. The last column of Table 1 shows that the empirical-Bayes procedure essentially corrects for all the bias of the retrospective method due to the violation of the HWE assumption.

The empirical proportions of significant SNPs detected by different methods at different nominal significance levels in the CGEMS prostate cancer study.

Next, we conducted a simulation study to investigate the performance of various tests in ranking a true susceptibility locus in a genome-wide association study (GWAS) that include hundreds of thousands of “null” SNPs. To generate realistic linkage disequilibrium patterns, we simulated GWAS data mimicking the CGEMS study itself. Given MAF among controls and the disease-genotype odds-ratio parameters for a chosen susceptibility locus, we simulate genotype data at that locus for the cases and controls separately from the corresponding multinomial distributions. Given the genotype data at the susceptibility locus for a case or a control, we simulate genotype data for the remainder of the SNPs by assigning the whole genotype profile for a randomly selected subject from the controls of the CGEMS study who have the same genotype data at the given susceptibility locus as the sampled subject in our simulation study. This algorithm, as originally described by Yu et al. (2009), assumes that given the genotypes for the susceptibility locus, the risk of the disease is independent of all the remaining SNPs. We simulated 50 data sets with approximately 550 cases and 550 controls. For each data set, we tested for association for each of the approximately 450,000 SNPs using the prospective, retrospective and empirical-Bayes methods. The rank of the disease-associated SNP is obtained by sorting all the p-values in ascending order.

Table 2 displays the median ranks obtained by three methods for a true disease-associated SNP that has a recessive effect with a log-odds-ratio of *β* = log(3). As expected, the ranks of all tests decrease as the MAF increases. Comparing the ranks of different tests at a specific MAF, we can see that the standard prospective method generally has the lowest power in the sense that it assigns much higher rank to the susceptibility SNP than the two other tests. When MAF=0.1, we observe that the pure retrospective method performs the best in the sense that it assigns the lowest rank to the susceptibility SNPs among all the methods. In contrast when MAF≥ 0.2, we observe that the empirical-Bayes procedure assigns considerable lower rank to the susceptibility SNP than the pure retrospective method. Intuitively, the results can be explained from the fact that the retrospective method yields low p-values for many null SNPs due to the violation of the HWE assumption (see Table 1) and thus dilutes the rank of the real susceptibility SNP.

The forms of the prospective- and retrospective-scores suggest how they can be modified easily for SNPs that may not have been directly genotyped, but can be “imputed” conditional on neighboring SNPs and estimates of linkage disequilibrium from HapMap or other similar data bases. Let (*G*) denote the neighboring genotype information for an untyped SNP-locus with unobserved genotype *G*. The prospective score for such an untyped SNP can be defined by taking the conditional expectation of the “complete data” score-function
${U}_{PL}^{0}$ given the observed data, i.e the neighboring genotype information. More formally, the prospective score for an untyped SNP can be written as

$${U}_{PL}^{0u}=\frac{{N}_{1}{N}_{0}}{{N}_{1}+{N}_{0}}\left[\frac{1}{{N}_{1}}\sum _{i=1}^{{N}_{1}}E\{m(G)\mid \mathcal{N}({G}_{i})\}-\frac{1}{{N}_{0}}\sum _{i=1}^{{N}_{0}}E\{m(G)\mid \mathcal{N}({G}_{i})\}\right],$$

(8)

where the conditional expectations are taken with respect to a suitable imputation model such as those described by Nicolae (2006), Marchini et al. (2007) and others. The retrospective score for an untyped SNP can be similarly defined by the conditional expectation of the “complete data” retrospective score-function
${U}_{RL}^{0}$ given the observed data (*G*) in the form

$${U}_{RL}^{0u}=\sum _{i=1}^{{N}_{1}}[E\{m(G)\mid \mathcal{N}({G}_{i})\}-{E}_{\mathit{HWE},f}\{m(G)\}].$$

(9)

Notice that in the retrospective score function, the contribution of the term *E _{HWE,f}* {

$$\widehat{f}=\frac{{\sum}_{i=1}^{{N}_{0}+{N}_{1}}\{I({G}_{i}=1)+2I({G}_{i}=2)\}}{2N}.$$

Thus, given an imputation model, we can estimate the allele frequency *f* as

$${\widehat{f}}^{u}=\frac{{\sum}_{i=1}^{{N}_{0}+{N}_{1}}\text{pr}\{G=1\mid \mathcal{N}({G}_{i})\}+2\text{pr}\{G=2\mid \mathcal{N}({G}_{i})\}}{2N}.$$

(10)

We further need the variances for ${U}_{PL}^{0u}$ and ${U}_{RL}^{0u}$ under the null hypothesis to obtain the corresponding score tests. The variance of ${U}_{PL}^{0u}$ can be estimated as

$${V}_{PL}^{0u}=\frac{{N}_{1}{N}_{0}}{{N}_{1}+{N}_{0}}{V}_{E\{m(G)\mid \mathcal{N}(G)\}},$$

where *V _{E}*

$${({U}_{PL}^{0u})}^{\top}{\{{V}_{PL}^{0u}\}}^{-}{U}_{RL}^{0u},$$

where the superscripts T and – denote transpose and generalized inverse, respectively. Asymptotically, this statistic follows a chi-squared distribution under the null hypothesis of *β* = 0, with the degrees of freedom given by the dimension of *m*(*G*). The variance of the retrospective score
${U}_{RL}^{0u}$, after adjusting for the estimation of the allele frequency *f* by given by (10) and can be estimated by

$${V}_{RL}^{0u}={N}_{1}\left[{V}_{E\{m(G)\mid \mathcal{N}(G)\}}+\frac{{N}_{1}}{2N}\left\{\frac{{V}_{E\{G\mid \mathcal{N}(G)\}}}{2}C(\widehat{f})C{(\widehat{f})}^{\top}-QC{(\widehat{f})}^{\top}-C(\widehat{f}){Q}^{\top}\right\}\right],$$

where *Q* is pooled-sample covariance between *E*{*m*(*G*)*|N*(*G _{i}*)} and

$${V}_{PL}^{0u}=\sum _{i=1}^{{N}_{1}+{N}_{0}}{\stackrel{\sim}{U}}_{RL,i}^{0u}{({\stackrel{\sim}{U}}_{RL,i}^{0u})}^{\top}$$

where the efficient score

$${\stackrel{\sim}{U}}_{RL,i}^{0u}={D}_{i}[E\{m(G)\mid \mathcal{N}({G}_{i})\}-{E}_{\mathit{HWE},\widehat{f}}\{m(G)\}]-\frac{{N}_{1}}{2N}C(\widehat{f})[E\{G\mid \mathcal{N}({G}_{i})\}-2\widehat{f}].$$

The retrospective-score test is then based on the test statistic given by

$${({U}_{RL}^{0u})}^{\top}{\{{V}_{RL}^{0u}\}}^{-}{U}_{RL}^{0u},$$

which again follows a chi-squared distribution asymptotically under the null hypothesis, with the degrees of freedom given by the dimension of *m*(*G*). In both the prospective- and retrospective-sore tests given above, we obtain the conditional probability Pr{*G|*(*G _{i}*)} directly from some external reference database, e.g. HapMap, a strategy similar to the proposal of Nicolae (2006).

We now demonstrate the potential power advantages that might be achieved by imputing the untyped SNP, using numerical studies following two scenarios as in Tables 1 and 2 of Nicolae (2006). In Scenario 1, the untyped SNP can be perfectly predicted by the genotypes of the typed SNPs, namely the ${R}_{s}^{2}=1$ (see Stram et al., 2004, for definition), while in Scenario 2 the untyped SNP is moderately predicted by the genotypes of the typed SNPs with ${R}_{s}^{2}=0.39$. The SNP profiles together with the haplotype frequencies estimated from HapMap CEU samples in the two scenarios are summarized in Tables 3 and and4.4. Also listed in Tables 3 and and44 are the haplotype frequencies we actually used to simulate the SNP data for the case-control sample, which moderately deviate from those seen in the HapMap CEU sample to reflect the potential discrepancy between the HapMap and study samples. The haplotype pair for each person is generated according to HWE.

The SNP profiles and haplotype frequencies for the region considered in Scenario 1, where the untyped SNP can be perfectly predicted by genotyped SNPs *A*_{1}, …, *A*_{4} (
${R}_{s}^{2}=1$). Also listed are the haplotype frequencies estimated from the CEU sample **...**

The SNP profiles and haplotype frequencies for the region considered in Scenario 2, where the untyped SNP is moderately predicted by genotyped SNPs *A*_{1}, …, *A*_{3} (
${R}_{s}^{2}=0.39$). Also listed are the haplotype frequencies estimated from the CEU sample **...**

We simulated the case-control status by the logistic regression model (3), where the genetic determinant *G* is given by the minor allele count of the untyped SNP, and the function *m*(·) is given by the recessive, dominant, or additive genetic mode. The intercept *α* = −3.0, which yields an overall disease rate around 5%. Each analysis is based on a case-control sample with 1000 cases and 1000 controls. The simulation results are based on 1000 (3000) repetitions for evaluation of test power (size). All the tests are performed at a significance level of 0.01. The score tests are performed using the correct genetic model, and the retrospective-score tests are based on the robust sandwich-type variance estimates; results based on model-based variance estimates are quite similar and are omitted. When performing the prospective- and retrospective-score tests with imputed genotypes for the untyped SNP, we use thed haplotype frequency estimates from HapMap to obtain the conditional probabilities Pr{*G|*(*G _{i}*)}, even though the case-control sample is actually from a population with moderately different haplotype frequencies. To see the degree of recovery of missing information achieved by imputation, we also perform the prospective-and retrospective-score tests based on the true genotypes at the untyped SNP. In addition, we perform the multi-marker Hotelling’s

Results for this simulation study are presented in Tables 5 (Scenario 1) and 6 (Scenario 2). It is seen that the score tests with imputed genotypes have size matching reasonably well with the nominal value of 1%, even though the imputation is based on haplotype frequencies that are obtained from the HapMap data and are different from the true frequencies. From the results regarding power, we see that imputing the untyped SNP in either the prospective- or the retrospective-score test can achieve substantial power gains as compared with the Hotelling’s *T*^{2} test based only on genotyped SNPs. The relative power improvement gained by imputation can still be quite remarkable even when the accuracy for predicting the untyped SNP using the genotyped SNPs is only of a moderate level (Scenario 2, where
${R}_{s}^{2}=0.39$). On the other hand, the prediction accuracy does affect the degree of recovery of the missing information that may be achieved by imputation: in Scenario 1, with perfect prediction of the untyped SNP, the tests using imputed genotypes do attain the full power we would obtain if the tests were based on the true genotype of the untyped SNP. In Scenario 2, with moderate prediction of the untyped SNP, imputation of the untyped SNP can recover partial but not full power. It is worth remembering that, with exact data, the retrospective-score test is usually more powerful than the prospective-score under the dominant or recessive model, and the two tests are essentially equivalent under the additive model. Here we observe the same phenomena when the prospective- and retrospective-score tests are based on imputed genotypes.

As we noted earlier, when exact genotype data are available, the retrospective-score test is more sensitive to violation of the HWE assumption than the prospective-score test; i.e., the former is usually biased while the latter still remains unbiased when HWE does not hold. To assess the robustness properties for the prospective- and retrospective-score tests with imputed genotype data, we performed a further simulation study where the SNP haplotypes are still given as in Tables 3 and and4,4, but the haplotype pair *H*^{di} = (*h _{a}*,

We conclude this section with a discussion on the two types of association analyses recently developed for untyped SNPs: the full likelihood approach (Lin et al., 2008) and the two-stage approach (Nicolae, 2006; Marchini et al., 2007). The full likelihood approach uses a retrospective likelihood for the case-control sample and a likelihood for the external (such as HapMap) data, by which the imputation and association analysis are simultaneously performed in a one-stage manner. Conversely, the two-stage approach performs the imputation and association analysis separately: imputing missing genotypes in the first stage and then performing association analysis in the second stage. In the imputation stage of the two-stage approach, one can apply existing powerful external imputation algorithms such as Nicolae (2006) and Marchini et al. (2007), and hence the two-stage approach is convenient to implement. There has been some debate on the efficiency difference between the two approaches (Marchini and Howie, 2008; Lin and Hu, 2008). Our simulation results (Tables 5 and and6)6) suggest that the efficiency difference between the full likelihood and the two-stage approaches may be mostly due to the use of different likelihoods (prospective vs. retrospective) and not so much due to the use of one-stage versus two-stage analysis. In this section, we have shown that one can still use a retrospective likelihood even in a two-stage approach with powerful imputation performed at the first stage.

Although single-SNP association tests are often the primary methods for genome-wide association scans, many secondary or “downstream” analyses are often useful for detailed characterization of the risk of the disease associated with specific genomic regions of interest. One popular technique is *haplotype-based association analysis*, which involves studying the association of a disease with a genomic region in terms of the underlying “haplotypes”, the combination of alleles at multiple loci along individual homologous chromosomes. Originally, haplotype-based association analysis was considered a powerful technique for “indirect” association testing in situations where a causal SNP may not have been genotyped, but the haplotypes defined by multiple typed SNPs could serve as a good “surrogate” for the causal variant. With the advent of various imputation methods, although haplotype analysis has become less relevant for such indirect association testing, it remains a useful tool for parsimonious characterization of disease-risk associated with multiple, possibly interacting, loci within a given region. Moreover, it is conceivable that for some regions, the haplotypes, and not the individual SNPs, are functional units and thus for these regions stronger signals of associations could be detected by performing haplotype based regression analysis.

A technical problem for haplotype-based regression analysis is that typically the haplotype information for the study subjects is not directly observable. Instead, locus-specific genotype data are observed, which contain information on the pair of alleles a subject carries, but does not provide the “phase information”, that is which combinations of alleles appear across multiple loci along the individual chromosomes. In general, the genotype data of a subject will be phase-ambiguous whenever the subject is heterozygous at two or more loci. Statistically, the lack of phase information can be viewed as a special missing data problem.

For example, suppose A/a and B/b denote the major/minor alleles in two bi-allelic loci. A particular haplotype pair, called a diplotype, is the pair of alleles that are inherited from one’s parents. One such haplotype pair would be (*AB*) − (*ab*), and disease risk can be associated with the number of copies of particular haplotypes that one inherits. Unfortunately, the diplotypes are not observable directly, but instead we can observe only the unordered or combined genotypes, in this case (*Aa*) at the first locus and (*Bb*) at the second locus, i.e., (*AaBb*). However, when observing only the genotypes, the actual haplotype pair is unknown, or “phase ambiguous”, because the haplotype pair (*Ab*)−(*aB*) has the same set of unordered genotypes. Confronted with the unordered set of genotypes (*AaBb*), we know that the actual haplotype pair is either (*AB*) − (*ab*) or (*Ab*) − (*aB*), but we must use probability models to take into account the phase ambiguity when performing statistical inference.

In Section 2 we described “model-free” prospective and “model-based” efficient retrospective methods for analyzing SNP data, and we also described empirical-Bayes methods that data-adaptively move between the two. Just as in SNP data, for haplotype data there are also model-free and model-based methods, and accompanying empirical-Bayes methods.

A variety of methods have been developed for haplotype-based analysis of case-control data using the logistic regression model (Zhao et al., 2003; Lake et al., 2003; Epstein and Satten, 2003; Satten and Epstein, 2004; Spinka et al., 2005; Lin and Zeng, 2006; Chatterjee et al., 2006; Chen et al., 2009). Consider a general risk model similar to (3) but with the addition of environmental factors (*E*) and written in terms of the diplotypes, denoted as *H*^{di}:

$$\text{pr}(D=1\mid {H}^{\text{di}},E)=\frac{exp\{\alpha {+}^{\top}m({H}^{\text{di}},E,\beta )\}}{1+exp\{\alpha {+}^{\top}m({H}^{\text{di}},E,\beta )\}},$$

(11)

where the function *m*(·) is chosen in a suitable way to reflect an assumed mode of genetic effect. For example, suppose we are interested in the particular haplotype *h*_{*}= (*ab*). A model that assumes an additive effect of this haplotype would have *m*(*H*^{di} = *h*^{di}, *E*) linear in the number of copies of the haplotype *h*_{*}.

The data setup then is that we have observations on environmental exposure (*E*), genotypes *G* and cases and controls *D*. What is missing is the underlying diplotype *H*^{di}. The retrospective likelihood is still (2), but the risk of disease depends on the diplotype *H*^{di} and not otherwise on the genotype.

While models such as (11) seem straightforward enough for random samples, in retrospective samples a problem arises because of the phase ambiguity. In particular, all components of *β* may not be identifiable if the distribution of (*H*^{di}, *E*) is left completely unrestricted (Epstein and Satten, 2003; Lin and Zeng, 2006). Thus, to make progress, some type of distributional assumptions are needed. Here we will distinguished between two approaches, both of them retrospective in nature but with different distributional assumptions. The first we call “model-free” in that very little is actually assumed about the haplotype distribution. If haplotypes were observable, this method reduces to ordinary prospective logistic regression, while in the rare disease case with phase ambiguity, the method reduces to that of Zhao et al. (2003). The second approach, which we call “model-based”, makes much stronger assumptions about the haplotype distribution, and reduces to the efficient retrospective approach of Chatterjee and Carroll (2005) if haplotypes were observable. The model-free method will thus be more robust but less efficient than the model-based method.

The model-based method (Spinka et al., 2005) has three aspects.

- (A.1) Haplotypes and the environment are assumed independent in the population.
- (A.2) The diplotypes are assumed to be in HWE in the population, so that$$\begin{array}{r}\text{pr}({H}^{\text{di}}={h}^{\text{di}}=({h}_{a},{h}_{b})\mid E)=q\{{h}^{\text{di}}=({h}_{a},{h}_{b}),\theta \}={\theta}_{a}^{2}\phantom{\rule{0.38889em}{0ex}}\text{if}\phantom{\rule{0.38889em}{0ex}}{h}_{a}={h}_{b}\\ =2{\theta}_{a}{\theta}_{b}\phantom{\rule{0.38889em}{0ex}}\text{if}\phantom{\rule{0.38889em}{0ex}}{h}_{a}\ne {h}_{b},\end{array}$$where
*θ*denotes the population frequency for the haplotype_{s}*h*._{s} - (A.3) The distribution of the environmental variable
*E*is left completely nonparametric.

The methodology Spinka et al. (2005) used to construct their profile likelihood was a nonparametric maximum likelihood estimator over the unknown distribution of *E*. However, there is an alternative derivation, one that is both more intuitive and much easier to work out. Indeed, it is a not sufficiently well known fact that for most purposes a case-control study can be viewed as a prospective study with missing data. Consider a sampling scenario where each subject from the underlying population is selected into the case-control study using a Bernoulli sampling scheme where the selection probability for a subject given his/her disease status *D* = *d* is proportional to *N _{d}*/pr(

$${L}_{\text{model}}(D,G,E,\mathrm{\Omega})=\frac{{\sum}_{{h}^{\text{di}}\in {\mathcal{H}}_{G}}q({h}^{\text{di}},\theta )exp[D\{\kappa +m({h}^{\text{di}},E,\beta )\}]}{{\sum}_{s=0}^{1}{\sum}_{{h}^{\text{di}}}q({h}^{\text{di}},\theta )exp[s\{\kappa +m({h}^{\text{di}},E,\beta )\}],}$$

(12)

where *p _{d}* =

The two important model assumptions in the model-based estimator are (A.1) and (A.2). Although because of identifiability some model assumptions must be made, they can be weakened tremendously, as follows (Chen, Chatterjee and Carroll, 2009):

- (B.1) The haplotype and the environment are independent in the population given the genotype
*G*. - (B.2) There population distribution for the diplotypes given the genotype
*G*, called*q*_{free}(*h*^{di}*|G*,*θ*), can be derived assuming HWE.

Following the same alternative sampling scheme as described in Section **??**, or by doing a nonparametric maximum likelihood analysis, we can compute pr(*D* = 1*|G*, *E*, *δ* = 1) under assumptions (B.1), (B.2) and (A.3) to be

$${L}_{\text{free}}(D,G,E,\mathrm{\Omega})=\frac{{\sum}_{{h}^{\text{di}}\in {\mathcal{H}}_{G}}{q}_{\text{free}}({h}^{\text{di}}\mid G,\theta )exp[D\{\kappa +m({h}^{\text{di}},E,\beta )\}]}{{\sum}_{s=0}^{1}{\sum}_{{h}^{\text{di}}\in {\mathcal{H}}_{G}}{q}_{\text{free}}({h}^{\text{di}}\mid G,\theta )exp[s\{\kappa +m({h}^{\text{di}},E,\beta )\}],}$$

(13)

To see why the likelihood *L*_{free} requires far weaker assumptions than *L*_{model}, note that *L*_{free} requires the haplotype-environment independence and HWE assumption only to specify the conditional distribution pr(*H ^{di}|G*,

Note that *L*_{free}(*D*, *G*, *E*, Ω) will contain little information on *θ* since it conditions on *G*. Thus, when implementing methods based on this likelihood, Chen et al. (2009) proposed to replace the score function for *θ* by the estimating function for *θ* based on the genotype data from the controls and assuming that the haplotypes are in HWE in the population.

In Section 4.2.2, we constructed a profile likelihood under strong assumptions leading to an efficient method that will not be robust to violations of the two major assumptions. Conversely, in Section 4.2.3 we computed a profile likelihood leading to much more robust inference, but at the cost of a steep loss of efficiency. Similarly to Section 2.3, here we briefly review a fully sample size- and data-adaptive empirical-Bayes method that Chen et al. (2009) described for gaining efficiency when warranted but is still robust.

Let _{free} and _{model} be the model-free and model-based estimates, with *j ^{th}* components

$$\begin{array}{l}{\widehat{\beta}}_{j,\text{EB}}={\widehat{\beta}}_{j,\text{free}}+{W}_{j}({\widehat{\beta}}_{j,\text{model}}-{\widehat{\beta}}_{j,\text{free}});\\ W=\frac{{v}_{j}}{{v}_{j}+{({\widehat{\beta}}_{j,\text{free}}-{\widehat{\beta}}_{j,\text{model}})}^{2}}.\end{array}$$

(14)

The intuition behind (14) is that if the model fails, (_{j,}_{model} − _{j,}_{free}) will be large relative to *v _{i}*, which as a variance is proportional to

Chen et al. (2009) illustrate application of the different methods in two case-control data examples. The examples were chosen in such a way that from a priori biologic grounds one would expect the gene-environment independence assumption to hold in one case, but not in the other. The two examples together illustrate how the different shrinkage estimators adapt to alternative scenarios of gene-environment distribution.

Researchers now increasingly use the Cochran-Armitage trend test as the primary method for single-SNP association testing in the GWAS. The test is known to have robust power for the detection of effect of susceptibility SNPs under a range of realistic modes of inheritance that give rise to some sort of monotone relationship between disease-risk and allele count. As noted in Section 2, the retrospective and prospective methods have very similar, if not identical, power under the trend model and thus either could be used as the primary method for analysis of GWAS data. The trend test, however, can perform very poorly for the detection of SNPs for which the minor allele has a recessive effect. Thus, it is often recommended that a test under the recessive mode of inheritance be conducted as a secondary step to detect SNPs with recessive effects that may be missed by the primary trend test of association. The use of the retrospective method can be potentially beneficial at this stage. One, however, has to be cautious about creation of false positive results due to the violation of the HWE assumption. We recommend that if a retrospective method is to be used for potential power gain, then it should be used in conjunction with the empirical-Bayes type shrinkage estimation. Our numerical investigations suggest that such a method can indeed be more powerful than the conventional “prospective” methods without creating excess false positives, see Tables 1 and and22.

In this article, although we focus on association tests involving bi-allelic SNPs, the same issues are relevant for genetic association tests involving loci with more than two alleles. In particular, one can gain efficiency for analysis of case-control data by assuming HWE or other natural population-genetic models (Satten and Epstein, 2004; Lin and Zeng, 2006) to specify multi-allelic genotype frequency for the underlying population. The sensitivity of the methods to underlying model assumption can be reduced by appropriate shrinkage estimation techniques.

The difference between prospective and retrospective methods become more relevant for studies of gene-gene and gene-environment interactions, a topic that we have not directly addressed in this article. In particular, retrospective methods, such as the case-only analysis (Piegorsch et al., 1994), which assumes gene-gene or/and gene-environment independence for the underlying population, can gain dramatic power for testing and estimation of odds-ratio interaction parameters in the logistic regression model. Given that standard case-control analyses often have poor power for detection of multiplicative interactions due to small numbers of cases or controls in cells of crossing exposures, practitioners often find it is tempting to use the more powerful retrospective methods. The assumption of gene-environment independence, however, can be violated, either due to direct casual association between gene and environment or indirect association due to effects of family history and hidden population stratification. The assumption of gene-gene independence between physically distant genes can also be violated due to population stratification. Thus, we believe the development of shrinkage (Mukherjee and Chatterjee, 2008; Chen, Chatterjee and Carroll, 2009) and other type of data-adaptive techniques (Li and Conti, 2009) has been valuable for robust inference in case-control studies of genetic epidemiology.

Chatterjee’s research was supported by a gene-environment initiative grant from the National Heart Lung and Blood Institute (RO1HL091172-01) and by the Intramural Research Program of the National Cancer Institute. Chen’s research was supported by the National Science Council of ROC (NSC 95-2118-M-001-022-MY3). Carroll’s research was supported by a grant from the National Cancer Institute (CA57030) and by Award Number KUS-CI-016-04, made by King Abdullah University of Science and Technology (KAUST).

NILANJAN CHATTERJEE, Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, DHHS. Rockville MD 20852, U.S.A.

YI-HAU CHEN, Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan R.O.C.

SHENG LUO, Division of Biostatistics, University of Texas School of Public Health, Houston, TX 77030.

RAYMOND J. CARROLL, Department of Statistics, Texas A&M University, College Station, TX 77843-3143, U.S.A.

- Andersen EB. Asymptotic properties of conditional maximum-likelihood estimators. Journal of the Royal Statistical Society, Series B. 1970;32:283–301.
- Chapman JM, Cooper JD, Todd JA, Clayton DG. Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power. Human Heredity. 2003;56:18–31. [PubMed]
- Chatterjee N, Carroll RJ. Semiparametric maximum likelihood estimation in case-control studies of gene-environment interactions. Biometrika. 2005;92:399–418.
- Chatterjee N, Chen J, Spinka C, Carroll RJ. Comment on the paper likelihood based inference on haplotype effects in genetic association studies by d y lin and d zeng. Journal of American Statistical Association. 2006;101:108–110.
- Chen J, Chatterjee N. Exploiting hardy-weinberg equilibrium for efficient screening of single snp associations from case-control studies. Human Heredity. 2007;63:196–204. [PubMed]
- Chen YH, Chatterjee N, Carroll RJ. Shrinkage estimators for robust and efficient inference in haplotype-based case-control studies. Journal of American Statistical Association. 2009 page To appear. [PMC free article] [PubMed]
- Cornfield J. A statistical problem arising from retrospective studies. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability.1956.
- Epstein MP, Satten GA. Inference on haplotype effects in case-control studies using unphased genotype data. American Journal of Human Genetics. 2003;73:1316–1329. [PubMed]
- Hartl DL, Clark AG. Principles of Population Genetics. 4 Sinauer Associates; 2007.
- Hunter DJ, Kraft P, Jacobs KB, Cox DG, Yeager M, Hankinson SE, Wacholder S, Wang Z, Welch R, Hutchinson A, Wang J, Yu K, Chatterjee N, et al. A genome-wide association study identifies alleles in fgfr2 associated with risk of sporadic postmenopausal breast cancer. Nature Genetics. 2007;39:870–874. [PMC free article] [PubMed]
- Lake SL, Lyon H, Tantisira K, Silverman EK, Weiss ST, Laird NM, Schaid DJ. Estimation and tests of haplotype-environment interaction when linkage phase is ambiguous. Human Heredity. 2003;55:56–65. [PubMed]
- Li D, Conti DV. Detecting gene-environment interactions using a combined case-only and case-control approach. American Journal of Epidemiology. 2009;169:497–504. [PMC free article] [PubMed]
- Lin DY, Hu Y. Reply to marchini and howie. American Journal of Human Genetics. 2008;83:539–540.
- Lin DY, Hu Y, Huang BE. Simple and efficient analysis of disease association with missing genotype data. American Journal of Human Genetics. 2008;82:444–45. [PubMed]
- Lin DY, Zeng D. Likelihood-based inference on haplotype effects in genetic association studies. Journal of the American Statistical Association. 2006;101(473):89–104.
- Luo S, Mukherjee B, Chen J, Chatterjee N. Shrinkage estimation for robust and efficient screening of single-snp sssociation from case-control genome-wide association studies. Genetic Epidemiology. 2009 To apprear. [PMC free article] [PubMed]
- Marchini J, Howie B. Comparing algorithms for genotype imputation. American Journal of Human Genetics. 2008;83:535–539. [PubMed]
- Marchini J, Howie B, Myers S, GM, PD A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genetics. 2007;39:906–913. [PubMed]
- Mukherjee B, Chatterjee N. Exploiting gene-environment independence for analysis of case-control studies: an empirical bayes approach to trade off between bias and efficiency. Biometrics. 2008;64:685–694. [PubMed]
- Nicolae DL. Testing untyped alleles (tuna)-applications to genome-wide association studies. Genetic Epidemiology. 2006;30:718–727. [PubMed]
- Piegorsch WW, Weinberg CR, Taylor JA. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Statistics in Medicine. 1994;13:153–162. [PubMed]
- Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–412.
- Roeder K, Carroll RJ, Lindsay BG. A semiparametric mixture approach to case-control studies with errors in covariables. Journal of the American Statistical Association. 1996;91:722–732.
- Satten GA, Epstein MP. Comparison of prospective and retrospective methods for haplotype inference in case-control studies. Genetic Epidemiology. 2004;27(3):192–201. [PubMed]
- Satten GA, Kupper LL. Conditional regression analysis of the exposure- disease odds ratio using known probability-of-exposure values. Biometrics. 1993;49:429–440. [PubMed]
- Spinka C, Carroll RJ, Chatterjee N. Analysis of case-control studies of genetic and environmental factors with missing genetic information and haplotype-phase ambiguity. Genetic Epidemiology. 2005;29:108–127. [PMC free article] [PubMed]
- Thomas G, Jacobs KB, Yeager M, Kraft P, Wacholder S, Orr N, Yu K, Chatterjee N, Welch R, Hutchinson A, et al. Multiple novel loci identified in a genome-wide association study of prostate cancer. Nature Genetics. 2008;40:310–315. [PubMed]
- Xiong M, Zhao J, Berwinkle E. Generalized t2 test for genome association studies. American Journal of Human Genetics. 2002;70:1257–1268. [PubMed]
- Yeager M, Orr N, Hayes RB, Jacobs KB, Kraft P, Wacholder S, Minichiello MJ, Fearnhead P, Yu K, Chatterjee N, et al. Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nature Genetics. 2007;39:645–649. [PubMed]
- Yu K, Li QWBA, Pfeiffer R, Rosenberg P, Caporaso N, Kraft P, Chatterjee N. Pathway analysis by adaptive combination of p-values. Genetic Epidemiology. 2009 To appear. [PMC free article] [PubMed]
- Zhao LP, Li SS, Khalid NA. Method for the assessment of disease associations with single-nucleotide polymorphism haplotypes and environmental variables in case-control studies. American Journal of Human Genetics. 2003;72:1231–1250. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |