Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2679507

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. A MODEL-BASED AND A MODEL-FREE ESTIMATOR
- 3. EMPIRICAL BAYES-TYPE SHRINKAGE ESTIMATORS
- 4. PENALIZED LIKELIHOODS
- 5. SIMULATIONS
- 6. CASE-CONTROL STUDIES OF COLORECTAL ADENOMA AND PROSTATE CANCER
- 7. DISCUSSION
- REFERENCES

Authors

Related links

J Am Stat Assoc. Author manuscript; available in PMC 2009 May 8.

Published in final edited form as:

J Am Stat Assoc. 2009 March 1; 104(485): 220–233.

doi: 10.1198/jasa.2009.0104PMCID: PMC2679507

NIHMSID: NIHMS108121

Yi-Hau Chen is Associate Research Member with the Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan, Republic of China (E-mail: wt.ude.acinis.tats@nehchy). Nilanjan Chatterjee is Chief and Principle Investigator with the Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institute of Health, Department of Health and Human Services, Rockville Maryland 20852 (E-mail: vog.hin.liam@nrettahc). Raymond J. Carroll is Distinguished Professor with the Department of Statistics, Texas A&M University, College Station, Texas 77843−3143 (E-mail: ude.umat.tats@llorrac).

See other articles in PMC that cite the published article.

Case-control association studies often aim to investigate the role of genes and gene-environment interactions in terms of the underlying haplotypes (i.e., the combinations of alleles at multiple genetic loci along chromosomal regions). The goal of this article is to develop robust but efficient approaches to the estimation of disease odds-ratio parameters associated with haplotypes and haplotype-environment interactions. We consider “shrinkage” estimation techniques that can adaptively relax the model assumptions of Hardy-Weinberg-Equilibrium and gene-environment independence required by recently proposed efficient “retrospective” methods. Our proposal involves first development of a novel retrospective approach to the analysis of case-control data, one that is robust to the nature of the gene-environment distribution in the underlying population. Next, it involves shrinkage of the robust retrospective estimator toward a more precise, but model-dependent, retrospective estimator using novel empirical Bayes and penalized regression techniques. Methods for variance estimation are proposed based on asymptotic theories. Simulations and two data examples illustrate both the robustness and efficiency of the proposed methods.

Haplotypes, the combinations of alleles at multiple loci along individual homologous chromosomes, define the functional units of a gene through which the underlying protein product is made (Clark 2004). Association studies based on haplotypes, which can capture interloci interactions as well as “indirect association” due to linkage disequilibrium (LD) with unobserved causal variants, can be a powerful approach to the discovery and characterization of the genetic basis of complex diseases (Schaid 2004). Thus, in recent years, there has been tremendous interest in developing methods for haplotype-based regression analysis of genetic epidemiologic data. A technical problem has been that traditional epidemiologic studies only collect locus-specific genotype data, which does not provide the “phase information”, that is, which alleles appear at multiple loci along the individual chromosomes. Statistically, the lack of phase information can be viewed as a special missing data problem.

For logistic regression analysis of unmatched case-control studies, two classes of methods have evolved. The “prospective” methods (Schaid, Rowland, Tines, Jacobson, and Polalnd 2002; Zhao, Li, and Khalid 2003; Lake et al. 2003) ignore the underlying retrospective nature of the case-control design. These methods are considered robust in the sense that they depend very weakly on the underlying assumptions of Hardy-Weinberg equilibrium (HWE) and gene-environment (*G*-*E*) independence, although the assumptions cannot be totally avoided because of the phase ambiguity problem. In contrast, “retrospective” methods (Epstein and Satten 2003; Stram et al. 2003; Satten and Epstein 2004; Spinka, Carroll, and Chatterjee 2005; Lin and Zeng 2006), which properly account for case-control sampling, can fully exploit the assumptions of HWE and *G*-*E* independence to gain major efficiency over the prospective methods. It is often debatable which of the two types of methods is more suitable for a particular study. Prospective estimates of haplotype-effects and haplotype-environment interactions involving relatively rare haplotypes often tend to be very imprecise. Retrospective methods can produce much more precise estimates of those parameters, but concern often remains about the potential for bias because of the possible violation of the underlying assumptions, a potential we see in our simulations.

The potential for bias in retrospective methods can be reduced by flexible modeling approaches that relax the underlying assumptions. Alternative population genetic models that can relax the HWE assumptions have been used for retrospective haplotype analysis of case-control data (Satten and Epstein 2004; Lin and Zeng 2006). It has been also shown that the assumption of *G*-*E* independence can be relaxed to a large extent by assuming haplotypes are independent of *E* given unphased genotypes, but allowing the conditional distribution of *E* given the unphased genotypes to remain completely unrestricted (Lin and Zeng 2006). These solutions, which can alleviate the concern about bias, are not completely satisfactory. First, models for relaxing the HWE assumption can capture only certain types of departures from the underlying constraints for the diplotype distribution and may not be able to model phenomena such as excess heterozygosity. Second, even if a completely nonparametric model for the *G*-*E* distribution is available, we may still be able to gain efficiency in analysis of case-control data by exploiting the fact that HWE and G-E independence often do hold, approximately if not exactly. In the existing methods, if one uses a very general model for the distribution of *G*-*E*, then the concern about bias will be minimized, but inevitably efficiency will be lost.

Our main objective is to develop methods for haplotype-based analysis of case-control studies, which can gain efficiency by exploiting model assumptions of HWE and *G*-*E* independence for the underlying population and yet are resistant to bias when those model assumptions are violated. The basic idea involves shrinkage of a “model-free” estimator that is robust to HWE and *G*-*E* independence toward a “model-based” estimator that directly exploits those assumptions. The amount of “shrinkage” is sample size and data adaptive so that in large samples the method has no bias whether the assumptions of HWE and *G*-*E* independence hold, and yet the method can gain efficiency by shrinking the analysis toward HWE and *G*-*E* independence, but only to the extent the data validates the assumptions.

There are several novel aspects of our proposal. First, in Section 2.2, we propose a novel retrospective likelihood approach to haplotype analysis of case-control data that is robust to the nature of the gene-environment distribution in the underlying population. Second, in Section 3, we develop an empirical Bayes (EB)-type shrinkage estimation approach and a rigorous asymptotic theory for it. The key difficulty is that the problem is semiparametric, in that there are infinite-dimensional nuisance parameters associated with the joint distribution of the gene and the environment. Our method overcomes this difficulty by focusing only on the parameters of interest. In Section 4, we develop a penalized likelihood approach and asymptotic theory for it. The penalized likelihood involves shrinkage, not of a parameter or set of parameters to zero as is usually done, but to a model-based estimator, and also overcomes the problem of infinite-dimensional nuisance parameters. Effectively, we try to shrink the difference of the model-free and model-based estimators toward zero. In Sections 5 and 6, we use simulation studies and two real data examples to illustrate that unlike the existing haplotype-based regression methods, whose utility depends crucially on specific model assumptions, the proposed shrinkage methods adapt themselves to a wide range of situations.

Finally, although our scientific focus of this article is haplotype-based case-control studies, this article makes a far more general contribution. Using modern shrinkage and penalization techniques to combine assumption-laden and assumption-free methods in semiparametric problems with infinite dimensional nuisance parameters is an idea that transcends genetic association studies. We hope that our article will lead to further research in this more general area.

Let **H** = (**H**_{a}, **H**_{b}) denote the diplotype status (haplotype pair) for a subject at *M* loci of interest within a genomic region. Given **H** and a set of environmental covariates, **X**, assume that the risk of a binary disease outcome *D* is given by the logistic regression model

$$\text{logit}\left\{{\mathrm{pr}}_{\beta}(D=1\mid \mathbf{H},\mathbf{X})\right\}={\beta}_{0}+m(\mathbf{H},\mathbf{X};{\beta}_{1}),$$

(1)

where *m*(·) is a known but arbitrary function that specifies the log-odds-ratio of the disease as a function of **H** and **X** in terms of a set of regression parameters *β*_{1} . In (1), the effect of a diplotype can be further specified in terms of the effect of the constituent haplotypes assuming *dominant, recessive*, or *additive* modes of penetrance (Wallenstein, Hodge, and Weston 1998). Let **G** = (*g*_{1}, . . ., *g*_{M}) denote the unphased genotype data for the *M* loci. As explained earlier, the genotype data **G** could be consistent with multiple diplotypes because of phase ambiguity. We denote ${\mathcal{H}}_{\mathbf{G}}$ to be the set of all possible diplotypes that are consistent with the genotype data **G**. Let *F*(**X, G**) be the cumulative distribution function for **X** and **G** in the underlying population.

Assume data on **G** and **X** are collected in a case-control study for *N*_{0} controls (*D* = 0) and *N*_{1} cases (*D* = 1). Let *N* = *N*_{0} + *N*_{1}. The fundamental likelihood for case-control data, known as the “retrospective” likelihood, is given by

$$\begin{array}{cc}\hfill & {L}_{\mathrm{Haplo}}^{R}=\prod _{i=1}^{N}\mathrm{pr}({\mathbf{G}}_{i},{\mathbf{X}}_{i}\mid {D}_{i})\hfill \\ \hfill \phantom{\rule{1em}{0ex}}& =\prod _{i=l}^{N}\frac{\left\{\sum _{\mathbf{H}\in {\mathcal{H}}_{{\mathbf{G}}_{i}}}{\mathrm{pr}}_{\beta}({D}_{i}\mid {\mathbf{H}}_{i},{\mathbf{X}}_{i})\mathrm{pr}({\mathbf{H}}_{i}\mid {\mathbf{X}}_{i},{\mathbf{G}}_{i})\right\}dF({\mathbf{X}}_{i},{\mathbf{G}}_{i})}{{\int}_{X}\sum _{G}\left\{\sum _{\mathbf{H}\in {\mathcal{H}}_{{\mathbf{G}}_{i}}}{\mathrm{pr}}_{\beta}({D}_{i}\mid \mathbf{H},\mathbf{X})\mathrm{pr}(\mathbf{H}\mid \mathbf{X},\mathbf{G})\right\}dF(\mathbf{X},\mathbf{G})},\hfill \end{array}$$

(2)

where the last expression follows by Bayes theorem and the identity that

$$\mathrm{pr}(D\mid \mathbf{X},\mathbf{G})=\sum _{\mathbf{H}\in {\mathcal{H}}_{G}}{\mathrm{pr}}_{\beta}(D\mid \mathbf{H},\mathbf{X})\mathrm{pr}(\mathbf{H}\mid \mathbf{X},\mathbf{G}),$$

First, let us consider obtaining a “model-based” estimator for *β*_{1} under the assumption that **H** and **X** are independent and that the distribution of **H** follows HWE in the underlying population. Under these assumptions, the joint density function for **X** and **G** is

$$dF(\mathbf{X},\mathbf{G})=dF\left(\mathbf{X}\right)\times q\left(\mathbf{G}\right),$$

( 3 )

where *F*(**X**) is the marginal distribution function for $\mathbf{X},q\left(\mathbf{G}\right)={\sum}_{\mathbf{H}\in {\mathcal{H}}_{G}}\mathrm{pr}\left(\mathbf{H}\right)$, and pr(**H**) is the population frequency of the diplotype **H**. Under HWE for haplotypes, we have

$$\begin{array}{cc}\hfill {\mathrm{pr}}_{\theta}\{\mathbf{H}=({\mathbf{h}}_{a},{\mathbf{h}}_{b})\}& ={\theta}_{\alpha}^{2}\phantom{\rule{1em}{0ex}}\text{if}\phantom{\rule{thickmathspace}{0ex}}{h}_{a}={h}_{b}\hfill \\ \hfill & =2{\theta}_{a}{\theta}_{b}\phantom{\rule{1em}{0ex}}\text{if}\phantom{\rule{thickmathspace}{0ex}}{h}_{a}\ne {h}_{b},\hfill \end{array}$$

where *θ _{s}* denotes the population frequency for the haplotype

$$\mathrm{pr}(\mathbf{H}\mid \mathbf{X},\mathbf{G})=\mathrm{pr}(\mathbf{H}\mid \mathbf{G})=I(\mathbf{H}\in {\mathcal{H}}_{\mathbf{G}}){\mathrm{pr}}_{\theta}\left(\mathbf{H}\right)\u2215\sum _{{\mathbf{H}}_{\ast}\in {\mathcal{H}}_{G}}{\mathrm{pr}}_{\theta}\left({\mathbf{H}}_{\ast}\right).$$

( 4 )

Spinka et al. (2005) showed how to estimate ** β** and

Now consider obtaining a “model-free” estimator. Unfortunately, in the presence of phase ambiguity, ** β** and

To see why (4) involves very mild assumptions, note that if there were no phase ambiguity (i.e., **G** = **H**), then this formulation does not impose any restriction on the population distribution of the covariates of the logistic regression model (1). In this case, it follows from classical theory (Prentice and Pyke 1979) that the retrospective maximum-likelihood estimate of *β*_{1} could be obtained using standard prospective logistic regression analysis, the validity of which does not depend on assumptions for the covariate distribution. In contrast, the validity of Lin and Zeng's estimator does depend on the assumed diplotype distribution.

In the presence of phase ambiguity, violation of HWE and *H*-*X* independence for the underlying population would imply that the corresponding constraint for the distribution pr(**H**|**X, G**) will also not hold. However, in typical association studies involving tightly linked loci, the problem of phase ambiguity tends to be modest and the misspecification of the conditional distribution in such situations will have a fairly small influence on inference on the regression parameters in our model-free framework, see Section 5 for simulation results.

We next develop an algorithm for obtaining the semiparametric maximum likelihood estimator of *β*_{1}, one that maximizes the retrospective likelihood (2) under the assumption (4), allowing *F*(**G, X**) to remain completely nonparametric. We consider a profile-likelihood approach analogous to that described in Chatterjee and Carroll (2005), Spinka et al. (2005) and Chatterjee and Chen (2007). In particular, assuming that the nonparametric maximum likelihood estimator of *F*(**X, G**) allows masses only on the observed data points, following arguments analogous to Spinka et al. we can show that the estimator of *β*_{1}, which maximizes the retrospective likelihood (2), can be obtained from an alternative pseudo-likelihood of the data where the contribution of each subject is given by

$${L}_{\text{free}}(D,\mathbf{G},\mathbf{X},\Omega )=\frac{\sum _{\mathbf{h}\in {\mathcal{H}}_{G}}{q}_{\text{free}}(\mathbf{h}\mid \mathbf{G},\theta )\mathcal{M}(D,\mathbf{h},\mathbf{X},\mathbf{G},\Omega )}{\sum _{s=0}^{1}\sum _{\mathbf{h}\in {\mathcal{H}}_{G}}{q}_{\text{free}}(\mathbf{h}\mid \mathbf{G},\theta )\mathcal{M}(s,\mathbf{h},\mathbf{X},\mathbf{G},\Omega )},$$

( 5 )

where *q*_{free}(**h**|**G, θ**) denotes pr(

$$\mathcal{M}(d,\mathbf{h},\mathbf{x},\mathbf{g},\Omega )=\frac{\mathrm{exp}\left[d\{\kappa +m(\mathbf{h},\mathbf{X},{\beta}_{1})\}\right]}{1+\mathrm{exp}\{{\beta}_{0}+m(\mathbf{h},\mathbf{X},{\beta}_{1})\}},$$

*p _{d}* =

Note that *L*_{free}(*D*, **G, X, Ω**) will contain little information on ** θ**, because it conditions on

We note that the profile-likelihood estimator Spinka et al. (2005) derived for maximization of (2) under the assumptions of HWE and *H*-*X* independence corresponds to a pseudo-likelihood where the contribution of each subject is given by

$${L}_{\text{model}}(D,\mathbf{G},\mathbf{X},\Omega )=\frac{\sum _{\mathbf{h}\in {\mathcal{H}}_{G}}q(\mathbf{h};\theta )\mathcal{M}(D,\mathbf{h},\mathbf{X},\mathbf{G},\Omega )}{\sum _{s=0}^{1}\sum _{\mathbf{h}}q(\mathbf{h};\theta )\mathcal{M}(s,\mathbf{h},\mathbf{X},\mathbf{G},\Omega ),}$$

( 6 )

where *q*(**h; θ**) denotes pr(

Interestingly, the pseudo-likelihoods for both the “model-based” and the “model-free” estimators can be derived as a proper likelihood under an alternative sampling design, wherein a case-control study can be viewed as a prospective study with missing data. Consider a sampling scenario where each subject from the underlying population is selected into the case-control study using a Bernoulli sampling scheme where the selection probability for a subject given his or her disease status *D* = *d* is proportional to *μ*_{d} = *N _{d}*/pr(

$$\begin{array}{cc}\hfill {L}_{\text{model}}(D,\mathbf{G},\mathbf{X},\Omega )& =\mathrm{pr}(D,\mathbf{G}\mid \mathbf{X},R=1)\phantom{\rule{1em}{0ex}}\text{and}\hfill \\ \hfill {L}_{\text{free}}(D,\mathbf{G},\mathbf{X},\Omega )& =\mathrm{pr}(D\mid \mathbf{G},\mathbf{X},R=1).\hfill \end{array}$$

The proof for the former identity can be found in Spinka et al. (2005) and that for the latter identity follows similarly. Thus, in this alternative sampling scheme, the difference between the model-based and model-free estimators corresponds to whether the genotype data has been conditioned out of the likelihood or not.

As discussed in Chatterjee and Carroll (2005), in a case-control sample, it may be difficult to estimate the intercept parameter *b*_{0} even when the haplotype-environment independence assumption is imposed. However, the estimation of *b*_{0} can be avoided by imposing the rare-disease assumption so that the parameters in effect are (*k, β*_{1}, ** θ**). In the following discussion, we will adopt this convention and redefine the regression parameters

Once we obtain two estimators of ** β**, we propose to combine them using EB-type weighting in the spirit of Mukherjee and Chatterjee (2008). Previously, we have developed a general theory for obtaining such weighted estimators when the departure of the population distribution of the risk-factors from the underlying models can be indexed by a finite set of parameters. In the current setting, however, the departure of the nonparametric density

Let *β*_{free} and *β*_{model} denote the asymptotic limit of model-free and model-based estimators proposed previously. Note that when HWE and (*G*-*E*) independence hold, we have ** ψ** =

$$\mathbf{K}=\mathbf{V}{(\mathbf{V}+\widehat{\psi}{\widehat{\psi}}^{\u22ba})}^{-1},$$

where **V** is the (estimated) variance-covariance matrix of $\widehat{\psi}$. By this logic, we can construct an EB-type estimator

$${\widehat{\beta}}_{\mathrm{EB}}={\widehat{\beta}}_{\text{free}}+\mathbf{K}({\widehat{\beta}}_{\text{model}}-{\widehat{\beta}}_{\text{free}}),$$

( 7 )

We observe that Equation (7) suggests a general way of constructing simple shrinkage estimators. The shrinkage factor **K** determines the amount of shrinkage of the model-free estimator toward its model-based counterpart, with the two extremes being **K** = **I** (identity matrix) implying ${\widehat{\beta}}_{\mathrm{EB}}={\widehat{\beta}}_{\text{model}}$ and **K** = 0 implying ${\widehat{\beta}}_{\mathrm{EB}}={\widehat{\beta}}_{\text{free}}$. If the estimator is to be approximately consistent in large samples, whether the HWE and (*G*-*E*) independence assumptions hold or not, the matrix **K** should go to zero, at least when *β*_{model} ≠ *β*_{free}, as sample size increases. Moreover, if **K** goes to zero at a suitable rate, ${\widehat{\beta}}_{\mathrm{EB}}$ can be asymptotically equivalent to the model-free estimator, but in finite samples and when the difference between *β*_{model} and *β*_{free} is small, which might often be the case in practice, then ${\widehat{\beta}}_{\mathrm{EB}}$ can still have better finite sample performance in terms of the bias-variance trade-off. Simulation results in Section 5 will illustrate this feature. It is intuitive that **K** should be such that more weight should be given to ${\widehat{\beta}}_{\text{model}}$ or ${\widehat{\beta}}_{\text{free}}$ depending on the bias of the model-based estimator, a quantity that can be estimated empirically as $\widehat{\psi}={\widehat{\beta}}_{\text{model}}-{\widehat{\beta}}_{\text{free}}$. In Sections 5 and 6, we will also consider one such alternative EB-type shrinkage estimator where we choose **K** to be a diagonal matrix with the *i*th diagonal element being ${k}_{i}={v}_{i}\u2215({v}_{i}+{\widehat{\psi}}_{i}^{2})$, where * ν_{i}* is the

When the assumptions of HWE or *H*-*X* independence or both are violated so that *β*_{model} ≠ *β*_{free}, it is easy to show that the EB estimator is asymptotically equivalent to ${\widehat{\beta}}_{\text{free}}$, and hence is consistent for ** β** (see Appendix C). Nevertheless, utilizing the bias-variance trade-offs between ${\widehat{\beta}}_{\text{model}}$ and ${\widehat{\beta}}_{\text{free}}$, the EB estimator we propose can have substantially better finite sample properties than the model-free estimator

Now we consider the asymptotic theory for the EB-type estimator (7) when HWE and (*G*-*E*) independence hold, which implies *β*_{model} = *β*_{free} (i.e., the model-based estimator is consistent). Let **Ψ**_{model}(*D*, **G, X, Ω**_{model}) and **Ψ**_{free} (*D*, **G, X, Ω**_{free}) be the individual score/estimating functions for the model-based and model-free estimators. These are the derivatives of the logarithms of (5) and (6), respectively, with respect to the parameters, with the exception that in the model-free case, the score for ** θ** is as described in Section 2.3.1.

Let ${\mathcal{I}}_{\text{model}}$ be minus the expectation of ${N}^{-1}{\sum}_{i=1}^{N}\partial {\Psi}_{\text{model}}({D}_{i},{\mathbf{G}}_{i},{\mathbf{X}}_{i},{\Omega}_{\text{model}})\u2215\partial {\Omega}_{\text{model}}^{\u22ba}$ and let ${\mathcal{I}}_{\text{free}}$ be defined analogously. Let mean convergence in distribution to a random variable having the same distribution as the right hand side. It is a consequence of Theorems 2 and 3 in the Appendix B that ${\{{N}^{1\u22152}{({\widehat{\beta}}_{\text{model}}-{\beta}_{\text{model}})}^{\u22ba},{N}^{1\u22152}{({\widehat{\beta}}_{\text{free}}-{\beta}_{\text{free}})}^{\u22ba}\}}^{\u22ba}\Rightarrow {({\mathcal{Z}}_{\text{model}}^{\u22ba},{\mathcal{Z}}_{\text{free}}^{\u22ba})}^{\u22ba}\sim \text{Normal}(0,{\Sigma}_{\beta})$, where

$${\Sigma}_{\beta}=\mathrm{cov}\left\{\begin{array}{c}\hfill \left({\mathbf{I}}_{p}\phantom{\rule{thinmathspace}{0ex}}0\right){\mathcal{I}}_{\text{model}}^{-1}{N}^{-1\u22152}\sum _{i=1}^{N}{\Psi}_{\text{model}}({D}_{i},{\mathbf{G}}_{i},{\mathbf{X}}_{i},{\Omega}_{\text{model}})\hfill \\ \hfill \left({\mathbf{I}}_{p}\phantom{\rule{thinmathspace}{0ex}}0\right){\mathcal{I}}_{\text{free}}^{-1}{N}^{-1\u22152}\sum _{i=1}^{N}{\Psi}_{\text{free}}({D}_{i},{\mathbf{G}}_{i},{\mathbf{X}}_{i},{\Omega}_{\text{free}})\hfill \end{array}\right\},$$

with **I**_{p} the identity matrix of size *p* = dim(** β**), and

$$\begin{array}{c}\hfill {N}^{1\u22152}({\widehat{\beta}}_{\mathrm{EB}}-{\beta}_{\text{free}})\Rightarrow [\mathcal{M}({\mathcal{Z}}_{\text{model}},{\mathcal{Z}}_{\text{free}}),\hfill \\ \hfill \phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\mathbf{I}-\mathcal{M}({\mathcal{Z}}_{\text{model}},{\mathcal{Z}}_{\text{free}})]{({\mathcal{Z}}_{\text{model}}^{\u22ba},{\mathcal{Z}}_{\text{free}}^{\u22ba})}^{\u22ba}.\hfill \end{array}$$

( 8 )

Of course, the limit distribution in (8) is not normally distributed, a phenomenon that is expected for many model-average estimators at the null or reduced model (Hjort and Claeskens 2003; Claeskens and Carroll 2007). However, simulation studies show that the lack of normality, while real, is not serious in practice in our context. The quantile-quantile plots shown in Figure 1 for comparing the distribution of the EB-type estimates from a set of simulations with the normal distribution illustrate this fact (see Section 5 for details about simulations). Moreover, the EB estimator in this situation can be more efficient than the model-free estimator not only in finite samples, but also in large samples (Tables 2 and and3,3, the blocks with *f* = 0, *γ*_{1,3} = 0). Such a phenomenon of “super-efficiency”, first observed by Hodges (e.g., see Lehmann 1983, pp. 405−406), is expected for many shrinkage estimators. Finally, we note that variance estimators we propose in the case of *β*_{model} ≠ *β*_{free} provide remarkably good approximations for the variance of the EB estimators even when *β*_{model} = *β*_{free}, although in the latter case the derivation of the asymptotic variances based on the *δ*-method is not strictly accurate, but it is acceptable in practice.

A further consideration of (7) suggests an entirely different approach to combining assumption-laden and assumption-free methods in semiparametric problems with infinite dimensional nuisance parameters. Specifically, setting **K*** = **I** – **K**, we can rewrite (7) as

$${\widehat{\beta}}_{\mathrm{EB}}={\widehat{\beta}}_{\text{model}}+{\mathbf{K}}^{\ast}({\widehat{\beta}}_{\text{free}}-{\widehat{\beta}}_{\text{model}}).$$

( 9 )

Because **K*** = 0 leads to the model-based estimator and **K*** = **I** leads to the model-free estimator, one interpretation of (9) is that one is shrinking the difference between the model-free and model-based estimators toward zero. With **K*** = 0 being interpreted as full shrinkage, a more useful interpretation of (9) is that one is shrinking the model-free estimator toward the model-based estimator when it is appropriate to do so.

With this idea in mind, we now turn to penalized likelihoods, which are also used for constructing shrinkage estimators. However, most of these proposals shrink parameters to zero. Based on the intuition of the EB-type estimators described in the previous paragraph, we instead propose shrinking the model-free estimator toward the model-based estimator (or equivalently, shrinking their difference to zero) by maximization of a penalized log-likelihood of the form

$$\begin{array}{cc}\hfill {l}_{p}=& \sum _{i=1}^{N}\phantom{\rule{thickmathspace}{0ex}}\mathrm{log}\phantom{\rule{thickmathspace}{0ex}}\left\{{L}_{\text{free}}({D}_{i},{\mathbf{G}}_{i},{\mathbf{X}}_{i},\Omega )\right\}\hfill \\ \hfill & -\sum _{j=1}^{p}{\lambda}_{j}P({\beta}_{j},{\widehat{\beta}}_{\text{model},j}),\hfill \end{array}$$

( 10 )

where ${\widehat{\beta}}_{\text{model},j}$ denotes the estimate of *β _{j}* from the model-based likelihood

There are two unique features of (10). First, we propose shrinkage directly with respect to the focus parameters of interest. Given that we are trying to exploit *G*-*E* independence and HWE, which are assumptions about nuisance parameters, it may seem more natural that one maximizes a penalized likelihood where the penalties are given to the nuisance parameters. However, in a semiparametric model involving infinite dimensional nuisance parameters, this could become a challenging task. By formulating the problem with respect to the focus parameters themselves, however, we can simply work with the profile likelihoods that are completely free of the nuisance parameters. A second interesting feature of our proposal is that we are shrinking parameters, not toward some fixed values as is usually done, but toward the estimates obtained from an alternative model.

One can use both *L*_{1} (LASSO) and *L*_{2} (ridge) penalty functions, with the former given by $P({\beta}_{j},{\widehat{\beta}}_{\text{model},j})=\mid {\beta}_{j}-{\widehat{\beta}}_{\text{model},j}\mid $ and the latter by $P({\beta}_{j},{\widehat{\beta}}_{\text{model},j})={({\beta}_{j}-{\widehat{\beta}}_{\text{model},j})}^{2}$. For logistic regression models, *L*_{2} penalized likelihoods can be maximized by iteratively reweighted ridge regressions and hence can be implemented conveniently. An interesting feature of the *L*_{1} penalty is that it can produce “sparse” solutions [i.e., we can have certain estimates exactly equal to those obtained from the HWE and *G*-*E* independence model (Tibshirani 1996)]. More detailed discussion of the respective properties for the two types of penalty functions can be found in Hastie, Tibshirani, and Friedman (2001).

Two issues are important for practical implementation of the penalized likelihood estimation: (a) computation and (b) the choice of the penalty parameters *λ _{j}*. In the Appendix D we deal with issue (a) and in what follows we deal with issue (b).

Both *L*_{1} and *L*_{2} penalized estimation can be interpreted as Bayesian posterior mode estimation (Tibshirani 1996), with the prior given by the Laplace and the normal distributions, respectively. Note that the penalty parameter *λ _{j}* in the penalized log-likelihood (10) has a 1−1 correspondence to the variance

We conducted simulation studies to examine the performance of the EB-type and penalized likelihood shrinkage estimators. We implemented two EB estimators, termed EB1 and EB2, one corresponding to “multivariate shrinkage” with $\mathbf{K}=\mathbf{V}{(\mathbf{V}+\widehat{\psi}{\widehat{\psi}}^{\u22ba})}^{-1}$, where **V** is the estimated variance-covariance of $\widehat{\psi}={\widehat{\beta}}_{\text{model}}-{\widehat{\beta}}_{\text{free}}$, and the other corresponding to “componentwise shrinkage” with $\mathbf{K}=\mathrm{diag}\{{v}_{i}\u2215({v}_{i}+{\widehat{\psi}}_{i}^{2})\}$, where ${\widehat{\psi}}_{i}$ is the *i*th component of $\widehat{\psi}$ and *ν*_{i} is the *i*th diagonal component of **V**. We also implemented two penalized likelihood methods, PL1 and PL2, corresponding respectively to the *L*_{1} and *L*_{2} penalties, with the penalty parameters chosen as described in Section 4.1. Haplotype data were simulated using the haplotype patterns and frequencies (see Table 1) for five Single Nucleotide Polymorphisms (SNPs) along a diabetes susceptibility region on chromosome 22, reported in the Finland-United States Investigation of Non-Insulin-Dependent Diabetes Mellitus (FUSION) study (Epstein and Satten 2003). The environmental covariate *X* is a binary variable, with pr(*X* = 1) = 0.3.

In our simulations, given the environmental covariate *X*, we generated a 5-SNP diplotype for a subject according to the model

$$\mathrm{log}\left[\frac{\mathrm{pr}\{\mathbf{H}=({\mathbf{h}}_{{j}_{1}},{\mathbf{h}}_{{j}_{2}})\mid X\}}{\mathrm{pr}\{\mathbf{H}=({\mathbf{h}}_{1},{\mathbf{h}}_{1})\mid X\}}\right]={\gamma}_{0{j}_{1}{j}_{2}}+({\gamma}_{1,{j}_{1}}+{\gamma}_{1,{j}_{2}})X,$$

( 11 )

where *j*_{1}, *j*_{2} = 1, . . ., 7 index haplotypes, and the diplotype (*h*_{1}, *h*_{1}) is chosen as the reference. The parameters {*γ*_{0j1j2}} are specified so that the marginal (unconditional on *X*) diplotype frequencies are given as

$$\mathrm{pr}\{\mathbf{H}=({\mathbf{h}}_{{j}_{1}},{\mathbf{h}}_{{j}_{2}})\}=\{\begin{array}{cc}2(1-f){\theta}_{{j}_{1}}{\theta}_{{j}_{2}}\hfill & {j}_{1}\ne {j}_{2}\hfill \\ f{\theta}_{{j}_{1}}+(1-f){\theta}_{{j}_{1}}^{2}\hfill & {j}_{1}={j}_{2}\hfill \end{array}\phantom{\}}$$

( 12 )

where the *θ _{j}* are the haplotype frequencies given in Table 1, and

$$\mathrm{logit}\phantom{\rule{thickmathspace}{0ex}}\mathrm{pr}(D=1\mid \mathbf{H},X)={\beta}_{0}+{\beta}_{H}\mathcal{H}+{\beta}_{X}X+{\beta}_{HX}\mathcal{H}X,$$

( 13 )

where $\mathcal{H}$ is coded as indicating whether a subject carries at least one copy of the causal haplotype “01100” (*j* = 3) (i.e., a dominant genetic model), or indicating whether a subject carries two copies of this haplotype (i.e., a recessive model). The parameter values (*β*_{0}, *β _{H}*,

In the simulations we compare the shrinkage estimators with the model-free and model-based estimators. We also consider the estimator ${\widehat{\beta}}_{\mathrm{HWD}}$ accounting for the Hardy-Weinberg disequilibrium of the form (12) but not accounting for *H*-*X* dependence, which is proposed in Lin and Zeng (2006).

Tables 2 and and33 display simulation results for the dominant and recessive models, respectively. We take the sample sizes to be *N*_{1} = *N*_{2} = 150, 300, or 600 for the dominant model and *N*_{1} = *N*_{0} = 300, 600 or 1,000 for the recessive model. To save space, we report only the results for *β _{H}* and

When the assumptions of HWE and *H*-*X* independence are met (*f* = 0, *γ*_{1,3} = 0), all the different estimators are essentially unbiased. In this situation, the model-based estimator can be remarkably more efficient than the model-free estimator, especially when the genetic effect is recessive. The different shrinkage estimators give up some efficiency relative to the model-based estimator, but all of them, with the occasional exception of EB1 (multivariate empirical-Bayes shrinkage), achieve major gains in efficiency over the model-free estimators in small samples (*N*_{1} = *N*_{0} = 150 or 300). Moreover, EB2 (component-wise empirical-Bayes shrinkage) consistently retains substantial gains over the model-free estimator even for larger sample sizes such as *N*_{1} = *N*_{0} = 600 or *N*_{1} = *N*_{0} = 1,000. The estimator ${\widehat{\beta}}_{\mathrm{HWD}}$ assuming Hardy-Weinberg disequilibrium performs similarly to the model-based estimator.

When the assumptions of HWE or *H*-*X* independence or both are violated, the model-based estimator yields very large bias for the genetic main effect or the gene-environment interaction parameter or both. The model-free estimator, although it depends weakly on these same assumptions, has negligible bias. As a result, in these situations, the mean squared error (MSE) for the model-based estimator is often much larger than that of the model-free estimator when the violation of model assumptions is severe or the sample size is large. Note that when only the assumption of HWE is violated but *H*-*X* independence still holds (*f* = 0.05, *γ*_{1,3} = 0), the estimator ${\widehat{\beta}}_{\mathrm{HWD}}$ performs best among all the estimators considered, which is as expected because ${\widehat{\beta}}_{\mathrm{HWD}}$ assumes exactly the true model in this setting. When the *H*-*X* independence is violated, ${\widehat{\beta}}_{\mathrm{HWD}}$ performs similarly to the model-based estimator and has large bias and MSE. All of the shrinkage estimators adapt to the situation quite nicely and reduce the bias dramatically. The MSE of the shrinkage estimators usually remains much smaller than that of the model-free estimator when the sample size is small, whereas the performances of the shrinkage and model-free estimators become quite similar in large samples. Overall, among the various shrinkage estimators, the PL1 produces the smallest MSE in most cases under the dominant model, whereas the EB2 and PL2 produce the smallest MSE in most cases under the recessive model. The magnitude of the bias for all shrinkage estimators is similar, with that for the EB2 and PL1 estimators being the largest in most cases.

Variance estimators for the EB and penalized estimators are given in the Appendices C and F. In Tables given in the supplementary material and available from the first author, we have studied the performance of these variance estimators. We observed for each shrinkage estimator that the mean of the estimated variances over simulations was quite close to the empirical variance of the shrinkage estimator. The variance estimators work remarkably well even when HWE and (*G*-*E*) independence assumptions are met, although in this situation the application of the *δ*-method is indeed not technically correct due to nonnormality of the limiting distribution.

In this section, we discuss results from two case-control data examples. The examples were chosen in such a way that from a priori biological grounds one would expect the gene-environment independence assumption to hold in one example, but probably not in the other. The two examples taken together illustrate how the different shrinkage estimators adapt to alternative scenarios for the gene-environment distribution.

The first application involves a case-control study of colorectal adenoma, a precursor of colorectal cancer. The study sample includes 628 prevalent advanced adenoma cases and 635 gender-matched controls, selected from the screening arm of the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial at the National Cancer Institute, USA (Gohagan, Prorok, Hayes, and Kramer 2000; Moslehi et al. 2006). One of the main objectives of this study is to assess whether the smoking-related risk of colorectal adenoma may be modified by certain haplotypes in *NAT2*, a gene known to be important in the metabolism of smoking related carcinogens. In addition, because *NAT2* is involved in the smoking metabolism pathway, potentially it can influence an individual's addiction to smoking itself, causing the gene-environment independence assumption to be violated.

The second application involves a case-control study of prostate cancer. The sample includes 749 prostate cancer cases and 781 controls, also selected from the screening arm of the PLCO trial described earlier. The main objective of the study is to examine the relationship between risk of prostate cancer and [25(OH)D], a serum level biomarker of vitamin D, that reflects both dietary and sunlight exposures. The anticancer effect of vitamin D is hypothesized due to the ability of prostate cells to convert [25(OH)D)] into 1,25-dihydroxy-vitamin D [1,25(OH)2D], the most active form of this vitamin, which regulates the gene transcription of many proteins involving cellular differentiation, proliferation, and apoptosis via the vitamin D receptor (VDR). It is thus of interest to see whether genetic polymorphisms in the VDR gene modify the relationship between [25(OH)D)] and risk of prostate cancer. Given the “downstream” role of the VDR gene in the vitamin-D pathway, it is very unlikely that these polymorphisms actually could influence the level of the [25(OH)D] itself. Thus, the gene-environment independence assumption in this application is likely to be valid.

In the colorectal adenoma study, genotype data were available on six SNPs, for which seven common haplotypes (with estimated frequency > 0.5%) are inferred by the EM algorithm of Li, Khalid, Carlson, and Zhao (2003). In the prostate cancer study, genotype data were available on three SNPs, for which four common haplotypes (with estimated frequency > 0.5%) are inferred by the expectation maximization (EM) algorithm. Our association analysis is performed by fitting the logistic regression model (13). The haplotype covariate vector $\mathcal{H}$ is coded by a multiplicative rule (i.e., the *h*th component of $\mathcal{H}$ is the number of haplotypes *h* carried by a subject). The most frequent haplotype is chosen as the reference haplotype, and those with haplotype frequency < 2% are grouped into the reference.

For the colorectal adenoma study, the environmental covariate vector **X** includes the variables Age (age in years), Female (indicator for female gender), Smk1 and Smk2 (indicators respectively for former and current smokers). All the environmental variables are centered to have mean zero. The haplotype-environment interaction terms include the interactions of the haplotype “101010”, a key haplotype of interest because of its known role in the metabolism of smoking related carcinogens (Chang-Claude et al. 2002), with Smk1 and Smk2. For the prostate cancer study, the environmental covariate vector **X** includes the age level (categorized into four groups), center (nine groups), and the [25(OH)D)] level (nmol/L). We consider the interaction of the haplotypes “000” and “001” with the standardized [25(OH)D)] level (standardized by its sample standard deviation), which are the most promising interactions according to preliminary analysis.

Table 4 shows the results for the two applications, based on the association analyses introduced in Sections 2, 3, and 4. For simplicity, we report only results for the interaction parameters of main interest (the full models we fitted to the two datasets are described in Section 6.1). Also, we report only results from the model-free and model-based analyses, the component-wise empirical Bayes shrinkage analysis, and the *L*_{1} penalized likelihood analysis (with the penalty parameters chosen by matching them with the estimated prior variances, as discussed in Section 4.1). The results from the multivariate empirical Bayes shrinkage analysis are quite similar to those from the component-wise empirical Bayes shrinkage analysis, and a similar relationship holds between the *L*_{1} and the *L*_{2} penalized likelihood analyses. The standard errors are obtained by the formulas given in the Appendices C and F, and the *p*-values are obtained by normal approximation.

Results for haplotype-environment interaction analyses of the colorectal adenoma study (Example 1) and the prostate cancer study (Example 2). The full models fitted are described in Section 6.1

The major conclusions we draw from Table 4 are as follows. For the colorectal adenoma study, where we expect that the haplotype-environment independence may be violated and hence the model-based analysis may be biased, we do observe a large discrepancy between the point estimates from the model-free and the model-based analyses. The empirical Bayes and the penalized likelihood methods we propose nicely adapt to the situation and produce results much closer to the model-free analysis than the model-based analysis. In summary, carriers of the haplotype “101010” tend to have lower smoking related risk than noncarriers. This agrees with previous laboratory and epidemiologic studies that have identified the haplotype “101010”, known as *NAT2***4*, as a rapid metabolizer for smoking related carcinogens (e.g., see Moslehi et al. 2006).

For the prostate cancer study, where we expect haplotype-environment independence to hold, the model-free and the model-based analyses do produce very similar point estimates, revealing that both of them may be unbiased. However, the model-based analysis is more efficient than the model-free analysis. We note that in this example the component-wise empirical Bayes analysis and the *L*_{1} penalized likelihood analysis produce results quite close to the model-based analysis, especially in the interaction of the [25(OH)D)] level with haplotype “000”, and thus retain the efficiency gain. In summary, we can conclude from the model-based and the proposed shrinkage methods that the haplotype “000” may significantly modify the risk of prostate cancer associated with the [25(OH)D)] level.

In this work we first examined retrospective likelihood methods for haplotype-based case-control studies. A “model-free” retrospective likelihood was proposed, which, in contrast to the “model-based” retrospective likelihood (Spinka et al. 2005) that assumes HWE and haplotype-environment independence to specify the joint distribution of diplotypes and environmental exposures, utilizes the same assumptions only to specify the *conditional* diplotype distribution given environmental exposures and unphased genotypes. Our simulation studies show that such retrospective likelihood analysis leads to inference that is very robust to violation of the HWE and (*G*-*E*) independence assumptions. With the rare disease assumption, the proposed “model-free” retrospective likelihood is closely related to the “prospective likelihood” used by Zhao et al. (2003) and Lake et al. (2003).

We further considered a number of alternative shrinkage estimation techniques for adaptive relaxation of the HWE and *G*-*E* independence assumptions from the model-based retrospective likelihood. These different shrinkage estimators were constructed by different ways of shrinking the model-free estimator toward the model-based estimator, with the aim of achieving better finite sample properties in terms of the bias-efficiency trade-off. Our simulation studies reveal that the proposed shrinkage estimators can dramatically reduce the bias of “model-based” retrospective methods in the presence of violation of model assumptions. On the other hand, when the model assumptions are satisfied, exactly or approximately, the shrinkage estimators can retain a considerable gain in efficiency over the “model-free” estimators. Thus, overall, the proposed techniques provide a natural way of trading off between bias and efficiency in the type of problems we studied in this article. Two empirical illustrations were described; one where the assumption of *G*-*E* independence is likely violated and one where it likely holds, and analysis results from these examples are consistent with those from the simulations.

We also have studied asymptotic theory for the EB and penalized shrinkage estimators. We have shown that the proposed EB estimator is approximately $\sqrt{N}\text{-consistent}$ whether the underlying assumptions of HWE and *G*-*E* independence hold or not. Further, when the assumptions of HWE or *G*-*E* independence assumptions or both are violated, we have shown that the EB estimator has an asymptotic normal distribution, the variance of which can be reliably estimated in a simple closed form using the *δ* method. On the other hand, when the assumptions of HWE and *G*-*E* independence are met, the proposed EB estimator can be viewed as a model-average estimator that converges to a nonnormal distribution. In practice, however, we found that this limiting nonnormal distribution can be well approximated by a normal distribution, the variance of which can still be reliably estimated using the same estimator derived for the case when the model assumptions fail. All the previously mentioned properties are also satisfied by the penalized estimators for suitable choice of the penalty parameters.

In this article, we have compared the performance of the alternative shrinkage estimators in terms of mean squared errors of the estimators. In large genetic association studies, however, one is often first interested in testing as opposed to estimation. It is important to note that the misspecification of HWE or (*G*-*E*) independence or both causes bias even under the null hypothesis of no association or no interaction. Thus, from both the testing and estimation points of view, it is important to account for the impact of model misspecification. Although in this article we do not study the properties of shrinkage estimators from a testing point of view, a recent study has reported that EB-type shrinkage estimators indeed performs very well for large scale association testing (Mukherjee et al. 2008).

The advantages in bias, efficiency, computational simplicity, and availability of a unified approach to inference make the proposed shrinkage procedures very appealing in genetic association studies. Further, using modern shrinkage and penalization techniques to combine assumption-laden and assumption-free methods in semiparametric problems is an idea that transcends genetic association studies. We hope that this work will lead to further research in this general area.

Chen's research was supported by the National Science Council of ROC (NSC 95−2118-M-001−022-MY3). Chatterjee's research was supported by a gene-environment initiative grant from the National Heart Lung and Blood Institute (RO1HL091172−01) and by the Intramural Research Program of the National Cancer Institute. Carroll's research was supported by grants from the National Cancer Institute (CA57030, CA104620) and by Award Number KUS-CI-016−04, made by King Abdullah University of Science and Technology (KAUST). The authors thank the editor, associate editor, and referees for their helpful comments.

In this section, we use the subscripts “alt” and “cc” to denote the expectation operators under the alternative (see Section 2.3.3) and the case-control sampling schemes. We drop the subscript “free” for the parameters.

Let **Ψ**_{free}(*D*, **G, X, Ω**) be the estimating function we consider for the “model-free” estimation, where the estimating function for ** β** is obtained by the score log

The following result will be used to justify using **Ψ**_{free}(*D*, **G, X, Ω**) as an unbiased estimating function in the case-control sampling scheme.

*Theorem 1*. For any function *K*(*D*, **G, X**),

$${E}_{\mathrm{alt}}\{K(D,\mathbf{G},\mathbf{X})\mid R=1\}={E}_{\mathrm{cc}}\left\{K(D,\mathbf{G},\mathbf{X})\right\}.$$

(A.1)

A.1 PROOF OF THEOREM 1

By definition,

$$\begin{array}{c}\hfill {E}_{\mathrm{cc}}\left\{K(D,\mathbf{G},\mathbf{X})\right\}={N}^{-1}\sum _{i=1}^{N}E\{K({D}_{i},{\mathbf{G}}_{i},{\mathbf{X}}_{i})\mid {D}_{i}\}\hfill \\ \hfill =\sum _{d=0}^{1}{p}_{d}E\{K(d,\mathbf{G},\mathbf{X})\mid D=d\}.\hfill \end{array}$$

Further, note that pr_{alt}(*D* = *d*|*R* = 1) *p _{d}*. Recall that the distribution of

$$\begin{array}{cc}\hfill {E}_{\mathrm{alt}}\{K(D,\mathbf{G},\mathbf{X})\mid R=1\}& ={E}_{\mathrm{alt}}[{E}_{\mathrm{alt}}\{K(D,\mathbf{G},\mathbf{X})\mid D,R=1\}\mid R=1]\hfill \\ \hfill & ={E}_{\mathrm{alt}}[{E}_{\mathrm{alt}}\{K(D,\mathbf{G},\mathbf{X})\mid D\}\mid R=1]\hfill \\ \hfill & ={E}_{\mathrm{alt}}[E\{K(D,\mathbf{G},\mathbf{X})\mid D\}\mid R=1]\hfill \\ \hfill & =\sum _{d=0}^{1}{p}_{d}E\{K(D,\mathbf{G},\mathbf{X})\mid D=d\},\hfill \end{array}$$

completing the proof. In the previous argument, the second line uses the fact that the distribution of *R* is independent of (**H, G, X**) given *D*, whereas the third line notes that probability calculations about the distribution of (**H, G, X**) given *D* are the same in either sampling scheme.

Theorem 1 in Appendix A shows that expectations in the case-control sampling scheme are the same as expectations in the alternative sampling scheme, and hence that **Ψ**_{free}(·) is an unbiased estimating function [i.e., ${N}^{-1}{\sum}_{i=1}^{N}E\{{\Psi}_{\text{free}}({D}_{i},{\mathbf{G}}_{i},{\mathbf{X}}_{i},{\Omega}_{\text{free}})\mid {D}_{i}\}=0$]. Let **I**_{free} be the information matrix,

$${\mathcal{I}}_{\text{free}}=-{N}^{-1}\sum _{i=1}^{N}E\{\partial {\Psi}_{\text{free}}({D}_{i},{\mathbf{G}}_{i},{\mathbf{X}}_{i},{\Omega}_{\text{free}})\u2215\partial {\Omega}_{\text{free}}^{\u22ba}\mid {D}_{i}\},$$

which can be estimated as

$${\widehat{\mathcal{I}}}_{\text{free}}=-{N}^{-1}\sum _{i=1}^{N}\partial {\Psi}_{\text{free}}({D}_{i},{\mathbf{G}}_{i},{\mathbf{X}}_{i},{\widehat{\Omega}}_{\text{free}})\u2215\partial {\Omega}_{\text{free}}^{\u22ba}.$$

Define ${\Lambda}_{\text{free}}={\sum}_{d}{p}_{d}E\{{\Psi}_{\text{free}}(d,\mathbf{G},\mathbf{X},{\Omega}_{\text{free}})\mid D=d\}E\{{\Psi}_{\text{free}}^{\u22ba}(d,\mathbf{G},\mathbf{X},{\Omega}_{\text{free}})\mid D=d\}$ where the expectations are estimated as ${N}_{d}^{-1}{\sum}_{i=1}^{N}I({D}_{i}=d){\Psi}_{\text{free}}(d,{\mathbf{G}}_{i},{\mathbf{X}}_{i},{\widehat{\Omega}}_{\text{free}})$. Also define ${\mathcal{I}}_{\text{free}}^{\ast}={\sum}_{d}{p}_{d}E\{{\Psi}_{\text{free}}(d,\mathbf{G},\mathbf{X},{\Omega}_{\text{free}}){\Psi}_{\text{free}}^{\u22ba}(d,\mathbf{G},\mathbf{X},{\widehat{\Omega}}_{\text{free}})\mid D=d\}$, which can be estimaed by

$${\widehat{\mathcal{I}}}_{\text{free}}^{\ast}={N}^{-1}\sum _{i=1}^{N}{\Psi}_{\text{free}}({D}_{i},{\mathbf{G}}_{i},{\mathbf{X}}_{i},{\widehat{\Omega}}_{\text{free}}){\Psi}_{\text{free}}^{\u22ba}({D}_{i},{\mathbf{G}}_{i}{\mathbf{X}}_{i},{\widehat{\Omega}}_{\text{free}}),$$

Then because of Theorem 1, we have the following result.

*Theorem 2*. As N_{0}, *N*_{1} → ∞,

$$\begin{array}{cc}\hfill {N}^{1\u22152}({\widehat{\Omega}}_{\text{free}}-{\Omega}_{\text{free}})=& {N}^{-1\u22152}\sum _{i=1}^{N}{\mathcal{I}}_{\text{free}}^{-1}{\Psi}_{\text{free}}\hfill \\ \hfill & \times ({D}_{i},{\mathit{G}}_{i},{\mathit{X}}_{i},{\Omega}_{\text{free}})+{o}_{p}\left(1\right).\hfill \end{array}$$

(A.2)

In particular this means that ${N}^{1\u22152}({\widehat{\Omega}}_{\text{free}}-{\Omega}_{\text{free}})\Rightarrow \text{Normal}\{0,{\mathcal{I}}_{\text{free}}^{-1}({\mathcal{I}}_{\text{free}}^{\ast}-{\Lambda}_{\text{free}}){\mathcal{I}}_{\text{free}}^{-1}\}$. In addition, the estimates ${\widehat{\mathcal{I}}}_{\text{free}},{\widehat{\mathcal{I}}}_{\text{free}}^{\ast}$, and ${\widehat{\Lambda}}_{\text{free}}$ are consistent for ${\mathcal{I}}_{\text{free}},{\mathcal{I}}_{\text{free}}^{\ast}$ and **Λ**_{free}.

For the sake of completeness, we note that a similar expansion was derived by Spinka, et al. (2005) when HWE and (*G*-*E*) independence hold.

*Theorem 3*. As *N*_{0}, *N*_{1} → ∞,

$$\begin{array}{cc}\hfill {N}^{1\u22152}({\widehat{\Omega}}_{\text{model}}-{\Omega}_{\text{model}})=& {N}^{-1\u22152}\sum _{i=1}^{N}{\mathcal{I}}_{\text{model}}^{-1}{\Psi}_{\text{model}}\hfill \\ \hfill & \times ({D}_{i},{\mathit{G}}_{i},{\mathit{X}}_{i},{\Omega}_{\text{free}})+{o}_{p}\left(1\right).\hfill \end{array}$$

(A.3)

In particular, this means that ${N}^{1\u22152}({\widehat{\Omega}}_{\text{model}}-{\Omega}_{\text{model}})\Rightarrow \text{Normal}\{0,{\mathcal{I}}_{\text{model}}^{-1}({\mathcal{I}}_{\text{model}}-{\Lambda}_{\text{model}}){\mathcal{I}}_{\text{model}}^{-1}\}$. In addition, the estimates ${\widehat{\mathcal{I}}}_{\text{model}}$ and ${\widehat{\Lambda}}_{\text{model}}$ are consistent for **I**_{model} and **Λ**_{model} (all the matrices here are defined analogously to their counterparts defined in Theorem 2).

Here we focus on ${\widehat{\beta}}_{\mathrm{EB}1}$; the asymptotic theory for ${\widehat{\beta}}_{\mathrm{EB}2}$ can be derived analogously (the definitions for the EB1 and EB2 empirical Bayes-type estimators is given in Section 5.1). The asymptotic theory is simple, because when *β*_{model} ≠ *β*_{free}, the EB shrinkage estimators are asymptotically equivalent to the model-free estimator (see later). Our main goal here is to develop for the EB shrinkage estimator an approximate covariance matrix estimator that is more accurate in finite samples than the covariance matrix for the model-free estimator. Note that, although the latter also serves as an estimator for variance of the EB estimator due to the asymptotic equivalence, it is often too conservative in finite samples (simulation data not shown).

Let $\widehat{\beta}={({\widehat{\beta}}_{\text{model}}^{\u22ba},{\widehat{\beta}}_{\text{free}}^{\u22ba})}^{\u22ba}$, and ${\widehat{\Sigma}}_{\widehat{\beta}}$ denote the estimated variance-covariance of $\widehat{\beta}$. Let ${\widehat{\Psi}}_{\text{model},i}={\Psi}_{\text{model}}({D}_{i},{\mathbf{G}}_{i},{\mathbf{X}}_{i},{\widehat{\Omega}}_{\text{model}})$, and ${\widehat{\Psi}}_{\text{free},i}$ be defined analogously. The block-diagonal terms of ${\widehat{\Sigma}}_{\widehat{\beta}}$ are obtained directly from Theorems 2 and 3, and, by Theorems 2 and 3, the off block-diagonal terms are given by

$${N}^{-2}\left({\mathbf{I}}_{p}\phantom{\rule{1em}{0ex}}0\right){\widehat{\mathcal{I}}}_{\text{model}}^{-1}\left(\sum _{i=1}^{N}{\widehat{\Psi}}_{\text{model},i}{\widehat{\Psi}}_{\text{free},i}^{\u22ba}-\widehat{\Lambda}\right){\widehat{I}}_{\text{free}}^{-1}{\left({\mathbf{I}}_{p}\phantom{\rule{1em}{0ex}}0\right)}^{\u22ba}$$

and its transpose, where **I*** _{p}* is the identity matrix of size

To get an approximate estimated covariance matrix, which is more accurate than that of the model-free estimator in small samples, we use the following calculations. Define

$${\stackrel{~}{\beta}}_{\mathrm{EB}1}={\beta}_{\text{free}}+\mathbf{V}{(\mathbf{V}+\psi {\psi}^{\u22ba})}^{-1}\psi $$

and note that by a first order Taylor expansion, we can write

$$\sqrt{N}({\widehat{\beta}}_{\mathrm{EB}1}-{\stackrel{~}{\beta}}_{\mathrm{EB}1})={\mathcal{G}}_{1}\times \sqrt{N}(\widehat{\beta}-{\beta}_{\ast})+{o}_{p}\left(1\right),$$

where the *p* × 2*p* matrix **G**_{1} is given as

$$\begin{array}{c}\hfill {\mathcal{G}}_{1}=(-\frac{2\psi {\psi}^{\u22ba}{\mathbf{V}}^{-1}}{{(1+{\psi}^{\u22ba}{\mathbf{V}}^{-1}\psi )}^{2}}+\frac{1}{(1+{\psi}^{\u22ba}{\mathbf{V}}^{-1}\psi )}{\mathbf{I}}_{p}\phantom{)}\hfill \\ \hfill \phantom{(}\frac{{\psi}^{\u22ba}{\mathbf{V}}^{-1}\psi}{(1+{\psi}^{\u22ba}{\mathbf{V}}^{-1}\psi )}{\mathbf{I}}_{p}+\frac{2\psi {\psi}^{\u22ba}{\mathbf{V}}^{-1}}{{(1+{\psi}^{\u22ba}{\mathbf{V}}^{-1}\psi )}^{2}}).\hfill \end{array}$$

Of course, because **V** = *O _{p}*(

$${\mathcal{G}}_{2}=\left(\mathrm{diag}\left\{\frac{{v}_{j}({v}_{j}-{\psi}_{j}^{2})}{{({v}_{j}+{\psi}_{j}^{2})}^{2}}\right\}{\mathbf{I}}_{p}-\mathrm{diag}\left\{\frac{{v}_{j}({v}_{j}-{\psi}_{j}^{2})}{{({v}_{j}+{\psi}_{j}^{2})}^{2}}\right\}\right).$$

The penalized likelihood estimators ${\widehat{\beta}}_{\mathrm{PL}1}$ and ${\widehat{\beta}}_{\mathrm{PL}2}$ are obtained from (10) with *L*_{1} (LASSO) and *L*_{2} (ridge) penalties, respectively. Their implementation involves solving the corresponding score functions for (10); note that the score function used for ** θ** in

Following Fan and Li (2001, Section 3.3), we use a unified Newton-Raphson algorithm for implementing these estimators, where the penalty function is approximated locally by a quadratic function, which is exact for the *L*_{2} penalty. Specifically, write the penalty function *P*(*β _{j}*,

$$\mathrm{d}P\left(\mid {b}_{j}^{\ast}\mid \right)\u2215\mathrm{d}{b}_{j}\approx \{\stackrel{.}{P}\left(\mid {b}_{j}^{\ast}\mid \right)\u2215\mid {b}_{j}^{\ast}\mid \}{b}_{j}^{\ast},$$

where the superscript dot denotes derivative. Let

$${\Psi}_{\text{free}}\left(\Omega \right)=\sum _{i=1}^{N}{\Psi}_{\text{free}}({D}_{i},{\mathbf{G}}_{i},{\mathbf{X}}_{i},\Omega ),$$

where **Ψ**_{free}(·) is defined in Section 3.1, and

$${\stackrel{\u2012}{\mathcal{I}}}_{\text{free}}\left(\Omega \right)=-\partial {\Psi}_{\text{free}}^{\u22ba}\left(\Omega \right)\u2215\partial \Omega .$$

Let ${\widehat{\Omega}}_{\text{model}}$ be the “model-based” estimates for **Ω** obtained from (6). Then at a current parameter estimate **Ω*** = (*β**, *θ**), the score function for the penalized likelihood can be approximated by

$${\Psi}_{\text{free}}\left({\Omega}^{\ast}\right)-\Gamma \left({\Omega}^{\ast}\right)({\Omega}^{\ast}-{\widehat{\Omega}}_{\text{model}}),$$

where the matrix **Γ**(**Ω***) is diagonal whose diagonal elements are ${\lambda}_{j}\stackrel{.}{P}\left(\mid {b}_{j}^{\ast}\mid \right)\u2215\mid {b}_{j}^{\ast}\mid $ for those corresponding to *β*_{j}, and are 0 for those corresponding to ** θ**, becaue we do not penalize the estimates for haplotype frequency parameters

$$\begin{array}{cc}\hfill \widehat{\Omega}=& {\widehat{\Omega}}^{\ast}+{\{{\stackrel{\u2012}{\mathcal{I}}}_{\text{free}}\left({\Omega}^{\ast}\right)+\Gamma \left({\Omega}^{\ast}\right)\}}^{-1}\hfill \\ \hfill & \times \left\{{\Psi}_{\text{free}}\left({\Omega}^{\ast}\right)-\Gamma \left({\Omega}^{\ast}\right)({\Omega}^{\ast}-{\widehat{\Omega}}_{\text{model}})\right\}.\hfill \end{array}$$

In the case where the *L*_{1} penalty is used, when $\widehat{\beta}\ast j$ becomes close to ${\widehat{\beta}}_{\text{model},j}$ (e.g., the absolute difference < 10^{−5}), we set ${\widehat{\beta}}_{j}={\widehat{\beta}}_{\text{model},j}$ and set the corresponding diagonal element of **Γ**(**Ω**) to a large value (e.g., 10^{5}). We have found that this algorithm is very stable and fast. A sandwich-type variance estimator is proposed in Appendix F for the resulting penalized estimator.

We first consider the case that *β*_{model} ≠ *β*_{free}. In this case, if *λ** _{j}* → 0 as

Next, as in Section 3.1, we consider the case where HWE and *G*-*E* independence hold, so that *β*_{model} = *β*_{free} and *θ*_{model} = *θ*_{free}, and hence **Ω**_{model} = **Ω**_{free}. In this case, we will show that the penalized estimator, though it has the same limit value as the model-free estimator, is a different estimator from the model-free estimator. In fact, from the asymptotic distribution for the penalized estimator derived later, it is seen that the penalized estimator is a model-average estimator, and hence is more efficient than the model-free estimator.

Consider the *L*_{2}-penalized estimator first. With a slightly expanded notation we have

$$\begin{array}{cc}\hfill & {N}^{1\u22152}\left({\widehat{\beta}}_{\text{free}}-{\beta}_{\text{free}},{\widehat{\theta}}_{\text{free}}-{\theta}_{\text{free}},{\widehat{\beta}}_{\text{model}}-{\beta}_{\text{model}},{\widehat{\theta}}_{\text{model}}-{\Omega}_{\text{model}}\right)\hfill \\ \hfill & \phantom{\rule{thickmathspace}{0ex}}=\left[\begin{array}{c}\hfill {\mathcal{I}}_{\text{free}}^{-1}\left({\Omega}_{\text{free}}\right){N}^{-1\u22152}\sum _{i=1}^{N}{\Psi}_{\text{free},i}\left({\Omega}_{\text{free}}\right)\hfill \\ \hfill {\mathcal{I}}_{\text{model}}^{-1}\left({\Omega}_{\text{model}}\right){N}^{-1\u22152}\sum _{i=1}^{N}{\Psi}_{\text{model},i}\left({\Omega}_{\text{model}}\right)\hfill \end{array}\right]+{o}_{p}\left(1\right)\hfill \\ \hfill & \phantom{\rule{thickmathspace}{0ex}}\Rightarrow {({\mathcal{Z}}_{\text{free}}^{\u22ba},{\mathcal{Q}}_{\text{free}}^{\u22ba},{\mathcal{Z}}_{\text{model}}^{\u22ba},{\mathcal{Q}}_{\text{model}}^{\u22ba})}^{\u22ba}\sim \text{Normal}(0,{\Sigma}_{\text{all}}),\hfill \end{array}$$

say, where **Ψ**_{free},*i*(**Ω**_{free}) = **Ψ**_{free}(*D _{i}*,

Define $\mathcal{U}\left\{{N}^{1\u22152}({\widehat{B}}_{\text{free}}-{\widehat{\beta}}_{\text{model}})\right\}=\mathrm{diag}[({v}_{j}\u22152){\left\{{N}^{1\u22152}({\widehat{\beta}}_{\text{free},j}-{\widehat{\beta}}_{\text{model},j})\right\}}^{-2},0,\dots ,0]$. Then the *L*_{2}-penalized estimator as described in Section 4.1, ${\widehat{\Omega}}_{\mathrm{PL}2}$, solves

$$\begin{array}{cc}\hfill 0=& {N}^{-1\u22152}\sum _{i=1}^{N}{\Psi}_{\text{free},i}\left({\widehat{\Omega}}_{\mathrm{PL}2}\right)-\mathcal{U}\left\{{N}^{1\u22152}({\widehat{\beta}}_{\text{free}}-{\widehat{\Omega}}_{\text{model}})\right\}{N}^{1\u22152}\hfill \\ \hfill & \times ({\widehat{\Omega}}_{\mathrm{PL}2}-{\widehat{\Omega}}_{\text{model}}).\hfill \end{array}$$

By Taylor series, we see that

$$\begin{array}{cc}\hfill 0=& {N}^{-1\u22152}\sum _{i=1}^{N}{\Psi}_{\text{free},i}\left({\Omega}_{\text{free}}\right)\hfill \\ \hfill & -\mathcal{U}\left\{{N}^{1\u22152}({\widehat{\beta}}_{\text{free}}-{\widehat{\beta}}_{\text{model}})\right\}{N}^{1\u22152}({\Omega}_{\text{free}}-{\widehat{\Omega}}_{\text{model}})\hfill \\ \hfill -& \left[{\mathcal{I}}_{\text{free}}+\mathcal{U}\left\{{N}^{1\u22152}({\widehat{\beta}}_{\text{free}}-{\widehat{\beta}}_{\text{model}})\right\}\right]{N}^{1\u22152}({\widehat{\Omega}}_{\mathrm{PL}2}-{\Omega}_{\text{free}})+{o}_{p}\left(1\right).\hfill \end{array}$$

Dropping the dependence of **I**_{free} on **Ω**_{free}, remembering that the leading term in the previous expression has the limit distribution ${\mathcal{I}}_{\text{free}}{({\mathcal{Z}}_{\text{model}}^{\u22ba},{\mathcal{Q}}_{\text{free}}^{\u22ba})}^{\u22ba}$ and then solving, we see that

$$\begin{array}{cc}\hfill {N}^{1\u22152}({\widehat{\Omega}}_{\mathrm{PL}2}-{\Omega}_{\text{free}})& \Rightarrow {\{{\mathcal{I}}_{\text{free}}+\mathcal{U}({\mathcal{Z}}_{\text{model}}-{\mathcal{Z}}_{\text{free}})\}}^{-1}\hfill \\ \hfill & \times [{\mathcal{I}}_{\text{free}}{({\mathcal{Z}}_{\text{free}}^{\u22ba},{\mathcal{Q}}_{\text{free}}^{\u22ba})}^{\u22ba}\hfill \\ \hfill & +\mathcal{U}({\mathcal{Z}}_{\text{model}}-{\mathcal{Z}}_{\text{free}}){({\mathcal{Z}}_{\text{model}}^{\u22ba},{\mathcal{Q}}_{\text{model}}^{\u22ba})}^{\u22ba}]\hfill \\ \hfill & +{o}_{p}\left(1\right).\hfill \end{array}$$

Note that this limit distribution, just as that for the EB estimators given in (8), is a type of distribution for a model-average estimator. It is then seen that, when HWE and *G*-*E* independence hold, the limit distribution for ${\widehat{\beta}}_{\mathrm{PL}2}$ is in principle not a normal distribution. Our simulations, however, show that the departure from normality in this case is b not large.

Now consider the asymptotic distribution for the *L*_{1}-penalized estimator, ${\widehat{\Omega}}_{\mathrm{PL}1}$, when HWE and *G*-*E* independence hold, with the penalty parameter ** λ** being chosen as in Section 4.1. Let $\mathcal{R}({N}^{1\u22152}{\widehat{\beta}}_{\text{free}},{N}^{1\u22152}{\widehat{\beta}}_{\text{model}})=\mathrm{diag}[{({v}_{j}\u22152)}^{1\u22152}{\mid {N}^{1\u22152}({\widehat{\beta}}_{\text{free},j}-{\widehat{\beta}}_{\text{model},j}){\mathcal{B}}_{j}^{\ast}\mid}^{-1},0,\dots ,0]$, where ${\mathcal{B}}_{j}^{\ast}={N}^{1\u22152}({\widehat{\beta}}_{\mathrm{PL}1,j}-{\widehat{\beta}}_{\text{model},j})$. By Taylor expansion we have

$$\begin{array}{cc}\hfill 0=& {N}^{-1\u22152}\sum _{i=1}^{N}{\Psi}_{\text{free},i}\left({\Omega}_{\text{free}}\right)-\mathcal{R}({\mathcal{Z}}_{\text{model}},{\mathcal{Z}}_{\text{free}}){N}^{1\u22152}\hfill \\ \hfill & ({\Omega}_{\text{free}}-{\widehat{\Omega}}_{\text{model}})\hfill \\ \hfill -& \{{\mathcal{I}}_{\text{free}}+\mathcal{R}({\mathcal{Z}}_{\text{model}},{\mathbf{Z}}_{\text{free}})\}{N}^{1\u22152}({\widehat{\Omega}}_{\mathrm{PL}1}-{\Omega}_{\text{free}})+{o}_{p}\left(1\right),\hfill \end{array}$$

which implies that

$$\begin{array}{cc}\hfill {N}^{1\u22152}\{{\widehat{\Omega}}_{\mathrm{PL}1}-{\Omega}_{\text{free}}\}& \Rightarrow {({\mathcal{I}}_{\text{free}}+\mathcal{R}({\mathcal{Z}}_{\text{model}},{\mathcal{Z}}_{\text{free}}))}^{-1}\hfill \\ \hfill & [{\mathcal{I}}_{\text{free}}{({\mathcal{Z}}_{\text{free}}^{\u22ba},{\mathcal{Q}}_{\text{free}}^{\u22ba})}^{\u22ba}\phantom{]}\hfill \\ \hfill & +\phantom{[}\mathcal{R}({\mathcal{Z}}_{\text{model}},{\mathcal{Z}}_{\text{free}}){({\mathcal{Z}}_{\text{model}}^{\u22ba},{\mathcal{Q}}_{\text{model}}^{\u22ba})}^{\u22ba}]\hfill \\ \hfill & +{o}_{p}\left(1\right),\hfill \end{array}$$

which is again a form of the distribution for a model-average estimator, and hence is in general not normally distributed.

The major purpose of this section is to derive accurate and easily computed approximate variance estimates for the penalized estimators, using sandwich ideas. Here we assume that *β*_{model} ≠ *β*_{free}, and the penalty parameters *λ*_{j}s are chosen so that *λ*_{j} → 0 as *N* → ∞. As discussed in Appendix E, the penalized estimator in this case is asymptotically normal with mean **Ω**_{PL} = **Ω**_{free}.

Let

$${\stackrel{\u2012}{\Psi}}_{\text{free}}\left(\Omega \right)=\sum _{i=1}^{N}{\Psi}_{\text{free}}({D}_{i},{\mathbf{G}}_{i},{\mathbf{X}}_{i},\Omega )\equiv \sum _{i=1}^{N}{\Psi}_{\text{free},i}\left(\Omega \right),$$

and ${\stackrel{\u2012}{\mathcal{I}}}_{\text{free}}\left(\Omega \right)=-\partial {\stackrel{\u2012}{\Psi}}_{\text{free}}\left(\Omega \right)\u2215\partial {\Omega}^{\u22ba}$, where **Ψ**_{model,i}(**Ω**_{model}) and **Ψ**_{free,i}(**Ω**_{free}) are defined as in the Appendix E. In Appendix D we approximated the score function by

$${\stackrel{\u2012}{\Psi}}_{\mathrm{PL}}\equiv {\Psi}_{\text{free}}\left(\Omega \right)-\Gamma \left(\Omega \right)(\Omega -{\widehat{\Omega}}_{\text{model}}),$$

(A.4)

where the matrix **Γ**(**Ω**) is diagonal with diagonal elements ${\lambda}_{j}\stackrel{.}{P}\left(\mid {b}_{j}\mid \right)\u2215\mid {b}_{j}\mid ,{b}_{j}={\beta}_{j}-{\widehat{\beta}}_{\text{model},\mathrm{j}}$, for those corresponding to *β _{j}*, and 0 for those corresponding to

Recall that in our proposal the penalty parameters ** λ** may be functions of $\widehat{\psi}={\widehat{\beta}}_{\text{model}}-{\widehat{\beta}}_{\text{free}}$. Let

$$\begin{array}{cc}\hfill {\mathcal{Z}}_{\text{model},i}\left({\Omega}_{\text{model}}\right)& ={\mathcal{I}}_{\text{model}}^{-1}{\Psi}_{\text{model},i}\left({\Omega}_{\text{model}}\right),\hfill \\ \hfill {\mathcal{Z}}_{\text{free},i}\left({\Omega}_{\text{model}}\right)& ={\mathcal{I}}_{\text{free}}^{-1}{\Psi}_{\text{free},i}\left({\Omega}_{\text{free}}\right).\hfill \end{array}$$

Accordingly, we can further write **Ψ**_{PL} = Σ_{i }**Ψ**_{PL,i}(**Ω**) +*O _{p}*(

$$\begin{array}{cc}\hfill {\Psi}_{\mathrm{PL},i}\left(\Omega \right)=& {\Psi}_{\text{free},i}\left(\Omega \right)+{N}^{-1}\Gamma \left(\Omega \right)\{{\mathcal{Z}}_{\text{model},i}-(\Omega -{\Omega}_{\text{model}})\}\hfill \\ \hfill & -{N}^{-1}\mathcal{P}(\Omega ,{\Omega}_{\text{model}},{\Omega}_{\text{free}})\{{\mathcal{Z}}_{\text{model},i}\left({\Omega}_{\text{model}}\right)\hfill \\ \hfill & -{\mathcal{Z}}_{\text{free},i}\left({\Omega}_{\text{free}}\right)\},\hfill \end{array}$$

where the matrix **P**(**Ω, Ω**_{model}, **Ω**_{free}) is diagonal with diagonal elements $(\mathrm{d}{\lambda}_{j}\u2215\mathrm{d}{\psi}_{j})\stackrel{.}{P}\left(\mid {b}_{j}\mid \right){b}_{j}\u2215\mid {b}_{j}\mid ,{b}_{j}={\beta}_{j}-{\beta}_{\text{model},\phantom{\rule{thickmathspace}{0ex}}\mathrm{j}}$, for elements corresponding to *β _{j}*, and 0 for elements corresponding to

Denote by ${\widehat{\Omega}}_{\mathrm{PL}},{\widehat{\Omega}}_{\text{model}}$, and ${\widehat{\Omega}}_{\text{free}}$ the penalized, model-based and model-free estimators, respectively, and ${\widehat{\Lambda}}_{\mathrm{PL}}={\sum}_{d}{p}_{d}\widehat{E}\{{\widehat{\Psi}}_{\mathrm{PL}}\mid D=d\}\widehat{E}\{{\widehat{\Psi}}_{\mathrm{PL}}^{\u22ba}\mid D=d\}$, where $\widehat{E}({\widehat{\Psi}}_{\mathrm{PL}}\mid D=d)\equiv {N}_{d}^{-1}{\sum}_{i=1}^{N}I({D}_{i}=d){\Psi}_{\mathrm{PL},i}\left({\widehat{\Omega}}_{\mathrm{PL}}\right)$. Then a sandwich-type variance estimator for the penalized likelihood estimator can be obtained as

$$\begin{array}{cc}\hfill & {\left\{{\stackrel{\u2012}{\mathcal{I}}}_{\text{free}}\left({\widehat{\Omega}}_{\mathrm{PL}}\right)+\Gamma \left({\widehat{\Omega}}_{\mathrm{PL}}\right)\right\}}^{-1}\left(\sum _{i=1}^{N}{\Psi}_{\mathrm{PL},i}{\left({\widehat{\Omega}}_{\mathrm{PL}}\right)}^{\otimes 2}-{\widehat{\Lambda}}_{\mathrm{PL}}\right)\hfill \\ \hfill & \phantom{\rule{1em}{0ex}}\times {\left\{{\stackrel{\u2012}{\mathcal{I}}}_{\text{free}}\left({\widehat{\Omega}}_{\mathrm{PL}}\right)+\Gamma \left({\widehat{\Omega}}_{\mathrm{PL}}\right)\right\}}^{-1},\hfill \end{array}$$

where ${\mathbf{A}}^{\otimes \phantom{\rule{thickmathspace}{0ex}}2}$ for a vector **A**.

- Chang-Claude J, Kropp S, Jäger B, Bartsch H, Risch A. Differential Effect of NAT2 on the Association between Active and Passive Smoke Exposure and Breast Cancer Risk. Cancer Epidemiology, Biomarkers & Prevention. 2002;11:698–704. [PubMed]
- Chatterjee N, Carroll RJ. Semiparametric Maximum Likelihood Estimation in Case-Control Studies of Gene-Environment Interactions. Biometrika. 2005;92:399–418.
- Chatterjee N, Chen Y-H. Maximum Likelihood Inference on a Mixed Conditionally and Marginally Specified Regression Model for Genetic Epidemiologic Studies with Two-Phase Sampling. (Series B).Journal of the Royal Statistical Society. 2007;69:123–142.
- Claeskens G, Carroll RJ. Post-Model Selection Inference in Semiparametric Models. Biometrika. 2007;94:249–265.
- Clark AG. The Role of Haplotypes in Candidate Gene Studies. Genetic Epidemiology. 2004;27:321–333. [PubMed]
- Epstein MP, Satten GA. Inference on Haplotype Effects in Case-Control Studies Using Unphased Genotype Data. American Journal of Human Genetics. 2003;73:1316–1329. [PubMed]
- Fan J, Li R. Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association. 2001;96:1348–1360.
- Gohagan JK, Prorok PC, Hayes RB, Kramer BS. The Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial of the National Cancer Institute: History, Organization, and Status. Controlled Clinical Trials. 2000;21(Suppl):251S–272S. [PubMed]
- Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag; New York: 2001.
- Hjort NL, Claeskens G. Frequentist Model Average Estimators. Journal of the American Statistical Association. 2003;98:879–899. (with discussion)
- Lake SL, Lyon H, Tantisira K, Silverman EK, Weiss ST, Laird NM, Schaid DJ. Estimation and Tests of Haplotype-Environment Interaction When Linkage Phase Is Ambiguous. Human Heredity. 2003;55:56–65. [PubMed]
- Lehmann EL. Theory of Point Estimation. John Wiley and Sons; New York: 1983.
- Li SS, Khalid N, Carlson C, Zhao LP. Estimating Haplotype Frequencies and Standard Errors for Multiple Single Nucleotide Polymorphisms. Biostatistics (Oxford, England) 2003;4:513–522. [PubMed]
- Lin DY, Zeng D. Likelihood-Based Inference on Haplotype Effects in Genetic Association Studies. Journal of the American Statistical Association. 2006;101:89–118. (with discussion)
- Moslehi R, Chatterjee N, Church TR, Chen J, Yeager M, Weissfield J, Hein DW, Hayes RB. Cigarette Smoking, N-acetyltransferase Genes and the Risk of Advanced Colorectal Adenoma. Pharmacogenomics. 2006;7:819–829. [PubMed]
- Mukherjee B, Ahn J, Gruber SB, Rennert G, Moreno V, Chatterjee N. Tests for Gene-Environment Interaction from Case-Control Data: A Novel Study of Type I Error, Power and Designs. Genetic Epidemiology. 2008 Epub PMID: 18473390. [PubMed]
- Mukherjee B, Chatterjee N. Exploiting Gene-Environment Independence for Analysis of Case-Control Studies: A Shrinkage Approach to Trade Off Between Bias and Efficiency. Biometrics. 2008;64:685–694. [PubMed]
- Prentice RL, Pyke R. Logistic Disease Incidence Models and Case-Control Studies. Biometrika. 1979;66:403–411.
- Satten GA, Epstein MP. Comparison of Prospective and Retrospective Methods for Haplotype Inference in Case-Control Studies. Genetic Epidemiology. 2004;27:192–201. [PubMed]
- Schaid DJ. Evaluating Associations of Haplotypes with Traits. Genetic Epidemiology. 2004;27:348–364. [PubMed]
- Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA. Score Tests for Association Between Traits and Haplotypes When Linkage Phase Is Ambiguous. American Journal of Human Genetics. 2002;70:425–434. [PubMed]
- Spinka C, Carroll RJ, Chatterjee N. Analysis of Case-Control Studies of Genetic and Environmental Factors With Missing Genetic Information and Haplotype-Phase Ambiguity. Genetic Epidemiology. 2005;29:108–127. [PMC free article] [PubMed]
- Stram DO, Pearce CL, Bretsky P, Freedman M, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Thomas DC. Modeling and E-M Estimation of Haplotype-Specific Relative Risks from Genotype Data for a Case-Control Study of Unrelated Individuals. Human Heredity. 2003;55:179–190. [PubMed]
- Tibshirani RJ. Regression Shrinkage and Selection via the LASSO. (Series B).Journal of the Royal Statistical Society. 1996;58:267–288.
- Wallenstein S, Hodge SE, Weston A. Logistic Regression Model for Analyzing Extended Haplotype Data. Genetic Epidemiology. 1998;15:173–181. [PubMed]
- Zhao LP, Li SS, Khalid NA. Method for the Assessment of Disease Associations with Single Nucleotide Polymorphism Haplotypes and Environmental Variables in Case-Control Studies. American Journal of Human Genetics. 2003;72:1231–1250. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |