Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2892338

Formats

Article sections

- SUMMARY
- 1. Introduction
- 2. Sampling scheme and likelihood
- 3. A profile likelihood based Newton–Raphson algorithm to compute SPMLE
- 4. Estimated likelihood
- 5. Simulation
- 6. Data application
- 7. Discussion
- Supplementary Material
- REFERENCES

Authors

Related links

Biometrics. Author manuscript; available in PMC 2010 June 25.

Published in final edited form as:

Published online 2008 May 13. doi: 10.1111/j.1541-0420.2008.01046.x

PMCID: PMC2892338

NIHMSID: NIHMS107483

The publisher's final edited version of this article is available at Biometrics

See other articles in PMC that cite the published article.

Recent results for case-control sampling suggest when the covariate distribution is constrained by gene-environment independence, semiparametric estimation exploiting such independence yields a great deal of efficiency gain. We consider the efficient estimation of the treatment-biomarker interaction in two-phase sampling nested within randomized clinical trials, incorporating the independence between a randomized treatment and the baseline markers. We develop a Newton–Raphson algorithm based on the profile likelihood to compute the semiparametric maximum likelihood estimate (SPMLE). Our algorithm accommodates both continuous phase-one outcomes and continuous phase-two biomarkers. The profile information matrix is computed explicitly via numerical differentiation. In certain situations where computing the SPMLE is slow, we propose a maximum estimated likelihood estimator (MELE), which is also capable of incorporating the covariate independence. This estimated likelihood approach uses a one-step empirical covariate distribution, thus is straightforward to maximize. It offers a closed-form variance estimate with limited increase in variance relative to the fully efficient SPMLE. Our results suggest exploiting the covariate independence in two-phase sampling increases the efficiency substantially, particularly for estimating treatment-biomarker interactions.

In clinical and epidemiological studies, the effect of an intervention is often influenced by variables reflecting individual susceptibility. In pharmacogenetic studies, there is a growing body of evidence supporting interactions between genetic polymorphisms and antihypertensive treatments (Arnett et al., 2005). Identifying the effect-modifying genotypes, or other types of biomarkers helps to disclose the etiology of diseases, and to understand the mechanism of the intervention effect. Since bioassays are often expensive and there may be many candidate markers, it is common to measure biomarkers only in a case-control sample from the study cohort (Breslow and Day, 1980), or in a stratified case-control sample if additional covariates are involved in forming the strata (White, 1982; Scott and Wild, 1991). This often constitutes a two-phase outcome-dependent sampling design, in that the first-phase data contain the response ascertained for every subject, and perhaps a collection of “cheap” covariates (for example, treatment assignment, demographic factors); the second-phase data contain the biomarker data for a case-control subsample. If the phase-one cohort is a randomized clinical trial, the treatment assignment is independent of the phase-two biomarkers measured from baseline-stored blood. This paper pertains to the potential efficiency gain when exploiting this independence in two-phase randomized trials.

Statistical methods for two-phase sampling have been studied by many authors. When the covariate partially observed in the second phase is discrete, Ibrahim (1990) uses a weighted EM algorithm to estimate the parameters in a generalized linear model. Difficulty arises when the missing covariates are continuous: numerical integration or Monte Carlo methods are needed when a parametric covariate distribution is assumed (Ibrahim, Chen and Lipsitz, 1999). To avoid model misspecification and ease the computation, a number of pseudo-likelihood methods have been proposed, including the inverse probability weighted estimator (Flanders and Greenland, 1991; Lipsitz, Ibrahim and Zhao, 1999), conditional likelihood (Breslow and Cain, 1988), estimated likelihood (Pepe and Fleming, 1991; Carroll and Wand, 1991), mean score estimators (Reilly and Pepe, 1995), and pseudoscore estimators (Chatterjee, Chen and Breslow, 2003). These approaches yield a consistent estimator of the regression parameters, yet they are not efficient in general. Robins, Rotnitzky and Zhao (1994) introduced a class of semiparametric estimators based on inverse probability weighted estimating equations, and obtained an efficient estimator that attains the semiparametric variance bound. However, the implementation is difficult. When the first-phase data can be reduced to discrete stratum labels, a profile likelihood approach has been proposed to obtain the semiparametric maximum likelihood estimates, with the covariate distributions left completely nonparametric (Scott and Wild, 1997; Lawless, Kalbfleisch and Wild, 1999).

Recent work in case-control studies suggests that exploiting the gene-environment independence improves the estimation efficiency of regression parameters substantially (Chatterjee and Carroll, 2005; Chatterjee and Chen, 2007). Moreover, for rare diseases, the gene-environment interaction in a logistic regression can be estimated by the odds ratio between the gene and the environmental variable in cases only (Piegorsch, Weinberg and Taylor, 1994; Umbach and Weinberg, 1997). Despite the efficiency advantage, these analyses are generally sensitive to departures from the gene-environment independence assumption (Albert et al., 2001). In two-phase sampling nested within randomized trials, however, there exists indisputable independence between the treatment and baseline covariates by design, including markers ascertained in the phase-two sample. Ignoring such design-based independence is a waste of information. We present two examples that motivate our research:

**Example 1:** The randomized clinical trial of estrogen plus progestin in the Women’s Health Initiative (WHI) was terminated early in 2002 because of an increased risk of stroke, breast cancer, and cardiovascular diseases in the treatment arm (Rossouw et al., 2002). To determine whether the adverse effect of conjugated equine estrogen and medroxyprogesterone acetate on stroke was modified by selected baseline blood biomarkers and genotypes, WHI analyzed baseline blood samples from cases and controls. Twenty-nine biomarkers were measured, encompassing inflammatory markers, lipid levels, thrombosis factors, blood cell counts and several Single Nucleotide Polymorphisms (SNPs). All markers except the SNPs yield continuous measurements. More recently in a similar attempt, the WHI is undertaking a genome-wide association study in which hundreds of thousands of SNPs are sequenced in a sample.

**Example 2:** The Genetics of Hypertension-Associated Treatment (GenHAT) Study is a large-scale, double-blind, randomized trial attempting to test the interaction between the insertion/deletion polymorphism in antiotension-converting enzyme and four antihypertensive treatments (Arnett et al., 2005). No significant association was found. In a hypothetical secondary study, a number of SNPs on several genes in renin-angiotensin-aldosterone system are further investigated. The outcome is the blood pressure (BP) change after 6 months treatment. While the outcome and the treatment are collected for every participant, genetic variants are only measured in 3 outcome-dependent subsamples: one from the stratum with BP change lower than 10% percentile, one from the stratum with BP change higher than 90% percentile, and one from the stratum with the rest of participants.

Both studies employ two-phase sampling to identify the effect-modifying biomarkers. The independence between the treatment and the biomarkers is dictated by randomization. The outcome of interest can be categorical (stroke in **Example 1**), or continuous (BP change in **Example 2**). In fact many clinical outcomes are continuous, such as cholesterol levels, HIV viral load, and child’s IQ. It is important to account for continuous outcomes in analysis. In addition, biomarkers can be categorical (genotype) or continuous (blood assay); and as many as hundreds of thousands may be measured (**Example 1**). To our knowledge, the efficient estimation exploiting covariate independence in two-phase randomized trials has not been addressed. Chatterjee and Chen (2007) extended the profile technique to allow gene-environment independence in studies with two-phase sampling. However, their method only considers binary outcomes. For continuous outcomes, the profile approach often entails a substantial loss of efficiency, since it has to reduce the outcome to discrete stratum labels (Chatterjee, Chen and Breslow, 2003). Moreover, computing the variance matrix when exploiting covariate independence using the profile technique can be algebraically cumbersome.

In this article, we propose two semiparametric methods that exploit covariate independence in two-phase sampling. They are semiparametric since the distribution of the missing covariates is treated nonparametrically, while the association between the outcome and covariates remains parametric. Both methods use full information from continuous outcomes and provide a straightforward computation of variance estimates. We first develop a profile likelihood based Newton–Raphson algorithm that can be used to compute the semiparametric maximum likelihood estimate (SPMLE). The independence between covariates is incorporated transparently. The novelty is that, instead of replacing the high dimensional nuisance parameters by a few low dimensional parameters as in Scott and Wild (1997), we profile out the nuisance parameters completely, and explicitly compute the information matrix through numerical differentiation, thus generating the variance as a by-product. When very many biomarkers are investigated, computing the SPMLE for every marker may be time-consuming and unnecessary. This is particularly the case if the disease is relatively rare and there are many covariates to be adjusted. We develop a maximal estimated likelihood estimator (MELE) that is much faster to compute and accounts for the covariate independence, yet it does not lose much efficiency. In essence it plugs a consistent empirical estimator of the covariate distribution in the likelihood. The amount of variance reduction by imposing the independence can be derived explicitly.

In Section 2, we define the sampling scheme and likelihood. In Section 3, we describe the profile Newton–Raphson algorithm that computes the SPMLE and the estimated variances. In Section 4, we derive the estimated likelihood and asymptotic distribution of the MELE. In Section 5, we assess the finite sample properties of the proposed estimators in simulations. In Section 6, we show an application to a biomarker dataset from WHI. We end with a discussion. Technical details are available as on-line supplementary materials.

Let *Y* be an outcome of interest, *X* be the treatment that is randomized, and *Z* be a collection of covariates which includes the expensive biomarker, and potentially other important predictors. Throughout this work, *Y* and *Z* can be continuous or categorical, but *X* is assumed to be categorical. Suppose without missing data, *N* subjects with i.i.d. random variables (*Y _{i}*,

Let *V* = {*i*: *R _{i}* = 1} and = {

$$L(\beta ,G)={\displaystyle \prod _{i\in V}{f}_{\beta}({y}_{i}|{x}_{i},{z}_{i}){g}_{{z}_{i}|{x}_{i}}{\displaystyle \prod _{j\in \overline{V}}\left({\displaystyle \sum _{{z}_{l}\in {\u01b5}_{{x}_{j}}}{g}_{{z}_{k}|{x}_{j}}{f}_{\beta}({y}_{j}|{x}_{j},{z}_{l})}\right),}}$$

(1)

$${L}^{\perp}(\beta ,G)={\displaystyle \prod _{i\in V}{f}_{\beta}({y}_{i}|{x}_{i},{z}_{i}){g}_{{z}_{i}}{\displaystyle \prod _{j\in \overline{V}}\left({\displaystyle \sum _{{z}_{l}\in \u01b5}{g}_{{z}_{l}}{f}_{\beta}({y}_{j}|{x}_{j},{z}_{l})}\right),}}$$

(2)

respectively. Throughout this article, we use the superscript ^{} to indicate expressions that employ the independence between *X* and *Z*. Intuitively, imposing independence shrinks the dimension of the nuisance parameter *G*, since we only need to estimate the marginal distribution of *Z*, thereby improving the estimation of β. Note that the dimension of *g*_{z|x} and *g _{z}* increases with the number of the phase-two subjects for continuous

To maximize likelihoods with a Euclidian parameter θ, a Newton–Raphson algorithm iteratively updates by θ^{(m+1)} = θ^{(m)} + *I*(θ^{(m)})*S*(θ^{(m)}) until convergence, where *S*(θ^{(m)}) is the score function and *I*(θ^{(m)}) is the observed information evaluated at the current θ^{(m)}. Near the solution the convergence of Newton–Raphson algorithm is always fast (exponential), and it automatically leads to variance estimates. When maximizing the semiparametric likelihoods (1) and (2), the interest is in inference on β, not in the infinite dimensional nuisance parameter *g*. A profile likelihood can be derived in which *g* is maximized first for a fixed β, then maximized with respect to β using a Newton–Raphson algorithm.

Specifically, let _{p}(β, *ĝ*_{β}) denote the profile log likelihood for β, where *ĝ*_{β} is the maximizer of *g* given β. Let *S _{p}*(β,

This algorithm relies on a fast computation of *ĝ*_{β(m)} for any fixed β^{(m)}. By introducing a Lagrange multiplier that respects the fact that *g* sum to 1, we can show that *ĝ*_{β}(*z*|*x*) satisfies

$$\sum _{i\in V}{1}_{[{x}_{i}=x,{z}_{i}=z]}+{\displaystyle \sum _{j\in \overline{V}}\frac{{\displaystyle {\sum}_{{z}_{k}\in {\u01b5}_{x}}{f}_{\beta}({y}_{j}|{x}_{j},{z}_{k})g({z}_{k}|{x}_{j}){1}_{[{x}_{j}=x,{z}_{k}=z]}}}{{\displaystyle {\sum}_{{z}_{k}\in {\u01b5}_{x}}g({z}_{k}|x){f}_{\beta}({y}_{j}|{x}_{j},{z}_{k}){1}_{[{x}_{j}=x]}}}={N}_{x}g(z|x),}$$

(3)

where ${N}_{x}={\displaystyle {\sum}_{l=1}^{N}{1}_{[{x}_{l}=x]}}$, the total number of subjects with the covariate value *x*. Note that *X* is discrete and observed for everyone. The second term of (3) on the left hand side is essentially *g*(*z*|*y _{j}*,

After *ĝ*_{β}(*z*|*x*) is computed, *S _{p}*(β,

$${S}_{p}(\beta ,{\hat{g}}_{\beta})=\left\{{\displaystyle \sum _{i=1}^{N}\frac{\partial}{\partial {\hat{g}}_{\beta}}\ell (\beta ,{\hat{g}}_{\beta}|{x}_{i},{y}_{i},{r}_{i}{z}_{i})}\right\}\frac{\partial {\hat{g}}_{\beta}}{\partial \beta}+{\displaystyle \sum _{i=1}^{N}\frac{\partial}{\partial \beta}\ell (\beta ,{\hat{g}}_{\beta}|{x}_{i},{y}_{i},{r}_{i}{z}_{i},{\hat{g}}_{\beta}).}$$

(4)

Since *ĝ*_{β} is the maximizer for every fixed β, ${\sum}_{i=1}^{N}\frac{\partial}{\partial {\hat{g}}_{\beta}}\ell (\beta ,{\hat{g}}_{\beta}|{x}_{i},{y}_{i},{r}_{i}{z}_{i})=0$. Hence the first term of (4) equals to 0. Therefore

$${S}_{p}(\beta ,{\hat{g}}_{\beta})={\displaystyle \sum _{i\in V}S({y}_{i}|{x}_{i},{z}_{i})+{\displaystyle \sum _{j\in \overline{V}}{\displaystyle \sum _{{z}_{k}\in {\u01b5}_{{x}_{j}}}\frac{{\hat{g}}_{\beta}({z}_{k}|{x}_{j}){f}_{\beta}({y}_{j}|{x}_{j},{z}_{k})}{{\displaystyle {\sum}_{{z}_{k}\in {\u01b5}_{{x}_{j}}}{\hat{g}}_{\beta}({z}_{k}|{x}_{j}){f}_{\beta}({y}_{j}|{x}_{j},{z}_{k})}}S({y}_{j}|{x}_{j},{z}_{k}).}}}$$

(5)

Once the profile score is computed, we use numerical differentiation to approximate the profile information matrix:

$${I}_{p}(\beta ,{\hat{g}}_{{\beta}^{(m)}})=\frac{\partial {S}_{p}({\beta}^{(m)},{\hat{g}}_{{\beta}^{(m)}})}{\partial {\beta}^{(m)}}=\frac{{S}_{p}({\beta}^{(m)}+\epsilon ,{\hat{g}}_{{\beta}^{(m)}+\epsilon})-{S}_{p}({\beta}^{(m)}-\epsilon ,{\hat{g}}_{{\beta}^{(m)}-\epsilon})}{2\epsilon}.$$

This involves perturbing each element of β^{(m)} in both directions, computing two new *ĝ*_{β}, and evaluating their profile scores. A highly accurate *I _{p}* can be achieved with ε in the order of 1/

When *X* *Z*, we only need to maximize *g* on the pooled sample space *Ƶ*, rather than on the restricted sample space *Ƶ _{x}*. We can now compute ${\hat{g}}_{\beta}^{\perp}(z|x)$ using

$${\hat{g}}_{\beta}^{\perp}(z|x)={\hat{g}}_{\beta}(z)\approx \frac{{{\displaystyle {\sum}_{i\in V}1}}_{[{z}_{i}=z]}}{N-\left({\displaystyle {\sum}_{j\in \overline{V}}{\displaystyle {\sum}_{{z}_{k}\in \u01b5}\frac{{f}_{\beta}({y}_{j}|{x}_{j},{z}_{k}){1}_{[{z}_{k}=z]}}{{\displaystyle {\sum}_{{z}_{k}\in \u01b5}{\hat{g}}_{\beta}({z}_{k}){f}_{\beta}({y}_{j}|{x}_{j},{z}_{k})}}}}\right)},$$

and (5) becomes

$${S}_{p}^{\perp}(\beta ,{\hat{g}}_{\beta})={\displaystyle \sum _{i\in V}{S}_{\beta}({y}_{i}|{x}_{i},{z}_{i})+{\displaystyle \sum _{j\in \overline{V}}{\displaystyle \sum _{{z}_{k}\in \u01b5}\frac{{\hat{g}}_{\beta}({z}_{k}){f}_{\beta}({y}_{j}|{x}_{j},{z}_{k})}{{\displaystyle {\sum}_{{z}_{k}\in \u01b5}{\hat{g}}_{\beta}({z}_{k}){f}_{\beta}({y}_{j}|{x}_{j},{z}_{k})}}{S}_{\beta}({y}_{j}|{x}_{j},{z}_{k}).}}}$$

(6)

A slight modification of the Newton–Raphson algorithm suffices.

Starting from a naive estimator of β, for example the inverse probability weighted estimator (Flanders and Greenland, 1991; Lipsitz, Ibrahim and Zhao, 1999), the profile Newton–Raphson algorithm usually takes 3–4 iteration to achieve 1e-5 accuracy. At convergence, we obtain the variance estimates of by inverting the information matrix of the profile likelihood *I _{p}*(,

The main computational burden of obtaining the SPMLE lies in updating *ĝ*_{} and the numerical differentiation to get *I _{p}*(,

When the phase-two sampling is outcome-dependent, a consistent estimator of *G*(*z*|*x*) can be formulated as a weighted average of the empirical distribution of *Z*|*x* in each stratum *S _{k}* (Hu and Lawless, 1996; Lawless, Kalbfleisch and Wild, 1999). Observe that

$$G(z|x)={\displaystyle \sum _{k=1}^{K}\text{Pr}(Z<z|{S}_{k},x)\text{Pr}({S}_{k}|x).}$$

Because the probability of observing *Z* is assumed constant in each stratum, we observe that Pr(*Z* < *z*|*S _{k}*,

Therefore, we obtain an empirical estimate of *G*(*z*|*x*) using *Z* in the validation data,

$$\hat{G}(z|x)={\displaystyle \sum _{k=1}^{K}\frac{{N}_{kx}}{{N}_{x}}{\displaystyle \sum _{i\in {S}_{k}}\frac{{1}_{[{z}_{i}<z,{x}_{i}=x,{r}_{i}=1]}}{{\displaystyle {\sum}_{i\in {S}_{k}}{1}_{[{x}_{i}=x,{r}_{i}=1]}}},}}$$

where ${N}_{kx}={\displaystyle {\sum}_{i=1}^{N}{1}_{[i\in {S}_{k},{x}_{i}=x]},{N}_{x}=}{\displaystyle {\sum}_{i=1}^{N}{1}_{[{x}_{i}=x]}}$. The estimated likelihood for incomplete data becomes

$${\hat{f}}_{\beta}(y|x)={\displaystyle \sum _{k=1}^{K}\frac{{N}_{kx}}{{N}_{x}}{\displaystyle \sum _{i\in {S}_{k}}\frac{{f}_{\beta}(y|x,{z}_{i}){1}_{[{x}_{i}=x,{r}_{i}=1]}}{{\displaystyle {\sum}_{i\in {S}_{k}}{1}_{[{x}_{i}=x,{r}_{i}=1]}}}.}}$$

Let * _{N}*(β) denote the estimated likelihood. We want to maximize

$${\hat{L}}_{N}(\beta )={\displaystyle \prod _{i\in V}{f}_{\beta}({y}_{i}|{x}_{i},{z}_{i}){\displaystyle \prod _{j\in \overline{V}}{\hat{f}}_{\beta}({y}_{j}|{x}_{j}).}}$$

(7)

Using *X* *Z* we can improve the estimation of the likelihood. Following the same derivation for *Ĝ*(*z*|*x*) with *X* *Z*, the estimated empirical distribution is simplified to

$${\hat{G}}^{\perp}(z|x)=\hat{G}(z)={\displaystyle \sum _{k=1}^{K}\frac{{N}_{k}}{N}{\displaystyle \sum _{i\in {S}_{k}}\frac{{1}_{[{z}_{i}<z,{r}_{i}=1]}}{{\displaystyle {\sum}_{i\in {S}_{k}}{1}_{[{r}_{i}=1]}}},}}$$

where ${N}_{k}={\displaystyle {\sum}_{i=1}^{N}{1}_{[i\in {S}_{k}]}}$. Essentially we are able to use all observed *Z* to estimate the empirical distribution, not constrained by a particular *x*. Hence the estimated likelihood becomes

$${\hat{f}}_{\beta}^{\perp}(y|x)={\displaystyle \sum _{k=1}^{K}\frac{{N}_{k}}{N}{\displaystyle \sum _{i\in {S}_{k}}\frac{{f}_{\beta}(y|x,{z}_{i}){1}_{[{r}_{i}=1]}}{{\displaystyle {\sum}_{i\in {S}_{k}}{1}_{[{r}_{i}=1]}}}}.}$$

It is immediate that applying the independence assumption reduces the variability of the estimated likelihood, $\text{Var}[{\hat{f}}_{\beta}^{\perp}({y}_{j}|{x}_{j})]<\text{Var}[{\hat{f}}_{\beta}({y}_{i}|{x}_{j})]$, because the former involves an average of more terms. The estimated likelihood using the independence assumption is

$${\hat{L}}^{\perp}(\beta )={\displaystyle \prod _{i\in V}{f}_{\beta}({y}_{i}|{x}_{i},{z}_{i}){\displaystyle \prod _{j\in \overline{V}}{\hat{f}}_{\beta}^{\perp}({y}_{j}|{x}_{j}).}}$$

(8)

Let denote the estimator maximizing (7), and ^{} the estimator maximizing (8). We describe the asymptotic properties of ^{} in the following two theorems. The derivation of the large sample properties is somewhat similar to Weaver and Zhou (2005). The emphasis here is on the efficiency gain when using ^{} instead of . Let ρ_{V} be the probability of a subject falling in the validation sample, and let ρ_{k} be the probability of a subject falling in the stratum *k*. We assume ρ_{V} and ρ_{k} are strictly positive, so that ${\sum}_{i=1}^{N}{1}_{[{r}_{i}=1]}/N\to {\rho}_{V}>0$ and *N _{k}*/

$$\begin{array}{c}\sqrt{N}(\tilde{\beta}-\beta ){\to}_{d}N\phantom{\rule{thinmathspace}{0ex}}(0,I{(\beta )}^{-1}+I{(\beta )}^{-1}\mathrm{\Sigma}I{(\beta )}^{-1}).\hfill \\ \sqrt{N}({\tilde{\beta}}^{\perp}-\beta ){\to}_{d}N\phantom{\rule{thinmathspace}{0ex}}(0,I{(\beta )}^{-1}+I{(\beta )}^{-1}{\mathrm{\Sigma}}^{\perp}I{(\beta )}^{-1}),\hfill \end{array}$$

where

$$\begin{array}{cc}\hfill I(\beta )& ={\rho}_{V}E[-\frac{{\partial}^{2}\text{log}\phantom{\rule{thinmathspace}{0ex}}{f}_{\beta}(Y|X,Z)}{\partial {\beta}^{2}}]+(1-{\rho}_{V})E[-\frac{{\partial}^{2}\text{log}\phantom{\rule{thinmathspace}{0ex}}{f}_{\beta}(Y|X)}{\partial {\beta}^{2}}],\hfill \\ \hfill \mathrm{\Sigma}& ={\displaystyle \sum _{k=1}^{K}{\displaystyle \sum _{\{x\}}\frac{\text{Pr}({S}_{k},X=x)\phantom{\rule{thinmathspace}{0ex}}{[\text{Pr}(R=0|X=x)]}^{2}}{{p}_{k}}{\text{Var}}_{\mathit{Z}}\phantom{\rule{thinmathspace}{0ex}}[{\mathrm{E}}_{\mathit{Y}}[W|x,z,R=0]|x,z\in {S}_{k}],}}\hfill \\ \hfill {\mathrm{\Sigma}}^{\perp}& ={\displaystyle \sum _{k=1}^{K}\frac{{\rho}_{k}{(1-{\rho}_{V})}^{2}}{{p}_{k}}{\text{Var}}_{\mathit{Z}}\phantom{\rule{thinmathspace}{0ex}}[{\mathrm{E}}_{\mathit{Y},\mathit{X}}[W|z,R=0]|z\in {S}_{k}],}\hfill \\ \hfill W& =\frac{\partial {f}_{\beta}(Y|X,Z)/\partial \beta}{{f}_{\beta}(Y|X)}-\frac{\partial {f}_{\beta}(Y|X)/\partial \beta}{{f}_{\beta}^{2}(Y|X)}{f}_{\beta}(Y|X,Z).\hfill \end{array}$$

We assume the usual regularity conditions for maximum likelihood holds for *f*_{β}(*Y*|*X*, *Z*) and *f*_{β}(*Y*|*X*) (Cox and Hinkely, 1974, Chapter 9) and that the sampling probability in each stratum is strictly positive. The proof of existence, uniqueness and consistency of an estimated likelihood estimator follows the results of Foutz (1977). The key step is that the second derivative of the estimated likelihood converges to a positive definite information matrix *I*(β), i.e., $-\frac{1}{N}\frac{{\partial}^{2}\text{log}\phantom{\rule{thinmathspace}{0ex}}\hat{L}(\beta )}{\partial {\beta}^{2}}{\to}_{p}I(\beta )$. Consistent estimators for *I*(β), Σ and Σ^{} can be formulated using empirical terms. See on-line supplementary material. Unlike the SPMLE in Section 3, *Î*() can be computed explicitly. Therefore and ^{} can be obtained much faster than the SPMLE. Clearly Σ^{} < Σ. The asymptotic variance reduction when exploiting the independence is *I*(β)^{−1}(Σ − Σ^{})*I*(β)^{−1}.

We conducted a series of simulations to evaluate the proposed estimators, and to investigate the efficiency gain when exploiting covariate independence. We consider a two-phase sampling scheme with the following features: the outcome *Y* may be binary or continuous; a binary covariate *X* indexing the treatment assignment, which takes the distribution form of Bernoulli(0.5); both *Y* and *X* are observed for everyone, thus form the phase-one data; in the second phase, a case-control sample is identified from the cohort by Bernoulli sampling, independent of *X*. A continuous biomarker is measured for subjects in the case-control sample; In all simulations, *X* is independent of *Z*.

We generated data for 10,000 subjects using the logistic model

$$\text{logit}(\text{Pr}(y=1|x,z))=\text{exp}({\beta}_{0}+{\beta}_{1}x+{\beta}_{2}z+{\beta}_{3}xz)$$

Set β_{0} = −3, β_{1} = 0.2, β_{2} = 0.1, and vary β_{3} to achieve different amounts of interaction between *X* and *Z*. We generated the biomarker data (*Z*) from a log normal distribution with a ceiling of 10 for *Z*, that is, *Z* ~ min(*e*^{N(0,1)}, 10). The Bernoulli sampling probabilities for cases and controls are such that on average 800 cases and 800 controls were selected; 5,000 datasets are simulated. Table 1 summarizes the biases and the sample variances of the various estimators. The complete-case estimator (CC) only uses the subjects with complete data in a logistic regression, ignoring the sampling. This estimator is consistent for the slope parameters (Prentice and Pyke, 1979). The weighted estimator (WE) uses the estimated sampling probabilities from the 4 strata defined by *Y* and *X* to weight the estimation equations. SPMLE and SPMLE exploiting independence (SPMLE^{}) are computed as in Section 3. MELE and MELE^{} are computed as in Section 4. All estimators are unbiased for the three slope parameters. The efficiency is relative to the SPMLE ignoring independence. Among the methods ignoring independence, the SPMLE achieves the lowest variance. The complete-case estimator yields the same efficiency as SPMLE in estimating β_{2} and β_{3}, but it has a much lower efficiency in estimating β_{1} since it does not utilize the phase-one data. The WE is the worst in terms of estimating the interaction, while the MELE is very close to the SPMLE in efficiency, especially when β_{3} is small. The trend observed here is consistent with Lawless, Kalbfleisch and Wild (1999). Interestingly, when using the independence (MELE^{} and SPMLE^{}), the sampling variances drop markedly. The efficiency gain in estimating β_{3} ranges from over 100% (β_{3} = 0) to 20% (β_{3} = 1). The efficiency gain in the main effects are 10%–50% depending on the effect size. In the parameter values we considered, the estimated likelihood approach performs closely to the semiparametric likelihood approach. We expect the relative efficiency of the MELE to decrease when parameter values get larger (Lawless, Kalbfleisch and Wild, 1999).

Binary Y: a comparison of weighted estimation equation estimator (WE), semiparametric maximum likelihood estimator (SPMLE), the maximal estimated likelihood estimator (MELE), with (^{}) and without exploiting independence assumption in 5000 simulations. **...**

Table 2 evaluates the validity of the estimated variances for the SPMLE, MELE, SPMLE^{} and MELE^{} based on 5000 simulations. The mean of the estimated variances agree very well to the corresponding sample variances. The empirical 95% coverage probabilities are close to the nominal 95%. We also studied many different parameter choices, as well as the situation where *Z* is categorical. The results are similar to those in Table 1 and and22 (results not shown).

We simulated a continuous outcome variable *Y* ~ *N*(μ, σ^{2}), with μ = −1 + 0.2*x* + 0.1*z* + β_{3}*xz*, σ^{2} = 1, *X* and *Z* as in Section 5.1. In the first phase, 2000 subjects are generated with *X* and *Y* observed. To create the phase-one strata, *Y* is divided at its 90th quantile. Individuals in the upper stratum are “cases”, and those in the lower stratum are “controls”. All cases and a random sample of about 200 controls enter the second phase and their *Z*s are measured. In Table 3 we compare the efficiencies of WE (using the estimated sampling probability), MELE, MELE^{}, SPMLE, and SPMLE^{}. We also include the SPMLE with reduced phase-one data by Lawless, Kalbfleisch and Wild (1999) (referred to as “LKW”) to see how much information is lost in categorizing the phase-one continuous outcome. Note both the WE and the LKW reduce the phase-one data: the WE estimates the sampling probabilities using observed counts in each strata and uses it as if it is fixed in the weighted log-likelihood; the LKW treats the phase one data as stratum labels, but maximizes the likelihood with respect to β and *g* simultaneously, therefore the weight is updated in the iterations.

Continuous Y: a comparison of the weighted estimator (WE), the reduced semiparametric maximum likelihood estimator as in the (Lawless, Kalbfleisch and Wild, 1999) (LKW), the semiparametric maximum likelihood estimator (SPMLE), the maximal estimated likelihood **...**

The biases of all estimators are negligible and therefore omitted in Table 3. As expected, the WE and the LKW have much bigger variances than the SPMLEs and the MELEs because the latter use more information from the first phase. Exploiting the independence between *X* and *Z* yields an additional efficiency gain, ranging from 10% to 50%. Consistent with the results in Table 1, the efficiency gain decreases with β_{3}. The estimated likelihood approach yields a second-best performance with respect to efficiency, next to the likelihood based methods. Note when β_{3} = 1 the performance of WE is better than that of LKW. This may be caused by the relatively small sample size (2000 phase-one subjects, 400 phase-two subjects). To improve the performance of the LKW method, one can create more strata for the phase-one data (poststratification) hence gain more precision (Lawless, Kalbfleisch and Wild, 1999). However it will still be inferior to methods using continuous outcomes from the first place.

We took the WHI biomarker study as in Example 1 to illustrate our methods. The aforementioned 29 biomarkers were picked by WHI investigators as markers that are possibly associated with either stroke, venous thrombotic disease, or myocardial infarction (MI). A comprehensive analysis of these samples is published (Kooperberg et al., 2007). In our terminology, this biomarker study is a two-phase study. The first-phase data consist of the randomized treatment assignment and stroke outcome for 16,608 study participants. The second phase consists of 124 cases and 504 controls from whom blood was analyzed. All blood biomarkers are continuous and were logarithm (base 10) transformed. To eliminate potential confounding factors, we included a number of important clinical characteristics in the second phase data, such as age, physical activity levels, diabetes, hypertension, systolic and diastolic blood pressure, and waist:hip ratio. We are interested in the interaction between the hormone treatment and each of biomarkers in a logistic regression adjusting for both main effects and aforementioned clinical characteristics. We compared four different methods to analyze two-phase data: the standard method ignoring the missing data (complete-case only) and the independence between the treatment and biomarkers, namely complete-case analysis, the proposed estimated likelihood estimator with or without exploiting independence, and the proposed SPMLE with or without exploiting covariate independence.

Table 4 shows the estimates of the interaction between the treatment and the thrombosis biomarkers using the aforementioned five methods. The model used in this table is

$$\text{logit}(P(\text{stroke}|\phantom{\rule{thinmathspace}{0ex}}\dots ))={\beta}_{0}+{\beta}_{1}\mathit{\text{HT}}+{\beta}_{2}\text{log}(B)+{\beta}_{3}\mathit{\text{HT}}\phantom{\rule{thinmathspace}{0ex}}\times \phantom{\rule{thinmathspace}{0ex}}\text{log}(B)+{\displaystyle \sum _{i}{\beta}_{i+3}{X}_{i},}$$

where stroke is the event of a WHI participant having a stroke, *HT* an indicator whether this participant was assigned to Hormone Therapy, *B* the (continuous) biomarker, and the *X _{i}* are other confounding factors. For ease of exposition, we only show the results of one class of biomarkers. Without exploiting the independence, there is essentially no improvement in efficiency over the complete-case analysis using either the MELE or the SPMLE. On the other hand, the standard errors of the two proposed methods exploiting independence is markedly smaller than the standard complete-case analysis and two other semiparametric methods. We found three thrombosis markers show a markedly increased significance. In particular, the p-values for PAP (plasmin-antiplasmin complex) are significant after adjusting for 29 tests by Bonferroni correction. A similar pattern of variance reduction, though in a smaller magnitude, is observed for the main effects - the treatment effect and the biomarker effect. The results of the MELE and the SPMLE exploiting independence are mostly indistinguishable. This is probably because the stroke is rather rare event (124 cases out of 16608 participants in this study) in the study cohort, the one-step estimated covariate distribution in the MELE is close to the iteratively estimated one in the SPMLE. These results demonstrate the advantage of our methods: exploiting the independence between the randomized treatment and the biomarker improves the power of detecting the interaction between them.

With the rapid advance of biotechnologies such as gene chips and proteomics, it is increasingly popular to employ some form of two-phase sampling to measure the expensive biomarker data for a sample of the study cohort, while collecting the cheap covariates and outcomes for everyone. When the phase-one cohort is a randomized clinical trial, there is independence between the treatment and baseline covariates. In this article, we show that exploiting this independence substantially improves the estimation efficiency, particularly for treatment-biomarker interactions. This is especially relevant to many pharmacogenetic studies where drug-genotype interactions are under investigation. In an observational two-phase study, however, unless we have strong a *priori* evidence about the gene-environment independence, we should exert extra caution when exploiting it in estimation as deviation from the independence can draw a substantial bias (Albert et al., 2001).

Beyond exploiting the independence caused by randomization, our methods have general merits in semiparametric estimation. Existing methods to compute the SPMLE, such as the EM algorithm and the profile likelihood, all have limitations. We propose a profile likelihood based Newton–Raphson algorithm that computes the SPMLE for a wide range of data forms including continuous outcomes, as long as the phase-one covariate is discrete and the phase-two data is missing at random. An innovation in our algorithm is the usage of numerical differentiation to compute the profile information matrix, thus avoiding the complicated algebraic derivation of Lawless, Kalbfleisch and Wild (1999). In situations where computing the SPMLE is time-consuming, the proposed estimated likelihood approach may help. A contribution of this article is to derive the asymptotics for the estimated likelihood in two-phase sampling and work out the efficiency gain when using the covariate independence. In our simulations and data application, the MELE performs almost as good as the SPMLE. Simulations in Lawless, Kalbfleisch and Wild (1999) suggest that the relative efficiency of MELE may decline when the effect size increases. In genetic association studies with many markers, the majority of markers have no effect and some may have weak effect, the estimated likelihood will be time-efficient in screening for a subset of interesting markers. The SPMLE can be used subsequently to get a more precise estimate for the biggest hits.

Semiparametric efficient estimators can be alternatively derived in the framework of augmented inverse probability weighted estimators (Robins, Rotnitzky and Zhao, 1994). When parametric models involved are correctly specified, the estimates are asymptotically equivalent to the SPMLE derived under the likelihood framework, but are harder to implement (Carpenter, Kenward and Vansteelandt, 2006). The augmented inverse probability weighted approach has its appeal in that it remains consistent if either the selection probability or the conditional distribution of missing data given observed data is correctly modeled. In randomized clinical trials, when baseline covariates (*Z*) to be adjusted are continuous and high-dimensional, it is almost impossible to correctly specify the parametric distribution of *Y*|*X*, *Z*. Then the SPMLE assuming the wrong model of *Y*|*X*, *Z* will not be consistent. However, since the sampling probabilities are precisely controlled by the investigators, it is possible to construct a doubly-robust estimator that exploits the independence introduced by randomization, yet still yields consistent estimators of treatment-biomarker interactions. Indeed, Robins and Ritov (1997) shows that in this setting any estimator that fails to use the knowledge of sampling probabilities can perform poorly in moderate samples. It remains interesting for future methodological studies to exploit the independence in the framework of doubly-robust estimators.

This work was supported in part by NIH grant R01 CA 74841, P01 CA53996 and U01 CA125489. The Women’s Health Initiative program is funded by the National Heart, Lung, and Blood Institute, US Department of Health and Human Services. The authors are grateful to the associate editor and the referee for their helpful comments which led to improved clarity of our presentation.

SUPPLEMENTARY MATERIALS

Web Appendices referenced in Section 3 and 4 are available under the Paper Information link at the Biometrics website http://www.biometrics.tibs.org.

- Albert PS, Ratnasinghe D, Tangrea J, Wacholder S. Limitations of the case-only design for identifying gene-environment interactions. American Journal of Epidemiology. 2001;154:587–693. [PubMed]
- Arnett DK, Davis BR, Ford CE, Boerwinkle E, Leiendecker-Foster C, B MM, Black H, Eckfeldt JH. Pharmacogenetic association of the angiotensin-converting enzyme insertion/deletion polymorphism on blood pressure and cardiovascular risk in relation to antihypertensive treatment: the genetics of hypertension-associated treatment (GenHAT) study. Circulation. 2005;111:3374–3383. [PubMed]
- Breslow NE, Cain KC. Logistic regression for two-stage case-control data. Biometrika. 1988;75:11–20.
- Breslow NE, Day NE. Statistical Methods in Cancer Research I. The Analysis of Case-control Studies. Lyon, France: International Agency for Research on Cancer; 1980.
- Breslow NE, Robins JM, Wellner JM. Large sample theory for semiparametric regression models with two-phase, outcome-dependent sampling. Annals of Statistics. 2003;31:1110–1139.
- Carpenter JR, Kenward MG, Vansteelandt S. A comparison of multiple imputation and doubly robust estimation for analyses with missing data. Journal of the Royal Statistical Society, Ser. A. 2006;169:571–584.
- Carroll RJ, Wand MP. Semiparametric estimation in logistic measurement error models. Journal of the Royal Statistical Society, Ser. B. 1991;53:573–585.
- Chatterjee N, Carroll RJ. Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika. 2005;92:399–418.
- Chatterjee N, Chen YH. Maximum likelihood inference on a mixed conditionally and marginally specified regression model for genetic epidemiologic studies with two-phase sampling. Journal of the Royal Statistical Society, Ser. B. 2005;69:123–142.
- Chatterjee N, Chen YH, Breslow NE. A pseudoscore estimator for regression problems with two-phase sampling. Journal of the American Statistical Association. 2003;98:158–168.
- Cox DR, Hinkley DV. Theoretical Statistics. New York: Chapman and Hall; 1974.
- Flanders WD, Greenland S. Analytic methods for two-stage case-control studies and other stratified designs. Statistics in Medicine. 1991;10:739–747. [PubMed]
- Foutz RV. On the unique consistent solution to the likelihood equations. Journal of the American Statistical Association. 1977;72:147–148.
- Hu XJ, Lawless JF. Estimation from truncated lifetime data with supplementary information on covariates and censoring times. Biometrika. 1996;83:747–761.
- Ibrahim JG. Incomplete data in generalized linear models. Journal of the American Statistical Association. 1990;85:765–769.
- Ibrahim JG, Chen M, Lipsitz SR. Monte carlo em for missing covariates in parametric regression models. Biometrics. 1999;55:591–596. [PubMed]
- Kooperberg C, Cushman M, Hsia J, Robinson JG, Aragaki AK, Lynch JK, Baird AE, Johnson KC, Kuller LH, Beresford SA, Rodriguez B. Can biomarkers identify women at increased stroke risk? PLoS clinical trails. 15:e28. [PMC free article] [PubMed]
- Lawless JF. Likelihood and pseudo likelihood estimation based on response-biased observation. Proc. Georgia Symp. Estimation Functions. Hayward: Institute of Mathematical Statistics; 1997.
- Lawless JF, Kalbfleisch JD, Wild CJ. Semiparametric methods for response-selective and missing data problems in regression. Journal of the Royal Statistical Society, Ser. B. 1999;61:413–438.
- Lipsitz SR, Ibrahim JG, Zhao LP. A weighted estimating equation for missing covariate data with properties similar to maximum likelihood. Journal of the American Statistical Association. 1999;94:1147–1160.
- Murphy SA, van der Vaart AW. On profile likelihood (with discussion) Journal of the American Statistical Association. 2000;95:449–485.
- Pepe MS, Fleming TR. A non-parametric method for dealing with mismeasured covariate data. Journal of the American Statistical Association. 1991;86:108–113.
- Piegorsch WW, Weinberg CR, Taylor JA. Non-hierarchical logistic models and case-only designs for assessing susceptibility in population based case-control studies. Statistics in Medicine. 1994;13:153–162. [PubMed]
- Prentice RL, Pyke R. Logistic disease incidence models and case-controls studies. Biometrika. 1979;66:403–411.
- Reilly M, Pepe MS. A mean score method for missing and auxiliary covariate data in regression models. Biometrika. 1995;82:299–314.
- Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association. 1994;89:846–866.
- Robins JM, Ritov Y. Toward a curse of dimensionality appropriate (CODA) asymptotic theory for semiparametric models. Statistics in Medicine. 1997;16:285–319. [PubMed]
- Rossouw JE, Anderson GL, Prentice RL, LaCroix AZ, Kooperberg C, Stefanick ML, Jackson RD, Beresford SA, Howard BV, Johnson KC, Kotchen JM, Ockene J. Risks and benefits of estrogen plus progestin in healthy postmenopausal women: principal results from the women’s health initiative randomized controlled trial. Journal of American Medical Association. 2002;288:321–333. [PubMed]
- Rubin DB. Inference and missing data. Biometrika. 1976;63:581–592.
- Scott AJ, Wild CJ. Fitting logistic regression models in stratified case-control studies. Biometrics. 1991;47:497–510.
- Scott AJ, Wild CJ. Fitting regression models to case-control data by maximum likelihood. Biometrika. 1997;84:57–71.
- Umbach DM, Weinberg CR. Designing and analyzing case-control studies to exploit independence of genotype and exposure. Statistics in Medicine. 1997;16:1731–1743. [PubMed]
- Weaver MA, Zhou H. An estimated likelihood method for continuous outcome regression models with outcome-dependent sampling. Journal of the American Statistical Association. 2005;100:459–469.
- White JE. A two-stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology. 1982;115:119–128. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |