Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2574960

Formats

Article sections

- Summary
- 1. Introduction
- 2. Semiparametric Model Framework
- 3. Estimating Functions for Treatment Parameters Using Auxiliary Covariates
- 4. Implementation of Improved Estimators
- 5. Improved Hypothesis Tests
- 6. Simulation Studies
- 7. Applications
- 8. Discussion
- References

Authors

Related links

Biometrics. Author manuscript; available in PMC 2008 October 28.

Published in final edited form as:

Published online 2008 January 11. doi: 10.1111/j.1541-0420.2007.00976.x

PMCID: PMC2574960

NIHMSID: NIHMS45063

Department of Statistics, North Carolina State University, Raleigh, North Carolina 27695−8203, U.S.A.

The publisher's final edited version of this article is available at Biometrics

See other articles in PMC that cite the published article.

The primary goal of a randomized clinical trial is to make comparisons among two or more treatments. For example, in a two-arm trial with continuous response, the focus may be on the difference in treatment means; with more than two treatments, the comparison may be based on pairwise differences. With binary outcomes, pairwise odds-ratios or log-odds ratios may be used. In general, comparisons may be based on meaningful parameters in a relevant statistical model. Standard analyses for estimation and testing in this context typically are based on the data collected on response and treatment assignment only. In many trials, auxiliary baseline covariate information may also be available, and it is of interest to exploit these data to improve the efficiency of inferences. Taking a semiparametric theory perspective, we propose a broadly-applicable approach to adjustment for auxiliary covariates to achieve more efficient estimators and tests for treatment parameters in the analysis of randomized clinical trials. Simulations and applications demonstrate the performance of the methods.

In randomized clinical trials, the primary objective is to compare two or more treatments on the basis of an outcome of interest. Along with treatment assignment and outcome, baseline auxiliary covariates may be recorded on each subject, including demographical and physiological characteristics, prior medical history, and baseline measures of the outcome. For example, the international Platelet Glycoprotein IIb/IIIa in Unstable Angina: Receptor Suppression Using Integrilin Therapy (PURSUIT) study (Harrington, 1998) in subjects with acute coronary syndromes compared the anti-coagulant Integrilin plus heparin and aspirin to heparin and aspirin alone (control) on the basis of the binary endpoint death or myocardial infarction at 30 days. Similarly, AIDS Clinical Trials Group (ACTG) 175 (Hammer et al., 1996) randomized HIV-infected subjects to four antiretroviral regimens with equal probabilities, and an objective was to compare measures of immunological status under the three newer treatments to those under standard zidovudine (ZDV) monotherapy. In both studies, in addition to the endpoint, substantial auxiliary baseline information was collected.

Ordinarily, the primary analysis is based only on the data on outcome and treatment assignment. However, if some of the auxiliary covariates are associated with outcome, precision may be improved by “adjusting” for these relationships (e.g., Pocock et al., 2002), and there is an extensive literature on such covariate adjustment (e.g., Senn, 1989; Hauck, Anderson, and Marcus, 1998; Koch et al., 1998; Tangen and Koch, 1999; Lesaffre and Senn, 2003; Grouin, Day, and Lewis, 2004). Much of this work focuses on inference on the difference of two means and/or on adjustment via a regression model for mean outcome as a function of treatment assignment and covariates. In the special case of the difference of two treatment means, Tsiatis et al. (2007) proposed an adjustment method that follows from application of the theory of semiparametrics (e.g., van der Laan and Robins, 2003; Tsiatis, 2006) by Leon, Tsiatis, and Davidian (2003) to the related problem of “pretest-posttest” analysis, from which the form of the “optimal” (most precise) estimator for the treatment mean difference, adjusting for covariates, emerges readily. This approach separates estimation of the treatment difference from the adjustment, which may lessen concerns over bias that could result under regression-based adjustment because of the ability to inspect treatment effect estimates obtained simultaneously with different combinations of covariates and “to focus on the covariate model that best accentuates the estimate” (Pocock et al., 2002, p. 2925).

In this paper, we expand on this idea by developing a broad framework for covariate adjustment in settings with two or more treatments and general outcome summary measures (e.g., log-odds ratios) by appealing to the theory of semiparametrics. The resulting methods seek to use the available data as efficiently as possible while making as few assumptions as possible. In Section 2, we present a semiparametric model framework involving parameters relevant to making general treatment comparisons. Using the theory of semiparametrics, we derive the class of estimating functions for these parameters in Section 3 and in Section 4 demonstrate how these results lead to practical estimators. This development suggests a general approach to adjusting any test statistic for making treatment comparisons to increase efficiency, described in Section 5. Performance of the proposed methods is evaluated in simulation studies in Section 6 and is shown in representative applications in Section 7.

Denote the data from a *k*-arm randomized trial, *k* ≥ 2, as (*Y _{i}*,

Let *β* denote a vector of parameters involved in making treatment comparisons under a specified statistical model. For example, in a two-arm trial, for a continuous real-valued response *Y*, a natural basis for comparison is the difference in means for each treatment, *E*(*Y* | *Z* = 2) − *E*(*Y* | *Z* = 1), represented directly as *β*_{2} in the model

(1)

In a three-arm trial, we may consider the model

(2)

In contrast to (1), we have parameterized (2) equivalently in terms of the three treatment means rather than differences relative to a reference treatment, and treatment comparisons may be based on pairwise contrasts among elements of *β*. For binary outcome *Y* = 0 or 1, where *Y* = 1 indicates the event of interest, we may consider for a *k*-arm trial

(3)

where logit(*p*) = log{*p*/(1 − *p*)}; *β* = (*β*_{1}, . . . , *β** _{k}*)

If *Y _{i}* is a vector of continuous longitudinal responses

(4)

where *β* = (*β*_{1}, *β*_{2})* ^{T}*, and

(5)

leaving remaining features of the distribution of *Y* given *Z* unspecified. For binary *Y _{ij}*, the marginal model logit{

In all of (1)-(5), *β* (*p* × 1) is a parameter involved in making treatment comparisons in a model describing aspects of the conditional distribution of *Y* given *Z* and is of central interest. In addition to *β*, models like (4) and (5) depend on a vector of parameters *γ*, say; e.g., in (4), ; and *γ* = *α* in (5). In general, we define *θ* = (*β** ^{T}* ,

For these and similar models, consistent, asymptotically normal estimators for *θ*, and hence for *β* and functions of its elements reflecting treatment comparisons, based on the data (*Y _{i}*,

As noted in Section 1, the standard approach in practice for covariate adjustment, thus using all of (*Y _{i}*,

(6)

extension to *k* > 2 treatments is immediate. See Tsiatis et al. (2007, Section 3) for discussion of related estimators for *β*_{2} in the particular case of (1). If (6) is the correct model for *E*(*Y* | *X*, *Z*), then and *β*_{2} in (1) coincide, and, moreover, the OLS estimator for in (6) is a consistent estimator for *β*_{2} that is generally more precise than the usual unadjusted estimator, even if (6) is not correct (e.g., Yang and Tsiatis, 2001). For binary *Y*, covariate adjustment is often carried out based on the logistic regression model

(7)

where the MLE of is taken as the adjusted estimator for the log-odds ratio *β*_{2} in (3) with *k* = 2. In (7), is the log-odds ratio conditional on *X*, assuming this quantity is constant for all *X*. This assumption may or may not be correct; even if it were, is generally different from *β*_{2} in (3). Tsiatis et al. (2007, Section 2) discuss this point in more detail.

To derive alternative methods, we begin by describing our assumed semiparametric statistical model for the full data (*Y*, *X*, *Z*), which is a characterization of the class of all joint densities for (*Y*, *X*, *Z*) that could have generated the data. We seek methods that perform well over as large a class as possible; thus, we assume that densities in this class involve no restrictions beyond the facts that *Z**X*, guaranteed by randomization; that *π** _{g}* =

Under the above conditions, we assume that all joint densities for (*Y*, *X*, *Z*) may be written, in obvious notation, as *p _{Y,X,Z}*(

(8)

(9)

The joint density involves an additional, possibly infinite-dimensional nuisance parameter *ψ*, needed to include in the class all joint densities satisfying (i) and (ii). Here, *p _{X}*(

We now derive consistent, asymptotically normal estimators for *θ*, and hence *β*, in a given *p*_{Y|Z} (*y*|*z*; *θ*, *η*) and using the iid data (*Y _{i}*,

When the data on auxiliary covariates *X* are not taken into account, estimating functions for *θ* based only on (*Y*, *Z*) in models like those in (1)-(5) leading to consistent, asymptotically normal estimators are well known. For example, the OLS estimator for *θ* = *β* in the linear regression model (1) may be obtained by considering the estimating function

(10)

and solving the estimating equation in *θ*. The OLS estimator for *β*_{2} so obtained equals the usual difference in sample means. Likewise, with *θ* = *β* = (*β*_{1}, . . . , *β** _{k}*)

(11)

The estimating functions (10) and (11) are unbiased; i.e., have mean zero assuming that (1) and (3), respectively, are correct. Under regularity conditions, unbiased estimating functions lead to consistent, asymptotically normal estimators (e.g., Carroll et al., 2006, Section A.6).

Our key result is that, given a semiparametric model *p _{Y,X,Z}*(

(12)

where *a _{g}*(

Full advantage of this result may be taken by identifying the optimal estimating function within class (12), that for which the elements of the corresponding estimator for *θ* have smallest asymptotic variance. This estimator for *β* thus yields the greatest efficiency gain over among all estimators with estimating functions in class (12) and hence more efficient inferences on treatment comparisons. By standard arguments for M-estimators (e.g., Stefanski and Boos, 2002), an estimator for *θ* corresponding to an estimating function of form (12) is consistent and asymptotically normal with asymptotic covariance matrix

(13)

where *θ*_{0} is the true value of *θ*, , and . . Thus, to find the optimal estimating function, one need only consider in (13) and determine *a _{g}*(

(14)

In the case of *β*_{2} in (1), (14) yields the optimal estimator in (16) of Tsiatis et al. (2007).

The optimal estimator in class (12) solving (14) depends on the conditional expectations *E*{*m*(*Y*, *Z*; *θ*) | *X _{i}*,

(1) Solve the original estimating equation to obtain the unadjusted estimator . For each subject *i*, obtain the values for each *g* = 1, . . . , *k*.

(2) Note that the are (*r* × 1). For each treatment group *g* = 1, . . . , *k* separately, based on the *r*-variate “data” for *i* in group *g*, develop a parametric regression model , where ; i.e., such that *q _{gu}*(

(15)

and solve for *θ* to obtain the final, adjusted estimator . We recommend substituting , in (15).

The foregoing three-step algorithm applies to very general *m*(*Y*, *Z*; *θ*). Often,

(16)

for some *A*(*Z*, *θ*) with *r* rows and some *f*(*Z*, *θ*), as in (10) and (11). Here, a simpler, “direct” implementation strategy is possible. Note that *E*{*m*(*Y*, *Z*; *θ*) | *X*, *Z* = *g*} = *A*(*g*, *θ*){*E*(*Y* |*X*, *Z* = *g*) − *f*(*g*; *θ*)}; thus, for each *g* = 1, . . . , *k*, based on the data (*Y _{i}*,

Several observations follow from semiparametric theory. Although we advocate representing *E*{*m*(*Y*, *Z*; *θ*) | *X*, *Z* = *g*} or *E*(*Y* |*X*, *Z* = *g*), *g* = 1, . . . , *k*, by parametric models, consistency and asymptotic normality of hold regardless of whether or not these models are correct specifications of the true *E*{*m*(*Y*, *Z*; *θ*) | *X*, *Z* = *g*} or *E*(*Y* |*X*, *Z* = *g*). Thus, the proposed methods are not parametric, as their validity does not depend on parametric assumptions. The theory also shows that, in either implementation strategy, if the *q** _{g}* are specified and fitted via OLS as described above, then, by an argument similar to that in Leon et al. (2003, Section 4), is guaranteed to be relatively more efficient than the corresponding unadjusted estimator. Moreover, under these conditions, although

Because in either implementation strategy solving (15) is an M-estimator, the sandwich method (e.g., Stefanski and Boos, 2002) may be used to obtain a sampling covariance matrix for , from which standard errors for functions of may be derived. This matrix is of form (13), with expectations replaced by sample averages evaluated at the estimates and *a _{g}*(

The regression models *q** _{g}* in either implementation, which are the mechanism by which covariate adjustment is incorporated, are determined separately by treatment group and are developed independently of reference to the adjusted estimator . Thus, estimation of

The principles in Section 3 may be used to construct more powerful tests of null hypotheses of no treatment effects by exploiting auxiliary covariates. The key is that, under a general null hypothesis *H*_{0} involving s degrees of freedom, a usual test statistic *T** _{n}*, say, based on the data (

(17)

where is a (*s*×1) function of (*Y*, *Z*), discussed further below, such that , with denoting expectation under *H*_{0}; and .

When the notion of “treatment effects” may be formulated in terms of *β* in a model like (1)-(5), the null hypothesis is typically of the form *H*_{0} : *C**β* = 0, where C is a (*s*×*p*) contrast matrix. E.g., in (2), *C* is (2 × 3) with rows (1, −1, 0) and (1, 0, −1). When inference on *H*_{0} is based on a Wald test of the form , where is unadjusted estimator corresponding to an estimating function *m*(*Y*, *Z*; *θ*), and is an estimator for the covariance matrix of . Here, *B* is the (*p* × *r*) matrix equal to the first *p* rows of , and *θ*_{0} is the value of *θ* *H*_{0}.

In other situations, the null hypothesis may not refer to a parameter like *β* in a given model. For example, the null hypothesis for a *k*-arm trial may be *H*_{0} : *S*_{1}(*u*) = · · · = *S** _{k}*(

To motivate the proposed more powerful tests, we consider the behavior of *T _{n}* in (17) under a sequence of local alternatives

(18)

(19)

where . The second term in (19) has mean zero by randomization under *H*_{0} or any alternative. Accordingly, it follows under the sequence of alternatives *H*_{1n} that converges in distribution to a random vector, so that in (18) has an asymptotic distribution with noncentrality parameter *τ*^{T}Σ^{*−1}*τ*.

These results suggest that, to maximize the noncentrality parameter and thus power, we wish to find the particular Σ* , , say, that makes non-negative definite for all Σ*, which is equivalent to making non-negative definite for all Σ*. This corresponds to finding the optimal choice of *a _{g}*(

These developments suggest an implementation strategy analogous to that in Section 4:

(1) For the test statistic *T _{n}*, determine and substitute sample quantities for any unknown parameters to obtain . E.g., for

(20)

As *μ* is unknown, is obtained by substituting for *μ*. We recommend substituting for *π** _{g}*,

(2) For each treatment group *g* = 1, . . . , *k* separately, treating the for subjects *i* in group *g* as *s*-variate “data,” develop a regression model by representing each component *q _{gu}*(

(3) Using the predicted values from step (2), form

(21)

and substitute these values into (18). Estimate Σ* in (18) by .

Compare the resulting test statistic to the distribution. As in Section 4, there is no effect asymptotically of estimating *ζ** _{g}* and

The approach of Tangen and Koch (1999) to modifying the Wilcoxon test for two treatments is in a similar spirit to this general approach.

We report results of several simulations, each based on 5000 Monte Carlo data sets. Tsiatis et al. (2007, Section 6) carried out extensive simulations in the particular case of (1); thus, we focus here on estimation of quantities other than differences of treatment means.

In the first set of simulations, we considered *k* = 2, a binary response *Y*, and

(22)

so that *β*_{2} is the log-odds ratio for treatment 2 relative to treatment 1, the parameter of interest; and *θ* = *β* = (*β*_{1}, *β*_{2})* ^{T}* . For each scenario, we generated

- Aug. 1 , fit by OLS
- Aug. 2 , fit by OLS
- Aug. 3 fit by IRWLS
- Aug . 4 , fit by IRWLS
- Aug. 5 by OLS with forward selection
- Aug. 6 by IRWLS with forward selection

where “true” means that *c** _{g}*(

Table 1 shows modest to considerable gains in efficiency for the proposed estimators, depending on the strength of the association. The estimators are unbiased, and associated confidence intervals achieve the nominal level. In contrast, the usual adjustment based on (7) leads to biased estimation of *β*_{2}, considerable efficiency loss, and unreliable intervals. This is a consequence of the fact that *β*_{2} is an unconditional measure of treatment effect while is defined conditional on *X*; this distinction does not matter when the model for *Y* is linear but is important when it is nonlinear, as is (7) (see, e.g., Robinson et al., 1998).

In the second set of simulations, we again took *k* = 2 and focused on *β*_{2}, the difference in treatment slopes in the linear mixed model (4). In each scenario, we generated for each *i* = 1, . . . , *n* = 200 *Z _{i}* as Bernoulli with

We carried out simulations based on 10,000 Monte Carlo data sets involving *k* = 3 and the Kruskal-Wallis test. For each data set, we generated for each of *n* = 200 or 400 subjects *Z* with *P*(*Z* = *g*) = 1/3, *g* = 1, 2, 3, and (*Y*, *X*) with joint distribution of (*Y*, *X*) given *Z* bivariate normal with mean {*β*_{1}*I*(*Z* = 1) + *β*_{2}*I*(*Z* = 2), 0}* ^{T}* and covariance matrix vech(1,

We consider data from 5,710 patients in the PURSUIT trial introduced in Section 1 and focus on the log-odds ratio for Integrilin relative to control. The 35 baseline auxiliary covariates are listed in Web Appendix D.

The unadjusted estimate of the log-odds ratio based on (22), , is −0.174 with standard error 0.073. To calculate the augmented estimator based on (22), we used the direct implementation strategy and took , *g* = 1, 2, with *c** _{g}*(

We consider data on 2139 subjects from ACTG 175, discussed in Section 1, where the *k* = 4 treatments were zidovudine (ZDV) monotherapy (*g* = 1), ZDV+didanosine (ddI, *g* = 2), ZDV+zalcitabine (*g* = 3), and ddI monotherapy (*g* = 4). The continuous response is CD4 count (cells/mm^{3}, *Y* ) at 20±5 weeks, and we focus on the four treatment means, with the same 12 auxiliary covariates considered by Tsiatis et al. (2007, Section 5).

We consider the extension of model (2) to *k* = 4 treatments, so that *θ* = *β* = (*β*_{1}, . . . , *β*_{4})* ^{T}*,

We also carried out the standard unadjusted three-degree-of-freedom Wald test for *H*_{0} : *β*_{1} = *β*_{2} = *β*_{3} = *β*_{4} and Kruskal-Wallis test for *H*_{0} : *S*_{1}(*u*) = · · · = *S*_{4}(*u*) = *S*(*u*), as well as their adjusted counterparts using *c** _{gu}*(

See Web Appendix D for further results for these data.

We have proposed a general approach to using auxiliary baseline covariates to improve the precision of estimators and tests for general measures of treatment effect and general null hypotheses in the analysis of randomized clinical trials by using semiparametric theory.

We identify the optimal estimating function involving covariates within the class of such estimating functions based on a *given m*(*Y*, *Z*; *θ*). For differences of treatment means or measures of treatment effect for binary outcomes, this estimating function in fact leads to the efficient estimator for the treatment effect. In more complicated models, e.g., repeated measures models, we do not identify the optimal estimating function among *all* possible. Our experience in other problems suggests that gains over the methods here would be modest.

The use of model selection techniques, such as forward selection in our simulations, to determine covariates to include in the augmentation term models should have no effect asymptotically on the properties of the estimators for *θ*. However, such effects may be evident in smaller samples, requiring a “correction” to account for failure of the asymptotic theory to represent faithfully the uncertainty due to model selection. Investigation of how approaches to inference after model selection (e.g., Hjort and Claeskens, 2003; Shen, Huang and Ye, 2004) may be adapted to this setting would be a fruitful area for future research.

This work was supported by NIH grants R37 AI031789, R01 CA051962, and R01 CA085848.

Supplementary Materials

Web Appendices A–D, referenced in Sections 2, 3, and 7, are available under the Paper Information link at the Biometrics website http://www.tibs.org/biometrics.

Click here to view.^{(121K, pdf)}

- Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models: A Modern Perspective, Second Edition. Chapman and Hall/CRC; Boca Raton: 2006.
- Grouin JM, Day S, Lewis J. Adjustment for baseline covariates: An introductory note. Statistics in Medicine. 2004;23:697–699. [PubMed]
- Hammer SM, Katzenstein DA, Hughes MD, Gundaker H, Schooley RT, Haubrich RH, Henry WK, Lederman MM, Phair JP, Niu M, Hirsch MS, Merigan TC, AIDS Clinical Trials Group Study 175 Study Team A trial comparing nucleoside monotherapy with combination therapy in HIV-infected adults with CD4 cell counts from 200 to 500 per cubic millimeter. New England Journal of Medicine. 1996;335:1081–1089. [PubMed]
- Harrington RA, PURSUIT Investigators Inhibition of platelet glycoprotein IIb/IIIa with eptifibatide in patients with acute coronary syndromes without persistent ST-segment elevation. New England Journal of Medicine. 1998;339:436–443. [PubMed]
- Hauck WW, Anderson S, Marcus SM. Should we adjust for covariates in nonlinear regression analyses of randomized trials? Controlled Clinical Trials. 1998;19:249– 256. [PubMed]
- Hjort NL, Claeskens G. Frequentist model average estimators. Journal of the American Statistical Association. 2003;98:879–899.
- Koch GG, Tangen CM, Jung JW, Amara IA. Issues for covariance analysis of dichotomous and ordered categorical data from randomized clinical trials and non-parametric strategies for addressing them. Statistics in Medicine. 1998;17:1863–1892. [PubMed]
- Leon S, Tsiatis AA, Davidian M. Semiparametric e cient estimation of treatment e ect in a pretest-posttest study. Biometrics. 2003;59:1046–1055. [PubMed]
- Lesaffre E, Senn S. A note on non-parametric ANCOVA for covariate adjustment in randomized clinical trials. Statistics in Medicine. 2003;22:3586–3596. [PubMed]
- Pocock SJ, Assmann SE, Enos LE, Kasten LE. Subgroup analysis, covariate adjustment, and baseline comparisons in clinical trial reporting: Current practice and problems. Statistics in Medicine. 2002;21:2917–2930. [PubMed]
- Robinson LD, Dorroh JR, Lein D, Tiku ML. The e ects of covariate adjustment in generalized linear models. Communications in Statistics, Theory and Methods. 1998;27:1653–1675.
- SAS Institute, Inc. SAS Online Doc 9.1.3. SAS Institute, Inc.; Cary, NC: 2006.
- Senn S. Covariate imbalance and random allocation in clinical trials. Statistics in Medicine. 1989;8:467–475. [PubMed]
- Shen X, Huang HC, Ye J. Inference after model selection. Journal of the American Statistical Association. 2004;99:751–762.
- Stefanski LA, Boos DD. The calculus of M-estimation. The American Statistician. 2002;56:29–38.
- Tangen CM, Koch GG. Nonparametric analysis of covariance for hypothesis testing with logrank and Wilcoxon scores and survival-rate estimation in a randomized clinical trial. Journal of Biopharmaceutical Statistics. 1999;9:307–338. [PubMed]
- Tsiatis AA. Semiparametric Theory and Missing Data. Springer; New York: 2006.
- Tsiatis AA, Davidian M, Zhang M, Lu X. Covariate adjustment for twosample treatment comparisons in randomized clinical trials: A principled yet flexible approach. Statistics in Medicine. 2007 in press. [PMC free article] [PubMed]
- van der Laan MJ, Robins JM. Unified Methods for Censored Longitudinal Data and Causality. Springer; New York: 2003.
- van der Vaart AW. Asymptotic Statistics. Cambridge University Press; Cambridge: 1998.
- Yang L, Tsiatis AA. E ciency study for a treatment e ect in a pretest-posttest trial. The American Statistician. 2001;56:29–38.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |