Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3909656

Formats

Article sections

- Summary
- 1. Introduction
- 2. Scientific setting and goal
- 3. A symmetry criterion for non-parametric validity of parametric tests
- 4. Relation with established literature
- 5. Example: the DIADS Trial
- 6. Discussion
- Supplementary Material
- References

Authors

Related links

Biometrics. Author manuscript; available in PMC 2014 February 2.

Published in final edited form as:

Published online 2011 July 15. doi: 10.1111/j.1541-0420.2011.01642.x

PMCID: PMC3909656

NIHMSID: NIHMS306262

Russell T. Shinohara, Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA;

The publisher's final edited version of this article is available at Biometrics

See other articles in PMC that cite the published article.

Pilot phases of a randomized clinical trial often suggest that a parametric model may be an accurate description of the trial's longitudinal trajectories. However, parametric models are often not used for fear that they may invalidate tests of null hypotheses of equality between the experimental groups. Existing work has shown that when, for some types of data, certain parametric models are used, the validity for testing the null is preserved even if the parametric models are incorrect. Here, we provide a broader and easier to check characterization of parametric models that can be used to (a) preserve nonparametric validity of testing the null hypothesis, i.e., even when the models are incorrect, and (b) increase power compared to the non- or semiparametric bounds when the models are close to correct. We demonstrate our results in a clinical trial of depression in Alzheimer's patients.

When analyzing data from randomized clinical trials, investigators often have information about the relative appropriateness of certain parametric models from pilot phases or existing literature. More specifically, suppose one is interested in assessing whether there is a difference in average trajectories between a treatment arm and a control arm. Previous observations that such trajectories are curvilinear over time would mean that a parametric model could approximate well the actual underlying trajectories.

For example, in pharmaceutical treatment for patients with depression and Alzheimer's disease studies (DIADS, Lyketsos et al 2003), depressive symptoms are studied longitudinally after initiation of an antidepressive treatment regime or placebo. It has been observed that such treatments generally result in an initial improvement in symptoms that reaches a plateau in a matter of weeks (e.g., Mulsant et al, 2001). This curvilinear shape indicates that a parametric model of quadratic curves for the mean outcome over time would be close to the actual trajectories. This in turn would mean that a test between a treatment and a control arm based on such parametric models might result in higher power than a nonparametric test.

Unfortunately, researchers tend to not use parametric models when analyzing data from such trials. This is understandably a result of hesitations about the validity of parametric tests when these models are misspecified. More specifically, the behavior of the type I error of hypothesis tests for RCTs based on misspecified parametric models has not been as carefully studied until recently. Early work by Gail, Tan, and Piantodosi (1988) examined test validity using a special case of misspecified generalized linear models (see discussion). For linear models, Robins (2004) has examined the behavior of hypothesis testing based on misspecified models in this context. Rosenblum and van der Laan (2009) have shed further light on this problem, by showing that there exist classes of possibly misspecified models that still lead to valid tests. These results, however, have been specific to testing for differences in means in particular subclasses of generalized linear models.

We derive a criterion that characterizes a broader class of parametric models through which non-parametrically robust hypothesis tests are obtainable. For example, we show in Section 4 that a large class of longitudinal parametric models can also be used to construct non-parametrically valid tests. Furthermore, the criterion that we propose is easy to verify as it has a geometrical symmetry interpretation. This is important because these parametric model-based tests (a) preserve nonparametric validity of testing the null hypothesis, i.e., even when the models are incorrect, and (b) increase power compared to the non- or semiparametric bounds when the models are close to correct (see Section 6). In the next section, we present the setting and notation of the remainder of the work. In Section 3 we give our main characterization result. In Section 4 we show that the classes characterized in Rosenblum and van der Laan (2009) are a subset of the class characterized by the more general symmetry criterion. In Section 5 we give an application to the DIADS trial, and we conclude with a discussion.

We consider a randomized clinical trial (RCT) that compares an outcome *Y* between two treatments, *a* = 0, 1. Specifically, for each of *i* = 1, ..., *n* patients, we measure the assigned treatment *A* and the outcome *Y* . We also allow that pre-treatment covariate information *X* is measured; *X* is not used for randomization but can be used for analysis. We wish to generalize inference statements in a reference population from which we can assume that the *n* patients are a representative random sample.

We denote by the true conditional distribution *pr*(*Y* = *y* | *A* = *a, X* = *x*) for randomized arms *a* = 0, 1. We wish to test the null hypothesis

(1)

as functions of *y, x*. More specifically, our goal is to test *H*_{0} with tests that are developed based on parametric models but are non-parametrically valid. This will be useful when pilot phases of a clinical trial have suggested that a parametric model may be an accurate description of the trial's data, such as shapes of longitudinal trajectories.

Typically, we would represent a parametric model for the RCT with covariates by a collection of distributions {*P*(*Y* = *y* | *A* = *a, X* = *x, θ*), for *θ* Θ} over a parameter space Θ. Here, however, it will help give further intuition to our results if instead we use a different representation. Every parameter value *θ* Θ gives rise simultaneously to one distribution for the arm *A* = 0 and another for *A* = 1, namely, to the vector of distributions (*P*(*Y* = *y* | *A* = 0, *X* = *x, θ*), *P*(*Y* = *y* | *A* = 1, *X* = *x, θ*)), which we denote by (*p*_{0}(*y*; *x*; *θ*), *p*_{1}(*y*; *x*; *θ*)). As the parameter *θ* varies over Θ, we therefore represent an arbitrary parametric model by the set of vectors

(2)

or, more briefly, by {(*p*_{0}, *p*_{1})} where we have omitted the indices for *y, x, θ*. In words, *S* is a set whose members are the vectors of the distributions for the two arms of the RCT that are generated by a parameter value. We allow that the model *S* may be incorrect in the sense that *S* may not contain ().

Rosenblum and van der Laan (2009) consider regression models under the setting described above and show that a class of models with a particular form induce valid hypothesis tests, independently of whether or not the specified model is correct. We claim that this property holds for a more general class of models characterized by the following criterion:

** Criterion 1:** If (

(3)

In terms of the parameter-based, but longer notation, Criterion 1 is described as follows. For a given value of *θ*, which defines (*p*_{0}(*y*; *x*; *θ*), *p*_{1}(*y*; *x*; *θ*)) as an allowed pair of distributions for the treatment arms *a* = 0 and *a* = 1 in the model *S*, there exists two parameter values in *S*, say and for which: the null pair (*p*_{0}(*y*; *x*; *θ*), *p*_{0}(*y*; *x*; *θ*)) can be written as and so belongs in *S* with parameter value ; and the null pair (*p*_{1}(*y*; *x*; *θ*), *p*_{1}(*y*; *x*; *θ*)) can be written as and so belongs in *S* with parameter value .

More intuitively, Criterion 1 can be depicted visually using the Kullback-Leibler (KL) distance (the negative of the KL information, Kullback and Leibler, 1951) as in Figure (1(a)-1(b)). The axes in these plots are the component-wise KL distance from the true null distribution in each arm, which is convenient for emphasizing the symmetric nature of the criterion. In Figure 1(a), Criterion 1 is satisfied, but in 1(b) is fails to hold. In simpler language, this criterion requires that if the model allows a distribution *p* for one of the arms, then it must allow that the null hypothesis (*p, p*) may be true. This criterion is reasonable and with the goal to compare between treatment arms, it would be difficult to justify a model that does not allow for such a null hypothesis.

Depiction of the two scenarios in which Criterion 1 is satisfied (left) and not satisfied (right) by the class of models *S*.

Under the regularity condition that *π*_{0}*E*{|log *p*_{0}(*Y _{i}*;

** Result 1:** If Criterion 1 is satisfied, we have that under the null hypothesis (1), a null distribution (

If, in addition, conditions A1-A6 of (White, 1982) hold then we have that (*p**, *p**) is the unique maximizer of limiting log-lkelihood and that the MLE of the contrast between *p*_{0} and *p*_{1} is asymptotically normal with mean 0. The result is shown in Appendix A. In what follows, we assume the above regularity conditions.

The above result is important because, although the researcher does not control the correctness of the parametric model *S*, the researcher fully controls and can select *S* to satisfy Criterion 1. The latter thus ensures that under the true null *H*_{0}, any contrast (e.g., difference in means, medians) between the maximum likelihood estimates, say (), is asymptotically also null. The uncertainty of the contrast between the maximum likelihood estimates, () should be estimated robustly, for example, using a bootstrap (see Section 5).

Result 1 of the last section generalizes the models of Rosenblum and van der Laan (2009), and of Gail, Tan, and Piantadosi (1988) (which are a special case of Rosenblum and van der Laan (2009), see discussion) that can be used as a basis for a valid test. Specifically, Rosenblum and van der Laan (2009) considered the null hypothesis to be on the mean regressions in each arm,

(4)

where , and showed that tests based on the working model for *Y* being a generalized linear model are robust to that model being incorrect. We can now see that that result follows from geometric symmetry arguments similar to the ones for Criterion 1 and Result 1. To see this, define *μ _{a}*(

(5)

also be members in *S*_{means}.

Note that, for the above null pairs to be in the model *S*_{means}, we mean that for any given * β*, the left null pair of (5) can be rewritten as for some set of parameters, ; and the right null pair of (5) can be rewritten as for some set of parameters, .

** Result 2:** If Criterion 2 is satisfied, we have that under the null hypothesis (4), the limit of the log-likelihood function is maximized at a parameter

In the Web Appendix, we prove Result 2 and also show that the generalized linear models described in Rosenblum and van der Laan (2009) satisfy Criterion 2. Criterion 2 is similar to Criterion 1 in its statement and function. The difference is in the null hypotheses (4 and 1, respectively). Criterion 2 requires symmetry in the mean structures allowed in the model but is limited to generalized linear models, whereas Criterion 1 requires symmetry with respect to the distribution, and is applicable to any parametric model. To show the generality of Criterion 1, we continue with two examples that demonstrate the ease of checking its conditions.

For a first example, consider the simple normal linear regression with homoscedastic variance *σ*^{2} and mean *E*(*Y* | *X* = *x, A* = *a*) modeled as

(6)

where * β* = (

(7)

Therefore it is seen easily that Criterion 1 holds because the null pair distributions with means

(8)

and with the same *σ*^{2} are also allowed models in *S*; the first pair is the null model that chooses the coefficient of *A* to be 0 and the intercept to be *β*_{0}; the latter pair can be re-written as [(*β*_{0} + *β _{A}*)+

As a second example, it is useful to consider a study measuring the outcome *Y* longitudinally, say at times *t* = 0, ...*T*, yielding values *Y _{t}* respectively. For such outcome, consider a multivariate normal model with means

(9)

and unknown variance covariance matrices var(*Y*|*A* = *a*) = Σ* _{a}*, where Σ

(10)

Therefore, the null pairs [*μ*_{0}(*t, β*),

(11)

Because the parameters **β**_{a=0}, **β**_{a=1}, *Σ*_{a=0}, *Σ*_{a=1} are unrestricted, it follows that the last two pairs are also in the model, so Criterion 1 is satisfied.

Although major depression is a significant cause of morbidity in patients with Alzheimer's disease (AD), reports concerning the treatment of such a condition are conflicting. Forty-four community-dwelling older adults who were diagnosed with probable AD and had experienced a major depressive episode were randomized to sertraline *A* = 1 or placebo *A* = 0 in the Depression in Alzheimer's Disease Study (DIADS). Details on inclusion and exclusion criteria, along with a more detailed description of the trial are available in Lyketsos et al. (2003).

In order to assess the effect of sertraline on depression, we consider the Cornell Scale for Depression in Dementia (CSDD) (Alexopoulos et al., 1988), which was measured at baseline (*t* = 0) and at *t* = 3, 6, 9, and 12 weeks after enrollment. The observed data are depicted in Figure (2), left panel, where the thicker lines denote the observed means in each treatment arm.

The observed CSDD measurements (black for placebo; grey for sertraline arm) in the DIADS trial and the fitted means (dotted curves) from the nonparametric (MANCOVA, left) and parametric (quadratic, right) models.

We consider testing the null hypothesis *H*_{0} of (1) against the alternative hypothesis that the distributions are different, using two models. For both models, we estimate a common quantity, the difference in means between treatment and placebo at each time past baseline, i.e., *δ _{t}* =

The first model is the nonparametric version of the MANCOVA in which we represent the mean Cornell scores *Y* at time *t* as:

(12)

and unknown variance covariance matrices var(*Y*|*A* = *a*) = Σ* _{a}*, where Σ

Prior to DIADS, pilot studies had already suggested that the mean Cornell scores on sertraline show an initial benefit which then starts reaching a plateau (e.g., Mulsant et al, 2001). This suggests that the simple model in (9) that allows for a quadratic trajectory in time for the mean in each arm could represent parsimoniously the DIADS trajectories for the time frame of 12 weeks. Moreover, because model (9) satisfies Criterion 1, we know that under the nonparametric *H*_{0} of (1), the limits of the MLEs of **β**_{a=0} and **β**_{a=1} are the same fixed vector, say * β** . Thus, under

From the theoretical part of the paper, we know that because this parametrically-derived test satisfies Criterion 1, it should be nonparametrically valid under the null (1). Also, it will have better power than the nonparametric test to detect alternatives of diminishing drug benefit that is well described by the trajectory (9). We evaluated these two properties in the motivating study of DIADS.

First, in order to check that the tests are valid in data like those in DIADS, we estimated the type I error of the above two tests in the distribution that results by simulating 1,000 placebo and sertraline arms with sampling from the observed placebo arm only. This creates studies of the same size as the one we have, and enforces the null hypothesis with distribution equal to that of the observed placebo arm, which is not necessarily satisfying the parametric model (9). In this realistic example, the empirical type I error was 5% for both *W*^{nonpar} and *W*^{param}.

Next, both models were fitted to the DIADS data and the fitted means are depicted in thick dashed lines in Figure (2). Estimates of the variance covariance matrices were obtained from 500 bootstrap samples. The significance levels (p-values) for a treatment effect were 0.10 for the nonparametrically derived test *W*^{nonpar} and 0.04 for the robust parametrically derived test *W*^{param}.

Finally, we compared the two tests in terms of power to detect the empirical effects seen in the study. Specifically, in order to assess power, a bootstrap within arms was used to resample 1000 datasets with the same number of individuals in each of the treatment arms as the observed DIADS trial. For each of these resampled datasets, the MANCOVA and quadratic models were fit and standard errors were estimated (via a further bootstrap of the resampled individuals). The power was then calculated as the proportion of times each model rejected the null hypothesis of no treatment effect. These simulations estimated the power to be 61% for the nonparametrically derived test *W*^{nonpar} and 69% for the robust parametrically derived test *W*^{param}.

The power of both tests converges to 1 with increasing effects and increasing sample size. The effect size at the end of this study was relatively large (67%). Thus we expect that the relative gains in power between the two methods should be larger in smaller effect sizes and smaller with larger sample sizes. A more comprehensive study of power is of interest for further work.

We have demonstrated that for testing the null hypothesis of equivalence between treatment arms, a wide class of parametric models provides testing with nonparametric validity. We provided a simple symmetry characterization of such classes providing investigators an easy way to harness the efficiency of such parametric models while maintaining robustness properties traditionally considered reserved for nonparametric methods.

The work of Gail, Tan and Piantadosi (1988) also considered testing hypotheses using a class of misspecified generalized linear models. They calculated a score statistic from the residuals of a model fit omitting the treatment effect, and demonstrated that the standardization of this statistic based on a misspecified model yields Type I error rates above the nominal level. The incorrect Type I error rate from the score test normalized using the model-based variance resulted from the misspecified model. Indeed, the validity of the test was confirmed when robust variance estimation based on permutation of residuals was employed. Although our tests are different from those of Gail, Tan and Piantadosi, their models are a special case of the models considered by Rosenblum and van der Laan (2009) and by those satisfying our Criterion 1. In this sense, the new result therefore extends the class of possibly misspecified models that can be used to derive nonparametrically valid tests.

One can also use permutation tests with the general models satisfying Criterion 1, but now with any estimated contrast, say *ĉ*, between the two arms’ distributions *p*_{0} and *p*_{1} under the more general models satisfying Criterion 1. Specifically, one can easily find the permutation distribution of *ĉ* by having the computer calculate *ĉ* for a large number of permutations of the treatment labels, and then compare that reference distribution to the observed value of the estimated contrast. If there is a true effect, though, this mixing of the two arms’ data may yield a large variance in the reference distribution of the estimated contrast, leading to low power, which would be a tradeoff for using exact tests.

It is reasonable to surmise that, under our Criterion 1 and an adaptation of White's (1982) conditions A1-A6 to the permutation test setting, the Wald test statistic for treatment effect using a sandwich or bootstrap variance will be asymptotically equivalent under the null hypothesis to the Gail et al. score test statistic, and would have an asymptotic standard normal null distribution. A rigorous investigation of this issue is a potential topic for future work.

Although the Criterion 1 is quite general, there are more general conditions that ensure model robustness. An example of such a condition is:

** Criterion 3:** Let (

The same proof as for Criteron 1 is valid assuming the more general Criterion 3. This criterion is quite difficult to interpret, however, as it is dependent on the true distribution of the data. As such, it is of little practical import but illustrates a general nature of the robustness phenomenon.

Our results use the regularity conditions of White (1982). The conditions are similar in spirit to those ensuring the usual consistency and normality properties of the MLE, but are adapted to misspecified models with the assistance of the Kullback-Leibler distance. If these conditions are not met, there can be indeed multiple maximizers of the limiting loglikelihood. This can be addressed by defining the MLE () of interest in the study sample to be the maximizer that is closest to a null of distribution in *S* in terms of the KL distance. Under the true null, we expect that even under quite looser conditions this MLE () will converge to a null distribution in *S*, although the more technical parts of this problem will be explored in future work.

The results from this paper may be easily extended beyond the case of maximum likelihood estimation. The validity of the test from Result 1 holds in more general estimating equation settings as long as the limit of the objective function is of the form of (7) (Appendix) under the null hypothesis. Examples of such models include generalized estimating equations (Liang and Zeger, 1986), which are used routinely in the analysis of data from clinical trials.

It is also important to note the relation of our work to semiparametric methods that use covariates (e.g., Tsiatis et al. (2008)). Within a semiparametric model say *S _{semipar}*, an efficient semiparametric estimator has the variance of the least favorable parametric submodel allowed in

Information from prior pilot studies or other scientific knowledge, although important, may not be critical for parametric-based procedures to be valid nonparametrically. This is suggested by work by Frangakis and Rubin (2001) and van der Laan et al. (2007), who examine how observed data from the study at hand can be used for choosing between a parametric-based versus a semi- or nonparametric-based estimator. To preserve nonparametric validity, these types of choice procedures are superefficient and not regular in the theoretical statistical sense, and require additional study.

We thank the Editor, Associate Editor and Reviewers for constructive comments, Michael Rosenblum for inspiring discussions, and the NIH (R01DA023879) for partial financial support. Russell Shinohara is supported by the Epidemiology and Biostatistics of Aging Training Grant T32AG000247 from the National Institute on Aging.

For a pair (*p*_{0}; *p*_{1}) of distributions allowed in the parametric model *S*, the log likelihood of a random sample of *i* = 1; ..., *n* individuals randomly assigned to either *A _{i}* = 0 or 1 is proportional to , and therefore proportional to

(13)

where *n _{a}* is the number of patients in treatment arm

(14)

Assume now that the null hypothesis (1) that the true distributions holds; then the operations *E*(log(·) | *A _{i}* = 0) and

(15)

where we have omitted the arguments *Y _{i}, X_{i}* with no loss of generality.

Let us now assume Criterion (1) from the main section, and suppose that a maximizer of (15) is a non-null pair , i.e. with . Then there are two cases: (a) either or (b) one of is larger. If (a) is true, then the null pair , which by Criterion 1 is also in the model, gives the same value of the functional (15) and so is also a maximizer (the same is true for the null pair ). If (b) is true, then suppose is the larger of . Then, we can see that the null pair will actually give a value that is greater than the maximum, which would be a contradiction. So, (b) cannot be true, and so from (a) we know that the limit of the log likelihood (15) is maximized at a null pair of distributions in the model, say (*p**, *p**), which proves Result 1.

If, in addition, we have regularity conditions A1-A6 of (White, 1982) then we have that the null pair (*p**, *p**) is the unique maximizer of (15), and, with arguments analogous to White (1982) we get that the MLE of the contrast between *p*_{0} and *p*_{1} is asymptotically normal with mean 0.

First, let us consider the form of the mean function of generalized linear model with the robustness property proposed by Rosenblum and van der Laan, that is,

(16)

where {*f _{j}*,

Suppose (*μ*_{0}(·, * β*),

(17)

Since each of the *g _{j}* is equal to an

where we can define ** β*** component-wise by inspection to match the definition of (16) (i.e., the components of

Web Appendices referenced in Section 4 are available under the Paper Information link at the Biometrics website http://www.biometrics.tibs.org.

Russell T. Shinohara, Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA.

Constantine E. Frangakis, Department of Biostatistics, Johns Hopkins University, Baltimore, MD, USA.

Constantine G. Lyketsos, Department of Psychiatry, Johns Hopkins Bayview Hospital, Baltimore, MD, USA.

- Alexopoulos G, Abrams R, Young R, Shamoian C. Cornell scale for depression in dementia. Biological Psychiatry. 1988;23:271–284. [PubMed]
- Diggle P, Heagerty P, Liang K, Zeger S. Analysis of longitudinal data. Oxford Statistical Science Series. 2003
- Frangakis C, Rubin D. Rejoinder to Discussions on Addressing an Idiosyncrasy in Estimating Survival Curves Using Double Sampling in the Presence of Self-Selected Right Censoring. Biometrics. 2001;57:351–353. [PubMed]
- Kullback S, Leibler R. On information and sufficiency. Annals of Mathematical Statistics. 1951;22:79–86.
- Lyketsos C, DelCampo L, Steinberg M, Miles Q, Steele C, Munro C, Baker A, Sheppard J-M, Frangakis C, Brandt K, Rabins P. Treating depression inalzheimer disease: Efficacy and safety of sertraline therapy, and the benefits of depression reduction: The DIADS. Archives of General Psychiatry. 2003;60:737–746. [PubMed]
- MacCullagh P, Nelder J. Generalized linear models. Chapman & Hall; 1991.
- Moore K, van der Laan M. Application of time-to-event methods in the assessment of safety in clinical trials. In Design and Analysis of Clinical Trials with Time-to-Event Endpoints. Chapman and Hall/CRC Biostatistics Series. 2009
- Mulsant B, Pollock B, Nebes R, Miller M, Sweet R, Stack J, Houck P, Bensasi S, Maxumdar S, Reynolds C. A twelve-week, double-blind, randomized comparison of nortriptyline and paroxetine in older depressed inpatients and outpatients. American Journal of Geriatric Psychiatry. 2001;9:406–414. [PubMed]
- Robins J. Optimal structural nested models for optimal sequential decisions. Proceedings of the Second Seattle Symposium in Biostatistics: Analysis of Correlated Data. 2004
- Rosenblum M, van der Laan M. Using regression models to analyze randomized trials: Asymptotically valid hypothesis tests depite incorrectly specified models. Biometrics. 2009;65:937–945. [PMC free article] [PubMed]
- Tsiatis A, Davidian M, Zhang M, Lu X. Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: A principled yet flexible approach. Statistics in medicine. 2008;27:4658–4677. [PMC free article] [PubMed]
- van der Laan M, Polley E, Hubbard A. Super learner. Statistical applications in genetics and molecular biology. 2007;6:25. [PubMed]
- Van der Vaart A. Asymptotic statistics. Cambridge University Press; Cambridge: 1998.
- White H. Maximum likelihood estimation of misspecified models. Econometrica. 1982;50:1–25.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |