Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2892819

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Causal Effects of the Factors
- 3. 2k Factorial Two-Stage Designs
- 4. 2k−m Factorial Two Stage Design
- 5. Examples
- 6. Simulation Results
- 7. Discussion
- References

Authors

Related links

J Am Stat Assoc. Author manuscript; available in PMC 2010 June 28.

Published in final edited form as:

J Am Stat Assoc. 2009 March 1; 104(458): 391–408.

doi: 10.1198/jasa.2009.0119PMCID: PMC2892819

NIHMSID: NIHMS99566

S. A. Murphy, Department of Statistics, University of Michigan, Ann Arbor, MI 48109, Email: ude.hcimu@yhprumas;

See other articles in PMC that cite the published article.

Dynamic treatment regimes are time-varying treatments that individualize sequences of treatments to the patient. The construction of dynamic treatment regimes is challenging because a patient will be eligible for some treatment components only if he has not responded (or has responded) to other treatment components. In addition there are usually a number of potentially useful treatment components and combinations thereof. In this article, we propose new methodology for identifying promising components and screening out negligible ones. First, we define causal factorial effects for treatment components that may be applied sequentially to a patient. Second we propose experimental designs that can be used to study the treatment components. Surprisingly, modifications can be made to (fractional) factorial designs - more commonly found in the engineering statistics literature -for screening in this setting. Furthermore we provide an analysis model that can be used to screen the factorial effects. We demonstrate the proposed methodology using examples motivated in the literature and also via a simulation study.

*Dynamic treatment regimes* (Robins, 1986, 1987; Lavori et al., 2000; Murphy et al., 2001) are time-varying treatments increasingly employed in the treatment and management of chronic, relapsing disorders such as drug dependence, HIV infection and depression (Brooner and Kidorf, 2002; Rush et al., 2003; Lisziewicz and Lori, 2002). These regimes individualize the treatment level and type via decision rules that input patient outcomes collected during treatment and output recommended treatment alterations. A two stage dynamic treatment regime specifies how to select the treatment combination by a sequence of decision rules, one per stage, {*d*_{1}, *d*_{2}} where the decision rule *d*_{1} takes the information available initially, say *O*_{1} and outputs stage 1 treatments *A*_{1} and the decision rule *d*_{2} takes the information available at the beginning of the stage 2, say {*O*_{1}, *A*_{1}, *O*_{2}}, and recommends stage 2 treatments, *A*_{2}. The time order is *O*_{1}, *A*_{1}, *O*_{2}, *A*_{2}, *O*_{3} where *O*_{3} is observed at the end of stage 2. The primary outcome is *Y* = *f*(*O*_{1}, *A*_{1}, *O*_{2}, *A*_{2}, *O*_{3}) for *f* known.

Dynamic treatment regimes are also often multi-component (i.e., multiple factor) treatments: components may include not only the treatments for the primary disorder but also adjunctive treatments for co-occurring problems, behavioral therapies to improve adherence and delivery mechanisms. Presently dynamic treatment regimes are constructed via expert opinion and then evaluated in randomized two group trials. Observational studies are typically used to assist experts in constructing the dynamic treatment regime. Observational analyses attempt to infer causal relations from non-randomized comparisons and are subject to bias when the compared groups differ in ways other than by type of treatment (Rubin, 1974, 1978). While there have been great advances in the epidemiological, statistical and other literatures in developing methods that precisely specify the assumptions under which this bias may be eliminated, randomized studies, when possible, remain the optimal approach to reducing bias (Rubin, 1974, 1978). The field of experimental design provides an approach to using randomization in the construction of multi-component treatments.

The focus of this article is on screening experiments for two stage dynamic treatment regimes. Screening experiments are used to whittle down the large number of treatment components resulting in a few promising dynamic treatment regimes. The formulation and analysis of screening experiments for multi-component dynamic treatment regimes requires the combination of ideas from causal inference and factorial experimental design.

Unfortunately, in settings such as substance abuse, mental health and HIV research, the classical design and analysis of screening experiments *cannot* be directly imported. This is because some stage 2 factors are only relevant for patients who responded (or did not respond) to prior stage 1 treatment factors. For this reason, in Section 2, we carefully define factorial effects so that these effects have a causal interpretation. In Section 3 we provide 2* ^{k}* experimental designs and associated analysis methods that can be used to screen these effects. In Section 4, we consider 2

Screening factors often occurs via the assessment of factorial effects in ANOVA decompositions (see Box, Hunter and Hunter, 1978; Wu and Hamada, 2000 or Montgomery and Jennings, 2006) for the conditional mean of the primary outcome. In our setting special care is required in defining effects. Consider a simple case where there is only one stage 1 factor and two stage 2 factors. Each factor takes on levels in {−1, 1}. Suppose *N* subjects are each randomized equally to the two levels of the stage one factor *A*_{1}. Then an indicator of early response is observed; *R* is 1 if the subject is responding following assignment of *A*_{1} and 0 otherwise (there is no *O*_{1} and *O*_{2} = *R*). For simplicity, we refer to individuals with *R* = 1, 0 as responders, non-responders, respectively. At stage 2, responders are randomized equally between the {−1, 1} levels of the factor
${A}_{2}^{(1)}$, and non-responders are randomized equally between the {−1, 1} levels of the factor
${A}_{2}^{(0)}$ (the superscripts, (1), (0), indicate stage 2 factors for responders, non-responders, respectively). At the end of the study the primary outcome *Y* (coded so that high values are preferable) is observed. Assuming subject observations are independent and identically distributed, a “factorial” decomposition is

$$\begin{array}{l}E[Y\mid {A}_{1},R,{A}_{2}^{(1)},{A}_{2}^{(0)}]={\eta}_{0}+{\eta}_{1}{A}_{1}+{\gamma}_{1}R+{\gamma}_{2}R{A}_{1}+\\ {\beta}_{1}R{A}_{2}^{(1)}+{\beta}_{2}R{A}_{1}{A}_{2}^{(1)}+{\alpha}_{1}(1-R){A}_{2}^{(0)}+{\alpha}_{2}(1-R){A}_{1}{A}_{2}^{(0)}\end{array}$$

(1)

with the interpretation that the coefficients {*η*_{1}, *β*_{1}, *β*_{2}, *α*_{1}, *α*_{2}} represent factorial effects. *R* is included in the above decomposition since
${A}_{2}^{(1)}$ can only be assigned to subjects with *R* = 1 and similarly for
${A}_{2}^{(0)}$.

If we interpret the coefficients {*η*_{1}, *β*_{1}, *β*_{2}, *α*_{1}, *α*_{2}} as factorial effects then to screen factors *A*_{1},
${A}_{2}^{(1)},{A}_{2}^{(0)}$ we will base our inference on these coefficients. However, this can lead to erroneous conclusions concerning the usefulness of *A*_{1}, via *η*_{1}, for at least two reasons. First in the medical/behavioral fields there are a plethora of both known and unknown common causes of *R* and *Y* (hence *R* is prognostic for *Y*). Consider the following example. Let *U*, a Bernoulli random variable with success probability
${\scriptstyle \frac{1}{2}}$, represent an unknown common cause of both the outcome *Y* and early response *R. U* might be an unknown genetic factor. Suppose that *Y* = *δ*_{0} + *δ*_{1}*U* + *ε* where *ε* (mean zero, finite variance) is independent of (*U*, *R*, *A*_{1},
${A}_{2}^{(0)},{A}_{2}^{(1)}$). Thus there is *no* effect of the stage 1 factor *A*_{1} or of the stage 2 factors (
${A}_{2}^{(0)},{A}_{2}^{(1)}$) on the outcome *Y*. Next suppose that
$P[R=1\mid U,{A}_{1}]=U\phantom{\rule{0.16667em}{0ex}}[{\scriptstyle \frac{{q}_{1}+{q}_{2}}{2}}+{\scriptstyle \frac{{q}_{1}-{q}_{2}}{2}}{A}_{1}]+(1-U)\phantom{\rule{0.16667em}{0ex}}[{\scriptstyle \frac{{q}_{3}+{q}_{4}}{2}}+{\scriptstyle \frac{{q}_{3}-{q}_{4}}{2}}{A}_{1}]$ where each *q _{j}* [0, 1]. Note that

$$\begin{array}{l}E[Y\mid {A}_{1},R=0,{A}_{2}^{(0)},{A}_{2}^{(1)}]={\delta}_{0}+{\delta}_{1}E[U\mid {A}_{1},R=0]\\ ={\delta}_{0}+\frac{{\delta}_{1}}{2}\left(\frac{1-{q}_{1}}{2-{q}_{1}-{q}_{3}}+\frac{1-{q}_{2}}{2-{q}_{2}-{q}_{4}}\right)\\ +\frac{{\delta}_{1}}{2}\left(\frac{1-{q}_{1}}{2-{q}_{1}-{q}_{3}}-\frac{1-{q}_{2}}{2-{q}_{2}-{q}_{4}}\right){A}_{1}\end{array}$$

Thus
${\eta}_{1}={\scriptstyle \frac{{\delta}_{1}}{2}}\left({\scriptstyle \frac{1-{q}_{1}}{2-{q}_{1}-{q}_{3}}}-{\scriptstyle \frac{1-{q}_{2}}{2-{q}_{2}-{q}_{4}}}\right)$ which may differ from the true effect of zero. That is, *η*_{1} in (1) reflects biases that occur because we are conditioning on *R* which is both an outcome of *A*_{1} and a prognostic variable for *Y*. Further discussion of this bias can be found in Robins and Greenland (1994) and Robins and Wasserman (1997).

Second even if there are no unknown common causes of *R* and *Y*, *η*_{1} does not reflect the overall impact of *A*_{1} as *A*_{1} may impact *Y* at least partially by its effect on *R* (Parmigiani, 2002; Murphy, 2005). For both reasons, clinical trial guidelines (ICH E9, 1999) warn against conditioning on outcomes such as side effects or other post randomization covariates in comparing one treatment to another.

These two issues preclude the use of (1) for screening. To our knowledge the former issue does not arise in the traditional experimental design literature, even in clinical trials with multiple stages. For example, adaptive experimental designs (Hu and Rosenberger, 2006, Zacks, 1996, Patel, 1962) also involve multiple stages of (sequential) decision making, but each stage involves different subjects; that is “sequential” there means decisions on a sequence of subjects not sequential decisions on each subject. Since subjects can be generally assumed to respond independently (given membership in the same population of subjects), and the stages involve different subjects, the above issues do not occur.

In this section a model in terms of factorial effects for each fixed pattern of factor levels is developed; in the next subsection randomization of the factor levels is incorporated. This model will be in terms of potential outcomes (Rubin, 1978; Robins, 1986, 1987); the parameters in the model will be the causal effects. The idea is that each subject has multiple potential outcomes, one per factor level combination. However only one of the potential outcomes, that is, the outcome associated with the subject’s assigned factor levels is observed; the remaining potential outcomes are missing. The potential outcome framework facilitates a precise definition of the causal factorial effects; as will be seen this is somewhat nuanced for stage 1 factorial effects. Also this framework facilitates the statement of the conditions under which estimators of the causal effects of a factor can be obtained.

Suppose there are *p*_{1} stage 1 factors, *p*_{12} stage 2 factor levels for responders and *p*_{02} stage 2 factor for non-responders. As is generally the case in screening, we assume each factor has two levels (see Section 3 for discussion), coded by values ±1. Let *a*_{1} denote a vector of stage 1 factor levels,
${a}_{2}^{(1)},{{a}_{2}^{(1)}}^{\prime}$ denote vectors of stage 2 factor levels for responders and
${a}_{2}^{(0)},{{a}_{2}^{(0)}}^{\prime}$ denote vectors of stage 2 factor levels for non-responders. Let
${R}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}\in \{0,1\}$ and
${Y}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}$ denote the stage 1 response and the primary outcome, respectively, that would be observed if the subject were assigned the dynamic treatment regime: \provide treatment *a*_{1} in stage 1 and then if an early response occurs provide
${a}_{2}^{(1)}$ otherwise provide
${a}_{2}^{(0)}$.” Note that there are 2^{p}^{1+}^{p}^{02+}^{p}^{12} different dynamic treatment regimes, thus, there are 2^{p}^{1+}^{p}^{02+}^{p}^{12} potential stage 1 outcomes and 2^{p}^{1+}^{p}^{02+}^{p}^{12} potential primary outcomes for each subject.

Throughout we make two assumptions that hold if the stage 2 treatment is not revealed to the subject/clinical staff until after the occurrence of the stage 1 outcome. These assumptions will enable us to reduce the number of groups of subjects in the experimental design (see the next section).

Assume that
${R}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}={R}_{{a}_{1},{{a}_{2}^{(0)}}^{\prime},{{a}_{2}^{(1)}}^{\prime}}$ for all {*a*_{1, }
${a}_{2}^{(0)},{{a}_{2}^{(0)}}^{\prime},{a}_{2}^{(1)},{{a}_{2}^{(1)}}^{\prime}$}.

In words, this is the assumption that a subject’s stage 1 outcome remains the same regardless of the stage 2 treatment assignments. Below we use the notation, *R _{a}*

Assume that
${Y}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}{R}_{{a}_{1}}={Y}_{{a}_{1},{{a}_{2}^{(0)}}^{\prime},{a}_{2}^{(1)}}{R}_{{a}_{1}}$ and
${Y}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}(1-{R}_{{a}_{1}})={Y}_{{a}_{1},{a}_{2}^{(0)},{{a}_{2}^{(1)}}^{\prime}}(1-{R}_{{a}_{1}})$ for all {*a*_{1},
${a}_{2}^{(0)},{{a}_{2}^{(0)}}^{\prime},{a}_{2}^{(1)},{{a}_{2}^{(1)}}^{\prime}$}.

For example, this assumption states that the primary outcome of a subject who does not respond in stage 1 does not depend on what this subject would have been assigned had he/she responded in stage 1. These assumptions may be violated if, as part of the protocol, subjects are informed in stage 1 of which stage 2 treatment they will be offered if they do not respond at stage 1. It may happen that subjects, who know that they will be assigned a rather burdensome stage 2 treatment upon non-response, will try to proactively avoid this burdensome treatment by adhering more faithfully to their assigned stage 1 treatment than would otherwise be the case.

In the sequel, stage 1 effects refer to all effects involving *only* stage 1 factors and the stage 2 effects refer to all effects involving *at least* one stage 2 factor. The stage 2 causal effects are defined via a saturated linear regression model for

$$E[{Y}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}\mid {R}_{{a}_{1}}]$$

(2)

in *a*_{1},
${a}_{2}^{(1)}{R}_{{a}_{1}},{a}_{2}^{(0)}(1-{R}_{{a}_{1}})$ and their higher order interactions. For example suppose that there is one stage 1 factor and two stage 2 factors (*p*_{1} = *p*_{02} = *p*_{12} = 1) then similar to (1) we have

$$\begin{array}{r}E[{Y}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}\mid {R}_{{a}_{1}}]={\eta}_{0}+{\eta}_{1}{a}_{1}+{\gamma}_{0}{R}_{{a}_{1}}+{\gamma}_{1}{R}_{{a}_{1}}{a}_{1}+{\beta}_{1}{R}_{{a}_{1}}{a}_{2}^{(1)}+{\beta}_{2}{R}_{{a}_{1}}{a}_{1}{a}_{2}^{(1)}+\\ {\alpha}_{1}(1-{R}_{{a}_{1}}){a}_{2}^{(0)}+{\alpha}_{2}(1-{R}_{{a}_{1}}){a}_{1}{a}_{2}^{(0)}.\end{array}$$

(3)

where *α _{j}*,

$$\begin{array}{l}2{\beta}_{1}=\frac{1}{2}\left(E[{Y}_{1,{a}_{2}^{(0)},1}\mid {R}_{1}=1]-E[{Y}_{1,{a}_{2}^{(0)},-1}\mid {R}_{1}=1]\right)\\ +\frac{1}{2}\left(E[{Y}_{-1,{a}_{2}^{(0)},1}\mid {R}_{-1}=1]-E[{Y}_{-1,{a}_{2}^{(0)},-1}\mid {R}_{-1}=1]\right).\end{array}$$

Recall that under the ignorability assumption II, the quantities above do not depend on the value of ${a}_{2}^{(0)}$. The coefficients in the above linear regression are 1/2 the size of the usual factorial effects (Wu and Hamada, 2000, pg. 113). (Note there is some discrepancy in the literature in how factorial effects are defined; for example Byar and Piantadosi’s (1985) definition of a main effect is one-half the size of Wu and Hamada’s definition). Throughout this work we use the regression formulation and refer to the regression coefficients as factorial effects.

As discussed at the beginning of this section the usual definition of the stage 1 effect (for example, via *η*_{1} in (3)) can lead to erroneous conclusions. To avoid this problem we do not condition on *R _{a}*

$$E\left[{\left(\frac{1}{2}\right)}^{{p}_{12}}\sum _{{a}_{2}^{(1)}}{Y}_{{a}_{1},{a}_{2}^{(1)}}{R}_{{a}_{1}}+{\left(\frac{1}{2}\right)}^{{p}_{02}}\sum _{{a}_{2}^{(0)}}{Y}_{{a}_{1},{a}_{2}^{(0)}}(1-{R}_{{a}_{1}})\right]$$

(4)

in *a*_{1}. Again suppose that *p*_{1} = *p*_{02} = *p*_{12} = 1 then the linear regression is simple:

$$E\left[\frac{1}{2}\sum _{{a}_{2}^{(1)}}{Y}_{{a}_{1},{a}_{2}^{(1)}}{R}_{{a}_{1}}+\frac{1}{2}\sum _{{a}_{2}^{(0)}}{Y}_{{a}_{1},{a}_{2}^{(0)}}(1-{R}_{{a}_{1}})\right]={\phi}_{1}+{\phi}_{2}{a}_{1}.$$

Solving for _{2} we obtain,

$$\begin{array}{l}2{\phi}_{2}=\frac{1}{2}E\left[\sum _{{a}_{2}^{(1)}}{Y}_{1,{a}_{2}^{(1)}}{R}_{1}-\sum _{{a}_{2}^{(1)}}{Y}_{-1,{a}_{2}^{(1)}}{R}_{-1}\right]\\ +\frac{1}{2}E\left[\sum _{{a}_{2}^{(0)}}{Y}_{1,{a}_{2}^{(0)}}(1-{R}_{1})-\sum _{{a}_{2}^{(0)}}{Y}_{-1,{a}_{2}^{(0)}}(1-{R}_{-1})\right];\end{array}$$

_{2} is the causal main effect of *A*_{1}.

In general the causal main effect of one of the *p*_{1} stage 1 factors is the average of contrasts between two means of *Y* one for each level of the stage 1 factor, marginal over the distribution of *R* (and all other intermediate outcomes in *O*_{2} as well); the average is over the remaining factor levels (with all other factors taking levels ±1 with equal probability). The averaging/marginalization with respect to a uniform distribution over the other factor levels is consistent with the definition of factorial effects in experimental design (Wu and Hamada, 2000; Byar and Piantadosi, 1985). Note the definition of the stage 2 treatment effects also involves marginalization (except the stage 2 treatments are nested within the outcome *R* and thus the contrasts are between means conditional on *R*).

Suppose we have experimental data in which the factors are randomized according to some distribution on {−1, 1}^{p}^{1+}^{p}^{02+}^{p}^{12}. The following provides definitions of the factorial effects in terms of the resulting data distribution. Let *A*_{1} denote the random vector of *p*_{1} stage 1 factor levels,
${A}_{2}^{(1)}$ denote the random vector of *p*_{12} stage 2 factor levels for responders and
${A}_{2}^{(0)}$ denote the random vector of *p*_{02} stage 2 factor levels. Lower case letters denote realizations of these random vectors. In this case {*A*_{1},
${A}_{2}^{(0)},{A}_{2}^{(1)}$} is clearly independent of the collection {*R*_{a1},
${Y}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}$*, a*_{1} {−1, 1}^{p}^{1},
${a}_{2}^{(0)}\in {\{-1,1\}}^{{p}_{02}},{a}_{2}^{(1)}\in {\{-1,1\}}^{{p}_{12}}$}. On each subject we observe *A*_{1}, *R*,
${A}_{2}^{(0)},{A}_{2}^{(1)}$*, Y* (when *R* = 1, factor levels given by
${A}_{2}^{(1)}$ are assigned and when *R* = 0, factor levels given by
${A}_{2}^{(0)}$ are assigned). We connect the potential outcomes to the observed data via the consistency assumption (Robins, 1997): assume that for all values of {*a*_{1}*, *
${a}_{2}^{(0)},{a}_{2}^{(1)}$}, if *A*_{1} = *a*_{1}*, *
${A}_{2}^{(0)}={a}_{2}^{(0)},{A}_{2}^{(1)}={a}_{2}^{(1)}$ then
$Y={Y}_{{a}_{1},{a}_{2}^{(0)},{a}_{2}^{(1)}}$ and *R* = *R _{a}*

The randomization of factor levels plus the consistency and ignorability assumptions imply that *E*[*R _{a}*

The discussion above implies that in the simple setting where there is only one stage 1 factor and two stage 2 factors the *β* (*α*) parameters in (1) represent the stage 2 causal effects for the responders (non-responders, respectively). A similar statement can not be made for the stage 1 effects. Instead recall that the stage 1 causal effects are given by a saturated linear model for the mean in (4). This mean can be rewritten as

$${\left(\frac{1}{2}\right)}^{{p}_{12}}\sum _{{a}_{2}^{(1)}}E\left[{Y}_{{a}_{1},{a}_{2}^{(1)}}\mid {R}_{{a}_{1}}=1\right]E[{R}_{{a}_{1}}]+{\left(\frac{1}{2}\right)}^{{p}_{02}}\sum _{{a}_{2}^{(0)}}E\left[{Y}_{{a}_{1},{a}_{2}^{(0)}}\mid {R}_{{a}_{1}}=0\right]E[1-{R}_{{a}_{1}}].$$

Again using randomization and the consistency assumption this mean can be reexpressed as

$$\begin{array}{l}{\left(\frac{1}{2}\right)}^{{p}_{12}}\sum _{{a}_{2}^{(1)}}E\left[Y\mid {A}_{1}={a}_{1},R=1,{A}_{2}^{(1)}={a}_{2}^{(1)}\right]E[R\mid {A}_{1}={a}_{1}]\\ +{\left(\frac{1}{2}\right)}^{{p}_{02}}\sum _{{a}_{2}^{(0)}}E\left[Y\mid {A}_{1}={a}_{1},R=0,{A}_{2}^{(0)}={a}_{2}^{(0)}\right]\phantom{\rule{0.16667em}{0ex}}(1-E[R\mid {A}_{1}={a}_{1}]).\end{array}$$

(5)

Thus the stage 1 causal effects are given by the coefficients in a saturated linear model for the above sum. The above formula (5) is a version of Robins “G-computation” formula (Robins, 1986); this formula frequently arises in causal inference for time varying treatments. A more intuitive definition for the stage 1 effects obtains if the stage 2 factors are independently randomized according to discrete uniform distributions on {−1, 1}. Then (5) simplifies to *E*[*Y* |*A*_{1} = *a*_{1}]; a saturated linear model for this conditional mean provides the definition of the stage 1 effects in this special setting.

Somewhat surprisingly the stage 1 and stage 2 causal effects are terms in one model for the conditional mean. Define

$$\begin{array}{l}{E}_{{A}_{2}\sim U}[E[Y\mid {A}_{1},R,{A}_{2}^{(R)}]\mid {A}_{1},R]={\left(\frac{1}{2}\right)}^{{p}_{12}}\sum _{{a}_{2}^{(1)}}E\left[Y\mid {A}_{1},R=1,{A}_{2}^{(1)}={a}_{2}^{(1)}\right]R\\ +{\left(\frac{1}{2}\right)}^{{p}_{02}}\sum _{{a}_{2}^{(0)}}E\left[Y\mid {A}_{1},R=0,{A}_{2}^{(0)}={a}_{2}^{(0)}\right](1-R).\end{array}$$

This definition is used to make the next formulae concise; note (5) is equal to $E\left[{E}_{{A}_{2}\sim U}[E[Y\mid {A}_{1},R,{A}_{2}^{(R)}]\mid {A}_{1},R]\mid {A}_{1}={a}_{1}\right]$. Consider the telescoping sum for the conditional mean, $E[Y\mid {A}_{1},R,{A}_{2}^{(R)}]$:

$$\left(E[Y\mid {A}_{1},R,{A}_{2}^{(R)}]-{E}_{{A}_{2}\sim U}[E[Y\mid {A}_{1},R,{A}_{2}^{(R)}]\mid {A}_{1},R]\right)$$

(6)

$$\begin{array}{c}+\left({E}_{{A}_{2}\sim U}[E[Y\mid {A}_{1},R,{A}_{2}^{(R)}]\mid {A}_{1},R]-E\phantom{\rule{0.16667em}{0ex}}[{E}_{{A}_{2}\sim U}[E[Y\mid {A}_{1},R,{A}_{2}^{(R)}]\mid {A}_{1},R]\mid {A}_{1}]\right)\\ +E\left[{E}_{{A}_{2}\sim U}[E[Y\mid {A}_{1},R,{A}_{2}^{(R)}]\mid {A}_{1},R]\mid {A}_{1}\right].\end{array}$$

(7)

The first term (6) contains the stage 2 causal effects. The last term (7) is just another way to write (5) and thus contains all stage 1 causal effects. The middle term is composed entirely of nuisance parameters and can be rewritten as *R* − *E*[*R*|*A*_{1}] times

$${E}_{{A}_{2}\sim U}[E[Y\mid {A}_{1},R,{A}_{2}^{(R)}]\mid {A}_{1},R=1]-{E}_{{A}_{2}\sim U}[E[Y\mid {A}_{1},R,{A}_{2}^{(R)}]\mid {A}_{1},R=0].$$

(8)

Thus in the example of one stage 1 factor and two stage 2 factors (one for responders, the other for non-responders), we replace the linear regression in (1) by

$$\begin{array}{l}E[Y\mid {A}_{1},R,{A}_{2}^{(R)}]={\phi}_{1}+{\phi}_{2}{A}_{2}\\ +(R-E[R\mid {A}_{1}])\phantom{\rule{0.16667em}{0ex}}({\psi}_{1}+{\psi}_{2}{A}_{1})\\ +{\beta}_{1}(1-R){A}_{2}^{(0)}+{\beta}_{2}(1-R){A}_{2}^{(0)}{A}_{1}+{\alpha}_{1}R{A}_{2}^{(1)}+{\alpha}_{2}R{A}_{2}^{(1)}{A}_{1},\end{array}$$

where the first row corresponds to (7), the second row to (8) times (*R* − *E*[*R*|*A*_{1}]) and the last row to (6). Note that even though we call the parameters, *ψ*_{1}, *ψ*_{2} and the stage 1 response rate, *E*[*R*|*A*_{1}], nuisance parameters, these parameters may be of independent interest. This is certainly the case for the stage 1 response rate; additionally *ψ*_{1}, *ψ*_{2} reflect the ability of the stage 1 response to predict the primary outcome. Note that the nuisance parameters, *ψ*_{1}, *ψ*_{2}, would be absent from the above formula only if the response indicator, *R*−*E*[*R*|*A*_{1}] were zero, that is, only if the stage 1 response, *R*, is a deterministic function of *A*_{1}.

Our goal is to screen out inactive factors. Usually we consider two levels for each factor. The two levels should be selected to be sufficiently disparate so that we can obtain an effect yet not be unethical. In settings where there is likely to be a downturn in the primary outcome at high levels of the factor, we select the higher level small enough so that the downturn is thought to be insufficient to eliminate the effect. If necessary, subsequent trials can be used to more fully explore the dose-response.

Suppose that there are two stage 1 factors *A*_{1} = {*A*_{11}, *A*_{12}} and one stage 2 factor for responders
${A}_{2}^{(1)}$ and one for non-responders,
${A}_{2}^{(0)}$. Consider the experimental design in Table 1. Each row in the design of Table 1 provides the factor levels assigned to a group of subjects. The column labeled
${A}_{2}^{(1)}={A}_{2}^{(0)}$ designates the identical factor level settings fo
${A}_{2}^{(1)}$ and
${A}_{2}^{(0)}$. Note that once responder (non-responder) status is known for each subject, we can view the design as two 2^{3} full factorial designs, one for responders to stage 1 treatment and the other for non-responders to stage 1 treatment.

Each row of the design corresponds to a group of subjects, all of whom are assigned the same dynamic treatment regime. For example, the subjects in the first group are assigned the dynamic treatment regime: “Provide *A*_{11} = +1, *A*_{12} = +1, if they respond, assign
${A}_{2}^{(1)}=+1$ in stage 2 and if they do not respond, assign
${A}_{2}^{(0)}=+1$ in stage 2” (note that a dynamic treatment regime must specify the stage 2 treatments for both the responders and the non-responders). For convenience we use the term “stacked” to indicate situations in which the factor level settings of two factors are identical across all rows of the experimental design; factors
${A}_{2}^{(1)}$ and
${A}_{2}^{(0)}$ are stacked. The stacking is advantageous because investigators only have to implement a 2^{3} design yet under the ignorability assumptions, we have information on all 2^{4} dynamic treatment regimes. For example even though no row of the design corresponds to “assign *A*_{11} = +1, *A*_{12} = +1 and if response assign
${A}_{2}^{(1)}=+1$ but if non-response assign
${A}_{2}^{(0)}=-{1}^{\u2033}$ we may combine the responders in row 1 and the non-responders in row 2 together to produce a new group of subjects assigned this dynamic treatment regime.

As in Table 1, all designs considered here possess the property that for each factor half of the rows in the design are set at the +1 level and half of the rows are set at the −1 level. This property will play a crucial role in ascertaining the aliasing between effects. Note however that due to the difficulties in recruiting subjects inherent in clinical trials, the groups of subjects may not be exactly equal in size, and second even if there are equal numbers of subjects in each group, the number of subjects assigned a particular level of ${A}_{2}^{(1)},{A}_{2}^{(0)}$ depends on the response rate.

As before suppose there are *p*_{1} stage 1 factors, *p*_{12} stage 2 factors for responders and *p*_{02} stage 2 factors for non-responders. Here we consider screening the stage 1 and stage 2 effects using data from a 2* ^{k}* factorial two-stage experiment in which

Define **X**_{1} as the random vector composed of a 1, all stage 1 factors and their two-way and higher order products (2^{p}^{1} terms),
${\mathbf{X}}_{2}^{(1)}$ as the random vector composed of all stage 2 factors for responders, their two-way and higher order products and the all products of these combinations with members of **X**_{1} (2^{p}^{1+}^{p}^{12}−2^{p}^{1} terms).
${\mathbf{X}}_{2}^{(0)}$ is defined similarly. The decomposition in (6 – 8) can be written via a linear model

$$E[Y\mid {A}_{1},R,{A}_{2}]={\mathbf{X}}_{1}^{T}\phi +(R-p({\mathbf{X}}_{1})){\mathbf{X}}_{1}^{T}\psi +R{\mathbf{X}}_{2}^{{(1)}^{T}}\beta +(1-R){\mathbf{X}}_{2}^{{(0)}^{T}}\alpha $$

(9)

where *β* and *α* are the vectors of all stage 2 causal effects for responders and non-responders, respectively, *ψ* is the vector of nuisance effects, is the vector of stage 1 causal effects and *p*(**X**_{1}) = *E*[*R*|*A*_{1}] (note **X**_{1} is a function of *A*_{1}).

Recall that a parameter is identifiable if different values of the parameter lead to different distributions of {*A*_{1}, *R*, *A*_{2}, *Y*} (van der Vaart, 1998). Identifiability in this setting is straightforward.

Assume that *P*[0 < *E*[*R*|*A*_{1}] < 1] = 1. The parameters, , *ψ*, *β* and *α* in (9) are identifiable.

The above lemma is a direct consequence of two facts. First the conditional means,
$E[R\mid {A}_{1}={a}_{1}],E[Y\mid {A}_{1}={a}_{1},R=1,{A}_{2}^{(1)}={a}_{2}^{(1)}]$ and
$E[Y\mid {A}_{1}={a}_{1},R=0,{A}_{2}^{(0)}={a}_{2}^{(0)}]$ for all values of {*a*_{1},
${a}_{2}^{(1)},{a}_{2}^{(0)}$}, are identifiable. Second the expectation of
${\left[{\mathbf{X}}_{1}^{T},\phantom{\rule{0.38889em}{0ex}}(R-p({\mathbf{X}}_{1})){\mathbf{X}}_{1}^{T},\phantom{\rule{0.38889em}{0ex}}R{\mathbf{X}}_{2}^{{(1)}^{T}},\phantom{\rule{0.38889em}{0ex}}(1-R){\mathbf{X}}_{2}^{{(0)}^{T}}\right]}^{T}$ times its transpose is given by a block diagonal with blocks,
$E[{\mathbf{X}}_{1}{\mathbf{X}}_{1}^{T}],\phantom{\rule{0.38889em}{0ex}}E[{\mathbf{X}}_{1}p({\mathbf{X}}_{1})\phantom{\rule{0.16667em}{0ex}}(1-p({\mathbf{X}}_{1})){\mathbf{X}}_{1}^{T}],\phantom{\rule{0.38889em}{0ex}}E\left[{\mathbf{X}}_{2}^{(1)}p({\mathbf{X}}_{1}){\mathbf{X}}_{2}^{{(1)}^{T}}\right]$ and
$E\left[{\mathbf{X}}_{2}^{(0)}(1-p({\mathbf{X}}_{1})){\mathbf{X}}_{2}^{{(0)}^{T}}\right]$ (recall *p*(**X**_{1}) = *E*[*R*|*A*_{1}]). These blocks are invertible under the conditions specified in Lemma 1. The assumption of Lemma 1 is not necessary for identifiability of and can be weakened to only *P*[*E*[*R*|*A*_{1}] > 0] = 1 for identifiability of *β* (with a similar statement for *α*). The details of the proof are omitted.

To utilize (9) in screening estimate *p*(**X**_{1}) by forming the average of *R* for each value of **X**_{1} to obtain (**X**_{1}). Conduct a linear regression of *Y* on {**X**_{1}, (*R* − (**X**_{1}))**X**_{1},
$R{\mathbf{X}}_{2}^{(1)},(1-R){\mathbf{X}}_{2}^{(0)}$} to obtain , , , . Note that in contrast to classical screening analyses, these estimators are not orthogonal. This is because homogenous variance is not assumed (e.g. the conditional variance of Y may vary by factor levels), the groups sizes may not be identical and because the responder rates will likely vary by the stage 1 factor levels.

Assume that the variance of *Y* is finite and that *P*[0 < *E*[*R*|*A*_{1}] < 1] = 1. Then as *N* → ∞, {
$\sqrt{N}(\widehat{\phi}-\phi ),\sqrt{N}(\widehat{\beta}-\beta ),\sqrt{N}(\widehat{\alpha}-\alpha )$} converges in distribution to a multivariate normal. If
$P\left[\mathit{Var}(Y\mid {A}_{1},R,{A}_{2})=\mathit{Var}(Y\mid {A}_{1},R,{A}_{2}^{(R)})\right]=1$, then the estimators , and are locally semiparametric efficient.

In many settings all responders will be provided the same treatment. In this case there are no , *β* and *P*[0 < *E*[*R*|*A*_{1}] < 1] = 1 can be replaced by *P* [*E*[*R*|*A*_{1}] > 0] = 1 (similar statements can be made if all non-responders are provided the same treatment). See Tsiatis (Chapter 4, 2006) for an introduction to and definition of locally semiparametric efficiency. Note the assumption on the conditional variances holds under the consistency and ignorability assumptions. This lemma is a special case of Lemma 4 below. The formula for the variance-covariance matrix is provided in Appendix A.1 along with an asymptotically consistent estimator and the proof of Lemma 4.

A better use of resources for screening is use a 2^{k}^{−}* ^{m}* fractional factorial design, which uses only a 1/2

This design has one-half as many rows (groups of subjects) as the full factorial design in Table 1; see the examples in Section 6 for realistic designs. Usually a 2^{k}^{−}* ^{m}* design is selected by first ascertaining plausible working assumptions concerning the factorial effects. Wu and Hamada (2000) provide principles that can be used to guide these working assumptions in the absence of scientific knowledge (e.g. often it is plausible that three way and higher order effects are negligible). Next, the aliasing associated with candidate 2

As was the case for full factorial designs, we interpret a fractional factorial design as a design in which each subject is assigned with equal probability to one of the 2^{k}^{−}* ^{m}* possible combinations of the factor levels. Thus (

We use a large sample notion of aliasing. Effects will be aliased if only their sum can be identified (operationalized here to mean that only asymptotically consistent estimators of their sum are available). This is a weaker concept than the finite sample concept of aliasing as discussed, for example, in Wu and Hamada (2000); in that setting, effects are aliased if we are only able to obtain an unbiased estimator of their sum.

Defining words are generally used to ascertain finite sample aliasing. Consider the design in Table 2. The defining word for this design is
$1={A}_{11}{A}_{12}{A}_{2}^{(1)}$ (the product of the ±1’s across the three columns is equal to 1). Equivalently
$1={A}_{11}{A}_{12}{A}_{2}^{(0)}$ since the levels of
${A}_{2}^{(1)}$ are equal to the levels of
${A}_{2}^{(0)}$. Similarly
${A}_{2}^{(1)}={A}_{11}{A}_{12}$ (the product of the ±1’s in columns 1 and 2 is equal to the ±1’s in column 3). In a standard 2^{3–1} design, the defining word
$1={A}_{11}{A}_{12}{A}_{2}^{(0)}$ means that each main effect is aliased with a two way interaction (e.g. A_{11} is aliased with
${A}_{12}{A}_{2}^{(0)}$, etc.). It turns out that even though the screening analysis model for a dynamic treatment regime is based on a nonstandard model (9), we will, nonetheless, be able to use the defining words to ascertain the large sample aliasing. This means that we can use commonly available designs such as those provided by Wu and Hamada (2000) to design screening studies for dynamic treatment regimes. Lemma 3, below, provides conditions under which the defining words can be used to determine the aliasing of effects.

As in Section 3 we construct the vector
$\left[{\mathbf{X}}_{1}^{T},{\left({\mathbf{X}}_{2}^{(1)}\right)}^{T},{\left({\mathbf{X}}_{2}^{(0)}\right)}^{T}\right]$ from (*A*_{1}, *A*_{2}). This vector takes on 2^{k}^{−}* ^{m}* equally likely vector values. Construct the matrix
$\left[{\stackrel{\sim}{\mathbf{X}}}_{1},{\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)},{\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}\right]$ with each row corresponding to one of these vectors values. The defining words identify the identical columns in
$\left[{\stackrel{\sim}{\mathbf{X}}}_{1},{\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}\right]$ and in
$\left[{\stackrel{\sim}{\mathbf{X}}}_{1},{\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}\right]$. Consider the design in Table 2; here

$${\stackrel{\sim}{\mathbf{X}}}_{1}=\left(\begin{array}{rrrr}\hfill 1& \hfill 1& \hfill 1& \hfill 1\\ \hfill 1& \hfill 1& \hfill -1& \hfill -1\\ \hfill 1& \hfill -1& \hfill 1& \hfill -1\\ \hfill 1& \hfill -1& \hfill -1& \hfill 1\end{array}\right),\phantom{\rule{0.38889em}{0ex}}{\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}={\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}=\left(\begin{array}{rrrr}\hfill 1& \hfill 1& \hfill 1& \hfill 1\\ \hfill -1& \hfill -1& \hfill 1& \hfill 1\\ \hfill -1& \hfill 1& \hfill -1& \hfill 1\\ \hfill 1& \hfill -1& \hfill -1& \hfill 1\end{array}\right).$$

(10)

Labeling the columns of [_{1} by as {1, *A*_{11}, *A*_{12}, *A*_{11}*A*_{12}} and the columns of
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(j)}$ by {
${A}_{2}^{(j)},{A}_{11}{A}_{2}^{(j)},{A}_{12}{A}_{2}^{(j)},{A}_{11}{A}_{12}{A}_{2}^{(j)}$} we see that the defining word,
$1={A}_{11}{A}_{12}{A}_{2}^{(j)}$ (*j* = 0, 1) identifies the identical columns.

Assume that *P*[0 < *E*[*R*|*A*_{1}] < 1] = 1. Make the *Formal* assumption:

For each identical column in both _{1} and
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}$ and for each identical column in both _{1} and
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$ either

- (3a) the associated nuisance effect(s) (
*ψ*parameters) in (9) are zero or

Then the defining words can be used to ascertain aliasing.

To make Lemma 3 concrete, consider the analog of (9) for the design in Table 2:

$$\begin{array}{l}E[Y\mid {A}_{1},R,{A}_{2}]={\phi}_{1}+{\phi}_{2}{A}_{11}+{\phi}_{3}{A}_{12}+{\phi}_{4}{A}_{11}{A}_{12}+\\ (R-p({\mathbf{X}}_{1}))({\psi}_{1}+{\psi}_{2}{A}_{11}+{\psi}_{3}{A}_{12}+{\psi}_{4}{A}_{11}{A}_{12})+\\ R({\beta}_{1}{A}_{2}^{(1)}+{\beta}_{2}{A}_{11}{A}_{2}^{(1)}+{\beta}_{3}{A}_{12}{A}_{2}^{(1)}+{\beta}_{4}{A}_{11}{A}_{12}{A}_{2}^{(1)})+\\ (1-R)({\alpha}_{1}{A}_{2}^{(0)}+{\alpha}_{2}{A}_{11}{A}_{2}^{(0)}+{\alpha}_{3}{A}_{12}{A}_{2}^{(0)}+{\alpha}_{4}{A}_{11}{A}_{12}{A}_{2}^{(0)})\end{array}$$

(11)

The matrices in (10) permit a variety of formal assumptions. For example, we might make the formal assumption that *ψ*_{2} = *ψ*_{3} = *ψ*_{4} = 0 and *β*_{4} = *α*_{4} = 0. Then Lemma 3 allows us to read off the large sample aliasing from the defining word,
$1={A}_{11}{A}_{12}{A}_{2}^{(j)}$, *j* = 0, 1. That is each main effect is aliased with a two-way interaction between the remaining predictors (e.g. *A*_{11} is aliased with
${A}_{12}{A}_{2}^{(0)}$ and so on). This design is most useful if, according to the working assumptions, there are no two-way interactions.

See the Appendix A.1 for a proof of Lemma 3. As before if there are only stage 2 factors for responders (non-responders) then we need only assume *P*[*E*[*R*|*A*_{1}] > 0] = 1 (respectively *P*[*E*[*R*|*A*_{1}] < 1] = 1). In the next section we provide an algorithm based on this proof for determining the aliasing and constructing the predictors in the screening analysis. We only use 2^{k}^{−}* ^{m}* designs for which either (3a) or (3b) holds for any common stage 1 and stage 2 columns in

Of course we cannot fit the model (9), or in the case of our example, (11), since the 2^{k}^{−}* ^{m}* design does not provide

- Construct
_{1}, ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}$ and ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$. Eliminate duplicate columns in each of_{1}, ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}$ and ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$ and associated predictors from ${\mathbf{X}}_{1}\phantom{\rule{0.16667em}{0ex}}({\mathbf{X}}_{2}^{(1)},{\mathbf{X}}_{2}^{(0)})$, retaining only the predictor that, according to the working assumptions, may have a non-negligible effect. Name the resulting vector ${\mathbf{U}}_{1}\phantom{\rule{0.16667em}{0ex}}({\mathbf{U}}_{2}^{(1)},{\mathbf{U}}_{2}^{(0)})$. - If any
*β*(*α*) parameters are assumed zero (via assumption 3b) delete the predictors associated with these parameters in ${\mathbf{U}}_{2}^{(1)}$ (respectively ${\mathbf{U}}_{2}^{(0)}$); similarly delete the associated columns in ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)},{\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$. - If some of the nuisance,
*ψ*, parameters are assumed to be zero (via assumption 3a) then create**Z**_{3}from**U**_{1}by eliminating predictors in**U**_{1}with an assumed zero valued*ψ*parameter. Otherwise**Z**_{3}=**U**_{1}. At this point we can rewrite (9) as$$E[Y\mid {A}_{1},R,{A}_{2}]={\mathbf{U}}_{1}^{T}{\phi}^{\u2033}+(R-p({\mathbf{U}}_{1})){\mathbf{Z}}_{3}^{T}{\psi}^{\u2033}+R{\mathbf{U}}_{2}^{{(1)}^{T}}{\beta}^{\u2033}+(1-R){\mathbf{U}}_{2}^{{(0)}^{T}}{\alpha}^{\u2033}$$where the primes on the regression coefficients indicate each entry in*ψ*″,*ψ*″,*β*″,*α*″ may be a sum of stage 1 effects, nuisance effects, stage 2 effects for responders and stage 2 effects for non-responders, respectively. Here the response rate,*E*[*R*|*A*_{1}] is written as*p*(**U**_{1}). - If assumption 3b is made (e.g. some stage 2 effects,
*β*,*α*, parameters are assumed to be zero) then the defining words have identified columns common to at least two of the three matrices_{1}, ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)},{\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$. If there are one or more columns common to all three matrices then for each common column delete an associated predictor from one of either**U**_{1}, ${\mathbf{U}}_{2}^{(0)}$ or ${\mathbf{U}}_{2}^{(0)}$ resulting in**Z**_{1}, ${\mathbf{Z}}_{2}^{(0)}$ or ${\mathbf{Z}}_{2}^{(0)}$. The choice of which predictor to eliminate determines the aliasing as follows. If the omitted predictor is a stage 1 variable then the coefficient of the stage 2 predictor for responders (non-responders) in the regression (12) is an estimator of the sum of the stage 1 and stage 2 effects for responders*β*′ = ″ +*β*″ (*α*′ = ″ +*α*″, respectively). If the omitted predictor is a stage 2 variable for responders (non-responders) then the coefficient of the stage 1 predictor in the regression (12) is an estimator of sum ′ = ″ +*β*″ (′ = ″ +*α*″, respectively) and the coefficient of the stage 2 predictor for non-responders is an estimator of the difference between the stage 1 and stage 2 effects*α*′ =*α*″ −*β*″ (*β*″ =*β*″ −*α*″, respectively). We obtain$$E[Y\mid {A}_{1},R,{A}_{2}]={\mathbf{Z}}_{1}^{T}{\phi}^{\prime}+(R-p({\mathbf{U}}_{1})){\mathbf{Z}}_{3}^{T}{\psi}^{\prime}+R{\mathbf{Z}}_{2}^{{(1)}^{T}}{\beta}^{\prime}+(1-R){\mathbf{Z}}_{2}^{{(0)}^{T}}{\alpha}^{\prime}.$$(12)

Using the working assumptions the regression coefficients can be labeled as corresponding to particular stage 1 effects, nuisance effects, a stage 2 effects for responders and stage 2 effects for non-responders.

Consider once again the example from Table 2 with the formal assumptions that *ψ*_{2} = *ψ*_{3} = *ψ*_{4} = 0 and *β*_{4} = *α*_{4} = 0 (see (11)). After step 3, we have **U**_{1} = {1, *A*_{11}, *A*_{12}, *A*_{11}*A*_{12}}, **Z**_{3} = {1},
${\mathbf{U}}_{2}^{(j)}=\{{A}_{2}^{(j)},{A}_{11}{A}_{2}^{(j)},{A}_{12}{A}_{2}^{(j)}\}$, *j* = 0, 1. In step 4 we select one of several analysis models (each corresponding to a different form of aliasing). Suppose we eliminate *A*_{11}*A*_{12} in **U**_{1} to form **Z**_{1} and eliminate
${A}_{11}{A}_{2}^{(1)},{A}_{12}{A}_{2}^{(1)}$ from
${\mathbf{U}}_{2}^{(1)}$ to form
${\mathbf{Z}}_{2}^{(1)}({\mathbf{Z}}_{2}^{(0)}={\mathbf{U}}_{2}^{(0)})$, then according to step 4 the conditional mean becomes

$$\begin{array}{l}E[Y\mid {A}_{1},R,{A}_{2}]={\phi}_{1}^{\prime}+{\phi}_{2}^{\prime}{A}_{11}+{\phi}_{3}^{\prime}{A}_{12}+(R-p({\mathbf{U}}_{1})){\psi}_{1}^{\prime}+\\ R{\beta}_{1}^{\prime}{A}_{2}^{(1)}+(1-R)\phantom{\rule{0.16667em}{0ex}}({\alpha}_{1}^{\prime}{A}_{2}^{(0)}+{\alpha}_{2}^{\prime}{A}_{11}{A}_{2}^{(0)}+{\alpha}_{3}^{\prime}{A}_{12}{A}_{2}^{(0)})\end{array}$$

where ${\phi}_{1}^{\prime}={\phi}_{1},{\phi}_{2}^{\prime}={\phi}_{2}+{\beta}_{3},{\phi}_{3}^{\prime}={\phi}_{3}+{\beta}_{2},{\psi}_{1}^{\prime}={\psi}_{1},{\beta}_{1}^{\prime}={\beta}_{1}+{\phi}_{4},{\alpha}_{1}^{\prime}={\alpha}_{1}+{\phi}_{4},{\alpha}_{2}^{\prime}={\alpha}_{2}-{\beta}_{3}$ and ${\alpha}_{3}^{\prime}={\alpha}_{3}-{\beta}_{2}$ from (11).

The steps in the algorithm follow the steps in the proof of identifiability for Lemma 3; see the proof in Appendix A.1 for a detailed explanation. At this point the total number of entries in {**Z**_{1}, **Z**_{3},
${\mathbf{Z}}_{2}^{(1)},{\mathbf{Z}}_{2}^{(0)}$} is at most 2 (2^{k}^{−}* ^{m}*)and the total number of predictors in

Assume that the variance of *Y* is finite and that *P*[0 < *E*[*R*|*A*_{1}] < 1] = 1. Then as *N* → ∞, the multivariate distribution of {
$\sqrt{N}({\widehat{\phi}}^{\prime}-{\phi}^{\prime}),\sqrt{N}({\widehat{\beta}}^{\prime}-{\beta}^{\prime}),\sqrt{N}({\widehat{\alpha}}^{\prime}-{\alpha}^{\prime})$} converges to an multivariate normal distribution. Furthermore if the model in (12) is saturated then the estimators ′, ′, ′ are semiparametric efficient.

The proof, and an asymptotically consistent estimator for the variance-covariance matrix, is provided in Appendix A.1. In general the model (12) will be saturated. Exceptions occur when more formal assumptions are made than is necessary (see example 3 in Section 5 for an illustration). Or when according to the design, there are one or more stage 2 factors (only for responders or only for non-responders) with levels completely crossed with the levels of all other factors (e.g. these stage 2 factors are randomized independently of the remaining factors and they are not stacked with other factors in the design). For completeness Appendix A.1 also supplies an estimating function yielding semiparametric efficient estimators for use in the case when (12) is not saturated. As was the case for Lemmas 2 and 3, if all responders are provided the same treatment then there are no ′, *β*′ and *P*[0 < *E*[*R*|*A*_{1}] < 1] = 1 can be replaced by *P*[*E*[*R*|*A*_{1}] < 1] = 1 (similar statements can be made if all non-responders are provided the same treatment).

The primary goal is to assess the activity of all stage 1 and stage 2 main effects, that is to test if each main effect is zero. To choose the sample size *N*, we make the following rough approximations. We assume that our formal assumptions, if any, are correct and that the residual variance is equal across the 2^{k}^{−}* ^{m}* rows in the design, say

$$N=\frac{{2}^{k-m}{({z}_{\beta}+{z}_{\alpha /2})}^{2}}{min({p}_{\mathit{min}},1-{p}_{\mathit{max}})\phantom{\rule{0.16667em}{0ex}}{(\mathrm{\Delta}/\sigma )}^{2}}$$

where *p _{min}*,

Suppose there are stage 2 components for both responders and non-responders. If the signal-to-noise ratio (*SNR* = Δ/*σ*) is very large then the recommended row size will be so small that the chance of a group (corresponding to a row in the design) occurring with no responders or no non-responders is too high. Given the response rate *p* and group size *n* = *N*/2^{k}^{−}* ^{m}*, the chance that a group of this type will occur is 1 −(1 −

In this section we illustrate several 2^{k}^{−}* ^{m}* designs and associated working and formal assumptions. The examples below are 2

Consider a study with four stage 1 factors: *S* (speciality or general practice clinic), *B* (adjunctive therapy to improve adherence: yes/no), *C* (counseling: intensive/less intensive) and *T* (level of staff training: high/low). All individuals are offered medication. There is one stage 2 factor for early non-responders *F*_{2} (switch or augment medication) and one stage 2 factor for early responders *G*_{2} (telephone disease management: yes/no). Suppose that the working assumptions are that the main effects of the factors and the two way interactions *CG*_{2}, *TG*_{2}, *BF*_{2}, *TF*_{2} and *BC* may be active with the remaining effects assumed negligible. Recall, these working assumptions are used to guide the choice of the 2^{k}^{−}* ^{m}* design, but are not necessarily assumed true in the screening analysis.

In Design 1, a 2^{4–1} design is constructed for the stage 1 factors and then the levels of the stage 2 factor are crossed with this design. In Design 2 the stage 2 factor is aliased with the four way interaction between the stage 1 factors. Both of these designs are 2^{5–1} designs. In the Design 3 below we allow for an additional stage 2 factor which is applicable only for the non-responders. Here a 2^{6–2} design is discussed.

This design crosses a 2^{4–1} design with defining word *SBCT* = 1 with the stage 2 factor levels to produce a 2^{5–1} design; no formal assumptions are made. The defining word indicates that the stage 1 and stage 2 matrices _{1} and
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}$ (or
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$) do not share columns and thus stage 2 effects are not aliased with nuisance effects (or stage 1 effects). As a result the aliasing follows directly from the defining word (e.g. the two way interaction *BC* is aliased with the two way interaction *ST*, and since *SBCTG*_{2} = *G*_{2}, the two interaction *TG*_{2} is aliased with the interaction *SBCG*_{2}, and so on). Note that this design is consistent with the working assumptions in that all of the interesting effects are aliased with effects that are thought to be negligible.

The screening analysis uses the model in (12) with **Z**_{1} = [1, *S*, *B*, *C*, *T*, *BC*, *SB*, *SC*], **Z**_{3} = **Z**_{1},
${\mathbf{Z}}_{2}^{(1)}=[{G}_{2},S{G}_{2},B{G}_{2},C{G}_{2},T{G}_{2},SB{G}_{2},SC{G}_{2},BC{G}_{2}]$ and similarly
${\mathbf{Z}}_{2}^{(0)}=[{F}_{2},S{F}_{2},B{F}_{2},C{F}_{2},T{F}_{2},SB{F}_{2},SC{F}_{2},BC{F}_{2}]$. This is a saturated model. Recall that due to the nonorthogonality of the predictors, omitting predictors from this model is equivalent to making formal assumptions. An advantage of this design is that no negligibility assumptions on the nuisance parameters (*ψ*’s) need be made. However, this design aliases *BC* with *ST*; a better design would avoid aliasing two way interactions.

Suppose that we are willing to make the formal assumption that there are no three way or higher order stage 2 interactions (*α* and *β* regression coefficients of interactions between a stage 2 factor and two or more stage 1 factors are 0), and that there are no four way or higher nuisance interactions involving *R* and stage 1 factors (*ψ* regression coefficients of interactions between *R* and three or more stage 1 factors are 0). Consider a 2^{5–1} design with defining word *SBCTF*_{2} = 1 with *F*_{2} and *G*_{2} stacked (so another expression for the defining word is *SBCTG*_{2} = 1). The design is provided in Table 3. The defining words indicate that under the formal assumptions, none of the stage 1 main effects or two-way interactions are aliased. Potentially active stage 2 effects, such as *CG*_{2}, can also be estimated since the nuisance effects associated with the same column in the stage 1 matrix _{1}, (here the four way interaction between *R* and *SBT*) are negligible. And potentially active nuisance effects such as the three way interaction between *R* and *SB* can be estimated since these nuisance effects are associated with the same column in the stage 2 matrices
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}$ (or
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$) as negligible stage 2 effects (here *CTG*_{2} and *CTF*_{2}).

To screen the factors using data from design 2, we use the model in (12) with with **Z**_{1} = [1, *S*, *B*, *C*, *T*, *BC*, *SB*, *SC*, *ST*, *BT*, *CT*], **Z**_{3} = **Z**_{1},
${\mathbf{Z}}_{2}^{(1)}=[{G}_{2},S{G}_{2},B{G}_{2},C{G}_{2},T{G}_{2}]$ and similarly
${\mathbf{Z}}_{2}^{(0)}=[{F}_{2},S{F}_{2},B{F}_{2},C{F}_{2},T{F}_{2}]$. This is a saturated model. Note that the defining word indicates the aliasing, for example the ′ coefficient of *G*_{2} is actually estimating the sum of main effect of *G*_{2} and the four-way interaction *SCBT*.

In order to illustrate some of the more subtle considerations, consider the inclusion of an additional stage 2 factor *H*_{2} (additional behavioral contingency to improve long term medication adherence) that is only assigned to non-responders (*R* = 0). Further suppose the formal assumptions are that there are no three way or higher nuisance interactions involving *R* and stage 1 factors and that three way and higher stage 2 causal effects are negligible. The working assumptions are that all effects except the main effects of the factors and the two way interactions *TG*_{2}, *TF*_{2}, *CG*_{2}, *BF*_{2}, *F*_{2}*H*_{2} and *BC* are expected to be negligible. Consider the 2^{6–2} design, with defining words *SCTF*_{2} = 1 and *BTF*_{2}*H*_{2} = 1 and *F*_{2} and *G*_{2} stacked (so we could express the above with a *G*_{2} instead of *F*_{2}; note the product of these two defining words yields 1 = *SBCH*_{2}.)

In contrast to designs 1 and 2, not only the experimental design but also the choice of the regression model determines the aliasing (we use step 4 of the algorithm in Section 4.2 to determine the aliasing). Indeed we have a choice of several regression models that can be used to screen the factors each corresponding to different aliasing. Using the steps in Section 4.2 we see that one possibility is to use the model in (12) with **Z**_{1} = [1, *S, B, C, T, SB, BC, BT, SBC, SBT, BCT*], **Z**_{3} = [1, *S, B, C, T*],
${\mathbf{Z}}_{2}^{(1)}=[{G}_{2},S{G}_{2},B{G}_{2},C{G}_{2},T{G}_{2}]$ and
${\mathbf{Z}}_{2}^{(0)}=[{F}_{2},S{F}_{2},B{F}_{2},C{F}_{2},T{F}_{2},{H}_{2},S{H}_{2},C{H}_{2},{F}_{2}{H}_{2}]$. Using the defining words, we can deduce the aliasing. For example since the defining words indicate that *G*_{2} = *F*_{2} = *SCT* and *SCT* has been omitted from **X**_{1}, we have that the *β*′ coefficient of *G*_{2} is the sum of the main effect of *G*_{2} and the three way interaction *SCT*; similarly the *α*′ coefficient of *F*_{2} is the sum of the main effect of *F*_{2} and the three way interaction *SCT*. Note that the defining words indicate that *BF*_{2} = *BG*_{2} = *TH*_{2} = *SBCT* and we included only *BF*_{2}, *BG*_{2} in the regression thus the *α*′ coefficient of *BF*_{2} is the sum of the two way interactions *BF*_{2}, *TH*_{2} and the four way interaction *SBCT* whereas the *β*′ coefficient of *BG*_{2} is estimating the sum of the two way interaction *BG*_{2} and the four way interaction *SBCT*. In general, given the formal assumptions, the defining words along with the choice of **Z**_{1},
${\mathbf{Z}}_{2}^{(1)}$ and
${\mathbf{Z}}_{2}^{(0)}$ result in the aliasing:
${\beta}_{{G}_{2}}^{\prime}={\beta}_{{G}_{2}}+{\phi}_{\mathit{SCT}},{\beta}_{S{G}_{2}}^{\prime}={\beta}_{S{G}_{2}}+{\phi}_{CT},{\beta}_{B{G}_{2}}^{\prime}={\beta}_{B{G}_{2}}+{\phi}_{\mathit{SBCT}},{\beta}_{C{G}_{2}}^{\prime}={\beta}_{C{G}_{2}}+{\phi}_{ST},{\beta}_{T{G}_{2}}^{\prime}={\beta}_{T{G}_{2}}+{\phi}_{SC},{\beta}_{{F}_{2}}^{\prime}={\beta}_{{F}_{2}}+{\phi}_{\mathit{SCT}},{\beta}_{S{F}_{2}}^{\prime}={\beta}_{S{F}_{2}}+{\phi}_{CT},{\beta}_{B{F}_{2}}^{\prime}={\beta}_{B{F}_{2}}+{\beta}_{T{H}_{2}}+{\phi}_{\mathit{SBCT}},{\beta}_{C{F}_{2}}^{\prime}={\beta}_{C{F}_{2}}+{\phi}_{ST},{\beta}_{T{F}_{2}}^{\prime}={\beta}_{T{F}_{2}}+{\beta}_{B{H}_{2}}+{\phi}_{SC}$ (for clarity the associated predictor is given as the subscript on the parameter). The remaining parameters are unaliased (e.g. the (′, *β*′, *α*′) parameters are equal to the corresponding (, *β*, *α*) parameters).

Another possible screening analysis would use (12) with **Z**_{1} = [1, *S*, *B*, *C*, *T*, *SB*, *BC,BT, CT, SBC, SBT, BCT*] and
${\mathbf{Z}}_{2}^{(0)}=[{F}_{2},B{F}_{2},C{F}_{2},T{F}_{2},{H}_{2},S{H}_{2},C{H}_{2}]$ but leave **Z**_{3} and
${\mathbf{Z}}_{2}^{(1)}$ as is; that is the predictor *CT* is added to **Z**_{1} and the predictor *SF*_{2} is removed from
${\mathbf{Z}}_{2}^{(0)}$. In this case the aliasing is the same as before except
${\phi}_{CT}^{\prime}={\phi}_{CT}-{\alpha}_{S{F}_{2}},{\beta}_{S{G}_{2}}^{\prime}={\beta}_{S{G}_{2}}-{\alpha}_{S{F}_{2}}$ (there is no longer a
${\alpha}_{S{F}_{2}}^{\prime}$ parameter).

Both of the two screening analysis models discussed above have 30 parameters and hence are not saturated models. If desired we could fit a saturated model by adding two nuisance interactions to **Z**_{3}, *SBT* and *BCT*. These two nuisance interactions were needlessly assumed to be negligible in the formal assumptions. For example, from the defining words, note that the column associated with *SBT* is the same as the column associated with the three way stage 2 interactions *SF*_{2}*H*_{2}, *TCH*_{2} and *BCF*_{2} all of which were assumed to be negligible. That is we made both assumption 3a and 3b in Lemma 3 instead of one or the other. Estimation of these two effects acts as a check on the formal assumptions since under the formal assumptions these two nuisance effects are zero.

Extensive simulations were conducted so as to evaluate the proposed analysis, examine the impact of violations of the formal assumptions and examine the impact of rows with no responders (or no non-responders) on the analysis. We are particularly interested in mental health settings such as the treatment of major depression and substance abuse in which response (absence of symptoms) rates to initial treatment are around 50 to 70%; as a result the simulations below use initial response rates in this range. Note that in actual practice, when response (non-response) rates are low, usually only factors for non-responders (respectively, responders) are investigated.

The findings were as follows. First when the formal assumptions hold, the standard errors performed well and the Type 1 error is as planned. In general the sample size calculations depend on the group with the smallest proportion of non-responders (or responders). As a result the sample size calculations are conservative and the power to detect active stage 1 main effects and stage 2 main effects is higher than the nominal value. Simulations with both normal and non-normal error distributions such as a t-distribution with 3 degrees of freedom were also conducted; this did not substantially alter the results. When the formal assumptions are violated (e.g. effects assumed negligible in the formal assumptions are not negligible) there is a surprising degree of robustness; that is, the bias is of smaller magnitude than would be expected. Also when one or more rows contain only responders, fitting an analysis model that omitted some active effects led to relatively robust estimators of the remaining effects. This robustness is discussed at greater length following simulation 2. We present three simulations that exemplify the above findings.

The simulations below use Design 2 (see Table 3). In this example the formal assumptions are that there are no three way or higher order stage 2 interactions and there are no four way or higher order nuisance interactions involving *R* and stage 1 factors. In the simulations provided below, *R* is generated using a response rate given by *logit E*[*R*|*F*_{1}]) = .6 + .1*S* + .1*B* + .1*C* + .1*T*. This results in response rates varying from .55 to .73 across the 16 rows; this wide range of early response rates is extreme for the mental health/substance abuse fields however it allows us to illustrate the issues. *Y* is normally distributed with residual variance, *σ*^{2}, set to 1. We present results with the signal-to-noise ratio (SNR= effect size/residual variance) equal to .25 or .35 units per standard deviation. In the simulations, active main effects are equal to the SNR× *σ*, active two way interactions are equal to SNR×.5×*σ* and active three way interactions are set equal to SNR× .25 × *σ*. These settings are consistent with the Hierarchical Ordering Principle (Wu and Hamada, 2000, pg. 112) which states that the lower order effects are more likely to be important than higher order effects and effects of the same order are equally likely to be important (this principle is used in the absence of domain knowledge indicating otherwise). Throughout all main effects and the interactions *SB*, *SC*, *SG*_{2}, *TG*_{2} and *TF*_{2} are active; thus the working assumptions are incorrect (the working assumptions are that only the main effects of the factors and the two way interactions *TG*_{2}, *TF*_{2}, *CG*_{2}, *BF*_{2} and *BC* are likely to be active).

We used the formula in Section 4.3 to determine the sample size required to detect a given SNR at 90% power with a 10% Type 1 error rate for *p _{min}* = .55 and

Note that there are 6 effects that according to the working model should be negligible and are actually negligible. Given an overall error rate of .1, the empirical error rate should be around .011 and this is the case as can be seen by the rows labeled *ST*, *BT*, *CT*, *BG*_{2}, *SF*_{2} and *CF*_{2}. Similarly the empirical Type 1 error rate for the interesting effects should be around .1 and this can be seen by rows labeled *BC*, *CG*_{2} and *BF*_{2}. The high power in detecting main effects is due to the conservatism used in selecting the group size. Recall that the response rates per group range from a low of .55 to an high of .73. The group sizes are chosen then to achieve the power .9 to detect stage 2 effects for non-responders when the non-response rate is .27. Since only one of the groups exhibits this low non-response rate, the group sizes are conservative, resulting in higher power. In simulations with a constant response rate across all 16 groups (not shown here) the power to detect a main effect for stage 2 non-responders varied from around .88 to .94 depending on the simulation.

In the next simulation, also of 1000 simulated data sets, the formal assumptions are violated; there are active three way stage 2 effects for responders. The results in Table 5 are surprisingly good; the results in this table would be expected *if* the interactions involving *R* were zero (no nuisance interactions). To see this, recall the defining words for this design are 1 = *SBCTG*_{2} (equivalently 1 = *SBCTF*_{2}). Thus, for example the column in the stage 2 matrix
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$ matrix associated with *CTF*_{2} is the same as the column associated with *SB* in _{1}. If the coefficient of the interaction (*R* − *E*[*R*|*A*_{1}])*SB* were zero (it is not) then we would expect that estimators of the stage 1 interactions such as *SB* will estimate this stage 1 interaction plus the product of the response rate times the effect of *CTG*_{2} plus the non-response rate times the effect of *CTF*_{2}. In this case, this is .175+ response rate(.0875)+ non-response rate(0). If we use the average non-response rate (here .64), this yields .231 which is close to the average estimated effect for *SB*. Similar statements can be made for remaining stage 1 two way interactions. This robustness is explained by the fact that the response rates vary only from .55 to .73 across the groups. As discussed in Section 4, the less the response rates vary, the closer the predictors are to being orthogonal. Thus the estimators of the effects are approximately uncorrelated (the off diagonal elements of the correlation matrix are small{in this simulation the maximum correlation in absolute value is .12 and the average of the absolute value of the correlations is .03). Thus as long as the response rates do not vary greatly (as is the case in many areas of mental health and substance abuse) we can expect this robustness. Similar results hold when the formal assumptions concerning the nuisance effects are violated (not shown here).

Also this simulation, by chance, did not result in any samples with one or more groups containing only responders. However this can happen and did happen in other simulations (not shown). Indeed when SNR= .35 the group sizes had to be adjusted upwards to ensure that the probability that all groups have some non-responders is not too low (low is arbitrarily chosen to be .01; see Section 4.3). Even with this adjustment, a simulation size of 1000 samples will sometimes include some samples in which groups with no responders occur. If in a given data set some groups (e.g. corresponding to rows in the design) contain no responders we fit a model using the working assumptions, that is, **Z**_{1} = [1, *S*, *B*, *C*, *T*, *BC*] **Z**_{3} = **Z**_{1},
${\mathbf{Z}}_{2}^{(1)}=[{F}_{2},B{F}_{2},T{F}_{2}]$ and
${\mathbf{Z}}_{2}^{(0)}=[{G}_{2},C{G}_{2},T{G}_{2}]$.

The working assumptions are incorrect thus the omission of active effects biases the estimated effects. To examine the degree of bias we conducted a simulation in which the groups corresponding to the rows of the design were sufficiently small so that the chance of one or more groups with all responders was likely; in effect we simulated many samples saving only those that had one or more groups with all responders. Table 6 reports the results for 729 such samples.

As with simulation 2, this simulation demonstrates robustness of the analysis method to the omission of active effects. Note that both the bias of the estimators and the quality of the standard errors is poorest for the stage 2 effects for non-responders. This is not surprising as many rows will have few non-responders.

An attractive alternative to the use of randomized clinical trials in constructing dynamic treatment regimes is the use of observational data in which the components are not randomized. Here the variation in treatment is usually due to patient/clinician preference, patient adherence, component availability, etc. Observational data has the great advantage in that it often already exists and/or is less expensively obtained than data from the screening designs discussed here. However note that observational data suffers from two substantial drawbacks. First inferences regarding causal relations in the absence of randomization requires untestable assumptions (Rubin, 1978; Robins, 1992) concerning why individuals receive, or are offered, differing treatments. Second, even if we are comfortable with the required assumptions, then because we do not control the treatment levels in the observational study we have uncontrolled aliasing. That is, when our modeling assumptions do not hold, the form of the aliasing may be complex and difficult to ascertain. In contrast the screening designs considered here not only improve our ability to make causal inferences but also provide a greater understanding of the aliasing that occurs due to incorrect modeling assumptions. However there are many open problems. Four such problems follow.

First there is the question of how to construct the optimal experimental designs. A first thought might be to utilize the maximum resolution criterion or its refinement, the minimum aberration criterion (see Wu and Hamada, 2000). Note that the design in example 1 has only resolution IV whereas the design in example 2 has higher resolution (V). It is not necessarily true that the example 2 design is to be preferred since it requires formal assumptions whereas the example 1 design does not. More work is required to appropriately incorporate the roll of the formal assumptions into methods for comparing these designs.

Second as discussed in Step 4 of the Algorithm in Section 4.2 some designs permit multiple analysis models; because of the non-orthogonality of the predictors, different models result in tests with different powers. A strategy for choosing among these different analysis models is needed.

Third, designs with a smaller number of groups (e.g. treatment combinations) might be considered. To do so, we would begin by eliminating negligible effects from a potential model via formal and working assumptions. Next, we could attempt to find, say, a D-optimal design (e.g., see Wu and Hamada, 2000) that would permit estimation of the remaining effects. This would yield designs that have fewer than 2^{k}^{−}* ^{m}* groups. However, such designs rarely have the nice aliasing structure found with 2

Lastly, this paper has not discussed potential secondary analyses that might be used with data from a 2^{k}^{−}* ^{m}* design. Such analyses might consider how best to utilize time-varying patient covariates in the construction of the decision rules. The methods of Murphy (2003) and Robins (2004) require generalization as these methods require randomization or, at a minimum, stochastic variation in assigned stage 2 factors.

We acknowledge support for this project from National Institute of Health grants RO1 MH080015 and P50 DA10075.

First assume there are factors for both responders and non-responders. We have for *R* = 0, 1,

$$E[\mathbf{Y}\mid R]={\stackrel{\sim}{\mathbf{X}}}_{1}\phi +(R-{\mathbf{D}}_{\mathbf{p}}){\stackrel{\sim}{\mathbf{X}}}_{1}\psi +R{\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}\beta +(1-R){\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}\alpha $$

(13)

where the defining words specify the common columns in _{1},
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}$ and
${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}$ (**D _{p}** is a diagonal matrix with

Next if assumption 3a was used then some of the entries in *ψ*′ are known to be zero. Delete the associated columns from _{1} to form _{3} and delete these entries from *ψ*′. If assumption 3b was made then follow the same procedure for
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}$ and *β*′ and similarly for
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$ and *α*′.

At this point there are no duplicate columns in _{1} or in _{3} or in
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}$ or in
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$. Furthermore there is no column common to both _{3} and
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}$ and no column common to both _{3} and
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$. This is because such a column would have been assumed to either have a *ψ* coefficient or *α*, *β* coefficients equal to zero by assumptions 3a), 3b) respectively. There may, however, be one or more columns which are common to all three, _{1},
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)},{\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$, matrices. Note the above discussion implies these columns are not in _{3}. To obtain regression coefficients that are identifiable, we need to eliminate each common column from one of these three matrices. To see this consider one such common column, **z** which is, say, the ith column of _{1}, the jth column of
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}$ and the kth column of
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$. This column contributes
$\mathbf{z}{\phi}_{i}^{\prime}+R\mathbf{z}{\beta}_{j}^{\prime}+(1-R)\mathbf{z}{\alpha}_{k}^{\prime}$ to *E*[**Y**|*R*]. We can write this sum in three ways corresponding to how we delete this column from one of the three matrices:

$$\begin{array}{l}\mathbf{z}{\phi}_{i}^{\prime}+R\mathbf{z}{\beta}_{j}^{\prime}+(1-R)\mathbf{z}{\alpha}_{k}^{\prime}=R\mathbf{z}({\beta}_{j}^{\prime}+{\phi}_{i}^{\prime})+(1-R)\mathbf{z}({\alpha}_{k}^{\prime}+{\phi}_{i}^{\prime})\\ =\mathbf{z}({\phi}_{i}^{\prime}+{\beta}_{j}^{\prime})+(1-R)\mathbf{z}({\alpha}_{k}^{\prime}-{\beta}_{j}^{\prime})\\ =\mathbf{z}({\phi}_{i}^{\prime}+{\alpha}_{k}^{\prime})+R\mathbf{z}({\beta}_{j}^{\prime}-{\alpha}_{k}^{\prime}).\end{array}$$

If we delete **z** from _{1} then we will be able to identify the sum of the associated stage 1 and stage 2 effects; if we delete **z** from
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}$ then we will be able to identify the sum of the associated stage 1 effect and the stage 2 effect for responders and also the difference between the stage 2 effect for non-responders and the stage 2 effect for responders; if we delete **z** from
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$ then we will be able to identify the sum of the associated stage 1 effect and the stage 2 effect for non-responders and also the difference between the stage 2 effect for responders and the stage 2 effect for non-responders.

Note the assumption that *P*[*E*[*R*|*A*_{1}] > 0] = 1 (*P* [*E*[*R*|*A*_{1}] < 1] = 1) implies that the 2^{k}^{−}* ^{m}* ×1 vector

$$\begin{array}{l}E[\mathbf{Y}\mid R=1]={\stackrel{\sim}{\mathbf{Z}}}_{1}{\phi}^{\prime}+{\mathbf{D}}_{1-\mathbf{p}}{\stackrel{\sim}{\mathbf{Z}}}_{3}{\psi}^{\prime}+{\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}{\beta}^{\prime}\\ E[\mathbf{Y}\mid R=0]={\stackrel{\sim}{\mathbf{Z}}}_{1}{\phi}^{\prime}-{\mathbf{D}}_{\mathbf{p}}{\stackrel{\sim}{\mathbf{Z}}}_{3}{\psi}^{\prime}+{\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}{\alpha}^{\prime}.\end{array}$$

Multiply all terms in both of these equations by
${\stackrel{\sim}{\mathbf{Z}}}_{3}^{T}$ and subtract the lower equation from the top. Note that _{3} does not share a column with
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(i)}$ so
${\stackrel{\sim}{\mathbf{Z}}}_{3}^{T}{\stackrel{\sim}{\mathbf{Z}}}_{2}^{(i)}=0$ for *i* = 0, 1 and that
${\stackrel{\sim}{\mathbf{Z}}}_{3}^{T}{\stackrel{\sim}{\mathbf{Z}}}_{3}$ is the identity matrix. We obtain
${\psi}^{\prime}={\stackrel{\sim}{\mathbf{Z}}}_{3}^{T}(E[\mathbf{Y}\mid R=1]-E[\mathbf{Y}\mid R=0])$. So *ψ*′ is identifiable. A little more matrix algebra yields

$$\begin{array}{l}{\stackrel{\sim}{\mathbf{Z}}}_{1}^{T}\left(E[\mathbf{Y}\mid R=1]-{\mathbf{D}}_{1-\mathbf{p}}{\stackrel{\sim}{\mathbf{Z}}}_{3}{\psi}^{\prime}\right)={\phi}^{\prime}+{\stackrel{\sim}{\mathbf{Z}}}_{1}^{T}{\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}{\beta}^{\prime}\\ {\stackrel{\sim}{\mathbf{Z}}}_{1}^{T}\left(E[\mathbf{Y}\mid R=0]+{\mathbf{D}}_{\mathbf{p}}{\stackrel{\sim}{\mathbf{Z}}}_{3}{\psi}^{\prime}\right)={\phi}^{\prime}+{\stackrel{\sim}{\mathbf{Z}}}_{1}^{T}{\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}{\alpha}^{\prime}.\end{array}$$

Identifiability of ′, *α*′ and *β*′ follows from the fact that there are no columns common to all three of the matrices.

In there are no factors for responders (non-responders) remove ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(1)}\beta $ (respectively ${\stackrel{\sim}{\mathbf{X}}}_{2}^{(0)}\alpha $) from (13). Then follow the above arguments.

We assume there are factors for both responders and non-responders. Similar arguments can be used when there are only factors for one or the other. Recall a 2^{k}^{−}* ^{m}* design is used; that is {

Define
$\mathbf{Z}(\eta )=\left[(R-{\mathbf{U}}_{1}^{T}\eta ){\mathbf{Z}}_{3}^{T},{\mathbf{Z}}_{1}^{T},R{\mathbf{Z}}_{2}^{{(1)}^{T}},(1-R){\mathbf{Z}}_{2}^{{(0)}^{T}}\right]$. Also let **E*** _{N}* be expectation with respect to the empirical distribution of the data. Then solve

$$\begin{array}{l}0={\mathbf{E}}_{N}\phantom{\rule{0.16667em}{0ex}}[\mathbf{Z}(\eta )\phantom{\rule{0.16667em}{0ex}}(Y-\mathbf{Z}{(\eta )}^{T}{\gamma}^{\prime})]\\ 0={\mathbf{E}}_{N}\phantom{\rule{0.16667em}{0ex}}[{\mathbf{U}}_{1}(R-{\mathbf{U}}_{1}^{T}\eta )]\end{array}$$

(14)

for *γ*′, *η* to obtain ′, . The second equation is the normal equation for a regression of *R* on **U**_{1}. Recall the model used for the conditional distribution of *R* is saturated so weighting by the variance of *R* would not alter the results; in fact the value of
${u}_{1}^{T}\widehat{\eta}$ will be equal to the observed combined response rate for the rows consistent with *u*_{1}. The asymptotic arguments are standard exercises. As a result we provide only an outline. First, standard asymptotic arguments are sufficient to show convergence in probability of ′ to *γ*′ and to *η*. Furthermore define Σ* _{γ}*=

Below we show that Σ* _{γ}* is invertible. Again standard arguments can be used to show that

$${\mathrm{\sum}}_{\gamma}\sqrt{N\phantom{\rule{0.16667em}{0ex}}}({\widehat{\gamma}}^{\prime}-{\gamma}^{\prime})=\sqrt{N}({\mathbf{E}}_{N}-E)\phantom{\rule{0.16667em}{0ex}}[\ell ({\gamma}^{\prime},\eta ,{\mathrm{\sum}}_{\gamma ,\eta},{\mathrm{\sum}}_{\eta ,\eta})]+{o}_{P}(1).$$

where $\ell ({\gamma}^{\prime},\eta ,{\mathrm{\sum}}_{\gamma ,\eta},{\mathrm{\sum}}_{\eta ,\eta})=\mathbf{Z}(\eta )\phantom{\rule{0.16667em}{0ex}}(Y-\mathbf{Z}{(\eta )}^{T}{\gamma}^{\prime})-{\mathrm{\sum}}_{\gamma ,\eta}{\mathrm{\sum}}_{\eta ,\eta}^{-1}{\mathbf{U}}_{1}(R-{\mathbf{U}}_{1}^{T}\eta )$. Thus the asymptotic variance-covariance matrix of $\sqrt{N}({\widehat{\gamma}}^{\prime}-{\gamma}^{\prime})$ is

$${\mathrm{\sum}}_{\gamma}^{-1}E\left[[\ell ({\gamma}^{\prime},\eta ,{\mathrm{\sum}}_{\gamma ,\eta},{\mathrm{\sum}}_{\eta ,\eta})]\phantom{\rule{0.38889em}{0ex}}{[\ell ({\gamma}^{\prime},\eta ,{\mathrm{\sum}}_{\gamma ,\eta},{\mathrm{\sum}}_{\eta ,\eta})]}^{T}\right]{\mathrm{\sum}}_{\gamma}^{-1}.$$

A consistent estimator of the asymptotic variance-covariance matrix is provided by replacing Σ* _{γ}* by

Construct the matrices _{(1)}, _{(3)},
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)},{\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$ as in the proof of Lemma 3 above. Note each row is equal to one of the 2^{k}^{−}* ^{m}* equally probable realizations of
$\left[{\mathbf{Z}}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\mathbf{Z}}_{3}^{T},{\left({\mathbf{Z}}_{2}^{(1)}\right)}^{T},{\left({\mathbf{Z}}_{2}^{(0)}\right)}^{T}\right]$. Define

$$\stackrel{\sim}{\mathbf{V}}=\left[\begin{array}{cccc}{\mathbf{D}}_{1-\mathbf{p}}{\stackrel{\sim}{\mathbf{Z}}}_{3}& {\stackrel{\sim}{\mathbf{Z}}}_{1}& {\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}& 0\\ {\mathbf{D}}_{-\mathbf{p}}{\stackrel{\sim}{\mathbf{Z}}}_{3}& {\stackrel{\sim}{\mathbf{Z}}}_{1}& 0& {\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}\end{array}\right]$$

where the 0’s denote conforming matrices with all entries equal to zero. Then 2^{k}^{−}* ^{m}*Σ

$${\stackrel{\sim}{\mathbf{V}}}^{T}\left[\begin{array}{cc}{\mathbf{D}}_{\mathbf{p}}& 0\\ 0& {\mathbf{D}}_{1-\mathbf{p}}\end{array}\right]\stackrel{\sim}{\mathbf{V}}.$$

We show that is of full rank. This combined with the assumption that the response probabilities are bounded away from both 0 and 1 will imply that Σ* _{γ}* is invertible.

Note that will have less than 2(2^{k}^{−}* ^{m}*) columns (it has 2(2

Next = _{1} + _{2} where

$${\stackrel{\sim}{\mathbf{V}}}_{1}=\left[\begin{array}{cccc}{\stackrel{\sim}{\mathbf{Z}}}_{3}& {\stackrel{\sim}{\mathbf{Z}}}_{1}& 0& {\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}\\ 0& {\stackrel{\sim}{\mathbf{Z}}}_{1}& {\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}& 0\end{array}\right],\phantom{\rule{0.38889em}{0ex}}{\stackrel{\sim}{\mathbf{V}}}_{2}=\left[\begin{array}{cccc}{\mathbf{D}}_{-\mathbf{p}}{\stackrel{\sim}{\mathbf{Z}}}_{3}& 0& 0& 0\\ {\mathbf{D}}_{-\mathbf{p}}{\stackrel{\sim}{\mathbf{Z}}}_{3}& 0& 0& 0\end{array}\right]$$

_{1} is easily shown to be of full rank, once one identifies common columns in _{3}, _{1},
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)},{\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$. Note by assumption there are no columns common to _{3} and
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}$ or _{3} and
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$. And there are no columns common to all three _{(1)},
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)},{\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$. Thus _{3} is composed of columns common with _{1} and columns unique to _{3}. Similarly _{1} is composed of columns that are common with _{3}, columns common with
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}$ and columns common with
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$. _{1} has no unique columns(if it did then this column should have been added to _{3} when was augmented). The inner product of a column with itself is 2^{k}^{−}* ^{m}* and the inner product of any other two columns is zero. These facts imply that the inverse of
${\stackrel{\sim}{\mathbf{V}}}_{1}^{T}{\stackrel{\sim}{\mathbf{V}}}_{1}$ is easily found. Let

$$\left[\begin{array}{cccc}2{I}_{{q}_{3c}\times {q}_{3c}}& 0& -{I}_{{q}_{3c}\times {q}_{3c}}& 0\\ 0& {I}_{{q}_{3u}\times {q}_{3u}}& 0& 0\end{array}\right]$$

where *q*_{3}* _{c}* is the number of common columns between

Next because _{1} is of full rank and square (thus invertible) we can write,

$$\stackrel{\sim}{\mathbf{V}}={\stackrel{\sim}{\mathbf{V}}}_{1}\left({I}_{2({2}^{k-m})\times 2({2}^{k-m})}+{\left({\stackrel{\sim}{\mathbf{V}}}_{1}^{T}{\stackrel{\sim}{\mathbf{V}}}_{1}\right)}^{-1}{\stackrel{\sim}{\mathbf{V}}}_{1}^{T}{\stackrel{\sim}{\mathbf{V}}}_{2}\right).$$

Using the formula for the first *q*_{3} rows of
${\left({\stackrel{\sim}{\mathbf{V}}}_{1}^{T}{\stackrel{\sim}{\mathbf{V}}}_{1}\right)}^{-1}$ we find that the term in parentheses is proportional to a matrix of the form

$$\left[\begin{array}{cc}{I}_{{q}_{3}\times {q}_{3}}& 0\\ \mathbf{U}& {I}_{2({2}^{k-m})-{q}_{3}\times 2({2}^{k-m})-{q}_{3}}\end{array}\right]$$

where **U** is of dimension 2(2^{k}^{−}* ^{m}*)−

First we derive the semiparametric efficient score. Then we discuss how this score can be used to produce a locally semiparametric efficient estimator of *γ*′. Lastly we prove that in the saturated model the estimators are semiparametric efficient and that under an additional assumption on the variances (this assumption is provided in Lemma 2) certain unsaturated models will also yield semiparametric efficient estimators.

The semiparametric model is given by

$$Y=\mathbf{Z}{(\eta )}^{T}{\gamma}^{\prime}+\epsilon $$

where *E*[*ε*|*A*_{1}, *R*, *A*_{2}] = 0, *R* has a Bernoulli distribution with success rate,
$E[R\mid {A}_{1},{A}_{2}]=E[R\mid {A}_{1}]={\mathbf{U}}_{1}^{T}\eta $ and *γ*′ belongs *R ^{qγ}* (

$${f}_{\epsilon}(Y-\mathbf{Z}{(\eta )}^{T}{\gamma}^{\prime}\mid {A}_{1},R,{A}_{2})\phantom{\rule{0.16667em}{0ex}}{({\mathbf{U}}_{1}^{T}\eta )}^{R}{(1-{\mathbf{U}}_{1}^{T}\eta )}^{1-R}\prod _{i=1}^{{2}^{k-m}}{p}_{i}^{1\{({A}_{1},{A}_{2})=({a}_{1i},{a}_{2i})\}}.$$

The nuisance parameters are the densities *f _{ε}*(·|

To derive the efficient score function for {*γ*′, *η*} we calculate the score vector for {*γ*′, *η*} and then form the projection of this score vector on the nuisance tangent space (see Theorem 4.1 of Tsiatis (2006)). The residual is the efficient score. Let is the Hilbert space of *q*-dimensional, mean zero, finite variance functions of {*A*_{1}, *R*, *A*_{2}, *Y*} equipped with the inner product,
$<{h}_{1},{h}_{2}>=E[{h}_{1}^{T}{h}_{2}]$. The tangent space for a particular parameter is the subset of formed by mean square closure of all functions that are score functions for the parameter in a parametric submodel containing the true value of that parameter. Define

$$\begin{array}{l}{\mathrm{\Lambda}}_{1}=\{{h}_{1}(\epsilon ,{a}_{1},r,{a}_{2})\in \mathcal{H}:E[{h}_{1}(\epsilon ,{A}_{1},R,{A}_{2})\mid {A}_{1},R,{A}_{2}]=0\}\\ {\mathrm{\Lambda}}_{2}=\{{h}_{2}(\epsilon ,{a}_{1},r,{a}_{2})\in \mathcal{H}:E[\epsilon {h}_{2}(\epsilon ,{A}_{1},R,{A}_{2})\mid {A}_{1},R,{A}_{2}]=0\}\\ {\mathrm{\Lambda}}_{3}=\{{h}_{3}({a}_{1},r,{a}_{2})\in \mathcal{H}:E[{h}_{3}({A}_{1},R,{A}_{2})\mid {A}_{1},{A}_{2}]=0\}\\ {\mathrm{\Lambda}}_{4}=\{{h}_{4}({a}_{1},{a}_{2})\in \mathcal{H}:E[{h}_{3}({A}_{1},{A}_{2})]=0\}\end{array}$$

From Tsiatis (Section 4.5, 2006) we have that the tangent space for the densities *f _{ε}*(

$$\mathrm{\Lambda}=({\mathrm{\Lambda}}_{1}\cap {\mathrm{\Lambda}}_{2})\oplus {\mathrm{\Lambda}}_{4}.$$

Second, Tsiatis (Section 4.5, 2006) proves that

$$({\mathrm{\Lambda}}_{1}\cap {\mathrm{\Lambda}}_{2})\oplus {\mathrm{\Lambda}}_{3}\oplus {\mathrm{\Lambda}}_{4}={\mathrm{\Lambda}}_{2}.$$

Since
$\mathcal{H}={\mathrm{\Lambda}}_{2}\oplus {\mathrm{\Lambda}}_{2}^{\perp}$, we have that the nuisance tangent space, Λ = (Λ_{1} Λ_{2}) Λ_{4} is the subset of Λ_{2} which is orthogonal to Λ_{3}, that is,
$\mathrm{\Lambda}={\mathrm{\Lambda}}_{2}\cap {\mathrm{\Lambda}}_{3}^{\perp}$. It follows that Λ^{} is the direct sum of
${\mathrm{\Lambda}}_{2}^{\perp}$ and the part of Λ_{2} that is contained in Λ_{3}. Since Λ_{3} Λ_{2} we have
${\mathrm{\Lambda}}^{\perp}={\mathrm{\Lambda}}_{2}^{\perp}\oplus {\mathrm{\Lambda}}_{3}$.

Recall the efficient score is the residual of the projection of the score vector for {*γ*′, *η*} onto the nuisance tangent space, Λ. Equivalently the efficient score is the projection of the score vector for {*γ*′, *η*} onto Λ^{}. In this case the latter is easier to compute. Tsiatis (Section 4.5, 2006) proves that

$${\mathrm{\Lambda}}_{2}^{\perp}=\{\epsilon g({a}_{1},r,{a}_{2})\phantom{\rule{0.16667em}{0ex}}\forall \phantom{\rule{0.16667em}{0ex}}g\phantom{\rule{0.16667em}{0ex}}\text{arbitrary}\phantom{\rule{0.16667em}{0ex}}q-\text{dimensional}\phantom{\rule{0.16667em}{0ex}}\text{functions}\phantom{\rule{0.16667em}{0ex}}\text{of}\phantom{\rule{0.16667em}{0ex}}\{{a}_{1},r,{a}_{2}\}\}.$$

Because *R* is a binary random variable, members of Λ_{3} also have a simple form: *h*_{3}(*A*_{1}*, R, A*_{2}) = *g*′(*A*_{1}*, A*_{2})(*R*−*E*[*R*|*A*_{1}*, A*_{2}]) for *g*′, a *q*-dimension function of {*A*_{1}*, A*_{2}}.

We have that the efficient score must be of the form, *εg*(*A*_{1}, *R*, *A*_{2})+*g*′(*A*_{1}, *A*_{2})(*R*− *E*[*R*|*A*_{1}, *A*_{2}]) where both *g* and *g*′ are *q*-dimensional. Assume that *E*[*ε*^{2}|*A*_{1}, *R*, *A*_{2}] is positive. Then it is easy to see that the projection of an *h*(*ε*, *A*_{1}, *R*, *A*_{2}) onto
${\mathrm{\Lambda}}_{2}^{\perp}$ is

$$\mathrm{\Pi}\phantom{\rule{0.16667em}{0ex}}(h(\epsilon ,{A}_{1},R,{A}_{2})\mid {\mathrm{\Lambda}}_{2}^{\perp})=E[h\epsilon \mid {A}_{1},R,{A}_{2}]\frac{\epsilon}{E[{\epsilon}^{2}\mid {A}_{1},R,{A}_{2}]}.$$

Similarly, assuming that *E*[*R*|*A*_{1}, *A*_{2}] (0, 1),

$$\mathrm{\Pi}\phantom{\rule{0.16667em}{0ex}}(h(\epsilon ,{A}_{1},R,{A}_{2})\mid {\mathrm{\Lambda}}_{3})=E[h(R-E[R\mid {A}_{1},{A}_{2}])\mid {A}_{1},{A}_{2}]\frac{R-E[R\mid {A}_{1},{A}_{2}]}{E[R\mid {A}_{1},{A}_{2}](1-E[R\mid {A}_{1},{A}_{2}])}.$$

If we denote the scores for *γ*′, *η*, by *S _{γ}*

$$\begin{array}{l}1=\int {f}_{\epsilon}(y-z{(\eta )}^{T}{\gamma}^{\prime}\mid {a}_{1},r,{a}_{2})\phantom{\rule{0.16667em}{0ex}}dy\\ 0=\int (y-z{(\eta )}^{T}{\gamma}^{\prime}){f}_{\epsilon}(y-z{(\eta )}^{T}{\gamma}^{\prime}\mid {a}_{1},r,{a}_{2})\phantom{\rule{0.16667em}{0ex}}dy\\ 0=\sum _{r=0,1}(r-{u}_{1}^{T}\eta )\phantom{\rule{0.16667em}{0ex}}{({u}_{1}^{T}\eta )}^{r}\phantom{\rule{0.16667em}{0ex}}{(1-{u}_{1}^{T}\eta )}^{1-r}\end{array}$$

for all values of *a*_{1}, *r*, *a*_{2}, *η*, *γ*′ (recall *u* and *z* are functions of these). Differentiating both sides of the first two equations with respect to *γ*′ we obtain

$$\begin{array}{l}0=\int {S}_{{\gamma}^{\prime}}(y-z{(\eta )}^{T}{\gamma}^{\prime},{a}_{1},r,{a}_{2}){f}_{\epsilon}(y-z{(\eta )}^{T}{\gamma}^{\prime}\mid {a}_{1},r,{a}_{2})\phantom{\rule{0.16667em}{0ex}}dy\\ 0=-z(\eta )+\int (y-z{(\eta )}^{T}{\gamma}^{\prime}){S}_{{\gamma}^{\prime}}(y-z{(\eta )}^{T}{\gamma}^{\prime},{a}_{1},r,{a}_{2}){f}_{\epsilon}(y-z{(\eta )}^{T}{\gamma}^{\prime}\mid {a}_{1},r,{a}_{2})\phantom{\rule{0.16667em}{0ex}}dy.\end{array}$$

That is, 0 = *E*[*S _{γ}*

$$\begin{array}{l}0=({z}_{3}^{T}{\psi}^{\prime}){u}_{1}+\int (y-z{(\eta )}^{T}{\gamma}^{\prime}){S}_{\eta}(y-z{(\eta )}^{T}{\gamma}^{\prime},{a}_{1},r,{a}_{2}){f}_{\epsilon}(y-z{(\eta )}^{T}{\gamma}^{\prime}\mid {a}_{1},r,{a}_{2})\phantom{\rule{0.16667em}{0ex}}dy\\ 0=-{u}_{1}+\sum _{r=0,1}(r-{u}_{1}^{T}\eta ){S}_{\eta}({a}_{1},r,{a}_{2})\phantom{\rule{0.16667em}{0ex}}{({u}_{1}^{T}\eta )}^{r}\phantom{\rule{0.16667em}{0ex}}{(1-{u}_{1}^{T}\eta )}^{1-r}.\end{array}$$

That is,
$-({\mathbf{Z}}_{3}^{T}{\psi}^{\prime}){\mathbf{U}}_{1}=E[\epsilon {S}_{\eta}(\epsilon ,{A}_{1},R,{A}_{2})\mid {A}_{1},R,{A}_{2}]$ and
${\mathbf{U}}_{1}=E[(R-{\mathbf{U}}_{1}^{T}\eta ){S}_{\eta}({A}_{1},R,{A}_{2})\mid {A}_{1},{A}_{2}]$. Putting these results together we obtain the efficient score for (*γ*′, *η*):

$${S}_{\mathit{eff}}(Y,{A}_{1},R,{A}_{2};{\gamma}^{\prime},\eta ,\sigma )=\left(\begin{array}{c}\mathbf{Z}(\eta ){\scriptstyle \frac{Y-\mathbf{Z}{(\eta )}^{T}{\gamma}^{\prime}}{{\sigma}_{{A}_{1},R,{A}_{2}}^{2}}}\\ -({\mathbf{Z}}_{3}^{T}{\psi}^{\prime}){\mathbf{U}}_{1}{\scriptstyle \frac{Y-\mathbf{Z}{(\eta )}^{T}{\gamma}^{\prime}}{{\sigma}_{{A}_{1},R,{A}_{2}}^{2}}}+{\mathbf{U}}_{1}{\scriptstyle \frac{R-{\mathbf{U}}_{1}^{T}\eta}{{\mathbf{U}}_{1}^{T}\eta (1-{\mathbf{U}}_{1}^{T}\eta )}}\end{array}\right)$$

where
${\sigma}_{{A}_{1},R,{A}_{2}}^{2}=E[{(Y-\mathbf{Z}{(\eta )}^{T}{\gamma}^{\prime})}^{2}\mid {A}_{1},R,{A}_{2}]$. The variance-covariance matrix of *S _{eff}*(Σ

$$\left(\begin{array}{cc}E[\mathbf{Z}(\eta ){\sigma}_{{A}_{1},R,{A}_{2}}^{-2}\mathbf{Z}{(\eta )}^{T}]& -E\phantom{\rule{0.16667em}{0ex}}[\mathbf{Z}(\eta ){\sigma}_{{A}_{1},R,{A}_{2}}^{-2}({\mathbf{Z}}_{3}^{T}{\psi}^{\prime}){\mathbf{U}}_{1}^{T}]\\ -E[{\mathbf{U}}_{1}({\mathbf{Z}}_{3}^{T}{\psi}^{\prime}){\sigma}_{{A}_{1},R,{A}_{2}}^{-2}\mathbf{Z}{(\eta )}^{T}]& E[{\mathbf{U}}_{1}({({\mathbf{U}}_{1}^{T}\eta (1-{\mathbf{U}}_{1}^{T}\eta ))}^{-1}+{({\mathbf{Z}}_{3}^{T}{\psi}^{\prime})}^{2}{\sigma}_{{A}_{1},R,{A}_{2}}^{-2})\phantom{\rule{0.16667em}{0ex}}{\mathbf{U}}_{1}^{T}]\end{array}\right)$$

Although we did not show this above, the smoothness assumptions imply that the estimators {′, } are regular, asymptotically linear estimators (see Tsiatis (2006) for definitions). If Σ* _{eff}* were singular then no regular, asymptotically linear estimators can exist (Chamberlain, 1986). Thus Σ

To construct a semiparametric efficient estimator we employ the efficient score, *S _{eff}* as follows. As before we assume that

$$\begin{array}{l}0={\mathbf{E}}_{N}\left[\mathbf{Z}(\widehat{\eta})\frac{Y-\mathbf{Z}{(\eta )}^{T}{\gamma}^{\prime}}{{\widehat{\sigma}}_{{A}_{1},R,{A}_{2}}^{2}}\right]\\ 0={\mathbf{E}}_{N}\left[-({\mathbf{Z}}_{3}^{T}{\widehat{\psi}}^{\prime}){\mathbf{U}}_{1}\frac{Y-\mathbf{Z}{(\eta )}^{T}{\widehat{\gamma}}^{\prime}}{{\widehat{\sigma}}_{{A}_{1},R,{A}_{2}}^{2}}+{\mathbf{U}}_{1}\frac{R-{\mathbf{U}}_{1}^{T}\eta}{{\mathbf{U}}_{1}^{T}\widehat{\eta}(1-{\mathbf{U}}_{1}^{T}\widehat{\eta})}\right]\end{array}$$

(15)

for *γ*′, *η* to obtain
${\widehat{\gamma}}_{\mathit{eff}}^{\prime}$, * _{eff}*. Standard arguments can be used to show that

$$\begin{array}{l}{\mathrm{\sum}}_{\mathit{eff}}\sqrt{N}({({({\widehat{\gamma}}_{\mathit{eff}}^{\prime})}^{T},{\widehat{\eta}}_{\mathit{eff}}^{T})}^{T}-({({\gamma}^{\prime})}^{T},{\widehat{\eta}}^{T}))\\ =\sqrt{N}({\mathbf{E}}_{N}-E)\phantom{\rule{0.16667em}{0ex}}[{S}_{\mathit{eff}}(Y,{A}_{1},R,{A}_{2};{\gamma}^{\prime},\eta ,\sigma )]+{o}_{P}(1).\end{array}$$

as *N* → ∞

In general the model(12) should be saturated; this is because each predictor associated with an effect (or one of its aliases) will occur in two vectors, either in **Z**_{1} and **Z**_{3} or in **Z**_{1} and
${\mathbf{Z}}_{2}^{(1)}$ or in **Z**_{1} and
${\mathbf{Z}}_{2}^{(0)}$ or in
${\mathbf{Z}}_{2}^{(1)}$ and
${\mathbf{Z}}_{2}^{(0)}$ (Similarly (9) will generally be saturated as well). An exception to this pattern (and thus the model will not be saturated) occurs if there are unnecessary formal assumptions; that is both assumptions 3a) and 3b) are made concerning the effects associated with a particular column in Lemma 3 rather than just one of these. In this case the “Algorithm for Constructing the Regressors for Use in the Screening Analysis” of Section 4.2 will remove too many predictors from **Z**_{3}.

Assuming that no unnecessary formal assumptions are made, the only other reason why (12) (or 9) will not be a saturated model is if there is a stage 2 factor for only responders (or only non-responders) and the design specifies that this factor’s levels are crossed with the levels of the remaining factors (e.g. this factor is independent of the remaining factors). In this latter case effects involving this stage 2 factor can only be aliased with other effects involving this stage 2 factor. As a result each predictor associated with effects of this stage 2 factor (or one of its aliases) can only be included once and only in ${\mathbf{Z}}_{2}^{(1)}$. However under the assumption that $P[\mathit{Var}(Y\mid {A}_{1},R,{A}_{2})=RV\phantom{\rule{0.16667em}{0ex}}ar(Y\mid {A}_{1},R,{A}_{2}^{(R)})]=1$ we show that in this non-saturated case, {′, } is locally semiparametric efficient. The adjective, “locally” is used because the semiparametric efficiency holds if this assumption holds and otherwise not.

First we prove that {′, } is semiparametric efficient when (12) is a saturated model. For simplicity suppose there are stage 2 factors for both responders and nonresponders. Assume *P*[0 < *E*[*R*|*A*_{1}] < 1] = 1 and *P* [*V ar*(*Y* |*A*_{1}, *R*, *A*_{2}) *>* 0] = 1. This means that the probability of
$P[\{{min}_{{a}_{1},r,{a}_{2}}{\widehat{\sigma}}_{{a}_{1},r,{a}_{2}}>0\}\cup \{{min}_{{u}_{1}}\mid {u}_{1}^{T}\widehat{\eta}\mid \phantom{\rule{0.16667em}{0ex}}\mid 1-{u}_{1}^{T}\widehat{\eta}\mid >0\}]\to 1$ as *N* → ∞. Thus we assume in the sample the proportion of responders in each of the 2^{k}^{−}* ^{m}* groups is not equal to one or zero. In this case we prove that
$\{{\widehat{\gamma}}^{\prime},\widehat{\eta}\}=\{{\widehat{\gamma}}_{\mathit{eff}}^{\prime},{\widehat{\eta}}_{\mathit{eff}}\}$.

First since
${\mathbf{E}}_{N}[{\mathbf{U}}_{1}(R-{\mathbf{U}}_{1}^{T}\widehat{\eta})]=0$ we have that
${\mathbf{E}}_{N}\left[{\mathbf{U}}_{1}{\scriptstyle \frac{R-{\mathbf{U}}_{1}^{T}\widehat{\eta}}{{\mathbf{U}}_{1}^{T}\widehat{\eta}(1-{\mathbf{U}}_{1}^{T}\widehat{\eta})}}\right]$ is zero as well. To see this suppose that there are *ν* unique values of **U**_{1} in the design (*ν* ≤2* ^{k−m}*). Then,

$$\begin{array}{l}{\mathbf{E}}_{N}[{\mathbf{U}}_{1}(R-{\mathbf{U}}_{1}^{T}\widehat{\eta})]=\sum _{i=1}^{\nu}{\mathbf{E}}_{N}[{1}_{{\mathbf{U}}_{1}={\mathbf{u}}_{1i}}{\mathbf{u}}_{1i}(R-{\mathbf{u}}_{1i}^{T}\widehat{\eta})]\\ =\sum _{i=1}^{\nu}{\mathbf{E}}_{N}[{1}_{{\mathbf{U}}_{1}={\mathbf{u}}_{1i}}]\phantom{\rule{0.16667em}{0ex}}{\mathbf{u}}_{1i}\left(\frac{{\mathbf{E}}_{N}[R{1}_{{\mathbf{U}}_{1}={\mathbf{u}}_{1i}}]}{{\mathbf{E}}_{N}[{1}_{{\mathbf{U}}_{1}={\mathbf{u}}_{1i}}]}-{\mathbf{u}}_{1i}^{T}\widehat{\eta}\right)\\ ={\stackrel{\sim}{\mathbf{U}}}_{1}^{T}\mathbf{D}\left(\overline{\mathbf{R}}-{\stackrel{\sim}{\mathbf{U}}}_{1}\widehat{\eta}\right)\end{array}$$

where **Ũ**_{1} is a *ν* × *ν* matrix with columns corresponding to the *ν* realizations of **U**_{1}, **D** is a diagonal matrix with *i*th entry equal to **E*** _{N}* [

Next since **E**_{N} **[Z**() *Y* − **Z**()* ^{T}*′ = 0, both
${\mathbf{E}}_{N}\left[\mathbf{Z}(\widehat{\eta}){\scriptstyle \frac{Y-\mathbf{Z}{(\widehat{\eta})}^{T}{\widehat{\gamma}}^{\prime}}{{\sigma}_{{A}_{1},R,{A}_{2}}^{2}}}\right]$ and
${\mathbf{E}}_{N}\left[-({\mathbf{Z}}_{3}^{T}{\widehat{\psi}}^{\prime}){\mathbf{U}}_{1}{\scriptstyle \frac{Y-\mathbf{Z}{(\widehat{\eta})}^{T}{\widehat{\gamma}}^{\prime}}{{\sigma}_{{A}_{1},R,{A}_{2}}^{2}}}\right]$ will be zero as well. To see this we use a matrix formulation. First let

$$\widehat{\mathbf{V}}=\left[\begin{array}{cccc}{\mathbf{D}}_{1-\widehat{\mathbf{p}}}{\stackrel{\sim}{\mathbf{Z}}}_{3}& {\stackrel{\sim}{\mathbf{Z}}}_{1}& {\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)}& 0\\ {\mathbf{D}}_{-\widehat{\mathbf{p}}}{\stackrel{\sim}{\mathbf{Z}}}_{3}& {\stackrel{\sim}{\mathbf{Z}}}_{1}& 0& {\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}\end{array}\right]$$

where the 0’s denote conforming matrices with all entries equal to zero and _{1}, _{3},
${\stackrel{\sim}{\mathbf{Z}}}_{2}^{(1)},{\stackrel{\sim}{\mathbf{Z}}}_{2}^{(0)}$ were defined in the proof of Lemma 4 above. Since the model is saturated, is a square matrix of dimension 2(2^{k}^{−}* ^{m}*). Using the same proof as used in Lemma 4 to show that is full rank, we can show that is full rank as well (and hence invertible). Define

We can write 0 = **E*** _{N}* [

$$0={\widehat{\mathbf{V}}}^{T}\left(\begin{array}{cc}{\mathbf{D}}_{1}& 0\\ 0& {\mathbf{D}}_{0}\end{array}\right)\left(\left(\begin{array}{c}{\mathbf{E}}_{N}[{\mathbf{Y}}_{1}]\\ {\mathbf{E}}_{N}[{\mathbf{Y}}_{0}]\end{array}\right)-\widehat{\mathbf{V}}{\widehat{\gamma}}^{\prime}\right).$$

(16)

Thus in the saturated model, ′

$$0=\left(\begin{array}{c}{\mathbf{E}}_{N}[{\mathbf{Y}}_{1}]\\ {\mathbf{E}}_{N}[{\mathbf{Y}}_{0}]\end{array}\right)-\widehat{\mathbf{V}}{\widehat{\gamma}}^{\prime}.$$

Since we can write
${\mathbf{E}}_{N}\left[\mathbf{Z}(\widehat{\eta}){\scriptstyle \frac{Y-\mathbf{Z}{(\widehat{\eta})}^{T}{\widehat{\gamma}}^{\prime}}{{\widehat{\sigma}}_{{A}_{1},R,{A}_{2}}^{2}}}\right]$ and
${\mathbf{E}}_{N}\left[-({\mathbf{Z}}_{3}^{T}{\widehat{\psi}}^{\prime}){\mathbf{U}}_{1}{\scriptstyle \frac{Y-\mathbf{Z}{(\widehat{\eta})}^{T}{\widehat{\gamma}}^{\prime}}{{\widehat{\sigma}}_{{A}_{1},R,{A}_{2}}^{2}}}\right]$ as matrices times the above difference we have these two terms are zero as well. Combining the above with the results for , we have that when the model is saturated and the sample is sufficiently large so that the response rates in each of the 2^{k}^{−}* ^{m}* groups are neither zero or one,
${\widehat{\gamma}}^{\prime}={\widehat{\gamma}}_{\mathit{eff}}^{\prime}$ and =

Next we prove that in the non-saturated model, {′, } will be locally semi-parametric efficient if 1) no unnecessary formal assumptions are made (assumptions 3a and 3b are made concerning the effects associated with a particular column in Lemma 3 rather than just one of these) and 2) there is a stage 2 factor for responders (or non-responders) that is independent of the remaining factors. Assume that
$\mathit{Var}(Y\mid {A}_{1},R=r,{A}_{2})=\mathit{Var}(Y\mid {A}_{1},R=r,{A}_{2}^{(r)})$ for *r* = 0, 1. Under these assumptions on the residual variance we estimate *σ*_{A1,1;A2} by the sample variance of the residual (*Y* −**Z**()* ^{T} *′) for responders for each combination of groups (rows) in the design with a unique value of {

As in the above both
${\mathbf{E}}_{N}\left[{\mathbf{U}}_{1}{\scriptstyle \frac{R-{\mathbf{U}}_{1}^{T}\widehat{\eta}}{{\mathbf{U}}_{1}^{T}\widehat{\eta}(1-{\mathbf{U}}_{1}^{T}\widehat{\eta})}}\right]$ and
${\mathbf{E}}_{N}[{\mathbf{U}}_{1}(R-{\mathbf{U}}_{1}^{T}\widehat{\eta})]$ are identically zero. Next consider the first equation in (14) and (15) (in the latter _{A}_{1,}_{R,A}_{2} is replaced by _{A}_{1,}_{R,A}_{(R)}).

For clarity, assume there is only one more stage 2 factor for responders than non-responders and that this stage 2 factor is randomized independently of all other factors. This means that while there are 2^{k}^{−}^{m}^{−1} unique realizations of {*A*_{1},
${A}_{2}^{(1)}$} there are only 2^{k}^{−}^{m}^{−1} unique realizations of {*A*_{1},
${A}_{2}^{(0)}$}. Now (16) continues to hold but although a full rank matrix, is no longer square. It’s dimensions are 2 (2^{k}^{−}* ^{m}*) × 2

$$0={\left({\widehat{\mathbf{V}}}^{\prime}\right)}^{T}\left(\begin{array}{cc}{\mathbf{D}}_{1}& 0\\ 0& {\mathbf{D}}_{0}^{\prime}\end{array}\right)\left(\left(\begin{array}{c}{\mathbf{E}}_{N}[{\mathbf{Y}}_{1}]\\ {\mathbf{E}}_{N}{[{\mathbf{Y}}_{0}]}^{\prime}\end{array}\right)-{\widehat{\mathbf{V}}}^{\prime}{\widehat{\gamma}}^{\prime}\right).$$

As before the invertibility of ′ implies that the last term in parentheses is equal to zero. Because we can express both

$${\mathbf{E}}_{N}\left[\mathbf{Z}(\widehat{\eta})\frac{Y-\mathbf{Z}{(\widehat{\eta})}^{T}{\widehat{\gamma}}^{\prime}}{{\widehat{\sigma}}_{{A}_{1},R,{A}_{2}^{(R)}}^{2}}\right]\phantom{\rule{0.38889em}{0ex}}\text{and}\phantom{\rule{0.16667em}{0ex}}{\mathbf{E}}_{N}\left[-({\mathbf{Z}}_{3}^{T}{\widehat{\psi}}^{\prime}){\mathbf{U}}_{1}\frac{Y-\mathbf{Z}{(\widehat{\eta})}^{T}{\widehat{\gamma}}^{\prime}}{{\widehat{\sigma}}_{{A}_{1},R,{A}_{2}^{(R)}}^{2}}\right]\phantom{\rule{0.16667em}{0ex}}$$

as matrices times the above difference we have these two terms are zero as well. Note we would not have been able to do this had we used the estimator
${\widehat{\sigma}}_{{A}_{1},R,{A}_{2}^{(R)}}^{2}$ instead of
${\widehat{\sigma}}_{{A}_{1},R,{A}_{2}}^{2}$ in (15). Combining the above with the results for , we have that when the sample is sufficiently large so that the response rates in each of the groups defined by unique values of {*A*_{1},
${A}_{2}^{(1)}$} and {*A*_{1},
${A}_{2}^{(0)}$} are neither zero or one then
${\widehat{\gamma}}^{\prime}={\widehat{\gamma}}_{\mathit{eff}}^{\prime}$ and = * _{eff}*.

Here we provide a simple example of how the aliasing becomes difficult to interpret when the formal assumptions do not hold. Suppose there are two factors at stage 1 and one stage 2 factor for responders. We are interested in the main effects of these factors and we are willing to make the working assumption that there are no interactions. Suppose we are willing to make the formal assumption that all effects of *R* (including interactions between *R* and the stage 1 factors) are negligible. We use the experimental design in Table 2 with defining word 1 = *A*_{11}*A*_{12}*A*_{2} (*A*_{2} is
${A}_{2}^{(0)}$ here).

The conditional mean of *Y* given *A*_{1}, *R*, *A*_{2} is given by (12). Suppose that it is known that *ψ _{j}* = 0,

$$\begin{array}{l}E[Y\mid {A}_{11}={a}_{11},{A}_{12}={a}_{12},R=r,{A}_{2}={a}_{11}{a}_{12}]=\\ ({\phi}_{1}+{\psi}_{1})+({\phi}_{2}+{\eta}_{2}{\psi}_{1}){a}_{11}\\ +({\phi}_{3}+{\eta}_{3}{\psi}_{1}){a}_{12}+({\phi}_{4}+{\eta}_{4}{\psi}_{1}){a}_{11}{a}_{12}\\ +({\psi}_{1}+{\beta}_{1}{a}_{11}{a}_{12}+{\beta}_{2}{a}_{12}+{\beta}_{3}{a}_{11}+{\beta}_{4})r\end{array}$$

for *a*_{11} {−1, 1}, *a*_{12} {−1, 1} and *r* {0, 1}. So from data resulting from the experimental design in Table 2 we can identify the eight quantities such as: _{1} + *ψ*_{1}, _{2} + *η*_{2}*ψ*_{1}, _{3} + *η*_{3}*ψ*_{1}, _{4}+*η*_{4}*ψ*_{1}, *ψ*_{1} + *β*_{4}, *β*_{1}, *β*_{2} and *β*_{3}. The above nonlinearity in parameters makes it difficult to interpret the identifiable quantities involving the * _{j}*’s. More complex designs result in similarly uninterpretable quantities. In summary, to maintain interpretability, we consider experimental designs that

S. A. Murphy, Department of Statistics, University of Michigan, Ann Arbor, MI 48109, Email: ude.hcimu@yhprumas.

D. Bingham, Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada, V5A 1S6, Email: ac.ufs.tats@mahgnibd.

1. Box GEP, Hunter WG, Hunter JS. Statistics for Experimenter: An Introduction to Design, Data Analysis, and Model Building. New York: Wiley & Sons; 1978.

2. Byar DP, Piantadosi S. Factorial designs for randomized clinical trials. Cancer Treat Rep. 1985 Oct;69(10):1055–63. [PubMed]

3. Brooner RK, Kidorf M. Using behavioral reinforcement to improve methadone treatment participation. Science and Practice Perspectives. 2002;1:38–48. [PMC free article] [PubMed]

4. Chamberlain G. Asymptotic efficiency in semi-parametric models with censoring. Journal of Econometrics. 1986;32:189–218.

5. Collins LM, Murphy SA, Strecher V. The Multiphase Optimization Strategy (MOST) and the Sequential Multiple Assignment Randomized Trial (SMART): New Methods for More Potent e-Health Interventions. American Journal of Preventive Medicine. 2007;32(5S):S112–118. [PMC free article] [PubMed]

6. Fava M, Rush AJ, Trivedi MH, Nierenberg AA, Thase ME, Sackeim HA, Quitkin FM, Wisniewski S, Lavori PW, Rosenbaum JF, Kupfer DJ. Background and Rationale for the Sequenced Treatment Alternative to Relieve Depression (STAR*D) Study. Psychiatric Clinics of North America. 2003;26(3):457–494. [PubMed]

7. Fries A, Hunter WG. Minimum Aberration 2^{k−p} Designs. Technomet-rics. 1980;22:601–608.

8. Hu F, Rosenberger WF. The Theory of Response-Adaptive Randomization in Clinical Trials. Hoboken, NJ: John Wiley & Sons, Inc; 2006.

9. ICH E9. ICH Harmonised Tripartite Guideline, Statistical Principles for Clinical Trials. Statistics in Medicine. 1999;18:1905–1942. [PubMed]

10. Lavori PW, Dawson R, Rush AJ. Flexible treatment strategies in chronic disease: Clinical and research implications. Biological Psychiatry. 2000;48:605–614. [PubMed]

11. Montgomery DC, Jennings CL. An Overview of Industrial Screening Experiments. In: Dean A, Lewis S, editors. Screening Methods for Experimentation in Industry, Drug Discovery, and Genetics. NewYork: Springer; 2006.

12. Murphy SA. Optimal Dynamic Treatment Regimes. Journal of the Royal Statistical Society, Series B (with discussion) 2003;65(2):331–366.

13. Murphy SA. An Experimental Design for the Development of Adaptive Treatment Strategies. Statistics in Medicine. 2005;24:1455–1481. [PubMed]

14. Murphy SA, van der Laan MJ, Robins JM. CPPRG. marginal mean models for dynamic regimes . JASA. 2001;96:1410–1423. [PMC free article] [PubMed]

15. Newey WK. Semiparametric efficiency bounds . Journal of Applied Econometrics. 1990;5(2):99–135.

16. Parmigiani G. Modeling in Medical Decision Making: A Bayesian Approach. Wiley & Sons, Ltd; England: 2002.

17. Patel MS. Group-Screening with More Than Two Stages . Technometrics. 1962;4(2):209–217.

18. Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods-application to control of the healthy worker survivor effect. Computers and Mathematics with Applications. 1986;14:1393–1512.

19. Robins JM. Addendum to “A new approach to causal inference in mortality studies with sustained exposure periods - Application to control of the healthy worker survivor effect” Computers and Mathematics with Applications. 1987;14:923–945.

20. Robins JM. Estimation of the time-dependent accelerated failure time model in the presence of confounding factors. Biometrika. 1992;79:321–34.

21. Robins JM. Causal Inference from complex longitudinal data. Latent Variable Modeling and Applications to Causality. In: Berkane M, editor. Lecture Notes in Statistics. Vol. 120. Springer-Verlag, Inc; New York: 1997. pp. 69–117.

22. Robins JM, Greenland S. Adjusting for differential rates of PCP prophylaxis in high- versus low-dose AZT treatment arms in an AIDS randomized trial. Journal of the American Statistical Association. 1994;89:737–749.

23. Robins JM, Wasserman L. Estimation of effects of sequential treatments by reparameterizing directed acyclic graphs . In: D Geiger, P Shenoy., editors. Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann; San Francisco: 1997. pp. 409–42.

24. Robins JM. Optimal structural nested models for optimal sequential decisions. In: Lin DY, Heagerty P, editors. Proceedings of the Second Seattle Symposium on Biostatistics. New York: Springer; 2004.

25. Rubin DB. Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies. Journal of Educational Psychology. 1974;66(5):688–701.

26. Rubin DB. Bayesian inference for causal effects: The role of randomization. The Annals of Statistics. 1978;6:34–58.

27. Tsiatis AA. Semiparametric Theory and Missing Data. New York: Springer; 2006 .

28. JCF Wu, Hamada M. Experiments: Planning, Analysis, and Parameter Design Optimization. New York: John Wiley & Sons, Inc; 2000.

29. Zacks S. Adaptive Designs for Parametric Models. In: Ghosh S, Rao CR, editors. Handbook of Statistics. Vol. 13. Elsevier Science B.V; Amsterdam: 1996.