Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Am Stat Assoc. Author manuscript; available in PMC 2010 June 28.
Published in final edited form as:
J Am Stat Assoc. 2009 March 1; 104(458): 391–408.
doi:  10.1198/jasa.2009.0119
PMCID: PMC2892819

Screening Experiments for Developing Dynamic Treatment Regimes


Dynamic treatment regimes are time-varying treatments that individualize sequences of treatments to the patient. The construction of dynamic treatment regimes is challenging because a patient will be eligible for some treatment components only if he has not responded (or has responded) to other treatment components. In addition there are usually a number of potentially useful treatment components and combinations thereof. In this article, we propose new methodology for identifying promising components and screening out negligible ones. First, we define causal factorial effects for treatment components that may be applied sequentially to a patient. Second we propose experimental designs that can be used to study the treatment components. Surprisingly, modifications can be made to (fractional) factorial designs - more commonly found in the engineering statistics literature -for screening in this setting. Furthermore we provide an analysis model that can be used to screen the factorial effects. We demonstrate the proposed methodology using examples motivated in the literature and also via a simulation study.

Keywords: Multi-stage Decisions, Experimental Design, Causal Inference

1. Introduction

Dynamic treatment regimes (Robins, 1986, 1987; Lavori et al., 2000; Murphy et al., 2001) are time-varying treatments increasingly employed in the treatment and management of chronic, relapsing disorders such as drug dependence, HIV infection and depression (Brooner and Kidorf, 2002; Rush et al., 2003; Lisziewicz and Lori, 2002). These regimes individualize the treatment level and type via decision rules that input patient outcomes collected during treatment and output recommended treatment alterations. A two stage dynamic treatment regime specifies how to select the treatment combination by a sequence of decision rules, one per stage, {d1, d2} where the decision rule d1 takes the information available initially, say O1 and outputs stage 1 treatments A1 and the decision rule d2 takes the information available at the beginning of the stage 2, say {O1, A1, O2}, and recommends stage 2 treatments, A2. The time order is O1, A1, O2, A2, O3 where O3 is observed at the end of stage 2. The primary outcome is Y = f(O1, A1, O2, A2, O3) for f known.

Dynamic treatment regimes are also often multi-component (i.e., multiple factor) treatments: components may include not only the treatments for the primary disorder but also adjunctive treatments for co-occurring problems, behavioral therapies to improve adherence and delivery mechanisms. Presently dynamic treatment regimes are constructed via expert opinion and then evaluated in randomized two group trials. Observational studies are typically used to assist experts in constructing the dynamic treatment regime. Observational analyses attempt to infer causal relations from non-randomized comparisons and are subject to bias when the compared groups differ in ways other than by type of treatment (Rubin, 1974, 1978). While there have been great advances in the epidemiological, statistical and other literatures in developing methods that precisely specify the assumptions under which this bias may be eliminated, randomized studies, when possible, remain the optimal approach to reducing bias (Rubin, 1974, 1978). The field of experimental design provides an approach to using randomization in the construction of multi-component treatments.

The focus of this article is on screening experiments for two stage dynamic treatment regimes. Screening experiments are used to whittle down the large number of treatment components resulting in a few promising dynamic treatment regimes. The formulation and analysis of screening experiments for multi-component dynamic treatment regimes requires the combination of ideas from causal inference and factorial experimental design.

Unfortunately, in settings such as substance abuse, mental health and HIV research, the classical design and analysis of screening experiments cannot be directly imported. This is because some stage 2 factors are only relevant for patients who responded (or did not respond) to prior stage 1 treatment factors. For this reason, in Section 2, we carefully define factorial effects so that these effects have a causal interpretation. In Section 3 we provide 2k experimental designs and associated analysis methods that can be used to screen these effects. In Section 4, we consider 2km designs. In this section we provide an approach to ascertain the aliasing of the factorial effects. Section 5 provides examples illustrating a variety of designs. In Section 6 a simulation study is used to evaluate the robustness of the proposed methods. Section 7 concludes with a summary and discussion of open problems.

2. Causal Effects of the Factors

Screening factors often occurs via the assessment of factorial effects in ANOVA decompositions (see Box, Hunter and Hunter, 1978; Wu and Hamada, 2000 or Montgomery and Jennings, 2006) for the conditional mean of the primary outcome. In our setting special care is required in defining effects. Consider a simple case where there is only one stage 1 factor and two stage 2 factors. Each factor takes on levels in {−1, 1}. Suppose N subjects are each randomized equally to the two levels of the stage one factor A1. Then an indicator of early response is observed; R is 1 if the subject is responding following assignment of A1 and 0 otherwise (there is no O1 and O2 = R). For simplicity, we refer to individuals with R = 1, 0 as responders, non-responders, respectively. At stage 2, responders are randomized equally between the {−1, 1} levels of the factor A2(1), and non-responders are randomized equally between the {−1, 1} levels of the factor A2(0) (the superscripts, (1), (0), indicate stage 2 factors for responders, non-responders, respectively). At the end of the study the primary outcome Y (coded so that high values are preferable) is observed. Assuming subject observations are independent and identically distributed, a “factorial” decomposition is


with the interpretation that the coefficients {η1, β1, β2, α1, α2} represent factorial effects. R is included in the above decomposition since A2(1) can only be assigned to subjects with R = 1 and similarly for A2(0).

If we interpret the coefficients {η1, β1, β2, α1, α2} as factorial effects then to screen factors A1, A2(1),A2(0) we will base our inference on these coefficients. However, this can lead to erroneous conclusions concerning the usefulness of A1, via η1, for at least two reasons. First in the medical/behavioral fields there are a plethora of both known and unknown common causes of R and Y (hence R is prognostic for Y). Consider the following example. Let U, a Bernoulli random variable with success probability 12, represent an unknown common cause of both the outcome Y and early response R. U might be an unknown genetic factor. Suppose that Y = δ0 + δ1U + ε where ε (mean zero, finite variance) is independent of (U, R, A1, A2(0),A2(1)). Thus there is no effect of the stage 1 factor A1 or of the stage 2 factors ( A2(0),A2(1)) on the outcome Y. Next suppose that P[R=1U,A1]=U[q1+q22+q1q22A1]+(1U)[q3+q42+q3q42A1] where each qj [set membership][0, 1]. Note that A1 may impact the early response (q1q2 ≠ 0 or q3q4 ≠ 0). We obtain


Thus η1=δ12(1q12q1q31q22q2q4) which may differ from the true effect of zero. That is, η1 in (1) reflects biases that occur because we are conditioning on R which is both an outcome of A1 and a prognostic variable for Y. Further discussion of this bias can be found in Robins and Greenland (1994) and Robins and Wasserman (1997).

Second even if there are no unknown common causes of R and Y, η1 does not reflect the overall impact of A1 as A1 may impact Y at least partially by its effect on R (Parmigiani, 2002; Murphy, 2005). For both reasons, clinical trial guidelines (ICH E9, 1999) warn against conditioning on outcomes such as side effects or other post randomization covariates in comparing one treatment to another.

These two issues preclude the use of (1) for screening. To our knowledge the former issue does not arise in the traditional experimental design literature, even in clinical trials with multiple stages. For example, adaptive experimental designs (Hu and Rosenberger, 2006, Zacks, 1996, Patel, 1962) also involve multiple stages of (sequential) decision making, but each stage involves different subjects; that is “sequential” there means decisions on a sequence of subjects not sequential decisions on each subject. Since subjects can be generally assumed to respond independently (given membership in the same population of subjects), and the stages involve different subjects, the above issues do not occur.

2.1 Causal Effects in Terms of Potential Outcomes

In this section a model in terms of factorial effects for each fixed pattern of factor levels is developed; in the next subsection randomization of the factor levels is incorporated. This model will be in terms of potential outcomes (Rubin, 1978; Robins, 1986, 1987); the parameters in the model will be the causal effects. The idea is that each subject has multiple potential outcomes, one per factor level combination. However only one of the potential outcomes, that is, the outcome associated with the subject’s assigned factor levels is observed; the remaining potential outcomes are missing. The potential outcome framework facilitates a precise definition of the causal factorial effects; as will be seen this is somewhat nuanced for stage 1 factorial effects. Also this framework facilitates the statement of the conditions under which estimators of the causal effects of a factor can be obtained.

Suppose there are p1 stage 1 factors, p12 stage 2 factor levels for responders and p02 stage 2 factor for non-responders. As is generally the case in screening, we assume each factor has two levels (see Section 3 for discussion), coded by values ±1. Let a1 denote a vector of stage 1 factor levels, a2(1),a2(1) denote vectors of stage 2 factor levels for responders and a2(0),a2(0) denote vectors of stage 2 factor levels for non-responders. Let Ra1,a2(0),a2(1){0,1} and Ya1,a2(0),a2(1) denote the stage 1 response and the primary outcome, respectively, that would be observed if the subject were assigned the dynamic treatment regime: \provide treatment a1 in stage 1 and then if an early response occurs provide a2(1) otherwise provide a2(0).” Note that there are 2p1+p02+p12 different dynamic treatment regimes, thus, there are 2p1+p02+p12 potential stage 1 outcomes and 2p1+p02+p12 potential primary outcomes for each subject.

Throughout we make two assumptions that hold if the stage 2 treatment is not revealed to the subject/clinical staff until after the occurrence of the stage 1 outcome. These assumptions will enable us to reduce the number of groups of subjects in the experimental design (see the next section).

Ignorability Assumption I

Assume that Ra1,a2(0),a2(1)=Ra1,a2(0),a2(1) for all {a1, a2(0),a2(0),a2(1),a2(1)}.

In words, this is the assumption that a subject’s stage 1 outcome remains the same regardless of the stage 2 treatment assignments. Below we use the notation, Ra1, instead of Ra1,a2(0),a2(1).

Ignorability Assumption II

Assume that Ya1,a2(0),a2(1)Ra1=Ya1,a2(0),a2(1)Ra1 and Ya1,a2(0),a2(1)(1Ra1)=Ya1,a2(0),a2(1)(1Ra1) for all {a1, a2(0),a2(0),a2(1),a2(1)}.

For example, this assumption states that the primary outcome of a subject who does not respond in stage 1 does not depend on what this subject would have been assigned had he/she responded in stage 1. These assumptions may be violated if, as part of the protocol, subjects are informed in stage 1 of which stage 2 treatment they will be offered if they do not respond at stage 1. It may happen that subjects, who know that they will be assigned a rather burdensome stage 2 treatment upon non-response, will try to proactively avoid this burdensome treatment by adhering more faithfully to their assigned stage 1 treatment than would otherwise be the case.

In the sequel, stage 1 effects refer to all effects involving only stage 1 factors and the stage 2 effects refer to all effects involving at least one stage 2 factor. The stage 2 causal effects are defined via a saturated linear regression model for


in a1, a2(1)Ra1,a2(0)(1Ra1) and their higher order interactions. For example suppose that there is one stage 1 factor and two stage 2 factors (p1 = p02 = p12 = 1) then similar to (1) we have


where αj, βj, j = 1, 2 are the stage 2 causal effects; in particular α1, β1 are the stage 2 main effects and α2, β2 are the stage 2 interaction effects. If we solve for αj, βj, j = 1, 2, we see that the causal effects are contrasts of conditional means involving potential outcomes; for example,


Recall that under the ignorability assumption II, the quantities above do not depend on the value of a2(0). The coefficients in the above linear regression are 1/2 the size of the usual factorial effects (Wu and Hamada, 2000, pg. 113). (Note there is some discrepancy in the literature in how factorial effects are defined; for example Byar and Piantadosi’s (1985) definition of a main effect is one-half the size of Wu and Hamada’s definition). Throughout this work we use the regression formulation and refer to the regression coefficients as factorial effects.

As discussed at the beginning of this section the usual definition of the stage 1 effect (for example, via η1 in (3)) can lead to erroneous conclusions. To avoid this problem we do not condition on Ra1; instead we marginalize over Ra1 to obtain causal effects. Consistent with ignorability assumption II, we use the notation Ya1,a2(0) if Ra1= 0 and Ya1,a2(1) if Ra1= 1. We define the stage 1 causal factorial effects via a saturated linear regression model for


in a1. Again suppose that p1 = p02 = p12 = 1 then the linear regression is simple:


Solving for [var phi]2 we obtain,


[var phi]2 is the causal main effect of A1.

In general the causal main effect of one of the p1 stage 1 factors is the average of contrasts between two means of Y one for each level of the stage 1 factor, marginal over the distribution of R (and all other intermediate outcomes in O2 as well); the average is over the remaining factor levels (with all other factors taking levels ±1 with equal probability). The averaging/marginalization with respect to a uniform distribution over the other factor levels is consistent with the definition of factorial effects in experimental design (Wu and Hamada, 2000; Byar and Piantadosi, 1985). Note the definition of the stage 2 treatment effects also involves marginalization (except the stage 2 treatments are nested within the outcome R and thus the contrasts are between means conditional on R).

2.2 Effects in Terms of Conditional Expectations Involving Randomized Factors

Suppose we have experimental data in which the factors are randomized according to some distribution on {−1, 1}p1+p02+p12. The following provides definitions of the factorial effects in terms of the resulting data distribution. Let A1 denote the random vector of p1 stage 1 factor levels, A2(1) denote the random vector of p12 stage 2 factor levels for responders and A2(0) denote the random vector of p02 stage 2 factor levels. Lower case letters denote realizations of these random vectors. In this case {A1, A2(0),A2(1)} is clearly independent of the collection {Ra1, Ya1,a2(0),a2(1), a1 [set membership] {−1, 1}p1, a2(0){1,1}p02,a2(1){1,1}p12}. On each subject we observe A1, R, A2(0),A2(1), Y (when R = 1, factor levels given by A2(1) are assigned and when R = 0, factor levels given by A2(0) are assigned). We connect the potential outcomes to the observed data via the consistency assumption (Robins, 1997): assume that for all values of {a1, a2(0),a2(1)}, if A1 = a1, A2(0)=a2(0),A2(1)=a2(1) then Y=Ya1,a2(0),a2(1) and R = Ra1.

The randomization of factor levels plus the consistency and ignorability assumptions imply that E[Ra1] = E[Ra1|A1= a1] = E[R|A1 = a1]. Similarly for each r [set membership] {0, 1}, P[Ra1= r, A1 = 1 A2(r)=a2(r)] = P[R = r, A1 = 1, A2(r)=a2(r)]. Assuming this probability is nonzero, E[Ya1,a2(0),a2(1)Ra1=r] is equal to E[Ya1,a2(0),a2(1)A1=a1,Ra1=r,A2(r)=a2(r)], which by the consistency and ignorability assumptions is in turn equal to E[YA1=a1,R=r,A2(r)=a2(r)]. As a result, the stage 2 causual effects in (2) are simply the coefficients of terms involving a2(0) in a saturated linear model for conditional mean E[YA1=a1,R=0,A2(0)=a2(0)] and the coefficients of terms involving a2(1) in a saturated linear model for conditional mean E[YA1=a1,R=1,A2(1)=a2(1)].

The discussion above implies that in the simple setting where there is only one stage 1 factor and two stage 2 factors the β (α) parameters in (1) represent the stage 2 causal effects for the responders (non-responders, respectively). A similar statement can not be made for the stage 1 effects. Instead recall that the stage 1 causal effects are given by a saturated linear model for the mean in (4). This mean can be rewritten as


Again using randomization and the consistency assumption this mean can be reexpressed as


Thus the stage 1 causal effects are given by the coefficients in a saturated linear model for the above sum. The above formula (5) is a version of Robins “G-computation” formula (Robins, 1986); this formula frequently arises in causal inference for time varying treatments. A more intuitive definition for the stage 1 effects obtains if the stage 2 factors are independently randomized according to discrete uniform distributions on {−1, 1}. Then (5) simplifies to E[Y |A1 = a1]; a saturated linear model for this conditional mean provides the definition of the stage 1 effects in this special setting.

Somewhat surprisingly the stage 1 and stage 2 causal effects are terms in one model for the conditional mean. Define


This definition is used to make the next formulae concise; note (5) is equal to E[EA2U[E[YA1,R,A2(R)]A1,R]A1=a1]. Consider the telescoping sum for the conditional mean, E[YA1,R,A2(R)]:



The first term (6) contains the stage 2 causal effects. The last term (7) is just another way to write (5) and thus contains all stage 1 causal effects. The middle term is composed entirely of nuisance parameters and can be rewritten as RE[R|A1] times


Thus in the example of one stage 1 factor and two stage 2 factors (one for responders, the other for non-responders), we replace the linear regression in (1) by


where the first row corresponds to (7), the second row to (8) times (RE[R|A1]) and the last row to (6). Note that even though we call the parameters, ψ1, ψ2 and the stage 1 response rate, E[R|A1], nuisance parameters, these parameters may be of independent interest. This is certainly the case for the stage 1 response rate; additionally ψ1, ψ2 reflect the ability of the stage 1 response to predict the primary outcome. Note that the nuisance parameters, ψ1, ψ2, would be absent from the above formula only if the response indicator, RE[R|A1] were zero, that is, only if the stage 1 response, R, is a deterministic function of A1.

3. 2k Factorial Two-Stage Designs

Our goal is to screen out inactive factors. Usually we consider two levels for each factor. The two levels should be selected to be sufficiently disparate so that we can obtain an effect yet not be unethical. In settings where there is likely to be a downturn in the primary outcome at high levels of the factor, we select the higher level small enough so that the downturn is thought to be insufficient to eliminate the effect. If necessary, subsequent trials can be used to more fully explore the dose-response.

Suppose that there are two stage 1 factors A1 = {A11, A12} and one stage 2 factor for responders A2(1) and one for non-responders, A2(0). Consider the experimental design in Table 1. Each row in the design of Table 1 provides the factor levels assigned to a group of subjects. The column labeled A2(1)=A2(0) designates the identical factor level settings fo A2(1) and A2(0). Note that once responder (non-responder) status is known for each subject, we can view the design as two 23 full factorial designs, one for responders to stage 1 treatment and the other for non-responders to stage 1 treatment.

Table 1
A 23 Factorial Two Stage Design1

Each row of the design corresponds to a group of subjects, all of whom are assigned the same dynamic treatment regime. For example, the subjects in the first group are assigned the dynamic treatment regime: “Provide A11 = +1, A12 = +1, if they respond, assign A2(1)=+1 in stage 2 and if they do not respond, assign A2(0)=+1 in stage 2” (note that a dynamic treatment regime must specify the stage 2 treatments for both the responders and the non-responders). For convenience we use the term “stacked” to indicate situations in which the factor level settings of two factors are identical across all rows of the experimental design; factors A2(1) and A2(0) are stacked. The stacking is advantageous because investigators only have to implement a 23 design yet under the ignorability assumptions, we have information on all 24 dynamic treatment regimes. For example even though no row of the design corresponds to “assign A11 = +1, A12 = +1 and if response assign A2(1)=+1 but if non-response assign A2(0)=1 we may combine the responders in row 1 and the non-responders in row 2 together to produce a new group of subjects assigned this dynamic treatment regime.

As in Table 1, all designs considered here possess the property that for each factor half of the rows in the design are set at the +1 level and half of the rows are set at the −1 level. This property will play a crucial role in ascertaining the aliasing between effects. Note however that due to the difficulties in recruiting subjects inherent in clinical trials, the groups of subjects may not be exactly equal in size, and second even if there are equal numbers of subjects in each group, the number of subjects assigned a particular level of A2(1),A2(0) depends on the response rate.

As before suppose there are p1 stage 1 factors, p12 stage 2 factors for responders and p02 stage 2 factors for non-responders. Here we consider screening the stage 1 and stage 2 effects using data from a 2k factorial two-stage experiment in which k = p1 + max(p12, p02). In contrast to classical analyses in experimental design (analyses that assume normality and are exact for small group sizes), we consider analyses that are justified by large sample theory (large group sizes) and thus are approximate for small group sizes. With this viewpoint we interpret a full factorial design as a design in which each subject is assigned with equal probability to one of the 2k possible combinations of the factor levels. In large samples this randomization results in approximately equal sized groups assigned to each row of the design. Denote the collection of stage 1, stage 2 factors by A1, respectively A2. We have that (A1, A2) has a discrete uniform distribution on {−1, 1}k where A2(1) is the first p12 entries in A2 (these will be the settings of the stage 2 factors for responders) and A2(0) is the first p02 entries in A2 (these will correspond to the settings of the stage 2 factors for non-responders). The data consists of N i.i.d. copies of {A1, R, A2, Y}. There are 2k unique values of (A1, A2).

Define X1 as the random vector composed of a 1, all stage 1 factors and their two-way and higher order products (2p1 terms), X2(1) as the random vector composed of all stage 2 factors for responders, their two-way and higher order products and the all products of these combinations with members of X1 (2p1+p12−2p1 terms). X2(0) is defined similarly. The decomposition in (6 – 8) can be written via a linear model


where β and α are the vectors of all stage 2 causal effects for responders and non-responders, respectively, ψ is the vector of nuisance effects, [var phi] is the vector of stage 1 causal effects and p(X1) = E[R|A1] (note X1 is a function of A1).

Recall that a parameter is identifiable if different values of the parameter lead to different distributions of {A1, R, A2, Y} (van der Vaart, 1998). Identifiability in this setting is straightforward.

Lemma 1

Assume that P[0 < E[R|A1] < 1] = 1. The parameters, [var phi], ψ, β and α in (9) are identifiable.

The above lemma is a direct consequence of two facts. First the conditional means, E[RA1=a1],E[YA1=a1,R=1,A2(1)=a2(1)] and E[YA1=a1,R=0,A2(0)=a2(0)] for all values of {a1, a2(1),a2(0)}, are identifiable. Second the expectation of [X1T,(Rp(X1))X1T,RX2(1)T,(1R)X2(0)T]T times its transpose is given by a block diagonal with blocks, E[X1X1T],E[X1p(X1)(1p(X1))X1T],E[X2(1)p(X1)X2(1)T] and E[X2(0)(1p(X1))X2(0)T] (recall p(X1) = E[R|A1]). These blocks are invertible under the conditions specified in Lemma 1. The assumption of Lemma 1 is not necessary for identifiability of [var phi] and can be weakened to only P[E[R|A1] > 0] = 1 for identifiability of β (with a similar statement for α). The details of the proof are omitted.

To utilize (9) in screening estimate p(X1) by forming the average of R for each value of X1 to obtain [p with hat](X1). Conduct a linear regression of Y on {X1, (R[p with hat](X1))X1, RX2(1),(1R)X2(0)} to obtain [var phiv with circumflex], [psi], [beta], [alpha]. Note that in contrast to classical screening analyses, these estimators are not orthogonal. This is because homogenous variance is not assumed (e.g. the conditional variance of Y may vary by factor levels), the groups sizes may not be identical and because the responder rates will likely vary by the stage 1 factor levels.

Lemma 2

Assume that the variance of Y is finite and that P[0 < E[R|A1] < 1] = 1. Then as N → ∞, { N(φ^φ),N(β^β),N(α^α)} converges in distribution to a multivariate normal. If P[Var(YA1,R,A2)=Var(YA1,R,A2(R))]=1, then the estimators [beta], [alpha] and [var phiv with circumflex] are locally semiparametric efficient.

In many settings all responders will be provided the same treatment. In this case there are no [beta], β and P[0 < E[R|A1] < 1] = 1 can be replaced by P [E[R|A1] > 0] = 1 (similar statements can be made if all non-responders are provided the same treatment). See Tsiatis (Chapter 4, 2006) for an introduction to and definition of locally semiparametric efficiency. Note the assumption on the conditional variances holds under the consistency and ignorability assumptions. This lemma is a special case of Lemma 4 below. The formula for the variance-covariance matrix is provided in Appendix A.1 along with an asymptotically consistent estimator and the proof of Lemma 4.

4. 2km Factorial Two Stage Design

A better use of resources for screening is use a 2km fractional factorial design, which uses only a 1/2m fraction of the groups in a 2k design. An example of a 23–1 design for the setting in which there are two stage 1 factors and one stage 2 factor each for responders/non-responders is

This design has one-half as many rows (groups of subjects) as the full factorial design in Table 1; see the examples in Section 6 for realistic designs. Usually a 2km design is selected by first ascertaining plausible working assumptions concerning the factorial effects. Wu and Hamada (2000) provide principles that can be used to guide these working assumptions in the absence of scientific knowledge (e.g. often it is plausible that three way and higher order effects are negligible). Next, the aliasing associated with candidate 2km designs is ascertained; parameters, that is, effects, are aliased when only their sum can be identified. Note that the aliasing occurring here is different from the confounding that occurs in causal inference as the latter is unplanned and due to potentially unknown variables whereas the former is the result of the planned study design and is known. Lastly the design with the aliasing structure that is most consistent with working assumptions is selected. See Wu and Hamada, (2000) for simple approaches to creating 2km designs.

As was the case for full factorial designs, we interpret a fractional factorial design as a design in which each subject is assigned with equal probability to one of the 2km possible combinations of the factor levels. Thus (A1, A2) has a discrete uniform distribution across the factor levels given by the 2km rows of the design (recall that A2(1)(A2(0)) is set to the first p12 (p02, respectively) entries in A2). The experimental data consists of N i.i.d. copies of {A1, R, A2, Y}.

4.1 Identification of Effects

We use a large sample notion of aliasing. Effects will be aliased if only their sum can be identified (operationalized here to mean that only asymptotically consistent estimators of their sum are available). This is a weaker concept than the finite sample concept of aliasing as discussed, for example, in Wu and Hamada (2000); in that setting, effects are aliased if we are only able to obtain an unbiased estimator of their sum.

Defining words are generally used to ascertain finite sample aliasing. Consider the design in Table 2. The defining word for this design is 1=A11A12A2(1) (the product of the ±1’s across the three columns is equal to 1). Equivalently 1=A11A12A2(0) since the levels of A2(1) are equal to the levels of A2(0). Similarly A2(1)=A11A12 (the product of the ±1’s in columns 1 and 2 is equal to the ±1’s in column 3). In a standard 23–1 design, the defining word 1=A11A12A2(0) means that each main effect is aliased with a two way interaction (e.g. A11 is aliased with A12A2(0), etc.). It turns out that even though the screening analysis model for a dynamic treatment regime is based on a nonstandard model (9), we will, nonetheless, be able to use the defining words to ascertain the large sample aliasing. This means that we can use commonly available designs such as those provided by Wu and Hamada (2000) to design screening studies for dynamic treatment regimes. Lemma 3, below, provides conditions under which the defining words can be used to determine the aliasing of effects.

Table 2
A 23–1 Fractional Factorial Two Stage Design

As in Section 3 we construct the vector [X1T,(X2(1))T,(X2(0))T] from (A1, A2). This vector takes on 2km equally likely vector values. Construct the matrix [X1,X2(1),X2(0)] with each row corresponding to one of these vectors values. The defining words identify the identical columns in [X1,X2(1)] and in [X1,X2(0)]. Consider the design in Table 2; here


Labeling the columns of [X1 by as {1, A11, A12, A11A12} and the columns of X2(j) by { A2(j),A11A2(j),A12A2(j),A11A12A2(j)} we see that the defining word, 1=A11A12A2(j) (j = 0, 1) identifies the identical columns.

Lemma 3

Assume that P[0 < E[R|A1] < 1] = 1. Make the Formal assumption:

For each identical column in both X1 and X2(1) and for each identical column in both X1 and X2(0) either

  • (3a) the associated nuisance effect(s) (ψ parameters) in (9) are zero or
  • (3b) the associated stage 2 effects (β, α parameters) in (9) are zero.

Then the defining words can be used to ascertain aliasing.

To make Lemma 3 concrete, consider the analog of (9) for the design in Table 2:


The matrices in (10) permit a variety of formal assumptions. For example, we might make the formal assumption that ψ2 = ψ3 = ψ4 = 0 and β4 = α4 = 0. Then Lemma 3 allows us to read off the large sample aliasing from the defining word, 1=A11A12A2(j), j = 0, 1. That is each main effect is aliased with a two-way interaction between the remaining predictors (e.g. A11 is aliased with A12A2(0) and so on). This design is most useful if, according to the working assumptions, there are no two-way interactions.

See the Appendix A.1 for a proof of Lemma 3. As before if there are only stage 2 factors for responders (non-responders) then we need only assume P[E[R|A1] > 0] = 1 (respectively P[E[R|A1] < 1] = 1). In the next section we provide an algorithm based on this proof for determining the aliasing and constructing the predictors in the screening analysis. We only use 2km designs for which either (3a) or (3b) holds for any common stage 1 and stage 2 columns in X1 and X2 of Lemma 3. The use of designs in which this formal assumption is violated produces aliasing that is not easily discernable by the defining words (see Section A.2 in the Appendix for an example). Note that we can choose a design in which X1 and X2 do not share a column (see example 1 in Section 5) and thus formal assumptions are not required. The formal assumptions, in addition to the working assumptions, are used to select the appropriate 2km design. The design is selected so that according to the working assumptions all but one of the effects in each set of aliased effects is thought to be negligible.

4.2 Screening Effects

Of course we cannot fit the model (9), or in the case of our example, (11), since the 2km design does not provide Y outcomes for all 2k values of {A1, A2} (the omitted outcomes correspond to the omitted rows in the design). To permit a regression model we rewrite (9) in terms of aliased effects. Using the results of Lemma 3, we reduce the number of columns in X1, X2(1) and X2(0). Note removing columns from these matrices is equivalent to eliminating predictors in X1, X2(1) and X2(0). The construction of the regressors for the screening analysis is straightforward except when formal assumption (3b) is made; see step 4 below.

Algorithm for Constructing the Regressors

  1. Construct X1, X2(1) and X2(0). Eliminate duplicate columns in each of X1, X2(1) and X2(0) and associated predictors from X1(X2(1),X2(0)), retaining only the predictor that, according to the working assumptions, may have a non-negligible effect. Name the resulting vector U1(U2(1),U2(0)).
  2. If any β (α) parameters are assumed zero (via assumption 3b) delete the predictors associated with these parameters in U2(1) (respectively U2(0)); similarly delete the associated columns in X2(1),X2(0).
  3. If some of the nuisance, ψ, parameters are assumed to be zero (via assumption 3a) then create Z3 from U1 by eliminating predictors in U1 with an assumed zero valued ψ parameter. Otherwise Z3 = U1. At this point we can rewrite (9) as
    where the primes on the regression coefficients indicate each entry in ψ″, ψ″, β″, α″ may be a sum of stage 1 effects, nuisance effects, stage 2 effects for responders and stage 2 effects for non-responders, respectively. Here the response rate, E[R|A1] is written as p(U1).
  4. If assumption 3b is made (e.g. some stage 2 effects, β, α, parameters are assumed to be zero) then the defining words have identified columns common to at least two of the three matrices X1, X2(0),X2(0). If there are one or more columns common to all three matrices then for each common column delete an associated predictor from one of either U1, U2(0) or U2(0) resulting in Z1, Z2(0) or Z2(0). The choice of which predictor to eliminate determines the aliasing as follows. If the omitted predictor is a stage 1 variable then the coefficient of the stage 2 predictor for responders (non-responders) in the regression (12) is an estimator of the sum of the stage 1 and stage 2 effects for responders β′ = [var phi]″ + β″ (α′ = [var phi]″ + α″, respectively). If the omitted predictor is a stage 2 variable for responders (non-responders) then the coefficient of the stage 1 predictor in the regression (12) is an estimator of sum [var phi]′ = [var phi]″ + β″ ([var phi]′ = [var phi]″ + α″, respectively) and the coefficient of the stage 2 predictor for non-responders is an estimator of the difference between the stage 1 and stage 2 effects α′ = α″ − β″ (β″ = β″ − α″, respectively). We obtain

Using the working assumptions the regression coefficients can be labeled as corresponding to particular stage 1 effects, nuisance effects, a stage 2 effects for responders and stage 2 effects for non-responders.

Consider once again the example from Table 2 with the formal assumptions that ψ2 = ψ3 = ψ4 = 0 and β4 = α4 = 0 (see (11)). After step 3, we have U1 = {1, A11, A12, A11A12}, Z3 = {1}, U2(j)={A2(j),A11A2(j),A12A2(j)}, j = 0, 1. In step 4 we select one of several analysis models (each corresponding to a different form of aliasing). Suppose we eliminate A11A12 in U1 to form Z1 and eliminate A11A2(1),A12A2(1) from U2(1) to form Z2(1)(Z2(0)=U2(0)), then according to step 4 the conditional mean becomes


where φ1=φ1,φ2=φ2+β3,φ3=φ3+β2,ψ1=ψ1,β1=β1+φ4,α1=α1+φ4,α2=α2β3 and α3=α3β2 from (11).

The steps in the algorithm follow the steps in the proof of identifiability for Lemma 3; see the proof in Appendix A.1 for a detailed explanation. At this point the total number of entries in {Z1, Z3, Z2(1),Z2(0)} is at most 2 (2km)and the total number of predictors in U1 is at most 2km. To conduct the screening analysis we estimate p(U1) for each value of U1 in the design by the proportion of responders assigned that value to obtain [p with hat] (U1) ([p with hat] (U1) can be formed from a saturated regression of R on U1). The screening analysis is a regression of Y on {Z1,(Rp^(U1))Z3,RZ2(1),(1R)Z2(0)} thus obtaining [var phiv with circumflex]′, [psi]′, [beta]′, [alpha]′. See Section 5 for further concrete examples.

Lemma 4

Assume that the variance of Y is finite and that P[0 < E[R|A1] < 1] = 1. Then as N → ∞, the multivariate distribution of { N(φ^φ),N(β^β),N(α^α)} converges to an multivariate normal distribution. Furthermore if the model in (12) is saturated then the estimators [var phiv with circumflex]′, [beta]′, [alpha]′ are semiparametric efficient.

The proof, and an asymptotically consistent estimator for the variance-covariance matrix, is provided in Appendix A.1. In general the model (12) will be saturated. Exceptions occur when more formal assumptions are made than is necessary (see example 3 in Section 5 for an illustration). Or when according to the design, there are one or more stage 2 factors (only for responders or only for non-responders) with levels completely crossed with the levels of all other factors (e.g. these stage 2 factors are randomized independently of the remaining factors and they are not stacked with other factors in the design). For completeness Appendix A.1 also supplies an estimating function yielding semiparametric efficient estimators for use in the case when (12) is not saturated. As was the case for Lemmas 2 and 3, if all responders are provided the same treatment then there are no [beta]′, β′ and P[0 < E[R|A1] < 1] = 1 can be replaced by P[E[R|A1] < 1] = 1 (similar statements can be made if all non-responders are provided the same treatment).

4.3 Sample Size Considerations

The primary goal is to assess the activity of all stage 1 and stage 2 main effects, that is to test if each main effect is zero. To choose the sample size N, we make the following rough approximations. We assume that our formal assumptions, if any, are correct and that the residual variance is equal across the 2km rows in the design, say σ2. If the the stage 1 responder rates were equal (say p) then the asymptotic variance-covariance matrix of the estimated effects would be a diagonal matrix with diagonal elements equal to σ2Np for the stage 2 effects for responders, σ2N(1p) for the stage 2 effects for non-responders and σ2N for the stage 1 effects. We act as if this is roughly the case. If there are stage 2 factors for both responders and non-responders then the sample size is determined by the smaller of σ2p and σ2(1p). The sample size to detect a main effect of size Δ with power 1 − β using a two-sided test with Type one error rate α is


where pmin, pmax are minimal and maximal response rates (across the 2km rows) and zu is the (1 − u)th standard normal percentile.

Suppose there are stage 2 components for both responders and non-responders. If the signal-to-noise ratio (SNR = Δ/σ) is very large then the recommended row size will be so small that the chance of a group (corresponding to a row in the design) occurring with no responders or no non-responders is too high. Given the response rate p and group size n = N/2km, the chance that a group of this type will occur is 1 −(1 −pn −(1 − p)n)2km. To reduce the chance of such groups we find the smallest n for which the above probability is smaller than a cutoff (in the simulations we use .01). To set the group size we use the maximum of the above number times 2km and N. Note that if there are no stage 2 components for responders (respectively non-responders) then 1 − pn −(1− p)n in the above formula may be replaced by 1−pn (respectively by 1 − (1 − p)n).

5. Examples

In this section we illustrate several 2km designs and associated working and formal assumptions. The examples below are 2km generalizations of the clinical studies STAR*D (Fava et al., 2003) and ExTENd (Oslin, personal communication) in which the stage 2 factors vary by whether the patient exhibits an early response. Three different designs, associated with different formal assumptions are illustrated.

Consider a study with four stage 1 factors: S (speciality or general practice clinic), B (adjunctive therapy to improve adherence: yes/no), C (counseling: intensive/less intensive) and T (level of staff training: high/low). All individuals are offered medication. There is one stage 2 factor for early non-responders F2 (switch or augment medication) and one stage 2 factor for early responders G2 (telephone disease management: yes/no). Suppose that the working assumptions are that the main effects of the factors and the two way interactions CG2, TG2, BF2, TF2 and BC may be active with the remaining effects assumed negligible. Recall, these working assumptions are used to guide the choice of the 2km design, but are not necessarily assumed true in the screening analysis.

In Design 1, a 24–1 design is constructed for the stage 1 factors and then the levels of the stage 2 factor are crossed with this design. In Design 2 the stage 2 factor is aliased with the four way interaction between the stage 1 factors. Both of these designs are 25–1 designs. In the Design 3 below we allow for an additional stage 2 factor which is applicable only for the non-responders. Here a 26–2 design is discussed.

Design 1

This design crosses a 24–1 design with defining word SBCT = 1 with the stage 2 factor levels to produce a 25–1 design; no formal assumptions are made. The defining word indicates that the stage 1 and stage 2 matrices X1 and X2(1) (or X2(0)) do not share columns and thus stage 2 effects are not aliased with nuisance effects (or stage 1 effects). As a result the aliasing follows directly from the defining word (e.g. the two way interaction BC is aliased with the two way interaction ST, and since SBCTG2 = G2, the two interaction TG2 is aliased with the interaction SBCG2, and so on). Note that this design is consistent with the working assumptions in that all of the interesting effects are aliased with effects that are thought to be negligible.

The screening analysis uses the model in (12) with Z1 = [1, S, B, C, T, BC, SB, SC], Z3 = Z1, Z2(1)=[G2,SG2,BG2,CG2,TG2,SBG2,SCG2,BCG2] and similarly Z2(0)=[F2,SF2,BF2,CF2,TF2,SBF2,SCF2,BCF2]. This is a saturated model. Recall that due to the nonorthogonality of the predictors, omitting predictors from this model is equivalent to making formal assumptions. An advantage of this design is that no negligibility assumptions on the nuisance parameters (ψ’s) need be made. However, this design aliases BC with ST; a better design would avoid aliasing two way interactions.

Design 2

Suppose that we are willing to make the formal assumption that there are no three way or higher order stage 2 interactions (α and β regression coefficients of interactions between a stage 2 factor and two or more stage 1 factors are 0), and that there are no four way or higher nuisance interactions involving R and stage 1 factors (ψ regression coefficients of interactions between R and three or more stage 1 factors are 0). Consider a 25–1 design with defining word SBCTF2 = 1 with F2 and G2 stacked (so another expression for the defining word is SBCTG2 = 1). The design is provided in Table 3. The defining words indicate that under the formal assumptions, none of the stage 1 main effects or two-way interactions are aliased. Potentially active stage 2 effects, such as CG2, can also be estimated since the nuisance effects associated with the same column in the stage 1 matrix X1, (here the four way interaction between R and SBT) are negligible. And potentially active nuisance effects such as the three way interaction between R and SB can be estimated since these nuisance effects are associated with the same column in the stage 2 matrices X2(1) (or X2(0)) as negligible stage 2 effects (here CTG2 and CTF2).

Table 3
Design 2:

To screen the factors using data from design 2, we use the model in (12) with with Z1 = [1, S, B, C, T, BC, SB, SC, ST, BT, CT], Z3 = Z1, Z2(1)=[G2,SG2,BG2,CG2,TG2] and similarly Z2(0)=[F2,SF2,BF2,CF2,TF2]. This is a saturated model. Note that the defining word indicates the aliasing, for example the [beta]′ coefficient of G2 is actually estimating the sum of main effect of G2 and the four-way interaction SCBT.

Design 3

In order to illustrate some of the more subtle considerations, consider the inclusion of an additional stage 2 factor H2 (additional behavioral contingency to improve long term medication adherence) that is only assigned to non-responders (R = 0). Further suppose the formal assumptions are that there are no three way or higher nuisance interactions involving R and stage 1 factors and that three way and higher stage 2 causal effects are negligible. The working assumptions are that all effects except the main effects of the factors and the two way interactions TG2, TF2, CG2, BF2, F2H2 and BC are expected to be negligible. Consider the 26–2 design, with defining words SCTF2 = 1 and BTF2H2 = 1 and F2 and G2 stacked (so we could express the above with a G2 instead of F2; note the product of these two defining words yields 1 = SBCH2.)

In contrast to designs 1 and 2, not only the experimental design but also the choice of the regression model determines the aliasing (we use step 4 of the algorithm in Section 4.2 to determine the aliasing). Indeed we have a choice of several regression models that can be used to screen the factors each corresponding to different aliasing. Using the steps in Section 4.2 we see that one possibility is to use the model in (12) with Z1 = [1, S, B, C, T, SB, BC, BT, SBC, SBT, BCT], Z3 = [1, S, B, C, T], Z2(1)=[G2,SG2,BG2,CG2,TG2] and Z2(0)=[F2,SF2,BF2,CF2,TF2,H2,SH2,CH2,F2H2]. Using the defining words, we can deduce the aliasing. For example since the defining words indicate that G2 = F2 = SCT and SCT has been omitted from X1, we have that the β′ coefficient of G2 is the sum of the main effect of G2 and the three way interaction SCT; similarly the α′ coefficient of F2 is the sum of the main effect of F2 and the three way interaction SCT. Note that the defining words indicate that BF2 = BG2 = TH2 = SBCT and we included only BF2, BG2 in the regression thus the α′ coefficient of BF2 is the sum of the two way interactions BF2, TH2 and the four way interaction SBCT whereas the β′ coefficient of BG2 is estimating the sum of the two way interaction BG2 and the four way interaction SBCT. In general, given the formal assumptions, the defining words along with the choice of Z1, Z2(1) and Z2(0) result in the aliasing: βG2=βG2+φSCT,βSG2=βSG2+φCT,βBG2=βBG2+φSBCT,βCG2=βCG2+φST,βTG2=βTG2+φSC,βF2=βF2+φSCT,βSF2=βSF2+φCT,βBF2=βBF2+βTH2+φSBCT,βCF2=βCF2+φST,βTF2=βTF2+βBH2+φSC (for clarity the associated predictor is given as the subscript on the parameter). The remaining parameters are unaliased (e.g. the ([var phi]′, β′, α′) parameters are equal to the corresponding ([var phi], β, α) parameters).

Another possible screening analysis would use (12) with Z1 = [1, S, B, C, T, SB, BC,BT, CT, SBC, SBT, BCT] and Z2(0)=[F2,BF2,CF2,TF2,H2,SH2,CH2] but leave Z3 and Z2(1) as is; that is the predictor CT is added to Z1 and the predictor SF2 is removed from Z2(0). In this case the aliasing is the same as before except φCT=φCTαSF2,βSG2=βSG2αSF2 (there is no longer a αSF2 parameter).

Both of the two screening analysis models discussed above have 30 parameters and hence are not saturated models. If desired we could fit a saturated model by adding two nuisance interactions to Z3, SBT and BCT. These two nuisance interactions were needlessly assumed to be negligible in the formal assumptions. For example, from the defining words, note that the column associated with SBT is the same as the column associated with the three way stage 2 interactions SF2H2, TCH2 and BCF2 all of which were assumed to be negligible. That is we made both assumption 3a and 3b in Lemma 3 instead of one or the other. Estimation of these two effects acts as a check on the formal assumptions since under the formal assumptions these two nuisance effects are zero.

6. Simulation Results

Extensive simulations were conducted so as to evaluate the proposed analysis, examine the impact of violations of the formal assumptions and examine the impact of rows with no responders (or no non-responders) on the analysis. We are particularly interested in mental health settings such as the treatment of major depression and substance abuse in which response (absence of symptoms) rates to initial treatment are around 50 to 70%; as a result the simulations below use initial response rates in this range. Note that in actual practice, when response (non-response) rates are low, usually only factors for non-responders (respectively, responders) are investigated.

The findings were as follows. First when the formal assumptions hold, the standard errors performed well and the Type 1 error is as planned. In general the sample size calculations depend on the group with the smallest proportion of non-responders (or responders). As a result the sample size calculations are conservative and the power to detect active stage 1 main effects and stage 2 main effects is higher than the nominal value. Simulations with both normal and non-normal error distributions such as a t-distribution with 3 degrees of freedom were also conducted; this did not substantially alter the results. When the formal assumptions are violated (e.g. effects assumed negligible in the formal assumptions are not negligible) there is a surprising degree of robustness; that is, the bias is of smaller magnitude than would be expected. Also when one or more rows contain only responders, fitting an analysis model that omitted some active effects led to relatively robust estimators of the remaining effects. This robustness is discussed at greater length following simulation 2. We present three simulations that exemplify the above findings.

The simulations below use Design 2 (see Table 3). In this example the formal assumptions are that there are no three way or higher order stage 2 interactions and there are no four way or higher order nuisance interactions involving R and stage 1 factors. In the simulations provided below, R is generated using a response rate given by logit E[R|F1]) = .6 + .1S + .1B + .1C + .1T. This results in response rates varying from .55 to .73 across the 16 rows; this wide range of early response rates is extreme for the mental health/substance abuse fields however it allows us to illustrate the issues. Y is normally distributed with residual variance, σ2, set to 1. We present results with the signal-to-noise ratio (SNR= effect size/residual variance) equal to .25 or .35 units per standard deviation. In the simulations, active main effects are equal to the SNR× σ, active two way interactions are equal to SNR×.5×σ and active three way interactions are set equal to SNR× .25 × σ. These settings are consistent with the Hierarchical Ordering Principle (Wu and Hamada, 2000, pg. 112) which states that the lower order effects are more likely to be important than higher order effects and effects of the same order are equally likely to be important (this principle is used in the absence of domain knowledge indicating otherwise). Throughout all main effects and the interactions SB, SC, SG2, TG2 and TF2 are active; thus the working assumptions are incorrect (the working assumptions are that only the main effects of the factors and the two way interactions TG2, TF2, CG2, BF2 and BC are likely to be active).

We used the formula in Section 4.3 to determine the sample size required to detect a given SNR at 90% power with a 10% Type 1 error rate for pmin = .55 and pmax = .73. Following the recommendations in Collins et. al. (2007) interesting effects are tested using a Type 1 error rate of .1 (no correction for multiple tests) and the remaining effects are tested using a overall Type 1 error rate of .1 (using a Bonferroni correction). In all simulations we estimate the variance-covariance matrix of the effects using the formula given in Appendix A.1 (near the beginning of the proof of Lemma 4 in Appendix A.1). Table 4 provides the results from 1000 data sets; the formal assumptions hold for these data sets.

Table 4
Simulation 1: Formal Assumptions Hold1

Note that there are 6 effects that according to the working model should be negligible and are actually negligible. Given an overall error rate of .1, the empirical error rate should be around .011 and this is the case as can be seen by the rows labeled ST, BT, CT, BG2, SF2 and CF2. Similarly the empirical Type 1 error rate for the interesting effects should be around .1 and this can be seen by rows labeled BC, CG2 and BF2. The high power in detecting main effects is due to the conservatism used in selecting the group size. Recall that the response rates per group range from a low of .55 to an high of .73. The group sizes are chosen then to achieve the power .9 to detect stage 2 effects for non-responders when the non-response rate is .27. Since only one of the groups exhibits this low non-response rate, the group sizes are conservative, resulting in higher power. In simulations with a constant response rate across all 16 groups (not shown here) the power to detect a main effect for stage 2 non-responders varied from around .88 to .94 depending on the simulation.

In the next simulation, also of 1000 simulated data sets, the formal assumptions are violated; there are active three way stage 2 effects for responders. The results in Table 5 are surprisingly good; the results in this table would be expected if the interactions involving R were zero (no nuisance interactions). To see this, recall the defining words for this design are 1 = SBCTG2 (equivalently 1 = SBCTF2). Thus, for example the column in the stage 2 matrix X2(0) matrix associated with CTF2 is the same as the column associated with SB in X1. If the coefficient of the interaction (RE[R|A1])SB were zero (it is not) then we would expect that estimators of the stage 1 interactions such as SB will estimate this stage 1 interaction plus the product of the response rate times the effect of CTG2 plus the non-response rate times the effect of CTF2. In this case, this is .175+ response rate(.0875)+ non-response rate(0). If we use the average non-response rate (here .64), this yields .231 which is close to the average estimated effect for SB. Similar statements can be made for remaining stage 1 two way interactions. This robustness is explained by the fact that the response rates vary only from .55 to .73 across the groups. As discussed in Section 4, the less the response rates vary, the closer the predictors are to being orthogonal. Thus the estimators of the effects are approximately uncorrelated (the off diagonal elements of the correlation matrix are small{in this simulation the maximum correlation in absolute value is .12 and the average of the absolute value of the correlations is .03). Thus as long as the response rates do not vary greatly (as is the case in many areas of mental health and substance abuse) we can expect this robustness. Similar results hold when the formal assumptions concerning the nuisance effects are violated (not shown here).

Table 5
Simulation 2: Formal Assumptions are Violated1

Also this simulation, by chance, did not result in any samples with one or more groups containing only responders. However this can happen and did happen in other simulations (not shown). Indeed when SNR= .35 the group sizes had to be adjusted upwards to ensure that the probability that all groups have some non-responders is not too low (low is arbitrarily chosen to be .01; see Section 4.3). Even with this adjustment, a simulation size of 1000 samples will sometimes include some samples in which groups with no responders occur. If in a given data set some groups (e.g. corresponding to rows in the design) contain no responders we fit a model using the working assumptions, that is, Z1 = [1, S, B, C, T, BC] Z3 = Z1, Z2(1)=[F2,BF2,TF2] and Z2(0)=[G2,CG2,TG2].

The working assumptions are incorrect thus the omission of active effects biases the estimated effects. To examine the degree of bias we conducted a simulation in which the groups corresponding to the rows of the design were sufficiently small so that the chance of one or more groups with all responders was likely; in effect we simulated many samples saving only those that had one or more groups with all responders. Table 6 reports the results for 729 such samples.

Table 6
Simulation 3: One or More Cells with No Non-responders1

As with simulation 2, this simulation demonstrates robustness of the analysis method to the omission of active effects. Note that both the bias of the estimators and the quality of the standard errors is poorest for the stage 2 effects for non-responders. This is not surprising as many rows will have few non-responders.

7. Discussion

An attractive alternative to the use of randomized clinical trials in constructing dynamic treatment regimes is the use of observational data in which the components are not randomized. Here the variation in treatment is usually due to patient/clinician preference, patient adherence, component availability, etc. Observational data has the great advantage in that it often already exists and/or is less expensively obtained than data from the screening designs discussed here. However note that observational data suffers from two substantial drawbacks. First inferences regarding causal relations in the absence of randomization requires untestable assumptions (Rubin, 1978; Robins, 1992) concerning why individuals receive, or are offered, differing treatments. Second, even if we are comfortable with the required assumptions, then because we do not control the treatment levels in the observational study we have uncontrolled aliasing. That is, when our modeling assumptions do not hold, the form of the aliasing may be complex and difficult to ascertain. In contrast the screening designs considered here not only improve our ability to make causal inferences but also provide a greater understanding of the aliasing that occurs due to incorrect modeling assumptions. However there are many open problems. Four such problems follow.

First there is the question of how to construct the optimal experimental designs. A first thought might be to utilize the maximum resolution criterion or its refinement, the minimum aberration criterion (see Wu and Hamada, 2000). Note that the design in example 1 has only resolution IV whereas the design in example 2 has higher resolution (V). It is not necessarily true that the example 2 design is to be preferred since it requires formal assumptions whereas the example 1 design does not. More work is required to appropriately incorporate the roll of the formal assumptions into methods for comparing these designs.

Second as discussed in Step 4 of the Algorithm in Section 4.2 some designs permit multiple analysis models; because of the non-orthogonality of the predictors, different models result in tests with different powers. A strategy for choosing among these different analysis models is needed.

Third, designs with a smaller number of groups (e.g. treatment combinations) might be considered. To do so, we would begin by eliminating negligible effects from a potential model via formal and working assumptions. Next, we could attempt to find, say, a D-optimal design (e.g., see Wu and Hamada, 2000) that would permit estimation of the remaining effects. This would yield designs that have fewer than 2km groups. However, such designs rarely have the nice aliasing structure found with 2km factorial designs. Furthermore, the designs proposed here frequently permit us to assess the working assumptions, unlike most D-optimal designs (see design 3 in Section 5). We view this as an important advantage.

Lastly, this paper has not discussed potential secondary analyses that might be used with data from a 2km design. Such analyses might consider how best to utilize time-varying patient covariates in the construction of the decision rules. The methods of Murphy (2003) and Robins (2004) require generalization as these methods require randomization or, at a minimum, stochastic variation in assigned stage 2 factors.


We acknowledge support for this project from National Institute of Health grants RO1 MH080015 and P50 DA10075.


A.1 Lemmas

Proof of Lemma 3

First assume there are factors for both responders and non-responders. We have for R = 0, 1,


where the defining words specify the common columns in X1, X2(1) and X2(0) (Dp is a diagonal matrix with p on the diagonal). Note that the unique columns in {X1, X2(1)} and in {X1, X2(0)} each number at most 2km. We group terms in the above display so that only unique columns remain. First if a column occurs k times in X1 then delete k −1 of the columns and sum all terms in ψ and [var phi] associated with the column. Each sum becomes the coefficient of the remaining column; all of the remaining entries in ψ, [var phi] remain the same. This process is repeated until all columns are unique resulting in a matrix Z1 and two vectors of coefficients ψ′, [var phi]′ for which X1ψ = Z1ψ′ and X1[var phi] = Z1[var phi]′. We follow a similar procedure for the two stage 2 matrices resulting in X2(1)β=Z2(1)β and X2(0)α=Z2(0)α.

Next if assumption 3a was used then some of the entries in ψ′ are known to be zero. Delete the associated columns from Z1 to form Z3 and delete these entries from ψ′. If assumption 3b was made then follow the same procedure for Z2(1) and β′ and similarly for Z2(0) and α′.

At this point there are no duplicate columns in Z1 or in Z3 or in Z2(1) or in Z2(0). Furthermore there is no column common to both Z3 and Z2(1) and no column common to both Z3 and Z2(0). This is because such a column would have been assumed to either have a ψ coefficient or α, β coefficients equal to zero by assumptions 3a), 3b) respectively. There may, however, be one or more columns which are common to all three, Z1, Z2(1),Z2(0), matrices. Note the above discussion implies these columns are not in Z3. To obtain regression coefficients that are identifiable, we need to eliminate each common column from one of these three matrices. To see this consider one such common column, z which is, say, the ith column of Z1, the jth column of Z2(1) and the kth column of Z2(0). This column contributes zφi+Rzβj+(1R)zαk to E[Y|R]. We can write this sum in three ways corresponding to how we delete this column from one of the three matrices:


If we delete z from Z1 then we will be able to identify the sum of the associated stage 1 and stage 2 effects; if we delete z from Z2(1) then we will be able to identify the sum of the associated stage 1 effect and the stage 2 effect for responders and also the difference between the stage 2 effect for non-responders and the stage 2 effect for responders; if we delete z from Z2(0) then we will be able to identify the sum of the associated stage 1 effect and the stage 2 effect for non-responders and also the difference between the stage 2 effect for responders and the stage 2 effect for non-responders.

Note the assumption that P[E[R|A1] > 0] = 1 (P [E[R|A1] < 1] = 1) implies that the 2km ×1 vector E[Y|R = 1], respectively E[Y|R = 0], is identifiable. Furthermore the vector of response rates p is identifiable. We have


Multiply all terms in both of these equations by Z3T and subtract the lower equation from the top. Note that Z3 does not share a column with Z2(i) so Z3TZ2(i)=0 for i = 0, 1 and that Z3TZ3 is the identity matrix. We obtain ψ=Z3T(E[YR=1]E[YR=0]). So ψ′ is identifiable. A little more matrix algebra yields


Identifiability of [var phi]′, α′ and β′ follows from the fact that there are no columns common to all three of the matrices.

In there are no factors for responders (non-responders) remove X2(1)β (respectively X2(0)α) from (13). Then follow the above arguments.

Proof of Lemma 4

We assume there are factors for both responders and non-responders. Similar arguments can be used when there are only factors for one or the other. Recall a 2km design is used; that is {A1, A2} have a discrete uniform distribution over the rows of the 2km design. Recall that [p with hat] (U1) denotes the observed response proportion for each unique value of U1. Let η be the parameter in the bernoulli distribution of R given A1 parameterized as p(U1)=U1Tη. Let γ′ = [(ψ′)T, ([var phi]′)T, (β′)T, (α′)T]T (placing ψ′ first is for convenience in the invertibility proof below). Note that {Z3, Z1, Z2(1),Z2(0)} in (12) are simply functions of A1, A2. Z1, Z3 are subvectors of U1.

Define Z(η)=[(RU1Tη)Z3T,Z1T,RZ2(1)T,(1R)Z2(0)T]. Also let EN be expectation with respect to the empirical distribution of the data. Then solve


for γ′, η to obtain [gamma with circumflex]′, [eta w/ hat]. The second equation is the normal equation for a regression of R on U1. Recall the model used for the conditional distribution of R is saturated so weighting by the variance of R would not alter the results; in fact the value of u1Tη^ will be equal to the observed combined response rate for the rows consistent with u1. The asymptotic arguments are standard exercises. As a result we provide only an outline. First, standard asymptotic arguments are sufficient to show convergence in probability of [gamma with circumflex]′ to γ′ and [eta w/ hat] to η. Furthermore define Σγ= E[Z(η)Z(η)T], γ,η=E[Z(η)(Z3Tψ)U1T] and η,η=E[U1U1T] and the corresponding estimators by [Sigma]γ = EN[Z([eta w/ hat])Z([eta w/ hat])T], ^γ,η=EN[Z(η^)(Z3Tψ^)U1T] and ^η,η=EN[U1U1T]. Note that Ση,η is the identity matrix due to the orthogonality of design (Σ^η,η may not be the identity since in finite samples, the groups sizes may be unequal). It is easily shown that [Sigma]γ, [Sigma]γ,ηand [Sigma]η,η converge in probability (as the sample size N → ∞) Σγ, Σγ,η and Σηη.

Below we show that Σγ is invertible. Again standard arguments can be used to show that


where (γ,η,γ,η,η,η)=Z(η)(YZ(η)Tγ)γ,ηη,η1U1(RU1Tη). Thus the asymptotic variance-covariance matrix of N(γ^γ) is


A consistent estimator of the asymptotic variance-covariance matrix is provided by replacing Σγ by [Sigma]γ, Σγ,η by [Sigma]γ,η, Ση,η by [Sigma]η,η, η by [eta w/ hat], γ′ by [gamma with circumflex]′ and the operator E by EN in the above. To improve the accuracy of the variance estimators when samples are small, we recommend adjusting the formula for the number of estimated parameters. That is multiply the estimated variance-covariance matrix by N/(Nqγqη) where qγ is the dimension of γ′ and qη is the dimension of η.

Proof that Σγ is invertible

Construct the matrices Z(1), Z(3), Z2(1),Z2(0) as in the proof of Lemma 3 above. Note each row is equal to one of the 2km equally probable realizations of [Z1TZ3T,(Z2(1))T,(Z2(0))T]. Define p to be a vector of the 2km response probabilities (e.g. the ith entry is given by ui1Tη where ui1 is the value of U1 corresponding to the ith row of the 2km design). Define Dp to be a diagonal matrix with diagonal elements equal to −p. Define


where the 0’s denote conforming matrices with all entries equal to zero. Then 2kmΣγ can be written as


We show that V is of full rank. This combined with the assumption that the response probabilities are bounded away from both 0 and 1 will imply that Σγ is invertible.

Note that V will have less than 2(2km) columns (it has 2(2km) rows) if there are unequal numbers of factors for responders and non-responders and/or unnecessary formal assumptions have been made. We can add columns to V by adding back in columns to Z3 associated with nuisance effects that did not necessarily need to be assumed zero (see example 3 for an illustration). Suppose there are fewer stage 2 factors for responders than for non-responders; this means that some of the stage 2 factors for non-responders are not stacked with a stage 2 factor for responders. Add the columns in Z2(0) associated with these non-stacked factors to Z2(1). A similar procedure can be followed in there are fewer stage 2 factors for non-responders than responders. If we prove this augmented 2(2km) × 2(2km) matrix V is of full rank then certainly any subset of columns is also full rank. Hence it suffices to prove that V is of full rank in the case when it is square (has 2(2km) columns).

Next V = V1 + V2 where


V1 is easily shown to be of full rank, once one identifies common columns in Z3, Z1, Z2(1),Z2(0). Note by assumption there are no columns common to Z3 and Z2(1) or Z3 and Z2(0). And there are no columns common to all three Z(1), Z2(1),Z2(0). Thus Z3 is composed of columns common with Z1 and columns unique to Z3. Similarly Z1 is composed of columns that are common with Z3, columns common with Z2(1) and columns common with Z2(0). Z1 has no unique columns(if it did then this column should have been added to Z3 when V was augmented). The inner product of a column with itself is 2km and the inner product of any other two columns is zero. These facts imply that the inverse of V1TV1 is easily found. Let q3 be the number of columns in Z3. For the purposes of this proof we need only note that the first q3 rows of (V1TV1)1 are proportional to the matrix:


where q3c is the number of common columns between Z3 and Z1 and q3u is the number of columns unique to Z3. In this we reordered the columns of Z3 and Z1 so that the common columns appear first.

Next because V1 is of full rank and square (thus invertible) we can write,


Using the formula for the first q3 rows of (V1TV1)1 we find that the term in parentheses is proportional to a matrix of the form


where U is of dimension 2(2km)−q3×q3. This square matrix has nonzero determinant and is thus invertible. V is invertible.


First we derive the semiparametric efficient score. Then we discuss how this score can be used to produce a locally semiparametric efficient estimator of γ′. Lastly we prove that in the saturated model the estimators are semiparametric efficient and that under an additional assumption on the variances (this assumption is provided in Lemma 2) certain unsaturated models will also yield semiparametric efficient estimators.

Semiparametric Efficient Score

The semiparametric model is given by


where E[ε|A1, R, A2] = 0, R has a Bernoulli distribution with success rate, E[RA1,A2]=E[RA1]=U1Tη and γ′ belongs R (qγ ≤ 2km) and η belongs to R (qη is equal to the number of rows in the 2km design corresponding to unique levels of the stage 1 factors). Put q = qγ+qη. Recall that Z(η)=[(RU1Tη)Z3T,Z1T,RZ2(1)T,(1R)Z2(0)T] and γ′ = [(ψ′)T, ([var phi]′)T, (β′)T, (α′)T]T. The density for a single observation is given by


The nuisance parameters are the densities fε(·|a1, r, a2) where each is a mean zero density with respect to Lebesgue measure and the pi’s. Actually we know that pi = 1=2km, however the efficient score is the same whether we know the pi’s or not. This is because {A1, A2} is ancillary (Newey, 1990). All of the expectations in this subsection are with respect to this density. As is customary, to derive the efficient score, we make regularity and smoothness conditions on the likelihood of the semiparametric model; these conditions are given in definition A.1 of the appendix in Newey (1990). This model is only slightly different from the semiparametric restricted moment model treated at great length by Tsiatis (Ch. 4, 2006); in particular it differs since a parameter in the Bernoulli distribution of R (e.g. η) also appears in Z(η). In the following we use Tsiatis’s (2006) terminology and where possible use his results.

To derive the efficient score function for {γ′, η} we calculate the score vector for {γ′, η} and then form the projection of this score vector on the nuisance tangent space (see Theorem 4.1 of Tsiatis (2006)). The residual is the efficient score. Let H is the Hilbert space of q-dimensional, mean zero, finite variance functions of {A1, R, A2, Y} equipped with the inner product, <h1,h2>=E[h1Th2]. The tangent space for a particular parameter is the subset of H formed by mean square closure of all functions that are score functions for the parameter in a parametric submodel containing the true value of that parameter. Define


From Tsiatis (Section 4.5, 2006) we have that the tangent space for the densities fε(ε, a1, r, a2) is Λ1 ∩ Λ2 and that the tangent space for the distribution of {A1, A2} (e.g. the pi’s) is given by Λ4. Because the pi’s are variationally independent of each of the fε(ε|a1, r, a2) densities the nuisance tangent space is given by the direct sum of two orthogonal spaces,


Second, Tsiatis (Section 4.5, 2006) proves that


Since H=Λ2Λ2, we have that the nuisance tangent space, Λ = (Λ1 [union or logical sum]Λ2) [plus sign in circle]Λ4 is the subset of Λ2 which is orthogonal to Λ3, that is, Λ=Λ2Λ3. It follows that Λ[perpendicular] is the direct sum of Λ2 and the part of Λ2 that is contained in Λ3. Since Λ3 [subset or is implied by] Λ2 we have Λ=Λ2Λ3.

Recall the efficient score is the residual of the projection of the score vector for {γ′, η} onto the nuisance tangent space, Λ. Equivalently the efficient score is the projection of the score vector for {γ′, η} onto Λ[perpendicular]. In this case the latter is easier to compute. Tsiatis (Section 4.5, 2006) proves that


Because R is a binary random variable, members of Λ3 also have a simple form: h3(A1, R, A2) = g′(A1, A2)(RE[R|A1, A2]) for g′, a q-dimension function of {A1, A2}.

We have that the efficient score must be of the form, εg(A1, R, A2)+g′(A1, A2)(RE[R|A1, A2]) where both g and g′ are q-dimensional. Assume that E[ε2|A1, R, A2] is positive. Then it is easy to see that the projection of an h(ε, A1, R, A2) [set membership] H onto Λ2 is


Similarly, assuming that E[R|A1, A2] [set membership] (0, 1),


If we denote the scores for γ′, η, by Sγ, Sη, respectively we see need only calculate E[Sγε|A1, R, A2], E[Sγ(RE[R|A1, A2])|A1, A2] and similar quantities for Sη. Following Tsiatis (2006) we do this indirectly without explicitly calculating Sγ and Sη. We have


for all values of a1, r, a2, η, γ′ (recall u and z are functions of these). Differentiating both sides of the first two equations with respect to γ′ we obtain


That is, 0 = E[Sγ(ε, A1, R, A2)|A1, R, A2] and Z(η) = E[εSγ(ε, A1, R, A2)|A1, R, A2]. The former implies that E[Sγ(ε, A1, R, A2)(RηTU1)|A1, A2] = 0. Differentiating both sides of each of the last two equations with respect to η results in:


That is, (Z3Tψ)U1=E[εSη(ε,A1,R,A2)A1,R,A2] and U1=E[(RU1Tη)Sη(A1,R,A2)A1,A2]. Putting these results together we obtain the efficient score for (γ′, η):


where σA1,R,A22=E[(YZ(η)Tγ)2A1,R,A2]. The variance-covariance matrix of Seffeff = E[Seff (Y, A1, R, A2; γ′, η, σ)Seff (Y, A1, R, A2; γ′, η, σ)]) is given by


Although we did not show this above, the smoothness assumptions imply that the estimators {[gamma with circumflex]′, [eta w/ hat]} are regular, asymptotically linear estimators (see Tsiatis (2006) for definitions). If Σeff were singular then no regular, asymptotically linear estimators can exist (Chamberlain, 1986). Thus Σeff must be invertible. Furthermore for any regular and asymptotically linear estimator, say (([gamma with tilde]′)T, [eta w/ tilde]T), and a q-dimensional vector, a, we have that a lower bound on the asymptotic variance of N((γ)T,ηT)((γ)T,ηT))aisaTeff1a.

A Semiparametric Efficient Estimator

To construct a semiparametric efficient estimator we employ the efficient score, Seff as follows. As before we assume that P[0 < E[R|A1] < 1] = 1. We also assume that P [V ar(Y |A1, R, A2) > 0] = 1. First use the method described in the body of the paper (and above in this Appendix) to produce consistent estimators [gamma with circumflex]′, [eta w/ hat]. Next set σ^A1,1,A22(σ^A1,0,A22) to be the sample variance of the residual (YZ([eta w/ hat])T[gamma with circumflex]′) for responders (respectively, non-responders) for each group (each row) of the design. Solve


for γ′, η to obtain γ^eff, [eta w/ hat]eff. Standard arguments can be used to show that


as N → ∞

Semiparametric Efficiency of {[gamma with circumflex]′, [eta w/ hat]}

In general the model(12) should be saturated; this is because each predictor associated with an effect (or one of its aliases) will occur in two vectors, either in Z1 and Z3 or in Z1 and Z2(1) or in Z1 and Z2(0) or in Z2(1) and Z2(0) (Similarly (9) will generally be saturated as well). An exception to this pattern (and thus the model will not be saturated) occurs if there are unnecessary formal assumptions; that is both assumptions 3a) and 3b) are made concerning the effects associated with a particular column in Lemma 3 rather than just one of these. In this case the “Algorithm for Constructing the Regressors for Use in the Screening Analysis” of Section 4.2 will remove too many predictors from Z3.

Assuming that no unnecessary formal assumptions are made, the only other reason why (12) (or 9) will not be a saturated model is if there is a stage 2 factor for only responders (or only non-responders) and the design specifies that this factor’s levels are crossed with the levels of the remaining factors (e.g. this factor is independent of the remaining factors). In this latter case effects involving this stage 2 factor can only be aliased with other effects involving this stage 2 factor. As a result each predictor associated with effects of this stage 2 factor (or one of its aliases) can only be included once and only in Z2(1). However under the assumption that P[Var(YA1,R,A2)=RVar(YA1,R,A2(R))]=1 we show that in this non-saturated case, {[gamma with circumflex]′, [eta w/ hat]} is locally semiparametric efficient. The adjective, “locally” is used because the semiparametric efficiency holds if this assumption holds and otherwise not.

First we prove that {[gamma with circumflex]′, [eta w/ hat]} is semiparametric efficient when (12) is a saturated model. For simplicity suppose there are stage 2 factors for both responders and nonresponders. Assume P[0 < E[R|A1] < 1] = 1 and P [V ar(Y |A1, R, A2) > 0] = 1. This means that the probability of P[{mina1,r,a2σ^a1,r,a2>0}{minu1u1Tη^1u1Tη^>0}]1 as N → ∞. Thus we assume in the sample the proportion of responders in each of the 2km groups is not equal to one or zero. In this case we prove that {γ^,η^}={γ^eff,η^eff}.

First since EN[U1(RU1Tη^)]=0 we have that EN[U1RU1Tη^U1Tη^(1U1Tη^)] is zero as well. To see this suppose that there are ν unique values of U1 in the design (ν ≤2k−m). Then,


where Ũ1 is a ν × ν matrix with columns corresponding to the ν realizations of U1, D is a diagonal matrix with ith entry equal to EN [R1 U1=u1i] and R is the ν dimensional vector of observed response proportions. Since EN[U1(RU1Tη^)]=0,U1T and D are invertible, we have that [eta w/ hat] satisfies R = Ũ1[eta w/ hat]. Since we can write EN[U1RU1Tη^U1Tη^(1U1Tη^)] in a similar matrix formulation (only the diagonal matrix is altered) we see that this formula is equal to zero as well.

Next since EN [Z([eta w/ hat]) YZ([eta w/ hat])T[gamma with circumflex]′ = 0, both EN[Z(η^)YZ(η^)Tγ^σA1,R,A22] and EN[(Z3Tψ^)U1YZ(η^)Tγ^σA1,R,A22] will be zero as well. To see this we use a matrix formulation. First let EN [Y1] be a 2km dimensional vector in which the ith entry corresponds to the sample average of Y R for the ith group (row in design) divided by the sample average of R in the ith group. Similarly define EN [Y0] be a 2km dimensional vector in which the ith entry is the sample average of Y (1 − R) for the ith group (row in design) divided by the sample average of 1 − R in the ith group. Define a vector of response proportions of the same dimension, say [p with hat] (the ith entry corresponds to a u1iT[eta w/ hat]. With these definitions we construct an empirical version of V defined in the proof of Lemma 4. Define


where the 0’s denote conforming matrices with all entries equal to zero and Z1, Z3, Z2(1),Z2(0) were defined in the proof of Lemma 4 above. Since the model is saturated, V is a square matrix of dimension 2(2km). Using the same proof as used in Lemma 4 to show that V is full rank, we can show that V is full rank as well (and hence invertible). Define D1 to be a diagonal matrix with ith diagonal entry equal to the proportion of responders in the ith group multiplied by the group size. Similarly define D0 to be a diagonal matrix with ith diagonal entry equal to the proportion of non-responders in the ith group multiplied by the group size.

We can write 0 = EN [Z([eta w/ hat]) YZ([eta w/ hat])T [gamma with circumflex]′)] as


Thus in the saturated model, [gamma with circumflex]


Since we can write EN[Z(η^)YZ(η^)Tγ^σ^A1,R,A22] and EN[(Z3Tψ^)U1YZ(η^)Tγ^σ^A1,R,A22] as matrices times the above difference we have these two terms are zero as well. Combining the above with the results for [eta w/ hat], we have that when the model is saturated and the sample is sufficiently large so that the response rates in each of the 2km groups are neither zero or one, γ^=γ^eff and [eta w/ hat] = [eta w/ hat]eff.

Next we prove that in the non-saturated model, {[gamma with circumflex]′, [eta w/ hat]} will be locally semi-parametric efficient if 1) no unnecessary formal assumptions are made (assumptions 3a and 3b are made concerning the effects associated with a particular column in Lemma 3 rather than just one of these) and 2) there is a stage 2 factor for responders (or non-responders) that is independent of the remaining factors. Assume that Var(YA1,R=r,A2)=Var(YA1,R=r,A2(r)) for r = 0, 1. Under these assumptions on the residual variance we estimate σA1,1;A2 by the sample variance of the residual (YZ([eta w/ hat])T [gamma with circumflex]′) for responders for each combination of groups (rows) in the design with a unique value of {A1, A2(1)} and similarly for non-responders. Denote the estimators by σ^A1,r,A2(r) for r = 0, 1. A locally semiparametric estimator can be found by solving (15) (with [sigma with hat]A1,R,A(R) instead of [sigma with hat]A1,R,A2).

As in the above both EN[U1RU1Tη^U1Tη^(1U1Tη^)] and EN[U1(RU1Tη^)] are identically zero. Next consider the first equation in (14) and (15) (in the latter [sigma with hat]A1,R,A2 is replaced by [sigma with hat]A1,R,A(R)).

For clarity, assume there is only one more stage 2 factor for responders than non-responders and that this stage 2 factor is randomized independently of all other factors. This means that while there are 2km−1 unique realizations of {A1, A2(1)} there are only 2km−1 unique realizations of {A1, A2(0)}. Now (16) continues to hold but V although a full rank matrix, is no longer square. It’s dimensions are 2 (2km) × 2km + 2km−1. A closer inspection of V reveals that in the lower 2km rows there are only 2km−1 unique rows, each has a one double. Thus if we remove these 2km−1 doubles from V (say to form V′) we now have a square matrix which is full rank. We can re-express (16) as follows. Let D0 and EN [Y0] be defined as before. The latter vector is of dimension 2km. Define EN [Y0]′ to be 2km−1 dimension with the ith entry corresponds to the sample average of Y (1 − R) for the ith unique realization of {A1, A2(0)} divided by the sample average of R in the ith realization. Define D0 to be a diagonal matrix with ith diagonal entry equal to proportion of responders in the group formed by the ith unique realization of {A1, A2(0)} multiplied by the group size. Now we can reexpress (16) as


As before the invertibility of V′ implies that the last term in parentheses is equal to zero. Because we can express both


as matrices times the above difference we have these two terms are zero as well. Note we would not have been able to do this had we used the estimator σ^A1,R,A2(R)2 instead of σ^A1,R,A22 in (15). Combining the above with the results for [eta w/ hat], we have that when the sample is sufficiently large so that the response rates in each of the groups defined by unique values of {A1, A2(1)} and {A1, A2(0)} are neither zero or one then γ^=γ^eff and [eta w/ hat] = [eta w/ hat]eff.

A.2 Aliasing and The Formal Assumptions

Here we provide a simple example of how the aliasing becomes difficult to interpret when the formal assumptions do not hold. Suppose there are two factors at stage 1 and one stage 2 factor for responders. We are interested in the main effects of these factors and we are willing to make the working assumption that there are no interactions. Suppose we are willing to make the formal assumption that all effects of R (including interactions between R and the stage 1 factors) are negligible. We use the experimental design in Table 2 with defining word 1 = A11A12A2 (A2 is A2(0) here).

The conditional mean of Y given A1, R, A2 is given by (12). Suppose that it is known that ψj = 0, j = 2, 3, 4. However we do not know whether any of the remaining parameters are zero or not. The experimental design in Table 2 permits the identification of the η parameters in p(A1) = η1 + η2A11 + η3A12 + η4A11A12 and the eight conditional means E[Y |A11 = a11, A12 = a12, R = r, A2 = a11a12] (recall that A2 = A11A12 in this design). The conditional means are


for a11 [set membership] {−1, 1}, a12 [set membership] {−1, 1} and r [set membership] {0, 1}. So from data resulting from the experimental design in Table 2 we can identify the eight quantities such as: [var phi]1 + ψ1, [var phi]2 + η2ψ1, [var phi]3 + η3ψ1, [var phi]4+η4ψ1, ψ1 + β4, β1, β2 and β3. The above nonlinearity in parameters makes it difficult to interpret the identifiable quantities involving the [var phi]j’s. More complex designs result in similarly uninterpretable quantities. In summary, to maintain interpretability, we consider experimental designs that do not alias a potentially active stage 2 effect with a potentially active nuisance effect.

Contributor Information

S. A. Murphy, Department of Statistics, University of Michigan, Ann Arbor, MI 48109, ude.hcimu@yhprumas.

D. Bingham, Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada, V5A 1S6, ac.ufs.tats@mahgnibd.


1. Box GEP, Hunter WG, Hunter JS. Statistics for Experimenter: An Introduction to Design, Data Analysis, and Model Building. New York: Wiley & Sons; 1978.
2. Byar DP, Piantadosi S. Factorial designs for randomized clinical trials. Cancer Treat Rep. 1985 Oct;69(10):1055–63. [PubMed]
3. Brooner RK, Kidorf M. Using behavioral reinforcement to improve methadone treatment participation. Science and Practice Perspectives. 2002;1:38–48. [PMC free article] [PubMed]
4. Chamberlain G. Asymptotic efficiency in semi-parametric models with censoring. Journal of Econometrics. 1986;32:189–218.
5. Collins LM, Murphy SA, Strecher V. The Multiphase Optimization Strategy (MOST) and the Sequential Multiple Assignment Randomized Trial (SMART): New Methods for More Potent e-Health Interventions. American Journal of Preventive Medicine. 2007;32(5S):S112–118. [PMC free article] [PubMed]
6. Fava M, Rush AJ, Trivedi MH, Nierenberg AA, Thase ME, Sackeim HA, Quitkin FM, Wisniewski S, Lavori PW, Rosenbaum JF, Kupfer DJ. Background and Rationale for the Sequenced Treatment Alternative to Relieve Depression (STAR*D) Study. Psychiatric Clinics of North America. 2003;26(3):457–494. [PubMed]
7. Fries A, Hunter WG. Minimum Aberration 2k−p Designs. Technomet-rics. 1980;22:601–608.
8. Hu F, Rosenberger WF. The Theory of Response-Adaptive Randomization in Clinical Trials. Hoboken, NJ: John Wiley & Sons, Inc; 2006.
9. ICH E9. ICH Harmonised Tripartite Guideline, Statistical Principles for Clinical Trials. Statistics in Medicine. 1999;18:1905–1942. [PubMed]
10. Lavori PW, Dawson R, Rush AJ. Flexible treatment strategies in chronic disease: Clinical and research implications. Biological Psychiatry. 2000;48:605–614. [PubMed]
11. Montgomery DC, Jennings CL. An Overview of Industrial Screening Experiments. In: Dean A, Lewis S, editors. Screening Methods for Experimentation in Industry, Drug Discovery, and Genetics. NewYork: Springer; 2006.
12. Murphy SA. Optimal Dynamic Treatment Regimes. Journal of the Royal Statistical Society, Series B (with discussion) 2003;65(2):331–366.
13. Murphy SA. An Experimental Design for the Development of Adaptive Treatment Strategies. Statistics in Medicine. 2005;24:1455–1481. [PubMed]
14. Murphy SA, van der Laan MJ, Robins JM. CPPRG. marginal mean models for dynamic regimes . JASA. 2001;96:1410–1423. [PMC free article] [PubMed]
15. Newey WK. Semiparametric efficiency bounds . Journal of Applied Econometrics. 1990;5(2):99–135.
16. Parmigiani G. Modeling in Medical Decision Making: A Bayesian Approach. Wiley & Sons, Ltd; England: 2002.
17. Patel MS. Group-Screening with More Than Two Stages . Technometrics. 1962;4(2):209–217.
18. Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods-application to control of the healthy worker survivor effect. Computers and Mathematics with Applications. 1986;14:1393–1512.
19. Robins JM. Addendum to “A new approach to causal inference in mortality studies with sustained exposure periods - Application to control of the healthy worker survivor effect” Computers and Mathematics with Applications. 1987;14:923–945.
20. Robins JM. Estimation of the time-dependent accelerated failure time model in the presence of confounding factors. Biometrika. 1992;79:321–34.
21. Robins JM. Causal Inference from complex longitudinal data. Latent Variable Modeling and Applications to Causality. In: Berkane M, editor. Lecture Notes in Statistics. Vol. 120. Springer-Verlag, Inc; New York: 1997. pp. 69–117.
22. Robins JM, Greenland S. Adjusting for differential rates of PCP prophylaxis in high- versus low-dose AZT treatment arms in an AIDS randomized trial. Journal of the American Statistical Association. 1994;89:737–749.
23. Robins JM, Wasserman L. Estimation of effects of sequential treatments by reparameterizing directed acyclic graphs . In: D Geiger, P Shenoy., editors. Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann; San Francisco: 1997. pp. 409–42.
24. Robins JM. Optimal structural nested models for optimal sequential decisions. In: Lin DY, Heagerty P, editors. Proceedings of the Second Seattle Symposium on Biostatistics. New York: Springer; 2004.
25. Rubin DB. Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies. Journal of Educational Psychology. 1974;66(5):688–701.
26. Rubin DB. Bayesian inference for causal effects: The role of randomization. The Annals of Statistics. 1978;6:34–58.
27. Tsiatis AA. Semiparametric Theory and Missing Data. New York: Springer; 2006 .
28. JCF Wu, Hamada M. Experiments: Planning, Analysis, and Parameter Design Optimization. New York: John Wiley & Sons, Inc; 2000.
29. Zacks S. Adaptive Designs for Parametric Models. In: Ghosh S, Rao CR, editors. Handbook of Statistics. Vol. 13. Elsevier Science B.V; Amsterdam: 1996.