Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2880540

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Inference via g-estimation and exceptional laws
- 3 Bias-correction at exceptional laws: Zeroing Instead of Plugging In (ZIPI)
- 4 Discussion
- References

Authors

Related links

Scand Stat Theory Appl. Author manuscript; available in PMC 2010 September 22.

Published in final edited form as:

Scand Stat Theory Appl. 2009 September 22; 37(1): 126–146.

doi: 10.1111/j.1467-9469.2009.00661.xPMCID: PMC2880540

NIHMSID: NIHMS152153

[Optimal dynamic regimes: bias correction]

Erica E. M. Moodie, Department of Epidemiology, Biostatistics, and Occupational Health, McGill University;

Erica E. M. Moodie, Department of Epidemiology, Biostatistics, and Occupational Health, McGill University, Purvis Hall, 1020 Pine Ave W., Montreal, QC H3A 1A2 Canada, Email: ac.lligcm@eidoom.acire

See other articles in PMC that cite the published article.

A dynamic regime provides a sequence of treatments that are tailored to patient-specific characteristics and outcomes. In 2004 James Robins proposed g-estimation using structural nested mean models for making inference about the optimal dynamic regime in a multi-interval trial. The method provides clear advantages over traditional parametric approaches. Robins’ g-estimation method always yields consistent estimators, but these can be asymptotically biased under a given structural nested mean model for certain longitudinal distributions of the treatments and covariates, termed exceptional laws. In fact, under the null hypothesis of no treatment effect, every distribution constitutes an exceptional law under structural nested mean models which allow for interaction of current treatment with past treatments or covariates. This paper provides an explanation of exceptional laws and describes a new approach to g-estimation which we call Zeroing Instead of Plugging In (ZIPI). ZIPI provides nearly identical estimators to recursive g-estimators at non-exceptional laws while providing substantial reduction in the bias at an exceptional law when decision rule parameters are not shared across intervals.

In a study aimed at estimating the mean effect of a treatment on a time-dependent outcome, dynamic treatment regimes are an obvious choice to consider. A dynamic regime is a function that takes treatment and covariate history as arguments; the value returned by the function is the treatment to be given at the current time. A dynamic regime provides a sequence of treatments that are tailored to patient-specific characteristics and outcomes. A recently proposed method of inference (Robins, 2004) is g-estimation using structural nested mean models: this method is semi-parametric and is robust to certain types of mis-specification and in this regard it is superior to traditional parametric approaches. However, g-estimation of certain structural nested mean models is asymptotically biased under specific data distributions which Robins has termed the *exceptional laws*. At exceptional laws, g-estimators of parameters for optimal structural nested mean models are asymptotically biased and exhibit non-regular behaviour.

The aim of this paper is to provide a means of reducing the asymptotic bias of g-estimators in the presence of exceptional laws. Exceptional laws are discussed in the optimal dynamic regimes literature only by Robins, (2004); there, a method of detecting these laws and improving coverage of confidence intervals, but not the bias of point estimates, is proposed.

Structural nested mean models and g-estimation are discussed in section 2.1, followed by a detailed explanation of exceptional laws in g-estimation in section 2.2. Section 3.2 describes a method of reducing the asymptotic bias due to exceptional laws which we call *Zeroing Instead of Plugging In* (ZIPI), with theoretical properties derived in section 3.3. Sections 3.4 and 3.5 compare ZIPI to g-estimation and to the Iterative Minimization for Optimal Regimes (IMOR) method proposed by Murphy (2003). Section 4 concludes.

We consider the estimation of treatment effect in a longitudinal study. All of the methods discussed in this paper may be extended to a finite number of treatment intervals. We focus on the two-interval case where a single covariate is used to determine the optimal rule at each interval.

Notation is adapted from Murphy (2003) and Robins (2004). Treatments are taken at *K* fixed times, *t*_{1},…,*t _{K}*.

In the two-interval case, treatments are taken at two fixed times, *t*_{1} and *t*_{2}, with outcome measured at *t*_{3}. The covariates measured prior to treatment at the beginning of the first and second intervals, i.e. at *t*_{1} and *t*_{2}, are *L*_{1} and *L*_{2}, respectively. In particular, *L*_{1} represents baseline covariates and *L*_{2} includes time-varying covariates which may depend on treatment received at *t*_{1}. The treatment given subsequent to observing *L _{j}* is

Throughout this paper, models are explained in terms of *potential outcomes*: the value of a covariate or the outcome that would result if a subject were assigned to a treatment possibly different from the one that they actually received. In the two-interval case we denote by *L*_{2}(*a*_{1}) a subject’s potential covariate at the beginning of the second interval if treatment *a*_{1} is taken by that subject, and *Y*(*a*_{1}, *a*_{2}) denotes the potential end-of-study outcome if regime (*a*_{1}, *a*_{2}) is followed.

Potential outcomes adhere to the *axiom of consistency*: *L*_{2}(*a*_{1}) *L*_{2} whenever treatment *a*_{1} is actually received and *Y*(*a*_{1}, *a*_{2}) *Y* whenever (*a*_{1}, *a*_{2}) is received. That is, the actual and counterfactual covariates (or outcome) are equal when the potential regime is the regime actually received.

We use the following definitions: an estimator _{n} of the parameter ψ^{†} is $\sqrt{n}$-*consistent* if $\sqrt{n}({\hat{\psi}}_{n}-{\psi}^{\u2020})={O}_{p}(1)$ with respect to any distribution *P* in the model. Similarly, if under *P*, $\sqrt{n}({\hat{\psi}}_{n}-{\psi}^{\u2020})\stackrel{D}{\to}T$, and *E*|*T*| < ∞ then *E*(*T*) is the $(\sqrt{n})$ *asymptotic bias* of _{n} under *P*, which we denote *AsyBiasp*(_{n}). If for some law *P* in the model, AsyBias*p*(_{n}) = 0, then _{n} is said to be $(\sqrt{n})$ *asymptotically unbiased* under *P*. (We will usually suppress the $\sqrt{n}$ and the *P* in our notation.) As shall be seen, g-estimators are $\sqrt{n}$-consistent under all laws, however they are not asymptotically unbiased under certain distributions of the data, which, following Robins (2004), we term ‘exceptional laws’ (see section 2.2).

To estimate the effect of a dynamic regime, we require the *stable unit treatment value assumption* (Rubin, 1978), which states that a subject’s outcome is not influenced by other subjects’ treatment allocation. In addition we assume that there are *no unmeasured confounders* (Robins, 1997), i.e. for any regime *ā*_{K},

$${A}_{j}\phantom{\rule{thinmathspace}{0ex}}\u2568\phantom{\rule{thinmathspace}{0ex}}({L}_{j+1}({\mathit{\u0101}}_{j}),\dots ,{L}_{K}({\mathit{\u0101}}_{K-1}),Y({\mathit{\u0101}}_{K}))\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{\overline{L}}_{j},{\mathit{\u0100}}_{j-1}={\mathit{\u0101}}_{j-1}.$$

In a trial in which sequential randomization is carried out, the latter assumption always holds. In two intervals for any *a*_{1}, *a*_{2} we have that *A*_{1} ╨ (*L*_{2}(*a*_{1}), *Y*(*a*_{1}, *a*_{2})) | *L*_{1} and *A*_{2} ╨ *Y*(*a*_{1}, *a*_{2}) | (*L*_{1}, *A*_{1} = *a*_{1}, *L*_{2}), that is, conditional on history, treatment received in any interval is independent of any future potential outcome.

Without further assumptions, the optimal regime may only be estimated from among the set of *feasible* regimes (Robins, 1994), i.e. those treatment regimes that have positive probability; in order to make non-parametric inference about a regime we require some subjects to have followed it.

Robins (2004, p. 209–214) produced a number of estimating equations for finding optimal regimes using structural nested mean models (SNMM) (Robins, 1994). We use a subclass of SNMMs called the *optimal blip-to-zero* functions, denoted by γ_{j}(_{j}, *ā*_{j}), defined as the expected difference in outcome when using the “zero” regime instead of treatment *a _{j}* at

$$\begin{array}{c}{\gamma}_{1}({l}_{1},{a}_{1})=E[Y({a}_{1},{d}_{2}^{\mathit{\text{opt}}}({l}_{1},{a}_{1},{L}_{2}({a}_{1})))-Y({0}_{1},{d}_{2}^{\mathit{\text{opt}}}({l}_{1},{0}_{1},{L}_{2}({0}_{1})))\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{L}_{1}={l}_{1}],\hfill \\ {\gamma}_{2}({\overline{l}}_{2},{\mathit{\u0101}}_{2})=E[Y({a}_{1},{a}_{2})-Y({a}_{1},{0}_{2})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}({\overline{L}}_{2},{A}_{1})=({\overline{l}}_{2},{a}_{1})].\hfill \end{array}$$

At the last (here, the second) interval there are no subsequent treatments, so the blip γ_{2}(·) is simply the expected difference in outcomes for subjects having taken treatment *a*_{2} as compared to the zero regime, 0_{2}, among subjects with history _{2}, *a*_{1}. At the first interval, the blip γ_{1}(·) is the expected (conditional) difference between the counterfactual outcome if treatment *a*_{1} was given in the first interval and optimal treatment was given in the second and the counterfactual outcome if the zero regime was given in the first interval and optimal treatment was given in the second interval.

We will use ψ to denote the parameters of the optimal blip function model, and ψ^{†} to denote the true values. For example, a linear blip ${\gamma}_{j}({\overline{l}}_{j},{\mathit{\u0101}}_{j};\psi )={a}_{j}({\psi}_{j0}+{\psi}_{j1}{l}_{j}+{\psi}_{j2}{l}_{j}^{2}+{\psi}_{j3}{a}_{j-1})$ implies that, conditional on prior treatment and covariate history, the expected effect of treatment *a _{j}* on outcome, provided optimal treatment is subsequently given, is quadratic in the covariate

Non-linear SNMMs are possible and may be preferable for continuous doses of treatment. For example, the SNMMs corresponding to the models in Murphy (2003) for both continuous and binary treatments are quadratic functions of treatment. It is equally possible to specify SNMMs where parameters are common across intervals. For instance, in a two-interval setting, the following blips may be specified: γ_{1}(*l*_{1}, *a*_{1}) = *a*_{1}(ψ_{0} + ψ_{1}*l*_{1}) and γ_{2}(_{2}, *ā*_{2}) = *a*_{2}(ψ_{0} + ψ_{1}*l*_{2}). In this example, the same parameters, ψ_{0} and ψ_{1}, are used in each interval *j* and thus are said to be *shared*. The implied optimal decision rules are *I*[ψ_{0}+ψ_{1}*l _{j}* > 0] for intervals

Robins (2004, p. 208) proposes finding the parameters ψ of the optimal blip-to-zero function via g-estimation. For two intervals, let

$$\begin{array}{c}{H}_{1}(\psi )=Y-{\gamma}_{1}({L}_{1},{A}_{1};\psi )+[{\gamma}_{2}({\overline{L}}_{2},({A}_{1},{d}_{2}^{\mathit{\text{opt}}}({L}_{1},{A}_{1},{L}_{2}({A}_{1})));\psi )-{\gamma}_{2}({\overline{L}}_{2},{\mathit{\u0100}}_{2};\psi )],\hfill \\ {H}_{2}(\psi )=Y-{\gamma}_{2}({\overline{L}}_{2},{\mathit{\u0100}}_{2};\psi ).\hfill \end{array}$$

*H*_{1}(ψ) and *H*_{2}(ψ) are equal in expectation, conditional on prior treatment and covariate history, to the potential outcomes $Y({0}_{1},{d}_{2}^{\mathit{\text{opt}}}({L}_{1},{0}_{1},{L}_{2}({0}_{1})))$ and *Y*(*A*_{1}, 0_{2}), respectively. For the purpose of constructing an estimating procedure, we must specify a function *S _{j}*(

$${U}_{j}(\psi )=\{{H}_{j}(\psi )-E[{H}_{j}(\psi )\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{\overline{L}}_{j},{\mathit{\u0100}}_{j-1}]\}\{{S}_{j}({A}_{j})-E[{S}_{j}({A}_{j})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{\overline{L}}_{j},{\mathit{\u0100}}_{j-1}]\}.$$

(1)

For distributions that are not “exceptional” (see definition in the next section), if $U(\psi )={\displaystyle {\sum}_{j=1}^{2}{U}_{j}(\psi )}$, then *E*[*U*(ψ^{†})] = 0 is an unbiased estimating equation from which consistent estimators of ψ^{†} may be found. Robins proves that estimates found by solving (1) are consistent provided *either* the expected counterfactual model, *E*[*H _{j}*(ψ)|

A less efficient, singly-robust version of equation (1) simply omits the expected counterfactual model:

$${U}_{j}^{*}(\psi )={H}_{j}(\psi )\{{S}_{j}({A}_{j})-E[{S}_{j}({A}_{j})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{\overline{L}}_{j},{\mathit{\u0100}}_{j-1}]\}.$$

(2)

Estimates found via equation (2) are consistent provided the model for treatment allocation, *p _{j}*(

Exact solutions to equations (1) and (2) can be found when optimal blips are linear in ψ and parameters are not shared between intervals. For details of the recursive procedure, see Robins (2004) or Moodie *et al.* (2007). An algorithm for solving the doubly-robust estimating equation (1) in the two-interval case is as follows, using _{n} to denote the empirical average operator:

- Estimate the nuisance parameters of the treatment model at time 2; i.e., estimate α
_{2}in*p*_{2}(*a*_{2}|_{2},*A*_{1}; α_{2}). - Assume a linear model for the expected counterfactual,
*E*[*H*_{2}(ψ_{2})|_{2},*A*_{1}; ς_{2}]. Express the least squares estimate_{2}(ψ_{2}) of the nuisance parameter ς_{2}, explicitly as a function of the data and the unknown parameter, ψ_{2}. - To find
_{2}, solve_{n}*U*_{2}(ψ_{2}) = 0; i.e. solve$${\mathrm{\mathbb{P}}}_{n}\{{H}_{2}({\psi}_{2})-E[{H}_{2}({\psi}_{2})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{\overline{L}}_{2},{A}_{1};{\hat{\varsigma}}_{2}({\psi}_{2})]\}\{{S}_{2}({A}_{2})-E[{S}_{2}({A}_{2})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{\overline{L}}_{2},{A}_{1};{\hat{\alpha}}_{2}]\}=0.$$ - Estimate the nuisance parameters of the treatment model at time 1; i.e. estimate α
_{1}in*p*_{1}(*a*_{1}|*L*_{1}; α_{1}). Plug_{2}into*H*_{1}(ψ_{1}, ψ_{2}) so that only ψ_{1}is unknown. - Assuming a linear model for
*E*[*H*_{1}(ψ_{1},_{2})|*L*_{1}; ς_{1}], the least squares estimate_{1}(ψ_{1},_{2}) of ς_{1}can again be expressed directly in terms of ψ_{1},_{2}and the data. - Solve
_{n}*U*_{1}(ψ_{1},_{2}) = 0 to find_{1}; i.e. solve$${\mathrm{\mathbb{P}}}_{n}\{{H}_{1}({\psi}_{1},{\hat{\psi}}_{2})-E[{H}_{1}({\psi}_{1},{\hat{\psi}}_{2})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{L}_{1};{\hat{\varsigma}}_{1}({\psi}_{1},{\hat{\psi}}_{2})]\}\{{S}_{1}({A}_{1})-E[{S}_{1}({A}_{1})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{L}_{1};{\hat{\alpha}}_{1}]\}=0.$$

The distribution function of the observed longitudinal data is *exceptional* with respect to a blip function specification if, at some interval *j*, there is a positive probability that the true optimal decision is not unique (Robins 2004, p. 219). Suppose, for example, that the blip function is linear, depending only on the current covariate and the previous treatment: γ_{j}(_{j}, *ā*_{j}; ψ) = *a _{j}*(ψ

Exceptional laws may easily arise in practice: in the case of no treatment effect, for a blip function that includes at least one component of treatment and covariate history, *every distribution is an exceptional law*. More generally, it may be the case that a treatment has no effect in a sub-group of the population under study. For example, in a study of a reduced-salt diet on blood pressure, the treatment (diet) will have little or no impact on the individuals in the study who are not salt-sensitive. The examples presented in the following sections focus on blip functions with discrete covariates, however asymptotic bias may be observed when variables included in the modelled blip function are continuous but do not affect response (i.e., the true coefficient is zero).

Robins (2004, p. 308) uses a simple scenario to explain the source of the asymptotic bias at exceptional laws. Define the function (*x*)_{+} *I*(*x* > 0)*x* = max(0, *x*). Suppose an i.i.d. sample *X*_{1},…,*X _{n}* is drawn from a normal distribution (η

Now consider the usual two-interval set-up, with linear optimal blips γ_{1}(*l*_{1}, *a*_{1}) = *a*_{1}(ψ_{10} + ψ_{11}*l*_{1}) and γ_{2}(_{2}, *a*_{1}, *a*_{2}) = *a*_{2}(ψ_{20} + ψ_{21}*l*_{2} + ψ_{22}*a*_{1} + ψ_{23}*l*_{2}*a*_{1}). Let *g*_{2}(ψ_{2}) = ψ_{20} + ψ_{21}*l*_{2} + ψ_{22}*a*_{1} + ψ_{23}*l*_{2}*a*_{1}. The g-estimating function for ${\psi}_{2}^{\u2020}$ is consistent and asymptotically normal, so ${g}_{2}({\hat{\psi}}_{2})\stackrel{P}{\to}{g}_{2}({\psi}_{2}^{\u2020})$. The sign of ${g}_{2}({\psi}_{2}^{\u2020})$ determines the true optimal treatment at the second interval: ${d}_{2}^{\mathit{\text{opt}}\u2020}=I[{g}_{2}({\hat{\psi}}_{2}^{\u2020})>0]$; the estimated rule is $\stackrel{\uff3e}{{d}_{2}^{\mathit{\text{opt}}}}=I[{g}_{2}({\hat{\psi}}_{2})>0]$. The g-estimating equation solved for ψ_{1} at the first interval contains:

$$\begin{array}{cc}{H}_{1}({\psi}_{1},{\hat{\psi}}_{2})\hfill & =Y-{\gamma}_{1}({L}_{1},{A}_{1};{\psi}_{1})+[{\gamma}_{2}({\overline{L}}_{2},({A}_{1},\stackrel{\uff3e}{{d}_{2}^{\mathit{\text{opt}}}});{\hat{\psi}}_{2})-{\gamma}_{2}({\overline{L}}_{2},({A}_{1},{A}_{2});{\hat{\psi}}_{2})]\hfill \\ \hfill & =Y-{\gamma}_{1}({L}_{1},{A}_{1};{\psi}_{1})+[(\stackrel{\uff3e}{{d}_{2}^{\mathit{\text{opt}}}}-{A}_{2})({\hat{\psi}}_{20}+{\hat{\psi}}_{21}{L}_{2}+{\hat{\psi}}_{22}{A}_{1}+{\hat{\psi}}_{23}{L}_{2}{A}_{1})]\hfill \\ \hfill & =Y-{\gamma}_{1}({L}_{1},{A}_{1};{\psi}_{1})+I[{g}_{2}({\hat{\psi}}_{2})>0]{g}_{2}({\hat{\psi}}_{2})-{A}_{2}{g}_{2}({\hat{\psi}}_{2})\hfill \\ \hfill & =Y-{\gamma}_{1}({L}_{1},{A}_{1};{\psi}_{1})+{({g}_{2}({\hat{\psi}}_{2}))}_{+}-{A}_{2}{g}_{2}({\hat{\psi}}_{2}).\hfill \end{array}$$

So

$$E[{H}_{1}({\psi}_{1}^{\u2020},{\hat{\psi}}_{2})]\ge E[Y-{\gamma}_{1}({L}_{1},{A}_{1};{\psi}_{1}^{\u2020})+{({g}_{2}({\psi}_{2}^{\u2020}))}_{+}-{A}_{2}{g}_{2}({\psi}_{2}^{\u2020})]=E[{H}_{1}({\psi}_{1}^{\u2020},{\psi}_{2}^{\u2020})].$$

The quantity $[{\gamma}_{2}({\overline{l}}_{2},({a}_{1},{d}_{2}^{\mathit{\text{opt}}\u2020});{\psi}_{2})-{\gamma}_{2}({\overline{l}}_{2},({a}_{1},{a}_{2});{\psi}_{2})]$ in *H*_{1}(ψ), or with more intervals, $\sum _{k>j}[{\gamma}_{k}({\overline{l}}_{k},({\mathit{\u0101}}_{k-1},{d}_{k}^{\mathit{\text{opt}}\u2020});{\psi}_{k})-{\gamma}_{k}({\overline{l}}_{k},{\mathit{\u0101}}_{k};{\psi}_{k})]$ in *H _{j}*(ψ

In this paper, we consider only binary treatments. However, the generalization to a larger number of discrete treatments is trivial and the issue of asymptotic bias remains. When the blip models are linear, asymptotic bias may be observed when the assumed blip models contain only discrete variables or when the true effect of the continuously-distributed covariates included in the assumed model is zero.

In the case of continuous treatments taking values in an interval one would typically assume a non-linear blip model, since otherwise optimal treatment will always be either the maximum or minimum value; see Robins (2004), p.218. Here again, the optimal treatment at time 2, and hence the estimating equation at time 1, *U*_{1}(ψ_{1}, ψ_{2}) will again be not everywhere differentiable with respect to ψ_{2}. Consequently at exceptional laws asymptotic bias will be present.

Consider a concrete example: the treatment of mild psoriasis in women aged 20–40 with randomly assigned treatments *A*_{1} (topical vitamin D cream) and *A*_{2} (oral retinoids) and covariates/response *L*_{1} a visual analog scale of itching, *L*_{2} a discrete symptom severity score where 1 indicates severe itching, and *Y* a continuous scale measuring relief and acceptability of side effects. The data were generated as follows: treatment in each interval takes on the value 0 or 1 with equal probability *L*_{1} ~ (0, 3); *L*_{2} ~ round((0.75, 0.5)); and *Y* ~ (1, 1) − |γ_{1}(*l*_{1}, *a*_{1})| − |γ_{2}(_{1}, *ā*_{2})| (see, for example, Murphy, 2003, p. 339). Suppose that the blip function is linear. For example, taking ${\psi}_{1}^{\u2020}=(1.6,0.0)$ and ${\psi}_{2}^{\u2020}=(-3.0,2.0,1.0,0.0)$; whenever ${L}_{2}={A}_{1}=1.0,{\psi}_{20}^{\u2020}+{\psi}_{21}^{\u2020}{L}_{2}+{\psi}_{22}^{\u2020}{A}_{1}+{\psi}_{23}^{\u2020}{L}_{2}{A}_{1}=-3+2\xb71+1\xb71+0\xb71=0$ even though not all components of ${\psi}_{2}^{\u2020}$ equal zero. Table 1 summarizes the results of the simulations, including the bias of the first-interval g-estimates (absolute and relative to standard error) and the coverage of Wald and score confidence intervals for *n* = 250 and 1000. Note that the optimal rule in the first interval is $I[{\psi}_{10}^{\u2020}+{\psi}_{11}^{\u2020}{L}_{1}>0]=I[{\psi}_{10}^{\u2020}>0]$, so that the sign of the blip parameter ${\psi}_{10}^{\u2020}$ determines the optimal decision. The bias in the first interval estimate of ${\psi}_{10}^{\u2020}$ is not negligible. Coverage is incorrect in smaller samples in part because of slow convergence of the estimated covariance matrix but perhaps also because of the exceptional law. Our results are consistent with the general observation that score intervals provide coverage closer to the nominal level than Wald intervals.

Summary statistics of g-estimates and Robins’ (2004) proposed method of detecting exceptional laws from 1000 datasets of sample sizes 250 and 1000.

In this example, the bias, though evident, is not large relative to the true value of the parameter. In such cases, the bias may have little impact on the estimation of the optimal decision rules, since the estimated optimal rules are of the form *I*[_{10} + _{11}*L*_{1} > 0]. As the next example shows, this will not always be the case.

Suppose now that, in fact, for every patient the optimal outcome does not depend on whether the topical vitamin D cream (*A*_{1}) is applied, so that ${\psi}_{10}^{\u2020}={\psi}_{11}^{\u2020}=0$, and the optimal treatment at *t*_{1} is not unique. The second interval parameters remain as above. In this case, the bias in the estimates of the first-interval blip function parameters result in a treatment rule which will lead us to apply the topical vitamin D cream, even though this is unnecessary. Table 2 demonstrates that in approximately 10% of the simulated datasets, one or both of _{10} and _{11} were found to be significantly different from 0 in independent, 0.05-level tests. When the tests falsely rejected one or both of the null hypotheses ${\psi}_{10}^{\u2020}=0$ and ${\psi}_{11}^{\u2020}=0$, the estimated optimal decision rule in the first interval would lead to unnecessary treatment for half of the individuals in a similar population! The non-smooth (and implicit) nature of the optimal decision rule parameterization (Murphy, 2003; Robins *et al.*, 2008, §4.4.1) is such that even small bias in the parameter estimates can lead to serious consequences in the resulting decision rule. Henceforth, we focus on the bias in the blip parameter estimates, keeping in mind that poor estimates of these can lead to treatment being applied unnecessarily, due to the indirect parameterization of the optimal decision rules.

Robins (2004, pp. 313–315) proves that under the scenario of an exceptional law such as the above, estimates _{1} are $\sqrt{n}$-consistent but are neither asymptotically normal nor asymptotically unbiased. For a graphical interpretation of asymptotic bias and consistency coinciding, we examine bias maps; these plots depict the bias of a parameter as sample size and another parameter are varied. Bias maps throughout this article focus on the absolute bias in _{10} as a function of sample size and one of the second interval parameters, ${\psi}_{20}^{\u2020},{\psi}_{21}^{\u2020},{\psi}_{22}^{\u2020}$, or ${\psi}_{23}^{\u2020}$. The plots represent the average bias over 1000 data sets, computed over a range of 2 units (on a 0.1 unit grid) for each parameter at sample sizes of 250, 300,…,1000. Note that there are several combinations of parameters ${\psi}_{2}^{\u2020}$ that lead to exceptional laws: as well as (−3.0, 2.0, 1.0, 0.0) if *p*(*A*_{1} = *L*_{2} = 1) > 0, exceptional laws occur with (−2.0, 2.0, 1.0, 0.0) and (−3.0, 3.0, 1.0, 0.0) if *p*(*A*_{1} = 0, *L*_{2} = 1) > 0.

The results of Robins (2004) imply that estimators of ${\psi}_{10}^{\u2020},{\psi}_{11}^{\u2020}$ are asymptotically unbiased and consistent, for parameters that do not change with *n* and a non-exceptional law. Consistency may be visualized by looking at a horizontal cross-section of any one of the bias maps in Figure 1: eventually, sample size will be great enough that bias of first-interval estimates is smaller than any fixed, positive number. However, if we consider a sequence of data generating processes $\{{\psi}_{(n)}^{\u2020}\}$ in which the ψ_{2}’s decrease with increasing *n*, so that ${g}_{2}({\psi}_{2(n)}^{\u2020})$ is *O*(*n*^{−1/2}), then AsyBias(_{1}) > 0. Contours of constant bias can be found along the lines on the bias map traced by plotting ${g}_{2}({\psi}_{2}^{\u2020})=k{n}^{-1/2}$ against *n*. Note that the asymptotic bias is bounded. In finite samples, the proximity to the exceptional law and the sample size both determine the bias of the estimator (Figure 1).

Absolute bias of _{10} as sample size varies at and near exceptional laws. In each figure, one of ψ_{20}, ψ_{21}, ψ_{22}, ψ_{23} is varied while the remaining are fixed at −3.0, 2.0, 1.0, or 0.0, respectively. **...**

In Figure 2, *L*_{2} is not rounded (representing, say, a visual analog scale of itching rather than a severity score) so that $P({g}_{2}({\psi}_{2}^{\u2020})=0)=0$ and the law is not exceptional. Virtually no small- or large-sample bias is observed for *n* ≥ 250. For a law to not be exceptional, it is sufficient for the blip function to contain a continuous covariate (with no point-masses in its distribution function) that interacts with treatment to affect the response, i.e., that has a non-zero coefficient in the blip function. However in finite samples, to avoid the bias that arises through truncation, it is also necessary that the derivative of the blip function with respect to a continuous covariate ${L}_{j},\frac{\partial}{\partial {L}_{j}}{\gamma}_{j}({\overline{L}}_{j},{\mathit{\u0100}}_{j})$, is large (that is, greater than *kn*^{−1/2} for some *k*) whenever γ_{j}(_{j}, *Ā*_{j}) is in a neighbourhood of zero, so as to be able to distinguish a unique optimal regime from those that are not unique.

As before, we express the blip as γ_{j}(_{j}, *Ā*_{j}; ψ_{j}) = *A _{jgj}*(

Robins (2004, p. 315) calculates the asymptotic bias of _{1} found using equation (2) on two intervals for linear blips in the special case where exceptional laws occur when *L _{j}* =

Since _{1} solves $\sqrt{n}{\mathrm{\mathbb{P}}}_{n}[{U}_{1}^{*}({\psi}_{1},{\hat{\psi}}_{2})]=0$, it follows that

$$\begin{array}{c}\sqrt{n}({\hat{\psi}}_{1}-{\psi}_{1}^{\u2020})\hfill \\ \text{}=-{\left(E\phantom{\rule{thinmathspace}{0ex}}\left[\frac{\partial}{\partial {\psi}_{1}}{U}_{1}^{*}({\psi}_{1}^{\u2020},{\hat{\psi}}_{2})\right]\right)}^{-1}\sqrt{n}{\mathrm{\mathbb{P}}}_{n}({U}_{1}^{*}({\psi}_{1}^{\u2020},{\hat{\psi}}_{2}))+{o}_{p}(1)\hfill \\ \text{}=-{\left(E\phantom{\rule{thinmathspace}{0ex}}\left[\frac{\partial}{\partial {\psi}_{1}}{U}_{1}^{*}({\psi}_{1}^{\u2020},{\hat{\psi}}_{2})\right]\right)}^{-1}\sqrt{n}{\mathrm{\mathbb{P}}}_{n}({U}_{1}^{*}({\psi}_{1}^{\u2020},{\psi}_{2}^{\u2020})+{\mathrm{\Delta}}_{1}^{g}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020}))+{o}_{p}(1)\hfill \\ \text{}=-{(E\phantom{\rule{thinmathspace}{0ex}}[{A}_{1}\{{S}_{1}({A}_{1})-E[{S}_{1}({A}_{1})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{L}_{1}]\}{X}_{1}^{\prime}])}^{-1}\sqrt{n}{\mathrm{\mathbb{P}}}_{n}({U}_{1}^{*}({\psi}_{1}^{\u2020},{\psi}_{2}^{\u2020})+{\mathrm{\Delta}}_{1}^{g}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020}))+{o}_{p}(1),\hfill \end{array}$$

where ${\mathrm{\Delta}}_{1}^{g}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})\equiv {U}_{1}^{*}({\psi}_{1},{\hat{\psi}}_{2})-{U}_{1}^{*}({\psi}_{1},{\psi}_{2}^{\u2020})$; here the superscript *g* indicates that the quantity is used to determine bias of g-estimators. As indicated by the notation, ${\mathrm{\Delta}}_{1}^{g}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})$ does not depend on ψ_{1} when blips are linear.

Since $E[{U}_{1}^{*}({\psi}_{1}^{\u2020},{\psi}_{2}^{\u2020})]=0$, this term does not contribute to the asymptotic bias, so it is sufficient to focus on $\sqrt{n}{\mathrm{\mathbb{P}}}_{n}({\mathrm{\Delta}}_{1}^{g}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})$. First observe the following:

$$\begin{array}{c}\sqrt{n}{\mathrm{\mathbb{P}}}_{n}[{\mathrm{\Delta}}_{1}^{g}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})]\hfill \\ \text{}=\sqrt{n}{\mathrm{\mathbb{P}}}_{n}[{U}_{1}^{*}({\psi}_{1},{\hat{\psi}}_{2})-{U}_{1}^{*}({\psi}_{1},{\psi}_{2}^{\u2020})]\hfill \\ \text{}=\sqrt{n}{\mathrm{\mathbb{P}}}_{n}[\{{S}_{1}({A}_{1})-E[{S}_{1}({A}_{1})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{L}_{1}]\}(({({\hat{\psi}}_{2}\xb7{X}_{2})}_{+}-{({\psi}_{2}^{\u2020}\xb7{X}_{2})}_{+})-{A}_{2}(({\hat{\psi}}_{2}-{\psi}_{2}^{\u2020})\xb7{X}_{2}))]\hfill \\ \text{}=\sqrt{n}{\mathrm{\mathbb{P}}}_{n}[\{{S}_{1}({A}_{1})-E[{S}_{1}({A}_{1})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{L}_{1}]\}({({\hat{\psi}}_{2}\xb7{X}_{2})}_{+}-{({\psi}_{2}^{\u2020}\xb7{X}_{2})}_{+})]+{o}_{p}(1).\hfill \end{array}$$

The last step follows because

$$\sqrt{n}{\mathrm{\mathbb{P}}}_{n}[\{{S}_{1}({A}_{1})-E[{S}_{1}({A}_{1})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{L}_{1}]\}{A}_{2}(({\hat{\psi}}_{2}-{\psi}_{2}^{\u2020})\xb7{X}_{2})]\stackrel{P}{\to}0$$

by Slutsky’s Theorem, as _{2} is consistent and asymptotically normal.

We now partition the space of the covariates *X*_{2}, according to whether ${\psi}_{2}^{\u2020}\xb7{X}_{2}$ is positive, negative or zero, considering each case separately. We also introduce the definition *h*(*X*_{1}) {*S*_{1}(*A*_{1}) − *E*[*S*_{1}(*A*_{1}) | *L*_{1}]} in order to simplify the notation:

$$\begin{array}{c}\sqrt{n}{\mathrm{\mathbb{P}}}_{n}[h({X}_{1})({({\hat{\psi}}_{2}\xb7{X}_{2})}_{+}-{({\psi}_{2}^{\u2020}\xb7{X}_{2})}_{+})I({\psi}_{2}^{\u2020}\xb7{X}_{2}>0)]\hfill \\ \text{}=\sqrt{n}\phantom{\rule{thinmathspace}{0ex}}E\phantom{\rule{thinmathspace}{0ex}}[h({X}_{1})({({\psi}_{2}\xb7{X}_{2})}_{+}-{({\psi}_{2}^{\u2020}\xb7{X}_{2})}_{+})I({\psi}_{2}^{\u2020}\xb7{X}_{2}0)]{|}_{{\psi}_{2}={\hat{\psi}}_{2}}+{o}_{p}(1)\hfill \\ \text{}=E\phantom{\rule{thinmathspace}{0ex}}[h({X}_{1})I({\psi}_{2}^{\u2020}\xb7{X}_{2}0){X}_{2}]\xb7(\sqrt{n}({\hat{\psi}}_{2}-{\psi}_{2}^{\u2020}))+{o}_{p}(1)\hfill \end{array}$$

and thus this case contributes no asymptotic bias. The first equality here follows by Lemma 19.24 in van der Vaart (1998): the set of functions

$$({x}_{1},{x}_{2})\mapsto h({x}_{1})({({\psi}_{2}\xb7{x}_{2})}_{+}-{({\psi}_{2}^{\u2020}\xb7{x}_{2})}_{+})I({\psi}_{2}^{\u2020}\xb7{x}_{2}>0)$$

with ψ_{2} ranging over a compact set satisfy a Lipschitz condition and are thus *P*-Donsker (this assumes finite second moments, see Examples 19.7 and 19.25); the equality then follows from the Lemma cited because

$${E\phantom{\rule{thinmathspace}{0ex}}\left[{(h({X}_{1})({({\psi}_{2}\xb7{X}_{2})}_{+}-{({\psi}_{2}^{\u2020}\xb7{X}_{2})}_{+})I({\psi}_{2}^{\u2020}\xb7{X}_{2}>0))}^{2}\right]\phantom{\rule{thinmathspace}{0ex}}|}_{{\psi}_{2}={\hat{\psi}}_{2}}\stackrel{P}{\to}0$$

via the continuous mapping theorem (again assuming finite second moments). The second equality follows since ${\hat{\psi}}_{2}\stackrel{P}{\to}{\psi}_{2}^{\u2020}$, hence

$$E\phantom{\rule{thinmathspace}{0ex}}[h({X}_{1})(I({\psi}_{2}^{\u2020}\xb7{X}_{2}>0){({\psi}_{2}\xb7{X}_{2})}_{+})]{|}_{{\psi}_{2}={\hat{\psi}}_{2}}\stackrel{P}{\to}E\phantom{\rule{thinmathspace}{0ex}}[h({X}_{1})(I({\psi}_{2}^{\u2020}\xb7{X}_{2}>0)({\psi}_{2}\xb7{X}_{2}))]{|}_{{\psi}_{2}={\hat{\psi}}_{2}\prime}$$

again by the continuous mapping theorem. Turning to the second term:

$$\begin{array}{c}\sqrt{n}{\mathrm{\mathbb{P}}}_{n}[h({X}_{1})({({\hat{\psi}}_{2}\xb7{X}_{2})}_{+}-{({\psi}_{2}^{\u2020}\xb7{X}_{2})}_{+})I({\psi}_{2}^{\u2020}\xb7{X}_{2}<0)]\hfill \\ \text{}=\sqrt{n}\phantom{\rule{thinmathspace}{0ex}}E\phantom{\rule{thinmathspace}{0ex}}[h({X}_{1})({({\psi}_{2}\xb7{X}_{2})}_{+}-{({\psi}_{2}^{\u2020}\xb7{X}_{2})}_{+})I({\psi}_{2}^{\u2020}\xb7{X}_{2}0)]{|}_{{\psi}_{2}={\hat{\psi}}_{2}}+{o}_{p}(1)\hfill \\ \text{}=\sqrt{n}\phantom{\rule{thinmathspace}{0ex}}E\phantom{\rule{thinmathspace}{0ex}}[h({X}_{1})I({\psi}_{2}^{\u2020}\xb7{X}_{2}0){({\psi}_{2}\xb7{X}_{2})}_{+}]{|}_{{}_{{\psi}_{2}={\hat{\psi}}_{2}}}+{o}_{p}(1)\stackrel{P}{\to}0,\hfill \end{array}$$

thus this term also does not contribute to the asymptotic bias. As in the previous case, the first equality here follows from a similar application of Lemma 19.24 in van der Vaart (1998); the second again follows from the continuous mapping theorem. The third term is as follows:

$$\begin{array}{c}\sqrt{n}{\mathrm{\mathbb{P}}}_{n}[h({X}_{1})({({\hat{\psi}}_{2}\xb7{X}_{2})}_{+}-{({\psi}_{2}^{\u2020}\xb7{X}_{2})}_{+})I({\psi}_{2}^{\u2020}\xb7{X}_{2}=0)]\hfill \\ \text{}=\sqrt{n}\phantom{\rule{thinmathspace}{0ex}}E\phantom{\rule{thinmathspace}{0ex}}[h({X}_{1})({({\psi}_{2}\xb7{X}_{2})}_{+}-{({\psi}_{2}^{\u2020}\xb7{X}_{2})}_{+})I({\psi}_{2}^{\u2020}\xb7{X}_{2}=0)]{|}_{{\psi}_{2}={\hat{\psi}}_{2}}+{o}_{p}(1)\hfill \\ \text{}=E\phantom{\rule{thinmathspace}{0ex}}[h({X}_{1}){(\sqrt{n}({\psi}_{2}-{\psi}_{2}^{\u2020})\xb7{X}_{2})}_{+}I({\psi}_{2}^{\u2020}\xb7{X}_{2}=0)]{|}_{{\psi}_{2}={\hat{\psi}}_{2}}+{o}_{p}(1);\hfill \end{array}$$

again by applying Lemma 19.24. Thus

$$\text{AsyBias}\phantom{\rule{thinmathspace}{0ex}}({\mathrm{\mathbb{P}}}_{n}[{U}_{1}^{*}({\psi}_{1}^{\u2020},{\hat{\psi}}_{2})])=E\phantom{\rule{thinmathspace}{0ex}}[h({X}_{1}){(2\pi )}^{-1/2}{({X}_{2}^{\prime}{\mathrm{\Sigma}}_{2}{X}_{2})}^{1/2}I({\psi}_{2}^{\u2020}\xb7{X}_{2}=0)]$$

where Σ_{2} is such that $\sqrt{n}({\hat{\psi}}_{2}-{\psi}_{2}^{\u2020})\stackrel{D}{\to}\mathcal{N}(\mathbf{0},{\mathrm{\Sigma}}_{2})$, and we have used the formula for the expectation of a truncated normal distribution. Thus

$$\text{AsyBias}({\hat{\psi}}_{1})=-{\left(E\phantom{\rule{thinmathspace}{0ex}}\left[\frac{\partial}{\partial {\psi}_{1}}{U}_{1}^{*}({\psi}_{1}^{\u2020},{\psi}_{2}^{\u2020})\right]\right)}^{-1}E\phantom{\rule{thinmathspace}{0ex}}[h({X}_{1}){(2\pi )}^{-1/2}{({X}_{2}^{\prime}{\mathrm{\Sigma}}_{2}{X}_{2})}^{1/2}I({\psi}_{2}^{\u2020}\xb7{X}_{2}=0)],$$

(3)

which is bounded.

For more than two intervals, the asymptotic bias calculation is more complex. Consider a three-interval problem where non-unique optimal rules exist at the last interval. The asymptotic bias calculation at the second (middle) interval proceeds as above, so that, the asymptotic bias in _{2} is

$$-{\left(E\phantom{\rule{thinmathspace}{0ex}}\left[\frac{\partial}{\partial {\psi}_{2}}{U}_{2}^{*}({\psi}_{1}^{\u2020},{\psi}_{2}^{\u2020},{\psi}_{3}^{\u2020})\right]\right)}^{-1}E\phantom{\rule{thinmathspace}{0ex}}[I[{g}_{3}({\overline{L}}_{3},{\mathit{\u0100}}_{2};{\psi}_{3}^{\u2020})=0]\{{S}_{2}({A}_{2})-E\phantom{\rule{thinmathspace}{0ex}}[{S}_{2}({A}_{2})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{\overline{L}}_{2},{A}_{1}]\}{(2\pi )}^{-1/2}{({X}_{3}^{\prime}{\mathrm{\Sigma}}_{3}{X}_{3})}^{1/2}].$$

Note that in the recursive estimation setting, ${U}_{2}^{*}({\psi}_{1},{\psi}_{2},{\psi}_{3})={U}_{2}^{*}({\psi}_{2},{\psi}_{3})$. The asymptotic bias calculation at the first interval can be found by examining ${\mathrm{\Delta}}_{1}^{g}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020},{\hat{\psi}}_{3},{\psi}_{3}^{\u2020})={U}_{1}^{*}({\psi}_{1},{\hat{\psi}}_{2},{\hat{\psi}}_{3})-{U}_{1}^{*}({\psi}_{1},{\psi}_{2}^{\u2020},{\psi}_{3}^{\u2020})$. This function is more complex than that derived above because both the regular estimator _{3} and the asymptotically biased, non-regular estimator _{2}, are plugged in.

Doubly-robust g-estimation is more efficient than singly-robust estimation and has the advantage that blip functions which do not allow for covariate and treatment interactions do not lead to exceptional laws. The doubly-robust g-estimating equation (1) can be expressed in terms of the singly-robust equation (2):

$${U}_{j}(\psi )={U}_{j}^{*}(\psi )-E[{H}_{j}(\psi )\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{\overline{L}}_{j},{\mathit{\u0100}}_{j-1}]\{{S}_{j}({A}_{j})-E[{S}_{j}({A}_{j})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{\overline{L}}_{j},{\mathit{\u0100}}_{j-1}]\}.$$

If either the model *E*[*H _{j}*(ψ) |

It is possible that if the propensity score model is incorrectly specified, but the model for *E*[*H*_{2}(ψ_{2}) | _{2}, *A*_{1}] is correct, then additional bias terms might arise. However, this is an unlikely scenario in practice as, with linear blips, correct specification of the model for *E*[*H*_{2}(ψ_{2}) | _{2}, *A*_{1}] is complex (see Moodie *et al.* 2007, Web Appendix).

It is not necessary for all parameters to equal zero for a law to be exceptional, thus a simple F-test is not a sufficient check. Robins (2004, p. 317) suggests the following steps for when parameters are not shared across intervals:

- Compute g-estimates of ψ
_{j}for*j*= 1,…,*K*and Wald confidence intervals about each parameter. - At each interval
*j*, compute ${\left|{d}_{j}^{\mathit{\text{opt}}}\in \text{CI}\right|}_{i}$ as the number of optimal rules possible at time*j*for each subject*i*under all values ψ_{j}in the confidence set. For example, if all values of ψ_{j}in the confidence set give the same optimal rule for subject*i*then ${\left|{d}_{j}^{\mathit{\text{opt}}}\in \text{CI}\right|}_{i}=1$. - If the fraction of $\left\{{\left|{d}_{j}^{\mathit{\text{opt}}}\in \text{CI}\right|}_{1},{\left|{d}_{j}^{\mathit{\text{opt}}}\in \text{CI}\right|}_{2},\dots ,{\left|{d}_{j}^{\mathit{\text{opt}}}\in \text{CI}\right|}_{n}\right\}$ that is greater than 1, denoted by
*p*, is small (e.g._{op,j}*p*< 0.05) then the law at interval_{op,j}*j*is likely not exceptional and inference based on Wald confidence sets is reliable for earlier intervals*m*<*j*.

The idea behind this approach is that if there are few instances for which ${\left|{d}_{j}^{\mathit{\text{opt}}}\in \text{CI}\right|}_{i}>1$, then the confidence set is far away from values of ψ_{j} that, in combination with the distribution of covariate and treatment history, would produce an exceptional law. In our simulations, Robins’ method detects exceptional laws in all samples in which they are present (Table 1) so that in all cases, score confidence intervals are recommended. In general, the score confidence intervals improve coverage at both time intervals, particularly in smaller samples.

This method may save the analyst having to find a more computationally-intensive confidence set if score intervals are not recommended, but could itself be quite time-consuming. We suggest here some additional guidelines:

- When parameters are not shared, exceptional laws at the
*k*^{th}interval do not affect the regularity of estimates for ψ_{j},*j*>*k*; hence it is sufficient to consider ${\left|{d}_{j}^{\mathit{\text{opt}}}\in \text{CI}\right|}_{i}$ only for the intervals*k*,…,*K*. - If, at every interval, the Wald confidence interval for the parameter of at least one continuous variable (with no “spikes” or point-masses) excludes zero, the law is likely not exceptional.
- If at any interval, the hypothesis of no effect of treatment is not rejected (i.e., the Wald confidence set for the vector ψ
_{j}includes the vector 0), the law is likely exceptional.

If a law is exceptional, Wald confidence intervals will not have the correct coverage; score confidence intervals will still be uniformly of the correct level (Robins 2004, p. 222). However the asymptotic bias remains.

In this section, we propose a modification to g-estimation (when there is no parameter sharing) that detects exceptional laws and reduces bias in the presence of an exceptional law. The method relies on the supposition that – at any interval *j* – the sample of study participants is drawn from two populations, ${\mathcal{M}}_{j}^{0}$ and ${\mathcal{M}}_{j}^{1}$, where those who are members of ${\mathcal{M}}_{j}^{0}$ do not have a unique optimal treatment decision in interval *j*, while there is only a single optimal treatment for all those belonging to ${\mathcal{M}}_{j}^{1}$. Let ${m}_{j}^{0}$ denote the members of the sample drawn from ${\mathcal{M}}_{j}^{0}$, and define ${m}_{j}^{1}$ similarly. The method that we propose attempts to determine an individual’s membership in ${\mathcal{M}}_{j}^{0}$ or ${\mathcal{M}}_{j}^{1}$ and then treats the data collected on individuals from the two groups differently. In practice, ${m}_{j}^{0}$ and ${m}_{j}^{1}$ are unknown and must be estimated.

As seen in section 2.2, bias enters the g-estimating equation of ${\psi}_{1}^{\u2020}$ through the inclusion of the upwardly-biased estimate of $I[{g}_{2}({\overline{L}}_{2},{A}_{1};{\psi}_{2}^{\u2020})>0]{g}_{2}({\overline{L}}_{2},{A}_{1};{\psi}_{2}^{\u2020})$ into *H*_{1}(ψ) when ${g}_{2}({\overline{L}}_{2},{A}_{1};{\psi}_{2}^{\u2020})$ is at or close to zero. Our algorithm searches for individuals for whom it is likely that ${g}_{2}({\overline{L}}_{2},{A}_{1};{\psi}_{2}^{\u2020})=0$ and then uses the “better guess” of zero for $I[{g}_{2}({\overline{L}}_{2},{A}_{1};{\psi}_{2}^{\u2020})>0]{g}_{2}({\overline{L}}_{2},{A}_{1};{\psi}_{2}^{\u2020})$ instead of the estimate obtained by plugging in _{2} for ${\psi}_{2}^{\u2020}$. We call this *Zeroing Instead of Plugging In* (ZIPI). Specifically, ZIPI is performed by starting at the last interval:

- Compute
_{K}using the usual g-estimation (singly- or doubly-robust). Set*j*=*K*. - Estimate ${m}_{j}^{0},{m}_{j}^{1}$ by calculating the 95% (or 80%, or 90%, etc.) Wald confidence interval,
*W*, about_{ji}*g*(_{j}_{j,i},*ā*_{j−1,i};_{j}) for each individual*i*in the sample. Let ${\hat{m}}_{j}^{0}=\{i\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}0\in {W}_{ji}\}$, and ${\hat{m}}_{j}^{1}=\{i\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}0\notin {W}_{ji}\}$. - For individual
*i*, setI.e., if at any interval$$\begin{array}{c}{H}_{j-1,i}^{0}={Y}_{i}-{\gamma}_{j-1}({\overline{L}}_{j-1,i},{\mathit{\u0100}}_{j-1,i};{\psi}_{j-1})\hfill \\ \text{\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}}+{\displaystyle \sum _{k:k\ge j}I[i\in {\hat{m}}_{k}^{1}][{\gamma}_{k}({\overline{L}}_{k,i},({\mathit{\u0100}}_{k-1,i},{d}_{k,i}^{\mathit{\text{opt}}});{\hat{\psi}}_{k})-{\gamma}_{k}({\overline{L}}_{k,i},{\mathit{\u0100}}_{k,i};{\hat{\psi}}_{k})].}\hfill \end{array}$$*k*,*k*≥*j*, the confidence interval about*g*(_{k}_{k,i},*Ā*_{k−1,i};_{k}) includes zero, ${\gamma}_{k}({\overline{L}}_{k,i},({\mathit{\u0100}}_{k-1,i},{d}_{k,i}^{\mathit{\text{opt}}});{\psi}_{k})-{\gamma}_{k}({\overline{L}}_{k,i},{\mathit{\u0100}}_{k,i};{\psi}_{k})$ is set to 0 in ${H}_{j-1,i}^{0}$, otherwise it is estimated by plugging in_{k}. - Repeat steps 2–4 with
*j*replaced by*j*− 1 until all parameters have been estimated.

The last interval ZIPI estimator, _{K}, is equal to the usual recursive g-estimator and thus is always consistent and asymptotically normal. Table 3 compares g-estimation with ZIPI using different confidence levels to select individuals who are likely to have non-unique optimal rules, keeping to the simulation scenario used in Table 1 and Figure 1. The bias reduction observed by using 0 instead of the estimate *I*[*g _{j}*(

Parameter estimates under exceptional laws: summary statistics of g-estimation using Zeroing Instead of Plugging In (ZIPI) and the usual g-estimates from 1000 datasets of sample sizes 250, 500 and 1000 using 80%, 90%, and 95% CIs for ZIPI. Second interval **...**

In finite samples, some misclassification of individuals into ${m}_{j}^{0}$ and ${m}_{j}^{1}$ will always occur if type I error is held fixed and both sets are non-empty. In this context, a type I error consists of falsely classifying subject *i* as having a unique rule when in fact he does not: $I(i\in {\hat{m}}_{j}^{1}\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}i\in {m}_{j}^{0})$. Correspondingly, type II error is the incorrect classification of subject *i* as having a non-unique optimal regime when in truth his optimal rule is unique: $I(i\in {\hat{m}}_{j}^{0}\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}i\in {m}_{j}^{1})$. An unfortunate consequence is that ZIPI introduces bias where none exists using ordinary recursive g-estimation in regions of the parameter space that are moderately close to the exceptional laws (Figure 3); see section 3.3 for further explanation of this phenomenon. As sample size increases, the magnitude of the bias and the region of “moderate proximity” to the exceptional law, where bias is observed, decrease. The maximum bias observed using ZIPI is smaller than the bias observed using usual g-estimation: 0.033 with 80% confidence interval (CI) selection and 0.040 with 95% CI selection as compared to 0.060 with g-estimation (Figure 1, Figure 3–Figure 4).

Absolute bias of _{10} as sample size varies at and near exceptional laws using Zeroing Instead of Plugging In (ZIPI) when 95% confidence intervals suggest exceptional laws. In each figure, one of ψ_{20}, ψ_{21}, ψ_{22}, **...**

Absolute bias of _{10} as sample size varies at and near exceptional laws using Zeroing Instead of Plugging In (ZIPI) when 80% confidence intervals suggest exceptional laws. In each figure, one of ψ_{20}, ψ_{21}, ψ_{22}, **...**

ZIPI falls into the class of pre-test estimators that are frequently used in a variety of statistical settings (variable or model selection constitutes a common example). The ZIPI procedure requires a large number of correlated significance tests, since the algorithm requires a testing of all individuals at all intervals but the first. Control of the type I error under multiple testing has been well-studied, however many methods of controlling family-wise error rate significantly reduce the power of the tests. The false discovery rate (FDR) of Benjamini & Hochberg (1995) provides a good alternative that has been shown to work well for correlated tests (Benjamini & Yekutieli, 2001). However even in better-understood scenarios, the optimal choice of threshold is a complicated problem; see, for example, Giles *et al.* (1992) or Kabaila & Leeb (2006).

ZIPI estimators are similar to ordinary g-estimators at non-exceptional laws, with the added benefit of reduced asymptotic bias at exceptional laws.

*Under non-exceptional laws, ZIPI estimators converge to the usual recursive g-estimators and therefore are asymptotically consistent, unbiased, and normally distributed*.

The proof is trivial. Of greater interest is the performance of ZIPI estimators under exceptional laws. We begin assuming that ${m}_{j}^{0}$ and ${m}_{j}^{1}$ are known for all *j*.

*When the distribution is exceptional and* ${m}_{j}^{0},{m}_{j}^{1}$ *are known at every interval, ZIPI estimators are asymptotically consistent, unbiased, and normally distributed*.

The statement is proved for the singly-robust [equation (2)] in a two-interval context. The generalization to *K* intervals or to the use of equation (1) follows readily.

Decompose the usual first-interval g-estimation function into two components:

$${\mathrm{\mathbb{P}}}_{n}({U}_{1}^{*}({\psi}_{1},{\hat{\psi}}_{2}))=\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{U}_{1i}^{*}({\psi}_{1},{\hat{\psi}}_{2})=\frac{1}{n}{\displaystyle \sum _{i\in {m}_{2}^{0}}{U}_{1i}^{*}({\psi}_{1},{\hat{\psi}}_{2})+\frac{1}{n}{\displaystyle \sum _{i\in {m}_{2}^{1}}{U}_{1i}^{*}({\psi}_{1},{\hat{\psi}}_{2}).}}}$$

For $i\in {m}_{2}^{1}$, *g*_{2}(_{2,i}, *a*_{1,i}; ${\psi}_{2}^{\u2020}$) ≠ 0 and so $E\phantom{\rule{thinmathspace}{0ex}}\left[\frac{1}{n}{\displaystyle {\sum}_{i\in {m}_{2}^{1}}{U}_{1i}^{*}({\psi}_{1}^{\u2020},{\hat{\psi}}_{2})}\right]=0$. For $i\in {m}_{2}^{0}$, on the other hand, $E[{U}_{1i}^{*}({\psi}_{1}^{\u2020},{\hat{\psi}}_{2})]\ne 0$ since

$${H}_{1,i}({\psi}_{1},{\hat{\psi}}_{2})={Y}_{i}-{\gamma}_{1}({L}_{1,i};{\psi}_{1})+[(\hat{{d}_{2,i}^{\mathit{\text{opt}}}}-{A}_{2,i})({\hat{\psi}}_{20}+{\hat{\psi}}_{21}{L}_{2,i}+{\hat{\psi}}_{22}{A}_{1,i}+{\hat{\psi}}_{23}{L}_{2,i}{A}_{1,i})]$$

is upwardly biased for ${H}_{1,i}({\psi}_{1},{\psi}_{2}^{\u2020})$, as described in section 2.2 and by Robins (2004).

Now let ${H}_{1,i}^{0}({\psi}_{1})={Y}_{i}-{\gamma}_{1}({L}_{1,i};{\psi}_{1})$ if subject *i* belongs to ${m}_{2}^{0}$ and ${H}_{1,i}^{0}({\psi}_{1})={H}_{1,i}({\psi}_{1},{\hat{\psi}}_{2})$ otherwise. The ZIPI estimating equation [corresponding to the g-estimating equation (2)] is

$$\begin{array}{c}{U}_{1i}^{0*}({\psi}_{1},{\hat{\psi}}_{2})=\frac{1}{n}\{{\displaystyle \sum _{i\in {m}_{2}^{0}}I[{g}_{2}({\overline{L}}_{2},{A}_{1};}{\psi}_{2}^{\u2020})=0]{H}_{1,i}^{0}({\psi}_{1})\{{S}_{j}({A}_{j})-E[{S}_{j}({A}_{j})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{\overline{L}}_{j},{\mathit{\u0100}}_{j-1}]\}\hfill \\ \text{\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}\hspace{1em}}+{\displaystyle \sum _{i\in {m}_{2}^{1}}(1-I[{g}_{2}({\overline{L}}_{2},{A}_{1};}{\psi}_{2}^{\u2020})=0]){H}_{1,i}^{0}({\psi}_{1})\{{S}_{j}({A}_{j})-E[{S}_{j}({A}_{j})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{\overline{L}}_{j},{\mathit{\u0100}}_{j-1}]\}\}.\hfill \end{array}$$

$E[{U}_{1i}^{0*}({\psi}_{1},{\hat{\psi}}_{2})]=0$, and therefore ZIPI estimators calculated under known ${m}_{2}^{0},{m}_{2}^{1}$ are consistent and asymptotically unbiased for any ψ in the parameter space and at any law, exceptional or otherwise.

Of course, ${m}_{j}^{0}$ and ${m}_{j}^{1}$ are not known and must always be estimated. Typically, some misclassification will occur. Recall a type I error is that of using the estimate _{2} rather than 0 in *H*_{1,i}(ψ_{1}) and a type II error is that of using 0 when a plug-in estimate would have been desired. We now calculate the asymptotic bias and compare this to the result of section 2.2 to demonstrate that even when ${m}_{j}^{0}$ and ${m}_{j}^{1}$ are estimated, ZIPI exhibits less asymptotic bias than recursive g-estimation. As with ordinary g-estimation, generalization of the expression for the asymptotic bias for *K* > 2 intervals is not trivial because of the need to use both regular and asymptotically biased plug-in estimates.

*When the distribution law is exceptional and* ${m}_{2}^{0},{m}_{2}^{1}$ *are estimated, two-interval ZIPI estimators have smaller asymptotic bias than the usual recursive g-estimators provided estimates are not shared across intervals, and* $\underset{{\overline{L}}_{2},{A}_{1}}{\text{inf}}\{|{g}_{2}({\overline{L}}_{2},{A}_{1};{\psi}_{2}^{\u2020})|\}\backslash \{0\}=\mu >0$.

Since $|{g}_{2}({\overline{L}}_{2},{A}_{1};{\psi}_{2}^{\u2020})|\ge \mu $ for some μ > 0 for all subjects with a unique optimal rule at the second interval, it follows that there exists a sample size such that the power to detect whether ${g}_{2}({\overline{L}}_{2,i},{A}_{1,i};{\psi}_{2}^{\u2020})\ne 0$ is effectively equal to 1 if classification of individuals into ${m}_{2}^{0},{m}_{2}^{1}$ is performed using 1 − ε confidence intervals for fixed type I error ε. However, 100ε% of the people with non-unique treatment rules will be falsely classified as belonging to ${\mathcal{M}}_{2}^{1}$.

Assume linear blips, so ${\psi}_{2}^{\u2020}\xb7{X}_{2}=0$ when laws are exceptional, where the true values of ψ_{1}, ψ_{2} are denoted by ${\psi}_{1}^{\u2020},{\psi}_{2}^{\u2020}$. Define ${\mathrm{\Delta}}_{1}^{z}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})={U}_{1}^{0*}({\psi}_{1},{\hat{\psi}}_{2})-{U}_{1}^{0*}({\psi}_{1},{\psi}_{2}^{\u2020})$, a quantity in ZIPI estimation similar to ${\mathrm{\Delta}}_{1}^{g}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})$ in the usual g-estimation procedure. Since, asymptotically, the only form of misclassification error in ZIPI is type I, ${\mathrm{\Delta}}_{1}^{z}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})$ consists of three components:

- If $i\in {m}_{j}^{0}$ and $i\in {\hat{m}}_{j}^{0}$, ${\mathrm{\Delta}}_{1}^{z}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})=0$;
- If $i\in {m}_{j}^{0}$ and $i\in {\hat{m}}_{j}^{1}$,;$${\mathrm{\Delta}}_{1}^{z}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})=\{{S}_{1}({A}_{1})-E[{S}_{1}({A}_{1})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{L}_{1}]\}\xb7(I[({\hat{\psi}}_{2}-{\psi}_{2}^{\u2020})\xb7{X}_{2}>0]-{A}_{2})({\hat{\psi}}_{2}-{\psi}_{2}^{\u2020})\xb7{X}_{2}$$
- If $i\in {m}_{j}^{1}$,.$$\begin{array}{c}{\mathrm{\Delta}}_{1}^{z}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})=\{{S}_{1}({A}_{1})-E[{S}_{1}({A}_{1})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{L}_{1}]\}\hfill \\ \text{\hspace{1em}\hspace{1em}}\times \phantom{\rule{thinmathspace}{0ex}}\{(I[{g}_{2}({\overline{L}}_{2},{A}_{1};{\hat{\psi}}_{2})>0]-{A}_{2})\phantom{\rule{thinmathspace}{0ex}}{g}_{2}({\overline{L}}_{2},{A}_{1};{\hat{\psi}}_{2})-(I[{g}_{2}({\overline{L}}_{2},{A}_{1};{\psi}_{2}^{\u2020})>0]-{A}_{2})\phantom{\rule{thinmathspace}{0ex}}{g}_{2}({\overline{L}}_{2},{A}_{1};{\psi}_{2}^{\u2020})\}\hfill \end{array}$$

Similar to the calculation of section 2.2, the asymptotic bias of _{1} is proportional to $E[{\mathrm{\Delta}}_{1}^{z}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})]$. The classification procedure in ZIPI incorrectly fails to use zeros for 100ε% of the sample with ${\mathcal{M}}_{1}^{0}$. Therefore we rewrite ${\mathrm{\Delta}}_{1}^{z}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})$ in terms of ${\mathrm{\Delta}}_{1}^{g}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})$ of recursive g-estimation:

$$\begin{array}{cc}E[\sqrt{n}{\mathrm{\mathbb{P}}}_{n}{\mathrm{\Delta}}_{1}^{z}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})]=\hfill & \epsilon E[\sqrt{n}{\mathrm{\mathbb{P}}}_{n}I[{g}_{2}({\overline{L}}_{2},{A}_{1};{\psi}_{2}^{\u2020})=0]{\mathrm{\Delta}}_{1}^{g}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})]\hfill \\ \hfill & +E[\sqrt{n}{\mathrm{\mathbb{P}}}_{n}I[{g}_{2}({\overline{L}}_{2},{A}_{1};{\psi}_{2}^{\u2020})\ne 0]{\mathrm{\Delta}}_{1}^{g}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})].\hfill \end{array}$$

It follows directly that the asymptotic bias of first-interval ZIPI estimates at exceptional laws is

$$-\epsilon \phantom{\rule{thinmathspace}{0ex}}{\left(E\phantom{\rule{thinmathspace}{0ex}}\left[\frac{\partial}{\partial {\psi}_{1}}{U}_{1}^{*}({\psi}_{1}^{\u2020},{\psi}_{2}^{\u2020})\right]\right)}^{-1}E[I[{g}_{2}({\overline{L}}_{2},{A}_{1};{\psi}_{2}^{\u2020})=0]\{{S}_{1}({A}_{1})-E[{S}_{1}({A}_{1})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{L}_{1}]\}{(2\pi )}^{-1/2}{({X}_{2}^{\prime}{\mathrm{\Sigma}}_{2}{X}_{2})}^{1/2}].$$

In finite samples both type I *and* type II error will be observed, which leads to additional bias in the ZIPI estimates. When sample size is not sufficient for power to equal 1, ${\mathrm{\Delta}}_{1}^{z}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})$ is decomposed into *four* components (two of which are zero in expectation). If type II error is denoted by β, then, approximately, for two-intervals the bias will be proportional to

$$\begin{array}{c}\epsilon \mathrm{P}({g}_{2}({\overline{L}}_{2},{A}_{1};{\psi}_{2}^{\u2020})=0)E[{\mathrm{\Delta}}_{1}^{g}({\hat{\psi}}_{2},{\psi}_{2}^{\u2020})]+\beta \mathrm{P}({g}_{2}({\overline{L}}_{2},{A}_{1};{\psi}_{2}^{\u2020})\ne 0)\hfill \\ \text{\hspace{1em}\hspace{1em}}\times \phantom{\rule{thinmathspace}{0ex}}E\phantom{\rule{thinmathspace}{0ex}}[\{{S}_{1}({A}_{1})-E[{S}_{1}({A}_{1})\phantom{\rule{thinmathspace}{0ex}}|\phantom{\rule{thinmathspace}{0ex}}{L}_{1}]\}(I[{g}_{2}({\overline{L}}_{2},{A}_{1};{\psi}_{2}^{\u2020})>0]-{A}_{2})\phantom{\rule{thinmathspace}{0ex}}{g}_{2}({\overline{L}}_{2},{A}_{1};{\psi}_{2}^{\u2020})].\hfill \end{array}$$

When covariates are integer-valued as in our simulations, there is good separation of *g*_{2}(_{2}, *A*_{1}; _{2}) between those people with unique rules and those without; in this instance, simulations suggest that choosing a relatively large (e.g., 0.20) type I error provides a good balance between bias reduction *at* and *near* exceptional laws (Figure 3–Figure 4). In general, we expect that it is preferable to have low type II error rather than low type I error. When a type I error is made, the ZIPI procedure will incorrectly assume an individual does have a unique regime and fail to plug in a zero, making the ZIPI estimate numerically more similar to the usual g-estimate, which is known to be unbiased (though not asymptotically so). When a type II error is made, the individual is incorrectly believed to have non-unique optimal rules at the second stage. ${H}_{1,i}^{0}({\psi}_{1},{\hat{\psi}}_{2})$ and *H*_{1,i}(ψ_{1}, _{2}) will be equal only if individual *i* happened to receive the optimal treatment at the second interval, and so the impact of the misclassification resulting from type II error will depend on the distribution of treatments at the second interval.

With ZIPI g-estimation proving both theoretically and in simulation to be asymptotically less biased than recursive g-estimation when parameters are not shared between intervals, it is a natural choice. The assumption of unshared parameters is in many cases appropriate, even when the number of intervals is large. For example, when studying the effect of breastfeeding in infant growth, blips at different intervals are not expected to share parameters as effects are thought to vary by developmental stage of infancy (Kramer *et al.*, 2004). However, it is plausible that blip functions for a treatment given over several intervals for a chronic condition may share parameters. The question then arises as to whether ZIPI outperforms simultaneous (non-recursive) g-estimation.

If parameters are shared across intervals, that is, ψ_{j} = ψ_{j′} for some *j* ≠ *j*′, recursive methods may still be used by pretending otherwise, using a recursive form of g-estimation (the usual or ZIPI) then averaging the interval-specific estimates of ψ^{†}. Such inverse covariance weighted recursive g-estimators have the advantage of being easier to compute than simultaneous estimators and are also doubly-robust under non-exceptional laws (Robins 2004, p. 221); note that the definition of exceptional laws is the same whether or not parameters are assumed to be shared. On the other hand, simultaneous g-estimation searches for estimates of the shared parameters directly without needing to average estimates from different intervals.

Robins (2004, p. 221–222) recommends simultaneous g-estimation for non-exceptional laws when there are few subjects who change treatment from one interval to the next: with few changes of treatment at a particular interval *j*, estimates _{j} are extremely variable. If there are a moderate number of treatment changes from one interval to the next (recursive) ZIPI estimation may be useful for the purposes of bias reduction in large samples. Figure 5 compares ZIPI with 80% confidence intervals used to estimate membership in ${m}_{j}^{0}$ or ${m}_{j}^{1}$, recursive g-estimation, and simultaneous g-estimation, for varying sample sizes, for data as before: *L*_{1} ~ (0, 3); *L*_{2} ~ round((0.75, 0.5)); *Y* ~ (1, 1)−|γ_{1}(*L*_{1}, *A*_{1})|−|γ_{2}(_{2}, *Ā*_{2})| but we now impose stationarity so that γ_{1}(*L*_{1}, *A*_{1}) = *A*_{1}(ψ_{0} + ψ_{1}*L*_{1}) and γ_{2}(_{2}, *Ā*_{2}) = *A*_{2}(ψ_{0} + ψ_{1}*L*_{2} + ψ_{22}*A*_{1} + ψ_{23}*L*_{2}*A*_{1}). Note the sharing of ψ_{0} and ψ_{1} between the blips of both intervals.

Absolute bias in _{0} as *n* varies about exceptional laws assuming stationarity: inverse-covariance weighted ZIPI estimates with 80% CIs, inverse-covariance weighted recursive g-estimates, and simultaneous g-estimates [left, center, and **...**

In our simulations, simultaneous and recursive g-estimation are similar in terms of both bias and variability. Relative to simultaneous g-estimation, the recursive methods are easier to implement and do not suffer from the possibility of multiple roots under linear blip functions. At non-exceptional laws, ZIPI produces nearly identical estimates to the doubly-robust recursive g-estimation which, as noted above, is locally efficient. In our simulations, ZIPI’s performance at exceptional laws varied by situation, performing much better when ψ_{1} = 2 as compared to when ψ_{1} = 3 (Figure 5), indicating that inverse-covariance weighted ZIPI estimators are not always less biased than simultaneous g-estimators.

It is worth noting that the asymptotic bias due to exceptional laws is also present in the Iterative Minimization for Optimal Regimes (IMOR) (Murphy, 2003) method as well. This is not surprising since IMOR and g-estimation are very closely related (Moodie *et al.*, 2007). As observed with the usual g-estimation, ZIPI outperforms IMOR when parameters are not shared across intervals (Table 4). Nevertheless, when parameters are shared and the sample is not very large, IMOR may prove to be less biased at exceptional laws than ZIPI, as in the previous subsection.

In this paper, we have discussed the asymptotic bias of optimal decision rule parameter estimators in the presence of exceptional laws. We also presented suggestions to reduce the computational burden of trying to detect exceptional laws using the method of Robins (2004). We introduced Zeroing Instead of Plugging In, a recursive form of g-estimation, that is to be preferred over recursive g-estimation whenever parameters are not shared between intervals if there is a possibility of laws being exceptional.

The ZIPI algorithm provides an all-or-nothing adjustment, using weights of zero or one for each individual depending on the estimated unique-rule class membership. Other approaches to bias reduction were considered, including weighting by each individual’s probability of having a unique optimal rule, but the improvement in the estimates was unremarkable. We will pursue other methods such as, for example, weighted averaging of the ZIPI and usual g-estimates in the pursuit of a method that works well both with shared and unshared parameters. The efficacy of the ZIPI procedure has not been tested in the case of non-linear blip functions for, say, estimating the optimal rules for a binary outcome. This, too, shall be investigated.

Future work will also include determination of the optimal choice of threshold in ZIPI; simulations suggest that when there is reasonably good separation of people with unique rules and those without, a relatively large probability of type I error (e.g., 0.20) works well. When the distinction between those individuals with unique rules and those without is smaller, controlling the type I error in the class-membership testing through the false-discovery rate or a comparable procedure may be desirable. We will also investigate using the ZIPI procedure with type I error that systematically decreases with increasing sample size while maintaining high power to detect individuals with unique optimal rules.

Oetting *et al.* (2008) have performed the most comprehensive study of sample size requirements for a two-interval, two-treatment sequential multiple assignment randomized trial (SMART) (Collins *et al.*, 2007) yet. In this study, sample size calculations were based on testing for a difference between the best and next-best strategy by considering the difference in mean response under the two treatment strategies. Earlier sample size calculations for testing between strategies were derived by Murphy (2005). To date, no thorough study has been performed for the sample size requirements for testing optimal strategies selected via g-estimation, and so the impact of parameter sharing on sample size and efficiency is not well understood. However it is assumed that the ‘reward’ for the additional assumption of shared parameters would be increased efficiency (under correct specification). A future direction of research will be the investigation of the properties of g-estimators and required sample sizes under both shared and unshared parameter assumptions.

We conclude by observing that one of g-estimation’s primary points of appeal is that, unlike traditional approaches such as dynamic programming, estimates are asymptotically unbiased under the null hypothesis of no treatment effect even if the distribution law of the complete longitudinal data is not known, provided the true law is not exceptional. However for a blip function that includes at least one component of treatment and covariate history, under the hypothesis of no treatment effect, every distribution is an exceptional law and therefore g-estimation is asymptotically biased under the null for all such blip functions and laws. Thus when primary concern is in performing a valid test of the null hypothesis of no treatment effect, we may consider fitting a “null” blip model which does not include any covariates (using the doubly robust estimator), proceeding to richer blip models only if the “null” blip model cannot be rejected.

Erica Moodie is supported by a University Faculty Award and a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (NSERC). Thomas Richardson acknowledges support by NSF grant DMS 0505865 and NIH R01 AI032475. We wish to thank Dr. James Robins for his insights and valuable discussion. We also thank Dr. Moulinath Banerjee, the associate editor and anonymous referees for helpful comments.

Erica E. M. Moodie, Department of Epidemiology, Biostatistics, and Occupational Health, McGill University.

Thomas S. Richardson, Department of Statistics, University of Washington.

- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 1995;57:289–300.
- Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 2001;29:1165–1188.
- Collins LM, Murphy SA, Strecher V. The Multiphase Optimization STrategy (MOST) and the Sequential Multiple Assignment Randomized Trial (SMART): New methods for more potent e-health interventions. American Journal of Preventative Medicine. 2007;32(5S):S112–S118. [PMC free article] [PubMed]
- Giles DEA, Lieberman O, Giles JA. The optimal size of a preliminary test of linear restrictions in a misspecified regression model. J. Amer. Statist. Assoc. 1992;87:1153–1157.
- Kabaila P, Leeb H. On the large-sample minimal coverage probability of confidence intervals after model selection. J. Amer. Statist. Assoc. 2006;101:619–629.
- Kramer MS, Guo T, Platt RW, Vanilovich I, Sevkovskaya Z, Dzikovich I, Michaelsen KF, Dewey K. Feeding effects on growth during infancy. The Journal of Pediatrics. 2004;145:600–605. [PubMed]
- Moodie EEM, Richardson TS, Stephens DA. Demystifying optimal dynamic treatment regimes. Biometrics. 2007;63:447–455. [PubMed]
- Murphy SA. Optimal dynamic treatment regimes. J. R. Stat. Soc. Ser. B Stat. Methodol. 2003;65:331–366.
- Murphy SA. An experimental design for the development of adaptive treatment strategies. Stat. Med. 2005;24:1455–1481. [PubMed]
- Oetting AI, Levy JA, Weiss RD, Murphy SA. Statistical methodology for a SMART design in the development of adaptive treatment strategies. In: Shrout P, editor. Causality and psychopathology: finding the determinants of disorders and their cures. Arlington VA: American Psychiatric Publishing, Inc.; 2008.
- Robins JM. Correcting for non-compliance in randomized trials using structural nested mean models. Comm. Statist. Theory Methods. 1994;23:2379–2412.
- Robins JM. Causal inference from complex longitudinal data. In: Berkane M, editor. Latent variable modeling and applications to causality. New York: Springer-Verlag; 1997. pp. 69–117.
- Robins JM. Optimal structural nested models for optimal sequential decisions. In: Lin DY, Heagerty PJ, editors. Proceedings of the Second Seattle Symposium on Biostatistics; New York: Springer-Verlag; 2004. pp. 189–326.
- Robins JM, Orellana L, Rotnitzky A. Estimation and extrapolation of optimal treatment and testing strategies. Stat. Med. 2008;27:4678–4721. [PubMed]
- Rubin DB. Bayesian inference for causal effects: The role of randomization. Ann. Statist. 1978;6:34–58.
- van der Vaart A. Asymptotic Statistics. Cambridge: Cambridge University Press; 1998.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |