Suppose that study subjects in a population of interest are randomly assigned to either a standard treatment denoted by

*G* = 0 or an experimental treatment denoted by

*G* = 1. Let

**U** denote a vector of baseline covariates,

*Y* denote the response variable, and

*Y*_{(k)} denote the potential outcome of an individual if he/she were assigned to treatment group

*G* =

*k*. Without loss of generality, we assume that a larger value of

*Y* is beneficial. Now, suppose that for subjects with

**U** =

**u**, we are interested in estimating the treatment difference

Our data consist of

*n* random vectors {(

*Y*_{i},

**U**_{i},

*G*_{i}),

*i* = 1,…,

*n*}, where we assume that {(

*Y*_{i},

**U**_{i}):

*G*_{i} =

*k*} are

*n*_{k} = ∑

_{i = 1}^{n}*I*(

*G*_{i} =

*k*) independent and identically distributed random vectors, for

*k* = 0,1. Furthermore, the ratio

*n*_{1}/

*n* goes to a constant in the open interval (0,1) as

*n*→

*∞*. To estimate

(

**u**), one may consider a nonparametric function procedure. In practice, however, when

**u** is not univariate, generally it is difficult, if not impossible, to estimate

(

**U**) well nonparametrically.

To reduce the complexity of the high-dimensional problem, conventionally one utilizes a parametric or semiparametric procedure to estimate

(

**u**). Here, we consider the following generalized linear “working” model to approximate the mean function of

*Y*_{(k)}:

where

**Z**, a

*p*×1 vector, is a function of

**U** with first column being 1,

*g*_{k} is a known, strictly increasing smooth link function, and

**β**_{k} is an unknown vector of regression coefficients. Note that in (2.1), we only model the mean function of

*Y*. To estimate the parameter vector

**β**_{k} in (2.1) without distribution assumptions about the response, one may use a solution

to the following estimating equation:

Using the arguments given in

Tian *and others* (2007), one can show that

converges to a deterministic vector

even when Model (2.1) is incorrectly specified. This stability property is crucial for developing our new procedure. It follows that for a given

**u**, a parametric estimator for

(

**u**) is

Note that when Model (2.1) is correctly specified,

is the true parameter for Model (2.1) and

is a consistent estimator of

(

**u**).

Let

**U**^{0} be the baseline covariate vector for a future subject from the study population. If this subject is treated by treatment

*k*, the response is

*Y*_{(k)}^{0},

*k* = 0,1. For

**U**^{0} =

**u**^{0}, we may use

to decide which treatment this specific subject should be treated with. The adequacy of such a decision heavily depends on the appropriateness of Model (2.1). On the other hand,

may be used as an index scoring system for grouping future subjects with potentially similar treatment differences. That is, we divide the future population into many strata based on the score

such that patients in the same subset

have the same parametric score value

*v*. To employ such scoring system in clinical practice, one could create strata by segmenting all possible values of

into small intervals.

Now, for any given

*v*, consider subjects in the subset Ω

_{v}. We propose to consistently estimate the average treatment difference for this subgroup,

based on a local likelihood approach (

Tibshirani and Hastie, 1987,

Fan and Gijbels, 1996), where

*μ*_{k}(

*v*) =

*E*(

*Y*_{(k)}^{0}|

**U**^{0}Ω

_{v}),

*k* = 0,1, and the expectation is taken with respect to the data and (

*Y*_{(k)}^{0},

**U**^{0}). Specifically, we obtain the root

to the local weighted estimating equation,

where

where

*h* is the smoothing parameter chosen such that

*h* =

*O*_{p}(

*n*^{ − ν}) with 1/5 <

*ν* < 1/2,

*K*_{h}(

*x*) =

*K*(

*x*/

*h*)/

*h*,

*K*(

*x*) is a symmetric kernel function with a finite support,

and

*ψ*(·) is a suitably chosen, nondecreasing function. Here,

*g*(

*x*) =

*x* if the response

*Y* is continuous and

*g*(

*x*) = exp(

*x*)/{1 + exp(

*x*)} if

*Y* is binary. Transforming the score via an appropriately chosen function,

*ψ*(·) could potentially improve the performance of the kernel estimation (Wand

*and others*, 1991, Park

*and others*, 1997). The

*μ*_{k}(

*v*) can then be estimated by

This estimator corresponds to the local linear least square estimator for continuous

*Y* and local linear logistic likelihood estimator for binary

*Y*. Subsequently, we estimate

as

In Section A of the

Supplementary Material available at

*Biostatistics* online, we show that

is uniformly consistent for

, for

*v* = [

*ρ*_{l},

*ρ*_{r}], where

is an interval properly contained in the support of

.

To construct confidence interval for

, the average treatment difference among the stratum Ω

_{v}, we show in Section B of the

Supplementary Material available at

*Biostatistics* online that for large

*n*, the distribution of

can be approximated by the conditional distribution of a mean-zero normal variable

given the data, where

is a random sample from the standard normal variable and is independent of the data,

is obtained by replacing

in

with

, for

*k* = 0,1, where for any function

*g*,

*g*(·) denotes its derivative. Note that

is a solution to the counterpart of estimating equation

(2.2), which is perturbed with the same set of

in (2.4). Also, note that

are the only random quantities in (2.4), whose distribution can be approximated easily by simulating

repeatedly. A (1 −

*α*) pointwise confidence interval estimates for

can be constructed via this large-sample approximation, which is

. Here,

is the standard error estimate of

and

*d* is the upper (1 −

*α*/2) percentile of the standard normal.

To select subgroups that may benefit from the new treatment, one may aim to identify regions of

*v* such that

is above or below a certain threshold value. However, as for any subgroup analysis, it is crucial to control for the global error rate. For the present case, we may control for the overall error rate by constructing a simultaneous confidence band for

. Specifically, to make inference about the treatment differences over a range of

*v*, one may construct a simultaneous confidence band for

A conventional way to obtain such a confidence band is based on a sup-type statistic

Although

does not converge weakly as a process in

*v*, we show in Section C of the

Supplementary Material available at

*Biostatistics* online that a standardized version of

converges in distribution to a proper random variable. In practice, for large

*n*, one can approximate the distribution of

by

, the sup of the absolute value of

divided by

, with (2.4) perturbed by the same set of perturbation variables

for all

*v*. It follows that the 100(1 −

*α*)% simultaneous confidence interval for

is

where the cutoff point

*γ* is chosen such that pr

The simultaneous interval estimators for

can be used to suggest treatment options for patient subgroups. For example, if there is little concern regarding the cost of the new treatment relative to the standard treatment, then one may suggest the new treatment to subgroups in

On the other hand, if the new treatment is expensive or has serious adverse risk, only patients with

where

*c*_{0} is a preselected cutoff value representing adequate benefit. Making decisions based on the simultaneous confidence intervals ensures that there is no inflated type I error associated with testing whether

exceeds or is above a certain threshold value for a range of

*v*. This is particularly important if

is a nonmonotone function crossing zero more than once.

As for any nonparametric functional estimation problem, the choice of the smoothing parameter

*h* for

is crucial for making inferences about

. We may choose a single smooth parameter for 2 nonparametric function estimates,

and

. Since each study subject is assigned to a single-treatment group, therefore, we cannot use the standard cross-validation method with the integrated mean square error criterion to choose

*h* for

. A reasonable, feasible alternative to choose an “optimal” smooth parameter is minimizing an average squared distance between the observed “cumulative” treatment difference, and their predicted counterparts in the validation samples under the

-fold cross-validation setting. Noting that

we propose to use empirical estimates of

*E*(

*Y*_{i}|

*G*_{i} = 1,

**U**_{i} ≤

**u**) −

*E*(

*Y*_{i}|

*G*_{i} = 0,

**U**_{i} ≤

**u**) as the observed cumulative treatment difference and the empirical counterpart of

as the predicted cumulative treatment difference based on the proposed procedure. Here and in the sequel, for any 2 vectors

**U** = (

*U*_{1},…,

*U*_{q})

^{T} and

**u** = (

*u*_{1},…,

*u*_{q})

^{T},

**U** ≤

**u** denotes {

*U*_{j} ≤

*u*_{j},

*j* = 1,…,

*q*}. Specifically, we randomly split the data into

disjoint subsets of about equal sizes, denoted by {

_{ι},

*ι* = 1,…,

}. For each

*ι*, we use all observations not in

_{ι} to obtain estimators for the parametric index score and

with a given

*h*. Let the resulting estimators be denoted by

and

, respectively. We then use the observations from

_{ι} to calculate the empirical integrated cumulative square error

where

*I*(·) is the indicator function and

is an empirical weight function such as the empirical distribution function. Last, we sum (2.8) over

*ι* = 1,…,

, and then choose

*h* by minimizing this sum. It is important to note that since the bandwidth is selected by minimizing a quantity which is composed of cumulative predicted errors, the order of such optimal bandwidth is expected to be

*n*^{ − 1/3} (

Bowman *and others*, 1998). Thus, the selected bandwidth satisfies the condition required for the resulting functional estimator

with the data-dependent smooth parameter to have the above desirable large-sample properties.