We conducted a simulation study to assess the small sample performance of our proposed estimator. The data were generated from a linear regression model of the form:

where

*X*,

*Z*, and ς were generated independently from standard normal distribution. Thus, the conditional distribution of

*Y* given

*X* and

*Z* is normal with mean β

_{0} + β

_{1}*X* + β

_{2}*Z* and variance 4. Let

*W* =

*X* +

*ϵ*, where

*ϵ* was generated from a zero-mean normal distribution with variance σ

^{2}. Note that the value of σ

^{2} indicates the strength of information contained in

*W* for

*X*. We set σ = 1 in simulation, which represents a moderate association between the

*W* and

*X*. Here, we take

*S* =

*W*.

Suppose there are

*N* subjects available at the first stage. Let

*a*_{i} and

*b*_{i} denote the

*i*/3 percentile of

*Y* and

*W*, respectively, for

*i* = 1,2. First, we use the method depicted in to obtain the second stage samples for the 2-stage OADS design. Then the size of the validation set is

Second, while selecting the same SRS sample of size

*n*_{0}, we also select the 2 supplemental ODS samples in the stratum

*A*_{1} of size

*n*_{1} +

*n*_{2} and stratum

*A*_{3} of size

*n*_{3} +

*n*_{4}, respectively, to mimic the design proposed by

Weaver and Zhou (2005). Note that the sizes of validation set

*V* obtained at the second stage through the above 2 sampling designs are the same.

Having obtained the data under the 2-stage OADS design, we denote the proposed estimator by

. We also denote the reduced proposed estimator by

for the 2-stage ODS design with (

*Y*,

*Z*,

*W*) observed at the first stage. We compare estimators

and

with some competing estimators. The first estimator, denoted by

, is the inverse probability weighted estimator (

Horvitz and Thompson, 1952) based on the 2-stage OADS design. The second estimators to be compared, as discussed in the Section 2.1, are the estimator

for the 2-stage ODS design with (

*Y*,

*Z*) observed at the first stage and, similarly, the estimator

for the 2-stage ODS design with only

*Y* observed at the first stage and (

*X*,

*Z*) observed at the second stage. The bandwidth

is used for these estimators involving kernel smoothing, where

is the sample standard error of {

*W*_{i},

*i**V*_{k}}. Finally, as a benchmark, we also consider the efficient linear regression estimator, denoted by

, which is a hypothetical situation in which all subjects at the first stage have

*X* observed, and the ordinary linear regression estimator, denoted by

, from a simple random sample of the same size as the validation set at the second stage. Note that the efficiency difference for methods β

_{Y1}, β

_{Y2}, β

_{P1}, and β

_{P2} should be attributed to the study design instead of estimating procedure. However, β

_{P2} and β

_{W} are different estimating procedures under the same 2-stage OADS design.

For narrative simplicity, we define an allocation function denoted by allocation(μ,ν) to allocate the validation set of size μ + 4ν at the second stage, which means that *n*_{0} = μ and *n*_{1} = *n*_{2} = *n*_{3} = *n*_{4} = ν under the 2-stage OADS design as illustrated in . Under the 2-stage ODS design, allocation(μ,ν) means SRS sample of size μ and 2 supplemental ODS samples in the stratum *A*_{1} of size 2ν and in stratum *A*_{3} of size 2ν are allocated. We also investigate the impact on the parameter estimation of different allocations of total validation sample size between the SRS sample and the supplemental OADS (ODS) samples at the second stage, with (*N*,β_{0},β_{1},β_{2}) = (1500,0.5,0.3,0.5) fixed.

For each simulation configuration, 1000 replicated samples were generated and the results were presented in . Under the model studied, we make the following observations on the estimator

, the parameter of interest. Note that the estimator

works quite well in all scenarios. First, all the methods in all the scenarios yield consistent estimators, the variance estimators accurately reflect the true variations, and the confidence intervals have proper coverage probabilities. Second, the proposed estimators

and

are more efficient than the estimators

and

, which indicates that taking auxiliary information into consideration indeed gains substantial estimation efficiency. Furthermore,

is more efficient than

. This fits our expectation since

not only utilizes the auxiliary in the stratification (i.e. study design) but also incorporates it into the estimation procedure, while

uses it just in the estimation procedure. On the other hand, although the precision of estimator

and that of

are almost the same in the scenarios considered, the efficiency gains of

over

are apparent due to the fact that the covariate

*Z* is observed for all subjects in

. The estimator

is less efficient than

since

just utilizes the second-stage sample and sampling probability under the 2-stage OADS design. Third, when we increase the size of the validation set from |

*V*| = 240 to |

*V*| = 360, more accurate estimators (including

,

,

,

,

, and

) are obtained as expected. Here, we consider 3 different ways to add the additional 120 samples to the validation set |

*V*| = 240. It can be seen that more efficiency gains are achievable through the way from allocation(120,30) to allocation(180,45), that is, putting half of the additional 120 samples to the SRS part and the other half to the OADS part averagely, than that from allocation(120,30) to allocation(240,30), that is, putting the additional 120 samples to the SRS part. Efficiency gains are also achieved through the way from allocation(120,30) to allocation(120,60), which puts the additional 120 samples to the OADS part evenly. These different allocation patterns indicate that adding the additional sample to both the SRS part and the supplemental OADS part or the supplemental OADS part is better than to the SRS part only. Finally, under the allocation(120,60), when the cutpoints vary from the

to

, that is, when the product sample space

×

is stratified by more extreme cutpoints, more precise estimators (including

,

,

, and

) are obtained, and the efficiency advantage of

over

becomes more obvious. We also investigate the effect of the strength of

*W* for

*X*, represented by σ, on the efficiency of estimator

, under the methods considered. Please see Figure A.1 in the

supplementary material available at

*Biostatistics* online.

| **Table 1.**Simulation study for the proposed estimators. Results are based on 1000 replicated data sets with 1500 subjects at the first stage for each data set^{†} |

It should be noted that in above simulation results, the covariate

*X* was generated independently from

*Z*. Therefore, we took

*S* =

*W* and then adopted a univariate kernel smoothing method to estimate the function

*g*(

*X*|

*Z*,

*W*) =

*g*(

*X*|

*W*) nonparametrically. As suggested by one of the referees, here we intend to investigate our proposed estimators when

*g*(

*X*|

*W*) is specified parametrically instead of being estimated by kernel smoothing. Note that in our above simulation setups

*g*(

*X*|

*W*) is a normal density function with mean

*W* and variance 2. The resultant estimate is denoted by

. Furthermore, we also consider this estimate in the misspecified situation in which the

*X* was generated from the model

but the working model remains to be

*X* =

*W* +

*ϵ*. The related results are formulated in . Obviously, when

*g*(

*X*|

*W*) is correctly specified, the estimate

outperforms the nonparametric methods. However, when

*g*(

*X*|

*W*) is misspecified, the estimate

is biased with low coverage probability while the nonparametric smoothing estimates, including our proposed estimates

and

, still work well.

| **Table 2.**Simulation study for the proposed estimators. Results are based on 1000 replicated data sets with 1500 subjects at the first stage and allocation pattern *allocation*(*120, 60*) at the second stage under the cutpoints for each data set^{†} |

On the other hand, as suggested by another referee, in some practice,

*d*, the dimension of

*W*, could be greater than one, and then multivariate kernel smoothing method would be involved. Hence, it is of practical importance to see how sensitive the resulting inference on the parameters of interest is with regard to the dimension

*d* of kernel smoothing. We explore this issue with some modifications of the simulation models, where we generate

*Z* from model

*Z* =

*W*^{2} +

*ϵ*_{2}, where

*W* and

*ϵ*_{2} are both generated independently from a standard normal distribution. We keep the remaining parametric simulation settings unchanged. We use 2 dimensional product standard normal kernels to estimate

*g*(

*X*|

*Z*,

*W*) with bandwidth matrix diag(

*h*_{1},

*h*_{2}), where

,

*h*_{2} is defined in a similar pattern, and

is the sample standard error of {

*Z*_{i},

*i**V*_{k}}. The corresponding estimates are listed in . It can be seen that when the dimension of kernel smoothing

*d* equals 2, the resultant estimates of β

_{1} of main interest are slightly biased with low coverage probability except for the inverse probability estimate

. Even then, our proposed estimators

and

outperform

and

.

| **Table 3.**Simulation study for the proposed estimators. Results are based on 1000 replicated data sets with 1500 subjects at the first stage and allocation pattern *allocation*(*120, 60*) at the second stage under the cutpoints for each data set with *S* = (*Z*, *W*)^{†} (more ...) |