PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
J Off Stat. Author manuscript; available in PMC 2017 August 3.
Published in final edited form as:
J Off Stat. 2016 March; 32(1): 231–256.
Published online 2016 March 10. doi:  10.1515/JOS-2016-0011
PMCID: PMC5542708
NIHMSID: NIHMS875786

Synthetic Multiple-Imputation Procedure for Multistage Complex Samples

Abstract

Multiple imputation (MI) is commonly used when item-level missing data are present. However, MI requires that survey design information be built into the imputation models. For multistage stratified clustered designs, this requires dummy variables to represent strata as well as primary sampling units (PSUs) nested within each stratum in the imputation model. Such a modeling strategy is not only operationally burdensome but also inferentially inefficient when there are many strata in the sample design. Complexity only increases when sampling weights need to be modeled. This article develops a general-purpose analytic strategy for population inference from complex sample designs with item-level missingness. In a simulation study, the proposed procedures demonstrate efficient estimation and good coverage properties. We also consider an application to accommodate missing body mass index (BMI) data in the analysis of BMI percentiles using National Health and Nutrition Examination Survey (NHANES) III data. We argue that the proposed methods offer an easy-to-implement solution to problems that are not well-handled by current MI techniques. Note that, while the proposed method borrows from the MI framework to develop its inferential methods, it is not designed as an alternative strategy to release multiply imputed datasets for complex sample design data, but rather as an analytic strategy in and of itself.

Keywords: Finite population Bayesian bootstrap, Haldane prior, stratified sample, clustered sample, sample weights

1. Introduction

Stratified multistage sampling is the most common type of sample design for large-scale surveys conducted by the U.S. federal statistical agencies. This type of sample design combines the advantages of both stratification (for statistical efficiency) and cluster sampling (for cost and logistical efficiency). Under this design, the primary sampling units (PSUs) are stratified in such a way that they are homogeneous with respect to a stratum-level aggregate of the variable(s) of interest. To permit a maximum degree of stratification and thus variance reduction, it is common practice to define a large number of strata where only a small number of PSUs are selected in each stratum.

Multiple imputation (MI) (Rubin 1976, 1987) is a method commonly used when item-level missing data are present. However, MI requires that survey design information be built into the imputation models. Reiter et al. (2006) demonstrated the importance of simultaneously accounting for stratum effects and clustering effects in multiple imputation. They showed that when design features were ignored in the imputation model, biases would occur on the estimated parameter, even if a design-based analysis method was applied to the imputed data. Current MI methods typically include dummy variables to represent strata as well as PSUs nested within each stratum in the imputation model. When necessary, they also identify statistically significant interactions between these dummies with other covariates through routine variable selection procedures such as stepwise regression (Reiter et al. 2006; Schenker et al. 2006). Such a modeling strategy is not only operationally burdensome but also inferentially inefficient when there are hundreds of strata in the sample design and the sample in each stratum consequently becomes sparse. For example, the Census Bureau’s Current Population Survey design groups 1,768 nonself-representing PSUs into 220 strata.

Possibly a better strategy is to consider clusters as random effects while treating strata as either fixed (using dummies) or random effects. However, many of the popular software packages that implement multiple imputation (e.g., SAS MI procedure, R packages mice or mi, and IVEware) cannot be adapted easily to such an approach. While a few recent software modules (such as R package pan and MLwiN module REALCOM-IMPUTE) have started to incorporate mixed effects or multilevel modeling for imputation purposes, they typically assume normal or latent normal distribution for variables with missing data. Their performances for missing categorical variables (binary in particular) are unclear. Moreover, there has been only little research that formally investigates their use in incorporating strata as well as clusters.

To circumvent these problems with fully parametric model-based imputation techniques, we develop a two-step semiparametric MI method. The idea is to separate the need to account for complex sample designs from the treatment of missing data. In the first step, sample designs are “reversed” through synthetic population data generation using a weighted finite population Bayesian bootstrap (FPBB) (Cohen 1997; Little and Zheng 2007; Dong et al. 2014). In the second step, missing values are imputed in the created synthetic population based on a parametric model that assumes identically independently distributed (IID) data elements. To account for stratum effects, we combine a replication variance estimation method (Efron 1979; Kovar et al. 1988; Rao and Wu 1988; Rao et al. 1992; Rust and Rao 1996) with the weighted FPBB. Under a standard missing at random (MAR) assumption (Little and Rubin 2002), our method requires neither complicated modeling of strata and clusters nor design-based analyses of the imputed data. Note that while the proposed method borrows from the multiple-imputation framework to develop its inferential methods, it is not designed as an alternative strategy to release multiply imputed datasets for complex sample design data. Rather, it is intended an alternative analytic strategy for population inference from complex sample design data with item-level missingness.

In this article, we focus on the estimation of two quantities: quantile estimates for a continuous variable, and estimates of rare proportions and their associated logistic regression estimates. We consider a stratified two-stage sample design and investigate a full range of quantiles including tail behaviors. While design-based methods for quantile estimation from complex survey data have been developed (Francisco and Fuller 1991; Woodruff 1952), quantile estimation after imputation is less commonly addressed in the literature. (A recent exception that considers nonparametric fractional imputation outside of the complex sample design setting is Yang et al. 2013.) This is the case despite the rapid development and increasing popularity of MI. We also consider MI for incomplete binary variables, with a focus on rare outcomes. It is well known that maximum-likelihood estimation of logistic regression models typically suffers from small sample bias, the degree of which is strongly dependent on the number of sample cases in the less frequent of the two categories (King and Zeng 2001). Thus when the dependent binary variable represents the occurrence of rare events, the logistic regression coefficients can be substantially biased even with a simple IID data structure. Random effects logistic models are commonly used for fitting clustered binary data; however, these models rely heavily on asymptotic theory assumptions, which may not be met in sparse samples. All these issues might extend naturally to the missing-data context. As shown by Zhao and Yucel (2009), sequential MI for binary data missing completely at random in a multilevel setting suffers from severe bias and poor coverage in estimating probabilities that are close to 0 or 1, particularly when the intraclass correlation is high.

The objectives of this article are: i) to develop a two-step synthetic MI method as a way to simultaneously account for stratification, clustering, and unequal inclusion probability; and ii) to demonstrate the effectiveness of the new method with respect to quantile estimation and logistic regression for binary rare events data as compared with existing fully parametric imputation strategies. Section 2 discusses the imputation strategies under three different models: simple random sample, fixed effects for clusters/strata, and random effects for cluster/strata. Section 3 introduces the newly proposed procedure and the MI inference rules for quantile estimation under this method. Section 4 presents a Monte Carlo simulation study as the validation tool to assess the repeated sampling properties of MI under the various approaches. Section 5 applies different MI procedures to the analysis of body mass index on youth data from the third National Health and Nutrition Examination Survey (NHANES III). Some concluding remarks follow in Section 6. We focus on the two-PSU-per-stratum design in this chapter, although the methods we develop can accommodate any number of PSUs per stratum.

1.1. Fully Parametric Imputation Methods for the Two-PSU-per-Stratum Design

Here we briefly describe fully parametric multiple-imputation techniques with complex sample design features incorporated to different degrees. We assume the missing data Yi is a member of the exponential family, and that there are fully observed covariates Xi (a (p + 1)-dimension vector) such that g(E(Yi|Xi)) = Xiβ for a known link function g(·) (e.g., g(u) = log(u/(1 − u)) for binary outcomes (logistic regression), g(u) = log(u) for count outcomes (Poisson regression), or g(u) = u for continuous outcomes (Gaussian regression)).

1.1.1. Standard Regression Model Assuming SRS

Based on the maximum-likelihood estimates [beta] and the associated asymptotic covariance matrix V ([beta]) for the generalized linear model g(E(Yi|Xi)) = Xiβ, the posterior predictive distribution of the parameters can be constructed, which is then used to impute the missing values (Rubin 1987, 169–170). Point and variance estimates of the regression parameters can then be obtained using the usual MI combining rules (Rubin 1987, 76). For the pth component of the regression parameter:

β^p=1Mm=1Mβ^p(m),
(1)

V^(β^p)=1Mm=1MV^(β^p(m))+M+1M(M-1)m=1M(β^p(m)-β^p)2
(2)

and

(β^p-βp)V^(β^p).tν,ν=(M-1)(1+m=1Mβ^p(m)(M+1)(M-1)m=1M(β^p(m)-β^p)2)2
(3)

where m = 1, …, M imputations are taken from draws widely separated to practically eliminate autocorrelation. Multivariate combining rules for the joint distribution of [beta] are available as well (Schafer 1997, 112–118).

1.1.2. Fixed-Effects Model (FX_APR)

Compared to the predictive model using standard generalized linear regression, we can add dummy variables indicating stratum and cluster memberships to account for stratification and clustering effects. Note that we also need to include the log transformation of sampling weight as a predictor if the missing-data mechanism depends on weights to make the imputation model truly appropriate. The model takes the following form:

g(E(YiXi))=Xiβ+Diγ+Eiη+[ζlog(wi)],
(4)

where Di is a 1 × (H − 1) row vector of dummies representing the H strata, and Ei is a 1 × Q row vector of dummies representing the clusters nested within each stratum. Note that Q = ΣhQhH, where Qh is the number of clusters in each stratum; in the case of the two-PSU-per-stratum case, Q = H. The parameter space under this model is expanded as θ = (β,γ,η,ζ), and the steps for imputation are similar to those in the SRS setting.

1.1.3. Mixed-Effects Model (RE_APR)

As there are only two PSUs selected from each stratum, it is not feasible to model clusters as random effects separately within each stratum. Here we pool all Q + H clusters in the sample and model them using a single random-effect term. The imputation model is specified as follows:

g(E(YjXj))=Xiβ+Diγ+ui+[ζlog(wj)],
(5)

where ui~N(0,σu2) is a random intercept term representing cluster effects, for i = 1, …,(Q + H), and σu2 denotes the between cluster variance. Other terms are as previously defined. (In the two-PSU-per-stratum case, Q + H = 2H.)

2. Synthetic MI Using the Weighted FPBB for Stratified Samples

In this section, we develop the two-step multiple-imputation methodology for a stratified two-stage sample design where a combination of complex sampling techniques are considered, namely, stratification, clustering, and unequal inclusion probability. We develop methods for an unrestricted number of clusters per stratum, but for our simulations and application we focus on the special case of two PSUs selected per stratum, which mimics the form of a public-use dataset that is commonly released for analyses.

2.1. Synthetic Data Generation to Account for Complex Sample Designs

Consider a finite population P, which is stratified into H strata with Nh PSUs in the hth stratum, and hence the population size of PSUs is h=1HNh=N. For the h th stratum, select nh PSUs with/without replacement from some probability sampling plan, independently across strata, and hence the total sample size of PSUs is h=1Hnh=n. Subsampling of mhi elements (treated as the ultimate sampling units in this example) from a total of Mhi is then conducted within the i th sampled PSU of the h th stratum for i = 1, …, nh, h = 1,2, …, H. Hence the overall sample size and population size of elements are h=1Hi=1nkmhi=h=1Hmh=m and h=1Hi=1NkMhi=h=1HMh=M, respectively, where mh and Mh are the sample size and population size of elements for the h th stratum, respectively. The population consists of four types of survey variables: a single outcome Y, a single covariate X, a design matrix Z = [S,C,w] including the stratum indicators (S), the cluster indicator (C) and the sample weight (w), and the response indicator R. Let D = (Ds, Dns) = {(Yhij, Xhij, Zhij, Rhij), h = 1, …, H, i = 1, …, Nh, j = 1, …, Mhi} denote the population of values measured on the survey variables, which is divided into the sampled component (Ds) and the nonsampled (Dns) component.

We generate synthetic populations using a two-stage procedure. The first stage accommodates stratification and clustering and the second weighting. We have two broad approaches. The first, which we term SYN1, assumes that first-stage (cluster-level) and second-stage (element-level) sample weights are available for the analysis and implements a weighted FPBB at each level to generate the synthetic population. The second, which we term SYN2, assumes that only final weights are available for the analysis; it uses a Bayesian bootstrap to account for stratification and clustering at the first stage and the weighted FPBB to account for the final weight at the second stage.

2.1.1. Double-Weighted Finite Population Bayesian Bootstrap (SYN1)

For the h th stratum, let ts,h and tns,h index the sampled and nonsampled clusters, respectively, and {b 1, …,b q, …,b rh, q = 1, …,rh} be the rh (1 ≤ rhNh) distinct matrices of real numbers each of dimension browq×bcolq with no row vectors in common. Each cluster in the stratum can take the form of one of b qs. Let thi = q when the i th cluster takes on the values of b q, for i = 1, …, Nh. Assume nh = rh and mhi = ||bts,hi|| (the number of distinct row vectors in bts,hi) for convenience of exposition. Let w (i) be the sample weight of the i th sampled cluster in the h th stratum which equals b q, for i = 1, …, nh. Also let wts,hi,Ds,h(j) be the sample weight of the j th sampled element in the i th sampled cluster which equals bkts,hi, for j = 1, …, mhi. Finally, let cts,h (q) and ctns,h (q) be the number of sampled and nonsampled clusters that equal b q, and cth,Ds,hhi(k) and cth,Dns,hhi(k) be the number of sampled and nonsampled elements that equal bkts,hi.

It can be shown (cf. Zhou 2014) that, within a stratum h, the Polya posterior for the counts of distinct unobserved elements Dns,h is given by

p(Dns,hDs,h)={q=1rh{Γ(wth(q))/Γ(wts,h(q))}}{Γ(Nh)/Γ(nh)}×{k=1mh{Γ(wth,Dns,h(k))/Γ(wts,h,Ds,h(k))}}{Γ(Mh)/Γ(mh)},
(6)

where wth(q)=wts,h(q)+Ctns,h(q) and wth,Dns,h(k)=wts,h,Ds,h(k)+cth,Dns,hhi(k), for mh=k=1mkcth,Ds,hhi(k) and mh=Mh-mh=k=1mkcth,Dns,hhi(k). The full posterior is then given by the product of the posteriors within each stratum, since these strata are independent and all strata in the population are in the sample:

p(DnsDs)=h=1Hp(Dns,hDs,h).
(7)

A Monte Carlo procedure to simulate from this posterior distribution is then given as follows:

  1. Draw the Nhnh nonsampled clusters in the population based on the Polya posterior distribution independently for each stratum. Each of the sampled clusters is resampled with probability
    ςhi=wts,h(i)-1+lhi,k-1×(Nh-nhnh)Nh-nh+(k-1)×(Nh-nhnh),k=1,,Nh-nh+1,
    (8)
    where lhi,k−1 is the number of times that the i th cluster in the h th stratum has been resampled at the (k − 1)th resampling, and w ts,h(i) is the weight for the i th sampled cluster in the h th stratum which is normalized to sum up to the total number of clusters, that is, i=1nkwts,h(i)=Nh.
  2. From Step 1, form a population of clusters { c11,c12,,c1n,c11,c12,,c1N1-n1,,cH1,cH2,,cHnH,cH1,cH2,,cHNH-nH}. Record the number of times each of the clusters from the original sample appears in the FPBB population of clusters, denoted by τhi, i = 1, …, nh, h = 1, …, H., and h=1Hi=1nhτhi=N. Then update the within cluster element-level conditional weights as follows: wjhi=wjhi×τhi, i = 1, …, nh, h = 1, …, H, where wj|hi is the inverse of the conditional probability that element j is selected given cluster i in stratum h is selected. Now pool all elements from these clusters together and treat them as a single FPBB sample (i.e., as if they have no stratum or cluster boundaries). Note that this FPBB sample has the same sample size m=h=1Hi=1nhmhi as the original sample, but different sampling weights. We then once more apply the weighted FPBB to these pooled elements to generate Mm units from the m units in the FPBB sample. We resample from each of the resampled clusters Mm elements, cycling through Mm times and resampling with probability
    λjhi=wjhi-1+lhij,k-1×(M-mm)M-m+(k-1)×(M-mm),k=1,,(M-m+1),
    (9)
    where lhij,k is the number of times that the j th element in the i th cluster in the h th stratum has been resampled at the k th resampling, and wj|hi is the updated conditional weight for the j th element in the i th cluster in the h th stratum. Again, they are normalized to sum up to the total number of units in the entire population, that is, h=1Hi=1nhj=1mhiwjhi=M. Thus we create a single synthetic population. Repeat Step 2 B times to obtain B FPBB synthetic populations.
  3. Repeat Steps 1–2 L times to obtain L bootstrap samples, yielding L × B FPBB populations P(lb)syn=(P(lb)obssyn,P(lb)missyn), l = 1, …, L, b = 1,…B, each of which consists of both responding elements and nonresponding elements on a vector of variables {Y,X,Z,R}.

2.1.2. Bootstrap –– Weighted Finite Population Bayesian Bootstrap (SYN2)

Because we often do not know the first- and second-stage weights in public-use datasets, we consider an alternative to the procedure proposed in Subsection 2.1.1. Rather than obtaining a sample of clusters from a draw from a Polya posterior, we use replication methods (Rust and Rao 1996) to capture the cluster-level sampling variance. The final sampling weights instead of the adjusted element-level conditional weights are then used directly as input in the second-stage weighted FPBB. We use Rao and Wu’s (1988) rescaling bootstrap, which is a generalized extension of McCarthy and Snowden’s (1985) “with replacement bootstrap”. Once the PSUs have been sampled, we continue with the weighted FPBB approach to complete the synthetic population data generation. The proposed procedure is as follows:

  1. Select a sample of nh=nh-1 PSUs from the parent sample in each stratum via SRSWR sampling;
  2. Apply the “ultimate cluster principle” (Wolter 2007), that is, once a PSU is taken into the bootstrap replicate, all elements in that PSU are taken into the replicate also. Thus we obtain our first bootstrap sample;
  3. Repeat the previous steps L times to obtain L bootstrap samples {Boot_l, l = 1, …,L};
  4. Within each bootstrap sample, update the element-level sampling weights as:
    whij=whij×(τhinhnh)={=nhnh-1whij,iftheithPSUselectedinthebootstrapsample=0,iftheithPSUnotselectedinthebootstrapsample
    As whij itself implicitly carries over the strata and PSU information in addition to unequal inclusion probability, we can drop the subscripts hi henceforth by pooling all elements in the bootstrap sample regardless of which stratum and PSU they originally came from. Normalize wj s to sum up to m:j=1mwj=m, where m * is the bootstrap sample size.
  5. For the l th bootstrap sample, l = 1, …,L, apply the weighted FPBB algorithm to create an entire population D=(Dns,Ds) based on the posterior predictive distribution of elements in the nonsampled population Dns = {(Yj, Xj, Zj, Rj), j = m* + 1, …, M } given the elements in the bootstrap sample Ds={(Yj,Xj,Zj,Rj),j=1,,m}.
    Operationally, we draw a Polya sample of size M * = Mm* from mult (M *; λ1, …, λK) where the selection probability λk, k = 1, …, K is a function of wj:
    λk=wj-1+1j,k-1×(Mm)M+(k-1)×(Mm),k=1,,M+1,
    (10)

Repeat Step (v) for B times to obtain L × B FPBB populations.

2.2. Imputation of the Synthesized Populations

Once the set of FPBB synthetic populations Psyn={P(b)(l),l=1,,L,b=1,,B}, where P(b)(l)=(Y(b)mis(l),P(b)obs(l)) are created using either the SYN1 method or the SYN2 method, we generate imputations Pimp={P(ba)(l),l=1,,L,b=1,,B,a=1,,A} from the posterior predictive distribution p(Y(b)mis(l)Y(b)obs(l)) based on a parametric model that does not condition on sample design features, that is, a model taking a form similar to the SRS model given in Subsection 2.1. We consider imputations based on the covariate (X) only (SYN1_srs or SYN2_srs) or imputations that include the log of the sample weights in the linear predictors (SYN1_lwt or SYN2_lwt).

To obtain the MI inference, denote the observed set of synthetic populations by PR={P(b)obs(l),b=1,,B,l=1,,L} and the imputed set of synthetic populations by PR¯={Y(ba)mis(l),l=1,,L,b=1,,B,a=1,,A}. The MI point estimator for the population statistic of interest Q (mean, regression estimator, quantile) is then given by the mean of the lba th point estimators:

Q^MI=1LBAlbbQ^lba.
(11)

The MI variance estimator is:

V^MI=(1+L-1)VL=(1+L-1)1L-1l(Q^l-Q^MI)2,whereQ^l=1BAbaQ^lba.
(12)

We then construct the 95% interval estimate for quantiles based on t reference distribution with degrees of freedom equal to min {vcom = Σh nhH, vsyn = L − 1. These results arise from the fact that, by the standard Rubin (1987) MI combining rules, we have

QPimp.tL-1(Q¯L,(1+L-1)VL),
(13)

where Q¯L=1LlQ(l),VL=1L-1l(Q(l)-Q¯L)2, and Q(l)=limBA1BAbaQ^LBA. Replacing Q(l) with its finite simulation estimator Ql replaces QL with QMI and gives the results above. A complete theoretical justification for (13) is provided in Dong et al. (2014) and Zhou (2014). Some intuition of the result can be gained by noting that the generation of the synthetic population sets the within imputation variance to 0 so that the posterior variance of Q can be obtained using the between-bootstrap variance only. Moreover, (11) assumes that E(qba) = Q – a result guaranteed by our Bayesian bootstrap estimator if the imputation model is also correct – as well as a sufficiently large sample size for the t approximation is reasonable.

Lo (1988) showed that the variance estimator for the FPBB mean in a simple random sample setting should be inflated by the factor ( n+1n-1). In the double-weighted FPBB (SYN1) setting, a small sample correction to the variance estimate thus needs to be used when the number of clusters per stratum is small. When nh = a is a constant across all strata, we use nh+1nh-1(1+L-1)VL; otherwise we suggest n¯h+1n¯h-1(1+L-1)VL, where [n with macron]h = H−1 Σhnh.

The Appendix provides the sample R code used to conduct the analyses in the application in Section 4 and can easily be adapted to other settings.

3. Simulation Study

We conducted a simulation study to investigate the performance of the proposed method for incorporating stratified cluster-sampling effects in multiple imputation. We targeted three population statistics: 1) population quantiles, 2) proportions of binary event data, and 3) logistic regression parameters relating the covariate to the binary data. The simulation is a 2 × 2 factorial design based on the following factors:

  1. keeping the first-stage sampling plan constant, we let the subsampling rate f2 of elements within sampled clusters be
    1. independent of or
    2. dependent on the stratum effects, and
  2. assume that
    1. the missingness on the Y-variable (continuous or binary) depends only on the covariate (X) (MAR_X), or
    2. depends on both X and the final sampling weight W(MAR_X,W).

We focus on a two-PSU-per-stratum sample design, both because it is a common design, especially in public-use settings, and because it is a “limiting case” in terms of the number of PSUs per stratum. In addition to the two variants of our synthetic MI estimators, we consider standard parametric MI under the SRS, appropriate fixed-effect (FX_APR), and appropriate random-effect (RE_APR) models.

3.1. Data Generation

Let i be the index for strata, j be the index for clusters, and k be the index for elements. Suppose there are 50 strata in the population. First, the number of PSUs in each stratum is randomly determined according to a uniform distribution, that is, Ci ~ Unif(2,54), i = 1, …, 50; second, the number of population elements within PSUs is randomly generated as Nij ~ Unif(20,80), i = 1, …,50, j = 1, …,Ci. Thus we obtain a population of size N = 67385. The complete data for four survey variables Y = (Y1,Y2,Y3,Y4)T are generated from a superpopulation model according to a two-step process, In the first step, Y1 and Y2 are randomly selected from a bivariate linear mixed-effects model; let N2(·) denote a bivariate normal distribution function:

(Y1ijkY2ijk)~N2(μ,),whereμ=[β1+Si+u1ij+ε1ijkβ2+u2ij+ε2ijk],=[σ11σ12σ12σ22].
(14)

Let β1 = β2 = 15 be the fixed covariate effects, Si=i5 be the fixed stratum effects, and let [u1iju2ij]T and [ε1ijkε2ijk]T be the random cluster effects and random error terms drawn from two independent bivariate normal distributions: N2(0,Σu) and N2(0,Σε). Elements of Σu are set as: σu12=4,σu22=1, σu1u2 = 0.2, and elements of Σε are set as: σε12=4,σε22=3, σε1ε2 = 1.732. This results in conditional intraclass correlations (ICC) of Y1 and Y2 as ρY1= 0.5 and ρY2= 0.25 (note that the unconditional ICC for the two variables may be smaller than these values). In the second step, a random-effects logistic regression model (Anderson and Aitkin 1985; Stiratelli, et al. 1984) is used to simulate two binary outcome variables Y3 and Y4 as a function of Y2. Under this model, a random effect is added to the linear part of the logistic regression model for each element in the cluster. The conditional mean of Y3ijk and Y4ijk is

πijk=E(Y.ijkY2ijk,u.ij)=Pr(Y.ijk=1Y2ijk,u.ij)=eα0+α1Si+α2Y2ijk+u.ij1+eα0+α1Si+α2Y2ijk+u.ij,
(15)

where u3ij ~ N(0,62), u4ij ~ N(0,102) and α = (α0,α1,α2)T is the vector of fixed covariate effects. We fix α2 = 1.5 and vary α0 and α1 to obtain two different binary variables Y3ijk and Y4ijk, with either moderate (α0 = −5,α1 = −1.5) or rare probabilities (α0 = −8,α1 = −6). Given u3ij, the Y3ijk s in the cluster are independent Bernoulli variables, that is, Y3ijk|u3ij ~ Bern(πijk).

Figure 1 shows the correlations between variables in the simulated population, with the different shades of grey representing different degrees of association between any of the two variables. The darker shades indicate higher correlation. All survey outcome variables (Y1,Y3,Y4) have a moderate to strong (0.2~0.8) stratum effect (H or strID) and clustering effect (U1,U3,U4), indicating that accounting for these effects in the analysis of missing data is essential.

Fig. 1
Correlation between variables in the simulated population (darker shades = higher correlation)

3.2. Sample Design

Within each stratum, we draw a two-stage cluster sample according to the following procedure: first, we draw a sample of two PSUs without replacement with probability proportional to the cluster size f1ij=2NijjNij. Second, we sample elements from each sampled cluster using two different subsampling schemes:

  1. sampling probability independent of Si which is defined in (14): SRS with an equal sampling fraction of f2k|ij = 1/5; and
  2. sampling probability related to Si: SRS with varying sampling fractions across strata, that is f2k|ij = expit(−0.8 − 0.12*Si), where expit(x) = 1/(1 + e−1(x)).

An average of 1,122 elements are selected in each of the 200 simulation replications. The distributions of sampling weights are shown in Figure 2. The distributions of sampling weights under the two subsampling schemes are generally very similar with somewhat more skewness under subsampling scheme 2.

Fig. 2
Distribution of weights under the two subsampling schemes

3.3. Imposing Missingness

Throughout the simulation study, we assume that Y2 is always completely observed and we impose missing values on Y1, Y3, and Y4 independently according to the following deletion function conditional on Y2 and/or log transformation of the weight:

Pr(R=0Y2,W)=exp(λ0+λ1Y2+λ2log(W))1+exp(λ0+λ1Y2+λ2log(W)),
(16)

where R is the response indicator and W is the overall sample weight. Setting λ2 = 0, we obtain the first MAR mechanism (i.e., MAR_X, note that we treat Y2 as a covariate X here), under which we further set λ0 = 3.42, λ1 = −0.2 and λ0 = −2.58, λ1 = 0.2 for deleting values on Y1 and Y3, Y4, respectively. Setting λ2 = −0.6, we obtain the second MAR mechanism (i.e., MAR_X,W), under which we fix λ1 = 0.2 and set two values on λ0 (= −0.274 or −0.33) for deleting values independently on all three outcome variables under subsampling scheme 1 and subsampling scheme 2, respectively. All deletion functions result in approximately 40% missingness on each variable.

3.4. Parametric Multiple Imputation

Both simple random sample SRS (including SRS, SYN1_srs and SYN2_srs) and fixed-effects model FX_APR can be implemented in R (R Core Team 2013) using mice routines; for the logistic model associated with the binary outcome, the method ‘logreg’ must be specified. We use the pan package in R for the mixed-effects imputation (RE_APR) for the missing continuous outcome; logistic mixed-effects imputation is programmed in SAS for the missing binary outcome, as there is no missing-data software package readily available for use.

3.5. Parameters of Interest and Inference

We focus on inference for the following population parameters: the mean of the continuous variable Y1, the mean of the binary variables Y3 and Y4 (i.e., Bernoulli proportions), linear regression coefficients of Y1 on Y2, logistic regression coefficients of Y3 (or Y4) on Y2, and the population percentiles of the continuous variable Y1.

Weighted analyses and sandwich variance estimators accounting for strata and clusters are used to estimate smooth statistics (including proportions and regression parameters) under the three fully parametric MI methods. For estimating quantiles of the distribution of a continuous survey variable, we construct the sample-weighted point estimator with confidence intervals based on the test-inversion method (Francisco and Fuller 1991). We chose the test-inversion method instead of Woodruff’s method (Woodruff 1952) despite the computational intensity, because the literature suggests that it may outperform Woodruff in heavily stratified samples or in small-to-moderate-sized samples (Kovar et al. 1988). Based on the ath imputed dataset, the empirical distribution function can be written as

F^(a)(y)=[SRwhijI(yhijobs<y)+SR¯whijI(yhij(a)<y)]Swhij,
(17)

where SR and SR are subsets of the sample data S, consisting of respondents and nonrespondents respectively. The estimator F(y) and its associated estimated variance v(F(y)) can then be obtained using the variance estimator proposed by Francisco and Fuller (1991) together with standard Rubin combining rules as previously described. The sample γth quantile estimator thus is qγ = (F)−1(γ), with 95% asymptotic confidence interval (CI) given by

[L,U]=[[F^]-1(γ-t0.025var(F^(qγ))),[F^]-1(γ-t0.025var(F^(qγ)))].
(18)

3.6. Results

Table 1 compares the average width × 10−2 and average coverage rates of the 95% CI of q(α), where α = 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, and 0.95, corresponding to seven selected population quantiles. Among all methods considered, the SRS imputation model yields the poorest coverage. This results from the compounding effects of biases and variance underestimation, due to ignoring stratum effects and clustering effects respectively. As we increase the dependence of both the sampling mechanism and response mechanism on stratum effects and sampling weights, the performance of SRS becomes even worse, as exhibited by the markedly increased RelBias and decreased coverage rates. In addition, ignoring stratum and/or weight effects that are highly relevant to either mechanism seems to impact the median and second and third quartiles more than the tail quantiles under SRS, as evident in the relatively lower coverage rates in the right part of Table 1.

Table 1
Comparison of average width × 10−2 and 95% CI coverage rates of q(α) for α = 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, and 0.95.

The FX_APR model (Reiter et al. 2006; Rubin 1996; Schenker et al. 2006), generally performs fairly well in our simulation study with respect to the estimation of population quantiles. There is some modest underestimation of the small percentile quartiles with the second-stage sampling constant. The RE_APR model also performs well, with the exception of moderate to high overcoverage when the second-stage sampling probability is associated with the stratum mean and the missingness mechanism.

In contrast, our synthetic MI (SYN2 in particular) compares favorably with all of its competitors, and in most cases yields results comparable to the RE_APR, which is regarded as a “gold standard” as it is compatible with the data-generating mechanism (Meng 1994). There is some undercoverage when the stratified double-weighted FPBB estimator (SYN1) is used, perhaps due to the fact that the Lo small-sample adjustment is not as accurate when nh = 2. However, use of a stratified bootstrap-weighted FPBB estimator (SYN2) generally eliminates this issue. Although an imputation model assuming SRS suffices for the synthetic MI method in most scenarios, we need to include the sampling weight as a predictor when the outcome Y and the response indicator R are strongly associated with each other through the sampling mechanism I, as is the case with the second subsampling scheme, when both the missingness indicator and the second-stage sampling rate are functions of the stratum mean.

Tables 2 and and33 compare the absolute relative bias relbias=100×|θ^-θcomplete|θcomplete%, RMSE and 95% nominal CI coverage for the estimated mean/proportions of Y1, Y3 and Y4 and the slopes of the three outcome variables on Y2, respectively. (θcomplete is the estimated parameter with complete data, and [theta w/ hat] is the estimated parameter under one of the different MI methods.) As in the estimation of the quantiles, the SRS imputation model is biased and has poor coverage as it ignores stratum and cluster effects. Again, dependence of subsampling on stratum effects and dependence of response on sampling weights damage the performance of SRS even further.

Table 2
Comparison of RelBias, RMSE and 95% CI coverage rates for the mean of Y1 and proportions of Y3 and Y4, Population true value: Y1 = 20.4, PY3 = 0.608, PY4 = 0.117
Table 3
Comparison of RelBias, RMSE and 95% CI coverage rates for the regression coefficients of Y1, Y3 and Y4 on Y2, Population true value: β1,Y1|Y2 = 0.488, β1,Y3|Y2 = 0.227, β1,Y4|Y2 = 0.083

FX_APR generally performs well in estimating the mean of a continuous variable (Y1) and a regular binary variable (Y3) with moderate probability as well as the slopes. However, it fails for proportion estimation for rare events data (Y4), yielding biased point estimates and less than nominal coverage throughout all scenarios. One interpretation might be that overfitting occurs when too many dummies are included to account for fixed strata and cluster effects, yielding dummy variables where all observed cases are 0 or 1. In this case, “complete separation” yields unstable coefficient estimates, damaging the predictive efficacy when the fitted model is used for drawing missing values. The problem is particularly prominent when the logistic fixed-effects imputation model is used along with the current sampling design, where an average of only ten elements are selected per PSU within each stratum; this results in even more substantial biases on Y4 than the SRS model. (Use of a Bayesian approach with an informative prior of the form t1(0,2.5) on the fixed-effect parameters using the mi function in R (Gelman et al. 2008) reduced but did not remove the impact of complete separation. A relative bias of 12–13% remained for the estimation of of Y4 under the MAR_X missingness mechanism, with 95% nominal coverage of 89%, while a relative bias of 17–22% remained under the MAR_X,W mechanism, with nominal coverage of 84%.) The random-effects model RE_APR more effectively avoids the overfitting issue through shrinkage effects: note that under RE_APR, we pooled all PSUs from all strata as if there were no strata bounds, and the stratum effects can be thought as being implicitly modeled in the random intercept term (uj = Ih + uh(j)).

As in the quantile estimation setting, our synthetic MI compares favorably with all of its competitors, and in most cases yields comparable results to the RE_APR for estimation of means and logistic regression parameters. In the case of rare events data, our proposed new method increases the analytical size through generating synthetic population data thus is even superior to RE_APR, consistently yielding negligible biases and close to nominal coverage. The impact of ignoring the weights in the imputation (under MAR_X,W mechanism) is less than in the quantile estimation setting, with the exception of the estimation of the continuous mean Y1, where including the weight is required to obtain approximately correct coverage.

A disadvantage of the method lies in its relative inefficiency for estimating nonlinear parameters (regression coefficients) (e.g., the synthetic MI results in unbiased point estimates but a larger RMSE than the two model-based MI methods). This is typical in that nonparametric methods cannot typically compete with their fully parametric counterparts under the correct model, and is a tradeoff made to improve robustness to model misspecification.

4. Application to NHANES III

We apply our method to the National Health and Nutrition Examination Survey (NHANES) III (1988–1994), which is designed to provide national estimates of the health and nutritional status of the civilian noninstitutionalized population of the United States aged two months and older (National Center for Health Statistics 1996). The data are obtained from a stratified, multistage area probability sampling design with oversampling of certain age and ethnicity groups. For confidentiality and computational reasons, the public-use data provides two pseudo-PSUs per stratum. Another unique feature of NHANES is that data are collected through both interview and actual physical examinations of the sampled persons. Both unit- and item-level nonresponse occurs in both components of the survey, and there is a particularly high missing rate on the body mass index (BMI) measure for youth data in the physical examination component (30%). As a popular measure of overweight status and obesity, the percentiles of BMI for children and youths are of particular interest for public health reasons. The upper percentiles and the lower percentiles are also closely monitored for overweight and underweight status, respectively. As a result, we restrict our analysis sample to children and youths from two months to 16 years of age. The Appendix provides the sample R code used to conduct the analyses below.

We estimate population quantiles (from 0.05 to 0.95 with an increment of 0.05 along with two extreme percentiles: 0.03 and 0.97) of BMI for children and youths by gender. We also estimate the proportion of such a population being covered by health insurance, overall and by race. To assure congenial inference, we include the following variables that are either of primary interest in the substantive analysis or are important predictors for BMI measures in the imputation model: age, gender, race, education, mother’s BMI, father’s BMI and family income (Yuan and Little 2007). We compared three different methods in our treatment of the missing data:

  1. complete case analysis (CC) with design-based estimation;
  2. fully parametric model-based MI using design-based estimation, within which we apply both an imputation model assuming SRS and the appropriate model conditional on all three sample design features (i.e., dummy variables indicating cluster and stratum memberships as well as the log transformation of sampling weights); and
  3. our proposed finite population Bayesian bootstrap method (using SYN2 since we do not have separate weights for the first and second stages of sampling), and including the log of the weight in the imputation model.

Estimates of the median BMI and the proportion of children with health insurance are given in Table 4. The CC method appears to overestimate the median of both the BMI measure and health-insurance coverage for the full sample and race domains relative to the MI approaches, and yields the widest confidence intervals or largest standard errors as a result of decreased sample size. Then again, the median of BMI obtained from synthetic MI is quite similar to that from the model-based MI, while demonstrating some advantages in efficiency by yielding shorter intervals. The generally lower health-insurance coverage estimates under the synthetic MI relative to model-based MI might be attributable to the fact that the synthetic MI are able to capture certain interactions between the sample design variables and the regular covariate matrix which are not explicitly modeled in the fully model-based MI.

Table 4
Alternative methods in estimating the median of BMI and the health-insurance coverage rate, for full sample and by gender and race, respectively

Figure 3 displays a visual comparison of the percentile estimation for the three methods under consideration. We look at how those methods perform in three different percentile ranges by gender domains: the middle percentiles from 0.5 to 0.75, the upper percentiles from 0.90 to 0.97 and the lower percentiles from 0.03 to 0.1. We chose these percentile ranges because the extreme lower and upper percentiles of BMI are typically used to monitor under- and overweight for children and youths, and there is evidence that gender difference exists in these BMI percentile ranges (particularly when age is considered, i.e., growth patterns in BMI). In general, both MI methods result in very similar BMI estimates, and they are lower than those obtained from CC analysis. This makes sense since our comparison of the distributions of age for complete cases and for missing cases on the BMI measure revealed that younger children are more susceptible to missingness, and therefore CC analysis tends to overestimate BMI by excluding those younger missing cases. The inclusion of the age variable as a predictor in the imputation model corrects such an overestimation. The magnitude of this correction for boys is bigger than that for girls in estimating the lower percentiles (0.03, 0.05). When examining a report on BMI-for-age percentiles by gender released by the Center for Disease Control and Prevention (http://www.cdc.gov/nchs/data/series/sr_11/sr11_246.pdf), we find that baby boys (corresponding to the lower quantiles here) have a relatively higher BMI, which might be at least part of the explanation.

Fig. 3
Comparison of methods for quantile estimation of BMI, by gender

5. Discussion

While multiple imputation has become a popular option for the analysis of missing data, some issues remain unresolved in its practical application to complex sample survey data. The complex features of sampling compounded with nonresponse in survey data often result in a rather complicated data structure, which prevents the straightforward application of the standard MI techniques (such as a multivariate normal model assuming simple random sampling). In this article, we develop a general-purpose approach to account for various design features in a highly stratified two-stage sample using a two-step synthetic MI framework. We have focused on evaluating the performance of the new method compared with existing methods with respect to several missing-data issues frequently encountered in large population-based socioeconomic and epidemiological studies. These include: i) accommodating stratification and multistage sampling in the imputation process; ii) the employment of nonstandard or non-normal imputation models for estimating probabilities of rare events; and iii) the estimation of population quantiles with multiply imputed data. (For examples that consider alternative sample designs, such as independent unequal probability of selection designs, or cluster and weighted designs without stratification, as well as estimators of quantities such as means and linear regression parameters, see Zhou (2014).

Although multiple imputation is technically valid only for maximum-likelihood estimates (Kim et al. 2006), we demonstrate that the coverage properties of the proposed method are fairly good for nonsmooth statistics. Specifically, our stratified variations of the weighted Polya posterior exhibits robustness to the loss function for estimating the upper and lower tails of the distribution function where even the appropriate model-based method (i.e., FX_APR) fails. In contrast with existing fully parametric MI methods, most of which perform poorly when applied to rare outcome binary data, the proposed method yields quite stable parameter estimates regardless of the rarity of the outcome. An alternative approach for MI estimation of quantiles that relies on estimating the CDF using a smooth regression curve is given by Wei et al. (2012), and could be used at the second-stage imputation step after the weighted finite population Bayesian bootstrap has been implemented.

It is worth stressing that our method requires only the most straightforward form of imputation modeling and combining rules for inference. This is because the effects of the complex sample design and the effect of estimating the nuisance parameters in imputation (e.g., regression parameters when the main quantity of interest is a quantile of Y) are both correctly reflected in the replication variance estimation given the design-reversed and multiply imputed synthetic populations. Any higher-level and nonlinear interactions in the covariate data, including those with the weights, clusters, or strata, will automatically be captured in the synthesizing step. However, when the imputation is conducted parametrically, as it is here, such design-variable interactions will still need to be considered if they are associated with the missingness mechanism, although the impact of misspecification will generally be attenuated. Similarly, not-missing-at-random mechanisms that are dependent on the missing values are not accommodated in this framework. Finally, we note that assuming SRS for imputation results in correct inference only at the population level: correct inference for domain estimation requires that the domains be included in the imputation model. For example, if variables X and Y are positively correlated in stratum A but negatively correlated in stratum B, this interaction will be correctly averaged over for the population inference using weighted FPBB, but if this interaction is of direct interest, it will be attenuated unless incorporated in the imputation model for the synthetic population. Further, imputing under SRS does not absolve the imputer from correctly modeling the data. To give a trivial example, assume data are sampled from two strata denoted by Z = {1,2}, where P(Z = 1) = P(Z = 2) = .5 in the population, and Y|Z = 1 ~ N(5,1) and Y|Z = 2 ~ N(−5,1), and stratum 1 is oversampled with P(I|Z = 1) [proportional, variant] .8 The method proposed here will correct the imbalance between the strata, and assuming a two-component normal mixture model will allow imputations of Y that maintain the correct marginal distribution of Y with equal-sized components. This will allow for correct estimation of percentiles, whereas simply assuming a unimodal normal distribution will only consistently estimate the mean. Correct estimation of percentiles within the strata will require also conditioning on the strata, as mentioned above. We note that one advantage of the proposed method is that, with design issues cleared out of the way, more focus can be given to developing missing-data models.

We also note that the method developed here does not allow for the release of a small number of multiply imputed datasets to be combined using the standard Rubin rules. It would be possible to publically release all L × B × A multiply imputed datasets to be analyzed using the methods developed here, although this would typically involve hundreds to thousands of datasets. Methods to allow a more modest release, with minimal impact on inference, are a topic for future research.

Future research will investigate the inferential properties of the proposed method in situations where auxiliary information on all population units is available, using a constrained version of the Polya posterior. Two other possible research directions include: (i) extending the two-step synthetic MI framework to deal with unit nonresponse problems, and (ii) extending it to deal with generating synthetic data for disclosure risk limitation.

Acknowledgments

This work was supported in part by Grant Number R01CA129101 from the National Cancer Institute. The authors would like to thank Rod Little, Brady West, and Richard Valliant, along with the Associate Editor and three reviewers, for their review and helpful comments.

Appendix: R Code for Using the Proposed Two-step MI Method on NHANES III

require(survey)
require(mice)
require(polyapost)
set.seed(seed #)
syn_bmi < -function(dt, N, Bt1, Bt2, Mt){
##Step 1: Generate synthetic populations with missing data;
#Stage 1: Create bootstrap samples from the parent sample;
              dsgn <- svydesign(ids = ~ predcl, strata = ~ pstrat, nest = TRUE, data = dat, weights = ~ predwt)
              dsgn.RW<-as.svrepdesign(design = dsgn, type = “subbootstrap”, replicates = Bt1)
              dim(dsgn.RW$repweights)
              repwt<-as.matrix(dsgn.RW$repweights)
              repwt[repwt = =0]<-NA
              dim(repwt)
              #set up arrays to hold point estimates from bootstrap samples;
              btm<-matrix(0,nrow = Bt1,ncol = 3)
              btqt<-matrix(0,nrow = Bt1,ncol = 21)
              btqtm<-matrix(0,nrow = Bt1,ncol = 21)
              btqtf<-matrix(0,nrow = Bt1,ncol = 21)
              for (j in 1:Bt1){
                          st.bb < -cbind(dat,repwt[,j])
                          #delete those units with zero weights for each bootstrap sample;
                          st.BB < -na.omit(st.bb)
                          #recode those 999 back to NA so that the mice package can be used for imputation;
                          st.BB$pybmi[st.BB$pybmi = = 999] < -NA
                          #need to calculate the replicate weights;
                          Samwt < -st.BB[,9]*st.BB[,13]
                          #normalize again the adjusted weights;
                          Samwts < -Samwt*N/sum(Samwt)
                          np < -nrow(st.BB)
                          ids < -seq(np)
                          ns < -N-np
##Stage 2: Create unweighted synthetic populations within each bootstrap sample;
#Set up arrays to hold point estimates from imputed unweighted synthetic populations;
            fbm < -matrix(0,nrow = Bt2,ncol = 3)
            fbqt < -matrix(0,nrow = Bt2,ncol = 21)
            fbqtm < -matrix(0,nrow = Bt2,ncol = 21)
            fbqtf < -matrix(0,nrow = Bt2,ncol = 21)
            for(boott in 1:Bt2){
                        l < -vector()
                        smp < -wtpolyap(ids, Samwts, ns)
                        #input the adjusted weights in the weighted Polya sampling algorithm;
                        for (k in 1:np){
                        l < -c(l,length(smp[smp = = k]))
                        }
                 #check if the vector of l sums up to the number of synthetic population size;
                 sum(l);
                 predY1 < -c(rep(st.BB[,1],l)) #bmi
                 predY2 < -c(rep(st.BB[,2],l)) #race
                 predY3 < -c(rep(st.BB[,3],l)) #gender
                 predY4 < -c(rep(st.BB[,4],l)) #income
                 predY5 < -c(rep(st.BB[,5],l)) #education
                 predY6 < -c(rep(st.BB[,6],l)) #mother’s bmi
                 predY7 < -c(rep(st.BB[,7],l)) #father’s bmi
                 predY8 < -c(rep(st.BB[,8],l)) #age
                 predwt1 < -c(rep(st.BB[,9],l))
                 predlwt < -log(predwt1) #log of sample weight
                 predCID < -c(rep(st.BB[,12],l)) #cluster ID
                 predSTID < -c(rep(st.BB[,11],l)) #stratum ID
##Step 2: Multiple imputation of the unweighted synthetic populations;
            #use the imputation model including log of weight as a predictor (syn_lwt);
            temp1 < -data.frame(cbind(predY1, predY2, predY3, predY4, predY5, predY6, predY7, predY8, predlwt))
            temp1_imp < -mice(temp1,method = “norm”, m = Mt)
            ml < -complete(temp1_imp, ‘long’)
            ml$bmit < -exp(ml$predY1) #back transform bmi to its normal scale
            mlmale < -subset(ml, predY3 = = 1)
            mlfem < -subset(ml, predY3 = = 2)
            multm < -cbind(as.vector(by(ml$bmit,ml$.imp,mean)),
            as.vector(by(mlmale$bmit,mlmale$.imp,mean)),
            as.vector(by(mlfem$bmit,mlfem$.imp,mean)))
            multqt < -sapply(with(ml,by(ml,.imp,function(x)quantile(x$bmit, c(0.03,seq(0.05,0.95,0.05),0.97)))),as.vector)
            multqtm < -sapply(with(mlmale,by(mlmale,.imp,function(x)quantile(x$bmit, c(0.03,seq(0.05,0.95,0.05),0.97)))),as.vector)
            multqtf < -sapply(with(mlfem,by(mlfem,.imp,function(x)quantile(x$bmit, c(0.03,seq(0.05,0.95,0.05),0.97)))),as.vector)
                  fbm[boott,] < -t(apply(multm,2,mean))
                  fbqt[boott,] < -t(apply(multqt,1,mean))
                  fbqtm[boott,] < -t(apply(multqtm,1,mean))
                  fbqtf[boott,] < -t(apply(multqtf,1,mean))
                  print(boott)
                  }
            btm[j,] < -t(apply(fbm,2,mean))
            btqt[j,] < -t(apply(fbqt,2,mean))
            btqtm[j,] < -t(apply(fbqtm,2,mean))
            btqtf[j,] < -t(apply(fbqtf,2,mean))
            print(j)
            }
            smpm < -apply(btm,2,mean)
            smpv < -(1) 1/Bt1)*apply(btm,2,var)
            smpse < -sqrt(smpv)
            smpqt < -apply(btqt,2,mean)
            smpqtv < -(1) 1/Bt1)*apply(btqt,2,var)
            smpqtse < -sqrt(smpqtv)
            smpqtm < - apply(btqtm,2,mean)
            smpqtvm < -(1) 1/Bt1)*apply(btqtm,2,var)
            smpqtsem < -sqrt(smpqtvm)
            smpqtf < -apply(btqtf,2,mean)
            smpqtvf < -(1) 1/Bt1)*apply(btqtf,2,var)
            smpqtsef < -sqrt(smpqtvf)
tt < -cbind(smpqt,smpqtm,smpqtf,smpqtse,smpqtsem,smpqtsef)
ss < -cbind(smpm,smpse)
write.table(tt,file = “D:/Dissertation/paper3/nhanes/synbmiqt_lwt.csv”,row. names = FALSE,sep = “,”)
write.table(ss,file = “D:/Dissertation/paper3/nhanes/synbmimn_lwt.csv”, row.names = FALSE,sep = “,”)
}
##Example##
syn_bmi(dt = dt, N = 100000, Bt1 = 50, Bt2 = 5, Mt = 5)
dt < -read.csv(“D:/Dissertation/paper3/nhanes/synbmi.csv”)
#Set the synthetic population size about 10 times the sample size;
N < -100000
#Normalize the weights to sum up to the assumed synthetic population size;
dt[,“predwt”] < -dt[,“predwt”]*N/sum(dt[,“predwt”])
sum(dt$predwt)
#Recode the missing values to 999;
dat[is.na(dat)] < -999

References

  • Anderson D, Aitkin M. Variance Component Models With Binary Response: Interviewer Variability. Journal of the Royal Statistical Society, Series B: Statistical Methodology. 1985;47:203–210.
  • Cohen MP. Proceedings of the Section on Survey Research Methods. American Statistical Association (ASA); Anaheim, CA: 1997. The Bayesian Bootstrap and Multiple Imputation for Unequal Probability Sample Designs; pp. 635–638. 1997.
  • Dong Q, Elliott MR, Raghunathan TE. A Nonparametric Method to Generate Synthetic Populations to Adjust for Complex Sample Design. Survey Methodology. 2014;40:29–46.
  • Efron B. Bootstrap Methods: Another Look at the Jackknife. Annals of Statistics. 1979;7:1–26.
  • Francisco CA, Fuller WA. Quantile Estimation With a Complex Survey Design. Annals of Statististics. 1991;19:454–469.
  • Kim JK, Brick MJ, Fuller WA, Kalton G. On the Bias of the Multiple-Imputation Variance Estimator in Survey Sampling. Journal of the Royal Statistical Society, Series B: Statistical Methodology. 2006;68:509–521. http://dx.doi.org/10.1111/j.1467-9868.2006.00546.x.
  • King G, Zeng L. Logistic Regression in Rare Events Data. Political Analysis. 2001;9:137–163.
  • Kovar JG, Rao JNK, Wu CFJ. Bootstrap and Other Methods to Measure Errors in Survey Estimates. Canadian Journal of Statistics. 1988;16:25–45.
  • Little RJ, Rubin DB. Statistical Analysis with Missing Data. 2. New York: Wiley and Sons, New York; 2002.
  • Little RJ, Zheng H. The Bayesian Approach to the Analysis of Finite Population Surveys. Bayesian Statistics. 2007;8:283–302.
  • Lo AY. A Bayesian Bootstrap for a Finite Population. The Annals of Statistics. 1988;16:1684–1695.
  • McCarthy PJ, Snowden CB. The Bootstrap and Finite Population Sampling. Vital and Health Statistics. U.S. Government Printing Office; Washington: 1985. Data Evaluation and Methods Research, Series 2, No. 95. Public Health Service Publication 85–1369. [PubMed]
  • Meng XL. Multiple Imputation Inferences With Uncongenial Sources of Input. Statistical Science. 1994;9:538–558. http://dx.doi.org/10.1214/ss/1177010269.
  • National Center for Health Statistics. Analytic And Reporting Guidelines: The Third National Health and Nutrition Examination Survey, NHANES III (1988–94) National Center for Health Statistics, Centers for Disease Control and Prevention; Hyattsville, Maryland: 1996. [accessed May 22, 2014]. Available at: http://www.cdc.gov/nchs/data/nhanes/nhanes3/nh3gui.pdf.
  • Rao JNK, Wu CFJ. Resampling Inference With Complex Survey Data. Journal of the American Statistical Association. 1988;83:231–241. http://dx.doi.org/10.2307/2288945.
  • Rao JNKCF, Wu J, Yue K. Some Recent Work on Resampling Methods for Complex Surveys. Survey Methodology. 1992;18:209–217.
  • Reiter JP, Raghunathan TE, Kinney SK. The Importance of Modeling the Sampling Design in Multiple Imputation for Missing Data. Survey Methodology. 2006;32:143–149.
  • Rubin DB. Inference and Missing Data. Biometrika. 1976;63:581–592.
  • Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: Wiley; 1987.
  • Rubin DB. Multiple Imputation After 18)Years. Journal of the American Statistical Association. 1996;91:473–489. http://dx.doi.org/10.2307/2291635.
  • Rust K, Rao JNK. Variance Estimation for Complex Estimators in Sample Surveys. Statistics in Medical Research. 1996;5:381–397.
  • Schafer JL. Analysis of Incomplete Multivariate Data. London: Chapman and Hall; 1997.
  • Schenker N, Raghunathan TE, Chiu P, Makuc DM, Zhang G, Cohen AJ. Multiple Imputation of Missing Income Data in the National Health Interview Survey. Journal of the American Statistical Association. 2006;101:924–933. http://dx.doi.org/10.1198/016214505000001375.
  • Stiratelli R, Laird N, Ware J. Random-Effects Models for Serial Observations With Binary Response. Biometrics. 1984;40:961–971. http://dx.doi.org/10.2307/2531147. [PubMed]
  • Wei Y, Ma Y, Carroll RJ. Multiple Imputation in Quantile Regression. Biometrika. 2012;99:423–438. http://dx.doi.org/10.1093/biomet/ass007. [PMC free article] [PubMed]
  • Wolter KM. Introduction to Variance Estimation. New York: Springer; 2007.
  • Woodruff R. Confidence Interval for Medians and Other Position Measures. Journal of the American Statistical Association. 1952;47:635–646. http://dx.doi.org/10.1080/01621459.1952.10483443.
  • Yang S, Kim JK, Shin DW. Imputation Methods for Quantile Estimation under Missing at Random. Statistics and Its Interface. 2013;6:369–377.
  • Yuan Y, Little RJ. Parametric and Semiparametric Model-Based Estimates of the Finite Population Mean for Two-Stage Cluster Samples With Item Nonresponse. Biometrics. 2007;63:1172–1180. http://dx.doi.org/10.1111/j.1541-0420.2007.00816.x. [PubMed]
  • Zhao E, Yucel RM. Performance of Sequential Imputation Method in Multilevel Applications. Proceedings of the Section on Survey Research Methods; August; Washington D.C. American Statistical Association ASA; 2009. pp. 2800–2810.
  • Zhou H. Unpublished PhD Thesis. 2014. Accounting for Complex Sample Designs in Multiple Imputation Using the Finite Population Bayesian Bootstrap.