Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Stat Med. Author manuscript; available in PMC 2012 September 28.
Published in final edited form as:
Stat Med. 2012 September 28; 31(22): 2441–2456.
Published online 2011 November 16. doi:  10.1002/sim.4359
PMCID: PMC3432177

Outcome-dependent sampling for longitudinal binary response data based on a time-varying auxiliary variable


Outcome-dependent sampling (ODS) study designs are commonly implemented with rare diseases or when prospective studies are infeasible. In longitudinal data settings, when a repeatedly measured binary response is rare, an ODS design can be highly efficient for maximizing statistical information subject to resource limitations that prohibit covariate ascertainment of all observations. This manuscript details an ODS design where individual observations are sampled with probabilities determined by an inexpensive, time-varying auxiliary variable that is related but is not equal to the response. With the goal of validly estimating marginal model parameters based on the resulting biased sample, we propose a semi-parametric, sequential offsetted logistic regressions (SOLR) approach. The SOLR strategy first estimates the relationship between the auxiliary variable and the response and covariate data by using an offsetted logistic regression analysis where the offset is used to adjust for the biased design. Results from the auxiliary variable model are then combined with the known or estimated sampling probabilities to formulate a second offset that is used to correct for the biased design in the ultimate target model relating the longitudinal binary response to covariates. Because the target model offset is estimated with SOLR, we detail asymptotic standard error estimates that account for uncertainty associated with the auxiliary variable model. Motivated by an analysis of the BioCycle Study (Gaskins et al., Effect of daily fiber intake on reproductive function: the BioCycle Study. American Journal of Clinical Nutrition 2009; 90(4): 1061–1069) that aims to describe the relationship between reproductive health (determined by luteinizing hormone levels) and fiber consumption, we examine properties of SOLR estimators and compare them with other common approaches.

Keywords: outcome-dependent sampling, bias sampling, study design, generalized estimating equations, longitudinal data analysis, binary data

1. Introduction

The BioCycle Study [1, 2] examined relationships between hormone levels and measures of oxidative stress over the course of the menstrual cycle. Because of costs associated with exposure and outcome ascertainment (i.e., blood draws are necessary for both), study coordinators were able to draw samples eight times per cycle for each of two cycles per participant. To maximize observed hormonal variation, the BioCycle Study design sought to ascertain samples at key time points. On a standard 28-day cycle, these times correspond to days 2 (during menses), 7 (middle of the follicular phase), 12, 13, and 14 (peri-ovulation), 18 (progesterone elevation), 22 (progesterone peak), and 27 (before menstruation). However, a major challenge to the BioCycle Study was that cycle lengths vary between and within women, so that the use of a fixed-day sampling approach would miss many of the key phases. The peri-ovulation phase is of particular interest since this is the time at which many of the hormone concentrations exhibit rapid change (see Figure 1 of reference [2]). For example, luteinizing hormone (LH) levels remain low for the entire menstrual cycle, except that they surge just before ovulation. Because there is a very short interval during which LH levels are high, BioCycle Study coordinators developed a strategy to improve the likelihood of observing the LH peak. In particular, they provided participants an at-home fertility monitor that required a simple urine sample. Beginning on day 6 of the cycle, the monitor requested 10 consecutive days of testing, and on the day the monitor indicated peak fertility (i.e., an LH surge is likely to be occurring) and the 2 days following, the participant was to visit the study clinic for peri-ovulation blood draws. We consider the results of the at-home urine exam as a surrogate marker for days when LH levels are likely to be high (‘high-risk’ days), and on all other days, LH levels are likely low (‘low-risk’ days).

Our analytical goal was to examine the relationship between high LH (>20 IU/L) versus low LH (≤20 IU/L) levels and dietary fiber intake while adjusting for patient features such as time-fixed demographics and the time-varying energy intake. Gaskins et al. [3] examined the relationship between the incidence of anovulatory cycles and daily fiber intake and found a positive association. We, on the other hand, only considered ovulatory cycles. Although serum was collected on all eight of the key time points, fiber intake was ascertained on one of the three peri-ovulation (high-risk) visit days and on three other (low-risk) cycle days (2, 7, and 22 from a standard cycle). To illucidate the study design, Figure 1 shows observed data from one observed cycle. In this case, the cycle was 30 days long, and LH was measured on days 2, 8, 14, 15, 16, 19, 25, and 29 with the highest value occurring on day 14. This is also the only day on which LH was greater than 20 IU/L. The at-home urine monitor was used on consecutive days from day 6 to day 15. On day 14, the monitor indicated peak fertility, which triggered the peri-ovulation visits occurring on days 14, 15, and 16. Fiber intake data for this cycle were then collected on days 2, 8, and 25 (low-risk visits) and on day 16 (high-risk visit). Since the design sampled with probability approximately 0.33 (i.e., 1/3) high-risk days and with probability approximately 0.12 (i.e., 3/25 on a standard cycle) low-risk days, the sample was biased and over-represented by days with high LH levels. We therefore anticipated that analyses that ignore the design will yield invalid inferences for target parameters. In this manuscript, we describe methods for valid inferences in the presence of outcome-dependent sampling (ODS) designs where sampling depends on a time-varying auxiliary variable that is related but not equal to the response variable.

Figure 1
BioCycle Study design: data collected for one menstrual cycle.

The BioCycle Study used an ODS design to increase the efficiency of parameter estimates that describe exposure–outcome relationships for a given cost of measurement. ODS is ubiquitious in epidemiological research since, in many circumstances, prospective studies are infeasible or are logistically impossible. The most commonly used ODS design is the case-control study [4,5], and from its principle of targetted sampling based on the most informative or interesting features of the response distribution, many other designs have emerged and are commonly used (e.g., the case-cohort design [6] and the case-crossover design [7,8]). Related to the BioCycle Study design, although applied to a univariate outcome, Lee, McMurchy, and Scott [9] discussed analyses under a design where sampling is based on an auxiliary variable (Z) that is related to the response variable (Y) and where interest is in the marginal regression relationship between the response and the covariates X, [E(Y | X)]. Chen and Breslow [10] showed that the Horvitz–Thompson based estimator with sampling probabilities estimated from the data is semi-parametrically efficient for this design. For longitudinal data, Park and Kim [11] and Pfeiffer et al. [12] proposed extensions to the case-cohort design, and other authors [1317] have proposed subject-level sampling-based designs for longitudinal or correlated data. The design discussed here can be thought of as a longitudinal data extension to the design discussed in [9, 10], and we further extend the design to address two-phase, stratified sampling [18,19]. Additionally, rather than sampling at the subject level, we sample observations on the basis of a time-varying auxiliary variable or jointly on an auxiliary variable and a target covariate. Although analyses of longitudinal data in the presence of unplanned outcome-dependent follow-up using inverse probability and intensity weighted approaches has been discussed in, for example, [2023], our analytical approach to the planned observation-level ODS is most similar to the sequential offsetted logistic regression (SOLR) approach of Schildcrout and Rathouz [16] that was applied to subject-level ODS. Briefly, in the SOLR approach, the first regression captures the relationship between the auxiliary variable and the response and covariates by using an offsetted logistic regression with the offset based on the known or estimated sampling probabilities. The target parameters for the regression of the response on covariates are then estimated using a second offsetted logistic regression with the results from the auxiliary variable regression model used to compute the bias-correcting offset term.

This manuscript is organized as follows. Section 2 describes the target population model, the study design that is intended to maintain high efficiency while adhering to the constraints introduced by limited study resources, and the pseudo-population model induced by the design. In Section 3, we discuss implementation of an analytical protocol using SOLR, and in particular we describe asymptotic variance estimators. We evaluate the operating characteristics of the proposed estimator in comparison with two other analysis strategies in Section 4, and in Section 5, we revisit the BioCycle Study. We conclude with a discussion in Section 6.

2. Model and study design

In this section, we discuss the marginal population model that defines the parameters that are the primary inferential targets, and we introduce an ODS study design that can be implemented to reduce study costs while permitting highly efficient parameter estimation. We then describe the conditional model induced by the proposed sampling design which is the mean model for the pseudo-population represented by the outcome-dependent sample (e.g., conditional on being sampled). We show that, although the induced conditional regression differs from the target model, the relationship is mathematically tractable. Finally, we detail valid analysis methods and assumptions that permit estimation of the marginal target parameters based on the biased sample. We also discuss a slightly different sampling approach that more closely characterizes the BioCycle Study design and show the conditions under which the induced pseudo-population model is equivalent to the one induced by our basic design.

Consider a study where we wish to make inference about a target population structure based on the marginal mean from a logistic regression model given by


where i [set membership] {1, 2, …, N} denotes subject, j denotes observation time within subjects, Yij is a binary response variable {I(LHij > 20) in the Biocycle Study}, and Xij is a vector of time-varying and time-invariant covariates. The binary response vector, Yi, is of length ni, the subject-specific design matrix Xi is of dimension ni × p, and Xij is a vector of length p. The within-subject dependence structure {i.e., corr(Yij, Yi k|Xi)} is not itself of interest but, as is well known, if modeled correctly, can lead to improved estimation efficiency for β.

2.1. Study design

When ascertainment costs associated with Yi are high and/or Yij = 1 is a rare event, sampling based on a relatively cheaply and easily ascertained auxiliary variable, Zi, that is related to Yi, can increase response variation between and within individuals and can therefore lead to vastly improved estimation precision. In addition, two-phase, stratified sampling based also on a cheaply and easily ascertained confounding variable can lead to further efficiency improvements for key parameters. Schildcrout and Rathouz [16] discussed a subject-level sampling design where sampling was based on a time-invariant auxiliary variable Zi that was related to all elements of the response vector Yi or jointly on Zi and on a potential confounder, X1i. In contrast, the design proposed here invokes observation-level sampling based on a time-varying auxiliary variable, Zij, or possibly jointly on Zij and a time-varying or time-invariant confounding variable, X1,ij contained in Xij.

Let Sij be an indicator of whether subject i was observed at time j, and let Si be the vector of such indicator variables for subject i over all possible times, j [set membership] {1, … ni }. Under this setup, where sampling is based exclusively on (Zij, X1,ij), Sij [perpendicular](Zi, Yij, Xij)|(Zij, X1,ij). We therefore denote observation-specific sampling probabilities pr (Sij = 1|Zi, Yij, Xij) = pr (Sij = 1|Zij, X1,ij) with π(Zij, X1,ij). With a rare response, sampling probabilities will be relatively high for π(1, X1,ij) and low for π(0, X1,ij), and considerations regarding the choice of X1,ij will depend upon its relationship with the target covariates. In general, X1,ij is used to increase observed variation in target covariates which leads to improved information for regression parameters.

2.2. Pseudo-population model

Whereas the original cohort is representative of the population, the data observed under the proposed study design is representative of a pseudo-population that is compositionally different from the target population. With an effective sampling strategy, the prevalence of the outcome will be higher in the pseudo-population than in the population. That is, pr (Yij = 1|Xij, Sij = 1) > pr (Yij = 1|Xij), and so analyses that ignore the study design, in many scenarios, will be invalid. However, if the design is properly accounted for in analyses, then valid and efficient estimates can result. To show this, we will describe the pseudo-population model that is induced by the ODS design from the population model, and we will demonstrate how it can be calculated from the known sampling probabilities, π(Zij, X1,ij), and from a model for the auxiliary variable, Zij given the data (Yij, Xij).

Let μijspr(Yij=1Xij,Sij=1) be the marginal mean model in the pseudo-population rep-resented by the observed ODS data. Via Bayes’ theorem, we may relate the pseudo-population and population mean models with


where τij (y, Xij) = pr (Sij = 1|Xij, Yij = y). If the value of τij (1, Xij)/τij (0, Xij) is known, we may estimate population model parameters using a logistic regression analysis approach by fitting


to the sampled data (Yij, Xij). However, because with the proposed design, sampling is based on (Zij, X1,ij), the offset term (i.e., the first term on the right-hand side of Equation 3) is not known. It must therefore be estimated using known sampling probabilities and by further fitting a model for the distribution of auxiliary variable Zij. We write the probability of an observation being sampled conditional on the data (Yij, Xij) under the proposed design as a marginalization of the joint probability pr (Sij = 1, Zij |Yij, Xij) over the distribution of the auxiliary variable Zij, that is,


Although π(zij, X1,ij) is known, the auxiliary variable model in the population, λijp(Yij,Xij)pr(Zij=1Yij,Xij) is unknown and must itself be estimated. To estimate λijp(Yij,Xij), we correct for the biased sampling design, and following the same approach that relates the pseudo-population and population means for the target model, we write the pseudo-population auxiliary variable model as a function of the population auxiliary variable model,


where λijs(Yij,Xi)=pr(Zij=1Yij,Xij,Sij=1) is the pseudo-population mean. We may estimate parameters in λijp by fitting an offsetted logistic regression model to the pseudo-population data by using the known log{π(1, Xij)/π(0, Xij)} as the offset. Once λijp has been estimated, the calculation of the offset in (3) is straightforward by noting that


In Section 3, we detail the sequential offsetted logistic strategy that can be used for parameter estimation, and we outline standard error calculations that can acknowledge the added uncertainty associated with estimating the offset term in Equation (6).

2.3. Independent outcome-dependent sampling versus within-subject, dependent outcome-dependent sampling

The design we propose can be referred to as independent ODS because sampling is independent across observations, and sampling probabilities do not depend upon other data arising from the same subject or on whether other observations had been sampled from the same subject. We therefore observe variable numbers of observations from each subject in the original cohort, and depending upon the outcome-dependent sample size, we may sample no observations from some of the subjects. In contrast to independent sampling, in some scenarios, including the BioCycle Study, investigators may wish to sample a set number of observations from each individual. This ‘within-subject’ ODS design can be more challenging to analyze because the probability that observation j from subject i is sampled depends not only on (Zij, X1,ij) but also on other elements zij = {zi k: k [set membership] (1, … ni), kj} in zi. However, we show that if a conditionally independent sampling assumption is plausible, the calculation of τij (y, Xij) in Equation (4) applies to within-subject ODS designs and specifically to the BioCycle Study.

Because observation-specific sampling probabilities under within-subject ODS generally depend upon the entire auxiliary variable vector, Zi, or on some function of it, we write sampling probabilities conditional on the data (Yij, Xij) as


where An external file that holds a picture, illustration, etc.
Object name is nihms377204ig1.jpg(zi) denotes all 2ni possible values for the subject-specific auxiliary variable vector Zi. When ni is large, this calculation can be computationally cumbersome, and it is further complicated because pr (zi |Xij, Yij) must be calculated from the full multivariate distribution [Zi |Xij, Yij]. However, if we assume conditionally independent sampling, that is, pr (Sij = 1 | zij, zij′, X1,ij) = pr (Sij = 1 | zij, X1,ij), then the expression simplifies to the form given in Equation (4), viz,


The conditionally independent sampling assumption will hold approximately under within-subject ODS if the number of observations with Zij = 1 and Zij = 0 does not vary from subject to subject (i.e., Zi is balanced). The extent to which there is variation is the extent to which the assumption is violated. The impact of violation of the assumption is that Sij is no longer independent of (Yij, Xij) given (Zij, X1,ij), which is crucial to unbiased estimation. In the BioCycle Study dataset analyzed here, of the 445 total cycles, Zij = 1 three times in 442 (99%) of them and only two times in the other three. Cycle lengths ranged from 22 to 36 days with an interdecile range of 25 to 32 days. So, whereas Zi is not perfectly balanced in the data, it is only unbalanced to a small degree.

3. Estimation

In this section, we propose a SOLR protocol that permits consistent estimation of target population parameters, β using the ODS strategy described in Section 2.2. We then discuss standard error calculations that appropriately account for the correlated data and the additional uncertainty due to estimation of the offset term in (3).

The SOLR strategy described in succeeding paragraphs makes the independence working covariance weighting assumption, although the extension to covariance weighting is straightforward. In many cases, independence working covariance weighting can lead to efficiency losses for time-varying covariate coefficients when compared with a properly specified working covariance weighting scheme. However, because our study design samples on the basis of an auxiliary variable Zi that is closely related to the response Yi, the correlation structure in the pseudo-population is different from the correlation structure in the target population, and in some cases the correlation can be reduced substantially.

It is also important to note that we do not make any assumptions regarding whether the cross-sectional population and pseudo-population conditional means μijp=pr(YijXij) and μijs=pr(YijXij,Sij=1)\ discussed in Section 2 are equal to their full covariate conditional mean counterparts, pr(Yij |Xi) and pr(Yij |Xi, Si), respectively. When the cross-sectional conditional mean is not equal to the full covariate conditional mean, we may obtain biased results for time-varying covariate coefficients when a non-independence working covariance structure is used [24, 25]. Therefore, in estimating the means cross-sectionally, we recommend independence working covariance weighting to ensure unbiased parameter estimates.

3.1. Sequential offsetted logistic regressions parameter estimation

We begin the SOLR protocol with λijp(Yij,Xij), which we take to be a logistic regression model of the auxiliary variable Zij on (Yij, Xij). Let Wij = g(Yij, Xij) contain Yij, all relevant time-specific covariate information contained in Xij, and possibly interactions between Yij and Xij so that pr (Zij = 1|Yij, Xij) = logit−1(Wij γ). In most circumstances, if Zij is an effective sampling variable, then Yij will contain the majority of the information about Zij, and so Wij could be quite parsimonious. The auxiliary variable population model is then given by


and estimates of γ are used to calculate estimates of λijP used in (6); however, because the data represent the pseudo-population rather than the target population, following Equation (5), we estimate γ by fitting the following offsetted logistic regression model to the pseudo-population data


where the sampling fraction ratio π(1, Xij)/π(0, Xij) offset term is fixed by design. Once γ has been estimated, we plug λ^ijP into (6) to calculate the biased sampling correction, offset term Bij = Bij (Xij) = log{τij (1, Xij)/τij (0, Xij)} in the pseudo-population model


Estimates of β based on an offsetted logistic regression model will then be consistent for the population model parameters in Equation (1).

3.2. Standard error calculations

Following Schildcrout and Rathouz [16], standard error calculations may be simpler when thinking of the SOLR protocol as a stacked estimation approach where the estimating equations in (12) are solved jointly for (γ, β)


In fact, Σi Σj Tij(γ) = 0 is first solved for γ. The estimate [gamma with circumflex] is then plugged into the estimate of Bij in μijS, and Σi Σj Uij([gamma with circumflex]) = 0 is solved for β. The asymptotic variance for ([gamma with circumflex],[beta]) that properly acknowledges the ODS design is given by




Wy,ij = g(y, Xij) is the design matrix (dimension 1× p) for subject i at time j in the auxiliary variable model when Yij = y. It is a flexible function of y, Xij and possibly interactions,


and r = π(1, Xij/π(0; Xij).

3.3. Alternatives to sequential offsetted logistic regressions estimation

Inverse probability of being sampled weighted estimators [26] are commonly used in survey or missing data scenarios in order to make inferences about population parameters from biased samples. ODS induces biased samples, but the selection mechanism is defined by the sampling routine, and so the probability of being sampled is known. Cai et al. [13] proposed inverse probability of being sampled weighted estimation when conducting an ODS analysis with cluster-based sampling; however, the general strategy of weighting contributions of an estimating equation by the inverse probability of having been observed can also apply to the independent ODS design described here. For example, we may estimate parameters from the population model μijp by solving ijπ(Zij,X1,ij)-1Xijt(Yij-μijs)=0 for β (e.g., [20,23]). Robust standard errors [27,28] can be used to obtain asymptotically valid standard error estimates.

Alternatively, in certain scenarios, a naive analysis of the selected sample may lead to valid estimated regression coefficients, although intercept estimates will generally be biased. For example, when sampling is based only on Zij, then arguments outlined in Park and Kim [11] can be generalized to suggest that estimates corresponding to log odds ratios may be unbiased when we ignore the study design entirely as long as independence weighting and robust standard errors are used. That is, we may simply solve ijXijt(Yij-μijs)=0 for β and use robust standard errors for estimation uncertainty.

4. Finite sample, simulation study

We have proposed a SOLR estimation strategy for marginal longitudinal regression models using ODS based on a time-varying auxiliary variable Zij and possibly on a time-varying or time-invariant regression model variable Xij. Here, we examine operating characteristics of the estimators and compare them with those obtained from simpler estimation strategies described in Section 3.3 that either ignore the study design and simply use independence weighting GEE (GEE-I) or that use an independence, inverse probability weighting (IPW) approach where the probability weights are based on empirical sampling probabilities (IPW-GEE-I). We now describe the models that were used to generate the data, the sampling strategies, and the specific estimation methods. All results from simulations are based on 1600 simulated datasets.

In order to conduct simulations that were relevent for the motivating BioCycle Study data, we chose specific data-generating details such as the cluster size and the strength of association between the auxiliary and outcome that are similar to the characteristics of BioCycle. For i [set membership] {1, …, 300} subjects, we generated ni ~ uniform(30, 40) observations per subject based on the following marginalized, first-order transition and latent variable model (MM-TLV) [29],


where μijp=pr(Yij=1Xij) is the population (marginal) mean model that describes the relationship between the binary response variable and covariates and μijcpr(Yij=1Xij,Yij-1,Ui) is the conditional mean or response dependence model that describes the relationships among the responses. Together, the marginal and conditional means specify a complete multivariate probability distribution [Yi |Xi] (assuming that Yij [coproduct operator] Yi,j −2, Yi,j −3, … |Yi,j −1, Ui). The implicit parameter Δij links the marginal and conditional means and has been described in a number of earlier manuscripts [2932]. For our purposes, μijc was used to induce correlation among repeated observations within a subject, and the parameter α = (αy, αu) determines the strength of the relationship among response values. In this MM-TLV, α = (1, 1), and so we have induced a common longitudinal dependence structure that contains a short-range serial component and long-range, random intercept component. Thus, the log odds ratio associated with Yij−1 being 1 as opposed to 0 is 1, and the variance component, which should be interpreted in a manner similar to the variance component in a standard generalized linear mixed model with a random intercept model, is also 1.

The vector Xij = (1, X1,ij, X2,i, X3,ij) contains the covariates of interest, and the regression parameter vector, β = (β0, β1, β2, β3)t = (−2.0, 0.5, 0.5, 0.5)t is the estimation target. X1,ij is time-varying and generated from a standard normal distribution. X2,i is a subject-level binary variable with pr (X2,i = 1) = 0.1, and X3,ij is a time-varying covariate that was generated from a normal distribution with mean −0.5 + X1,ij + X2,i and standard deviation equal to 0.5. Thus, X3,ij is associated with within-subject and between-subject response features and can therefore be thought of as a confounder for each of their relationships with the Yij. The population model is fixed throughout all simulations, and it induces approximately a 15% prevalence of positive response across all subjects and times.

Even though there are approximately 300 · 35 = 10, 500 potentially observed Yij values, we assume that study resources only permit ascertainment of 1200 of them, and because the pr (Yij = 1) is relatively low, ODS designs are likely to improve estimation efficiency over simple random sampling. Furthermore, because the prevalence of X2,i is also low and is correlated with X3,ij, stratified ODS (Str-ODS) that depends on (Zij, X3,ij) may further improve efficiency of parameter estimates.

The auxiliary variable Zij was generated under the model logit logit(λijp)=Wijγ with covariate vector Wij = (1, Yij, X1,ij, X2,i, X3,ij), and the value of γ varied across simulation scenarios. It included values (−2, 3, 0, 0, 0) and (−2, 5, 0, 0, 0) to induce a moderate and a strong Zij ~ Yij relationship, respectively, and (−2, 3, 0, 0, 1) and (−2, 3, 1, 1, 0) to allow covariates to affect the value of Zij and therefore to examine the impact of misspecifying this model for SOLR analysis. It is worth noting that we have generated Zij in such a way that does not include longitudinal correlation in the auxiliary vector beyond that induced from the linear predictor (which includes Yij).

We consider ODS on the basis of Zij and Str-ODS on the basis of (Zij, X3,ij) to improve efficiency. Assume nz is the number of observed Zij = z values across subjects and times, z [set membership] {0, 1}, and nz,v is the number of observed {Zij, I(X3,ij > 0.5)} = (z, v) values, v [set membership] {0, 1}. At each replication, we sampled 600 observations in each of the two z strata under ODS and 300 observations from each of the four (z, v) strata. In the latter scenario, we oversampled high values of X3,ij in order to observe greater variability in X2,i and X3,ij, and possibly in X1,ij. Whereas the sample size in the ODS was fixed across replications, nz and nz,v were not. For analyses, the empirical sampling probabilities for ODS and Str-ODS used in both SOLR and IPW-GEE-I estimation strategies were set to π(z) = 600/nz and π(z, v) = 300/n(z,v), respectively.

We examined three estimation approaches: GEE-I, IPW-GEE-I, and SOLR. The first two were described in Section 3.3, and SOLR was described in Section 3. For the SOLR approach, we examined three specific variations differentiated by how they estimated λijp=pr(Zij=1Wij). In SOLR-I(Y), SOLR-I(Y+X), and SOLR-I(Y*X), the assumed Wij was (1, Yij), (1, Yij, X1,ij, X2,i, X3,ij), and(1, Yij, X1,ij, X2,i, X3,ij, Yij X1,ij, Yij X2,i, Yij X3,ij), respectively. This permits a cursory exploration of a potential bias-variance tradeoff associated with the richness of the auxiliary variable model. If Zij |Yij, Xij is a function of covariates, then by using SOLR-I(Y), we may observe bias due to violation of the assumption that Sij [perpendicular] (Yij, Xij) | (Zij, X1,ij). In contrast, if the auxiliary variable only depends upon Yij, then using the SOLR-I(Y*X) which estimates several extra parameters unnecessarily may lead to efficiency losses.

GEE-I and IPW-GEE-I are easily implemented with standard software. Although weighting is required for IPW-GEE-I, the sampling probabilities depend on the study design, and these probabilities are known. The SOLR estimators are slightly more challenging because an additional level of modeling is required, namely λijp. By considering validity and efficiency of the estimators, we shed light on circumstances under which combinations of designs and estimation strategies might be favored over others.

4.1. Validity

Table I shows results from the simulations regarding the validity of the estimation strategies under a number of scenarios. In all scenarios studied, the naive, GEE-I estimator of β1, β2, and β3 was valid with approximately 95% coverage as long as (i) Zij depended upon Yij but not covariates (X1,ij, X2,i, X3,ij) and (ii) ODS sampling was conducted but not Str-ODS. Under Str-ODS, we observe bias in estimates of β3 with GEE-I, the extent of which increases with the strength of the Zij ~ Yij relationship. For example, under γ = (−2, 3, 0, 0, 0), the average GEE-I estimate of β3 was 0.32 with coverage 0.72, as opposed to the true value of 0.5, and under γ = (−2, 5, 0, 0, 0), it was 0.22 with coverage 0.44. In a similar way, bias was observed with GEE-I for the parameters corresponding to covariates that had a non-zero effect on Zij. For example, under γ = (−2, 3, 1, 1, 0) and ODS sampling, the average GEE-I estimate of β2 and β3 was 0.40 and 0.36, respectively. As expected, IPW-GEE-I was valid under all scenarios, and the SOLR estimators were valid when properly specified. However, under the scenarios when a covariate was associated with the auxiliary Zij, although this relationship was not acknowledged during analyses, biases were observed in SOLR estimates of the target coefficients. For example, with γ = (−2, 3, 0, 0, 1), the average SOLR-I(Y) estimate of β3 was 0.42, whereas the SOLR-I(Y+X) and SOLR-I(Y*X) estimates were both 0.50. Similarly, with γ = (−2, 3, 1, 1, 0), the average SOLR-I(Y) estimates of β2 and β3 were 0.40 and 0.35, respectively, which are very similar to those obtained with GEE-I.

Table I
Average parameter estimates and 95% coverage probabilities for the SOLR, IPW-GEE-I, and GEE-I procedures across 1600 replicate studies of 300 subjects with 30 to 40 observations possible per person, but with resources available to sample, in 1200 observations ...

The reason for the observed biases, either because of ignoring a stratified design with GEE-I or by misspecifying λijp with GEE-I or SOLR, results from misspecification of the offset term log{τij (1, X ij)/τij (0, Xij)}. Specifically, when properly specified, this sampling ratio term is independent of covariate values; however, the presence of a misspecification by ignoring, say, the effect of X3,ij on Zij leads to a residual, non-linear function of X3,ij in the offset. It is this dependence that leads to bias. The linear predictor then takes the form f(X3,ij) + β0 + β1X1,ij + β2X2,i + β3X3,ij, which contains linear and non-linear effects of X3,ij. Even though bias will occur in estimates of β3, valid inferences persisted for (β1, β2) in the limited number of scenarios we studied.

4.2. Efficiency

We now discuss the impact of the study design and the estimation strategy on statistical efficiency. Table II shows empirical standard errors. In addition to the other designs and analysis strategies already discussed, we also include a random sampling approach wherein 1200 observations were randomly sampled at each replication, and GEE-I and GEE-E (i.e., GEE with exchangeable working covariance weighting) were used to fit the sampled data.

Table II
Empirical standard errors across 1600 replications for SOLR, GEE-I, and IPW-GEE-I estimators.

4.2.1. Study design

From Table II, we see that well conducted analyses using ODS designs can lead to substantial efficiency improvements relative to random sampling, and further improvements can be made via two-phase or stratified, Str-ODS designs. Focusing on SOLR estimation with a parsimonious model for Zij (e.g., Wij = (1, Yij)) or with SOLR-I(Y), the empirical standard error of (β1, β2, β3) was approximately (0.158, 0.322, 0.141) under γ = (−2, 3, 0, 0, 0) and ODS sampling, which was similar to the standard errors obtained under γ = (−2, 5, 0, 0, 0). We saw modest efficiency improvements with Str-ODS, where the empirical standard error was (0.152, 0.289, 0.132) under γ = (−2, 3, 0, 0, 0). With random sampling and with exchangeable working covariance weighting, GEE-E, however, the empirical standard error was higher and was equal to (0.197, 0.357, 0.179), implying asymptotic relative efficiencies of (59.5%, 65.5%, 54.4%) compared with Str-ODS.

4.2.2. Estimation approach

The IPW-GEE-I estimation strategy is only slightly more difficult to implement than GEE-I, it is easier to implement than SOLR, and it will be valid as long as the inverse probability weights reflect the true sampling scheme; however, it can be highly inefficient relative to SOLR and to GEE-I. Inefficiency of inverse probability weighted estimators has also been observed in other biased sampling settings (e.g., [38]). The efficiency loss is particularly evident under weaker Zij ~ Yij relationships. For example, under ODS and Str-ODS designs, when γ = (−2, 3, 0, 0, 0), IPW-GEE-I was only slightly more efficient than random sampling. With ODS, the empirical standard error for (β1, β2, β3) was (0.192, 0.351, 0.170). However, with a stronger Zij ~ Yij relationship, the relative efficiency of IPW-GEE-I to the other estimators improved. For example, with γ = (−2, 5, 0, 0, 0) and ODS sampling, the IPW-GEE-I empirical standard errors were (0.171, 0.328, 0.155), which is still less efficient although more competitive with the other estimators. GEE-I and the SOLR-I(Y) were nearly equally efficient in all scenarios studied. We note that by fitting a rich model for λijp using SOLR-I(Y*X), we observed moderate efficiency losses relative to SOLR-I(Y) and SOLR-I(Y+X) under ODS with γ =(−2, 3, 0, 0, 0); however, with γ = (−2, 5, 0, 0, 0), all three SOLR estimators were approximately equally efficient.

4.3. Within-subject outcome-dependent sampling

The SOLR analysis procedure that we have discussed is tailored to independent ODS designs. Subject identifiers are ignored when sampling, which depends only upon the value Zij or (Zij, X3,ij). In contrast to this design, some studies, including BioCycle, may instead seek to sample a fixed number of observations per subject in order to ensure that all subjects are equally represented in the ODS. In our simulation study, we implemented one such dependent ODS design where four observations from each subject were sampled with probability linked to the value of Zij. Each time an observation was to be drawn within an individual, those with value Zij = 1 and Zij = 0 were assigned sampling probability weights equal to 600/n1 and 600/n0, respectively. The value of the offset log{π(1)/π(0)} used for the SOLR and IPW-GEE-I estimation strategies was based on the observed counts after sampling had been completed. For example, in a given replication, the sample may contain 570 observations with Zij = 1 and 630 observations with Zij = 0. The π(1) and π(0) values used in analyses were 570/n1 and 630/n0, respectively. Table III shows the results from the simulation study. When the cluster sizes were uniformly distributed between 30 and 40 observations per person and when the MM-TLV correlation structure contained serial and long-range components (αy, αu) = (1, 1) as in earlier simulations, we observed small biases in the estimates of the time-invariant covariate coefficients, β2, on the order of 6% to 10%. The estimates of β1 and β3 remained unbiased. As described in Section 2.3, the small bias in β2 was caused by the imbalance in Zi across subjects, which induced a small violation in the conditionally independent sampling assumption described in Section 2.3. To examine the impact of imbalance in Zi, we also simulated larger cluster sizes ni = n = 100 with no long-range correlation (i.e., αu = 0), which results in far less imbalance in Zi. As we can see in Table III, the biases seen with smaller clusters vanish. We can also see in the empirical standard error portion of the table (comparing with those in Table II) that there does not appear to be a substantial efficiency gain or loss with within-subject sampling as opposed to independent ODS.

Table III
Average parameter estimates, coverage probabilities, and empirical standard errors using the SOLR, IPW-GEE-I, and GEE-I estimation strategies under within-subject ODS with four observations sampled per subject and with γ = (−2, 3, 0, 0, ...

5. BioCycle Study data analysis

As discussed earlier, the BioCycle Study implemented an ODS design to maximize observed hormonal variation while adhering to resource limitations and minimizing participant burden. Our analytical goal was to examine the relationship between high (> 20 IU/L) versus low (≤ 20 IU/L) LH values and fiber intake, while adjusting for energy intake and patient characteristics including age and body mass index. In particular, as women must typically reach a certain threshold level of LH in order to trigger ovulation in a given cycle, we are interested in the impact of fiber intake on high versus low LH. Patient characteristics and experiences are shown in Table IV. Of the 245 participants included in the analysis, 199 were observed for two cycles, and 46 were observed for one cycle. The median age and body mass index were 25 years and 23 kg/m2, respectively, and most subjects were observed at each of the four key time points. In approximately 6% of observed days, patients were observed with high LH. High LH was observed in approximately 22% (96 out of 434) of the peri-ovulation visits (e.g., when Zij = 1) and in approximately 1.0% (13 out of 1284 days) of the other clinic visits (e.g., when Zij = 0).

Table IV
BioCycle Study patient characteristics.

In many longitudinal data analysis settings, it may be appropriate to decompose time-varying covariates into two components: one that varies exclusively between individuals and the other that varies exclusively within individuals. For example, a covariate xij observed in subject i at time j may be decomposed into the mean of the subject’s covariate series [x with macron]i and time-specific deviations from the mean value xij[x with macron]i. The former varies exclusively between individuals, and the latter varies exclusively within individuals (i.e., it is mean balanced). As discussed in references [3336], the omission of this decomposition involves an implicit and often implausible assumption that the impact of [x with macron]i and xij[x with macron]i on the outcome are equal to one another. In addition, the impact of confounding may differ across the two contributors (between-subject and within-subject) to variation in time-varying covariates. In the absence of the decomposition, the estimated effect is a mixture of the between-subject and within-subject effects, one of which may be susceptible to confounding. For our purposes, we decompose fiber intake, fiberij, into fiber¯ij and fiberij-fiber¯i, and estimate these effects separately.

The first step in SOLR analysis is to specify a model for the auxiliary variable conditional on covariates and sampled outcomes. For the BioCycle data, we consider a range of models from the simplest that only regresses Z on Y to a rich model that allows X, Y, and interactions between X and Y. Empirically, the data suggest that Yij is the major predictor of Zij with a log odds ratio of 3.00 (SE = 0.281, p < 0.001) in an additive model including covariates. Covariate effects are much weaker yet between-subject fiber (measured in 5-g units) is significantly associated with Zij having a log odds ratio of 0.056 (SE = 0.024, p = 0.02), and African American race has an estimated coefficient of 0.159 (SE = 0.066, p = 0.02). Therefore, the data suggest that consideration of covariates in the auxiliary model is warranted to obtain valid inference for the longitudinal response. However, none of the covariates appeared to modify the relationship between Y and Z.

Results from the BioCycle Study, shown in Table V, indicate that between-subject changes in fiber intake were inversely related to the incidence of high LH concentrations. In particular, a 5-g/day increase (~ one apple or two slices of whole grain bread) in average fiber intake was associated with approximately a 33% to 35% decrease in the odds (log odds ratio estimates ranging from −0.421 to −0.394) of high LH. All estimators except IPW-GEE-I were approximately equally efficient for the between-subject fiber intake variable with standard errors ranging from 0.121 to 0.129. The standard error estimate based on IPW-GEE-I was 0.142. Consistent with the results from the simulation study in Section 4, the SOLR-I(Y*X) or IPW-GEE-I yielded the largest standard errors among the estimators. In addition to fiber intake, race was highly associated with LH levels. For example, on the basis of the SOLR-I(Y+X) estimator, the odds of high LH levels were 59% (coefficient estimate −0.884) lower in African Americans compared with whites.

6. Discussion

We have discussed ODS from a cohort study with longitudinal follow-up on a binary response variable and when marginal models are of interest. The design includes a time-varying auxiliary variable related to response, and sampling is conducted at the level of the individual observations, as opposed to subjects, on the basis of this auxiliary variable. We described a SOLR estimation strategy that can be used for analyses, and we showed that, when properly specified, it can be far more efficient than random sampling and inverse probability of sample weighting. While ignoring the study design can in some scenarios lead to valid and efficient estimates, the SOLR approach with a properly specified model for the auxiliary variable is as efficient and is able to address stratified sampling (which itself can lead to efficiency improvements) and complex auxiliary variable models [Zij | Xi, Yij].

For the SOLR approach, we suggested a relatively simplified specification of the pseudo-population model. In particular, we suggested μijs=pr(YijXij,Sij=1). The more general specification of μijs=pr(YijXij,Si) that could be estimated with non-independence working covariance weighting can be applied with subject-level or within-subject sampling, with prospective or retrospective designs, and with sampling that depends upon the response or on an auxiliary variable that is related to the response. For example, Schildcrout and Rathouz (2010) used a variation of this approach for subject-level sampling directly from a broader population. We believe the general SOLR strategy to be very broad and to be applicable in a number of settings. With large clusters, simplifying assumptions are likely to be required, and appropriate analysis will depend upon the design and data features. An alternative analytical approach for our proposed design would be to use likelihood-based methods (e.g., Bayesian or maximum likelihood). Whereas such approaches would require joint, longitudinal modeling of [Yi, Zi |Xi, Si], they are certainly possible and are an area of future research.

Table V
BioCycle Study results: parameter estimates (standard errors) corresponding to log odds ratios from five analytical approaches.


This project was partially funded by the NIH grant R01 HL094786 from the National Heart, Lung, and Blood Institute, the Long-Range Research Initiative of the American Chemistry Council, and the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health.


1. Wactawski-Wende J, Schisterman EF, Hovey KM, Howards PP, Browne RW, Hediger M, Liu A, Trevisan M. BioCycle Study: design of the longitudinal study of the oxidative stress and hormone variation during the menstrual cycle. Paediatric and Perinatal Epidemiology. 2009;23:171–184. [PMC free article] [PubMed]
2. Howards PP, Schisterman EF, Wactawski-Wende J, Reschke JE, Frazer AA, Hovey KM. Timing clinic visits to phases of the menstrual cycle by using a fertility monitor: the BioCycle Study. American Journal of Epidemiology. 2009;169:105–112. [PMC free article] [PubMed]
3. Gaskins AJ, Mumford SL, Zhang C, Wactawski-Wende J, Hovey KM, Whitcomb BW, Howards PP, Perkins NJ, Yeung E, Schisterman EF. BioCycle Study Grp. Effect of daily fiber intake on reproductive function: the BioCycle Study. American Journal of Clinical Nutrition. 2009;90(4):1061–1069. doi: 10.3945/ajcn.2009.27990. [PubMed] [Cross Ref]
4. Anderson JA. Separate sample logistic discrimination. Biometrika. 1972;59:19–35.
5. Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–412.
6. Prentice RL. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986;73:1–11.
7. Navidi W. Bidirectional case-crossover designs for exposures with time trends. Biometrics. 1998;54:596–605. [PubMed]
8. Maclure M. The case-crossover design—a method for studying transient effects on the risk of acute events. American Journal of Epidemiology. 1991;133(2):144–153. [PubMed]
9. Lee AJ, McMurchy L, Scott AJ. Re-using data from case-control studies. Statistics in Medicine. 1997;16:1377–1389. [PubMed]
10. Chen J, Breslow NE. Semiparametric efficient estimation for the auxiliary outcome problem with the conditional mean model. The Canadian Journal of Statistics/La Revue Canadienne de Statistique. 2004;32(4):359–372.
11. Park E, Kim Y. Analysis of longitudinal data in case-control studies. Biometrika. 2004;91(2):321–330.
12. Pfeiffer RM, Ryan L, Litonjua A, Pee D. A case-cohort design for assessing covariate effects in longitudinal studies. Biometrics. 2005;61(4):982–991. [PubMed]
13. Cai J, Qaqish B, Zhou H. Marginal analysis for cluster-based case-control studies. Sankhyā, Series B. 2001;63(3):326–337.
14. Neuhaus JM, Scott AJ, Wild CJ. Family-specific approaches to the analysis of case-control family data. Biometrics. 2006;62(2):488–494. [PubMed]
15. Schildcrout JS, Heagerty PJ. On outcome-dependent sampling designs for longitudinal binary response data with time-varying covariates. Biostatistics. 2008;9:735–749. [PMC free article] [PubMed]
16. Schildcrout JS, Rathouz PJ. Longitudinal studies of binary response data following case-control and stratified case-control sampling: design and analysis. Biometrics. 2010;66:365–373. [PMC free article] [PubMed]
17. Schildcrout J, Heagerty P. Outcome-dependent sampling from existing cohorts with longitudinal binary response data: study planning and analysis. Biometrics. 2011 doi: 10.1111/j.1541-0420.2011.01582.x. [PMC free article] [PubMed] [Cross Ref]
18. White J. A 2 stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology. 1982;115(1):119–128. [PubMed]
19. Breslow N, Cain K. Logistic-regression for 2-stage case-control data. Biometrika. 1988;75(1):11–20.
20. Robins J, Rotnitzky A, Zhao L. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association. 1995;90(429):106–121.
21. Lin D, Ying Z. Semiparametric and nonparametric regression analysis of longitudinal data. Journal of the American Statistical Association. 2001;96(453):103–113.
22. Lin H, Scharfstein D, Rosenheck R. Analysis of longitudinal data with irregular, outcome-dependent follow-up. Journal of the Royal Statistical Society Series B- Statistical Methodology. 2004;66(Part 3):791–813.
23. Buzkova P, Lumley T. Semiparametric modeling of repeated measurements under outcome-dependent follow-up. Statistics in Medicine. 2009;28(6):987–1003. [PubMed]
24. Pepe M, Anderson G. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communication in Statistics-Simulation and Computation. 1994;23(4):939–951.
25. Schildcrout JS, Heagerty PJ. Regression analysis of longitudinal binary data with time-dependent environmental covariates: bias and efficiency. Biostatistics. 2005;6:633–652. [PubMed]
26. Robins J, Rotnitzky A, Zhao A. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association. 1994;89:846–866.
27. Zeger SL, Liang KY. Longitudinal data analysis for discrete and continuous outcomes. Biometrics. 1986;42:121–130. [PubMed]
28. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22.
29. Schildcrout JS, Heagerty PJ. Marginalized models for moderate to long series of longitudinal binary response data. Biometrics. 2007;63:322–331. [PubMed]
30. Heagerty PJ. Marginalized transition models and likelihood inference for longitudinal categorical data. Biometrics. 2002;58:342–351. [PubMed]
31. Miglioretti D, Heagerty P. Marginal modeling of multilevel binary data with time-varying covariates. Biostatistics. 2004;5(3):381–398. doi: 10.1093/biostatistics/kxg042. [PubMed] [Cross Ref]
32. Lee K, Daniels MJ. Marginalized models for longitudinal ordinal data with application to quality of life studies. Statistics in Medicine. 2008;27(21):4359–4380. doi: 10.1002/sim.3352. [PMC free article] [PubMed] [Cross Ref]
33. Scott A, Holt D. The effect of 2-stage sampling on ordinary least-squares methods. Journal of the American Statistical Association. 1982;77(380):848–854.
34. Neuhaus J, Kalbfleisch J. Between- and within-cluster covariate effects in the analysis of clustered data. Biometrics. 1998;54(2):638–645. [PubMed]
35. Sheppard L. Insights on bias and information in group-level studies. Biostatistics. 2003;4(2):265–278. [PubMed]
36. Schildcrout JS, Sheppard L, Lumley T, Slaughter JC, Koenig JQ, Shapiro GG. Ambient air pollution and asthma exacerbations in children: an eight-city analysis. American Journal of Epidemiology. 2006;164:505–517. [PubMed]
37. Diggle P, Heagerty P, Liang KY, Zeger S. Analysis of Longitudinal Data. 2. Oxford University Press; USA: 2002.
38. Lawless J, Kalbfleisch J, Wild CJ. Semiparametric methods for response-selective and missing data problems in regression. Journal of the Royal Statistical Society Series B- Statistical Methodology. 1999;61(Part 2):413–438.