Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3432177

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Model and study design
- 3. Estimation
- 4. Finite sample, simulation study
- 5. BioCycle Study data analysis
- 6. Discussion
- References

Authors

Related links

Stat Med. Author manuscript; available in PMC 2012 September 28.

Published in final edited form as:

Stat Med. 2012 September 28; 31(22): 2441–2456.

Published online 2011 November 16. doi: 10.1002/sim.4359PMCID: PMC3432177

NIHMSID: NIHMS377204

Jonathan S. Schildcrout,^{a,}^{*}^{†} Sunni L. Mumford,^{b} Zhen Chen,^{b} Patrick J. Heagerty,^{c} and Paul J. Rathouz^{d}

See other articles in PMC that cite the published article.

Outcome-dependent sampling (ODS) study designs are commonly implemented with rare diseases or when prospective studies are infeasible. In longitudinal data settings, when a repeatedly measured binary response is rare, an ODS design can be highly efficient for maximizing statistical information subject to resource limitations that prohibit covariate ascertainment of all observations. This manuscript details an ODS design where individual observations are sampled with probabilities determined by an inexpensive, time-varying auxiliary variable that is related but is not equal to the response. With the goal of validly estimating marginal model parameters based on the resulting biased sample, we propose a semi-parametric, *sequential offsetted logistic regressions* (SOLR) approach. The SOLR strategy first estimates the relationship between the auxiliary variable and the response and covariate data by using an offsetted logistic regression analysis where the offset is used to adjust for the biased design. Results from the auxiliary variable model are then combined with the known or estimated sampling probabilities to formulate a second offset that is used to correct for the biased design in the ultimate target model relating the longitudinal binary response to covariates. Because the target model offset is estimated with SOLR, we detail asymptotic standard error estimates that account for uncertainty associated with the auxiliary variable model. Motivated by an analysis of the BioCycle Study (Gaskins et al., Effect of daily fiber intake on reproductive function: the BioCycle Study. *American Journal of Clinical Nutrition* 2009; 90(4): 1061–1069) that aims to describe the relationship between reproductive health (determined by luteinizing hormone levels) and fiber consumption, we examine properties of SOLR estimators and compare them with other common approaches.

The BioCycle Study [1, 2] examined relationships between hormone levels and measures of oxidative stress over the course of the menstrual cycle. Because of costs associated with exposure and outcome ascertainment (i.e., blood draws are necessary for both), study coordinators were able to draw samples eight times per cycle for each of two cycles per participant. To maximize observed hormonal variation, the BioCycle Study design sought to ascertain samples at key time points. On a standard 28-day cycle, these times correspond to days 2 (during menses), 7 (middle of the follicular phase), 12, 13, and 14 (peri-ovulation), 18 (progesterone elevation), 22 (progesterone peak), and 27 (before menstruation). However, a major challenge to the BioCycle Study was that cycle lengths vary between and within women, so that the use of a fixed-day sampling approach would miss many of the key phases. The peri-ovulation phase is of particular interest since this is the time at which many of the hormone concentrations exhibit rapid change (see Figure 1 of reference [2]). For example, luteinizing hormone (LH) levels remain low for the entire menstrual cycle, except that they surge just before ovulation. Because there is a very short interval during which LH levels are high, BioCycle Study coordinators developed a strategy to improve the likelihood of observing the LH peak. In particular, they provided participants an at-home fertility monitor that required a simple urine sample. Beginning on day 6 of the cycle, the monitor requested 10 consecutive days of testing, and on the day the monitor indicated peak fertility (i.e., an LH surge is likely to be occurring) and the 2 days following, the participant was to visit the study clinic for peri-ovulation blood draws. We consider the results of the at-home urine exam as a surrogate marker for days when LH levels are likely to be high (‘high-risk’ days), and on all other days, LH levels are likely low (‘low-risk’ days).

Our analytical goal was to examine the relationship between high LH (>20 IU/L) versus low LH (≤20 IU/L) levels and dietary fiber intake while adjusting for patient features such as time-fixed demographics and the time-varying energy intake. Gaskins et al. [3] examined the relationship between the incidence of anovulatory cycles and daily fiber intake and found a positive association. We, on the other hand, only considered ovulatory cycles. Although serum was collected on all eight of the key time points, fiber intake was ascertained on one of the three peri-ovulation (high-risk) visit days and on three other (low-risk) cycle days (2, 7, and 22 from a standard cycle). To illucidate the study design, Figure 1 shows observed data from one observed cycle. In this case, the cycle was 30 days long, and LH was measured on days 2, 8, 14, 15, 16, 19, 25, and 29 with the highest value occurring on day 14. This is also the only day on which LH was greater than 20 IU/L. The at-home urine monitor was used on consecutive days from day 6 to day 15. On day 14, the monitor indicated peak fertility, which triggered the peri-ovulation visits occurring on days 14, 15, and 16. Fiber intake data for this cycle were then collected on days 2, 8, and 25 (low-risk visits) and on day 16 (high-risk visit). Since the design sampled with probability approximately 0.33 (i.e., 1/3) high-risk days and with probability approximately 0.12 (i.e., 3/25 on a standard cycle) low-risk days, the sample was biased and over-represented by days with high LH levels. We therefore anticipated that analyses that ignore the design will yield invalid inferences for target parameters. In this manuscript, we describe methods for valid inferences in the presence of outcome-dependent sampling (ODS) designs where sampling depends on a time-varying auxiliary variable that is related but not equal to the response variable.

The BioCycle Study used an ODS design to increase the efficiency of parameter estimates that describe exposure–outcome relationships for a given cost of measurement. ODS is ubiquitious in epidemiological research since, in many circumstances, prospective studies are infeasible or are logistically impossible. The most commonly used ODS design is the case-control study [4,5], and from its principle of targetted sampling based on the most informative or interesting features of the response distribution, many other designs have emerged and are commonly used (e.g., the case-cohort design [6] and the case-crossover design [7,8]). Related to the BioCycle Study design, although applied to a univariate outcome, Lee, McMurchy, and Scott [9] discussed analyses under a design where sampling is based on an auxiliary variable (*Z*) that is related to the response variable (*Y*) and where interest is in the marginal regression relationship between the response and the covariates ** X**, [

This manuscript is organized as follows. Section 2 describes the target population model, the study design that is intended to maintain high efficiency while adhering to the constraints introduced by limited study resources, and the pseudo-population model induced by the design. In Section 3, we discuss implementation of an analytical protocol using SOLR, and in particular we describe asymptotic variance estimators. We evaluate the operating characteristics of the proposed estimator in comparison with two other analysis strategies in Section 4, and in Section 5, we revisit the BioCycle Study. We conclude with a discussion in Section 6.

In this section, we discuss the marginal population model that defines the parameters that are the primary inferential targets, and we introduce an ODS study design that can be implemented to reduce study costs while permitting highly efficient parameter estimation. We then describe the conditional model induced by the proposed sampling design which is the mean model for the pseudo-population represented by the outcome-dependent sample (e.g., conditional on being sampled). We show that, although the induced conditional regression differs from the target model, the relationship is mathematically tractable. Finally, we detail valid analysis methods and assumptions that permit estimation of the marginal target parameters based on the biased sample. We also discuss a slightly different sampling approach that more closely characterizes the BioCycle Study design and show the conditions under which the induced pseudo-population model is equivalent to the one induced by our basic design.

Consider a study where we wish to make inference about a target population structure based on the marginal mean from a logistic regression model given by

$${\mu}_{ij}^{p}\equiv pr({Y}_{ij}=1\mid {\mathit{X}}_{ij})={\text{logit}}^{-1}({\mathit{X}}_{ij}\mathit{\beta}),$$

(1)

where *i* {1, 2, …, *N*} denotes subject, *j* denotes observation time within subjects, *Y _{ij}* is a binary response variable {

When ascertainment costs associated with *Y** _{i}* are high and/or

Let *S _{ij}* be an indicator of whether subject

Whereas the original cohort is representative of the population, the data observed under the proposed study design is representative of a pseudo-population that is compositionally different from the target population. With an effective sampling strategy, the prevalence of the outcome will be higher in the pseudo-population than in the population. That is, *pr* (*Y _{ij}* = 1|

Let ${\mu}_{ij}^{s}\equiv pr({Y}_{ij}=1\mid {\mathit{X}}_{ij},{S}_{ij}=1)$ be the marginal mean model in the pseudo-population rep-resented by the observed ODS data. Via Bayes’ theorem, we may relate the pseudo-population and population mean models with

$$\frac{{\mu}_{ij}^{s}}{1-{\mu}_{ij}^{s}}=\frac{{\tau}_{ij}(1,{\mathit{X}}_{ij})}{{\tau}_{ij}(0,{\mathit{X}}_{ij})}\xb7\frac{{\mu}_{ij}^{p}}{1-{\mu}_{ij}^{p}},$$

(2)

where *τ _{ij}* (

$${\mu}_{ij}^{s}={\text{logit}}^{-1}\left[log\left\{\frac{{\tau}_{ij}(1,{\mathit{X}}_{ij})}{{\tau}_{ij}(0,{\mathit{X}}_{ij})}\right\}+{\mathit{X}}_{ij}\mathit{\beta}\right]$$

(3)

to the sampled data (*Y _{ij},*

$${\tau}_{ij}(y,{\mathit{X}}_{ij})=\sum _{{z}_{ij}\in \{0,1\}}\pi ({z}_{ij},{X}_{1,ij})\xb7pr({z}_{ij}\mid {\mathit{X}}_{ij},{Y}_{ij}=y).$$

(4)

Although *π*(*z _{ij}, X*

$$\frac{{\lambda}_{ij}^{s}({Y}_{ij},{\mathit{X}}_{ij})}{1-{\lambda}_{ij}^{s}({Y}_{ij},{\mathit{X}}_{ij})}=\frac{\pi (1,{\mathit{X}}_{ij})}{\pi (0,{\mathit{X}}_{ij})}\xb7\frac{{\lambda}_{ij}^{p}({Y}_{ij},{\mathit{X}}_{ij})}{1-{\lambda}_{ij}^{p}({Y}_{ij},{\mathit{X}}_{ij})},$$

(5)

where
${\lambda}_{ij}^{s}({Y}_{ij},{\mathit{X}}_{i})=pr({Z}_{ij}=1\mid {Y}_{ij},{\mathit{X}}_{ij},{S}_{ij}=1)$ is the pseudo-population mean. We may estimate parameters in
${\lambda}_{ij}^{p}$ by fitting an offsetted logistic regression model to the pseudo-population data by using the known log{*π*(1, *X** _{ij}*)/

$$\frac{{\tau}_{ij}(1,{\mathit{X}}_{ij})}{{\tau}_{ij}(0,{\mathit{X}}_{ij})}=\frac{1-[1-\pi (1,{\mathit{X}}_{ij})/\pi (0,{\mathit{X}}_{ij})]\xb7{\lambda}_{ij}^{p}(1,{\mathit{X}}_{ij})}{1-[1-\pi (1,{\mathit{X}}_{ij})/\pi (0,{\mathit{X}}_{ij})]\xb7{\lambda}_{ij}^{p}(0,{\mathit{X}}_{ij})}.$$

(6)

In Section 3, we detail the sequential offsetted logistic strategy that can be used for parameter estimation, and we outline standard error calculations that can acknowledge the added uncertainty associated with estimating the offset term in Equation (6).

The design we propose can be referred to as independent ODS because sampling is independent across observations, and sampling probabilities do not depend upon other data arising from the same subject or on whether other observations had been sampled from the same subject. We therefore observe variable numbers of observations from each subject in the original cohort, and depending upon the outcome-dependent sample size, we may sample no observations from some of the subjects. In contrast to independent sampling, in some scenarios, including the BioCycle Study, investigators may wish to sample a set number of observations from each individual. This ‘within-subject’ ODS design can be more challenging to analyze because the probability that observation *j* from subject *i* is sampled depends not only on (*Z _{ij}, X*

Because observation-specific sampling probabilities under within-subject ODS generally depend upon the entire auxiliary variable vector, *Z** _{i}*, or on some function of it, we write sampling probabilities conditional on the data (

$$\begin{array}{l}{\tau}_{ij}(y,{\mathit{X}}_{ij})=\sum _{\mathcal{P}({\mathit{z}}_{i})}pr({S}_{ij}=1\mid {\mathit{z}}_{i},{\mathit{X}}_{ij},{Y}_{ij}=y)\xb7pr({\mathit{z}}_{i}\mid {\mathit{X}}_{ij},{Y}_{ij}=y)\\ =\sum _{\mathcal{P}({\mathit{z}}_{i})}pr({S}_{ij}=1\mid {z}_{ij},{\mathit{z}}_{{ij}^{\prime}},{\mathit{X}}_{1,ij})\xb7pr({z}_{ij},{\mathit{z}}_{{ij}^{\prime}}\mid {\mathit{X}}_{ij},{Y}_{ij}=y)\end{array}$$

(7)

where
(*z** _{i}*) denotes all 2

$$\begin{array}{l}{\tau}_{ij}(y,{\mathit{X}}_{ij})=\sum _{{z}_{ij}\in \{0,1\}}\pi ({z}_{ij},{X}_{1,ij})\xb7pr({z}_{ij}\mid {\mathit{X}}_{ij},{Y}_{ij}=y)\xb7\sum _{\mathcal{P}({\mathit{z}}_{{ij}^{\prime}})}pr({z}_{{ij}^{\prime}}\mid {z}_{ij},{\mathit{X}}_{ij},{Y}_{ij}=y)\\ =\sum _{{z}_{ij}\in \{0,1\}}\pi ({z}_{ij},{X}_{1,ij})\xb7pr({z}_{ij}\mid {\mathit{X}}_{ij},{Y}_{ij}=y).\end{array}$$

(8)

The conditionally independent sampling assumption will hold approximately under within-subject ODS if the number of observations with *Z _{ij}* = 1 and

In this section, we propose a *SOLR* protocol that permits consistent estimation of target population parameters, ** β** using the ODS strategy described in Section 2.2. We then discuss standard error calculations that appropriately account for the correlated data and the additional uncertainty due to estimation of the offset term in (3).

The SOLR strategy described in succeeding paragraphs makes the independence working covariance weighting assumption, although the extension to covariance weighting is straightforward. In many cases, independence working covariance weighting can lead to efficiency losses for time-varying covariate coefficients when compared with a properly specified working covariance weighting scheme. However, because our study design samples on the basis of an auxiliary variable *Z** _{i}* that is closely related to the response

It is also important to note that we do not make any assumptions regarding whether the cross-sectional population and pseudo-population conditional means
${\mu}_{ij}^{p}=pr({Y}_{ij}\mid {\mathit{X}}_{ij})$ and
${\mu}_{ij}^{s}=pr({Y}_{ij}\mid {\mathit{X}}_{ij},{S}_{ij}=1)$\ discussed in Section 2 are equal to their full covariate conditional mean counterparts, *pr*(*Y _{ij}* |

We begin the SOLR protocol with
${\lambda}_{ij}^{p}({Y}_{ij},{\mathit{X}}_{ij})$, which we take to be a logistic regression model of the auxiliary variable *Z _{ij}* on (

$${\lambda}_{ij}^{p}({Y}_{ij},{\mathit{X}}_{ij})={\text{logit}}^{-1}({\mathit{W}}_{ij}\gamma ),$$

(9)

and estimates of *γ* are used to calculate estimates of
${\lambda}_{ij}^{P}$ used in (6); however, because the data represent the pseudo-population rather than the target population, following Equation (5), we estimate *γ* by fitting the following offsetted logistic regression model to the pseudo-population data

$$\text{logit}\{{\lambda}_{ij}^{s}({Y}_{ij},{\mathit{X}}_{ij})\}=log\left\{\frac{\pi (1,{\mathit{X}}_{ij})}{\pi (0,{\mathit{X}}_{ij})}\right\}+{\mathit{W}}_{ij}\gamma ,$$

(10)

where the sampling fraction ratio *π*(1, *X** _{ij}*)/

$$\text{logit}({\mu}_{ij}^{s})={B}_{ij}+{\mathit{X}}_{ij}\mathit{\beta}.$$

(11)

Estimates of *β* based on an offsetted logistic regression model will then be consistent for the population model parameters in Equation (1).

Following Schildcrout and Rathouz [16], standard error calculations may be simpler when thinking of the SOLR protocol as a stacked estimation approach where the estimating equations in (12) are solved jointly for (*γ,*
** β**)

$$\left[\begin{array}{c}\mathit{T}(\gamma )\\ \mathit{U}(\gamma ,\mathit{\beta})\end{array}\right]=\sum _{i}\sum _{j}\left[\begin{array}{c}{\mathit{T}}_{ij}(\gamma )\\ {\mathit{U}}_{ij}(\gamma ,\mathit{\beta})\end{array}\right]\equiv \sum _{i}\sum _{j}\left[\begin{array}{c}{\mathit{W}}_{ij}^{t}({Z}_{ij}-{\lambda}_{ij}^{S})\\ {\mathit{X}}_{ij}^{t}({Y}_{ij}-{\mu}_{ij}^{S})\end{array}\right]=\mathbf{0}.$$

(12)

In fact, Σ* _{i}* Σ

$$\text{Var}(\widehat{\gamma},\widehat{\beta})={\widehat{\mathit{I}}}^{-1}\widehat{\mathit{Q}}{\widehat{\mathit{I}}}^{-{1}^{\prime}}$$

where

$$\begin{array}{l}\mathit{Q}=\sum _{i}{\left[\sum _{j}\left\{\begin{array}{c}{\mathit{T}}_{ij}(\gamma )\\ {\mathit{U}}_{ij}(\gamma ,\beta )\end{array}\right\}\right]}^{\otimes 2},\mathit{I}=\left(\begin{array}{cc}{\mathit{I}}_{TT}& \mathit{0}\\ {\mathit{I}}_{UT}& {\mathit{I}}_{UU}\end{array}\right),\\ {\mathit{I}}_{TT}=-E\left\{\frac{\partial \mathit{T}(\gamma )}{\partial \gamma}\right\}=\sum _{i}\sum _{j}{\lambda}_{ij}^{S}(1-{\lambda}_{ij}^{S})\xb7{\mathit{W}}_{ij}^{\otimes 2},\\ {\mathit{I}}_{UU}=-E\left\{\frac{\partial \mathit{U}(\gamma ,\beta )}{\partial \beta}\right\}=\sum _{i}\sum _{j}{\mu}_{ij}^{S}(1-{\mu}_{ij}^{S})\xb7{\mathit{X}}_{ij}^{\otimes 2}\\ {\mathit{I}}_{UT}=-E\left\{\frac{\partial U(\gamma ,\beta )}{\partial \gamma}\right\}=\sum _{i}\sum _{j}{\mu}_{ij}^{S}(1-{\lambda}_{ij}^{S})\xb7{\mathit{X}}_{ij}^{t}\left(\frac{\partial {B}_{ij}}{\partial {\gamma}^{t}}\right),\\ \left(\frac{\partial {B}_{ij}}{\partial {\gamma}^{t}}\right)=-{\mathit{W}}_{1,ij}^{t}{F}_{ij}(1)+{\mathit{W}}_{0,ij}^{t}{F}_{ij}(0).\end{array}$$

*W** _{y,ij}* =

$${F}_{ij}(y)=\frac{(1-r){\lambda}_{ij}^{p}(y,{\mathit{X}}_{ij})\{1-{\lambda}_{ij}^{p}(y,{\mathit{X}}_{ij})\}}{1-(1-r){\lambda}_{ij}^{p}(y,{\mathit{X}}_{ij})},$$

and *r* = *π*(1, *X** _{ij}*/

Inverse probability of being sampled weighted estimators [26] are commonly used in survey or missing data scenarios in order to make inferences about population parameters from biased samples. ODS induces biased samples, but the selection mechanism is defined by the sampling routine, and so the probability of being sampled is known. Cai et al. [13] proposed inverse probability of being sampled weighted estimation when conducting an ODS analysis with cluster-based sampling; however, the general strategy of weighting contributions of an estimating equation by the inverse probability of having been observed can also apply to the independent ODS design described here. For example, we may estimate parameters from the population model
${\mu}_{ij}^{p}$ by solving
${\sum}_{i}{\sum}_{j}\pi {({Z}_{ij},{X}_{1,ij})}^{-1}{\mathit{X}}_{ij}^{t}({Y}_{ij}-{\mu}_{ij}^{s})=\mathbf{0}$ for ** β** (e.g., [20,23]). Robust standard errors [27,28] can be used to obtain asymptotically valid standard error estimates.

Alternatively, in certain scenarios, a naive analysis of the selected sample may lead to valid estimated regression coefficients, although intercept estimates will generally be biased. For example, when sampling is based only on *Z _{ij}*, then arguments outlined in Park and Kim [11] can be generalized to suggest that estimates corresponding to log odds ratios may be unbiased when we ignore the study design entirely as long as independence weighting and robust standard errors are used. That is, we may simply solve
${\sum}_{i}{\sum}_{j}{\mathit{X}}_{ij}^{t}({Y}_{ij}-{\mu}_{ij}^{s})=\mathbf{0}$ for

We have proposed a SOLR estimation strategy for marginal longitudinal regression models using ODS based on a time-varying auxiliary variable *Z _{ij}* and possibly on a time-varying or time-invariant regression model variable

In order to conduct simulations that were relevent for the motivating BioCycle Study data, we chose specific data-generating details such as the cluster size and the strength of association between the auxiliary and outcome that are similar to the characteristics of BioCycle. For *i* {1, …, 300} subjects, we generated *n _{i}* ~ uniform(30, 40) observations per subject based on the following marginalized, first-order transition and latent variable model (MM-TLV) [29],

$$\text{logit}({\mu}_{ij}^{p})={\mathit{X}}_{ij}\mathit{\beta},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\text{logit}({\mu}_{ij}^{c})={\mathrm{\Delta}}_{ij}+{\alpha}_{y}{Y}_{ij-1}+{U}_{i},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}{U}_{i}~N(0,{\alpha}_{u}^{2})$$

(13)

where
${\mu}_{ij}^{p}=pr({Y}_{ij}=1\mid {\mathit{X}}_{ij})$ is the population (marginal) mean model that describes the relationship between the binary response variable and covariates and
${\mu}_{ij}^{c}\equiv pr({Y}_{ij}=1\mid {\mathit{X}}_{ij},{Y}_{ij-1},{U}_{i})$ is the conditional mean or response dependence model that describes the relationships among the responses. Together, the marginal and conditional means specify a complete multivariate probability distribution [*Y** _{i}* |

The vector *X** _{ij}* = (1,

Even though there are approximately 300 · 35 = 10, 500 potentially observed *Y _{ij}* values, we assume that study resources only permit ascertainment of 1200 of them, and because the

The auxiliary variable *Z _{ij}* was generated under the model logit
$\text{logit}({\lambda}_{ij}^{p})={\mathit{W}}_{ij}\gamma $ with covariate vector

We consider ODS on the basis of *Z _{ij}* and Str-ODS on the basis of (

We examined three estimation approaches: GEE-I, IPW-GEE-I, and SOLR. The first two were described in Section 3.3, and SOLR was described in Section 3. For the SOLR approach, we examined three specific variations differentiated by how they estimated
${\lambda}_{ij}^{p}=pr({Z}_{ij}=1\mid {\mathit{W}}_{ij})$. In SOLR-I(Y), SOLR-I(Y+X), and SOLR-I(Y*X), the assumed *W** _{ij}* was (1,

GEE-I and IPW-GEE-I are easily implemented with standard software. Although weighting is required for IPW-GEE-I, the sampling probabilities depend on the study design, and these probabilities are known. The SOLR estimators are slightly more challenging because an additional level of modeling is required, namely ${\lambda}_{ij}^{p}$. By considering validity and efficiency of the estimators, we shed light on circumstances under which combinations of designs and estimation strategies might be favored over others.

Table I shows results from the simulations regarding the validity of the estimation strategies under a number of scenarios. In all scenarios studied, the naive, GEE-I estimator of *β*_{1}, *β*_{2}, and *β*_{3} was valid with approximately 95% coverage as long as (i) *Z _{ij}* depended upon

Average parameter estimates and 95% coverage probabilities for the SOLR, IPW-GEE-I, and GEE-I procedures across 1600 replicate studies of 300 subjects with 30 to 40 observations possible per person, but with resources available to sample, in 1200 observations **...**

The reason for the observed biases, either because of ignoring a stratified design with GEE-I or by misspecifying
${\lambda}_{ij}^{p}$ with GEE-I or SOLR, results from misspecification of the offset term log{*τ _{ij}* (1,

We now discuss the impact of the study design and the estimation strategy on statistical efficiency. Table II shows empirical standard errors. In addition to the other designs and analysis strategies already discussed, we also include a random sampling approach wherein 1200 observations were randomly sampled at each replication, and GEE-I and GEE-E (i.e., GEE with exchangeable working covariance weighting) were used to fit the sampled data.

From Table II, we see that well conducted analyses using ODS designs can lead to substantial efficiency improvements relative to random sampling, and further improvements can be made via two-phase or stratified, Str-ODS designs. Focusing on SOLR estimation with a parsimonious model for *Z _{ij}* (e.g.,

The IPW-GEE-I estimation strategy is only slightly more difficult to implement than GEE-I, it is easier to implement than SOLR, and it will be valid as long as the inverse probability weights reflect the true sampling scheme; however, it can be highly inefficient relative to SOLR and to GEE-I. Inefficiency of inverse probability weighted estimators has also been observed in other biased sampling settings (e.g., [38]). The efficiency loss is particularly evident under weaker *Z _{ij}* ~

The SOLR analysis procedure that we have discussed is tailored to independent ODS designs. Subject identifiers are ignored when sampling, which depends only upon the value *Z _{ij}* or (

As discussed earlier, the BioCycle Study implemented an ODS design to maximize observed hormonal variation while adhering to resource limitations and minimizing participant burden. Our analytical goal was to examine the relationship between high (> 20 IU/L) versus low (≤ 20 IU/L) LH values and fiber intake, while adjusting for energy intake and patient characteristics including age and body mass index. In particular, as women must typically reach a certain threshold level of LH in order to trigger ovulation in a given cycle, we are interested in the impact of fiber intake on high versus low LH. Patient characteristics and experiences are shown in Table IV. Of the 245 participants included in the analysis, 199 were observed for two cycles, and 46 were observed for one cycle. The median age and body mass index were 25 years and 23 kg/m^{2}, respectively, and most subjects were observed at each of the four key time points. In approximately 6% of observed days, patients were observed with high LH. High LH was observed in approximately 22% (96 out of 434) of the peri-ovulation visits (e.g., when *Z _{ij}* = 1) and in approximately 1.0% (13 out of 1284 days) of the other clinic visits (e.g., when

In many longitudinal data analysis settings, it may be appropriate to decompose time-varying covariates into two components: one that varies exclusively between individuals and the other that varies exclusively within individuals. For example, a covariate *x _{ij}* observed in subject

The first step in SOLR analysis is to specify a model for the auxiliary variable conditional on covariates and sampled outcomes. For the BioCycle data, we consider a range of models from the simplest that only regresses *Z* on *Y* to a rich model that allows *X*, *Y*, and interactions between *X* and *Y*. Empirically, the data suggest that *Y _{ij}* is the major predictor of

Results from the BioCycle Study, shown in Table V, indicate that between-subject changes in fiber intake were inversely related to the incidence of high LH concentrations. In particular, a 5-g/day increase (~ one apple or two slices of whole grain bread) in average fiber intake was associated with approximately a 33% to 35% decrease in the odds (log odds ratio estimates ranging from −0.421 to −0.394) of high LH. All estimators except IPW-GEE-I were approximately equally efficient for the between-subject fiber intake variable with standard errors ranging from 0.121 to 0.129. The standard error estimate based on IPW-GEE-I was 0.142. Consistent with the results from the simulation study in Section 4, the SOLR-I(Y*X) or IPW-GEE-I yielded the largest standard errors among the estimators. In addition to fiber intake, race was highly associated with LH levels. For example, on the basis of the SOLR-I(Y+X) estimator, the odds of high LH levels were 59% (coefficient estimate −0.884) lower in African Americans compared with whites.

We have discussed ODS from a cohort study with longitudinal follow-up on a binary response variable and when marginal models are of interest. The design includes a time-varying auxiliary variable related to response, and sampling is conducted at the level of the individual observations, as opposed to subjects, on the basis of this auxiliary variable. We described a SOLR estimation strategy that can be used for analyses, and we showed that, when properly specified, it can be far more efficient than random sampling and inverse probability of sample weighting. While ignoring the study design can in some scenarios lead to valid and efficient estimates, the SOLR approach with a properly specified model for the auxiliary variable is as efficient and is able to address stratified sampling (which itself can lead to efficiency improvements) and complex auxiliary variable models [*Z _{ij}* |

For the SOLR approach, we suggested a relatively simplified specification of the pseudo-population model. In particular, we suggested
${\mu}_{ij}^{s}=pr({Y}_{ij}\mid {\mathit{X}}_{ij},{S}_{ij}=1)$. The more general specification of
${\mu}_{ij}^{s}=pr({Y}_{ij}\mid {\mathit{X}}_{ij},{\mathit{S}}_{i})$ that could be estimated with non-independence working covariance weighting can be applied with subject-level or within-subject sampling, with prospective or retrospective designs, and with sampling that depends upon the response or on an auxiliary variable that is related to the response. For example, Schildcrout and Rathouz (2010) used a variation of this approach for subject-level sampling directly from a broader population. We believe the general SOLR strategy to be very broad and to be applicable in a number of settings. With large clusters, simplifying assumptions are likely to be required, and appropriate analysis will depend upon the design and data features. An alternative analytical approach for our proposed design would be to use likelihood-based methods (e.g., Bayesian or maximum likelihood). Whereas such approaches would require joint, longitudinal modeling of [*Y*_{i},*Z** _{i}* |

This project was partially funded by the NIH grant R01 HL094786 from the National Heart, Lung, and Blood Institute, the Long-Range Research Initiative of the American Chemistry Council, and the Intramural Research Program of the *Eunice Kennedy Shriver* National Institute of Child Health and Human Development, National Institutes of Health.

1. Wactawski-Wende J, Schisterman EF, Hovey KM, Howards PP, Browne RW, Hediger M, Liu A, Trevisan M. BioCycle Study: design of the longitudinal study of the oxidative stress and hormone variation during the menstrual cycle. Paediatric and Perinatal Epidemiology. 2009;23:171–184. [PMC free article] [PubMed]

2. Howards PP, Schisterman EF, Wactawski-Wende J, Reschke JE, Frazer AA, Hovey KM. Timing clinic visits to phases of the menstrual cycle by using a fertility monitor: the BioCycle Study. American Journal of Epidemiology. 2009;169:105–112. [PMC free article] [PubMed]

3. Gaskins AJ, Mumford SL, Zhang C, Wactawski-Wende J, Hovey KM, Whitcomb BW, Howards PP, Perkins NJ, Yeung E, Schisterman EF. BioCycle Study Grp. Effect of daily fiber intake on reproductive function: the BioCycle Study. American Journal of Clinical Nutrition. 2009;90(4):1061–1069. doi: 10.3945/ajcn.2009.27990. [PubMed] [Cross Ref]

4. Anderson JA. Separate sample logistic discrimination. Biometrika. 1972;59:19–35.

5. Prentice RL, Pyke R. Logistic disease incidence models and case-control studies. Biometrika. 1979;66:403–412.

6. Prentice RL. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986;73:1–11.

7. Navidi W. Bidirectional case-crossover designs for exposures with time trends. Biometrics. 1998;54:596–605. [PubMed]

8. Maclure M. The case-crossover design—a method for studying transient effects on the risk of acute events. American Journal of Epidemiology. 1991;133(2):144–153. [PubMed]

9. Lee AJ, McMurchy L, Scott AJ. Re-using data from case-control studies. Statistics in Medicine. 1997;16:1377–1389. [PubMed]

10. Chen J, Breslow NE. Semiparametric efficient estimation for the auxiliary outcome problem with the conditional mean model. The Canadian Journal of Statistics/La Revue Canadienne de Statistique. 2004;32(4):359–372.

11. Park E, Kim Y. Analysis of longitudinal data in case-control studies. Biometrika. 2004;91(2):321–330.

12. Pfeiffer RM, Ryan L, Litonjua A, Pee D. A case-cohort design for assessing covariate effects in longitudinal studies. Biometrics. 2005;61(4):982–991. [PubMed]

13. Cai J, Qaqish B, Zhou H. Marginal analysis for cluster-based case-control studies. Sankhyā, Series B. 2001;63(3):326–337.

14. Neuhaus JM, Scott AJ, Wild CJ. Family-specific approaches to the analysis of case-control family data. Biometrics. 2006;62(2):488–494. [PubMed]

15. Schildcrout JS, Heagerty PJ. On outcome-dependent sampling designs for longitudinal binary response data with time-varying covariates. Biostatistics. 2008;9:735–749. [PMC free article] [PubMed]

16. Schildcrout JS, Rathouz PJ. Longitudinal studies of binary response data following case-control and stratified case-control sampling: design and analysis. Biometrics. 2010;66:365–373. [PMC free article] [PubMed]

17. Schildcrout J, Heagerty P. Outcome-dependent sampling from existing cohorts with longitudinal binary response data: study planning and analysis. Biometrics. 2011 doi: 10.1111/j.1541-0420.2011.01582.x. [PMC free article] [PubMed] [Cross Ref]

18. White J. A 2 stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology. 1982;115(1):119–128. [PubMed]

19. Breslow N, Cain K. Logistic-regression for 2-stage case-control data. Biometrika. 1988;75(1):11–20.

20. Robins J, Rotnitzky A, Zhao L. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association. 1995;90(429):106–121.

21. Lin D, Ying Z. Semiparametric and nonparametric regression analysis of longitudinal data. Journal of the American Statistical Association. 2001;96(453):103–113.

22. Lin H, Scharfstein D, Rosenheck R. Analysis of longitudinal data with irregular, outcome-dependent follow-up. Journal of the Royal Statistical Society Series B- Statistical Methodology. 2004;66(Part 3):791–813.

23. Buzkova P, Lumley T. Semiparametric modeling of repeated measurements under outcome-dependent follow-up. Statistics in Medicine. 2009;28(6):987–1003. [PubMed]

24. Pepe M, Anderson G. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Communication in Statistics-Simulation and Computation. 1994;23(4):939–951.

25. Schildcrout JS, Heagerty PJ. Regression analysis of longitudinal binary data with time-dependent environmental covariates: bias and efficiency. Biostatistics. 2005;6:633–652. [PubMed]

26. Robins J, Rotnitzky A, Zhao A. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association. 1994;89:846–866.

27. Zeger SL, Liang KY. Longitudinal data analysis for discrete and continuous outcomes. Biometrics. 1986;42:121–130. [PubMed]

28. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22.

29. Schildcrout JS, Heagerty PJ. Marginalized models for moderate to long series of longitudinal binary response data. Biometrics. 2007;63:322–331. [PubMed]

30. Heagerty PJ. Marginalized transition models and likelihood inference for longitudinal categorical data. Biometrics. 2002;58:342–351. [PubMed]

31. Miglioretti D, Heagerty P. Marginal modeling of multilevel binary data with time-varying covariates. Biostatistics. 2004;5(3):381–398. doi: 10.1093/biostatistics/kxg042. [PubMed] [Cross Ref]

32. Lee K, Daniels MJ. Marginalized models for longitudinal ordinal data with application to quality of life studies. Statistics in Medicine. 2008;27(21):4359–4380. doi: 10.1002/sim.3352. [PMC free article] [PubMed] [Cross Ref]

33. Scott A, Holt D. The effect of 2-stage sampling on ordinary least-squares methods. Journal of the American Statistical Association. 1982;77(380):848–854.

34. Neuhaus J, Kalbfleisch J. Between- and within-cluster covariate effects in the analysis of clustered data. Biometrics. 1998;54(2):638–645. [PubMed]

35. Sheppard L. Insights on bias and information in group-level studies. Biostatistics. 2003;4(2):265–278. [PubMed]

36. Schildcrout JS, Sheppard L, Lumley T, Slaughter JC, Koenig JQ, Shapiro GG. Ambient air pollution and asthma exacerbations in children: an eight-city analysis. American Journal of Epidemiology. 2006;164:505–517. [PubMed]

37. Diggle P, Heagerty P, Liang KY, Zeger S. Analysis of Longitudinal Data. 2. Oxford University Press; USA: 2002.

38. Lawless J, Kalbfleisch J, Wild CJ. Semiparametric methods for response-selective and missing data problems in regression. Journal of the Royal Statistical Society Series B- Statistical Methodology. 1999;61(Part 2):413–438.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |