Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2978757

Formats

Article sections

- Summary
- 1. Introduction
- 2. Notation and assumptions
- 3. Plug-in methods
- 4. Normality-based methods
- 5. Simulation experiments
- 6. A real example
- 7. Uneven pooling
- 8. Discussion
- References

Authors

Related links

Biometrics. Author manuscript; available in PMC 2012 June 1.

Published in final edited form as:

Published online 2010 July 21. doi: 10.1111/j.1541-0420.2010.01464.x

PMCID: PMC2978757

NIHMSID: NIHMS216084

Biostatistics and Bioinformatics Branch, Division of Epidemiology, Statistics, and Prevention Research, *Eunice Kennedy Shriver* National Institute of Child Health and Human Development, Bethesda, Maryland, U.S.A

The publisher's final edited version of this article is available at Biometrics

See other articles in PMC that cite the published article.

It has become increasingly common in epidemiologic studies to pool specimens across subjects in order to achieve accurate quantitation of biomarkers and certain environmental chemicals. In this paper, we consider the problem of fitting a binary regression model when an important exposure is subject to pooling. We take a regression calibration approach and derive several methods, including plug-in methods that use a pooled measurement and other covariate information to predict the exposure level of an individual subject, and normality-based methods that make further adjustments by assuming normality of calibration errors. Within each class we propose two ways to perform the calibration (covariate augmentation and imputation). These methods are shown in simulation experiments to effectively reduce the bias associated with the naive method that simply substitutes a pooled measurement for all individual measurements in the pool. In particular, the normality-based imputation method performs reasonably well in a variety of settings, even under skewed distributions of calibration errors. The methods are illustrated using data from the Collaborative Perinatal Project.

In this paper we consider the problem of estimating the relationship of a binary outcome with a collection of covariates, including an important exposure that is measured on pooled specimens from different subjects and not for the individual subjects. Our interest in this problem is driven by the following epidemiologic study. The Upstate New York Infant Development Screening Program (Upstate KIDS) is an ongoing study that enrolls infants at three to five months of age and follows them for three years. The study cohort consists of approximately 6000 children in Upstate New York, of which 1500 will be conceived by infertility treatment (CIT). Each CIT child will be matched with three non-CIT children for residence area and purality of birth. Of particular interest to us is a new component of the study that examines the roles of inflammation, infection and exposure to environmental chemicals, such as polychlorinated biphenyls (PCBs), in infant and child growth and neurodevelopment. Exposure information for this new component will be obtained from neonatal blood spots on the Guthrie cards to be collected at birth as mandated by state law. Each card contains five blood spots and each spot can be punched (sampled) for about 10–12 times. Given the limited amounts of specimens available, assay sensitivity is a serious issue with many of the biomarkers and chemicals to be measured from the Guthrie cards. For many of them, it is anticipated that more than 12 punches may be required to obtain reliable assay results. Thus, pooling samples across individuals may be the only reasonable option for conducting this component of the study. Such pooled exposure measurements present a statistical methodological challenge which is the subject of this paper.

The statistical implications of pooling measurements across subjects have been considered for biomarker studies (Vexler et al., 2006; Schisterman and Vexler, 2008; Schisterman et al., 2010), diagnostic testing (Faraggi et al., 2003; Liu and Schisterman, 2003; Mumford et al., 2006; Vexler et al., 2008), and longitudinal outcomes (Albert and Shih, 2009). In the Upstate KIDS study, the variables subject to pooling (e.g., PCB) will likely be treated as covariates in regression models for describing and explaining developmental problems. Weinberg and Umbach (1999) propose a case-control design where exposure measurements are pooled for cases and controls separately, and point out that a logistic regression model for the individual subjects implies a logistic regression model of the same form for pools of subjects with all covariates summed within pools. It is not immediately clear how to extend their approach to the present cohort study. In the present setting, it is straightforward to deal with continuous outcomes for which a linear regression model can be assumed. For then we can simply average all variables (both the outcome and the covariates) in the linear regression model across subjects in the same pool, in the same way that a pooled PCB measurement is obtained, and the pool averages will follow the same linear regression model as for the original measurements, with the same regression coefficients and a smaller error variance (depending on the size of the pool). It seems more challenging to deal with other types of outcomes. In this paper, we focus on binary regression models even though much of the discussion extends easily to other models (e.g., regression models for count data).

The statistical problem here could be cast in terms of errors in variables (Carroll et al., 2006) by treating a pooled PCB measurement as an imprecise measurement of the PCB level of each individual subject in the pool. On the other hand, the pooling problem has a special structure that merits separate consideration. In this paper, we derive and compare several methods for the pooling problem that can be regarded as variants of regression calibration, a general approach to measurement error models (Carroll et al., 2006, Chapter 4). We set up the notation and state basic assumptions in the next section. Then we adapt the regression calibration approach to the pooling problem and derive some specific methods in Sections 3 and 4. The methods are compared through simulations in Section 5 and illustrated with a real example in Section 6. We then consider some more flexible pooling schemes in Section 7 and conclude the paper with a discussion in Section 8.

Let us write *X* for a continuous exposure variable subject to pooling (e.g., PCB), *Y* for a binary outcome of interest (e.g., abnormal psychomotor development), and ** Z** for a vector of other covariates to adjust for. Suppose that

(1)

where *G* is a known function such as the probit function, and
is a vector of unknown regression parameters we would like to estimate. The problem is that *X* is not measured for each individual subject, but rather for pooled specimens only. The pooling mechanism will be discussed shortly.

Following the regression calibration approach (Carroll et al., 2006), we would like to “calibrate” model (1) with the available information that is predictive of *X* and then recover ** β** from the calibrated model. This can be implemented with the following calibration model:

(2)

where ** V** is an optional collection of variables that may help to predict

(3)

where we assume that *ε* is independent of ** W** with E(

It is important to note that the information in ** V**, if any, is strictly for predicting

(4)

that is, ** V** is conditionally independent of

Our consideration of the pooling mechanism will proceed in steps, starting with simplistic assumptions and then moving to more realistic and flexible pooling schemes. To begin, assume that the subjects are pooled into groups of identical size *m*, that subjects in the same group contribute equally to the pooled measurement of *X*, and that the pooling is completely random with respect to all variables mentioned above. The last assumption (random pooling) will be relaxed at the end of this section, and the others in Section 7 (uneven pooling). Let us use the subscripts *ij* to denote the *j*th subject in the *i*th pool, and write
for the measured value of *X* in the *i*th pool. The observed data can then be written as {* _{i}*, (

(5)

(6)

where *X** _{i}* = (

Although we previously made a random pooling assumption, this assumption can be relaxed through covariate-dependent pooling. Specifically, since equations (5) and (6) are both conditional on **W*** _{i}*, we can, and will, allow covariate-dependent pooling where the grouping of subjects may depend on their

Without observing the true values of the *X _{ij}*, one might be tempted to substitute

(7)

where *f*(*x _{ij}*|

Without additional assumptions we cannot evaluate the integral in (7) exactly. We could, however, approximate the integrand (as a function of *x _{ij}*) through a first-order Taylor series expansion around the conditional mean E(

where *Ġ* is the first derivative of *G*. Integrating this with respect to the conditional density *f*(*x _{ij}*|

(8)

where the second step follows by substituting (5). This is actually the exact model for the identity link, i.e., when *G* is the identity function. For other link functions, the quality of the approximation depends on the magnitude of the calibration error, which is determined by *σ*^{2}: the smaller *σ*^{2} is, the better the approximation tends to work. Methods based on model (8) will be referred to as plug-in methods. It is shown in Appendix B that the *Y _{ij}* in the same pool are generally correlated with each other given (

We propose two ways to configure the covariates for GEE analysis of model (8):

Fit the model with the augmented covariate vector
and then focus on the first three covariate components for inference about ** β**, essentially treating (

Combine the last two terms in model (8) and fit the model with the covariate vector
, where
and _{w} is an estimate of *α*_{w} obtained from a least squares regression of * _{i}* on
. The estimate

(9)

which is immediate from (3). Note that * _{ij}* is the predicted value of

For the augmentation method, inference about ** β** can be based directly on standard errors from the GEE analysis, bearing in mind the approximate nature of model (8). For the imputation method, standard asymptotic theory can be used to account for the data-dependent nature of the

(10)

where *G** _{i}*(

In this section, we consider how to improve the approximate model (8) with a normality assumption. In addition to the previous assumptions concerning model (3), we assume that the error term *ε* is normally distributed. The normality assumption can be checked through model (9), and it implies that, given (* _{i}*,

where *f*(·|·) denotes a generic conditional density or mass function. The full likelihood for (** α**,

The idea may be easier to communicate for the log link. In this case, *G* is the exponential function, and we can evaluate the integral in (7) explicitly as

(11)

where
is identical to ** β** except for the intercept:
. This follows from the fact that E{exp(

(12)

where
and Φ is the standard normal distribution function. Thus the calibrated model is also a probit regression, and *β*_{probit} is a rescaling of ** β** with all components attenuated by the same factor. As a result of the attenuation, |

(13)

where
. As in the probit case, *β*_{logit} is an attenuation of ** β** by a factor that further depends on

Each of the three models (11)–(13) can be fit with GEEs, using either the augmentation method or the imputation method described earlier. In each case, the “working” regression parameter vector *β*_{link}, where the subscript could be any of the three link functions considered above, is zero if and only if ** β** =

(14)

where _{probit,}* _{x}* is the GEE estimate and

Under the assumed models, the estimators derived from models (11)–(13) are consistent (approximately in the logit case) and asymptotically normal. For the augmentation method, this follows directly from the delta method. For the imputation method, we could use a joint estimating function argument as in Section 3, followed by the delta method. Robust variance estimates from an augmented GEE analysis are directly applicable to *β** _{z}* and

The proposed methods are evaluated in finite samples using simulated data. In the following experiments, *Z* and *V* (both scalars) are generated independently as standard normal variables, and *X* follows the calibration model (2) with ** W** = (

Empirical bias and standard deviation (SD) for estimating
under random pooling with normal calibration errors (see Section 5 for details)

Empirical bias and standard deviation (SD) for estimating
under random pooling with skewed calibration errors (see Section 5 for details)

Empirical bias and standard deviation (SD) for estimating
under stratified pooling with normal calibration errors (see Section 5 for details)

Each simulated sample is analyzed using the following six methods. FD is a logistic regression analysis based on the full data including the value of *X* for each individual subject. This is not a competing method for pooled data, but it serves as an indicator for the amount of information in the data generated. NV is the naive approach that simply substitutes * _{i}* for the unobserved values

The methods are compared in terms of empirical bias and standard deviation for estimating each component of
. Table 1 presents the results for normally distributed errors in the calibration model (2). The first two scenarios, with *σ* = 0.5 and *β _{x}* = 1, clearly demonstrate the relevance of the proposed methods, which effectively remove most of the bias associated with the naive approach. (Given the observed variability, an empirical bias of 0.03 would be significantly different from 0, and so would a bias difference of 0.06 between any two methods.) As expected, the regression calibration estimators are more variable than the FD and NV estimators. In the latter case, the increase in variability seems a reasonable price to pay for bias reduction. The imputation estimators, designed to improve upon their augmentation counterparts for better efficiency, appear to have achieved that objective. Another general observation is that the normality-based methods tend to be more variable than the plug-in methods, which is not surprising given the additional correction involved in the former approach. It may also be of interest to note that the intercept

The real question concerning the normality-based methods is how they perform when the calibration error *ε* is not normally distributed. Our numerical experience with logistic and Laplace distributions (results not shown) suggests that bias levels similar to those seen in Table 1 should be expected under symmetric, unimodal distributions. Therefore, a skewed distribution for *ε* is chosen for the next set of simulations. Specifically, we adopt a shifted exponential model where *σ*^{−1}*ε* + 1 is exponentially distributed with mean 1. The results are presented in Table 2. Since the distribution of *X* is no longer symmetric, here we also consider a negative intercept *β*_{1} = −1 in addition to the values (0, 1) in Table 1. With *σ* = 0.5, Table 2 shows that all four regression calibration methods perform well with negligible bias relative to the naive approach. These methods become more biased when *σ* = 1, but they clearly outperform the naive approach. Despite serious misspecification of the error distribution, the normality-based methods still preform relatively well, with bias usually less than 0.1.

Table 3 illustrates the effects of stratified pooling in the case of normal calibration errors. Here the pooling is stratified by the signs of *Z* and *V* and is implemented as follows. Once a list of 400 subjects is generated randomly, it is ordered by the signs of *Z* and *V* into four blocks, each of which is homogeneous with respect to the signs of these variables. We then go through the list and group every four subjects into a pool. Because the sizes of the strata are random, there will be at most three heterogeneous pools which consist of subjects from different strata. In Table 3, the naive method is clearly less biased than in Table 1, which is reasonable to expect because the *X _{ij}* in a homogeneous pool will tend to be similar to each other and therefore to

We have also experimented with different sample sizes and pool sizes, and the results are qualitatively similar to those reported here.

The Upstate KIDS study motivating this research is still ongoing, and the component concerning PCB exposure is in the design stage, so the data are not yet available to illustrate and compare the various methods. Here we apply the methods to a similar dataset analyzed and reported by Daniels et al. (2003), who evaluated the association between prenatal PCB exposure and children’s neurodevelopment at eight months. The data were collected as part of the Collaborative Perinatal Project (CPP), a large cohort study that followed the growth and neurologic development of approximately 55000 children born in 1959 through 1966 in the United States. The PCB measurements in the CPP study were based on third-trimester maternal serum samples, and not on neonatal blood spots as in the Upstate KIDS study. However, Rogan et al. (1986) have shown that PCB levels tend to be strongly correlated between maternal and cord serum at birth (*r* > 0.7). The Bayley Scales of Infant Development was used to measure children’s mental and psychomotor development, and the scores were standardized to have population mean 100 and standard deviation 16.

From Daniels et al. (2003), a simple random sample of 1065 children is available for our analysis. Daniels et al. (2003) have found that the association of PCB exposure with psychomotor scores varies by study center, which implies an interaction between ** Z** and

It is easy to pool the subjects artificially, average the PCB measurements in each pool, and then apply the various methods for pooled data. However, the results based on a single pooled dataset could be arbitrary and difficult to interpret. Thus it is more informative to repeat this pooling process many times in evaluating the different approaches. Table 4 gives a summary based on 1000 repetitions using the same labels for the different methods as in the previous section. Here the regression calibration methods are implemented with the mother’s race, education and cholesterol level as predictors for her PCB level. Each method produces, over repeated pooling, 1000 estimates of the log-odds ratio (*β _{x}*) relating low PDI to PCB, which are then averaged and compared with the full-data estimate. We also calculate standard deviations (SDs) in an analogous fashion. The pool size is either 3 or 6, and the pooling process is either completely random or stratified by the discrete components of

We have assumed so far that the pool size is constant and that the weighting within each pool is uniform. Real life is more complicated than this. In the Upstate KIDS study, for example, the number of punches available from each subject is variable and difficult to predict, and consequently, the number of subjects to pool together cannot be fixed, and the subjects in the same pool need not contribute equally. We now consider more flexible pooling schemes that allow uneven pooling. We write *m _{i}* for the size of the

Except for a more flexible pooling scheme, we retain the previous assumptions including (1)–(4) as well as the normality assumption in Section 4. Some implications of the new pooling scheme should be noted before we consider inference for specific link functions. First, estimation of the calibration model is now based on the relationship

(15)

where
and
. Since * _{i}* has variance
, which varies across pools, a weighted least squares procedure that assigns weight
to the

This follows from the joint normal distribution of
given **W*** _{i}*.

For the identity link, it is easy to see that

where
. This suggests an imputation method that replaces each *X _{ij}* with

For the log link, it can be shown as before that

(16)

In this case, ** β** could be estimated from a standard GEE analysis regressing

For the probit link, we have

Because the denominator in Φ(·) also depends on (*i*, *j*), it is no longer feasible to simply “absorb” it into the regression coefficients and apply the conversion formulae of Section 4. A custom GEE program could be developed to fully utilize the special structure of the problem. Once again, the logit case can be handled by numerical integration or by the aforementioned approximation by the probit link. In the latter case, the approximate model is given by

It has become increasingly common in epidemiologic studies to pool specimens for quantitative measurement of biomarkers and environmental chemicals, often due to limits on assay sensitivity or sample availability. These pooled measurements often provide valuable information about an important exposure, which should be used effectively and efficiently to recover the individual-level regression relationship. Here we take a regression calibration approach to this problem and derive several methods for binary outcome data. Theoretical considerations and numerical results suggest that the regression calibration methods generally perform better than the naive method that simply substitutes a pooled measurement for all individual measurements in the pool. In particular, the normality-based imputation method appears quite effective in a variety of situations, even with non-normal calibration errors. The methods extend easily to variable pool sizes and unequal weights within pools.

Some issues remain to be addressed. It has been assumed here that pooling is uninformative (possibly dependent on ** W** but independent of everything else given

This research was supported by the Intramural Research Program of the National Institutes of Health, *Eunice Kennedy Shriver* National Institute of Child Health and Human Development. The authors would like to thank Drs. Enrique Schisterman and Mary Hediger for helpful discussions, Drs. Matthew Longnecker, Zhen Chen and Mark Klebanoff for providing the CPP data, and the Associate Editor and an anonymous referee for insightful comments that improved the manuscript.

A conditioning argument yields

(A.1)

Since the pooling depends only on the ** W**’s, we have that, given

(A.2)

where the second step follows from assumptions (4) and (1). To evaluate (A.1), we can integrate (A.2) with respect to the conditional distribution of (*X _{ij}*|

For distinct subjects *j*_{1}, *j*_{2} in the *i*-th pool, it can be argued as in Appendix A that

which follows from assumptions (1)–(4). Because *X*_{ij1} and *X*_{ij2} are stochastically dependent given (* _{i}*,

hence the correlation.

- Albert PS, Shih JH. Modeling batched Gaussian longitudinal data subject to informative dropout. 2009. Unpublished manuscript.
- Burr D. On errors-in-variables in binary regression-Berkson case. Journal of the American Statistical Association. 1988;83:739–743.
- Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models: A Modern Perspective. 2. London: Chapman & Hall; 2006.
- Carroll RJ, Spiegelman CH, Lan KKG, Kent KT, Abbott RD. On error-in-variables for binary regression models. Biometrika. 1984;71:19–25.
- Daniels JL, Longnecker MP, Klebanoff MA, Gray KA, Brock JW, Zhou H, Chen Z, Needham LL. Prenatal exposure to low-level polychlorinated biphenyls in relation to mental and motor development at 8 months. American Journal of Epidemiology. 2003;157:485–492. [PubMed]
- Faraggi D, Reiser B, Schisterman EF. ROC curve analysis for biomarkers based on pooled assessments. Statistics in Medicine. 2003;22:2515–2527. [PubMed]
- Fearn T, Hill DC, Darby SC. Measurement error in the explanatory variable of a binary regression: Regression calibration and integrated conditional likelihood in studies of residential radon and lung cancer. Statistics in Medicine. 2008;27:2159–2167. [PubMed]
- Follman DA, Hunsberger SA, Albert PS. Repeated probit regression when covariates are measured with error. Biometrics. 1999;55:403–409. [PubMed]
- Johnson NL, Kotz S. Distributions in Statistics. Vol. 2. Boston: Houghton-Mifflin; 1970.
- Liang K-Y, Liu X-H. Estimating equations in generalized linear models with measurement error. In: Godambe AP, editor. Estimating Functions. Oxford: Clarendon Press; 1991.
- Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22.
- Liu A, Schisterman EF. Comparison of diagnostic accuracy of biomarkers with pooled assessments. Biometrical Journal. 2003;45:631–644.
- Mumford SL, Schisterman EF, Vexler A, Liu A. Pooling biospecimens and limits of detection: effects on ROC curve analysis. Biostatistics. 2006;7:585–598. [PubMed]
- Rogan WJ, Gladen BC, McKinney JD, Carreras N, Hardy P, Thullen J, Tingelstad J, Tully M. Polychlorinated biphenyls (PCBs) and dichlorodiphenyl dichloroethene (DDE) in human milk: effects of maternal factors and previous lactation. American Journal of Public Health. 1986;76:172–177. [PubMed]
- Schisterman EF, Vexler A. To pool or not to pool, from whether to when: applications of pooling to biospecimens subject to a limit of detection. Pediatric and Perinatal Epidemiology. 2008;22:486–496. [PMC free article] [PubMed]
- Schisterman EF, Vexler A, Mumford SL, Perkins NJ. Hybrid pooled-unpooled design for cost-efficient measurement of biomarkers. Statistics in Medicine. 2010;29:597–613. [PMC free article] [PubMed]
- Stefanski LA, Boos DD. The calculus of M-estimation. The American Statistician. 2002;56:29–38.
- Thoresen M, Laake P. A simulation study of measurement error correction methods in logistic regression. Biometrics. 2000;56:868–872. [PubMed]
- Tosteson TD, Stefanski LA, Schafer DW. A measurement-error model for binary and ordinal regression. Statistics in Medicine. 1989;8:1139–1147. [PubMed]
- van der Vaart AW. Asymptotic Statistics. Cambridge: Cambridge University Press; 1998.
- Vexler A, Liu A, Schisterman EF. Efficient design and analysis of biospecimens with measurements subject to detection limit. Biometrical Journal. 2006;48:780–791. [PubMed]
- Vexler A, Schisterman EF, Liu A. Estimation of ROC curves based on stably distributed biomarkers subject to measurement error and pooling mixtures. Statistics in Medicine. 2008;27:280–296. [PMC free article] [PubMed]
- Weinberg CR, Umbach DM. Using pooled exposure assessment to improve efficiency in case-control studies. Biometrics. 1999;55:718–726. [PubMed]
- Zeger SL, Liang KY, Albert PS. Models for longitudinal data: a generalized estimating equation approach. Biometrics. 1988;44:1049–1060. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |