Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2749284

Formats

Article sections

- Summary
- 1. Introduction
- 2. Evaluation of designs for biospecimens measurements with a DT: which design yields more numerical information?
- 3. Pooled–unpooled design
- 4. Maximum Likelihood Estimation (MLE)
- 5. Pooling and random error in design
- 6. Discussion
- References

Authors

Related links

Paediatr Perinat Epidemiol. Author manuscript; available in PMC 2009 September 23.

Published in final edited form as:

PMCID: PMC2749284

NIHMSID: NIHMS133531

Division of Epidemiology, Statistics and Prevention Research, National Institute of Child Health and Human Development, National Institutes of Health, Rockville, MD, USA

Correspondence: Enrique F. Schisterman, Division of Epidemiology, Statistics and Prevention Research, National Institute of Child Health and Human Development, 6100 Executive Boulevard, 7B03N, Rockville MD 20852, USA. E-mail: schistee/at/mail.nih.gov

The publisher's final edited version of this article is available at Paediatr Perinat Epidemiol

See other articles in PMC that cite the published article.

Pooling of biological specimens has been utilised as a cost-efficient sampling strategy, but cost is not the unique limiting factor in biomarker development and evaluation. We examine the effect of different sampling strategies of biospecimens for exposure assessment that cannot be detected below a detection threshold (DT). The paper compares use of pooled samples to a randomly selected sample from a cohort in order to evaluate the efficiency of parameter estimates.

The proposed approach shows that a pooling design is more efficient than a random sample strategy under certain circumstances. Moreover, because pooling minimises the amount of information lost below the DT, the use of pooled data is preferable (in a context of a parametric estimation) to using all available individual measurements, for certain values of the DT. We propose a combined design, which applies pooled and unpooled biospecimens, in order to capture the strengths of the different sampling strategies and overcome instrument limitations (i.e. DT). Several Monte Carlo simulations and an example based on actual biomarker data illustrate the results of the article.

The use of biomarkers for exposure assessment is common in epidemiology. The power gained by using a large sample of individuals must be weighed against the cost of performing many assays. After reproducibility and variability are established for the biomarker, financial constraints usually limit further evaluation to small sets of samples. For example, the cost of a single assay measuring polychlorinated biphenyls (PCBs) is up to $1000 so only small studies have been able to examine, for example, whether PCBs are associated with cancer or endometriosis.^{1}^{,}^{2} However, the imprecision of the results limits the conclusions that can be drawn for the suggested association.

Currently, two different approaches have been suggested to evaluate expensive biomarkers. Suppose we have biological specimens from a patient population *A* of size *N*, *A* = {*A*_{1}, *A*_{2}, …, *A _{N}*}, with test results

Alternatively, a pooling strategy may be employed where two or more specimens are physically combined into a single ‘pooled’ unit for analysis. Thus, a greater portion of the population is assayed for the same price compared with the random sampling approach. The amount of information per assay increases so the number of assays needed to achieve equivalent information decreases.^{3}^{–}^{6} Formally, the samples from patient population *A* are randomly combined into *n* = *N*/*p* pooled specimens of size *p*. The *n* pooled assays are considered the average of the contributing individual results, i.e.

(1.1)

where {*k*_{1}* _{i}*,

The concept of pooling biospecimens can be utilised in population-based epidemiological studies to explore the relationship between biomarker levels and outcome. The method’s primary goal is in establishing distributional parameters for a specific biomarker. Consequently, pooling can be seen as a primary tool for case–control and cohort studies exploring discrete outcomes. The technique has been explored extensively in the literature starting with publications related to cost-efficient syphilis testing of World War II recruits.^{7} Weinberg and Umbach introduced pooling to estimate odds ratios for case–control studies.^{8} Faraggi *et al.*^{3} and Liu and Schisterman^{4}^{,}^{5} examined the inference of the effect of pooling on the area under the Receiver Operating Characteristic curve.

Cost is not the only limiting factor in biomarker evaluation. Instrument sensitivity may also be problematic. Another common complexity arises when some participants have levels below the detection threshold (DT).^{9} Under these circumstances, biomarker values at or above the DT are measured and reported, but values below the DT are unobservable. Formally, instead of *X*, we observe *Z* = {*Z*_{1}, *Z*_{2}, …}, such that

(1.2)

where *d* is a value of the DT. Similarly, for the pooling design, we observed
, where

(1.3)

A variety of approaches have been used to analyse data with a lower DT. Substitution of *d*/*2* or
for observations below the DT has been previously described.^{9}^{–}^{11} These values are based on the assumption of a normal (*d*/*2*) or lognormal (
) distribution.^{12} Lubin and colleagues proposed multiple imputation based on bootstrapping when the exposure distribution function is known.^{13} Recent work shows that substitution of *E*(*X* | *X* < *d*) for data below the DT allows for unbiased estimation of linear and, under certain conditions, logistic regression parameters.^{12} Schisterman and colleagues have shown that unbiased estimates may also be obtained non-parametrically if data below the DT are replaced by zero for no intercept models and by an estimator of *E*(*X* | *X* ≥ *d*) for intercept models.^{12}

The main objective of this paper is to examine parameter estimation and efficiency of the pooling approach compared with the random sampling approach for assays with a DT. In Section 2, we compare the numerical (quantifiable) information available and efficiency of each sampling scheme. In Section 3, we propose a mixed (unpooling–pooling) design, which takes advantage of the strengths of each approach. In Section 4, we present maximum likelihood techniques to be utilised with the different designs. In Section 5 we illustrate methods to account for pooling and random measurement error and in Section 6 we present our conclusions.

The efficiency of pooling and random sampling are compared to determine which design yields more numerical information. Efficiency, here, weighs the available information against the inherent limitations of each design. For clarity, we assume that *X* has a normal distribution. However, the conclusions from this section are true for most distributions commonly used, including gamma.

Figure 1a plots the density function of the normally distributed biomarker *X* with a DT at *d* = −1. The shaded area corresponds to values of *X* below the DT where missing values would be reported. The unshaded area corresponds to reportable numerical values of *X*. In this case, as Pr{*X*_{1} < −1} ≈ 0.16, the expected proportion of observations below the DT is approximately 16%. Pooling the specimens reduces the effective variance of biomarker *X*, i.e. by definition (1.1) the variance of the pooled samples is
and the mean is 0.^{3} For the pooled samples, assuming *p* = *2*,
so approximately 8% of the pooled observations are below the DT as shown in Fig. 1c. Thus, the expected number of unobserved test results from the random sample design is about twice (
) the expected number of *N*/*A*s under the pooling strategy.

Normally distributed data constrained by a detection threshold (shaded area represents unobserved data). DT, detection threshold.

The rationale for pooling in this case is to take advantage of the statistical properties of averages through physical implementation, i.e. the value of pooled specimens is the mean of the individual biomarker values.

The pooling strategy provides more numerical observations than the random sampling approach with equivalent initial sample size, because the pooled distribution *X*^{(}^{p}^{)} with
is more concentrated around the expectation μ = 0. Hence, the pooling strategy is more efficient than the random sample in estimating the mean and variance. Note that if *d* = −∞ the maximum likelihood estimators of μ based on full data *Z* and pooled data *Z*^{(}^{p}^{)} have equal efficiency.^{3} In Situation 1, the ratio of the expected number of numerically observed test results of set *Z* to that of set *Z*^{(}^{p}^{)} is
, i.e. the number of numerically observed results from *Z*^{(}^{p}^{)} increases relative to the observed numerical elements of *Z*. Moreover, although
, we cannot conclude that the observed pooled data *Z*^{(}^{p}^{)} has less numerical information than the full data set *Z*.

Consider an example with four unpooled individual specimens: *X*_{1} = 3.1, *X*_{2} = 3.5, *X*_{3} = 4.0 and *X*_{4} = 2.0. If DT = 3.0, *Z* = {3.1, 3.5, 4.0, N/A}, where N/A signifies a value below the DT. If p = 2, the pooled samples include only two numerical observations:

yielding *Z*^{(2)} = {3.3, 3.0}. In this example, the value of *X*_{4} is not ignored by the pooled data, which are less affected by the DT than the full sample.

Figures 1b and d depict when the location of the DT is above the mean of *X* and the situation previously described is reversed. If the DT is located at 1 for example, more pooled samples will have values less than the DT than unpooled samples. As shown in Fig. 1b, the amount of unobserved data (shaded area) is smaller in the unpooled data than in the pooled data. Hence, pooling is beneficial when the DT is below the mean and detrimental when it is above the mean.

Nevertheless, in Situation 2, the pooling strategy might still be more efficient than random sampling. Intuitively, the pooled observations might be more informative than the unpooled observations because each pooled observation is based on more than one test result.

When the DT is much greater than the mean biomarker value, the pooling strategy is completely inefficient because the pooled data are based upon substantially less numerical information than a random sample of unpooled data.

In order to demonstrate the conclusions from Situations 1–3 with respect to sample size, we generated random sample *X*_{1}, …, *X _{N}*

Consider the situation where an assay is relatively inexpensive and could be measured for every participant. As previously stated, numerical values are not assigned when *X* is below the DT. However, knowledge of the data below the DT is important for inference. Richardson and Ciampi suggested imputing *E*(*X* | *X* < *d*) for values below the DT.^{14} Because cost is not an impediment in this example, we propose to assay the individual specimens and then pool the specimens and assay the pooled samples as well. As described in Section 2, when an individual specimen with a value less than the DT is pooled, the pooled sample may have a numerical result. Therefore, the individual’s *X* may be back-calculated (reconstructed) using the pooled results and the individual results from the other samples in the pool. The combination of pooling results with traditional unpooled measurements can produce numerical results for the maximum number of study participants, including some below the DT.

In this discussion we use *p* = *2* without loss of generality. Consider an individual *k* with an unpooled value *X _{k}* <

By the combined application unpooled–pooled strategies, the value of some observations less than the DT can be calculated allowing *E*(*X* | *X* < *d*) to be non-parametrically estimated using the method proposed by Richardson and Ciampi.^{14} We call the proposed technique the pooled–unpooled resampling design.

Having introduced the pooled–unpooled hybrid design, we can utilise maximum likelihood to estimate unknown parameters of a biomarker’s distribution. We can consider a biomarker that follows *X _{i}* ~

Efficiency of the maximum likelihood estimators:
and
are plotted by graphs (a) and (b) respectively. Curves (------), (——) and (·········) correspond to databases **...**

The estimations of μ and σ based on the pooled data are more efficient than those based upon the random sample up to *d* < μ. However, if *d* μ, then the pooling strategy is not recommended.

We can estimate parameters based on data following a gamma distribution in a similar manner.^{16} The gamma-shape parameter of the pooled data is *p*× the shape parameter of unpooled data and the scale parameter of the pooled data is 1/*p*× the scale parameter of unpooled data.^{3} The conclusions for the gamma case are similar to the normal.

The likelihood function in Section 4.1 is composed of two parts: one related to *N*/*A*-observed data (where *X* < *d*) and a second for numerically observed data (where *X* ≥ *d*); estimation for pooled–unpooled resampled data has three kinds of data. The first sample (*S*_{1}) has only *N*/*A* elements. Test results in this sample were initially below the DT and have not been reconstructed by applying the pooling resampling. Thus, for all *k* = 1, …, *N*, we have

(3.1)

The second sample (*S*_{2}) has reconstructed elements. Test results in this sample were initially below the DT and have been reconstructed by applying the pooling resampling. Therefore, elements of set *S*_{2} have distribution function

(3.2)

The last sample (*S*_{3}), as in Section 4.1, includes the numerically observed data. The likelihood function is a product of the densities that correspond to (3.1), (3.2) and the case, where numerical results were initially observed. We describe the likelihood in detail in Appendix formula (A.3).

To illustrate the proposed method, we generated a random sample {*X*_{1},…, *X _{N}*

Cholesterol measurements were collected for 10 normal volunteers at a medical centre. The mean and standard deviation for total cholesterol were estimated to be ( = 200.73) and ( = 51.72) respectively. The specimens were then randomly paired and the pooled specimens were assayed. For the purpose of demonstration, we artificially created a threshold (DT = 150) such that some numerical values could not be observed. In Table 1, we show the individual and pooled cholesterol values with and without the DT.

In this example, 20% of the individual measurements are below the threshold, whereas no pooled observations are below the DT. Applying the maximum likelihood method, the asymptotically unbiased mean and standard deviation were estimated to be ( = 196.13) and ( = 56.38), respectively, from unpooled data with the DT. Although more costly, by assaying both pooled and unpooled specimens, we can reasonably estimate values below the DT (Table 1). Moreover, using both the reconstructed data and the unpooled data above the DT, the mean and the variance are estimated to be = 198.99 and = 53.44

Although definition (1.1) shows the theoretical notation for pooled data, practically, pooling biological specimens can lead to additive pooling errors. In this section we use the maximum likelihood method from Section 4 and revise definition (1.1) to

(5.1)

where pooling errors *ε*_{1},…, *ε _{n}* are independent
distributed random variables and
. Definition (5.1) accounts for the pooling errors which were ignored by definition (1.1). In order to investigate the robustness of our approach for addressing pooling errors, we executed MC simulations. Formally, we assumed that only

- Random sampling: We randomly choose
*X*_{1},…,*X*from the full sample and observed_{N/p}*Z*_{1},…,*Z*because of the DT. The mean of_{N/p}*X*was estimated using the likelihood approach on the truncated data {_{i}*Z*_{1},…,*Z*}._{N/p} - Pooling: We randomly choose biospecimen sets of size
*p*with and observed by (1.3). Again, the mean of*X*was estimated using the MLE based on ._{i}

The accuracy of estimators (, ) of (*μ*, σ* _{X}*) is indicated by their MC variances. We assumed a biomarker distribution
,

Logarithm of the Monte Carlo estimators of *E*( − μ)^{2} and *E*( − σ_{X})^{2} [graphs (a) and (b) respectively], where the pooling error is in effect. Curves (------), (——) and (·········) **...**

The figure suggests that the conclusions in Sections 2 and 4 are correct for μ-estimation up to σ* _{ε}* ≤ σ

In addition to pooling error, studies are also subjected to random measurement error as a function of instrument calibration. Random measurement error occurs as a result of random instrument variability. One can account for random measurement error in the pooled or random sample designs through the use of standard techniques previously developed in the literature.^{18}^{,}^{19} These techniques include utilising error models, regression calibration models, validation studies or replication data to estimate and adjust for random measurement error. In addition, while not explicitly described here, standard information reported by a laboratory such as the coefficient of variation for the biomarker and reliability of the assay can be included in these models.

In this paper, we examine pooling and random sampling as strategies to evaluate biospecimens with a DT. These types of data are common in epidemiological research and include two types of values: numerical and non-numerical (i.e. N/A). Because numerical values yield more information than missing data, it is a goal of any researcher to minimise the number of N/A observations. Accordingly, we have explored theoretical methods as well as simulations where a pooling design is more efficient than a random sample. In addition, we show that the efficiency of the pooling design is dictated by the location of the DT but is independent of the distributional assumptions (e.g. gamma, t-distribution, Lognormal). For all distributions, there is a range of DTs where the pooling strategy is more efficient than a random sample because the inference-based pooling design provides more numerical information. In fact, in some cases pooling is more efficient than using the full sample. We showed that whenever *EX* > *d* (i.e. >50%) pooling is always the most efficient sampling strategy, but other factors, such as the underlying distribution, must be considered when *EX* < *d*.

Certainly, a preliminary analysis of biospecimens with incomplete measurements, such as a test to see if *EX* > *d*, is appropriate. Towards this end, the unpooled–pooled strategy proposed in Section 3 is not only helpful for the evaluation of pooling errors but can also be applied to a first-stage data study. In addition, the efficiency of MLEs under each design can be evaluated.

Cost has been the main motivation for pooling biological specimens or to randomly select a subset of individuals to be assayed. However, we have shown that in some cases, even when the full data are available, estimations based on pooled data increase efficiency over the use of individual measurements when the assay has a DT. This is because of the greater number of observations above the DT under pooling, which can then be used in the estimation procedure. However, using unpooled data allows, for example, distributional assumptions to be tested, the location of the DT to be estimated and the expected number of observations below the DT. In addition, one is able to stratify the pooled samples by confounders in order to retain confounding and covariate information in the pooled samples. To take advantage of the strengths of each of these approaches, we proposed a pooled–unpooled resampling design. According to this design, in the first stage all the patient population (or a random sample of them) is measured individually, and in the second stage, the patient population is pooled in groups of size *p* and these pooled samples are assayed. By employing this approach, we are able to reconstruct data that were unobserved in the first stage due to the DT.

This simple approach that we propose captures the strengths of the statistical properties of the distribution function of the averages by physically grouping biological specimens in order to overcome the instrument limitations.

Following Gupta’s method,^{15} we obtain the MLE based on a sample with observations subject to a DT:

(A.1)

where *f*_{Ω} is a density function of
; *N*_{Ω} is a size of set Ω; *k*_{Ω} is a number of *N*/*A*-elements of set Ω; (Ω = *Z*, *X*^{Ω} = *X*_{1}) or (Ω = {*Z _{i}*,

Thus, *L*(Ω) is a function of unknown parameters μ and σ, say *L*(Ω) = *L*(μ, σ; Ω).

The target estimators , of μ, σ (where ) are numerical solutions of system

The variances of considered estimators (, ) of (μ, σ) can be found by inverting the Fisher information matrix. Using Gupta,^{15} we obtain, depending on pooled/unpooled database,

(A.2)

where
; (*u*)is the standard normal density function
and if (, ) are based on

In accord with Section 3.2, we have

(A.4)

where *m*_{1} and *m*_{2} are number of elements of data *S*_{1} and *S*_{2} respectively; by applying (3.1), (3.2) and convolution transforms, we obtain

Now, the maximum likelihood estimators are .

The general maximum likelihood function is

(A.3)

where *m*_{1} and *m*_{2} are number of N/As of sets {*Z _{i}*,

1. Laden F, Neas LM, Spiegelman D, Hankinson SE, Willett WC, Ireland K, et al. Predictors of plasma concentrations of DDE and pcbs in a group of U.S. women. Environmental Health Perspectives. 1999;107:75–81. [PMC free article] [PubMed]

2. Louis GM, Weiner JM, Whitcomb BW, Sperrazza R, Schisterman EF, Lobdell DT, et al. Environmental PCB exposure and risk of endometriosis. Human Reproduction. 2005;20:279–285. [PubMed]

3. Faraggi D, Reiser B, Schisterman EF. ROC curve analysis for biomarkers based on pooled assessments. Statistics in Medicine. 2003;22:2515–2527. [PubMed]

4. Liu A, Schisterman EF. Sample size and power calculation in comparing diagnostic accuracy of biomarkers with pooled assessments. Journal of Applied Statistics. 2004;31:41–51.

5. Liu A, Schisterman E. Comparison of diagnostic accuracy of biomarkers with pooled assessments. Biometrical Journal. 2003;45:631–644.

6. Schisterman EF, Perkins NJ, Liu A, Bondell H. Optimal cut-point and its corresponding Youden Index to discriminate individuals using pooled blood samples. Epidemiology. 2005;16:73–81. [PubMed]

7. Keeler E, Berwick D. Effects of pooled samples. Health Laboratory Science. 1976;13:121–128. [PubMed]

8. Weinberg CR, Umbach DM. Using pooled exposure assessment to improve efficiency in case-control studies. Biometrics. 1999;55:718–726. [PubMed]

9. Helsel D. Nondetects and Data Analysis: Statistics for Censored Environmental Data. Hoboken, NJ: John Wiley & Sons, Inc; 2005.

10. Finkelstein M, Verma D. Exposure estimation in the presence of nondetectable values: another look. AIHAJ. 2001;62:195–198. [PubMed]

11. Hornung R, Reed L. Estimation of average concentration in the presence of nondetectable values. Applied Occupational and Environmental Hygiene. 1990;5:46–51.

12. Schisterman EF, Vexler A, Whitcomb BW, Liu A. The limitations due to exposure detection limits for regression models. American Journal of Epidemiology. 2006;163:374–383. [PMC free article] [PubMed]

13. Lubin JH, Colt JS, Camann D, Davis S, Cerhan JR, Severson RK, et al. Epidemiologic evaluation of measurement data in the presence of detection limits. Environmental Health Perspectives. 2004;112:1691–1696. [PMC free article] [PubMed]

14. Richardson DB, Ciampi A. Effects of exposure measurement error when an exposure variable is constrained by a lower limit. American Journal of Epidemiology. 2003;157:355–363. [PubMed]

15. Gupta AK. Estimation of the mean and standard deviation of a normal population from a censored sample. Biometrika. 1952;39:260–273.

16. Chapman DG. Estimating the parameters of a truncated gamma distribution. Annals of Mathematical Statistics. 1956;27:498–506.

17. Schisterman EF, Faraggi D, Reiser B, Trevisan M. Statistical inference for the area under the receiver operating characteristic curve in the presence of random measurement error. American Journal of Epidemiology. 2001;154:174–179. [PubMed]

18. Carroll RJ, Ruppert D, Stefanki LA. Measurement Error in Nonlinear Models. Boca Raton, FL: Chapman & Hall; 1995.

19. Gustafson P. Measurement Error and Misclassification in Statistics and Epidemiology. Boca Raton, FL: Chapman & Hall; 2004.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |