The use of biomarkers for exposure assessment is common in epidemiology. The power gained by using a large sample of individuals must be weighed against the cost of performing many assays. After reproducibility and variability are established for the biomarker, financial constraints usually limit further evaluation to small sets of samples. For example, the cost of a single assay measuring polychlorinated biphenyls (PCBs) is up to $1000 so only small studies have been able to examine, for example, whether PCBs are associated with cancer or endometriosis.
1,2 However, the imprecision of the results limits the conclusions that can be drawn for the suggested association.
Currently, two different approaches have been suggested to evaluate expensive biomarkers. Suppose we have biological specimens from a patient population
A of size
N,
A = {
A1,
A2, …,
AN}, with test results
X = {
X1,
X2, …,
XN}. One approach selects a random sample of the patient population A
(r) = {
Ak1,
Ak2, …,
Akn}
![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
A, where
n (≤
N) is determined by a power calculation and {
ki,
i = 1, …,
n} is a subsequence of set {1, 2, 3, …,
N} where assays are performed on the subset of specimens with observed results {
Xk1,
Xk2, …,
Xkn}.
Alternatively, a pooling strategy may be employed where two or more specimens are physically combined into a single ‘pooled’ unit for analysis. Thus, a greater portion of the population is assayed for the same price compared with the random sampling approach. The amount of information per assay increases so the number of assays needed to achieve equivalent information decreases.
3–6 Formally, the samples from patient population
A are randomly combined into
n =
N/
p pooled specimens of size
p. The
n pooled assays are considered the average of the contributing individual results, i.e.
where {
k1i,
i = 1, …,
p}, …, {
kni,
i = 1, …,
p} are some disjoint subsequences of set {1, 2, 3, …,
N}. Note that, this formal definition of pooled data is commonly accepted for methodological analyses and practical applications of the pooling design.
3–6 The concept of pooling biospecimens can be utilised in population-based epidemiological studies to explore the relationship between biomarker levels and outcome. The method’s primary goal is in establishing distributional parameters for a specific biomarker. Consequently, pooling can be seen as a primary tool for case–control and cohort studies exploring discrete outcomes. The technique has been explored extensively in the literature starting with publications related to cost-efficient syphilis testing of World War II recruits.
7 Weinberg and Umbach introduced pooling to estimate odds ratios for case–control studies.
8 Faraggi
et al.3 and Liu and Schisterman
4,5 examined the inference of the effect of pooling on the area under the Receiver Operating Characteristic curve.
Cost is not the only limiting factor in biomarker evaluation. Instrument sensitivity may also be problematic. Another common complexity arises when some participants have levels below the detection threshold (DT).
9 Under these circumstances, biomarker values at or above the DT are measured and reported, but values below the DT are unobservable. Formally, instead of
X, we observe
Z = {
Z1,
Z2, …}, such that
where
d is a value of the DT. Similarly, for the pooling design, we observed

, where
A variety of approaches have been used to analyse data with a lower DT. Substitution of
d/
2 or

for observations below the DT has been previously described.
9–11 These values are based on the assumption of a normal (
d/
2) or lognormal (

) distribution.
12 Lubin and colleagues proposed multiple imputation based on bootstrapping when the exposure distribution function is known.
13 Recent work shows that substitution of
E(
X |
X <
d) for data below the DT allows for unbiased estimation of linear and, under certain conditions, logistic regression parameters.
12 Schisterman and colleagues have shown that unbiased estimates may also be obtained non-parametrically if data below the DT are replaced by zero for no intercept models and by an estimator of
E(
X |
X ≥
d) for intercept models.
12The main objective of this paper is to examine parameter estimation and efficiency of the pooling approach compared with the random sampling approach for assays with a DT. In Section 2, we compare the numerical (quantifiable) information available and efficiency of each sampling scheme. In Section 3, we propose a mixed (unpooling–pooling) design, which takes advantage of the strengths of each approach. In Section 4, we present maximum likelihood techniques to be utilised with the different designs. In Section 5 we illustrate methods to account for pooling and random measurement error and in Section 6 we present our conclusions.