Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Stat Med. Author manuscript; available in PMC 2011 February 28.
Published in final edited form as:
Stat Med. 2010 February 28; 29(5): 597–613.
doi:  10.1002/sim.3823
PMCID: PMC2821989

Hybrid Pooled—Unpooled Design for Cost-Efficient Measurement of Biomarkers


Evaluating biomarkers in epidemiological studies can be expensive and time consuming. Many investigators use techniques such as random sampling or pooling biospecimens in order to cut costs and save time on experiments. Commonly, analyses based on pooled data are strongly restricted by distributional assumptions that are challenging to validate because of the pooled biospecimens. Random sampling provides data that can be easily analyzed. However, random sampling methods are not optimal cost-efficient designs for estimating means. We propose and examine a cost-efficient hybrid design that involves taking a sample of both pooled and unpooled data in an optimal proportion in order to efficiently estimate the unknown parameters of the biomarker distribution. In addition, we find that this design can be utilized to estimate and account for different types of measurement and pooling error, without the need to collect validation data or repeated measurements. We show an example where application of the hybrid design leads to minimization of a given loss function based on variances of the estimators of the unknown parameters. Monte Carlo simulation and biomarker data from a study on coronary heart disease are used to demonstrate the proposed methodology.

Keywords: Maximum likelihood, Measurement error, Pooling, Random sampling, Receiver operating characteristics, Sampling design

1. Introduction

Epidemiological studies increasingly take advantage of new and sophisticated biomarkers to more effectively measure exposures and predictors of interest. However, assaying biomarkers can be expensive and labor intensive, and even the least expensive individual assays may be infeasible to analyze in large cohorts. As a result, it is advantageous to find cost-efficient strategies for sampling. Dorfman [1], Faraggi et al. [2], Mumford et al. [3] Schisterman and Vexler [4], Vexler et al. [5], Weinberg and Umbach [6], and Zhang and Gant [7], all discuss pooling and random sampling as two approaches commonly used by investigators to reduce overall cost. Simple random sampling involves choosing and testing a random subset of available samples. In this case, several individual biospecimens are ignored; however, estimation of the operating characteristics of these individual biospecimens is straightforward. Pooling, on the other hand, involves randomly grouping and physically mixing individual biological samples. Assays are then performed on the smaller number of pooled samples. Thus, pooling reduces the number of measurements without ignoring any individual biospecimens. Theoretically, the measurements of pooled observations are the average of the individual measurements, and have lent themselves well to normality and gamma assumptions in the literature [3;5;8].

Faraggi [2] describes that in certain situations the pooling design is more efficient than random sampling because it is applied to a greater number of individual biospecimens. However, Mumford et al. [3] and Vexler et al. [5] describe the tradeoff between random sampling and pooling and describe the conditions where random sampling is more efficient than pooling. For example, pooling is much less efficient in situations where the arithmetic mean is not an efficient estimator, e.g. in the case of log-normally distributed data. Moreover, when data are subject to a limit of detection, pooling cannot be recommended as the efficient design when more than 50% of the data are censored.

For a normally distributed biomarker, Faraggi et al. [2] show that when the mean is the only parameter of interest, the most cost-efficient strategy would be to pool all of the biospecimens and measure only one sample. That approach obviously does not provide a reasonable estimate of the variance. When estimating the variance of a biomarker, random sampling is simple and efficient and pooling is unnecessary per Liu and Schisterman [9]. Since we are most often concerned with estimation of both the mean and the variance in order to make inferences, it seems a natural progression to consider a hybrid design, one that utilizes both pooling and random sampling.

Previous authors have explored the effects of measurement error under both pooling and random sampling designs. In both designs, measurement error can be caused by problems related to instrumentation, unstable environmental conditions, operator subjectivity, etc. which Fuller [10], Coffin and Sukhatme [11], Schisterman et al. [12], and Perkins and Schisterman [13] discuss in more detail. Both designs are similarly affected by measurement error because the biospecimens, pooled and unpooled, are measured via the same process. Dunn [14], Freedman et al. [15], and Thurigen et al. [16] describe methods requiring repeated measurements or validation data to correct for measurement error. Faraggi [17], Liu et al. [18], Reiser [19], and Schisterman et al. [12] show the impact of measurement error on estimation. However, obtaining and assaying these replications can be difficult because of cost or other limitations [20]. In general, when no repeated measurements are available, additive measurement errors cannot be evaluated.

A form of error specific to the pooling design lies in the physical process of pooling biospecimens (e.g. effects of temperature, instrument, and technician variability) where the measurements of pooled observations deviate from the average of the individual measurements. We refer to this specific type of measurement error as pooling error. Obviously, measurements of individual specimens would be unaffected by pooling error as they are not pooled. As with other forms of error, pooling error may have a detrimental impact on effect estimation.

In this paper, we consider a hybrid sampling design, explore situations where pooling error and measurement error are present, and efficiently estimate the mean and variance of a normally distributed biomarker while accounting for these errors (Section 2). Additionally, this design allows for estimation of measurement error without requiring repeated sampling. Monte Carlo simulations are used to illustrate the developed techniques (Section 3). The presentation is limited to normally distributed data, but could be extended to other distribution functions as well. The optimal proportions of pooled and unpooled samples can be determined with respect to the given loss function. This methodology is exemplified in Section 4, by estimating the area under the receiver operating characteristics (ROC) curve, where the loss function of interest is the variance of the maximum likelihood estimator (MLE) of the area under the curve (AUC). (See references by Faraggi and Reiser [21] and Shapiro [22] for details regarding ROC curves and the AUC.) The optimal proportion of pooled samples that minimizes the loss function is found. An application to cholesterol biomarker data from a study on coronary heart disease is presented in Section 5. In Section 6, we conclude with a discussion of our results.

2. Pooled—Unpooled Hybrid Design

In this section, MLEs are developed under the hybrid design, utilizing both pooled and unpooled samples, for several potential scenarios including no error, measurement error, pooling error, and measurement and pooling error. In each case, it is necessary to know the number of individual biological specimens available (N) and the number of assays that can be afforded (n) due to budget and/or time limitations.

2.1 Review of Pooled Design and Simple Random Sampling Design

Suppose that independent identically distributed (iid) biomarker levels Xi~N(μx,σx2) can be observed. A random sample of size n can be drawn from the original population of N, and corresponding MLEs,


and their variances can be directly obtained, where specifically var(μ^x)=σx2n.

Pooled samples are obtained by randomly grouping individual samples into groups of size p, physically combining specimens of the same group and testing each group as a single observation. Assuming that the pooled sample measurements reflect the individual samples’ average we denote pooled samples of group size p by


By the additive property of the normal distribution, we obtain


Obviously, the MLEs based on the pooled data are


where the variance of the estimator of μ is simply var(μ^x)=σx2np.

One can show that the σ -estimators based on {Z1, …, Zn} and {X1, …, Xn} are identically distributed as demonstrated by Liu and Schisterman [9]. However, in this simplest case involving pooling, the variance of the mean estimator is always smaller than that based on random sampling (σx2npσx2n).

2.2 Hybrid Design with Measurement Error

In this section, we consider the case where both pooled and unpooled samples are similarly affected by measurement error. Let us assume measurement error eiM is eiM~N(0,σM2). The actual measurements of the unpooled and pooled samples are iid random variables X'i=Xi+eiM and Z'i=Zi+eiM=1pj=(i1)p+1ipXj+eiM respectively, so X'i~N(μx,σx2+σM2) and Zi~N(μx,σx2p+σM2).

Next, consider a hybrid sample S, which is a combination of both pooled and unpooled samples, i.e. S is defined as {X1, …, X[αn], Z1, …, Z[(1−α)n]}, where α ϵ [0,1] represents the proportion that comes from a random sample. Note that α n + (1−α) npN, but here we assume that α n + (1−α) np = N, if α < 1, and that N and n are fixed based on budget and time constraints and the size of the cohort. In this paper we choose to optimize α, though since α and p are connected by a one-to-one mapping, it is equivalent to optimize by p.

The hybrid design with measurement error provides data S' = {X'1, …, X'n], Z'1, …, Z'[(1−α)n]}. The MLEs of μx,σx2andσM2, based on S' can be derived from the following equations:


where X′denotes the mean of X1,,X[αn], and Z′ denotes the mean of Z1,,Z[(1α)n]. These estimates are asymptotically distributed as


where Σ is the inverse of the Fisher Information Matrix, I, leading to the following variances:


(For details, see Appendix A1.)

Proposition 2.2.1

The values α = 0 (pooled samples only) and α=N2σx4+N(n+N)σx2σM2+nNσM4N2σx4+2N2σx2σM2+(2nNn2)σM4 minimize var(μ^x)andvar(σ^x2), respectively. (Proof in Appendix A2).

Commonly, var([mu]) and var([sigma with hat]) are components of a given loss-function that we would like to minimize with respect to α. In this case, it is generally not recommended for α to be exactly equal to 0 or 1. Formal notations and proof schemes mentioned in the Appendices can be used to obtain the optimal α under a stated loss-function. In Section 4, we show the application of this methodology for the case where the loss-function of interest is the variance of the AUC estimator, a function of both var([mu]) and var([sigma with hat]).

2.3 Hybrid Design with Pooling Error

Pooling error is a specific type of measurement error that occurs as a result of the process of physically pooling the samples together and may introduce additional variability. We denote this error as eip and assume eip~N(0,σp2). Subsequently, we assume that the pooled observations have the form Zi=1pj=(i1)p+1ipXj+eip, where Zi~N(μx,σx2p+σp2). The maximum likelihood estimators of μx,σx2,andσp2, and the corresponding Fisher Information Matrix, based on the hybrid design with pooling error only, are described in detail in Appendice A3 and Appendice A4. In contrast with section 2.1, [mu]x based on only pooled samples subject to pooling error, {Z1, …, Zn}, can be less efficient than that based on the random sample {X1, …, Xn}. In this case, only the pooled samples and not the random samples are subject to error, making the equations slightly simpler.

The pooling-error-only case has vast applicability where research is based on well established biomarkers with thoroughly developed and precise measurement techniques. This is more often than not the case, as most studies assume no or a negligible amount of measurement error. However, the extra handling and processing that is required to pool samples and realize the efficiency benefits of the hybrid design could cause a non-negligible level of additional variability.

2.4 Measurement Error and Pooling Error

In order to allow for both pooling and measurement errors, the hybrid sample must consist of three different samples to be able to estimate all of the variance components (i.e. σx2,σp2,andσM2), otherwise the terms are not identifiable. Specifically, these three samples include one unpooled and two groups of pooled samples with different pooling group sizes, say p1 and p2 with p1p2. The observed sample S" consists of:


where α1, α2 ϵ [0,1], α1n + α2np1 + (1−α1−α2)np2 = N. Thus, the MLEs of μx,σx2,σp2,andσM2, based on S" can be obtained from the system of equations:

μ^x=α1nX¯'A1A2+α2nZ¯'(σ^x2+σ^M2)A2+(1α1α2)nZ¯''(σ^x2+σ^M2)A1α1nA1A2+α2n(σ^x2+σ^M2)A2+(1α1α2)n(σ^x2+σ^M2)A1,σ^x2=iα1n(X'iμ^x)2α1nσ^M2,  σ^M2=p1p11(iα2n(Z'iμ^x)2α2niα1n(X'iμ^x)2α1np1σ^p2)andσ^p2=iα2n(Z"iμ^x)2(1α1α2)niα1n(X'iμ^x)2α1np1σ^M2p21p2,

where A1=(σ^x2p1+σ^p2+σ^M2)and  A2=(σ^x2p2+σ^p2+σ^M2).

These estimates asymptotically satisfy


where Σ is developed in Appendix A5. Thus, for example, we obtain that


where V1=(σx2p1+σP2+σM2)  and  V2=(σx2p2+σP2+σM2).

For fixed N, n, p1 and p2, α1 and α2 that minimize the variance of the mean are unique and can be calculated via conditional maximization. However, for other loss functions the solutions may not necessarily be unique.

3. Monte Carlo Simulations

In this section, Monte Carlo simulations are used to demonstrate the properties of the proposed methodology.

3.1 Measurement Error only

Normally distributed true biomarker levels, N = 3000 and n = 1000, were generated with μ=0andσx2=1, which were subsequently combined with several scenarios of random measurement error, σM2=0.1,0.4,and1.44. Hybrid samples were formed from uncorrelated pooled and unpooled measurements in proportions of α = 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9.

Table 1 shows the results of these simulations and it seems clear that each estimator has Monte Carlo variance that is approximately equal to the derived theoretical variance. Table 1 confirms that the most efficient strategy for mean estimation utilizes the most pooled samples (α = 0.1). In these simulations, the variance is minimized at small values of α but not always at the smallest value of α (0.2forσM2=0.1,0.3forσM2=0.4,and0.1forσM2=1.44). This discrepancy is due to the fact that the theoretical variance is calculated using a pooling size that does not have to be an integer value, but for the simulation, the value of p is rounded to the nearest integer value. One impact of the rounding is that it could potentially use more samples than are available. See Figure 1a.

Figure 1
a. Variance of the mean estimator for both rounded and unrounded values of p, for the case of N=3000, n=1000, σx2=1, and σm2=0.1.
Table 1
Monte Carlo simulation results for measurement error only case. N=3000, n=1000, σx21, 2500 repetitions.

For variance estimation, when measurement error exists, strategies that utilize more pooled samples are more efficient. Variance estimation is not as affected by rounding p as mean estimation. See Figure 1b. In addition, when measurement error is present, estimation is only possible when both pooled and unpooled samples are used. In the case where α = 0.1, there is probably not a sufficient sample of unpooled data for proper estimation.

As with all maximum likelihood estimation, caution should be used when applying these methods due to questions of robustness, specifically regarding var(σ^x2) used for optimization. Variance estimates are restricted to nonnegative values, however in the rare cases when unrestricted MLEs may results in [sigma with hat]2 < 0, we apply techniques used by Schisterman et al. and replace [sigma with hat]2 by a very small number [12]. Specifically, we replace σ^M2  with  1.4×σ^x2,1.2×σ^x2,1.1×σ^x2,1.01×σ^x2,1.005×σ^x2,or1.001×σ^x2, and evaluate the appropriateness of the various alternatives. In addition, we compare these results to that of replacing the variance with σ^M2=1N|1(1α)ni(1α)n(Z'iμ^x)2σ^x2p|, as recommended by Vonesh and Carter [23]. In our case, the choice of replacement value did not have a large impact on the results.

3.2 Pooling Error Only

Again, normally distributed data was generated withμ=0andσx2=1, but with varying pooling errors (σp2=0.3,0.7). Individual biospecimens available for testing ranged from (N = 300, 200), with varying numbers of assayed performed (n = 150, 100, 50). Hybrid samples were formed from uncorrelated pooled and unpooled measurements in proportions of (α = 0.25, 0.5, 0.75). For values of α that do not imply integer values of p, the value of p in the simulation was rounded to the nearest integer value.

Table 2 displays the results of these simulations where it is clear that the Monte Carlo variance for the estimators is approximately equal to the derived theoretical variance. Table 2 also reinforces Proposition 2.3.1 when estimating the mean. Three cases are shown to demonstrate each of the possible optimal values of α. Case 1 corresponds to the situation where the pooling error is small in relation to the variability in the sample. Here the variance is minimized when more pooled samples are used (α is closest to zero). In Case 2, the pooling error is greater than the sampling error and the variance is minimized when fewer pooled samples are used (α closest to 1). Case 3 is an example of a situation where the formula for the optimal α is used, and the simulation matches what we would expect. As expected for estimation of the variance, the simulation results show that for each different set of parameters, the sample which utilizes the greatest proportion of random samples (in this case where α = 0.75) is always the most efficient.

Table 2
Monte Carlo simulation results for pooling error only case. 5000 repetitions. μ = 0.

In these cases, tests for normality (e.g. Kolmogorov Smirnov, Shapiro-Wilk) did not reject the hypothesis regarding the normal distribution of the maximum likelihood estimators repeated by Monte Carlo simulations.

4. An Application to AUC Estimation

A common measure of a biomarker’s ability to discriminate between those with and without disease is the AUC as Faraggi and Reiser [21], Shapiro [22], and Wieand et al. [24] describe. Let X and Y denote diseased and healthy individual’s test scores, respectively, and the AUC=Pr(X>Y). Here, X~N(μx,σx2)andY~N(μy,σy2) lead to AUC = Φ(δ), where δ=(μxμy)(σx2+σy2). Estimating AUC requires estimates of the mean and variance of both populations, so the optimal sampling strategy requires finding the proportion of pooled samples that minimizes the variance of the AUC estimator (the loss function of interest here). Without loss of generality and for demonstration purposes, we assume that only pooling error is in effect. The appropriate variances of our estimators from Section 2.3 (Appendix A3) are applied to the loss function of interest.

Specifically, one can show that the variance of the AUC estimator is:




(See for details [3].)

Assume observations are of the form SxandSy described in Section 2.3. Applying the variances of the μ and σ estimators found in Section 2.3 (Appendix A3), we can directly find the values of α that minimizes the variance of the AUC estimator,

αx=σx4(μxμy)2(σx2σzx2)±σx3σzx(μxμy)(σx2+σy2)2(σx2σzx2)(σx2σzx2){σx2(μxμy)2(σx2σzx2)2σzx2(σx2+σy2)2},   αy=σy4(μxμy)2(σy2σzy2)±σy3σzy(μxμy)(σx2+σy2)2(σy2σzy2)(σy2σzy2){σy2(μxμy)2(σy2σzy2)2σzy2(σx2+σy2)2},

where σzx2=σx2p+σpx2,σzy2=σy2p+σpy2,var(exp)=σpx2,andvar(eyp)=σpy2, where xp and yp respectively correspond to the pooled samples for cases and controls.

In general, for a loss function corresponding to the variance of our parameters, we can find the optimal value of α which minimizes the loss function and can be evaluated as a function of unknown parameters. Section 5 illustrates this theoretical result with a practical example.

5. Example

A study of coronary heart disease at the Cedars-Sinai Medical Center investigated the ability of a cholesterol biomarker to discriminate between individuals who recently survived a myocardial infarction (cases) and those with no previous history of myocardial infarction (controls). Biomarker measurements were collected on 40 cases and 40 controls and blood samples were pooled in groups of 2, keeping cases and controls separate, and then re-measured. We therefore have individual measurements for 40 cases and 40 controls, as well as measurements for 20 samples of two pooled cases and 20 samples of two pooled controls. This allowed us to organize hybrid samples by taking combinations of individual and pooled measurements. The mean (±SD) was 205.5 (±42.3) for unpooled control samples and 207.9 (±35.2) for pooled control samples. The mean (±SD) was 226.8 (±41.7) for unpooled case samples and 223.2 (±37.3) for pooled case samples. The standard deviation of the differences between pooled measurements and the average of corresponding individual measurements was 6.7 for cases and 7.4 for controls.

For the purposes of this example, we assume that only pooling error is in effect. Under these assumptions, we apply the results from Section 4 for determining the optimal proportions of pooled and unpooled samples to maximize efficiency when estimating the AUC. Using the empirical means and variances, we find that α = 0.4 for the cases and α = 0.35 for the controls to minimize the variance of the AUC estimator when only 20 assays are tested (n=20). In other words, it is recommended that hybrid samples include 8 individual and 12 pooled (p=2) samples for the cases and 7 individual and 13 pooled samples for the controls to minimize the variance of the AUC estimator. More control samples should be pooled since the estimated pooling error is slightly lower among the controls than the cases. Using the developed methodology, we are also able to estimate the amount of pooling error. In particular, we took a hybrid sample of both cases and controls using the optimal levels of α. The hybrid sample was distributed as follows (mean (±SD)): unpooled control samples 201.0 (±31.5), pooled control samples 205.1 (±30.8), unpooled case samples 229.4 (±43.1), and pooled case samples 220.8 (±34.8). The MLEs were then estimated using the proposed methodology, and we found that μ^x=223.5,σ^x2=1662.4,σ^px2=283.4,μ^y=203.7,σ^y2=859.4,andσ^py2=449.3. The AUC estimator was found to be 0.65 with a variance of 0.0050.

By implementation of the hybrid design in a small simulation study testing only 20 samples, we were able to reduce the variability of the AUC estimator by 30% as compared to using only a random sample (hybrid design: AÛC = 0.621, Var(AÛC) = 0.0049 ; random sample: AÛC = 0.641, Var(AÛC) = 0.0071). As compared to the full sample, the hybrid design used only half the number of samples with only a 37% loss of efficiency, whereas the random sample lost 96% (full sample: AÛC = 0.640, Var(AÛC) = 0.0036).

6. Discussion

We present methodology for a cost-efficient hybrid sampling design that combines both pooled and unpooled samples in optimal proportions to efficiently estimate the unknown parameters of the biomarker distribution. Maximum likelihood estimators and their asymptotic distributions are given for various situations where errors are present. Application of this design also makes possible estimation of measurement error without repeated sampling. The hybrid design is similar to repeated measures but also retains, and in some cases improves, the efficiency of estimating μx and σx2. The information on repeated measures can only be used to estimate σM2, whereas use of the hybrid design can be used to estimate all parameters of interest. Any outside information available on the source and quantity of the errors can improve implementation of the hybrid design and subsequent estimation.

Misspecification of the measurement error model is of course a concern for estimation of α, and in turn for estimation of the main parameters of interest. Researchers often have substantial knowledge on the measurement of similar biomarkers, or biomarkers measured in a similar manner, and the laboratory conditions under which pooling will be performed (e.g. mechanical versus technician). This knowledge aids model specification. As mentioned, the pooling–error-only model is especially appropriate in cases where research is based on well-established and precise measurement techniques. In practice, it is best to allow for both types of error, especially if no prior knowledge of the biomarkers and the measurement process exists.

It should be noted that owing to the complexity of estimating the biomarker distribution from observed convoluted data, generalized likelihood or nonparametric inferences based only on pooled data may not be feasible, since the distribution of the observed averages involves convolution of p random variables of a biomarker distribution as discussed by Vexler et al. [8] In several cases, like those mentioned above, the hybrid design can be based on pre-specified proportions of pooled and unpooled samples (e.g. α = 0.5 ). The definition and approximation of optimal values of the proportions with respect to a given loss function can also belong to statements of problems that forego the hybrid design execution. In this article, several rules are analyzed which aid in determining the optimal proportion of unpooled samples. These rules are dependent on parameters (e.g. σx2,σp2) that are unknown at the design stage. It is usually assumed that the experimenter has a priori knowledge regarding the distribution of measurements, and good initial values of the unknown parameters are chosen based on this knowledge. This information might be available from previous related experiments, preliminary studies, or pre-testing of some kind. This information could also be taken from a subsample of the N available samples, with some modifications to the estimators. Since the unknown parameters are evaluated in previous experimentation, one might argue correctly that uni-stage designs are in fact the second stage of a less formal multistage experiment. Thus, following the design literature (e.g. Liu and Chi [25], Bellhouse et al. [26], Sitter and Wu [27], and Racine et al. [28]) the problem of approximating the optimal proportions of pooled and unpooled samples can be considered by using Bayesian methods, learning samples with a fixed size, or two-stage design techniques, etc. Two-stage designs can, in particular, be based on sequential procedures that make relevant test decisions using the first stage, utilizing a non pre-specified number of observations. Note also that the relationship between the bias of estimation of optimal values of α and the corresponding bias of main parameter estimation can be a major factor in defining the appropriate size of the learning sample. Our limited simulation results show that reasonable estimates of the optimal values of α can be obtained from relatively small to moderate-sized learning samples. In fact, learning samples with sizes proportional to the square root of the number of measured assays (i.e. [proportional, variant] n1/2), (Liu and Chi [25], Bellhouse etal.[26], Sitter and Wu [27], and Racine et al. [28]) provide estimators of μxandσx2 with variances comparable to those of the μxandσx2 estimators considered in previous sections. Consider, for example, the situations depicted in Table 1 of Section 3.1. To approximate the optimal evaluation of σx2, the values of α from Proposition 2.2.1, having form


are estimated based on learning samples with different sizes and the 50/50 proportion of numbers of pooled and unpooled measurements. (The learning samples, in these cases, should also have pooled and unpooled data to estimate σM2.) Then, the hybrid samples are formed and the parameters estimated. Table 3 presents the corresponding Monte Carlo results.

Table 3
Monte Carlo simulation results for Two-Stage Design. N=3000, n=1000, σx2=1, 2500 repetitions. Learning samples were based on n' measurements from N' bioassays (α=0.5, p=2).

Thus, comparing the Monte Carlo estimators of var(σ^x2) in Table 1 and Table 3, we observe that the proposed estimators are quite robust to the choice of the learning sample size that is proportional to (n1/2).

In this article, we demonstrate how to apply the hybrid design to estimate the AUC, but the methodology could also be applied to other functions such as hypothesis testing, regression, etc. In the present paper, we analyze normally distributed biomarkers, and the efficiency of the pooling design depends in large part on the distribution of interest. However, other common biomarker distributions could be evaluated in a similar manner. For nonparametric cases, the proposed estimators can be considered as the least square estimators that asymptotically have the variances presented in this paper. We consider the situation where only measurement error or only pooling error is present and the number of available biospecimens (N) is fixed. Hence, the pooling size is defined by the recommended optimal proportion (α) and the number of samples we can afford to test (n). In a similar manner, we could analyze the situation where the pooling size is fixed. However, when both measurement and pooling errors are present, p1 and p2 are considered fixed beforehand to allow for the optimization of α1. The proposed estimators could also be modified to allow for differential errors that are a function of the pooling group size or levels of the biomarker, or errors due to mixing in non-identical proportions.

In conclusion, the hybrid sampling design combines the benefits of both pooling and random sampling under strict cost limitations, while making possible the estimation of measurement error without repeated samples. Only a small to moderate-sized training sample is needed in order to successfully implement the hybrid design.


The authors thank the editor and reviewers for their valuable comments. This work was supported by the Intramural Research Program of the National Institutes of Health, Eunice Kennedy Shriver National Institute of Child Health and Human Development.


A1. Hybrid Design with Measurement Error

The log likelihood function based on S' has the form of:

(μx,σx,σM)=nln(2π)αnln(σx2+σM2)iαn(X'iμx)22(σx2+σM2)    (1α)nln(σx2p+σM2)i(1α)n(Z'iμx)22(σx2p+σM2).

The derivatives of the log likelihood function are


Setting the first derivatives equal to zero and solving yields the following system of equations


The second derivatives of the log likelihood function are

2μx2=αn(σx2+σM2)(1α)n(σx2p+σM2),2σx2=αn(σx2+σM2)+2αnσx2(σx2+σM2)2+iαn(X'iμx)2(σx2+σM2)24σx2iαn(X'iμx)2(σx2+σM2)3  (1α)np(σx2p+σM2)+2(1α)nσx2p2(σx2p+σM2)2+i(1α)n(Z'iμx)2p(σx2p+σM2)24σx2i(1α)n(Z'iμx)2,p2(σx2p+σM2)32σM2=αn(σx2+σM2)+2αnσM2(σx2+σM2)2+iαn(X'iμx)2(σx2+σM2)24σM2iαn(X'iμx)2(σx2+σM2)3(1α)n(σx2p+σM2)+2(1α)nσM2(σx2p+σM2)2  +i(1α)n(Z'iμx)2(σx2p+σM2)24σM2i(1α)n(Z'iμx)2(σx2p+σM2)3,2μxσx=2σxiαn(X'iμx)(σx2+σM2)22σxi(1α)n(Z'iμx)p(σx2p+σM2)2,2μxσM=2σMiαn(X'iμx)(σx2+σM2)22σMi(1α)n(Z'iμx)(σx2p+σM2)2,2σxσM=2αnσxσM(σx2+σM2)24σxσMiαn(X'iμx)2(σx2+σM2)3+2(1α)nσxσMp(σx2p+σM2)24σxσMi(1α)n(Z'iμx)2p(σx2p+σM2)3

Since these estimates are MLE’s, we have where n(μ^xμxσ^x2σx2σ^M2σM2)~N(0,I1), where




A2. The proof of Proposition 2.2.1

Since p=Nαn(1α)n, we can write


and hence


Thus, Var([mu]x) is a strictly increasing function. Therefore, the optimal alpha level is then found where α = 0, which corresponds to taking a pooled sample.

While p=Nαn(1α)n, we have


and hence


In this case, αVar(σ^x2)=0


There are two possible solutions:


We can show that the first solution α0,1 is strictly less than 1. Write


Now, if we consider the coefficients for each term of α0,1, we can show that the numerator is less than the denominator and therefore the solution α0,1 is less than 1, i.e. since the coefficients for σx4 in the numerator and denominator are equivalent, the coefficients for σx2σM2 are N (n + N) < 2N2, the coefficients for σM4 are nN < 2nNn2, and hence α0,1 ≤ 1.



Here the coefficients for σx4 in the numerator and denominator are equivalent, the coefficients for σx2σM2 are 3N2nN > 2N2, the coefficients for σM4 are N(2Nn) > n(2Nn), and hence α0,2 > 1.

Therefore, the optimal alpha is α0,1.

The proof of Proposition 2.2.1 is complete.

A3. Hybrid Design with Pooling Error

The log likelihood function based on sample S is


To derive the maximum likelihood estimators of μx, σx, and σp, we calculate:


Setting these first derivatives equal to zero and solving yields the following system of equations


where X denotes the mean of X1, …, X[αn] and Z denotes the mean of Z1, …, Z[(1−α)n]. Consider the second derivatives


Since these estimates are MLEs, we use the classical asymptotic properties which lead to the result n[μ^xμxσ^x2σx2σ^p2σp2]T~nN(0,I1), where the Fisher Information Matrix, I, is the negative expected value of the 2nd derivatives divided by n as n → ∞, i.e.


It is clear that asymptotically these estimators are distributed as


where Σ is the inverse of the Fisher Information Matrix.

A4. The Proof of Proposition 2.3.1

The optimal proportion of pooled and unpooled samples can be chosen in accord with a given loss function. We consider this loss function to be the variance of the mean estimator, the variance of the variance estimator, or a linear combination of both. When we are only interested in estimation of the mean, the proportion of pooled samples that minimizes the variance of the mean estimator depends on the size of the pooling error in relation to the variability of the sample.

Proposition 2.3.1

In the context of the variance var([mu]x) of the mean estimator the optimal proportion α of unpooled samples satisfies the rule:

Optimalα={0if σp2σx2(N2nN),1if σp2σx21,else ;σp2N+σx2(2nN)n(σx2+σp2)

whereas, in the context of the variance var(σ^x2) of the variance estimator the optimal proportion is α = 1.

As the pooling error gets larger in relation to the variability in the sample, it is recommended to perform measurements on a larger portion than random sampling. The existence of pooling error limits the area where the pooling design is efficient. In accordance with the distribution function of the maximum likelihood estimators, one can show that if we are only concerned with estimation of the variance, then the random sampling strategy is always recommended. The variance of the σ^x2 estimator is minimized when only a random sample is taken (α = 1).


We want to find the minimum of the variance of the mean estimator with respect to α. In accordance with the asymptotic distribution of [mu] (see, Appendix A1), we have


Since p=Nαn(1α)n, we can write the variance as a function of α


and hence


Since the denominator of αVar(μ^) is positive, we learn that the behavior of Var([mu]) depends on the part of the numerator, say Y=σx2(N2n+αn)σp2(Nαn). Specifically, if Y is positive then the derivative is also positive, and hence Var([mu]) increases by α. Denote Y=αA+B,αϵ[0,1],A=n(σx2+σp2),B=σx2(N2n)σp2N.

A > 0, and hence Y increases. Since the line Y = αA + B, α ϵ [0,1] can be located only 1) over axis α ϵ [0,1], 2) below the axis α ϵ [0,1], or 3) across α ϵ [0,1], we have 3 different cases:

  1. If B ≥ 0, i.e. σp2σx2N2nN, then the derivative αVar(μ^) is strictly positive, Var([mu]) increases as α increases so the optimal α = arg min var([mu]) is found at α = 0;
  2. If A + B ≤ 0, i.e. σp2σx2. Then αVar(μ^) is strictly negative and the optimal alpha is found at α = 1, because Var([mu]) decreases as α increases from 0 to 1;
  3. If αA + B = 0, then α=BA=σp2N+σx2(2nN)n(σx2+σp2).

This completes the proof of Proposition 2.2.1.

A5. Measurement Error and Pooling Error

In this case, the log likelihood function based on S'' has the form of


Setting the first derivatives μx,σx2,σp2andσM2 equal to zero and solving yields the system of equations regarding the maximum likelihood estimators.

The maximum likelihood approach provides that




In order to avoid technical formal notations, we do not mention the cumbersome forms of the second derivates that can be simply calculated.

References List

1. Dorfman R. The detection of defective members of large populations. Annals of Mathematical Statistics. 1943;14:436–440.
2. Faraggi D, Reiser B, Schisterman EF. ROC curve analysis for biomarkers based on pooled assessments. Statistics in Medicine. 2003;22:2515–2527. [PubMed]
3. Mumford SL, Schisterman EF, Vexler A, Liu A. Pooling biospecimens and limits of detection: effects on ROC curve analysis. Biostatistics. 2006;7:585–598. [PubMed]
4. Schisterman EF, Vexler A. To pool or not to pool, from whether to when: applications of pooling to biospecimens subject to a limit of detection. Paediatric and Perinatal Epidemiology. 2008;22:486–496. [PMC free article] [PubMed]
5. Vexler A, Liu A, Schisterman EF. Efficient design and analysis of biospecimens with measurements subject to detection limit. Biometrical Journal. 2006;48:780–791. [PubMed]
6. Weinberg CR, Umbach DM. Using pooled exposure assessment to improve efficiency in case-control studies. Biometrics. 1999;55:718–726. [PubMed]
7. Zhang SD, Gant TW. Effect of pooling samples on the efficiency of comparative studies using microarrays. Bioinformatics. 2005;21:4378–4383. [PubMed]
8. Vexler A, Schisterman EF, Liu A. Estimation of ROC curves based on stably distributed biomarkers subject to measurement error and pooling mixtures. Statistics in Medicine. 2008;27:280–296. [PMC free article] [PubMed]
9. Liu A, Schisterman EF. Comparison of Diagnostic Accuracy of Biomarkers With Pooled Assessments. Biometrical Journal. 2003;45:631–644.
10. Fuller W. Measurement Error Models. New York: John Wiley and Sons; 1987.
11. Coffin M, Sukhatme S. Receiver operating characteristic studies and measurement errors. Biometrics. 1997;53:823–837. [PubMed]
12. Schisterman EF, Faraggi D, Reiser B, Trevisan M. Statistical inference for the area under the receiver operating characteristic curve in the presence of random measurement error. American Journal of Epidemiology. 2001;154:174–179. [PubMed]
13. Perkins NJ, Schisterman EF. The Youden Index and the optimal cut-point corrected for measurement error. Biometrical Journal. 2005;47:428–441. [PubMed]
14. Dunn G. Design and Analysis of Reliability Studies: The Statistical Evaluation of Measurement Errors. New York: Oxford University Press; 1989.
15. Freedman LS, Fainberg V, Kipnis V, Midthune D, Carroll RJ. A new method for dealing with measurement error in explanatory variables of regression models. Biometrics. 2004;60:172–181. [PubMed]
16. Thurigen D, Spiegelman D, Blettner M, Heuer C, Brenner H. Measurement error correction using validation data: a review of methods and their applicability in case-control studies. Statistical Methods in Medical Research. 2000;9:447–474. [PubMed]
17. Faraggi D. The effect of random measurement error on receiver operating characteristic (ROC) curves. Statistics in Medicine. 2000;19:61–70. [PubMed]
18. Liu K, Stamler J, Dyer A, McKeever J, McKeever P. Statistical methods to assess and minimize the role of intra-individual variability in obscuring the relationship between dietary lipids and serum cholesterol. Journal of Chronic Diseases. 1978;31:399–418. [PubMed]
19. Reiser B. Measuring the effectiveness of diagnostic markers in the presence of measurement error through the use of ROC curves. Statistics in Medicine. 2000;19:2115–2129. [PubMed]
20. Dunson DB, Weinberg CR. Modeling human fertility in the presence of measurement error. Biometrics. 2000;56:288–292. [PubMed]
21. Faraggi D, Reiser B. Estimation of the area under the ROC curve. Statistics in Medicine. 2002;21:3093–3106. [PubMed]
22. Shapiro DE. The interpretation of diagnostic tests. Statistical Methods in Medical Research. 1999;8:113–134. [PubMed]
23. Vonesh EF, Carter RL. Efficient inference for random-coefficient growth curve models with unbalanced data. Biometrics. 1987;43:617–628. [PubMed]
24. Wieand S, Gail MH, James BR, James KL. A family of non-parametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika. 1989;76:585–592.
25. Liu Q, Chi GY. On sample size and inference for two-stage adaptive designs. Biometrics. 2001;57:172–177. [PubMed]
26. Bellhouse DR, Thompson ME, Godambe VP. Two-stage sampling with exchangeable prior distributions. Biometrika. 1977;64:97–103.
27. Sitter RR, Wu CF. Two-stage design of quantal response studies. Biometrics. 1999;55:396–402. [PubMed]
28. Racine A, Grieve AP, Fluhler H, Smith AFM. Bayesian Methods in Practice: Experiences in the Pharmaceutical Industry. Applied Statistics. 1986;35:93–150.