Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2821989

Formats

Article sections

- Summary
- 1. Introduction
- 2. Pooled—Unpooled Hybrid Design
- 3. Monte Carlo Simulations
- 4. An Application to AUC Estimation
- 5. Example
- 6. Discussion
- References List

Authors

Related links

Stat Med. Author manuscript; available in PMC 2011 February 28.

Published in final edited form as:

PMCID: PMC2821989

NIHMSID: NIHMS161385

Division of Epidemiology, Statistics and Prevention Research, *Eunice Kennedy Shriver* National Institute of Child Health and Human Development, NIH/DHHS, 6100 Executive Boulevard, Rockville, Maryland 20852, U.S.A.

Evaluating biomarkers in epidemiological studies can be expensive and time consuming. Many investigators use techniques such as random sampling or pooling biospecimens in order to cut costs and save time on experiments. Commonly, analyses based on pooled data are strongly restricted by distributional assumptions that are challenging to validate because of the pooled biospecimens. Random sampling provides data that can be easily analyzed. However, random sampling methods are not optimal cost-efficient designs for estimating means. We propose and examine a cost-efficient hybrid design that involves taking a sample of both pooled and unpooled data in an optimal proportion in order to efficiently estimate the unknown parameters of the biomarker distribution. In addition, we find that this design can be utilized to estimate and account for different types of measurement and pooling error, without the need to collect validation data or repeated measurements. We show an example where application of the hybrid design leads to minimization of a given loss function based on variances of the estimators of the unknown parameters. Monte Carlo simulation and biomarker data from a study on coronary heart disease are used to demonstrate the proposed methodology.

Epidemiological studies increasingly take advantage of new and sophisticated biomarkers to more effectively measure exposures and predictors of interest. However, assaying biomarkers can be expensive and labor intensive, and even the least expensive individual assays may be infeasible to analyze in large cohorts. As a result, it is advantageous to find cost-efficient strategies for sampling. Dorfman [1], Faraggi *et al.* [2], Mumford *et al.* [3] Schisterman and Vexler [4], Vexler *et al.* [5], Weinberg and Umbach [6], and Zhang and Gant [7], all discuss pooling and random sampling as two approaches commonly used by investigators to reduce overall cost. Simple random sampling involves choosing and testing a random subset of available samples. In this case, several individual biospecimens are ignored; however, estimation of the operating characteristics of these individual biospecimens is straightforward. Pooling, on the other hand, involves randomly grouping and physically mixing individual biological samples. Assays are then performed on the smaller number of pooled samples. Thus, pooling reduces the number of measurements without ignoring any individual biospecimens. Theoretically, the measurements of pooled observations are the average of the individual measurements, and have lent themselves well to normality and gamma assumptions in the literature [3;5;8].

Faraggi [2] describes that in certain situations the pooling design is more efficient than random sampling because it is applied to a greater number of individual biospecimens. However, Mumford *et al.* [3] and Vexler *et al.* [5] describe the tradeoff between random sampling and pooling and describe the conditions where random sampling is more efficient than pooling. For example, pooling is much less efficient in situations where the arithmetic mean is not an efficient estimator, e.g. in the case of log-normally distributed data. Moreover, when data are subject to a limit of detection, pooling cannot be recommended as the efficient design when more than 50% of the data are censored.

For a normally distributed biomarker, Faraggi *et al.* [2] show that when the mean is the only parameter of interest, the most cost-efficient strategy would be to pool all of the biospecimens and measure only one sample. That approach obviously does not provide a reasonable estimate of the variance. When estimating the variance of a biomarker, random sampling is simple and efficient and pooling is unnecessary per Liu and Schisterman [9]. Since we are most often concerned with estimation of both the mean and the variance in order to make inferences, it seems a natural progression to consider a hybrid design, one that utilizes both pooling and random sampling.

Previous authors have explored the effects of measurement error under both pooling and random sampling designs. In both designs, measurement error can be caused by problems related to instrumentation, unstable environmental conditions, operator subjectivity, etc. which Fuller [10], Coffin and Sukhatme [11], Schisterman *et al.* [12], and Perkins and Schisterman [13] discuss in more detail. Both designs are similarly affected by measurement error because the biospecimens, pooled and unpooled, are measured via the same process. Dunn [14], Freedman *et al.* [15], and Thurigen *et al.* [16] describe methods requiring repeated measurements or validation data to correct for measurement error. Faraggi [17], Liu *et al.* [18], Reiser [19], and Schisterman *et al.* [12] show the impact of measurement error on estimation. However, obtaining and assaying these replications can be difficult because of cost or other limitations [20]. In general, when no repeated measurements are available, additive measurement errors cannot be evaluated.

A form of error specific to the pooling design lies in the physical process of pooling biospecimens (e.g. effects of temperature, instrument, and technician variability) where the measurements of pooled observations deviate from the average of the individual measurements. We refer to this specific type of measurement error as pooling error. Obviously, measurements of individual specimens would be unaffected by pooling error as they are not pooled. As with other forms of error, pooling error may have a detrimental impact on effect estimation.

In this paper, we consider a hybrid sampling design, explore situations where pooling error and measurement error are present, and efficiently estimate the mean and variance of a normally distributed biomarker while accounting for these errors (Section 2). Additionally, this design allows for estimation of measurement error without requiring repeated sampling. Monte Carlo simulations are used to illustrate the developed techniques (Section 3). The presentation is limited to normally distributed data, but could be extended to other distribution functions as well. The optimal proportions of pooled and unpooled samples can be determined with respect to the given loss function. This methodology is exemplified in Section 4, by estimating the area under the receiver operating characteristics (ROC) curve, where the loss function of interest is the variance of the maximum likelihood estimator (MLE) of the area under the curve (*AUC)*. (See references by Faraggi and Reiser [21] and Shapiro [22] for details regarding ROC curves and the *AUC*.) The optimal proportion of pooled samples that minimizes the loss function is found. An application to cholesterol biomarker data from a study on coronary heart disease is presented in Section 5. In Section 6, we conclude with a discussion of our results.

In this section, MLEs are developed under the hybrid design, utilizing both pooled and unpooled samples, for several potential scenarios including no error, measurement error, pooling error, and measurement and pooling error. In each case, it is necessary to know the number of individual biological specimens available (*N*) and the number of assays that can be afforded (*n*) due to budget and/or time limitations.

Suppose that independent identically distributed (*iid*) biomarker levels can be observed. A random sample of size *n* can be drawn from the original population of *N*, and corresponding MLEs,

and their variances can be directly obtained, where specifically

Pooled samples are obtained by randomly grouping individual samples into groups of size *p*, physically combining specimens of the same group and testing each group as a single observation. Assuming that the pooled sample measurements reflect the individual samples’ average we denote pooled samples of group size *p* by

By the additive property of the normal distribution, we obtain

Obviously, the MLEs based on the pooled data are

where the variance of the estimator of μ is simply

One can show that the σ -estimators based on {*Z*_{1}, …, *Z _{n}*} and {

In this section, we consider the case where both pooled and unpooled samples are similarly affected by measurement error. Let us assume measurement error is . The actual measurements of the unpooled and pooled samples are *iid* random variables and respectively, so and

Next, consider a hybrid sample *S*, which is a combination of both pooled and unpooled samples, i.e. *S* is defined as {*X*_{1}, …, *X*_{[αn]}, *Z*_{1}, …, *Z*_{[(1−α)n]}}, where α ϵ [0,1] represents the proportion that comes from a random sample. Note that α *n* + (1−α) *np* ≤ *N*, but here we assume that α *n* + (1−α) *np* = *N*, if α < 1, and that *N* and *n* are fixed based on budget and time constraints and the size of the cohort. In this paper we choose to optimize α, though since α and *p* are connected by a one-to-one mapping, it is equivalent to optimize by *p*.

The hybrid design with measurement error provides data *S*' = {*X*'_{1}, …, *X*'_{[αn]}, *Z*'_{1}, …, *Z*'_{[(1−α)n]}}. The MLEs of , based on *S*' can be derived from the following equations:

where ′denotes the mean of , and ′ denotes the mean of . These estimates are asymptotically distributed as

where Σ is the inverse of the Fisher Information Matrix, *I*, leading to the following variances:

(For details, see Appendix A1.)

The values α = 0 (pooled samples only) and minimize , respectively. (Proof in Appendix A2).

Commonly, var() and var() are components of a given loss-function that we would like to minimize with respect to α. In this case, it is generally not recommended for α to be exactly equal to 0 or 1. Formal notations and proof schemes mentioned in the Appendices can be used to obtain the optimal α under a stated loss-function. In Section 4, we show the application of this methodology for the case where the loss-function of interest is the variance of the *AUC* estimator, a function of both var() and var().

Pooling error is a specific type of measurement error that occurs as a result of the process of physically pooling the samples together and may introduce additional variability. We denote this error as and assume . Subsequently, we assume that the pooled observations have the form , where . The maximum likelihood estimators of and the corresponding Fisher Information Matrix, based on the hybrid design with pooling error only, are described in detail in Appendice A3 and Appendice A4. In contrast with section 2.1, * _{x}* based on only pooled samples subject to pooling error, {

The pooling-error-only case has vast applicability where research is based on well established biomarkers with thoroughly developed and precise measurement techniques. This is more often than not the case, as most studies assume no or a negligible amount of measurement error. However, the extra handling and processing that is required to pool samples and realize the efficiency benefits of the hybrid design could cause a non-negligible level of additional variability.

In order to allow for both pooling and measurement errors, the hybrid sample must consist of three different samples to be able to estimate all of the variance components (i.e. ), otherwise the terms are not identifiable. Specifically, these three samples include one unpooled and two groups of pooled samples with different pooling group sizes, say *p*_{1} and *p*_{2} with *p*_{1} ≠ *p*_{2}. The observed sample *S*" consists of:

where α_{1}, α_{2} ϵ [0,1], α_{1}*n* + α_{2}*np*_{1} + (1−α_{1}−α_{2})*np*_{2} = *N*. Thus, the MLEs of based on *S*" can be obtained from the system of equations:

where

These estimates asymptotically satisfy

where Σ is developed in Appendix A5. Thus, for example, we obtain that

where

For fixed *N*, *n*, *p*_{1} and *p*_{2}, α_{1} and α_{2} that minimize the variance of the mean are unique and can be calculated via conditional maximization. However, for other loss functions the solutions may not necessarily be unique.

In this section, Monte Carlo simulations are used to demonstrate the properties of the proposed methodology.

Normally distributed true biomarker levels, *N* = 3000 and *n* = 1000, were generated with , which were subsequently combined with several scenarios of random measurement error, . Hybrid samples were formed from uncorrelated pooled and unpooled measurements in proportions of α = 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9.

Table 1 shows the results of these simulations and it seems clear that each estimator has Monte Carlo variance that is approximately equal to the derived theoretical variance. Table 1 confirms that the most efficient strategy for mean estimation utilizes the most pooled samples (α = 0.1). In these simulations, the variance is minimized at small values of α but not always at the smallest value of α . This discrepancy is due to the fact that the theoretical variance is calculated using a pooling size that does not have to be an integer value, but for the simulation, the value of *p* is rounded to the nearest integer value. One impact of the rounding is that it could potentially use more samples than are available. See Figure 1a.

a. Variance of the mean estimator for both rounded and unrounded values of *p*, for the case of N=3000, n=1000, σ_{x}^{2}=1, and σ_{m}^{2}=0.1.

Monte Carlo simulation results for measurement error only case. N=3000, n=1000, σ_{x}^{2}1, 2500 repetitions.

For variance estimation, when measurement error exists, strategies that utilize more pooled samples are more efficient. Variance estimation is not as affected by rounding *p* as mean estimation. See Figure 1b. In addition, when measurement error is present, estimation is only possible when both pooled and unpooled samples are used. In the case where α = 0.1, there is probably not a sufficient sample of unpooled data for proper estimation.

As with all maximum likelihood estimation, caution should be used when applying these methods due to questions of robustness, specifically regarding used for optimization. Variance estimates are restricted to nonnegative values, however in the rare cases when unrestricted MLEs may results in ^{2} < 0, we apply techniques used by Schisterman *et al*. and replace ^{2} by a very small number [12]. Specifically, we replace and evaluate the appropriateness of the various alternatives. In addition, we compare these results to that of replacing the variance with , as recommended by Vonesh and Carter [23]. In our case, the choice of replacement value did not have a large impact on the results.

Again, normally distributed data was generated with, but with varying pooling errors . Individual biospecimens available for testing ranged from (*N* = 300, 200), with varying numbers of assayed performed (*n* = 150, 100, 50). Hybrid samples were formed from uncorrelated pooled and unpooled measurements in proportions of (α = 0.25, 0.5, 0.75). For values of α that do not imply integer values of *p*, the value of *p* in the simulation was rounded to the nearest integer value.

Table 2 displays the results of these simulations where it is clear that the Monte Carlo variance for the estimators is approximately equal to the derived theoretical variance. Table 2 also reinforces Proposition 2.3.1 when estimating the mean. Three cases are shown to demonstrate each of the possible optimal values of α. Case 1 corresponds to the situation where the pooling error is small in relation to the variability in the sample. Here the variance is minimized when more pooled samples are used (α is closest to zero). In Case 2, the pooling error is greater than the sampling error and the variance is minimized when fewer pooled samples are used (α closest to 1). Case 3 is an example of a situation where the formula for the optimal α is used, and the simulation matches what we would expect. As expected for estimation of the variance, the simulation results show that for each different set of parameters, the sample which utilizes the greatest proportion of random samples (in this case where α = 0.75) is always the most efficient.

In these cases, tests for normality (e.g. Kolmogorov Smirnov, Shapiro-Wilk) did not reject the hypothesis regarding the normal distribution of the maximum likelihood estimators repeated by Monte Carlo simulations.

A common measure of a biomarker’s ability to discriminate between those with and without disease is the *AUC* as Faraggi and Reiser [21], Shapiro [22], and Wieand *et al.* [24] describe. Let *X* and *Y* denote diseased and healthy individual’s test scores, respectively, and the *AUC=Pr*(*X>Y*). Here, lead to *AUC* = Φ(δ), where . Estimating *AUC* requires estimates of the mean and variance of both populations, so the optimal sampling strategy requires finding the proportion of pooled samples that minimizes the variance of the *AUC* estimator (the loss function of interest here). Without loss of generality and for demonstration purposes, we assume that only pooling error is in effect. The appropriate variances of our estimators from Section 2.3 (Appendix A3) are applied to the loss function of interest.

Assume observations are of the form described in Section 2.3. Applying the variances of the μ and σ estimators found in Section 2.3 (Appendix A3), we can directly find the values of α that minimizes the variance of the *AUC* estimator,

where where *x _{p}* and

In general, for a loss function corresponding to the variance of our parameters, we can find the optimal value of α which minimizes the loss function and can be evaluated as a function of unknown parameters. Section 5 illustrates this theoretical result with a practical example.

A study of coronary heart disease at the Cedars-Sinai Medical Center investigated the ability of a cholesterol biomarker to discriminate between individuals who recently survived a myocardial infarction (cases) and those with no previous history of myocardial infarction (controls). Biomarker measurements were collected on 40 cases and 40 controls and blood samples were pooled in groups of 2, keeping cases and controls separate, and then re-measured. We therefore have individual measurements for 40 cases and 40 controls, as well as measurements for 20 samples of two pooled cases and 20 samples of two pooled controls. This allowed us to organize hybrid samples by taking combinations of individual and pooled measurements. The mean (±SD) was 205.5 (±42.3) for unpooled control samples and 207.9 (±35.2) for pooled control samples. The mean (±SD) was 226.8 (±41.7) for unpooled case samples and 223.2 (±37.3) for pooled case samples. The standard deviation of the differences between pooled measurements and the average of corresponding individual measurements was 6.7 for cases and 7.4 for controls.

For the purposes of this example, we assume that only pooling error is in effect. Under these assumptions, we apply the results from Section 4 for determining the optimal proportions of pooled and unpooled samples to maximize efficiency when estimating the *AUC*. Using the empirical means and variances, we find that α = 0.4 for the cases and α = 0.35 for the controls to minimize the variance of the *AUC* estimator when only 20 assays are tested (*n*=20). In other words, it is recommended that hybrid samples include 8 individual and 12 pooled (*p*=2) samples for the cases and 7 individual and 13 pooled samples for the controls to minimize the variance of the *AUC* estimator. More control samples should be pooled since the estimated pooling error is slightly lower among the controls than the cases. Using the developed methodology, we are also able to estimate the amount of pooling error. In particular, we took a hybrid sample of both cases and controls using the optimal levels of α. The hybrid sample was distributed as follows (mean (±SD)): unpooled control samples 201.0 (±31.5), pooled control samples 205.1 (±30.8), unpooled case samples 229.4 (±43.1), and pooled case samples 220.8 (±34.8). The MLEs were then estimated using the proposed methodology, and we found that . The *AUC* estimator was found to be 0.65 with a variance of 0.0050.

By implementation of the hybrid design in a small simulation study testing only 20 samples, we were able to reduce the variability of the *AUC* estimator by 30% as compared to using only a random sample (hybrid design: *AÛC* = 0.621, *Var*(*AÛC*) = 0.0049 ; random sample: *AÛC* = 0.641, *Var*(*AÛC*) = 0.0071). As compared to the full sample, the hybrid design used only half the number of samples with only a 37% loss of efficiency, whereas the random sample lost 96% (full sample: *AÛC* = 0.640, *Var*(*AÛC*) = 0.0036).

We present methodology for a cost-efficient hybrid sampling design that combines both pooled and unpooled samples in optimal proportions to efficiently estimate the unknown parameters of the biomarker distribution. Maximum likelihood estimators and their asymptotic distributions are given for various situations where errors are present. Application of this design also makes possible estimation of measurement error without repeated sampling. The hybrid design is similar to repeated measures but also retains, and in some cases improves, the efficiency of estimating μ_{x} and . The information on repeated measures can only be used to estimate , whereas use of the hybrid design can be used to estimate all parameters of interest. Any outside information available on the source and quantity of the errors can improve implementation of the hybrid design and subsequent estimation.

Misspecification of the measurement error model is of course a concern for estimation of α, and in turn for estimation of the main parameters of interest. Researchers often have substantial knowledge on the measurement of similar biomarkers, or biomarkers measured in a similar manner, and the laboratory conditions under which pooling will be performed (e.g. mechanical versus technician). This knowledge aids model specification. As mentioned, the pooling–error-only model is especially appropriate in cases where research is based on well-established and precise measurement techniques. In practice, it is best to allow for both types of error, especially if no prior knowledge of the biomarkers and the measurement process exists.

It should be noted that owing to the complexity of estimating the biomarker distribution from observed convoluted data, generalized likelihood or nonparametric inferences based only on pooled data may not be feasible, since the distribution of the observed averages involves convolution of *p* random variables of a biomarker distribution as discussed by Vexler *et al.* [8] In several cases, like those mentioned above, the hybrid design can be based on pre-specified proportions of pooled and unpooled samples (e.g. α = 0.5 ). The definition and approximation of optimal values of the proportions with respect to a given loss function can also belong to statements of problems that forego the hybrid design execution. In this article, several rules are analyzed which aid in determining the optimal proportion of unpooled samples. These rules are dependent on parameters (e.g. ) that are unknown at the design stage. It is usually assumed that the experimenter has *a priori* knowledge regarding the distribution of measurements, and good initial values of the unknown parameters are chosen based on this knowledge. This information might be available from previous related experiments, preliminary studies, or pre-testing of some kind. This information could also be taken from a subsample of the N available samples, with some modifications to the estimators. Since the unknown parameters are evaluated in previous experimentation, one might argue correctly that uni-stage designs are in fact the second stage of a less formal multistage experiment. Thus, following the design literature (e.g. Liu and Chi [25], Bellhouse *et al.* [26], Sitter and Wu [27], and Racine *et al*. [28]) the problem of approximating the optimal proportions of pooled and unpooled samples can be considered by using Bayesian methods, learning samples with a fixed size, or two-stage design techniques, etc. Two-stage designs can, in particular, be based on sequential procedures that make relevant test decisions using the first stage, utilizing a non pre-specified number of observations. Note also that the relationship between the bias of estimation of optimal values of α and the corresponding bias of main parameter estimation can be a major factor in defining the appropriate size of the learning sample. Our limited simulation results show that reasonable estimates of the optimal values of α can be obtained from relatively small to moderate-sized learning samples. In fact, learning samples with sizes proportional to the square root of the number of measured assays (i.e. *n*^{1/2}), (Liu and Chi [25], Bellhouse *et*al.[26], Sitter and Wu [27], and Racine *et al.* [28]) provide estimators of with variances comparable to those of the estimators considered in previous sections. Consider, for example, the situations depicted in Table 1 of Section 3.1. To approximate the optimal evaluation of , the values of α from Proposition 2.2.1, having form

are estimated based on learning samples with different sizes and the 50/50 proportion of numbers of pooled and unpooled measurements. (The learning samples, in these cases, should also have pooled and unpooled data to estimate .) Then, the hybrid samples are formed and the parameters estimated. Table 3 presents the corresponding Monte Carlo results.

Monte Carlo simulation results for Two-Stage Design. N=3000, n=1000, σ_{x}^{2}=1, 2500 repetitions. Learning samples were based on n' measurements from N' bioassays (α=0.5, p=2).

Thus, comparing the Monte Carlo estimators of in Table 1 and Table 3, we observe that the proposed estimators are quite robust to the choice of the learning sample size that is proportional to (*n*^{1/2}).

In this article, we demonstrate how to apply the hybrid design to estimate the *AUC*, but the methodology could also be applied to other functions such as hypothesis testing, regression, etc. In the present paper, we analyze normally distributed biomarkers, and the efficiency of the pooling design depends in large part on the distribution of interest. However, other common biomarker distributions could be evaluated in a similar manner. For nonparametric cases, the proposed estimators can be considered as the least square estimators that asymptotically have the variances presented in this paper. We consider the situation where only measurement error or only pooling error is present and the number of available biospecimens (*N*) is fixed. Hence, the pooling size is defined by the recommended optimal proportion (α) and the number of samples we can afford to test (*n*). In a similar manner, we could analyze the situation where the pooling size is fixed. However, when both measurement and pooling errors are present, *p*_{1} and *p*_{2} are considered fixed beforehand to allow for the optimization of α_{1}. The proposed estimators could also be modified to allow for differential errors that are a function of the pooling group size or levels of the biomarker, or errors due to mixing in non-identical proportions.

In conclusion, the hybrid sampling design combines the benefits of both pooling and random sampling under strict cost limitations, while making possible the estimation of measurement error without repeated samples. Only a small to moderate-sized training sample is needed in order to successfully implement the hybrid design.

The authors thank the editor and reviewers for their valuable comments. This work was supported by the Intramural Research Program of the National Institutes of Health, *Eunice Kennedy Shriver* National Institute of Child Health and Human Development.

The log likelihood function based on *S*' has the form of:

The derivatives of the log likelihood function are

Setting the first derivatives equal to zero and solving yields the following system of equations

The second derivatives of the log likelihood function are

Since these estimates are MLE’s, we have where , where

Hence,

Since , we can write

and hence

Thus, *Var*(* _{x}*) is a strictly increasing function. Therefore, the optimal alpha level is then found where α = 0, which corresponds to taking a pooled sample.

While , we have

and hence

In this case,

There are two possible solutions:

We can show that the first solution α_{0,1} is strictly less than 1. Write

Now, if we consider the coefficients for each term of α_{0,1}, we can show that the numerator is less than the denominator and therefore the solution α_{0,1} is less than 1, i.e. since the coefficients for in the numerator and denominator are equivalent, the coefficients for are *N* (*n* + *N*) < 2*N*^{2}, the coefficients for are *nN* < 2*nN* − *n*^{2}, and hence α_{0,1} ≤ 1.

Consider

Here the coefficients for in the numerator and denominator are equivalent, the coefficients for are 3*N*^{2} −*nN* > 2*N*^{2}, the coefficients for are *N*(2*N* − *n*) > *n*(2*N* − *n*), and hence α_{0,2} > 1.

Therefore, the optimal alpha is α_{0,1}.

The proof of Proposition 2.2.1 is complete.

The log likelihood function based on sample *S* is

To derive the maximum likelihood estimators of μ_{x}, σ_{x}, and σ_{p}, we calculate:

Setting these first derivatives equal to zero and solving yields the following system of equations

where denotes the mean of *X*_{1}, …, *X*_{[αn]} and denotes the mean of *Z*_{1}, …, *Z*_{[(1−α)n]}. Consider the second derivatives

Since these estimates are MLEs, we use the classical asymptotic properties which lead to the result , where the Fisher Information Matrix, *I*, is the negative expected value of the 2^{nd} derivatives divided by *n* as *n* → ∞, i.e.

It is clear that asymptotically these estimators are distributed as

where Σ is the inverse of the Fisher Information Matrix.

The optimal proportion of pooled and unpooled samples can be chosen in accord with a given loss function. We consider this loss function to be the variance of the mean estimator, the variance of the variance estimator, or a linear combination of both. When we are only interested in estimation of the mean, the proportion of pooled samples that minimizes the variance of the mean estimator depends on the size of the pooling error in relation to the variability of the sample.

In the context of the variance var(_{x}) of the mean estimator the optimal proportion α of unpooled samples satisfies the rule:

whereas, in the context of the variance of the variance estimator the optimal proportion is α = 1.

As the pooling error gets larger in relation to the variability in the sample, it is recommended to perform measurements on a larger portion than random sampling. The existence of pooling error limits the area where the pooling design is efficient. In accordance with the distribution function of the maximum likelihood estimators, one can show that if we are only concerned with estimation of the variance, then the random sampling strategy is always recommended. The variance of the estimator is minimized when only a random sample is taken (α = 1).

We want to find the minimum of the variance of the mean estimator with respect to α. In accordance with the asymptotic distribution of (see, Appendix A1), we have

Since , we can write the variance as a function of α

and hence

Since the denominator of is positive, we learn that the behavior of *Var*() depends on the part of the numerator, say . Specifically, if *Y* is positive then the derivative is also positive, and hence *Var*() increases by α. Denote

*A* > 0, and hence *Y* increases. Since the line *Y* = α*A* + *B*, α ϵ [0,1] can be located only 1) over axis α ϵ [0,1], 2) below the axis α ϵ [0,1], or 3) across α ϵ [0,1], we have 3 different cases:

- If
*B*≥ 0, i.e. then the derivative is strictly positive,*Var*() increases as α increases so the optimal α = arg min var() is found at α = 0; - If
*A*+*B*≤ 0, i.e. . Then is strictly negative and the optimal alpha is found at α = 1, because*Var*() decreases as α increases from 0 to 1; - If α
*A*+*B*= 0, then

This completes the proof of Proposition 2.2.1.

In this case, the log likelihood function based on *S*'' has the form of

Setting the first derivatives equal to zero and solving yields the system of equations regarding the maximum likelihood estimators.

The maximum likelihood approach provides that

where

In order to avoid technical formal notations, we do not mention the cumbersome forms of the second derivates that can be simply calculated.

1. Dorfman R. The detection of defective members of large populations. Annals of Mathematical Statistics. 1943;14:436–440.

2. Faraggi D, Reiser B, Schisterman EF. ROC curve analysis for biomarkers based on pooled assessments. Statistics in Medicine. 2003;22:2515–2527. [PubMed]

3. Mumford SL, Schisterman EF, Vexler A, Liu A. Pooling biospecimens and limits of detection: effects on ROC curve analysis. Biostatistics. 2006;7:585–598. [PubMed]

4. Schisterman EF, Vexler A. To pool or not to pool, from whether to when: applications of pooling to biospecimens subject to a limit of detection. Paediatric and Perinatal Epidemiology. 2008;22:486–496. [PMC free article] [PubMed]

5. Vexler A, Liu A, Schisterman EF. Efficient design and analysis of biospecimens with measurements subject to detection limit. Biometrical Journal. 2006;48:780–791. [PubMed]

6. Weinberg CR, Umbach DM. Using pooled exposure assessment to improve efficiency in case-control studies. Biometrics. 1999;55:718–726. [PubMed]

7. Zhang SD, Gant TW. Effect of pooling samples on the efficiency of comparative studies using microarrays. Bioinformatics. 2005;21:4378–4383. [PubMed]

8. Vexler A, Schisterman EF, Liu A. Estimation of ROC curves based on stably distributed biomarkers subject to measurement error and pooling mixtures. Statistics in Medicine. 2008;27:280–296. [PMC free article] [PubMed]

9. Liu A, Schisterman EF. Comparison of Diagnostic Accuracy of Biomarkers With Pooled Assessments. Biometrical Journal. 2003;45:631–644.

10. Fuller W. Measurement Error Models. New York: John Wiley and Sons; 1987.

11. Coffin M, Sukhatme S. Receiver operating characteristic studies and measurement errors. Biometrics. 1997;53:823–837. [PubMed]

12. Schisterman EF, Faraggi D, Reiser B, Trevisan M. Statistical inference for the area under the receiver operating characteristic curve in the presence of random measurement error. American Journal of Epidemiology. 2001;154:174–179. [PubMed]

13. Perkins NJ, Schisterman EF. The Youden Index and the optimal cut-point corrected for measurement error. Biometrical Journal. 2005;47:428–441. [PubMed]

14. Dunn G. Design and Analysis of Reliability Studies: The Statistical Evaluation of Measurement Errors. New York: Oxford University Press; 1989.

15. Freedman LS, Fainberg V, Kipnis V, Midthune D, Carroll RJ. A new method for dealing with measurement error in explanatory variables of regression models. Biometrics. 2004;60:172–181. [PubMed]

16. Thurigen D, Spiegelman D, Blettner M, Heuer C, Brenner H. Measurement error correction using validation data: a review of methods and their applicability in case-control studies. Statistical Methods in Medical Research. 2000;9:447–474. [PubMed]

17. Faraggi D. The effect of random measurement error on receiver operating characteristic (ROC) curves. Statistics in Medicine. 2000;19:61–70. [PubMed]

18. Liu K, Stamler J, Dyer A, McKeever J, McKeever P. Statistical methods to assess and minimize the role of intra-individual variability in obscuring the relationship between dietary lipids and serum cholesterol. Journal of Chronic Diseases. 1978;31:399–418. [PubMed]

19. Reiser B. Measuring the effectiveness of diagnostic markers in the presence of measurement error through the use of ROC curves. Statistics in Medicine. 2000;19:2115–2129. [PubMed]

20. Dunson DB, Weinberg CR. Modeling human fertility in the presence of measurement error. Biometrics. 2000;56:288–292. [PubMed]

21. Faraggi D, Reiser B. Estimation of the area under the ROC curve. Statistics in Medicine. 2002;21:3093–3106. [PubMed]

22. Shapiro DE. The interpretation of diagnostic tests. Statistical Methods in Medical Research. 1999;8:113–134. [PubMed]

23. Vonesh EF, Carter RL. Efficient inference for random-coefficient growth curve models with unbalanced data. Biometrics. 1987;43:617–628. [PubMed]

24. Wieand S, Gail MH, James BR, James KL. A family of non-parametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika. 1989;76:585–592.

25. Liu Q, Chi GY. On sample size and inference for two-stage adaptive designs. Biometrics. 2001;57:172–177. [PubMed]

26. Bellhouse DR, Thompson ME, Godambe VP. Two-stage sampling with exchangeable prior distributions. Biometrika. 1977;64:97–103.

27. Sitter RR, Wu CF. Two-stage design of quantal response studies. Biometrics. 1999;55:396–402. [PubMed]

28. Racine A, Grieve AP, Fluhler H, Smith AFM. Bayesian Methods in Practice: Experiences in the Pharmaceutical Industry. Applied Statistics. 1986;35:93–150.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |