Let’s assume that a rigorously determined list of 1000 candidate plasma protein biomarkers for detecting prostate cancer has been produced, and that the discovery process was well orchestrated using a study design that avoided bias, used well-characterized technologies with low FDRs, and that information on the likelihood of clinical relevance was also included in the prioritization or discovery of these candidates (e.g., markers were generated that correlate with clinical outcome). We lack sufficient resources to build ELISAs for all of these 1000 candidates, yet we would like to perform verification studies for as many of the candidates as possible to maximize our chances to find the subset of the most clinically promising candidates for validation studies. So we desperately need a novel experimental approach, and potentially a new assay method, to measure 1000 candidates to accommodate our conundrum.
Let’s use known information about the PSA biomarker to define the boundaries of performance that will be required of our new approach to candidate verification. As discussed above, despite its widespread use, the performance of the PSA marker is marginal at best. Hence, we will use its performance characteristics to set the minimum standard that we will accept in our ongoing search for new markers for the detection of prostate cancer. In other words, we do not want to aim to discover markers that perform worse than PSA; we will only aim to discover markers that perform as well as or better than PSA.
Using the empirical distribution of PSA levels described from a previous population study [
16], we will simulate different experimental scenarios for verification studies. We will consider two levels of candidate credentialing as part of the verification stage ():
- Level one credentialing: demonstrating that the mean plasma levels of a given candidate are significantly different between a population of cases and a population of controls. Once the mean levels have been determined in the two populations, a statistical test (e.g., t-test) can be applied to assess the significance of any difference between cases and controls. (For the PSA example, the average level in the cancer group (7.63 ng/mL) is about 3.8-fold higher than the average level in the normal group (2.01 ng/mL; see for statistical power).
| Table 1Power calculations for case-control studies using PSA as an examplea) |
- Level two credentialing: pilot measurement of the performance characteristics (i.e., sensitivity and specificity) of the candidate marker in the desired clinical setting to estimate its likelihood of success in a subsequent larger clinical validation study.
As we will demonstrate, it is useful to divide verification into these two levels of credentialing because the technology requirements (sample throughput, assay precision, assay multiplexing) differ between them. For example, we will argue that although biomarker candidates must be measured in individual patient samples for level two credentialing, a pooling strategy is possible for level one credentialing. Pooling is potentially advantageous since pooling plasma samples from multiple individuals provides an opportunity to reduce sample numbers (and hence throughput requirements), reduce the sample volumes required from individual clinical samples, and reduce the cost of verification. Reduced throughput requirements are a major advantage early on in verification, since this allows us to accommodate workflows that are too cumbersome and imprecise for validation studies, but that may provide a fast and relatively cheap way to screen a large number of candidates (see below).
– show the results of simulating different experimental scenarios for level one and level two credentialing, respectively. The statistical power for detecting PSA as a potential biomarker in plasma is calculated for various assay precisions (coefficient of variation, CV), numbers of samples (N), and numbers of replicate assays performed. Here statistical power is defined as our probability of detecting a biomarker with our assay given that the marker is differentially expressed between case and control; ideally, our experimental design should be associated with as high of a statistical power as possible (to avoid false negatives), minimally >90%. For level one credentialing, two study designs are considered: one using pooled plasma from multiple individuals (, ) and another using individual plasma samples (i.e., not pooled; , ). Several important conclusions can be drawn regarding the performance requirements of our ultimate verification workflow.
| (C) Level 2 credentialing (verification) in a homogeneous disease populationd) |
| B Level 1 credentialing (verification) in a heterogeneous disease population in which the target biomarker is only elevated in S% of casesc) |
For level one credentialing, if we want ≥90% power to detect PSA as a potential biomarker worthy of further study (i.e., mean plasma levels significantly differ between case and control populations), we find that:
- We can achieve our goal using either pooled samples or by analyzing individual samples.
- Pooling is advantageous whenever the sample size is limited since the impact of sample size is much less for pooled samples than for individual analyses.
- Pooling is also advantageous wherever the cost and/or throughput of each experiment are limiting. For example, if we fix the CV = 0.5 and N = 10, the pooled strategy would require 20 experimental runs (ten replicates each of one case pool and one control pool) to achieve 96.8% power to detect PSA as a potentially useful marker. In contrast, the individual analyses strategy would require 60 experimental runs (three replicates each of ten cases and ten controls) to achieve comparable power (94.2%).
- If adequate numbers of clinical samples are available, and if the cost and throughput of experiments is not a concern, then an individual sample design should be chosen for two reasons: (a) Analyses of individual samples will ultimately be required for clinical validation of candidate markers, so that operating characteristics (e.g., sensitivity and specificity) can be determined (); (b) a drawback to pooling lies in the reduction of bimodal populations to a single, more homogenous population. If there is heterogeneity in the disease population (e.g., molecular subtypes of cancer such as hormone-responsive vs. hormone-resistant cancer) resulting in the target biomarker’s being elevated in only a subpopulation (S), this marker could be lost by pooling, depending on the size of the subpopulation. This is demonstrated in where we repeat the simulation assuming two disease subtypes within the cases with or without prior knowledge of these subtypes. If we have no prior knowledge regarding the presence of disease subtypes in our case population and if sample numbers are small (<200 cases), a pooling strategy works best even if there are two subtypes of disease in the case population. For example, if a given biomarker is elevated in only 20% of the case population and we fix the assay CV = 0.2 and N = 50, a pooling strategy would give us a 75.6% chance of detecting the marker in the subpopulation (assuming five replicate measurements per pool = 10 measurements total; ), whereas an individual strategy would only give us a 37.7% chance of detecting the marker in the subpopulation (assuming one measurement per sample = 100 assays total; ), Although this is initially counterintuitive, the enlarged variation in the case population (due to the presence of two subtypes) makes the individual strategy less favorable until sample size becomes larger (N>200). It is also noteworthy that at sample sizes <200, pooling not only provides greater statistical power but also lower-throughput requirements, as in the example just discussed (10 vs. 100 assays required). If we do have prior knowledge about the two subtypes of cases in our study population (i.e., the clinical samples are annotated to allow us to identify the two subpopulations), then the individual strategy is clearly most advantageous ().
- It is apparent from that precision (CV) has a major impact on our statistical power. Hence, to maximize our capacity to test candidates, it is imperative that we optimize our verification assay workflows to ensure the highest precision possible and institute Standard Operating Procedures (SOP) to ensure that we consistently achieve high precision. For example, in the individual sample analyses of a homogeneous disease population () where N = 10 and CV = 0.2, 20 sample analyses (one replicate for each of ten cases + ten controls) must be performed to achieve power = 93%; in contrast 60 sample analyses (three replicates for each of ten cases + ten controls) would be required to achieve comparable power (92.2%) using a platform with CV = 0.8. Thus, a platform with CV = 0.2 has 3×greater capacity to test candidates than a platform with CV = 0.8.
Based on the above considerations, in order to achieve level one credentialing (for a candidate typified by PSA and a homogeneous disease population), we will need plasma samples from 20 cases and 20 well-matched controls. Additionally, we need to devise an assay technology with the following characteristics
- Capacity to test 1000 candidate biomarker proteins over a several month period;
- sensitivity ≤nanogram per milliliter in plasma;
- assay CV≤0.5;
- throughput to run up to ten replicate measurements per candidate.
If the target biomarker is elevated in only a subset (S) of the case population of which we have no prior knowledge, our requirements are more stringent. In this scenario, we will need >100 samples (depending on the prevalence of the disease subtype in which the biomarker is present; ) and assay CV ≤ 0.2 to detect a marker elevated in at least 20% of the case population.
For level two credentialing, our goal is to identify the subset of candidate markers most likely to meet the minimally acceptable sensitivity and specificity in a given clinical setting. Hence, we must perform a pilot study to characterize the distribution of the marker in the population, allowing us to estimate its sensitivity and specificity. The success of this step relies on how accurately the sensitivity and specificity can be measured; therefore, assay precision (CV) again plays an important role. As we can see from , a large CV will result in underestimation of sensitivity and specificity. For example, for PSA the
actual sample sensitivity is 73.9% and sample specificity is 88% [
16]. In our simulation, when CV = 0.5 the
estimated sensitivity is about 65%, which is 8.9% lower than the true sensitivity of PSA based on the population study [
16]. In addition to using a precise assay, a larger number (100s–1000s) of individual patient samples () will be needed compared with level one credentialing. For example, for the estimation of sensitivity (), around 1000 cases and 1000 controls would be needed to get a 90% confidence interval (CI) spanning less than 5% (
i.e., CI = (
x − 2.5%,
x + 2.5%)); or more than 5000 cases and 5000 controls will be needed to get a CI spanning less than 2% (
i.e., (
x − 1%,
x + 1%)). These requirements can also be viewed from another angle. In order to have 90% power to identify PSA as a good candidate marker worthy of follow up (
i.e., 70% sample sensitivity and 85% sample specificity), we would need 500 cases and 500 controls with a CV = 0.15. By comparison, this would require a couple of thousand cases and controls with a CV = 0.25.
Based on the above considerations, in order to achieve level two credentialing for a marker similar to PSA, we will need plasma samples from a minimum of 500 cases and 500 well-matched controls. Additionally, we need to devise an assay technology with the following characteristics:
- Capacity to test 100s candidate biomarker proteins over a few month period;
- sensitivity ≤ nanogram per milliliter in plasma;
- assay CV≤0.2;
- throughput to run up to 1000 measurements per candidate biomarker.
Note that level two credentialing is still just a pilot study using limited throughput assays to determine if a candidate is trending toward utility and therefore worthy of making a better high-throughput clinical-grade assay. True clinical validation, however, will require an even larger-scale case-control or cohort study in order to carefully examine the impact of other covariates on the proposed marker test, to determine the positive predictive values and false referral probabilities in real practice, and to compare or combine the new test with existing clinical tests. Although candidates showing promise in pilot level two credentialing studies may still not pass the test of ultimate clinical validation, level two credentialing is important because it allows us to advance only the most promising of candidates forward to clinical validation trials, thereby saving time, money, and clinical specimens and helping to maximize our return on investment.
It should be noted that the power calculations described in are based on known distributions of PSA levels in the cancer and the normal populations. Hence, these results can be generalized to other biomarkers showing similar population distributions, but markers with vastly different distributions would require that new calculations be performed based on the specific behavior of that marker. In the absence of knowing the population variation for markers yet to be discovered, it is useful to look at a well-studied example such as PSA to provide general guidance in planning verification studies.