The PRoBE study design includes four key components. These components relate to clinical context and outcomes, criteria for measuring biomarker performance, the biomarker test itself, and the size of the study. Items pertaining to each of the components are listed in Boxes 1
Box 1. Components of design relating to clinical context.
- Define the target population and clinical setting intended for use of the biomarker.
- Define subject inclusion and/or exclusion criteria and process for enrollment.
- Define the setting for specimen collection.
- Ensure adequate generality in the population studied.
- Define the outcome of interest.
- Specify procedures for ascertaining and measuring the outcome.
- Ensure prospective specimen collection before outcome ascertainment.
- Describe case patients (seek positive biomarker results in case patients) and subsets of case patients that are of interest.
- Describe control subjects (seek negative biomarker results in control subjects) and subsets of control subjects that are of interest.
- All subjects in the population must fit into a case or control category.
- Random selection of case patients and control subjects.
- Consider matching of control subjects to case patients on factors related to the biomarker only if scientific questions of interest can be addressed with matched data. Be aware of scientific limitations that result from matching.
Box 4. Components of design relating to study size.
- Recall minimally acceptable performance criteria (Box 2).
Box 2. Components of design relating to performance criteria.*
True- and False-Positive Rates
- Define TPR as the proportion of case patients with positive results.
- Define FPR as the proportion of control subjects with positive results.
- Does time between obtaining the specimen and the occurrence of an outcome impact on criteria for defining case subjects and control subjects? If so, provide corresponding time-dependent TPR and FPR definitions.
- Are there subgroups of case patients and control subjects that are of interest? If so, will TPR and FPR be calculated separately for each subgroup? Which subgroups are of primary interest?
Minimally Acceptable Performance
- What are minimally acceptable values (or ranges) for the key TPR and FPR parameters in this clinical application?
- Does a classification method currently exist?
- What is the performance of that method?
- Will the biomarker alone be compared with the current classifier or will it be combined with the current method (ie, head-to-head comparison or evaluation of increment in performance)?
- What are target levels for comparative performance of methods?
*TPR = true-positive rate; FPR = false-positive rate.
- Define anticipated performance levels.
- Provide rationale preferably with evidence from pilot data.
- Calculate case and control sample sizes (see Supplementary Material, available online).
- Plan for prospective collection from a cohort until sufficient numbers of case patients and control subjects are enrolled.
- Plan for early termination of the study if appropriate.
For what population and in what clinical setting is the biomarker intended? The context for clinical application should drive the study design (Box 1
). Once the context has been defined, a random cohort of subjects from the target population is enrolled, pertinent clinical data are collected, and biologic specimens are collected and stored. A rigorous protocol for subject enrollment and specimen collection ensues. Generalizability of study results is an important consideration in pivotal evaluations, a concept that is well appreciated in therapeutic clinical trials. This consideration motivates the design of a protocol that is simple enough for general use, that includes several institutions in the study, and that has inclusion and exclusion criteria that provide an adequately heterogeneous population for the clinical application.
The outcome or condition that the biomarker is to classify must be defined and the procedures for evaluating it must be specified. These procedures may simply involve follow-up, or they may be invasive and/or costly. If procedures do not exist to measure the outcome of interest, study objectives may need to be modified to use an alternative clinically relevant but measurable outcome (8
The purpose of the biomarker is to distinguish those with a bad outcome, whom we call case patients, from those with a good outcome, whom we call control subjects. Case patients are those in whom we expect the biomarker to be positive, and control subjects are those in whom we expect the biomarker to be negative. Sometimes a positive biomarker result is defined as the absence or low level of that biomarker. Subgroups of case patients and control subjects may be of interest. For example, histology and disease stage can define subgroups of cancer case patients. Benign disease and normal healthy organ tissue define two subtypes of control subjects. All subjects in the target population must fit into a precisely defined case or control category.
The design requires that random selections be made from the population for each category. In our experience, random selections of the relevant case patients and control subjects can be achieved only with a prospective cohort of subjects from the target population, with collection and storage of specimens before determination of outcome. After outcome data become available (ie, when case or control designations are determined), random sets of case patients and control subjects are selected retrospectively and their specimens are retrieved from storage.
Classic confounding arises when case patients differ from control subjects on factors that are related to the biomarker and those factors are themselves predictive of disease. For example, case patients may be older than control subjects and, if the biomarker varies with age, some of the apparent difference in biomarker values between case patients and control subjects may simply be due to age discrepancies. Choosing control subjects to match case patients on such factors eliminates this sort of confounding. However, there are major disadvantages to matching that are not always appreciated (9
). First, matched control subjects are no longer representative of the control population, making interpretation of false-positive rates problematic. Second, a simple analysis that compares matched control subjects with case patients can attenuate biomarker performance; that is, performance in the matched study as a whole appears worse than in subpopulations or strata in which matching factors are constant. Third, a matched study requires a covariate-adjusted analysis (10
), which is conceptually an analysis that stratifies according to covariates. This is a disadvantage because a stratified analysis is more complicated to implement and interpret than an analysis that does not stratify by covariates. Interestingly, a matched design is most efficient for this stratified (ie, covariate-adjusted) analysis.
It should be noted that examining marker performance within covariate strata is not the same as including covariates in a statistical model for the outcome (9
). The latter approach is concerned with performance of the combination of covariates and markers for classifying outcome. A serious problem with matching is that it actually precludes direct evaluation of the combination of markers and covariates. As a consequence, one cannot evaluate the increment in performance gained by combining a marker with the covariates compared with the covariates alone for classification. In summary, one must carefully consider whether a covariate-stratified measure of performance is of primary interest in the clinical application. If it is, then matching is recommended because of its statistical efficiency. However, in many settings, a covariate-stratified measure of performance will not be of primary interest, and the major complexities in the analysis that are introduced by matching indicate that it be avoided.
What do we want the biomarker to achieve? Performance criteria must be set to provide a yardstick for measuring the success or failure of the biomarker. The PRoBE design revolves around determining whether these criteria are met (Box 2
Performance measures and acceptable levels for these measures depend on clinical context. Consider first the diagnostic setting, in which the biomarker is to identify people with disease as being positive. The true-positive rate and the false-positive rate are the typical performance measures of interest. Also known as the sensitivity, the true-positive rate is the proportion of diseased people correctly detected as having disease by use of the marker. The false-positive rate (which equals 1 − specificity) is the proportion of nondiseased people incorrectly detected as having disease by use of the marker. Minimally acceptable values for both the true-positive rate and false-positive rate must be agreed upon at the design stage of the study. Current medical practice and effects of subsequent procedures or interventions resulting from positive and negative marker results impact on target values for true-positive rate and false-positive rate.
In a diagnostic study, the consequences of missing a case patient with invasive cancer may be fatal, which argues for a high true-positive rate. For example, a biomarker might be developed to guide women with suspicious lesions for breast cancer to undergo biopsy examination or not (for details of this study conducted by the Early Detection Research Network, see Supplementary Material
, available online). Women with invasive cancer that is, under current protocols, detected with a biopsy examination should continue to be recommended for biopsy examination (ie, a very high true-positive rate is required). However, under this current practice, the false-positive rate is 100%, in the sense that all women with suspicious lesions who do not have invasive cancer are also subjected to biopsy examination. Study investigators consider that a reduction in the false-positive rate of even 25% would be beneficial because it would result in 25% fewer women undergoing unnecessary biopsy examination. In other words, a false-positive rate of 75% is considered minimally acceptable by study investigators.
For general population screening, in contrast, the false-positive rate must be very low to avoid huge numbers of people undergoing unnecessary costly medical procedures. It has been argued that for ovarian cancer screening, the false-positive rate should not exceed 2%. Because the goal of general population screening is to detect disease early, the proportion of case patients detected at some relevant time before the appearance of clinical disease is the appropriate true-positive rate performance measure; that is, the true-positive rate is a function of the time lag between marker measurement and subsequent diagnosis of disease that would occur in the absence of screening. For example, detecting even 20% of invasive ovarian cancers at least 1 year before clinical diagnosis would be enormously beneficial if such cancers could be successfully treated at that stage. Thus, the minimally acceptable false-positive and true-positive rates in ovarian cancer screening might be a false-positive rate of at most 2% and a true-positive rate (1 year before clinical diagnosis of invasive disease) of at least 20%.
Performance criteria may vary with subgroups of case patients and control subjects. For example, the false-positive rate for women with normal ovaries should be at most 2%, but a much larger false-positive rate may be acceptable in control women with benign ovarian disease. One might require a higher true-positive rate for a disease histology that is likely to be successfully treated than for a disease that is not.
Performance criteria for prognostic markers bear similarities to those for screening markers. For example, the time between biomarker measurement and outcome must be considered. A biomarker may be more sensitive to outcomes, such as disease recurrence, that occur soon after marker measurement than to those that occur later. Moreover, it may be more important to identify subjects with subsequent events who are likely to be successfully treated at the time the marker is measured. Patients with events occurring very soon after marker measurement may not benefit. The false-positive rate is the fraction of control subjects with false-positive results. How do we define control subjects for prognostic markers? A landmark time T after biomarker measurement may be chosen as 1 year or 5 years, with control subjects defined as those who are event free at time T. Subjects who die of other causes before time T or who have other catastrophic events are included in the main control group in this approach. An alternative approach is to consider those subjects as a second control group and to calculate two false-positive rates, one for subjects event free at time T (the main control group) and one for subjects who have other events before time T (the secondary control group).
Sometimes biomarkers and/or predictors already exist for the clinical application (eg, CA-125 for ovarian cancer screening). Comparisons with existing biomarkers must be part of the pivotal classification accuracy study, and minimally acceptable improvements in performance must be specified. When there are difficulties or costs associated with existing markers, the goal may be to replace them with new markers. For example, pathologist assessment of nuclear grade of a ductal carcinoma in situ lesion is a marker for subsequent development of invasive breast cancer (11
). One would like to replace this marker with a biomarker that is cheaper, more reliable, and more transportable than expert pathology review. Head-to-head comparisons are appropriate in this setting. However, when existing biomarkers and/or predictors are easily obtained, the objective may be to assess the increment in performance that is achieved by adding the new marker to them. For example, the TRANSBIG prognostic breast cancer study (12
) found a relatively small increment in performance by adding the 70-gene signature to readily available data on clinical factors. Finally, if a marker is already part of standard clinical practice, it may be impossible to evaluate the inherent accuracy of a new marker. For example, because prostate-specific antigen testing is routine in the United States, one can now evaluate only certain types of improvements that can be achieved by combining new markers with prostate-specific antigen for prostate cancer screening (13
) but not the performance of new markers used alone without prostate-specific antigen.
The Biomarker Test
What is the biomarker? It is defined in part by the biologic specimen, procedures and timing for specimen collection, processing, and storage, all of which must be detailed in the study protocol (Box 3
). For example, in the diagnostic breast cancer study, blood is drawn preoperatively and centrifuged at 4°C within 5 hours of collection, serum is removed by pipetting, and aliquots are stored at −80°C. The PRoBE design ensures that these procedures are blinded to the patient's outcome status and to any information related to the outcome that is not available at the time of specimen collection. Retrieval of the specimen (eg, a serum aliquot) and the assay procedure itself must also be defined and blinded to outcome-related information. Blinding is a key component of the PRoBE design. Appropriate labeling of stored specimens ensures blinding. Only after the study is completed is the blinding broken so that outcome data can be linked with biomarker data.
Box 3. Components of design relating to the biomarker.
- Specify procedures for specimen collection, processing, storage, and retrieval.
- Specify assay procedures and how results are reported.
- Are mechanisms in place to blind specimen handling, assay, and reporting of results to outcome status?
- Is the biomarker data to be combined with other information on the patient in the intended clinical application, including other clinical information, other markers, and previous measurements of the biomarker in the patient?
- The specific algorithm for calculating the combination must be defined (it cannot be developed during evaluation).
Other Biomarkers and Predictors
- If other biomarkers or predictors will be combined or compared with the study biomarker, describe in detail protocols and procedures for obtaining these data.
- Provide assurance that procurement of these items is blinded to patient outcomes.
Ideally, the assay used in the pivotal study should be the assay that is intended for general use. However, development of a commercially available assay is not always practical before the pivotal study. The initial study may therefore use a research assay, with the recognition that, if the study is positive and an alternative assay is developed for widespread use, the alternative assay must be evaluated further, preferably with the same specimens that were used in the pivotal study.
It is now widely appreciated that the assessment of biomarker performance must be separated from biomarker discovery. In discovery research, if a biomarker is selected from a set of candidates because of its apparent good performance, its performance in those samples is biased in an overoptimistic direction. This is a statistical phenomenon that reflects elements of random variation in the particular samples chosen, specimen handling, and assay procedures. If the analysis were repeated with different specimens, the results would vary. The biomarkers that perform best in one dataset might not have the best performance in another. To estimate performance without bias, an independent dataset is ideal. Therefore, in the pivotal evaluation, the topic of this commentary, the marker is defined in advance and no selection of markers is involved.
A biomarker test may be defined as a combination of several biomarkers and possibly other predictors. The specific algorithm to combine the biomarker values into a score should be defined in advance of the pivotal evaluation; that is, we regard development of a combination of several biomarkers as part of discovery and not part of the PRoBE design. Statistical techniques such as cross-validation or bootstrapping (14
) can sometimes be used to simultaneously discover and evaluate a marker combination, but these techniques require that all steps involved in developing the combination score be completely defined in advance, which is a tall order in practice. We and others (1
) consider that instead a separate independent dataset is necessary to evaluate classification accuracy. This dataset may be obtained by splitting a large dataset into two components, one for discovery—the training dataset—and one for pivotal evaluation—the test dataset.
A profile of biomarker values over time may be more indicative of a subject's outcome status than a single measurement. Accordingly, the biomarker test may be defined by an algorithm that combines the subject's historical and current biomarker values. For example, change from the average of two previous annual measurements could define the biomarker test result. The schedule for specimen collection must give rise to sufficient data for doing the calculation. In addition, because the manner in which the calculation is done amounts to defining how the current and historical biomarker data will be combined, the combination algorithm must not be derived during the pivotal evaluation but rather defined in advance of it.
Conclusions must be drawn from the study. A positive conclusion is that the marker meets minimally acceptable performance criteria and a negative conclusion is that it does not (Box 4
). In practice, we calculate a confidence interval (or region) for the performance measure (or measures), which is a set of plausible values for the measure given the data, and draw a positive conclusion if all values in the interval are at least minimally acceptable. For example, in the diagnostic breast cancer study, the biomarker performance measure is the false-positive rate that corresponds to a true-positive rate of 0.98—that is, the proportion of noncancers that are biomarker positive when we set the biomarker threshold to ensure that 98% of invasive cancers are positive. We would draw a positive conclusion if, for example, the 95% confidence interval for the false-positive rate was 0.50 to 0.60 because it indicates that while maintaining biopsy recommendations for at least 98% of invasive breast cancers only 50%–60% of noncancer lesions will continue to be recommended for biopsy examination. Proportions in this range are even better than the minimally acceptable value of 75% that was specified in the design.
At the design stage, when only pilot data or other information are available, we anticipate a desirable performance level for the biomarker and make sure that there is a high chance (or power) that a positive conclusion will be drawn from the study, if the true performance of the biomarker in the population is as good as is anticipated. These considerations give rise to procedures for sample size calculations. Note that a positive conclusion is expected only if the marker's performance is better than minimally acceptable (ie, at a desirable level). In statistical jargon, minimally acceptable performance constitutes the null hypothesis that we wish to rule out, whereas the anticipated desirable performance level constitutes the alternative hypothesis. We provide details of sample size calculations in the Supplementary Material
For an inherently dichotomous biomarker that is either positive or negative, one specifies null (FPR0
) and alternative (FPR1
) values for the pair of performance measures, namely the false-positive rate and the true-positive rate. For a continuous biomarker, if some established threshold exists to define a biomarker result as positive, then again null and alternative values for the false-positive and true-positive rates will be specified for the dichotomized marker. More often, however, it will make sense either to set the false-positive rate at a minimally acceptable value, FPR0
, and to estimate the corresponding biomarker threshold and true-positive rate from the pivotal study, or to set the true-positive rate at a minimally acceptable value, TPR0
, and to estimate the corresponding biomarker threshold and false-positive rate from the pivotal study, as in the breast cancer study (Supplementary Material
, available online). For the former, null (TPR0
) and alternative (TPR1
) true-positive rates are specified and then sample sizes are calculated. In addition, one must ensure that the actual false-positive rate associated with the estimated threshold is close enough to the target false-positive rate value, FPR0
, which places further constraints on the number of control subjects, as described in the Supplementary Material
(available online). For the latter, sample sizes are based on specified null (FPR0
) and alternative (FPR1
) false-positive rates, and further requirements on the number of case subjects are made to ensure that the actual true-positive rate associated with the estimated threshold is close enough to the target value, TPR0
Sample size formulas that are detailed in the Supplementary Material
(available online) specify the numbers of case patients and control subjects required. However, specimen collection is performed for a cohort. The formulas will therefore be used in conjunction with projected prevalence or incidence rates in the cohort to calculate total numbers of subjects to be enrolled. Adjustments may be necessary if the actual rates are found to be different from those projected.
It is sometimes reasonable to terminate a study early if the analysis of partially accumulated data indicates that the biomarker has poor performance. Data monitoring is commonplace in therapeutic research, in which ethical concerns motivate early termination (16
). In biomarker research, evaluation often begins after the cohort is assembled and specimens are stored; therefore, ethical concerns do not motivate early termination. However, preservation of specimens and resources is important. Therefore, if initial results show clearly that a biomarker has poor performance, the study should terminate. Otherwise, the study should continue so that estimates of performance are refined. Standards of practice for study design in therapeutic research (16
) stipulate that early termination rules be specified in advance. The same practice should be followed in biomarker research. One simple rule is to terminate at a preplanned data monitoring point if the 95% confidence interval for biomarker performance is below minimally desirable levels. One must be cognizant that allowing studies to terminate early causes bias in estimates of biomarker performance from studies that do not actually terminate early. Statistical methods to adjust for this bias are available (17
Alternative Designs and Strategies
The most common bias in biomarker research involves systematic differences in subject selection and/or specimen collection between case patients and control subjects. For example, specimens collected from case patients at a treatment center will differ from those collected from healthy control subjects at a blood donation center because of such factors as differences in specimen processing protocols, stress levels, and medication use. The prospective uniform nature of specimen collection for all subjects from a single cohort in the PRoBE design eliminates these systematic biases by ensuring that specimens for case patients and control subjects are collected in exactly the same way.
Another common problem is that the population or clinical setting that is studied is not the setting for which the biomarker is intended. Performance of a biomarker in one setting may not reflect performance in the setting of interest. The PRoBE design avoids this extrapolation bias (18
) by requiring that the clinical application be defined and that the study cohort be a random sample from the target population. Inclusion of several institutions in the study increases confidence that the results generalize across institutions.
Simple retrospective case–control studies are notorious for spectrum bias (18
). A classic example is when selected case patients tend to have more severe or well-documented disease and selected control subjects are especially healthy, leading to overoptimistic estimates of biomarker performance. The PRoBE design avoids spectrum bias by identifying all subjects in the cohort as either case patients or control subjects and drawing randomly from the subgroups. Another problem with retrospective studies is that knowledge of the subject's outcome status may affect the interpretation of an assay result or the care with which the specimen is handled. This bias is avoided in the PRoBE design by storing specimens before outcome ascertainment and by blinding specimens for retrieval and assay procedures.
In strictly prospective studies, the biomarker value is ascertained for all subjects in a cohort and outcome status is determined subsequently. These studies are also subject to problems. First, they cost more because all samples are assayed instead of only a case–control subset. Second, ethical problems arise when the biomarker value is known but there is uncertainty about how it should affect patient care. Overtreatment is one such ethical concern that has been realized in the context of prostate cancer screening with prostate-specific antigen testing. The retrospective component of the PRoBE design avoids this ethical dilemma. Third, if outcome ascertainment is expensive or invasive, subjects with certain biomarker values may be more likely than those with other values to have the outcome ascertained. Incomplete ascertainment of outcome introduces verification bias, which typically inflates both true- and false-positive rates. Fourth, knowledge of the biomarker value may influence aspects of outcome determination, also leading to bias. The PRoBE design avoids all of these biases by ascertaining the outcome in a uniform manner for all subjects in the cohort and by timing biomarker measurement to occur after outcome assessment.
A major concern in biomarker research is overfitting bias. This bias occurs when a biomarker combination is evaluated with the same dataset that was used to develop it. By requiring completion of all discovery work before the pivotal evaluation, including development of marker combinations, the PRoBE design avoids overfitting bias. When the pivotal evaluation study is constituted as the test set derived from a larger study that is split into training and test components, the sample size calculations for the PRoBE design pertain to the size of the test component only. The threshold for marker positivity may or may not be defined in advance of the PRoBE study. If the threshold is derived from clinical settings that differ from the target setting or is derived from small studies, it is unwise to use that threshold. In contrast to previous approaches (19
), the PRoBE design accommodates estimation of the threshold in its sample size recommendations. Moreover, our approach to sample size calculation guarantees a certain power to rule out tests with unacceptable performance. This feature differs from approaches that are based simply on estimating performance with specified precision (19