Though not directly relevant for modeling, it is important to point out that study design for evaluating biomarker performance for classification has unique characteristics. First, the clinical context should be clearly stated and from that follows the definition of study population, definition of diseased and controls, the time and the setting under which specimens and data to be collected, the minimum performance criteria to set up the study hypothesis, and study size. The key is that the clinical context drives everything else. See Pepe et al. [

5] for detailed descriptions of study design issues for a pivotal biomarker classification validation study. Failure to use the principles, termed PRoBE study design standards, could lead to biased study conclusions not replicable when the test is used in the real clinical context, with potentially devastating consequences if it is used for important clinical decisions. The following discussions assume an appropriate study design has been done and we now have data on multiple biomarkers or a biomarker to be combined with existing predictors to develop a classification model.

The most commonly used modeling method to combine multiple predictors for binary disease outcome is logistic regression [

6]. A logistic model has the form log(p/(1−p)) = xβ where X is a vector of predictors x

_{0}, x

_{1,}x

_{2}, x

_{3}, …, e.g. x

_{0} is 1 and β

_{0} is related to disease prevalence, x

_{1} is biomarker value, x

_{2}, x

_{3}, … are other clinical predictors. Exp(β

_{i}) is OR associated with i-th predictor. One can use Xβ as a combined “new test” to draw the ROC curve, use a specific threshold for classification, and calculate sensitivity, specificity, PPV, and NPV associated with this threshold.

Though logistic regression was developed in epidemiologic association study, it has a nice property for classification. If the underlying logistic model is correct or at least xβ is a monotone function of the true model, then the classifier defined above is the optimal classifier in the sense that it has the maximum sensitivity given any specificity and vice versa, i.e. we can not improve further its ROC(t) curve for any point t [

7]. The reason is that by the famous Neyman-Pearson Lemma [

8], likelihood ratio–based decision rule, p(y|D=1)/p(y|D=0) > c, is the optimal decision rule. Any monotone increasing function of the likelihood ratio, such as risk score p(D=1|y), and logit transformation of risk score log(p(D=1|y)/(1−p(D=1|y)) are also optimal [

9,

10]. The last one is Xβ under logistic regression model.

However, a proposed model is almost never to be the truth. With that, the logistic regression model is in general not optimal for optimizing the ROC curve, i.e. not optimal for classification decision.

Researchers have used the likelihood ratio principle to construct a multiple predictor classifier even when they do not know the truth model. The idea is that if we can maximize the ROC curve or likelihood ratio directly and non-parametrically, the resulting classifier must be optimal. Since maximizing the whole ROC curve is computationally difficult, and since the optimal ROC curve must have the maximal AUC, Pepe et al. [

11] used AUC as an objective function to search biomarker combinations that maximize AUC. In their simulations they found when a logistic regression model is incorrect, maximizing AUC directly could lead to a much better classifier than that from the logistic regression model alone, while when the logistic model is correct, the two methods gave similar performance.

If sample size is sufficiently large, maximizing AUC, if computationally feasible, can indeed lead to the optimal classification model. For finite sample size, it could lead to a classifier that optimizes AUC for a given data but not for the population, an over fit. Since we don’t know in which region of the ROC curve the incorrect modeling has occurred, another approach is to focus on maximization of the relevant region of the ROC curve. Baker [

9] used likelihood ratio to maximize the ROC region between sensitivity 98–100% for prostate cancer population screening context.

Baker divided each marker into a number of intervals and formed a grid for two markers (for d markers it will be a d-dimensional grid). Starting from the extreme, i.e. the cell corresponding to the highest level for each marker, representing highest specificity, he selected a new cell to be added, one by one, using likelihood ratio, i.e. sensitivity/specificity ratio for the cell, as the selection criteria. That is to find a region by combining cells to maximize sensitivity/specificity ratio while satisfying specificity at a high value between 98–100%.

Since such a search can lead to a disjoint region, that is both anti-intuitive biologically (a high value is indicative for disease, but when it is higher it is indicative for control, and when it goes even higher it is indicative for disease again) and has more chance to over fit the data. Some restrictions should be applied. Baker used *Jagged ordered* and *Rectangular ordered* but they could be better termed as *or* rule and *and* rule, respectively.

The *or* rule (Jagged ordered rule) predicts a disease for a two marker situation, if marker 1 is above *a or* marker 2 is above *b*. The *and* rule (rectangular rule) predicts a disease if marker 1 is above *a and* marker 2 is above *b*. The search path to maximize the likelihood ratio is restricted to this given rule so the decision region is always connected.

We believe that the *or* rule is most often used because of its natural biological appeal. Most cancers are known to be heterogeneous, and there are a number of known or unknown histological or molecular sub-classes for each cancer. If there is a biomarker for each sub-class, then the *or* rule will combine these markers to increase sensitivity without much decrease in specificity.

The *and* rule is more suitable for combining markers that have very high sensitivity but poorer specificity. For example, a tumor and its surrounding tissues always have some inflammation, but other benign diseases could also exhibit inflammation. CA19-9 is elevated in both pancreatic cancer and chronic pancreatitis. PSA is elevated in both prostate cancer and benign prostate disease.

Baker did not discuss the possibility of combining *or* and *and* rules. It is conceivable that this combination could improve the performance. Let’s take a hypothetical example. Suppose there are five biomarkers, each elevated in 20% of non-overlapping cases and 10% non-overlapping controls. Using *or* rule to combine these five markers leads to a test with 100% sensitivity and 50% specificity. Let’s call this combined test maker A.

Another biomarker B has 100% sensitivity and 50% specificity. If the distribution of marker A and B are independent in controls, using *and* rule to combine marker A and B, i.e. predict disease if marker A is positive *and* marker B is positive, the final test rule will have 100% sensitivity and 75% specificity.

Algorithms

These methods are conceptually simple but difficult to implement when the number of candidates, d, is larger than two. One needs to do a grid search in d-dimensional space. There is also a question of how to form a grid, i.e. how many intervals should we form for each marker. Too-fine grids lead to insufficient numbers in each cell and the likelihood ratio statistics for each cell become noisy, while too-coarse divisions might miss an optimal cutoff point. For an example of four markers and 228 controls and 137 cases, Baker used 5 quartiles, leading to 625 cells in 4-dimensional space. More research should be done in developing efficient and robust algorithms for larger d, say 4 to 10.

Avoid over fitting

Even after we add restriction to the model space, overfitting of data is still a serious threat when the number of candidates is big and the sample size is small. The requirement for sample size increases exponentially with the number of candidates. Therefore it is important to limit the number of candidates to a minimum. To do so, one should first understand the performance characteristics for each biomarker and its biology so there is a strong rationale for inclusion. One can use cross-validation to estimate the performance of a combination rule. To do it right, the steps of the cross-validation should be specified in advance. Going back and forth and using many cross-validations to select one combination rule with the best cross-validation performance defeats the purpose of cross-validation.

Classification models for prognostic tests Cox regression model [

12] is the most popular method for combining predictors to model event time outcome. It has a similar draw back to the logistic regression model for binary outcome in that if the underlying model is incorrect, it is totally unknown whether the resulting classifier has any optimality for classification purpose. It also has the same issue as the logistic regression model in that the resulting classifier does not focus on the relevant performance region of the ROC curve for a specific clinical application. The ROC curve for a prognostic test has a time dimension, e.g. ROC curve for predicting 10-year prostate cancer mortality among prostate cancer patients. The proportional hazard assumption for Cox regression is unlikely to hold for prognostic markers as one could imagine a biomarker performs better when it is measured near the event than it does when it is measured far away from the event. For this reason, Zheng et al. [

13] used a weighted logistic regression model to model event time and found that when the proportional hazard assumption is violated, it performed better than the Cox model for classification purposes. When the proportional hazard assumption holds, both methods have similar classification performance in terms of ROC curves.

Prognostic marker study design has an unique issue of selection of cases and controls over time. Some investigators tend to define controls as those who never had an event during follow up and cases as those who had an event (died or disease recurred) before a certain time t. This sampling scheme violates the sampling principles for event time studies in which nested case-control or case-cohort samplings are preferred methods [

14,

15]. The optimal sampling method, in terms of efficiency and bias in estimating the classification performance, e.g. 10-year ROC(0.02), for prognostic test evaluation, has not been fully developed.