Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Stat Med. Author manuscript; available in PMC 2017 September 20.
Published in final edited form as:
Stat Med. 2016 September 20; 35(21): 3792–3809.
Published online 2016 April 5. doi:  10.1002/sim.6956
PMCID: PMC4965290

Combining Biomarkers Linearly and Nonlinearly for Classification Using the Area Under the ROC Curve


In biomedical studies, it is often of interest to classify/predict a subject’s disease status based on a variety of biomarker measurements. A commonly used classification criterion is based on AUC - Area under the Receiver Operating Characteristic Curve. Many methods have been proposed to optimize approximated empirical AUC criteria, but there are two limitations to the existing methods. First, most methods are only designed to find the best linear combination of biomarkers, which may not perform well when there is strong nonlinearity in the data. Second, many existing linear combination methods use gradient-based algorithms to find the best marker combination, which often result in sub-optimal local solutions. In this paper, we address these two problems by proposing a new kernel-based AUC optimization method called Ramp AUC (RAUC). This method approximates the empirical AUC loss function with a ramp function, and finds the best combination by a difference of convex functions algorithm. We show that as a linear combination method, RAUC leads to a consistent and asymptotically normal estimator of the linear marker combination when the data is generated from a semiparametric generalized linear model, just as the Smoothed AUC method (SAUC). Through simulation studies and real data examples, we demonstrate that RAUC out-performs SAUC in finding the best linear marker combinations, and can successfully capture nonlinear pattern in the data to achieve better classification performance. We illustrate our method with a dataset from a recent HIV vaccine trial.

Keywords: AUC, biomarker combination, classification, kernel, ramp loss, ROC curve

1. Introduction

In many areas of biomedical research, biomarkers can play important roles in classifying a subject or a sample into two or more categories or predicting the subject/sample’s probability of being in each category. For example, a key research component in developing a vaccine for an infectious disease is the so-called ‘immune correlates study,’ which seeks to identify immune biomarkers that are associated with the risk of infection [1]. Finding immune correlates helps us understand the biological mechanism of vaccine protection and reduces the cost of future vaccine efficacy trials. In practice, a single biomarker often has limited classification/prediction performance, making it desirable to combine multiple biomarkers for better performance.

In this paper we assume there are two outcome categories: case/disease group and control/non-diseased group, denoted by D and , respectively. We consider how to combine multiple markers to achieve the best classification performance. Many classification methods have been developed. We will focus on a class of methods which assign a scalar value called score to each subject. Subjects with scores higher than a threshold are then classified as cases, while subjects with lower scores are classified as controls. This is a flexible class of methods since the score can be expressed as a linear or nonlinear combination of the input variables. It does, however, exclude methods such as the artificial neural network and the classic decision tree.

Several performance criteria have been proposed to evaluate classification/prediction procedures. Two basic criteria are sensitivity and specificity. Sensitivity is the probability of correctly identifying a case subject; specificity is the probability of correctly classifying a control subject. Classification error combines the two by weighting 1-sensitivity and 1-specificity according to the probabilities of cases and controls. Both sensitivity and specificity depend on the choice of threshold. When we vary the threshold from −∞ to ∞, we get pairs of sensitivity and specificity. A plot of sensitivity versus 1-specificity is called the Receiver Operating Characteristic (ROC) curve. See [2] and [3] for a review of the ROC methodology. The area under the ROC curve (AUC) is a commonly used criterion for a model’s overall classification performance. It is threshold-independent and equals the probability that a randomly chosen case score is greater than a randomly chosen control score.

Two broad groups of scoring methods have been proposed for AUC-based classification. The first group of methods optimizes a criterion function that can be written as the sum of n terms, where n is the sample size; on the other hand, the criterion function used in the second group of methods is the sum of O(n2) terms. We refer to the first group as the single-indexed approach and the second group as the double-indexed approach or the AUC approach. Under this classification, both logistic regression, which optimizes a likelihood function that depends on the marker combination, and Support Vector Machine (SVM) [4], which optimizes a hinge loss function, are single-indexed methods. On the other hand, the AUC approach seeks to maximize the empirical AUC, or equivalently, to minimize the empirical AUC loss defined as:


where nD and n are the sample sizes of the case and control group, i and j index the case and control group respectively, xi or xj is a vector of biomarkers, β is a vector of coefficients, p refers to the pair of case indexed by ip and control indexed by jp, and ηpΔ=(xip-xjp)Tβ.

When the training data sample size is large, the empirical AUC approaches the population AUC, and the AUC approach is expected to perform better than the single-indexed approach in general when AUC is chosen as the classification performance criterion [5]. When the training data sample size is small to moderate, the scenario we are most concerned with, the single-indexed approach, may be more efficient, because the data are not reduced to rank as in the AUC approach. However, the AUC approach is more robust to outliers than the single-indexed approach. Methods like logistic regression and SVM are sensitive to outliers because the contribution of each individual observation to the criterion function is unlimited. Even though many proposals have been made to bend the influence of outlying observations to obtain robust methods, e.g. a M-estimator type robust logistic regression method by Bianco and Yohai (1996) [6] and robust statistical learning methods [7, 8, 9], these methods all involve nuisance parameters in their criterion functions, which determines the degree of bending. On the other hand, the robustness of the AUC approach is a result of directly optimizing the criterion function used to evaluate the classification procedure, and there is no need to tune a nuisance parameter related to robustness.

The AUC approach is more challenging than the single-indexed approach in terms of optimization. For example, the logistic likelihood function is smooth and convex, but the empirical AUC function is neither. There have been two main groups of AUC-based methods. The majority of the AUC-based methods are the Smoothed AUC (SAUC) methods. These methods use a smooth sigmoid function Ss(x) to approximate the step function 1 − I (x > 0) in the AUC loss (Figure 1a). For example, an approximation based on the normal cumulative distribution function (CDF) Φ, Ss(x) = Φ(−x/s), was used by [10, 11, 12] among others. Another commonly used approximation is the logistic function Ss(x) = 1/{1 + exp (x/s)}, see e.g. [13, 14, 15]. In both normal CDF and logistic approximations, a scale parameter s controls the closeness of the approximations. Smaller s leads to better approximation but harder optimization. Other smooth approximations have also been proposed, e.g. polynomial approximation [16]. While the SAUC criterion functions are differentiable and optimization is relatively easy, the gradient-based methods used in the optimization are difficult to remove from the local optimum and result in sub-optimal solutions.

Figure 1
(a) Different approximations of the empirical AUC loss function. (b) RAUC can be represented as a difference of two convex functions.

A second group of methods in the AUC approach is termed the Support Vector Machine-AUC (SVM-AUC) method [17, 18, 19, 20]. These methods replace the step function 1 − I(x > 0) in the AUC loss function by a hinge loss function (1 − x)+ (Figure 1b), thus seeking to minimize a convex upper bound of the AUC loss function. The resulting optimization problem is a convex optimization problem and can be solved by support vector machine software. Because the contribution of the difference in a pair of case/control scores can increase infinitely, SVM-AUC does not approximate the AUC loss function well. This problem is particularly serious when the training data is contaminated with outliers.

Most of the AUC-based methods developed so far focus on identifying linear combinations of markers, with the exception of the boosting approach used in [21]. While providing a simple summary of relationship between markers, a linear combination is limited in classification performance. One way to improve classifier performance is to map the markers to a feature space through basis expansion such as polynomials or splines. Choosing the right expansion is, however, nontrivial and requires substantial amount of experience in data modeling and domain knowledge.

In this paper, we propose a new AUC-maximization method for identifying marker combinations. We start by proposing a new loss function in Section 2, which provides a better approximation to the AUC criterion function than the SVM-AUC methods. We study the asymptotic theory for the estimated linear combination using the proposed method. In Section 3, we develop a difference-of-convex functions algorithm to find the best linear combination. The algorithm avoids some of the pitfalls associated with gradient-based methods used in SAUC methods. We further show that the algorithm lends itself naturally to the application of the so-called ‘kernel trick’ ([22], Chapters 5 and 12), which allows it to find the best linear combination in a higher (often infinite) dimensional feature space without needing to explicitly specify the mapping from the input markers to the features. In Section 4, we present simulation studies to study the comparative performance of the proposed method against common existing methods. In Section 5, we illustrate the application of the proposed method with an immune biomarker dataset from a recent HIV vaccine trial conducted in Thailand. We make concluding remarks in Section 6.

2. The Ramp AUC Estimator

To provide a new approach to finding AUC-based marker combinations that is optimized using a non-gradient based method effectively and can handle nonlinear combinations, we propose to approximate the step function, 1 − I(x), in the empirical AUC loss (1) by a ramp function:


where s > 0 is a scale parameter (Figure 1). We call p=1nDnD¯hs(ηpΔ)/(nDnD¯) the Ramp AUC (RAUC) loss function. For any given ηpΔ0, the value of this loss function moves towards 0 or 1 and becomes 0 or 1 at some point as s → 0. It is useful to keep track of the proportion of hs(ηpΔ) being 0 or 1, termed the saturation ratio, since it reflects the degree of approximation.

There are identifiability issues with all methods in the AUC approach. The problem with minimizing the empirical AUC loss function arises from the fact that 1 − I (x > 0) = 1 − I (cx > 0), where c is a scalar constant. Minimizing SAUC loss also has an identifiability issue because (i) Ss(|x|) > Ss(c|x|) for c > 1 and (ii) Ss(x) = Ss=1(x/s) for both normal CDF approximation and logistic approximation. Current SAUC methods use a combination of two procedures to deal with the identifiability problem. First, a constraint is imposed on the coefficients β. Pepe and Thompson [23] suggested choosing an anchor variable x1 and setting its coefficient β1 = 1, while Lin et al. [24] proposed to constrain the L1-norm of β to 1. Second, an empirical rule for selecting s is adopted to set s before optimization. Gammerman [25] proposed to set s=argmaxs(95%ηpΔ/s5) in order to ensure a good approximation of SAUC to the empirical AUC loss function; Vexler et al. [11] proposed to let s = (nDn)−0.1 based on simulation studies; and Lin et al. [24] proposed to have s = n−1/3Stddev (xiβ0/||β0||2), where n = nD + n and β0 is an initial estimate of β, out of both empirical and asymptotic considerations.

Minimizing RAUC has similar identifiability problems because hs(x) ≥ hs(cx) for c > 1 and because the ramp function also satisfies hs(|x|) = hs=1(|x|/s). We propose a new way to address these issues by setting s = 1 and penalizing the L2-norm of β. That is, we seek to minimize a penalized RAUC loss:


This effectively minimizes RAUC loss while constraining β22 to a constant. But instead of specifying how big β22 needs to be, we use λn as a lever to adjust it. When λn gets smaller, we constrain β22 at higher value. Consequently, the saturation ratio increases and the approximation error of the the ramp AUC criterion decreases; but, at the same time, the search algorithm becomes more likely to get stuck in suboptimal solutions. A thorough but computationally intensive approach for selecting λ is to perform searches under a series of λn and choose the one that yields the best empirical AUC. On the other hand, a rule of thumb that works well when finding the best linear combinations is that when a choice of λ leads to around 75% saturation ratio, the corresponding solution is often close to being optimal. In our experience, choosing smaller λn to get higher saturation ratios frequently results in either negligible improvement or, worse, deterioration in the quality of the solution.

We prefer constraints on L2-norm over constraints on an anchor variable coefficient because, as has been pointed out by others [15, 24], there are times when it is difficult to select an anchor variable. Constraining ||β||2 also ensures a more uniform approximation of the AUC loss across the parameter space than constraining ||β||1. Suppose each covariate has been standardized to have mean 0 and standard deviation 1. For both SAUC and RAUC, this means the saturation ratio remains relatively constant for different marker combinations under the ||β||2 constraint but not under the ||β||1 constraint. As the SAUC or RAUC tends to increase as the saturation ratio increases, the estimate of the combination is likely to be more biased in the latter case, particularly when the range of saturation ratios is not close to 1.

Consider the statistical model Pr (Y = 1|X) = G(α +XTβ), where Y is a binary outcome and G is an unknown monotonically increasing inverse link function. Let the first derivative of G(·) be denoted by G′(·). Let β be d-dimensional and denote θ(β) = (β2/β1, …, βd/β1)T. Denote the true parameters by β0 and denote θ0 = θ(β0). Also denote the first element of β0 by β0,1 and we assume it not to be 0. Sherman [26] proved maximizing empirical AUC leads to a consistent and asymptotically normal estimate of θ. Ma and Huang [15] showed this was also true for maximizing smoothed AUC. We show here that the estimate of combination [theta w/ hat] = θ([beta]) from minimizing the penalized empirical ramp AUC loss (2) is also consistent and asymptotically normal through two theorems.

Theorem 1 (Consistency)

Under certain regularity conditions and λn = op(n2), θ^pθ0 as n → ∞.

Theorem 2 (Asymptotic Normality)

Let X denote the last d − 1 components of X, η0 = α + XT β0, and g0 denote the marginal density of η0. Denote


Suppose assumptions (A1) – (A4) hold and λn = op(1), then


as n → ∞.

The assumptions and proofs of the theorems mirror those of [26] and [15] and can be found in Appendix A. Based on the proofs, it is clear that the asymptotic variance of the RAUC estimator of the best linear combination should be the same as that of the empirical AUC estimator and the smoothed AUC estimator. However, our asymptotic variance formula differs from that of [26] and [15] by a factor of 4. The difference can be traced to the expectation of the second derivative of the kernel of the empirical process in Lemma 1 of Appendix A.

To consistently estimate the asymptotic variance, one can either use numerical derivatives as suggested by Sherman [26] in the context of maximizing empirical AUC or use the following empirical smoothed estimate in the spirit of [15]:


where Π(x) = xxT; S(x) is a smooth sigmoid function, e.g. Φ(−x), with first derivative denoted by S′(x) and second derivative denoted by S″(x); and χi is the last d − 1 components of xi. As observed by Ma and Huang [15], the performance of the empirical variance estimate is less than satisfactory when the sample size is small to medium and the more computationally intensive bootstrap methods are preferred.

3. Optimization and Linear and Nonlinear Marker Combination

3.1. Identification of Linear Marker Combinations

The penalized RAUC loss (2) is a non-convex, non-smooth function. To find the best combination, we write the ramp function h as the difference between two convex functions: h(x) = h1(x) − h2(x) (Figure 1b)


This allows the application of difference of convex functions algorithm (DCA) [27, 9]. The essence of DCA is to use a series of convex optimization problems to approximate the non-convex problem. This is achieved by iteratively approximating h2 with a first order Taylor expansion; there is no need to approximate h1 because the difference between a convex function and a linear function is a convex function. The optimization proceeds as follows:

  • Step 1. Start with an initial guess for ηpΔ and assign it to ηpΔ,0. For example, we can use a robust logistic regression to get an initial estimate [theta w/ hat]rob and let ηpΔ,0(xip-xjp)T(1,θ^rob)T.
  • Step 2. Solve
    where h^2(ηpΔ,ηpΔ,0)=h2(ηpΔ,0)+h2(ηpΔ,0)ηpΔ, and h2(x)=-I(x<-1/2) is the first derivative of h2(x) with respect to x, except 0 at x = −1/2.
  • Step 3. Set ηpΔ,0(xip-xjp)Tβ^ and go back to step 2 until the change in the penalized RAUC loss is less than a pre-specified constant.

In step 2 of the algorithm above, we solve a convex optimization problem. As h1 and ĥ2 are not smooth functions, the standard approach is to convert it to a constrained smooth optimization problem by introducing slack variables ξp to replace h1 and adding two sets of constraints. This leads to:


where we have dropped terms void of ξ or β. Because the constraints in the above optimization problem are hard to work with, we seek to find its dual problem [28] via the Lagrange method. The primary Lagrangian is


where α and γ are non-negative Lagrange multipliers, corresponding to the two sets of constraints in (4). Setting derivatives of the Lagrangian with respect to the primary space variables β and ξ to 0, we arrive at


Plugging them back to Lp, the terms having ξp vanish. After some algebra (see Appendix B for details), we arrive at the dual problem:


where Q is a square matrix whose [p, p′]th element is xpΔ,xpΔ/λn, and b=1-2×Q×h2(ηΔ,0). The optimization problem (6) has a set of simple box constraints and can be solved via many quadratic programming methods. Once we solve the dual problem, solution to the primal problem is found through (5).

Having found the best combination, to classify a new observation we will use the score:


where ηΔ,0 and α come from the last iteration of DCA. Similar to the SVM literature [29], we call those xpΔ with h2(ηpΔ,0)+αp=0 non-support pairs because they do not contribute to the score. There are usually a great number of non-support pairs and this means only some of the inner products left angle bracketxnew, xpΔright angle bracket need to be evaluated, allowing for faster classification.

3.2. Identification of Nonlinear Marker Combinations

As the best classifiers may not be among linear combinations of the input markers, we wish to enlarge the feature space via basis expansion and find the best linear combination in the enlarged feature space. For example, if we have two markers (x1, x2), we go beyond the linear trends and look at the second order trends as well as the interaction between the two markers. Then we work with the feature space (x1, x2, x12,x22, x1x2). Let ϕi denote the feature vector for subject i. We now show that the algorithm above can be adapted to find the best combination of ϕi even if ϕi is infinite-dimensional and we don’t have explicit knowledge of the mapping from xi to ϕi, as long as we know how to evaluate the inner product left angle bracketϕi, ϕjright angle bracket. In step 1 we let ηpΔ,00. In step 2, the dual problem that needs to be solved has the same form as (6), where Q is now a square matrix whose [p, p′]th element is ϕpΔ,ϕpΔ/λn. But ϕpΔ,ϕpΔ can be written as:


Having solved the dual problem, there is no need to translate it back to the primal space because step 3 calls for [eta w/ hat]Δ, which has the expression:


That is, even though ϕi may be infinite-dimensional, [eta w/ hat]Δ always has a finite-dimensional representation. We can classify a new observation based on:


where again ηΔ,0 and α come from the last iteration of DCA.

This approach of mapping the input vector to a feature space without having to specify the mapping explicitly has been called the ‘kernel trick’ in the SVM literature [30, 22]. Let K be a symmetric continuous function that maps {xi} × {xj} to the real line. We refer to K as a kernel. Mercer’s theorem [31, 4] tells us that any positive semidefinite kernel K corresponds to the inner product in some mapping ϕ, i.e. K(xi, xj) = left angle bracketϕ(xi), ϕ(xj)right angle bracket. Thus, if we can compute kernel K, we can find the best linear combination in ϕ. Two commonly used nonlinear kernels include the dth-degree polynomial kernel function K(xi, xj) = (1 + left angle bracketxi, xjright angle bracket)d and the radial basis function (RBF) kernel K(xi, xj) = exp (−γ||xixj||2), where γ is a tuning parameter. A linear kernel K(xi, xj) = left angle bracketxi, xj right angle bracket maps a vector to itself, and this allows a uniform treatment of linear and nonlinear combination methods. In order to avoid overfitting in the feature space, we will treat λn as a tuning parameter to find the desirable model complexity.

4. Simulation studies

4.1. Linear Marker Combinations

In this simulation study, we examine the performance of RAUC in finding the best linear combination of markers. Two covariates are simulated from a mixture of two bivariate normal random variables


where Δ is a Bernoulli variable with mean π. The outcome variable Y is generated as Bernoulli random variables with means specified by logit {Pr (Y = 1)} = 4X1 − 3X2 − (X1X2)3. When no outliers are simulated, π is set to 0; when outliers are simulated, π is set to 0.05. The sample size is 200 for training data and 104 for test data. Figure 2 shows two training datasets, one with and one without outliers. The plots suggest that this simulation scenario mimics the the real data example that served as a motivation example in [5].

Figure 2
Sample training datasets from Section 4.1. Cases are plotted with filled circles, and controls are plotted with empty circles. Best linear combinations from three methods are shown as lines.

We compare RAUC with three other classification methods. The logistic method fits a logistic regression model to the training data and uses the fitted model to score the test data. For the SAUC method, we choose normal CDF as the sigmoid function, constrain ||β||2 = 1, and choose the scale parameter as prescribed by Lin et al. [24]. Optimization is carried out by adapting the algorithm from Lin et al. The third method we compare to is the SVM method [32, 33, 4] as implemented by Joachims [34]. We choose SVM for comparison here because it is a commonly used classification method and is also a kernel-based method like the RAUC method. In our categorization of classification methods, SVM is more closely aligned with the likelihood approach because the objective function that it maximizes has the same overall shape as the logistic regression method ([22] Fig 10.4). For all methods, we assume the working model to be a linear combination of x1 and x2.

Summary results from 1000 simulations are shown in Table 1. When there are no outliers in the training data, all methods perform similarly. When there are outliers in the training data, the logistic regression method may break down as illustrated by Figure 2. Table 1 shows that, when the test data does not include outliers, the average test data AUC for the logistic regression method and the SAUC method is 0.639 and 0.667, respectively. Because the AUC measure has a narrow range of going from 0.5 for a useless classification method to 1 for a perfect classification method, we will look at relative increase in effective AUC, defined as AUC−0.5, when comparing two methods. By this measure, SAUC shows a relative increase in effective AUC of (0.667 − 0.5)/(0.639 −0.5) − 1 = 20% over the logistic regression method. By this same measure, RAUCl, the RAUC method with linear kernel, shows a relative increase in effective AUC of (0.681 − 0.5)/(0.667 − 0.5) − 1 = 8% over SAUC when the test data contains no outliers. When the test data is simulated with outliers, we see a similar pattern of RAUC outperforming SAUC and SAUC outperforming logistic regression. Since the loss functions minimized by RAUC and SAUC both approximate the AUC loss, the difference in performance between the two can be attributed to the difference in optimization algorithm and suggests that the difference-of-convex functions algorithm employed by RAUC may be less likely to be stuck in local optima than the gradient-based algorithm used by SAUC. Finally, as we would expect, the performance of SVMl, the SVM estimator with linear kernel, tracks most closely with the logistic regression method as it is not robust against outliers.

Table 1
Test data AUC from Section 4.1, mean and se: Monte Carlo mean and standard error. SVMl: SVM with linear kernel. RAUCl: RAUC with linear kernel.

Using the knowledge of logit {Pr (Y = 1)} to classify cases and controls gives us a theoretical AUC that represents an upper bound on the classification performance. From a dataset of 500,000 data points, the theoretical AUC is 0.704 under the ‘no outliers’ scenario and 0.720 under the ‘with outliers’ scenario. Figure A.1 in the Supplementary Materials shows the density curves for the distributions of logit {Pr (Y = 1)} in cases and controls under the two scenarios. Comparing the theoretical AUC to Table 1, we see that a linear combination of X1 and X2 can achieve close to theoretical AUC under the ‘no outliers’ scenario but not under the ‘with outliers’ scenario. This makes sense because when there are no outliers the values of X1 and X2 are very close to 0 and the higher order term (X1X2)3 is nearly ignorable. Figure A.2 in the Supplementary Materials plots the joint distributions of ([beta]1, [beta]2) and the distributions of the ratio [beta]2/[beta]1 from RAUCl. The figure shows that under finite samples the distribution of [beta]2/[beta]1 behaves well when there are no outliers in the training data but has a long tail when there are outliers in the training data.

We repeat this simulation study and replace the first component of the mixture distribution of (X1, X2) with a heavy-tailed bivariate double-exponential distribution using the R package LaplacesDemon. Similar results are obtained and the details are described in Supplementary Materials Section D.1.

4.2. Nonlinear Marker Combinations

In this simulation study, we study classifier performance when there is a strong nonlinear pattern in the training data. The radial basis function (RBF) kernel is adopted in RAUC and SVM. We simulate 4 covariates independently from Student’s t distribution with 4 degrees of freedom. When no outliers are simulated, the outcome Y is generated as Bernoulli random variables with means specified as logit {Pr (Y = 1) |X1, X2, X3, X4} = 10 × {sin (πX1) + sin (πX2) + sin (πX3) + sin (πX4)}. When outliers are simulated, the outcome for samples with X12+X22+X32+X42>10 is simulated from a different mean model: 10 × {cos (πX1) + cos (πX2) + cos (πX3) + cos (πX4)}. The sample size is 200 for training data and 104 for test data.

The use of RBF kernel in RAUC and SVM requires careful consideration of the tuning process. The RBF kernel K(xi, xj) = exp (−γ||xixj||2) contains a scale parameter γ. In addition, the penalty parameter λ takes on a more important role as it controls not only approximation error but also model complexity because the implicit feature vector associated with RBF kernel is infinite-dimensional. We perform a grid search on (λ, γ). The grid points for γ are chosen to be the 50%, 75% and 90% quantiles of the between class Euclidean distance in the training data [35]. The grid points for λ are chosen more adaptively in the following way. Assuming the markers are all scaled to have variance 1, we start with a default grid of seven values for λ: 2−3, 2−2, 2−1, 1, 2, 22, and 23. We evaluate the performance of the classifiers at the 3 × 7 lattice. If the optimal ([lambda with circumflex], [gamma with circumflex]) has a [lambda with circumflex] that corresponds to either the minimum or maximum tested λ, we expand the lattice to include a new λ* that is another two-fold away from 1 than [lambda with circumflex]. We repeat this process until the optimal λ is not the minimum or maximum tested λ. To evaluate tuning parameters, we generate a tuning dataset of size 103. The mean and standard error of the test data AUC over 1000 replications are shown in Table 2.

Table 2
Test data AUC from Section 4.2, mean and se: Monte Carlo mean and standard error. SVMr: SVM with RBF kernel. RAUCr: RAUC with RBF kernel.

When the training data is free of outliers, both logistic regression method and SAUC method do not perform well due to strong nonlinearity in the data. Both RAUC method and SVM method with RBF kernel perform satisfactorily. As expected, the test data AUC is lower when the test data contains outliers than when the test data does not contain outliers.

When the training data contains outliers, we also observe that RAUC and SVM with RBF kernel outperform logistic and SAUC. Furthermore, RAUC with RBF kernel also shows an advantage over SVM with RBF kernel. The relative increase in effective AUC is (0.817 − 0.5)/(0.790 − 0.5) − 1 = 9% when the test data has no outliers, and (0.816 − 0.5)/(0.795 − 0.5) − 1 = 7% when the test data also has outliers.

From a dataset of 500,000 data points, the theoretical AUC is 0.995 under both ‘no outliers’ and ‘with outliers’ scenarios. Figure B.1 in the Supplementary Materials shows the density curves for the distributions of logit {Pr (Y = 1)} in cases and controls under the two scenarios. Figure B.2 in the Supplementary Materials shows scatterplots of ‘theoretical combination’, i.e. logit{Pr (Y = 1)}, and RAUCr estimated combination for the training data under the two scenarios from one Monte Carlo replicate. The plots show that for the bulk of the 200 data points the correlation between theoretical and estimated combinations are high, but when the theoretical combination is near 0, the estimated combination may show wide variability.

We repeat this simulation study with (X1, X2, X3, X4) simulated from a correlated multivariate Student’s t distribution, and obtain similar results (details in Supplementary Materials Section D.2).

5. Data example

RV144 is a community-based, randomized, multicenter, double-blind, placebo-controlled HIV vaccine efficacy trial conducted in Thailand from 2003 to 2006 [36]. In the modified intention-to-treat analysis involving 16,395 subjects, the vaccine efficacy was 31.2% with a P-value of 0.04. In an effort to identify immune correlates of infection risk potentially relevant for vaccine-induced protection, Haynes et al. [1] conducted a case-control study using biomarker measurement from blood samples of 41 cases and 205 controls, all from the vaccine arm. Haynes et al. reported an array of immunological response variables measured from these samples. We choose two of them for illustration: IgA, which measures the IgA class of serum antibody binding to HIV envelope protein gp120, and IL13, which measures the amount of IL13 production by peripheral blood mononuclear cells when stimulated by gp120 peptide. Both variables are continuous variables and are scaled to have standard deviation 1. We perform repeated random sub-sampling validation for AUC estimation. In iteration, we randomly split the data into 4/5 training set and 1/5 test set stratified on the case status. Test data AUC over 1000 splits is reported in Table 3.

Table 3
Test data AUC for the RV144 data example, mean and se: mean and standard error from repeated random sub-sampling validation.

For methods using the RBF kernel, we need a way to tune the tuning parameters. Because the number of cases is limited, it does not work well to further split the training set into a training subset and a tuning subset. Instead, we adopt the following strategy. We will use the same tuning parameters value for all training/test datasets. To find a good value for the tuning parameters, we randomly split the data into 4/5 training set and 1/5 tuning set, and perform a grid search to maximize the classification performance on the tuning data. The median value from 1000 such splits is chosen as the tuning parameter to be used in the training/test phase. For SVM, we try both this strategy and generalized approximate cross validation (GACV [37]) and find that our strategy performs slightly better than GACV; hence, we only report results from the former strategy.

The results in Table 3 show that, among the methods looking for the best linear combination of markers, the performance of RAUC with linear kernel is on par with other methods. The RAUC method detects a nonlinear trend in the relationship between infection and the two markers. Comparing RAUCr and RAUCl, the relative increase in effective AUC is (0.656 − 0.5)/(0.621 − 0.5) − 1 = 29%. To help interpret the nonlinearity, we make scatterplots of IgA and IL13 in Figure A.1 of Supplementary Materials, where cases are represented by red circles, controls are represented by black triangles, and classifier decision boundaries (left: RAUCl and right: RAUCr) are represented by shaded or non-shaded background. For RAUCl, IL13 is drawn on the linear scale; for RAUCr, IL13 is drawn on a transformed scale to provide a clearer view. The shape of the decision boundary from RAUCr suggests that the nonlinearity can be summarized by a simple AND/OR rule: IgA< 1 AND IL13> c1 OR IgA> 1 AND IL13>c1 + δ. The SVM method with nonlinear kernel does not perform as well, which is likely the manifestation of small sample performance issue with SVMr.

6. Concluding Remarks

In this paper we propose a new criterion function, the penalized RAUC loss function, for finding marker combinations with optimal classification/prediction performance. By using different kernels, the RAUC method can find both linear and nonlinear best marker combinations. We show that the best linear marker combination is a consistent and asymptotically normal estimator of the true marker combination when the data is generated from a semiparametric generalized linear model.

In the numerical studies, we show that the RAUC method with linear kernel out-performs SAUC as a linear classification method under some scenarios. While these two methods are both consistent estimators of the true marker combination and have the identical asymptotic variance, the difference-of-convex functions algorithm employed by RAUC is less likely to be stuck in local optima than the gradient-based algorithm used by SAUC. In the presence of nonlinear patterns, the RAUC method with nonlinear kernels allows a flexible way to derive nonlinear marker combinations for optimizing classification accuracy, with performance better or comparable to common alternative estimators. R package for implementing the RAUC method is available from the authors upon request.

Our focus in this paper is to develop classification rules to flexibly combine several candidate biomarkers that have been selected by investigators as potentially useful. Another important aspect of biomarker studies is to use data to help select important markers and features. Variable selection is well developed in regression analysis with linear predictors, where many types of penalties have been proposed and oracle properties established [38]. In AUC analysis with linear marker combinations, the penalization approach to variable selection faces an added layer of complexity due to the fact that a constraint on β is needed to ensure identifiability. Some efforts have been made in this area, e.g. Lin et al. [24] proposed to use a SCAD penalty [39] while maintaining ||β|| = 1. In AUC analysis with nonlinear marker combinations, model selection is further complicated by the fact that the input markers are mapped to a higher-dimensional feature space and parsimony may be sought on the feature level or on the marker level. Feature-level model selection has been studied in nonlinear kernel SVM [40, 41] and marker-level model selection has been explored in Gaussian process regression models [42, 43]. Our framework can be naturally extended to perform marker-level model selection for AUC analysis and is currently under investigation.

Supplementary Material

Supp info


The authors thank Krisztian Sebestyen for programming assistance. We also thank the Editor, the AE and the two referees for their insightful comments that helped improve the manuscript. This work was supported by the cooperative agreement W81XWH-07-2-0067 between the Henry M. Jackson Foundation and the Department of Defense, and National Institute of Allergy and Infectious Diseases grants 1R56AI116369-01A1, R01-GM106177 and UM1AI068635.


Appendix A: Proofs of Theorems 1 and 2

For proof of the consistency, we need two assumptions. (A1) θ0 is an interior point of Θ which is a compact subset of Rd−1. (A2) The support of the covariate vector x is not contained in any proper linear subspace of Rd, and the first component of x has a everywhere positive Lebesgue density, conditional on the other components.

Under (A1) and (A2), Han (1987) [44] proved the consistency of the empirical AUC estimator. Denote the AUC loss by AUCn(θ)=1n2p=1nDnD¯I(ηpΔ<0). It suffices to show that supθ[set membership]Θ |AUCn(θ) − argminβ1 Ln(β) | → 0 in probability as λn → 0, where Ln=1n2p=1nDnD¯r(ηpΔ)+12n2λnβ22. But


As argued in [15], the first term in the last line converges to 0 uniformly in probability when λn/n2 → 0. Thus ends the proof of the consistency.

Let F and f be the cdf and pdf for T = (XiXj)T β, where Xi is a random sample from the case and Xj is a random sample from the control. Denote the pdf for [1, θ(β)] (XiXj) by ψ.


Take the derivative with respect to β1, and as β1 → ∞,


Hence, E{r(T)}+12n2λnβ22 as a function of β1 is minimized at


Denote sn=(λnn2)1/4{(8θ02)-1ψ(0)}-1/4. When λn = op(1), sn=op(1/n). Define rn(x) = r(x/sn) and Hn(θ)=1n2iDjD¯rn{[1,θ](xi-xj)}.

Let Z = (Y, X). A key term in the normality proof is


By rn(x) = 1 − rn(−x), τn(Z1, θ) can be simplified to


where in the last line, we introduce S(y, t),


We now state the third assumption needed for the normality. (A3) (i) Δ and V exist and Δ is negative definite; (ii) partial derivative of g0 exists and is bounded on the support of (Y, X); (iii) for each z, all mixed second partial derivative of τn(z, ·) exist on a neighborhood of θ0; (iv) There is an integrable function M(z) such that for all z and θ in a neighborhood of θ0, then ||[nabla]2τn(z, θ) − [nabla]2τn(z, θ0)|| < M(z)|θ|. We now state and prove a lemma needed for normality.

Lemma 1

Under assumption (A3),





Let ui denote the unit vector in Rd−1 with ith component equal to unity and [nabla]i for the ith component of of [nabla]1. By definition of derivative, we have


By the definition of rn, we can compute


(8) and (9) follows.

To show (10),


The rest of the normality proof tracks the proof of Theorem 4 in [26] closely. Denote Γn(θ)=Hn(θ)-Hn(θ0)+12n2λnβ^12θ22. We will show that


uniformly in o(1) neighborhoods of θ0, where Wn converges in distribution to a N(0, V) random vector. By Theorem 1 and 2 of [26] and by the fact that sn=op(1/n), the result follows.

For each (z1, z2) in S [multiply sign in circle] S and each θ in Θ, define fn(z1, z2, θ) = I(y1 > y2)[rn{(1, θ) (x1x2)}) − rn{(1, θ0) (x1x2)}]. As in [26], let Pn be the empirical measure and Un be the U-process operator while Pfn(u, ·, θ) denotes the conditional expectation of fn(u, v, θ) given its first argument and Pfn(·, v, θ) the conditional expectation of fn(u, v, θ) given its second argument. Then we can write Hn(θ) − Hn(θ0) = Γn(θ) + Pngn(·, θ) + Unrn(·, ·, θ), where


We now show that


Note that 2Γn0(θ) = E(τn(z, θ) − τn(z, θ0)). Expand τn(z, θ) about θ0 to get


where θ* between θ and θ0. By (A3)(iv), for each z in S and each θ in a neighborhood of θ0,


Thus by Lemma 1,


Next we show that


uniformly over o(1) neighborhood of θ0, where Wn converges in distribution to a N(0, V) random vector. Note that


Then we have


uniformly over o(1) neighborhood of θ0, where


By Lemma 1, Wn converges in distribution to a N(0, Δ) random vector. By a weak law of large numbers, Dn = o(1). (13) follows.

Lastly we show that


Denote H = {I(y1 > y2)[r{(1, θ) (x1x2)/s} − r{(1, θ0) (x1x2)/s}]: θ [set membership] Θ, s [set membership] [0, 1]}. Then H is Euclidean by Lemma 22 (ii) of [45]. By the dominated convergence theorem, E(kn(·, ·, θ)) → 0 as θθ0 and sn →0. Thus Unkn(·, ·, θ) = o(1) by following Theorem 3 of [26].

Together, (12), (13), (14) and 1n2λnβ^12=op(1) imply (11). Thus completes the proof.

Appendix B: Derivation of the dual space criterion function and KKT conditions

In Section 3, we take the derivative of Lp with respect to β and ξ to get two sets of equations. Plug them back to Lp, we have


Thus we have arrived at a dual space formulation of the optimization problem that is a function of α only.

The KKT conditions associated with the above optimization programs are


This set of conditions can be written more succinctly by noting that if αp = 0, then ξp = 0 and xpTβ-120; if 1 > αp > 0, then ξp = 0 and xpTβ-12=0, or xpTβ=12; if αp = 1, then xpTβ-12=-ξp0. Combining them, we have the following expression of the KKT conditions



1. Haynes BF, Gilbert PB, McElrath MJ, Zolla-Pazner S, Tomaras GD, Alam SM, Evans DT, Montefiori DC, Karnasuta C, Sutthent R, et al. Immune-correlates analysis of an HIV-1 vaccine efficacy trial. New England Journal of Medicine. 2012;366(14):1275–1286. [PMC free article] [PubMed]
2. Pepe M. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press; 2003. Oxford statistical science series.
3. Zhou X, Obuchowski N, McClish D. Statistical Methods in Diagnostic Medicine. Wiley-Interscience; 2002. Wiley series in probability and statistics.
4. Vapnik V. The Nature of Statistical Learning Theory. Springer Verlag; 1995.
5. Pepe M, Cai T, Longton G. Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics. 2006;62(1):221–229. [PubMed]
6. Bianco A, Yohai V. Robust estimation in the logistic regression model. Robust Statistics, Data Analysis, and Computer Intensive Methods, Lecture Notes in Statistics. 1996;109:17–34.
7. Wu Y, Liu Y. Robust truncated hinge loss support vector machines. Journal of the American Statistical Association. 2007;102(479):974–983.
8. Liu Y, Shen X. Multicategory ψ-learning. Journal of the American Statistical Association. 2006;101(474):500–509.
9. Liu Y, Shen X, Doss H. Multicategory ψ-learning and support vector machine: computational tools. Journal of Computational and Graphical Statistics. 2005;14(1):219–236.
10. Lloyd C. Using smoothed receiver operating characteristic curves to summarize and compare diagnostic systems. Journal of the American Statistical Association. 1998;93(444):1356–1364.
11. Vexler A, Liu A, Schisterman E, Wu C. Note on distribution-free estimation of maximum linear separation of two multivariate distributions. Journal of Nonparametric Statistics. 2006;18(2):145–158.
12. Zhou X, Chen B, Xie Y, Tian F, Liu H, Liang X. Variable selection using the optimal ROC curve: An application to a traditional chinese medicine study on osteoporosis disease. Statistics in Medicine. 2012;31(7):628–635. [PubMed]
13. Yan L, Dodier R, Mozer M, Wolniewicz R. Optimizing classifier performance via an approximation to the Wilcoxon-Mann-Whitney statistic. Proceedings of the Twentieth International Conference on Machine Learning; The AAAI Press; 2003.
14. Herschtal A, Raskutti B. Optimising area under the ROC curve using gradient descent. Proceedings of the Twenty-First International Conference on Machine Learning; ACM; 2004.
15. Ma S, Huang J. Combining multiple markers for classification using ROC. Biometrics. 2007;63(3):751–891533. [PubMed]
16. Calders T, Jaroszewicz S. Efficient auc optimization for classification. Knowledge Discovery in Databases: PKDD 2007. 2007:42–53.
17. Brefeld U, Scheffer T. Proceedings of the ICML 2005Workshop on ROC Analysis in Machine Learning. 2005. Auc maximizing support vector learning.
18. Rakotomamonjy A. Optimizing area under roc curve with svms. Proceedings of ECAI 04 ROC and Artificial Intelligence Workshop. 2004
19. Joachims T. A support vector method for multivariate performance measures. Proceedings of the 22nd International Conference on Machine Learning; ACM; 2005. pp. 377–384.
20. Wang Y, Chen H, Li R, Duan N, Lewis-Fernández R. Prediction-based structured variable selection through the receiver operating characteristic curves. Biometrics. 2011;67(3):896–905. [PMC free article] [PubMed]
21. Komori O, Eguchi S. A boosting method for maximizing the partial area under the roc curve. BMC Bioinformatics. 2010;11(1):314. [PMC free article] [PubMed]
22. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2. Springer; 2009. Springer Series in Statistics.
23. Pepe M, Thompson M. Combining diagnostic test results to increase accuracy. Biostatistics. 2000;1(2):123–140. [PubMed]
24. Lin H, Zhou L, Peng H, Zhou X. Selection and combination of biomarkers using roc method for disease classification and prediction. Canadian Journal of Statistics. 2011;39(2):324–343.
25. Gammerman A. Computational Learning and Probabilistic Reasoning. John Wiley & Sons, Inc; 1996.
26. Sherman R. The limiting distribution of the maximum rank correlation estimator. Econometrica: Journal of the Econometric Society. 1993;61(1):123–137.
27. An LTH, Tao PD. Solving a class of linearly constrained indefinite quadratic problems by dc algorithms. Journal of Global Optimization. 1997;11(3):253–285.
28. Boyd S, Vandenberghe L. Convex Optimization. Cambridge University Press; 2004.
29. Burges C. A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery. 1998;2(2):121–167.
30. Aizerman A, Braverman E, Rozoner L. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control. 1964;25:821–837.
31. Mercer J. Functions of positive and negative type, and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society of London Series A, Containing Papers of a Mathematical or Physical Character. 1909;209:415–446.
32. Boser B, Guyon I, Vapnik V. A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on Computational learning theory; ACM; 1992. pp. 144–152.
33. Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995;20(3):273–297.
34. Joachims T. Learning to Classify Text Using Support Vector Machines. Vol. 668. Springer; Netherlands: 2002.
35. Brown M, Grundy W, Lin D, Cristianini N, Sugnet C, Furey T, Ares M, Haussler D. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences. 2000;97(1):262. [PubMed]
36. Rerks-Ngarm S, Pitisuttithum P, Nitayaphan S, Kaewkungwal J, Chiu J, Paris R, Premsri N, Namwat C, de Souza M, Adams E, et al. Vaccination with ALVAC and AIDSVAX to prevent HIV-1 infection in Thailand. New England Journal of Medicine. 2009;361(23):2209–2220. [PubMed]
37. Wahba G, Lin Y, Zhang H. Margin-like quantities and generalized approximate cross validation for support vector machines. Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop; IEEE; 1999. pp. 12–20.
38. Bühlmann P, van de Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer; Berlin Heidelberg: 2011. Springer Series in Statistics.
39. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96(456):1348–1360.
40. Fung G, Mangasarian O. A feature selection newton method for support vector machine classification. Computational Optimization and Applications. 2004;28(2):185–202.
41. Mangasarian OL, Kou G. Feature selection for nonlinear kernel support vector machines. Data Mining Workshops, 2007. ICDM Workshops 2007. Seventh IEEE International Conference on; IEEE; 2007. pp. 231–236.
42. Ni X, Zhang D, Zhang H. Variable selection for semiparametric mixed models in longitudinal studies. Biometrics. 2010;66(1):79–88. [PMC free article] [PubMed]
43. Savitsky T, Vannucci M, Sha N. Variable selection for nonparametric gaussian process priors: Models and computational strategies. Statistical Science. 2011;26(1):130–149. [PMC free article] [PubMed]
44. Han A. Non-parametric analysis of a generalized regression model* 1:: The maximum rank correlation estimator. Journal of Econometrics. 1987;35(2–3):303–316.
45. Nolan D, Pollard D. u-processes: Rates of convergence. The Annals of Statistics. 1987;15(2):780–799.