Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC4965290

Formats

Article sections

- Abstract
- 1. Introduction
- 2. The Ramp AUC Estimator
- 3. Optimization and Linear and Nonlinear Marker Combination
- 4. Simulation studies
- 5. Data example
- 6. Concluding Remarks
- Supplementary Material
- References

Authors

Related links

Stat Med. Author manuscript; available in PMC 2017 September 20.

Published in final edited form as:

Stat Med. 2016 September 20; 35(21): 3792–3809.

Published online 2016 April 5. doi: 10.1002/sim.6956PMCID: PMC4965290

NIHMSID: NIHMS768451

In biomedical studies, it is often of interest to classify/predict a subject’s disease status based on a variety of biomarker measurements. A commonly used classification criterion is based on AUC - Area under the Receiver Operating Characteristic Curve. Many methods have been proposed to optimize approximated empirical AUC criteria, but there are two limitations to the existing methods. First, most methods are only designed to find the best linear combination of biomarkers, which may not perform well when there is strong nonlinearity in the data. Second, many existing linear combination methods use gradient-based algorithms to find the best marker combination, which often result in sub-optimal local solutions. In this paper, we address these two problems by proposing a new kernel-based AUC optimization method called Ramp AUC (RAUC). This method approximates the empirical AUC loss function with a ramp function, and finds the best combination by a difference of convex functions algorithm. We show that as a linear combination method, RAUC leads to a consistent and asymptotically normal estimator of the linear marker combination when the data is generated from a semiparametric generalized linear model, just as the Smoothed AUC method (SAUC). Through simulation studies and real data examples, we demonstrate that RAUC out-performs SAUC in finding the best linear marker combinations, and can successfully capture nonlinear pattern in the data to achieve better classification performance. We illustrate our method with a dataset from a recent HIV vaccine trial.

In many areas of biomedical research, biomarkers can play important roles in classifying a subject or a sample into two or more categories or predicting the subject/sample’s probability of being in each category. For example, a key research component in developing a vaccine for an infectious disease is the so-called ‘immune correlates study,’ which seeks to identify immune biomarkers that are associated with the risk of infection [1]. Finding immune correlates helps us understand the biological mechanism of vaccine protection and reduces the cost of future vaccine efficacy trials. In practice, a single biomarker often has limited classification/prediction performance, making it desirable to combine multiple biomarkers for better performance.

In this paper we assume there are two outcome categories: case/disease group and
control/non-diseased group, denoted by *D* and *D̄*,
respectively. We consider how to combine multiple markers to achieve the best classification
performance. Many classification methods have been developed. We will focus on a class of
methods which assign a scalar value called score to each subject. Subjects with scores
higher than a threshold are then classified as cases, while subjects with lower scores are
classified as controls. This is a flexible class of methods since the score can be expressed
as a linear or nonlinear combination of the input variables. It does, however, exclude
methods such as the artificial neural network and the classic decision tree.

Several performance criteria have been proposed to evaluate classification/prediction procedures. Two basic criteria are sensitivity and specificity. Sensitivity is the probability of correctly identifying a case subject; specificity is the probability of correctly classifying a control subject. Classification error combines the two by weighting 1-sensitivity and 1-specificity according to the probabilities of cases and controls. Both sensitivity and specificity depend on the choice of threshold. When we vary the threshold from −∞ to ∞, we get pairs of sensitivity and specificity. A plot of sensitivity versus 1-specificity is called the Receiver Operating Characteristic (ROC) curve. See [2] and [3] for a review of the ROC methodology. The area under the ROC curve (AUC) is a commonly used criterion for a model’s overall classification performance. It is threshold-independent and equals the probability that a randomly chosen case score is greater than a randomly chosen control score.

Two broad groups of scoring methods have been proposed for AUC-based
classification. The first group of methods optimizes a criterion function that can be
written as the sum of *n* terms, where *n* is the sample size;
on the other hand, the criterion function used in the second group of methods is the sum of
*O*(*n*^{2}) terms. We refer to the first group as
the single-indexed approach and the second group as the double-indexed approach or the AUC
approach. Under this classification, both logistic regression, which optimizes a likelihood
function that depends on the marker combination, and Support Vector Machine (SVM)
[4], which optimizes a hinge loss
function, are single-indexed methods. On the other hand, the AUC approach seeks to maximize
the empirical AUC, or equivalently, to minimize the empirical AUC loss defined as:

$$\frac{1}{{n}_{D}{n}_{\overline{D}}}\sum _{i=1}^{{n}_{D}}\sum _{j=1}^{{n}_{\overline{D}}}\left[1-I\left\{{({\mathit{x}}_{i}-{\mathit{x}}_{j})}^{T}\phantom{\rule{0.16667em}{0ex}}\mathit{\beta}>0\right\}\right]\equiv \frac{1}{{n}_{D}{n}_{\overline{D}}}\sum _{p=1}^{{n}_{D}{n}_{\overline{D}}}\{1-I\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta}}>0)\},$$

(1)

where *n _{D}* and

When the training data sample size is large, the empirical AUC approaches the population AUC, and the AUC approach is expected to perform better than the single-indexed approach in general when AUC is chosen as the classification performance criterion [5]. When the training data sample size is small to moderate, the scenario we are most concerned with, the single-indexed approach, may be more efficient, because the data are not reduced to rank as in the AUC approach. However, the AUC approach is more robust to outliers than the single-indexed approach. Methods like logistic regression and SVM are sensitive to outliers because the contribution of each individual observation to the criterion function is unlimited. Even though many proposals have been made to bend the influence of outlying observations to obtain robust methods, e.g. a M-estimator type robust logistic regression method by Bianco and Yohai (1996) [6] and robust statistical learning methods [7, 8, 9], these methods all involve nuisance parameters in their criterion functions, which determines the degree of bending. On the other hand, the robustness of the AUC approach is a result of directly optimizing the criterion function used to evaluate the classification procedure, and there is no need to tune a nuisance parameter related to robustness.

The AUC approach is more challenging than the single-indexed approach in terms of
optimization. For example, the logistic likelihood function is smooth and convex, but the
empirical AUC function is neither. There have been two main groups of AUC-based methods. The
majority of the AUC-based methods are the Smoothed AUC (SAUC) methods. These methods use a
smooth sigmoid function *S _{s}*(

(a) Different approximations of the empirical AUC loss function. (b) RAUC can be
represented as a difference of two convex functions.

A second group of methods in the AUC approach is termed the Support Vector
Machine-AUC (SVM-AUC) method [17, 18, 19, 20]. These methods replace the step function 1
− *I*(*x* > 0) in the AUC loss function by a hinge
loss function (1 − *x*)_{+} (Figure 1b), thus seeking to minimize a convex upper bound of the AUC
loss function. The resulting optimization problem is a convex optimization problem and can
be solved by support vector machine software. Because the contribution of the difference in
a pair of case/control scores can increase infinitely, SVM-AUC does not approximate the AUC
loss function well. This problem is particularly serious when the training data is
contaminated with outliers.

Most of the AUC-based methods developed so far focus on identifying linear combinations of markers, with the exception of the boosting approach used in [21]. While providing a simple summary of relationship between markers, a linear combination is limited in classification performance. One way to improve classifier performance is to map the markers to a feature space through basis expansion such as polynomials or splines. Choosing the right expansion is, however, nontrivial and requires substantial amount of experience in data modeling and domain knowledge.

In this paper, we propose a new AUC-maximization method for identifying marker combinations. We start by proposing a new loss function in Section 2, which provides a better approximation to the AUC criterion function than the SVM-AUC methods. We study the asymptotic theory for the estimated linear combination using the proposed method. In Section 3, we develop a difference-of-convex functions algorithm to find the best linear combination. The algorithm avoids some of the pitfalls associated with gradient-based methods used in SAUC methods. We further show that the algorithm lends itself naturally to the application of the so-called ‘kernel trick’ ([22], Chapters 5 and 12), which allows it to find the best linear combination in a higher (often infinite) dimensional feature space without needing to explicitly specify the mapping from the input markers to the features. In Section 4, we present simulation studies to study the comparative performance of the proposed method against common existing methods. In Section 5, we illustrate the application of the proposed method with an immune biomarker dataset from a recent HIV vaccine trial conducted in Thailand. We make concluding remarks in Section 6.

To provide a new approach to finding AUC-based marker combinations that is
optimized using a non-gradient based method effectively and can handle nonlinear
combinations, we propose to approximate the step function, 1 −
*I*(*x*), in the empirical AUC loss (1) by a ramp function:

$${h}_{s}\phantom{\rule{0.16667em}{0ex}}(x)=\{\begin{array}{ccc}1& \text{if}& {\scriptstyle \frac{x}{s}}\le -{\scriptstyle \frac{1}{2}}\\ -{\scriptstyle \frac{x}{s}}+{\scriptstyle \frac{1}{2}}& \text{if}& -{\scriptstyle \frac{1}{2}}\le {\scriptstyle \frac{x}{s}}<{\scriptstyle \frac{1}{2}}\\ 0& \text{if}& {\scriptstyle \frac{x}{s}}>{\scriptstyle \frac{1}{2}}\end{array},$$

where *s* > 0 is a scale parameter (Figure 1). We call ${\sum}_{p=1}^{{n}_{D}{n}_{\overline{D}}}{h}_{s}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta}})/({n}_{D}{n}_{\overline{D}})$ the Ramp AUC (RAUC) loss function. For any given ${\eta}_{p}^{\mathrm{\Delta}}\ne 0$, the value of this loss function moves towards 0 or 1 and
becomes 0 or 1 at some point as *s* → 0. It is useful to keep track
of the proportion of ${h}_{s}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta}})$ being 0 or 1, termed the saturation ratio, since it reflects
the degree of approximation.

There are identifiability issues with all methods in the AUC approach. The problem
with minimizing the empirical AUC loss function arises from the fact that 1 − I
(*x* > 0) = 1 − I (*cx* > 0), where
*c* is a scalar constant. Minimizing SAUC loss also has an identifiability
issue because (i) *S _{s}*(|

Minimizing RAUC has similar identifiability problems because
*h _{s}*(

$$\underset{\mathit{\beta}}{min}\sum _{p=1}^{{n}_{D}{n}_{\overline{D}}}h\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta}})+\frac{1}{2}{\lambda}_{n}{\Vert \mathit{\beta}\Vert}_{2}^{2}.$$

(2)

This effectively minimizes RAUC loss while constraining ${\Vert \mathit{\beta}\Vert}_{2}^{2}$ to a constant. But instead of specifying how big ${\Vert \mathit{\beta}\Vert}_{2}^{2}$ needs to be, we use *λ _{n}* as
a lever to adjust it. When

We prefer constraints on *L*_{2}-norm over constraints on
an anchor variable coefficient because, as has been pointed out by others [15, 24],
there are times when it is difficult to select an anchor variable. Constraining
||** β**||

Consider the statistical model Pr (*Y* =
1|** X**) =

Under certain regularity conditions and λ_{n} =
o_{p}(n^{2}), $\widehat{\theta}\stackrel{p}{\to}{\theta}_{0}$
*as n* → ∞.

*Let* *denote the last d* − 1
*components of*
*X**, η*_{0} =
*α* + *X*^{T}
β_{0}*, and g*_{0}
*denote the marginal density of η*_{0}*.
Denote*

$$\begin{array}{l}V={\mathbb{E}}_{{\eta}_{0}}\phantom{\rule{0.16667em}{0ex}}\left\{{\beta}_{0,1}^{2}\mathit{Var}\phantom{\rule{0.16667em}{0ex}}(\mathcal{X}\mid {\eta}_{0})\phantom{\rule{0.16667em}{0ex}}{(Y-G({\eta}_{0}))}^{2}\phantom{\rule{0.16667em}{0ex}}{g}_{0}\phantom{\rule{0.16667em}{0ex}}{({\eta}_{0})}^{2}\right\}\\ \mathrm{\Delta}=-{\mathbb{E}}_{{\eta}_{0}}\phantom{\rule{0.16667em}{0ex}}\{{\beta}_{0,1}^{2}\mathit{Var}\phantom{\rule{0.16667em}{0ex}}(\mathcal{X}\mid {\eta}_{0})\phantom{\rule{0.16667em}{0ex}}{G}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{0})\phantom{\rule{0.16667em}{0ex}}{g}_{0}\phantom{\rule{0.16667em}{0ex}}({\eta}_{0})\}.\end{array}$$

*Suppose assumptions (A1) – (A4) hold and
λ _{n}* =

$${n}^{1/2}(\widehat{\mathit{\theta}}-{\mathit{\theta}}_{0})\stackrel{d}{\to}N\phantom{\rule{0.16667em}{0ex}}(0,{\mathrm{\Delta}}^{-1}V{\mathrm{\Delta}}^{-1})$$

*as n* → ∞.

The assumptions and proofs of the theorems mirror those of [26] and [15] and can be found in Appendix A. Based on the proofs, it is clear that the asymptotic variance of the RAUC estimator of the best linear combination should be the same as that of the empirical AUC estimator and the smoothed AUC estimator. However, our asymptotic variance formula differs from that of [26] and [15] by a factor of 4. The difference can be traced to the expectation of the second derivative of the kernel of the empirical process in Lemma 1 of Appendix A.

To consistently estimate the asymptotic variance, one can either use numerical derivatives as suggested by Sherman [26] in the context of maximizing empirical AUC or use the following empirical smoothed estimate in the spirit of [15]:

$$\begin{array}{l}\widehat{V}=\frac{1}{{n}^{3}}{\widehat{\beta}}_{1}^{2}\sum _{i=1}^{{n}_{D}}\prod \left(\sum _{j=1}^{{n}_{\overline{D}}}\left[{S}^{\prime}\phantom{\rule{0.16667em}{0ex}}\left\{{({\mathit{x}}_{i}-{\mathit{x}}_{j})}^{T}\phantom{\rule{0.16667em}{0ex}}\widehat{\mathit{\beta}}\right\}\phantom{\rule{0.16667em}{0ex}}({\mathit{\chi}}_{i}-{\mathit{\chi}}_{j})\right]\right)+\frac{1}{{n}^{3}}{\widehat{\beta}}_{1}^{2}\sum _{j=1}^{{n}_{\overline{D}}}\prod \left(\sum _{i=1}^{{n}_{D}}\left[{S}^{\prime}\phantom{\rule{0.16667em}{0ex}}\left\{{({\mathit{x}}_{i}-{\mathit{x}}_{j})}^{T}\phantom{\rule{0.16667em}{0ex}}\widehat{\mathit{\beta}}\right\}\phantom{\rule{0.16667em}{0ex}}({\mathit{\chi}}_{i}-{\mathit{\chi}}_{j})\right]\right)\\ \widehat{\mathrm{\Delta}}=\frac{1}{{n}^{2}}{\widehat{\beta}}_{1}^{2}\sum _{i=1}^{{n}_{D}}\sum _{j=1}^{{n}_{\overline{D}}}\left[{S}^{\u2033}\phantom{\rule{0.16667em}{0ex}}\left\{{({\mathit{x}}_{i}-{\mathit{x}}_{j})}^{T}\phantom{\rule{0.16667em}{0ex}}\widehat{\mathit{\beta}}\right\}\phantom{\rule{0.16667em}{0ex}}({\mathit{\chi}}_{i}-{\mathit{\chi}}_{j})\phantom{\rule{0.16667em}{0ex}}{({\mathit{\chi}}_{i}-{\mathit{\chi}}_{j})}^{T}\right],\end{array}$$

where Π(** x**) =

The penalized RAUC loss (2) is a
non-convex, non-smooth function. To find the best combination, we write the ramp function
*h* as the difference between two convex functions:
*h*(*x*) =
*h*_{1}(*x*) −
*h*_{2}(*x*) (Figure
1b)

$${h}_{1}\phantom{\rule{0.16667em}{0ex}}(x)={\left(\frac{1}{2}-x\right)}_{+},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}{h}_{2}\phantom{\rule{0.16667em}{0ex}}(x)={\left(-\frac{1}{2}-x\right)}_{+}.$$

This allows the application of difference of convex functions algorithm (DCA)
[27, 9]. The essence of DCA is to use a series of convex optimization
problems to approximate the non-convex problem. This is achieved by iteratively
approximating *h*_{2} with a first order Taylor expansion; there is
no need to approximate *h*_{1} because the difference between a
convex function and a linear function is a convex function. The optimization proceeds as
follows:

- Step 1. Start with an initial guess for ${\eta}_{p}^{\mathrm{\Delta}}$ and assign it to ${\eta}_{p}^{\mathrm{\Delta},0}$. For example, we can use a robust logistic regression to get an initial estimate
and let ${\eta}_{p}^{\mathrm{\Delta},0}\leftarrow {({\mathit{x}}_{{i}_{p}}-{\mathit{x}}_{{j}_{p}})}^{T}\phantom{\rule{0.16667em}{0ex}}{\left(1,{\widehat{\mathit{\theta}}}^{\mathit{rob}}\right)}^{T}$.^{rob} - Step 2. Solve$$\widehat{\mathit{\beta}}=\underset{\beta}{\text{argmin}}\sum _{p=1}^{{n}_{D}{n}_{\overline{D}}}\left\{{h}_{1}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta}})-{\widehat{h}}_{2}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta}},{\eta}_{p}^{\mathrm{\Delta},0})+\frac{1}{2}{\lambda}_{n}{\Vert \mathit{\beta}\Vert}_{2}^{2}\right\},$$(3)where ${\widehat{h}}_{2}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta}},{\eta}_{p}^{\mathrm{\Delta},0})={h}_{2}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta},0})+{h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta},0})\phantom{\rule{0.16667em}{0ex}}{\eta}_{p}^{\mathrm{\Delta}}$, and ${h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}(x)=-\mathrm{I}(x<-1/2)$ is the first derivative of
*h*_{2}(*x*) with respect to*x*, except 0 at*x*= −1/2. - Step 3. Set ${\eta}_{p}^{\mathrm{\Delta},0}\leftarrow {({\mathit{x}}_{{i}_{p}}-{\mathit{x}}_{{j}_{p}})}^{T}\phantom{\rule{0.16667em}{0ex}}\widehat{\mathit{\beta}}$ and go back to step 2 until the change in the penalized RAUC loss is less than a pre-specified constant.

In step 2 of the algorithm above, we solve a convex optimization problem. As
*h*_{1} and *ĥ*_{2} are not
smooth functions, the standard approach is to convert it to a constrained smooth
optimization problem by introducing slack variables *ξ _{p}*
to replace

$$\begin{array}{c}\underset{\mathit{\xi},\mathit{\beta}}{min}\sum _{p=1}^{{n}_{D}{n}_{\overline{D}}}\{{\xi}_{p}-{h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta},0})\phantom{\rule{0.16667em}{0ex}}{\eta}_{p}^{\mathrm{\Delta}}\}+\frac{1}{2}{\lambda}_{n}{\Vert \mathit{\beta}\Vert}_{2}^{2}\\ \text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.16667em}{0ex}}{\xi}_{p}\ge \frac{1}{2}-{\eta}_{p}^{\mathrm{\Delta}},{\xi}_{p}\ge 0,\end{array}$$

(4)

where we have dropped terms void of
** ξ** or

$${L}_{p}=\sum _{p=1}^{{n}_{D}{n}_{\overline{D}}}\{{\xi}_{p}-{h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta},0})\phantom{\rule{0.16667em}{0ex}}{\eta}_{p}^{\mathrm{\Delta}}\}+\frac{1}{2}{\lambda}_{n}{\Vert \mathit{\beta}\Vert}_{2}^{2}-\sum _{p}{\alpha}_{p}\phantom{\rule{0.16667em}{0ex}}\left({\xi}_{p}+{\eta}_{p}^{\mathrm{\Delta}}-\frac{1}{2}\right)-\sum _{p}{\gamma}_{p}{\xi}_{p},$$

where ** α** and

$$\begin{array}{l}\mathit{\beta}=\frac{1}{{\lambda}_{n}}\sum _{p=1}^{{n}_{D}{n}_{\overline{D}}}\{{h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta},0})+{\alpha}_{p}\}\phantom{\rule{0.16667em}{0ex}}{\mathit{x}}_{p}^{\mathrm{\Delta}}\\ 1={\alpha}_{p}+{\gamma}_{p}.\end{array}$$

(5)

Plugging them back to *L _{p}*, the terms having

$$\begin{array}{c}\underset{\mathit{\alpha}}{min}{\mathit{\alpha}}^{T}Q\mathit{\alpha}-{\mathit{b}}^{T}\mathit{\alpha}\\ \text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.16667em}{0ex}}0\le {\alpha}_{p}\le 1\end{array}$$

(6)

where *Q* is a square matrix whose
[*p*,
*p*′]* ^{th}* element is $\langle {\mathit{x}}_{p}^{\mathrm{\Delta}},{\mathit{x}}_{p}^{\mathrm{\Delta}}\rangle /{\lambda}_{n}$, and $\mathit{b}=\mathbf{1}-2\times Q\times {\mathit{h}}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\mathit{\eta}}^{\mathrm{\Delta},0})$. The optimization problem (6) has a set of simple box constraints and can be solved
via many quadratic programming methods. Once we solve the dual problem, solution to the
primal problem is found through (5).

Having found the best combination, to classify a new observation we will use the score:

$${\eta}_{\mathit{new}}=\sum _{p=1}^{{n}_{D}{n}_{\overline{D}}}\{{h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta},0})+{\alpha}_{p}\}\phantom{\rule{0.16667em}{0ex}}\langle {\mathit{x}}_{\mathit{new}},{\mathit{x}}_{p}^{\mathrm{\Delta}}\rangle ,$$

(7)

where
*η*^{Δ,0} and
** α** come from the last iteration of DCA. Similar
to the SVM literature [29], we call
those ${\mathit{x}}_{p}^{\mathrm{\Delta}}$ with ${h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta},0})+{\alpha}_{p}=0$ non-support pairs because they do not contribute to the
score. There are usually a great number of non-support pairs and this means only some of
the inner products

As the best classifiers may not be among linear combinations of the input
markers, we wish to enlarge the feature space via basis expansion and find the best linear
combination in the enlarged feature space. For example, if we have two markers
(*x*_{1}, *x*_{2}), we go beyond the
linear trends and look at the second order trends as well as the interaction between the
two markers. Then we work with the feature space (*x*_{1},
*x*_{2}, ${x}_{1}^{2},{x}_{2}^{2}$,
*x*_{1}*x*_{2}). Let
*ϕ** _{i}* denote the
feature vector for subject

$$\langle {\mathit{\varphi}}_{p}^{\mathrm{\Delta}},{\mathit{\varphi}}_{{p}^{\prime}}^{\mathrm{\Delta}}\rangle =\langle {\mathit{\varphi}}_{{i}_{p}},{\mathit{\varphi}}_{{i}_{{p}^{\prime}}}\rangle +\langle {\mathit{\varphi}}_{{j}_{p}},{\mathit{\varphi}}_{{j}_{{p}^{\prime}}}\rangle -\langle {\mathit{\varphi}}_{{i}_{p}},{\mathit{\varphi}}_{{j}_{{p}^{\prime}}}\rangle -\langle {\mathit{\varphi}}_{{j}_{p}},{\mathit{\varphi}}_{{i}_{{p}^{\prime}}}\rangle .$$

Having solved the dual problem, there is no need to translate it back to the
primal space because step 3 calls for
^{Δ}, which has the
expression:

$${\widehat{\mathit{\eta}}}^{\mathrm{\Delta}}=\left({\mathit{\varphi}}_{1}^{\mathrm{\Delta}},\cdots ,{\mathit{\varphi}}_{p}^{\mathrm{\Delta}}\right)\phantom{\rule{0.16667em}{0ex}}\widehat{\mathit{\beta}}=Q\times \{{\mathit{h}}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\mathit{\eta}}^{\mathrm{\Delta},0})+\widehat{\mathit{\alpha}}\}.$$

That is, even though
*ϕ** _{i}* may be
infinite-dimensional,

$$\begin{array}{l}{\eta}^{\mathit{new}}=\sum _{p=1}^{{n}_{D}{n}_{\overline{D}}}\{{h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta},0})+{\alpha}_{p}\}\phantom{\rule{0.16667em}{0ex}}\langle \mathit{\varphi}\phantom{\rule{0.16667em}{0ex}}({\mathit{x}}_{\mathit{new}}),{\mathit{\varphi}}_{p}^{\mathrm{\Delta}}\rangle \\ =\sum _{p=1}^{{n}_{D}{n}_{\overline{D}}}\{{h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta},0})+{\alpha}_{p}\}\phantom{\rule{0.16667em}{0ex}}\left\{\langle \mathit{\varphi}\phantom{\rule{0.16667em}{0ex}}({\mathit{x}}_{\mathit{new}}),{\mathit{\varphi}}_{{i}_{p}}\rangle -\langle \mathit{\varphi}\phantom{\rule{0.16667em}{0ex}}({\mathit{x}}_{\mathit{new}}),{\mathit{\varphi}}_{{j}_{p}}\rangle \right\},\end{array}$$

where again
*η*^{Δ,0} and
** α** come from the last iteration of DCA.

This approach of mapping the input vector to a feature space without having to
specify the mapping explicitly has been called the ‘kernel trick’ in the
SVM literature [30, 22]. Let *K* be a symmetric continuous
function that maps
{*x** _{i}*} ×
{

In this simulation study, we examine the performance of RAUC in finding the best linear combination of markers. Two covariates are simulated from a mixture of two bivariate normal random variables

$$\left(\begin{array}{c}{X}_{1}\\ {X}_{2}\end{array}\right)=(1-\mathrm{\Delta})\times N\phantom{\rule{0.16667em}{0ex}}\left(0,0.2\times \left[\begin{array}{cc}1& 0.9\\ 0.9& 1\end{array}\right]\right)+\mathrm{\Delta}\times N\phantom{\rule{0.16667em}{0ex}}\left(0,2\times \left[\begin{array}{cc}1& 0\\ 0& 1\end{array}\right]\right),$$

where Δ is a Bernoulli variable with mean
*π*. The outcome variable *Y* is generated as
Bernoulli random variables with means specified by logit {Pr (*Y*
= 1)} = 4*X*_{1} −
3*X*_{2} − (*X*_{1} −
*X*_{2})^{3}. When no outliers are simulated,
*π* is set to 0; when outliers are simulated,
*π* is set to 0.05. The sample size is 200 for training data and
10^{4} for test data. Figure 2 shows two
training datasets, one with and one without outliers. The plots suggest that this
simulation scenario mimics the the real data example that served as a motivation example
in [5].

Sample training datasets from Section 4.1. Cases are plotted with filled circles, and
controls are plotted with empty circles. Best linear combinations from three methods are
shown as lines.

We compare RAUC with three other classification methods. The logistic method
fits a logistic regression model to the training data and uses the fitted model to score
the test data. For the SAUC method, we choose normal CDF as the sigmoid function,
constrain ||** β**||

Summary results from 1000 simulations are shown in Table 1. When there are no outliers in the training data, all
methods perform similarly. When there are outliers in the training data, the logistic
regression method may break down as illustrated by Figure
2. Table 1 shows that, when the test data
does not include outliers, the average test data AUC for the logistic regression method
and the SAUC method is 0.639 and 0.667, respectively. Because the AUC measure has a narrow
range of going from 0.5 for a useless classification method to 1 for a perfect
classification method, we will look at relative increase in effective AUC, defined as
AUC−0.5, when comparing two methods. By this measure, SAUC shows a relative
increase in effective AUC of (0.667 − 0.5)/(0.639 −0.5) − 1
= 20% over the logistic regression method. By this same measure,
RAUC* ^{l}*, the RAUC method with linear kernel, shows a
relative increase in effective AUC of (0.681 − 0.5)/(0.667 − 0.5)
− 1 = 8% over SAUC when the test data contains no outliers. When
the test data is simulated with outliers, we see a similar pattern of RAUC outperforming
SAUC and SAUC outperforming logistic regression. Since the loss functions minimized by
RAUC and SAUC both approximate the AUC loss, the difference in performance between the two
can be attributed to the difference in optimization algorithm and suggests that the
difference-of-convex functions algorithm employed by RAUC may be less likely to be stuck
in local optima than the gradient-based algorithm used by SAUC. Finally, as we would
expect, the performance of SVM

Test data AUC from Section 4.1, mean and se: Monte Carlo mean and standard error.
SVM^{l}: SVM with linear kernel.
RAUC^{l}: RAUC with linear kernel.

Using the knowledge of logit {Pr (*Y* =
1)} to classify cases and controls gives us a theoretical AUC that represents an
upper bound on the classification performance. From a dataset of 500,000 data points, the
theoretical AUC is 0.704 under the ‘no outliers’ scenario and 0.720 under
the ‘with outliers’ scenario. Figure A.1 in the Supplementary Materials shows
the density curves for the distributions of logit {Pr (*Y*
= 1)} in cases and controls under the two scenarios. Comparing the
theoretical AUC to Table 1, we see that a linear
combination of *X*_{1} and *X*_{2} can
achieve close to theoretical AUC under the ‘no outliers’ scenario but not
under the ‘with outliers’ scenario. This makes sense because when there
are no outliers the values of *X*_{1} and
*X*_{2} are very close to 0 and the higher order term
(*X*_{1} − *X*_{2})^{3} is
nearly ignorable. Figure A.2 in the
Supplementary Materials plots the joint distributions of
(_{1},
_{2}) and the distributions of the ratio
_{2}/_{1}
from RAUC* ^{l}*. The figure shows that under finite samples the
distribution of

We repeat this simulation study and replace the first component of the mixture
distribution of (*X*_{1}, *X*_{2}) with a
heavy-tailed bivariate double-exponential distribution using the R package
*LaplacesDemon*. Similar results are obtained and the details are
described in Supplementary Materials
Section D.1.

In this simulation study, we study classifier performance when there is a strong
nonlinear pattern in the training data. The radial basis function (RBF) kernel is adopted
in RAUC and SVM. We simulate 4 covariates independently from Student’s
*t* distribution with 4 degrees of freedom. When no outliers are
simulated, the outcome *Y* is generated as Bernoulli random variables with
means specified as logit {Pr (*Y* = 1)
|*X*_{1}, *X*_{2},
*X*_{3}, *X*_{4}} = 10
× {sin (*πX*_{1}) + sin
(*πX*_{2}) + sin
(*πX*_{3}) + sin
(*πX*_{4})}. When outliers are simulated, the
outcome for samples with $\mid {X}_{1}^{2}+{X}_{2}^{2}+{X}_{3}^{2}+{X}_{4}^{2}\mid \phantom{\rule{0.16667em}{0ex}}>10$ is simulated from a different mean model: 10 ×
{cos (*πX*_{1}) + cos
(*πX*_{2}) + cos
(*πX*_{3}) + cos
(*πX*_{4})}. The sample size is 200 for training
data and 10^{4} for test data.

The use of RBF kernel in RAUC and SVM requires careful consideration of the
tuning process. The RBF kernel
*K*(*x** _{i}*,

Test data AUC from Section 4.2, mean and se: Monte Carlo mean and standard error.
SVM^{r}: SVM with RBF kernel.
RAUC^{r}: RAUC with RBF kernel.

When the training data is free of outliers, both logistic regression method and SAUC method do not perform well due to strong nonlinearity in the data. Both RAUC method and SVM method with RBF kernel perform satisfactorily. As expected, the test data AUC is lower when the test data contains outliers than when the test data does not contain outliers.

When the training data contains outliers, we also observe that RAUC and SVM with RBF kernel outperform logistic and SAUC. Furthermore, RAUC with RBF kernel also shows an advantage over SVM with RBF kernel. The relative increase in effective AUC is (0.817 − 0.5)/(0.790 − 0.5) − 1 = 9% when the test data has no outliers, and (0.816 − 0.5)/(0.795 − 0.5) − 1 = 7% when the test data also has outliers.

From a dataset of 500,000 data points, the theoretical AUC is 0.995 under both
‘no outliers’ and ‘with outliers’ scenarios. Figure B.1 in the Supplementary
Materials shows the density curves for the distributions of logit {Pr
(*Y* = 1)} in cases and controls under the two scenarios.
Figure B.2 in the Supplementary
Materials shows scatterplots of ‘theoretical combination’, i.e.
logit{Pr (*Y* = 1)}, and
RAUC* ^{r}* estimated combination for the training data under the
two scenarios from one Monte Carlo replicate. The plots show that for the bulk of the 200
data points the correlation between theoretical and estimated combinations are high, but
when the theoretical combination is near 0, the estimated combination may show wide
variability.

We repeat this simulation study with (*X*_{1},
*X*_{2}, *X*_{3},
*X*_{4}) simulated from a correlated multivariate
Student’s t distribution, and obtain similar results (details in Supplementary Materials Section D.2).

RV144 is a community-based, randomized, multicenter, double-blind,
placebo-controlled HIV vaccine efficacy trial conducted in Thailand from 2003 to 2006
[36]. In the modified
intention-to-treat analysis involving 16,395 subjects, the vaccine efficacy was
31.2% with a P-value of 0.04. In an effort to identify immune correlates of
infection risk potentially relevant for vaccine-induced protection, Haynes et al.
[1] conducted a case-control study
using biomarker measurement from blood samples of 41 cases and 205 controls, all from the
vaccine arm. Haynes et al. reported an array of immunological response variables measured
from these samples. We choose two of them for illustration: *IgA*, which
measures the IgA class of serum antibody binding to HIV envelope protein gp120, and
*IL13*, which measures the amount of *IL13* production by
peripheral blood mononuclear cells when stimulated by gp120 peptide. Both variables are
continuous variables and are scaled to have standard deviation 1. We perform repeated random
sub-sampling validation for AUC estimation. In iteration, we randomly split the data into
4/5 training set and 1/5 test set stratified on the case status. Test data AUC over 1000
splits is reported in Table 3.

Test data AUC for the RV144 data example, mean and se: mean and standard error from
repeated random sub-sampling validation.

For methods using the RBF kernel, we need a way to tune the tuning parameters. Because the number of cases is limited, it does not work well to further split the training set into a training subset and a tuning subset. Instead, we adopt the following strategy. We will use the same tuning parameters value for all training/test datasets. To find a good value for the tuning parameters, we randomly split the data into 4/5 training set and 1/5 tuning set, and perform a grid search to maximize the classification performance on the tuning data. The median value from 1000 such splits is chosen as the tuning parameter to be used in the training/test phase. For SVM, we try both this strategy and generalized approximate cross validation (GACV [37]) and find that our strategy performs slightly better than GACV; hence, we only report results from the former strategy.

The results in Table 3 show that, among the
methods looking for the best linear combination of markers, the performance of RAUC with
linear kernel is on par with other methods. The RAUC method detects a nonlinear trend in the
relationship between infection and the two markers. Comparing
RAUC* ^{r}* and RAUC

In this paper we propose a new criterion function, the penalized RAUC loss function, for finding marker combinations with optimal classification/prediction performance. By using different kernels, the RAUC method can find both linear and nonlinear best marker combinations. We show that the best linear marker combination is a consistent and asymptotically normal estimator of the true marker combination when the data is generated from a semiparametric generalized linear model.

In the numerical studies, we show that the RAUC method with linear kernel out-performs SAUC as a linear classification method under some scenarios. While these two methods are both consistent estimators of the true marker combination and have the identical asymptotic variance, the difference-of-convex functions algorithm employed by RAUC is less likely to be stuck in local optima than the gradient-based algorithm used by SAUC. In the presence of nonlinear patterns, the RAUC method with nonlinear kernels allows a flexible way to derive nonlinear marker combinations for optimizing classification accuracy, with performance better or comparable to common alternative estimators. R package for implementing the RAUC method is available from the authors upon request.

Our focus in this paper is to develop classification rules to flexibly combine
several candidate biomarkers that have been selected by investigators as potentially useful.
Another important aspect of biomarker studies is to use data to help select important
markers and features. Variable selection is well developed in regression analysis with
linear predictors, where many types of penalties have been proposed and oracle properties
established [38]. In AUC analysis
with linear marker combinations, the penalization approach to variable selection faces an
added layer of complexity due to the fact that a constraint on
** β** is needed to ensure identifiability. Some
efforts have been made in this area, e.g. Lin et al. [24] proposed to use a SCAD penalty [39] while maintaining
||

The authors thank Krisztian Sebestyen for programming assistance. We also thank the Editor, the AE and the two referees for their insightful comments that helped improve the manuscript. This work was supported by the cooperative agreement W81XWH-07-2-0067 between the Henry M. Jackson Foundation and the Department of Defense, and National Institute of Allergy and Infectious Diseases grants 1R56AI116369-01A1, R01-GM106177 and UM1AI068635.

For proof of the consistency, we need two assumptions. (A1)
*θ*_{0} is an interior point of Θ
which is a compact subset of ^{d}^{−1}.
(A2) The support of the covariate vector ** x** is not contained
in any proper linear subspace of

Under (A1) and (A2), Han (1987) [44] proved the consistency of the empirical AUC estimator. Denote the
AUC loss by ${\text{AUC}}_{n}\phantom{\rule{0.16667em}{0ex}}(\theta )={\scriptstyle \frac{1}{{n}^{2}}}{\sum}_{p=1}^{{n}_{D}{n}_{\overline{D}}}I\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta}}<0)$. It suffices to show that
sup_{θ}_{Θ}
|AUC* _{n}*(

$$\begin{array}{l}\left|{\text{AUC}}_{n}(\theta )-\underset{{\beta}_{1}}{\text{argmin}}{L}_{n}\phantom{\rule{0.16667em}{0ex}}(\beta )\right|=\left|\frac{1}{{n}^{2}}\sum _{p=1}^{{n}_{D}{n}_{\overline{D}}}I\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta}}<0)-\frac{1}{{n}^{2}}\sum _{p=1}^{{n}_{D}{n}_{\overline{D}}}r\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta}})-\frac{1}{2{n}^{2}}{\lambda}_{n}{\widehat{\beta}}_{1}^{2}{\Vert \theta \Vert}_{2}^{2}\right|\\ \le \frac{1}{{n}^{2}}\sum _{p=1}^{{n}_{D}{n}_{\overline{D}}}\mid I\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta}}<0)-r\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{\mathrm{\Delta}})\mid +{({\lambda}_{n}/{n}^{2})}^{1/2}\phantom{\rule{0.16667em}{0ex}}{O}_{p}(1)\\ \le \frac{1}{{n}^{2}}\sum _{p=1}^{{n}_{D}{n}_{\overline{D}}}I\phantom{\rule{0.16667em}{0ex}}\left[(1,\theta )\phantom{\rule{0.16667em}{0ex}}({x}_{{i}_{p}}-{x}_{{j}_{p}})<{({\lambda}_{n}/{n}^{2})}^{1/4}\phantom{\rule{0.16667em}{0ex}}{O}_{p}\phantom{\rule{0.16667em}{0ex}}(1)\right]+{({\lambda}_{n}/{n}^{2})}^{1/2}\phantom{\rule{0.16667em}{0ex}}{O}_{p}(1)\end{array}$$

As argued in [15], the
first term in the last line converges to 0 uniformly in probability when
*λ _{n}*/

Let *F* and *f* be the cdf and pdf for
*T* = (*X _{i}* −

$$\begin{array}{l}\mathbb{E}\phantom{\rule{0.16667em}{0ex}}\{r\phantom{\rule{0.16667em}{0ex}}(T)\}=Pr\phantom{\rule{0.16667em}{0ex}}(T<-0.5)+{\int}_{-0.5}^{0.5}\left(\frac{1}{2}-T\right)\phantom{\rule{0.16667em}{0ex}}f\phantom{\rule{0.16667em}{0ex}}(T)\phantom{\rule{0.16667em}{0ex}}d{\eta}^{\mathrm{\Delta}}\\ =\frac{1}{2}\{F\phantom{\rule{0.16667em}{0ex}}(0.5)+F\phantom{\rule{0.16667em}{0ex}}(-0.5)\}-{\int}_{-0.5}^{0.5}Tf\phantom{\rule{0.16667em}{0ex}}(T)\phantom{\rule{0.16667em}{0ex}}d{\eta}^{\mathrm{\Delta}}.\end{array}$$

Take the derivative with respect to *β*_{1}, and
as *β*_{1} → ∞,

$$\begin{array}{l}\frac{\partial}{\partial {\beta}_{1}}\mathbb{E}\phantom{\rule{0.16667em}{0ex}}\{r\phantom{\rule{0.16667em}{0ex}}(T)\}={\int}_{0}^{0.5/{\beta}_{1}}T\phantom{\rule{0.16667em}{0ex}}\{\psi \phantom{\rule{0.16667em}{0ex}}(-T)-\psi \phantom{\rule{0.16667em}{0ex}}(T)\}\phantom{\rule{0.16667em}{0ex}}d{\eta}^{\mathrm{\Delta}}\\ =\frac{0.5}{{\beta}_{1}}\times \frac{0.5}{{\beta}_{1}}\left\{\psi \phantom{\rule{0.16667em}{0ex}}\left(-\frac{0.5}{2{\beta}_{1}}\right)-\psi \phantom{\rule{0.16667em}{0ex}}\left(\frac{0.5}{2{\beta}_{1}}\right)\right\}+{o}_{p}\phantom{\rule{0.16667em}{0ex}}({\beta}_{1}^{-3})\\ ={\beta}_{1}^{-3}\phantom{\rule{0.16667em}{0ex}}\left\{-\frac{1}{8}{\psi}^{\prime}\phantom{\rule{0.16667em}{0ex}}(0)+{o}_{p}\phantom{\rule{0.16667em}{0ex}}(1)\right\}.\end{array}$$

Hence, $\mathbb{E}\phantom{\rule{0.16667em}{0ex}}\{r\phantom{\rule{0.16667em}{0ex}}(T)\}+{\scriptstyle \frac{1}{2{n}^{2}}}{\lambda}_{n}\phantom{\rule{0.16667em}{0ex}}{\Vert \beta \Vert}_{2}^{2}$ as a function of *β*_{1} is
minimized at

$$\begin{array}{ll}\hfill 0& ={\beta}_{1}^{-3}\phantom{\rule{0.16667em}{0ex}}\left\{-\frac{1}{8}{\psi}^{\prime}\phantom{\rule{0.16667em}{0ex}}(0)+{o}_{p}\phantom{\rule{0.16667em}{0ex}}(1)\right\}+\frac{{\lambda}_{n}}{{n}^{2}}{\beta}_{1}{\Vert \theta \Vert}^{2}\hfill \\ \hfill {\beta}_{1}& ={\left(\frac{{\lambda}_{n}}{{n}^{2}}\right)}^{-1/4}\phantom{\rule{0.16667em}{0ex}}{\left\{\frac{1}{8{\Vert \theta \Vert}^{2}}{\psi}^{\prime}\phantom{\rule{0.16667em}{0ex}}(0)+{o}_{p}\phantom{\rule{0.16667em}{0ex}}(1)\right\}}^{1/4}\hfill \end{array}$$

Denote ${s}_{n}={\left({\scriptstyle \frac{{\lambda}_{n}}{{n}^{2}}}\right)}^{1/4}\phantom{\rule{0.16667em}{0ex}}{\left\{{\left(8{\Vert {\theta}_{0}\Vert}^{2}\right)}^{-1}\phantom{\rule{0.16667em}{0ex}}{\psi}^{\prime}\phantom{\rule{0.16667em}{0ex}}(0)\right\}}^{-1/4}$. When *λ _{n}* =

Let *Z* = (*Y*, *X*). A key
term in the normality proof is

$${\tau}_{n}\phantom{\rule{0.16667em}{0ex}}({Z}_{1},\theta )={\mathbb{E}}_{{Y}_{2}}\phantom{\rule{0.16667em}{0ex}}[I({Y}_{1}>{Y}_{2}){r}_{n}\phantom{\rule{0.16667em}{0ex}}\{(1,\theta )\phantom{\rule{0.16667em}{0ex}}({X}_{1}-{X}_{2})\}]+{\mathbb{E}}_{{Y}_{2}}\phantom{\rule{0.16667em}{0ex}}[I({Y}_{2}>{Y}_{1}){r}_{n}\phantom{\rule{0.16667em}{0ex}}\{(1,\theta )\phantom{\rule{0.16667em}{0ex}}({X}_{2}-{X}_{1})\}].$$

By *r _{n}*(

$$\begin{array}{l}{\tau}_{n}\phantom{\rule{0.16667em}{0ex}}({Z}_{1},\theta )=\int I\phantom{\rule{0.16667em}{0ex}}({Y}_{1}>{Y}_{2})\phantom{\rule{0.16667em}{0ex}}{r}_{n}\phantom{\rule{0.16667em}{0ex}}\{(1,\theta )\phantom{\rule{0.16667em}{0ex}}({X}_{1}-{X}_{2})\}\phantom{\rule{0.16667em}{0ex}}dF\phantom{\rule{0.16667em}{0ex}}({X}_{2},{Y}_{2})+\int I({Y}_{2}>{Y}_{1})\phantom{\rule{0.16667em}{0ex}}[1-{r}_{n}\phantom{\rule{0.16667em}{0ex}}\{(1,\theta )\phantom{\rule{0.16667em}{0ex}}({X}_{1}-{X}_{2})\}]\phantom{\rule{0.16667em}{0ex}}dF\phantom{\rule{0.16667em}{0ex}}({X}_{2},{Y}_{2})\\ =\int \{I\phantom{\rule{0.16667em}{0ex}}({Y}_{1}>{Y}_{2})-I({Y}_{2}>{Y}_{1})\}\phantom{\rule{0.16667em}{0ex}}{r}_{n}\phantom{\rule{0.16667em}{0ex}}\{(1,\theta )\phantom{\rule{0.16667em}{0ex}}({X}_{1}-{X}_{2})\}\phantom{\rule{0.16667em}{0ex}}dF\phantom{\rule{0.16667em}{0ex}}({X}_{2},{Y}_{2})+\int I({Y}_{2}>{Y}_{1})dF\phantom{\rule{0.16667em}{0ex}}({X}_{2},{Y}_{2})\\ =\int S\phantom{\rule{0.16667em}{0ex}}({Y}_{1},\alpha +{X}_{2}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\phantom{\rule{0.16667em}{0ex}}{r}_{n}\phantom{\rule{0.16667em}{0ex}}\{(1,\theta )\phantom{\rule{0.16667em}{0ex}}({X}_{1}-{X}_{2})\}\phantom{\rule{0.16667em}{0ex}}dF\phantom{\rule{0.16667em}{0ex}}({X}_{2})+\int I({Y}_{1}>{Y}_{2})dF\phantom{\rule{0.16667em}{0ex}}({X}_{2},{Y}_{2}),\end{array}$$

where in the last line, we introduce
*S*(*y*, *t*),

$$\begin{array}{l}S(y,t)={E}_{Y\mid \alpha +{X}^{T}{\beta}_{0}}\phantom{\rule{0.16667em}{0ex}}\{I(y>Y)-I(y<Y)\mid \alpha +{X}^{T}{\beta}_{0}=t\}\\ =\{\begin{array}{ccc}Pr\phantom{\rule{0.16667em}{0ex}}(Y=0\mid \alpha +{X}^{T}{\beta}_{0}=t)& \text{if}& y=1\\ -Pr\phantom{\rule{0.16667em}{0ex}}(Y=1\mid \alpha +{X}^{T}{\beta}_{0}=t)& \text{if}& y=0\end{array}\\ =y-G\phantom{\rule{0.16667em}{0ex}}(t).\end{array}$$

We now state the third assumption needed for the normality. (A3) (i) Δ
and *V* exist and Δ is negative definite; (ii) partial derivative
of *g*_{0} exists and is bounded on the support of
(*Y*, ** X**); (iii) for each

Under assumption (A3),

$${\nabla}_{1}{\tau}_{n}(Z,{\theta}_{0})=-{\beta}_{0,1}\phantom{\rule{0.16667em}{0ex}}(\mathcal{X}-{\mathcal{X}}_{0})\phantom{\rule{0.16667em}{0ex}}S(Y,{\eta}_{0}){g}_{0}\phantom{\rule{0.16667em}{0ex}}({\eta}_{0})+{O}_{p}\phantom{\rule{0.16667em}{0ex}}({s}_{n})$$

(8)

*and*

$$\mathbb{E}\phantom{\rule{0.16667em}{0ex}}\left[{\nabla}_{1}{\tau}_{n}(Z,{\theta}_{0})\phantom{\rule{0.16667em}{0ex}}{\{{\nabla}_{1}{\tau}_{n}(Z,{\theta}_{0})\}}^{T}\right]=V+{O}_{p}\phantom{\rule{0.16667em}{0ex}}({s}_{n})$$

(9)

$$\mathbb{E}\phantom{\rule{0.16667em}{0ex}}\{{\nabla}_{2}{\tau}_{n}(Z,{\theta}_{0})\}=2\mathrm{\Delta}+{O}_{p}\phantom{\rule{0.16667em}{0ex}}({s}_{n}).$$

(10)

Let *u** _{i}* denote the
unit vector in

$${\nabla}_{i}^{1}{\tau}_{n}(Z,{\theta}_{0})=\underset{\epsilon \to 0}{lim}{\epsilon}^{-1}\phantom{\rule{0.16667em}{0ex}}[{\tau}_{n}(Z,{\theta}_{0}+\epsilon {u}_{i})-{\tau}_{n}(Z,{\theta}_{0})].$$

By the definition of *r _{n}*, we can compute

$$\begin{array}{l}{\tau}_{n}(Z,{\theta}_{0}+\epsilon {u}_{i})-{\tau}_{n}(Z,{\theta}_{0})=\int [{r}_{n}\phantom{\rule{0.16667em}{0ex}}\{(1,{\theta}_{0}+\epsilon {u}_{i})\phantom{\rule{0.16667em}{0ex}}({X}_{1}-{X}_{2})\}-{r}_{n}\phantom{\rule{0.16667em}{0ex}}\{(1-{\theta}_{0})\phantom{\rule{0.16667em}{0ex}}({X}_{1}-{X}_{2})\}]\phantom{\rule{0.16667em}{0ex}}S\phantom{\rule{0.16667em}{0ex}}({Y}_{1},\alpha +{X}_{2}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\phantom{\rule{0.16667em}{0ex}}dF\phantom{\rule{0.16667em}{0ex}}({X}_{2})\\ =\int \left[\begin{array}{c}I\phantom{\rule{0.16667em}{0ex}}\{(1,{\theta}_{0})\phantom{\rule{0.16667em}{0ex}}({X}_{1}-{X}_{2})+\epsilon \phantom{\rule{0.16667em}{0ex}}({X}_{1,i}-{X}_{2,i})<-{s}_{n}/2\}\\ -I\phantom{\rule{0.16667em}{0ex}}\{(1,{\theta}_{0})\phantom{\rule{0.16667em}{0ex}}({X}_{1}-{X}_{2})<-{s}_{n}/2\}\\ +I\phantom{\rule{0.16667em}{0ex}}\{\mid (1,{\theta}_{0})\phantom{\rule{0.16667em}{0ex}}({X}_{1}-{X}_{2})+\epsilon \phantom{\rule{0.16667em}{0ex}}({X}_{1,i}-{X}_{2,i})\mid \phantom{\rule{0.16667em}{0ex}}<{s}_{n}/2\}\phantom{\rule{0.16667em}{0ex}}[{\scriptstyle \frac{1}{2}}-\{(1,{\theta}_{0})\phantom{\rule{0.16667em}{0ex}}({X}_{1}-{X}_{2})+\epsilon \phantom{\rule{0.16667em}{0ex}}({X}_{1,i}-{X}_{2,i})\}]\\ -I\phantom{\rule{0.16667em}{0ex}}\{\mid (1,{\theta}_{0})\phantom{\rule{0.16667em}{0ex}}({X}_{1}-{X}_{2})\mid \phantom{\rule{0.16667em}{0ex}}<{s}_{n}/2\}\phantom{\rule{0.16667em}{0ex}}[{\scriptstyle \frac{1}{2}}-(1,{\theta}_{0})\phantom{\rule{0.16667em}{0ex}}({X}_{1}-{X}_{2})]\end{array}\right]\phantom{\rule{0.16667em}{0ex}}S\phantom{\rule{0.16667em}{0ex}}({Y}_{1},\alpha +{X}_{2}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\phantom{\rule{0.16667em}{0ex}}dF\phantom{\rule{0.16667em}{0ex}}({X}_{2})\\ =\int \left[\begin{array}{c}I\phantom{\rule{0.16667em}{0ex}}\left\{-\alpha +{({X}_{1}-{X}_{2})}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0}+\epsilon {\beta}_{0,1}\phantom{\rule{0.16667em}{0ex}}({X}_{1,i}-{X}_{2,i})<-\alpha -{\beta}_{0,1}{s}_{n}/2\right\}\\ -I\phantom{\rule{0.16667em}{0ex}}\left\{-\alpha +{({X}_{1}-{X}_{2})}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0}<-\alpha -{\beta}_{0,1}{s}_{n}/2\right\}\\ +I\phantom{\rule{0.16667em}{0ex}}\left\{\left|{({X}_{1}-{X}_{2})}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0}+\epsilon {\beta}_{0,1}\phantom{\rule{0.16667em}{0ex}}({X}_{1,i}-{X}_{2,i})\right|<\phantom{\rule{0.16667em}{0ex}}\mid {\beta}_{0,1}\mid {s}_{n}/2\right\}[{\scriptstyle \frac{1}{2}}-\{(1,{\theta}_{0})\phantom{\rule{0.16667em}{0ex}}({X}_{1}-{X}_{2})+\epsilon \phantom{\rule{0.16667em}{0ex}}({X}_{1,i}-{X}_{2,i})\}]\\ -I\phantom{\rule{0.16667em}{0ex}}\left\{\left|{({X}_{1}-{X}_{2})}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0}\right|<\mid {\beta}_{0,1}\mid \phantom{\rule{0.16667em}{0ex}}{s}_{n}/2\right\}[{\scriptstyle \frac{1}{2}}-(1,{\theta}_{0})\phantom{\rule{0.16667em}{0ex}}({X}_{1}-{X}_{2})]\end{array}\right]\phantom{\rule{0.16667em}{0ex}}S\phantom{\rule{0.16667em}{0ex}}({Y}_{1},\alpha +{X}_{2}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\phantom{\rule{0.16667em}{0ex}}dF\phantom{\rule{0.16667em}{0ex}}({X}_{2})\\ \text{let}\phantom{\rule{0.16667em}{0ex}}t=\alpha +{X}_{2}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0}\\ =\int \left[\begin{array}{c}-I\phantom{\rule{0.16667em}{0ex}}\{\alpha +{\beta}_{0,1}{s}_{n}/2+{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0}<t<\alpha +{\beta}_{0,1}{s}_{n}/2+{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0}+\epsilon {\beta}_{0,1}\phantom{\rule{0.16667em}{0ex}}({X}_{1,i}-{X}_{2,i})\}\\ +I\phantom{\rule{0.16667em}{0ex}}\{\mid t-\alpha +\epsilon {\beta}_{0,1}\phantom{\rule{0.16667em}{0ex}}({X}_{1,i}-{X}_{2,i})\mid \phantom{\rule{0.16667em}{0ex}}<\phantom{\rule{0.16667em}{0ex}}\mid {\beta}_{0,1}\mid {s}_{n}/2\}\phantom{\rule{0.16667em}{0ex}}[{\scriptstyle \frac{1}{2}}-\{(t-\alpha )/{\beta}_{01}+\epsilon \phantom{\rule{0.16667em}{0ex}}({X}_{1,i}-{X}_{2,i})\}]\\ -I\phantom{\rule{0.16667em}{0ex}}\{\mid t-\alpha \mid \phantom{\rule{0.16667em}{0ex}}<\phantom{\rule{0.16667em}{0ex}}\mid {\beta}_{0,1}\mid {s}_{n}/2\}\phantom{\rule{0.16667em}{0ex}}\{{\scriptstyle \frac{1}{2}}-(t-\alpha )/{\beta}_{01}\}\end{array}\right]\phantom{\rule{0.16667em}{0ex}}S\phantom{\rule{0.16667em}{0ex}}({Y}_{1},t)\phantom{\rule{0.16667em}{0ex}}{g}_{0}\phantom{\rule{0.16667em}{0ex}}(t\mid {\mathcal{X}}_{2})\phantom{\rule{0.16667em}{0ex}}\mathit{dtdF}\phantom{\rule{0.16667em}{0ex}}({\mathcal{X}}_{2})\\ =-\epsilon {\beta}_{0,1}\int ({X}_{1,i}-{X}_{2,i})\phantom{\rule{0.16667em}{0ex}}S\phantom{\rule{0.16667em}{0ex}}({Y}_{1},\alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\phantom{\rule{0.16667em}{0ex}}{g}_{0}\phantom{\rule{0.16667em}{0ex}}(\alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0}\mid {\mathcal{X}}_{2})\phantom{\rule{0.16667em}{0ex}}dF\phantom{\rule{0.16667em}{0ex}}({\mathcal{X}}_{2})+\epsilon {O}_{p}\phantom{\rule{0.16667em}{0ex}}({s}_{n})+{o}_{p}(\mid \epsilon \mid )\end{array}$$

To show (10),

$$\begin{array}{l}{\nabla}_{i}^{2}{\tau}_{n}(Z,{\theta}_{0})=-{\beta}_{0,1}\int {({X}_{1,i}-{X}_{2,i})}^{2}{G}^{\prime}\phantom{\rule{0.16667em}{0ex}}(\alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\phantom{\rule{0.16667em}{0ex}}{g}_{0}\phantom{\rule{0.16667em}{0ex}}(\alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0}\mid {\mathcal{X}}_{2})\phantom{\rule{0.16667em}{0ex}}dF\phantom{\rule{0.16667em}{0ex}}({\mathcal{X}}_{2})\phantom{\rule{0.16667em}{0ex}}\phantom{\rule{0.16667em}{0ex}}\phantom{\rule{0.16667em}{0ex}}\mathrm{b}/\mathrm{c}\phantom{\rule{0.16667em}{0ex}}\mathrm{E}\phantom{\rule{0.16667em}{0ex}}\{S\phantom{\rule{0.16667em}{0ex}}(Y,t)\mid {X}^{\prime}{\beta}_{0}=t\}=0\\ =-{\beta}_{0,1}{G}^{\prime}\phantom{\rule{0.16667em}{0ex}}(\alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\int {({X}_{1,i}-{X}_{2,i})}^{2}\phantom{\rule{0.16667em}{0ex}}p\phantom{\rule{0.16667em}{0ex}}({\eta}_{2}=\alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0}\mid {\mathcal{X}}_{2})\phantom{\rule{0.16667em}{0ex}}p\phantom{\rule{0.16667em}{0ex}}({\mathcal{X}}_{2})\phantom{\rule{0.16667em}{0ex}}d{\mathcal{X}}_{2}\\ =-{\beta}_{0,1}{G}^{\prime}(\alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\int {({X}_{1,i}-{X}_{2,i})}^{2}\phantom{\rule{0.16667em}{0ex}}g\phantom{\rule{0.16667em}{0ex}}(\alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\phantom{\rule{0.16667em}{0ex}}p\phantom{\rule{0.16667em}{0ex}}({\mathcal{X}}_{2}\mid {\eta}_{2}=\alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\phantom{\rule{0.16667em}{0ex}}d{\mathcal{X}}_{2}\\ =-{\beta}_{0,1}{G}^{\prime}\phantom{\rule{0.16667em}{0ex}}(\alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\phantom{\rule{0.16667em}{0ex}}g\phantom{\rule{0.16667em}{0ex}}(\alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\int ({X}_{1,i}^{2}-2{X}_{1,i}{X}_{2,i}+{X}_{2,i}^{2})\phantom{\rule{0.16667em}{0ex}}p\phantom{\rule{0.16667em}{0ex}}({\mathcal{X}}_{2}\mid \alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\phantom{\rule{0.16667em}{0ex}}d{\mathcal{X}}_{2}\\ =-{\beta}_{0,1}{G}^{\prime}\phantom{\rule{0.16667em}{0ex}}(\alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\phantom{\rule{0.16667em}{0ex}}g\phantom{\rule{0.16667em}{0ex}}(\alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\phantom{\rule{0.16667em}{0ex}}\{{X}_{1,i}^{2}-2{X}_{1,i}\phantom{\rule{0.16667em}{0ex}}\mathrm{E}\phantom{\rule{0.16667em}{0ex}}({X}_{2,i}\mid \alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})+\mathrm{E}\phantom{\rule{0.16667em}{0ex}}({X}_{2,i}^{2}\mid \alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\}\\ =-{\beta}_{0,1}{G}^{\prime}\phantom{\rule{0.16667em}{0ex}}(\alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\phantom{\rule{0.16667em}{0ex}}g\phantom{\rule{0.16667em}{0ex}}(\alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\phantom{\rule{0.16667em}{0ex}}\left[{\{{X}_{1,i}-\mathrm{E}\phantom{\rule{0.16667em}{0ex}}({X}_{2,i}\mid \alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\}}^{2}+\mathrm{E}\phantom{\rule{0.16667em}{0ex}}({X}_{2,i}^{2}\mid \alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})-{\mathrm{E}}^{2}\phantom{\rule{0.16667em}{0ex}}({X}_{2,i}\mid \alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\right]\\ =-{\beta}_{0,1}{G}^{\prime}\phantom{\rule{0.16667em}{0ex}}(\alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\phantom{\rule{0.16667em}{0ex}}g\phantom{\rule{0.16667em}{0ex}}(\alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\phantom{\rule{0.16667em}{0ex}}\left[{\{{X}_{1,i}-\mathrm{E}\phantom{\rule{0.16667em}{0ex}}({X}_{1,i}\mid \alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\}}^{2}+\mathbb{V}\phantom{\rule{0.16667em}{0ex}}({X}_{1,i}\mid \alpha +{X}_{1}^{T}\phantom{\rule{0.16667em}{0ex}}{\beta}_{0})\right]\end{array}$$

The rest of the normality proof tracks the proof of Theorem 4 in [26] closely. Denote ${\mathrm{\Gamma}}_{n}(\theta )={H}_{n}(\theta )-{H}_{n}({\theta}_{0})+{\scriptstyle \frac{1}{2{n}^{2}}}{\lambda}_{n}{\widehat{\beta}}_{1}^{2}{\Vert \theta \Vert}_{2}^{2}$. We will show that

$${\mathrm{\Gamma}}_{n}(\theta )=\frac{1}{2}{(\theta -{\theta}_{0})}^{\prime}\mathrm{\Delta}(\theta -{\theta}_{0})+\frac{1}{\sqrt{n}}{(\theta -{\theta}_{0})}^{\prime}{W}_{n}+{o}_{p}({\mid \theta -{\theta}_{0}\mid}^{2})+{O}_{p}({s}_{n}\mid \theta -{\theta}_{0}\mid )+{o}_{p}(\frac{1}{n})$$

(11)

uniformly in *o*(1) neighborhoods of
*θ*_{0}, where *W _{n}*
converges in distribution to a

For each (*z*_{1}, *z*_{2}) in
*S* *S* and each *θ*
in Θ, define *f _{n}*(

$$\begin{array}{l}{\mathrm{\Gamma}}_{n0}(\theta )=E({f}_{n}({Z}_{1},{Z}_{2},\theta ))\\ {g}_{n}(z,\theta )=P{f}_{n}(z,\xb7,\theta )+P{f}_{n}(\xb7,z,\theta )-2{\mathrm{\Gamma}}_{n0}(\theta )\\ {r}_{n}({z}_{1},{z}_{2},\theta )={f}_{n}({z}_{1},{z}_{2},\theta )-P{f}_{n}({z}_{1},\xb7,\theta )-P{f}_{n}(\xb7,{z}_{2},\theta )+{\mathrm{\Gamma}}_{n0}(\theta ).\end{array}$$

We now show that

$${\mathrm{\Gamma}}_{n0}(\theta )=\frac{1}{2}{(\theta -{\theta}_{0})}^{\prime}\mathrm{\Delta}(\theta -{\theta}_{0})+{o}_{p}({\mid \theta -{\theta}_{0}\mid}^{2})+{O}_{p}({s}_{n}\mid \theta -{\theta}_{0})$$

(12)

Note that
2Γ_{n}_{0}(*θ*)
=
*E*(*τ _{n}*(

$${\tau}_{n}(z,\theta )-{\tau}_{n}(z,{\theta}_{0})={(\theta -{\theta}_{0})}^{\prime}{\nabla}_{1}{\tau}_{n}(z,{\theta}_{0})+\frac{1}{2}{(\theta -{\theta}_{0})}^{\prime}{\nabla}_{2}{\tau}_{n}(z,{\theta}^{\ast})(\theta -{\theta}_{0}),$$

where *θ*^{*}
between *θ* and *θ*_{0}. By
(A3)(iv), for each *z* in *S* and each
*θ* in a neighborhood of
*θ*_{0},

$$\Vert {(\theta -{\theta}_{0})}^{\prime}[{\nabla}_{2}{\tau}_{n}(z,\theta )-{\nabla}_{2}{\tau}_{n}(z,{\theta}_{0})](\theta -{\theta}_{0})\Vert \phantom{\rule{0.16667em}{0ex}}<M(z){\mid \theta -{\theta}_{0}\mid}^{3}.$$

Thus by Lemma 1,

$$\begin{array}{l}2{\mathrm{\Gamma}}_{n0}(\theta )=\mathbb{E}\phantom{\rule{0.16667em}{0ex}}\{{\tau}_{n}(z,\theta )-{\tau}_{n}(z,{\theta}_{0})\}\\ ={(\theta -{\theta}_{0})}^{T}\mathbb{E}\phantom{\rule{0.16667em}{0ex}}\{{\nabla}_{1}{\tau}_{n}(z,{\theta}_{0})\}+\mathbb{E}\phantom{\rule{0.16667em}{0ex}}\left\{\frac{1}{2}{(\theta -{\theta}_{0})}^{T}{\nabla}_{2}{\tau}_{n}(z,{\theta}^{\ast})(\theta -{\theta}_{0})\right\}\\ ={O}_{p}({s}_{n}\mid \theta -{\theta}_{0}\mid )+{(\theta -{\theta}_{0})}^{T}\mathbb{E}\phantom{\rule{0.16667em}{0ex}}\{(\mathcal{X}-{\mathcal{X}}_{0})\phantom{\rule{0.16667em}{0ex}}S(Y,{\eta}_{0}){g}_{0}({\eta}_{0})\}+{(\theta -{\theta}_{0})}^{T}\mathrm{\Delta}(\theta -{\theta}_{0})+o({\mid \theta -{\theta}_{0}\mid}^{2})\\ ={O}_{p}({s}_{n}\mid \theta -{\theta}_{0}\mid )+{(\theta -{\theta}_{0})}^{T}\mathrm{\Delta}(\theta -{\theta}_{0})+o({\mid \theta -{\theta}_{0}\mid}^{2}).\end{array}$$

Next we show that

$${P}_{n}{g}_{n}(\xb7,\theta )=\frac{1}{\sqrt{n}}{(\theta -{\theta}_{0})}^{T}{W}_{n}+O({s}_{n}\mid \theta -{\theta}_{0})+o({\mid \theta -{\theta}_{0}\mid}^{2})$$

(13)

uniformly over *o*(1) neighborhood of
*θ*_{0}, where *W _{n}*
converges in distribution to a

$${g}_{n}(z,\theta )=\tau (z,\theta )-\tau (z,{\theta}_{0})-2{\mathrm{\Gamma}}_{n0}(\theta ).$$

Then we have

$$\begin{array}{l}{P}_{n}({g}_{n}(z,\theta ))={P}_{n}({\tau}_{n}(z,\theta )-{\tau}_{n}(z,{\theta}_{0}))-2{\mathrm{\Gamma}}_{n0}(\theta )\\ ={P}_{n}\phantom{\rule{0.16667em}{0ex}}\left({(\theta -{\theta}_{0})}^{\prime}{\nabla}_{1}{\tau}_{n}(z,{\theta}_{0})+\frac{1}{2}{(\theta -{\theta}_{0})}^{\prime}{\nabla}_{2}{\tau}_{n}(z,{\theta}^{\ast})(\theta -{\theta}_{0})\right)-2{\mathrm{\Gamma}}_{n0}(\theta )\\ ={P}_{n}\phantom{\rule{0.16667em}{0ex}}({(\theta -{\theta}_{0})}^{\prime}{\nabla}_{1}{\tau}_{n}(z,{\theta}_{0}))+\frac{1}{2}({(\theta -{\theta}_{0})}^{\prime}[{P}_{n}({\nabla}_{2}{\tau}_{n}(z,{\theta}_{0}))-2V](\theta -{\theta}_{0}))+O({s}_{n}\mid \theta -{\theta}_{0})+o({\mid \theta -{\theta}_{0}\mid}^{2})\\ =\frac{1}{\sqrt{n}}{(\theta -{\theta}_{0})}^{\prime}{W}_{n}+\frac{1}{2}{(\theta -{\theta}_{0})}^{\prime}{D}_{n}(\theta -{\theta}_{0})+O({s}_{n}\mid \theta -{\theta}_{0}\mid )+o({\mid \theta -{\theta}_{0}\mid}^{2})\end{array}$$

uniformly over *o*(1) neighborhood of
*θ*_{0}, where

$$\begin{array}{l}{W}_{n}=\sqrt{n}{P}_{n}\phantom{\rule{0.16667em}{0ex}}[(\mathcal{X}-{\mathcal{X}}_{0})\phantom{\rule{0.16667em}{0ex}}S(Y,{\eta}_{0}){g}_{0}\phantom{\rule{0.16667em}{0ex}}({\eta}_{0})],\\ {D}_{n}={P}_{n}\phantom{\rule{0.16667em}{0ex}}\left[(\mathcal{X}-{\mathcal{X}}_{0})\phantom{\rule{0.16667em}{0ex}}{(\mathcal{X}-{\mathcal{X}}_{0})}^{T}\phantom{\rule{0.16667em}{0ex}}S(Y,{\eta}_{0}){g}_{0}\phantom{\rule{0.16667em}{0ex}}({\eta}_{0})\right]-2\mathrm{\Delta}\end{array}$$

By Lemma 1, *W _{n}* converges in distribution to a

Lastly we show that

$${U}_{n}{r}_{n}(\xb7,\xb7,\theta )=o(1),$$

(14)

Denote *H* =
{*I*(*y*_{1} >
*y*_{2})[*r*{(1,
*θ*) (*x*_{1} −
*x*_{2})/*s*} −
*r*{(1, *θ*_{0})
(*x*_{1} −
*x*_{2})/*s*}]:
*θ* Θ, *s*
[0, 1]}. Then *H* is Euclidean by Lemma 22 (ii)
of [45]. By the dominated
convergence theorem, *E*(*k _{n}*(·,
·,

Together, (12), (13), (14) and ${\scriptstyle \frac{1}{{n}^{2}}}{\lambda}_{n}{\widehat{\beta}}_{1}^{2}={o}_{p}\phantom{\rule{0.16667em}{0ex}}(1)$ imply (11). Thus completes the proof.

In Section 3, we take the derivative of *L _{p}* with
respect to

$$\begin{array}{l}{L}_{p}=-\sum _{p}\left({h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{0})\phantom{\rule{0.16667em}{0ex}}{\mathit{x}}_{p}^{T}\mathit{\beta}+{\alpha}_{p}{\mathit{x}}_{p}^{T}\mathit{\beta}-\frac{1}{2}{\alpha}_{p}\right)+\frac{1}{2}{\mathit{\beta}}^{T}{P}_{\lambda}\mathit{\beta}\\ =-\sum _{p}({h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{0})+{\alpha}_{p})\phantom{\rule{0.16667em}{0ex}}{\mathit{x}}_{p}^{T}{P}_{\lambda}^{-1}\sum _{{p}^{\prime}}({h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{{p}^{\prime}}^{0})+{\alpha}_{{p}^{\prime}})\phantom{\rule{0.16667em}{0ex}}{\mathit{x}}_{{p}^{\prime}}+\sum _{p}\frac{1}{2}{\alpha}_{p}+\frac{1}{2}\sum _{p}({h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{0})+{\alpha}_{p})\phantom{\rule{0.16667em}{0ex}}{\mathit{x}}_{p}^{T}{P}_{\lambda}^{-1}\phantom{\rule{0.16667em}{0ex}}{P}_{\lambda}{P}_{\lambda}^{-1}\sum _{{p}^{\prime}}({h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{{p}^{\prime}}^{0})+{\alpha}_{{p}^{\prime}})\phantom{\rule{0.16667em}{0ex}}{\mathit{x}}_{{p}^{\prime}}\\ =-\frac{1}{2}\sum _{p,{p}^{\prime}}({h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{0})+{\alpha}_{p})\phantom{\rule{0.16667em}{0ex}}({h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{{p}^{\prime}}^{0})+{\alpha}_{{p}^{\prime}})\phantom{\rule{0.16667em}{0ex}}{\mathit{x}}_{p}^{T}{P}_{\lambda}^{-1}\phantom{\rule{0.16667em}{0ex}}{\mathit{x}}_{{p}^{\prime}}+\frac{1}{2}\sum _{p}{\alpha}_{p}\\ \propto -\frac{1}{2}\sum _{p,{p}^{\prime}}({\alpha}_{p}{\alpha}_{{p}^{\prime}}+{h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{0})\phantom{\rule{0.16667em}{0ex}}{\alpha}_{{p}^{\prime}}+{\alpha}_{p}{h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{{p}^{\prime}}^{0}))\phantom{\rule{0.16667em}{0ex}}{\mathit{x}}_{p}^{T}{P}_{\lambda}^{-1}\phantom{\rule{0.16667em}{0ex}}{\mathit{x}}_{{p}^{\prime}}+\frac{1}{2}\sum _{p}{\alpha}_{p}\\ =-\frac{1}{2}\sum _{p,{p}^{\prime}}{\alpha}_{p}{\alpha}_{{p}^{\prime}}{\mathit{x}}_{p}^{T}{P}_{\lambda}^{-1}\phantom{\rule{0.16667em}{0ex}}{\mathit{x}}_{{p}^{\prime}}-\frac{1}{2}\sum _{p,{p}^{\prime}}2{h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{{p}^{\prime}}^{0})\phantom{\rule{0.16667em}{0ex}}{\alpha}_{p}{\mathit{x}}_{p}^{T}{P}_{\lambda}^{-1}\phantom{\rule{0.16667em}{0ex}}{\mathit{x}}_{{p}^{\prime}}+\frac{1}{2}\sum _{p}{\alpha}_{p}\\ \propto -{\mathit{\alpha}}^{T}Q\mathit{\alpha}+\sum _{p}\left(1-2\sum _{{p}^{\prime}}{h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{{p}^{\prime}}^{0})\phantom{\rule{0.16667em}{0ex}}{\mathit{x}}_{{p}^{\prime}}^{T}{P}_{\lambda}^{-1}\phantom{\rule{0.16667em}{0ex}}{\mathit{x}}_{p}\right)\phantom{\rule{0.16667em}{0ex}}{\alpha}_{p}\\ =-{\mathit{\alpha}}^{T}Q\mathit{\alpha}+{\{\mathbf{1}-2Q{\mathit{h}}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\mathit{\eta}}^{0})\}}^{T}\phantom{\rule{0.16667em}{0ex}}\mathit{\alpha}.\end{array}$$

Thus we have arrived at a dual space formulation of the optimization problem
that is a function of ** α** only.

The KKT conditions associated with the above optimization programs are

$$\begin{array}{ll}\hfill {P}_{\lambda}^{-1}\sum _{p}({h}_{2}^{\prime}\phantom{\rule{0.16667em}{0ex}}({\eta}_{p}^{0})+{\alpha}_{p})\phantom{\rule{0.16667em}{0ex}}{\mathit{x}}_{p}& =\mathit{\beta}\hfill \\ \hfill {\xi}_{p}+{\mathit{x}}_{p}^{T}\mathit{\beta}-\frac{1}{2}& \ge 0\hfill \\ \hfill {\xi}_{p}& \ge 0\hfill \\ \hfill 1& \ge {\alpha}_{p}\ge 0\hfill \\ \hfill {\alpha}_{p}\phantom{\rule{0.16667em}{0ex}}\left({\xi}_{p}+{\mathit{x}}_{p}^{T}\mathit{\beta}-\frac{1}{2}\right)& =0\hfill \\ \hfill (1-{\alpha}_{p})\phantom{\rule{0.16667em}{0ex}}{\xi}_{p}& =0\hfill \end{array}$$

This set of conditions can be written more succinctly by noting that if
*α _{p}* = 0, then

$$\begin{array}{ccc}{\mathit{x}}_{p}^{T}\mathit{\beta}-{\scriptstyle \frac{1}{2}}\ge 0& \text{if}& {\alpha}_{p}<1\\ {\mathit{x}}_{p}^{T}\mathit{\beta}-{\scriptstyle \frac{1}{2}}\le 0& \text{if}& {\alpha}_{p}>0\end{array}.$$

1. Haynes BF, Gilbert PB, McElrath MJ, Zolla-Pazner S, Tomaras GD, Alam SM, Evans DT, Montefiori DC, Karnasuta C, Sutthent R, et al. Immune-correlates analysis of an HIV-1 vaccine efficacy
trial. New England Journal of Medicine. 2012;366(14):1275–1286. [PMC free article] [PubMed]

2. Pepe M. The Statistical Evaluation of Medical Tests for Classification and
Prediction. Oxford University Press; 2003. Oxford statistical science series. http://books.google.com/books?id=kMyXEJEtFmkC.

3. Zhou X, Obuchowski N, McClish D. Statistical Methods in Diagnostic Medicine. Wiley-Interscience; 2002. Wiley series in probability and statistics. http://books.google.com/books?id=ijN_Dlx7wmoC.

4. Vapnik V. The Nature of Statistical Learning Theory. Springer Verlag; 1995.

5. Pepe M, Cai T, Longton G. Combining predictors for classification using the area under the receiver
operating characteristic curve. Biometrics. 2006;62(1):221–229. [PubMed]

6. Bianco A, Yohai V. Robust estimation in the logistic regression model. Robust Statistics, Data Analysis, and Computer Intensive Methods, Lecture Notes in
Statistics. 1996;109:17–34.

7. Wu Y, Liu Y. Robust truncated hinge loss support vector machines. Journal of the American Statistical Association. 2007;102(479):974–983.

8. Liu Y, Shen X. Multicategory *ψ*-learning. Journal of the American Statistical Association. 2006;101(474):500–509.

9. Liu Y, Shen X, Doss H. Multicategory *ψ*-learning and support vector
machine: computational tools. Journal of Computational and Graphical Statistics. 2005;14(1):219–236.

10. Lloyd C. Using smoothed receiver operating characteristic curves to summarize and
compare diagnostic systems. Journal of the American Statistical Association. 1998;93(444):1356–1364.

11. Vexler A, Liu A, Schisterman E, Wu C. Note on distribution-free estimation of maximum linear separation of two
multivariate distributions. Journal of Nonparametric Statistics. 2006;18(2):145–158.

12. Zhou X, Chen B, Xie Y, Tian F, Liu H, Liang X. Variable selection using the optimal ROC curve: An application to a
traditional chinese medicine study on osteoporosis disease. Statistics in Medicine. 2012;31(7):628–635. [PubMed]

13. Yan L, Dodier R, Mozer M, Wolniewicz R. Optimizing classifier performance via an approximation to the
Wilcoxon-Mann-Whitney statistic. Proceedings of the Twentieth International Conference on Machine
Learning; The AAAI Press; 2003.

14. Herschtal A, Raskutti B. Optimising area under the ROC curve using gradient descent. Proceedings of the Twenty-First International Conference on Machine
Learning; ACM; 2004.

15. Ma S, Huang J. Combining multiple markers for classification using ROC. Biometrics. 2007;63(3):751–891533. [PubMed]

16. Calders T, Jaroszewicz S. Efficient auc optimization for classification. Knowledge Discovery in Databases: PKDD 2007. 2007:42–53.

17. Brefeld U, Scheffer T. Proceedings of the ICML 2005Workshop on ROC Analysis in Machine Learning. 2005. Auc maximizing support vector learning.

18. Rakotomamonjy A. Optimizing area under roc curve with svms. Proceedings of ECAI 04 ROC and Artificial Intelligence Workshop. 2004

19. Joachims T. A support vector method for multivariate performance
measures. Proceedings of the 22nd International Conference on Machine
Learning; ACM; 2005. pp. 377–384.

20. Wang Y, Chen H, Li R, Duan N, Lewis-Fernández R. Prediction-based structured variable selection through the receiver
operating characteristic curves. Biometrics. 2011;67(3):896–905. [PMC free article] [PubMed]

21. Komori O, Eguchi S. A boosting method for maximizing the partial area under the roc
curve. BMC Bioinformatics. 2010;11(1):314. [PMC free article] [PubMed]

22. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and
Prediction. 2. Springer; 2009. Springer Series in Statistics.

23. Pepe M, Thompson M. Combining diagnostic test results to increase accuracy. Biostatistics. 2000;1(2):123–140. [PubMed]

24. Lin H, Zhou L, Peng H, Zhou X. Selection and combination of biomarkers using roc method for disease
classification and prediction. Canadian Journal of Statistics. 2011;39(2):324–343.

25. Gammerman A. Computational Learning and Probabilistic Reasoning. John Wiley & Sons, Inc; 1996.

26. Sherman R. The limiting distribution of the maximum rank correlation
estimator. Econometrica: Journal of the Econometric Society. 1993;61(1):123–137.

27. An LTH, Tao PD. Solving a class of linearly constrained indefinite quadratic problems by dc
algorithms. Journal of Global Optimization. 1997;11(3):253–285.

28. Boyd S, Vandenberghe L. Convex Optimization. Cambridge University Press; 2004. http://books.google.com/books?id=mYm0bLd3fcoC.

29. Burges C. A tutorial on support vector machines for pattern
recognition. Data mining and knowledge discovery. 1998;2(2):121–167.

30. Aizerman A, Braverman E, Rozoner L. Theoretical foundations of the potential function method in pattern
recognition learning. Automation and Remote Control. 1964;25:821–837.

31. Mercer J. Functions of positive and negative type, and their connection with the
theory of integral equations. Philosophical Transactions of the Royal Society of London Series A, Containing
Papers of a Mathematical or Physical Character. 1909;209:415–446.

32. Boser B, Guyon I, Vapnik V. A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on Computational learning
theory; ACM; 1992. pp. 144–152.

33. Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995;20(3):273–297.

34. Joachims T. Learning to Classify Text Using Support Vector Machines. Vol. 668. Springer; Netherlands: 2002.

35. Brown M, Grundy W, Lin D, Cristianini N, Sugnet C, Furey T, Ares M, Haussler D. Knowledge-based analysis of microarray gene expression data by using
support vector machines. Proceedings of the National Academy of Sciences. 2000;97(1):262. [PubMed]

36. Rerks-Ngarm S, Pitisuttithum P, Nitayaphan S, Kaewkungwal J, Chiu J, Paris R, Premsri N, Namwat C, de Souza M, Adams E, et al. Vaccination with ALVAC and AIDSVAX to prevent HIV-1 infection in
Thailand. New England Journal of Medicine. 2009;361(23):2209–2220. [PubMed]

37. Wahba G, Lin Y, Zhang H. Margin-like quantities and generalized approximate cross validation for
support vector machines. Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE
Signal Processing Society Workshop; IEEE; 1999. pp. 12–20.

38. Bühlmann P, van de Geer S. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer; Berlin Heidelberg: 2011. Springer Series in Statistics. https://books.google.com/books?id=S6jYXmh988UC.

39. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle
properties. Journal of the American Statistical Association. 2001;96(456):1348–1360.

40. Fung G, Mangasarian O. A feature selection newton method for support vector machine
classification. Computational Optimization and Applications. 2004;28(2):185–202.

41. Mangasarian OL, Kou G. Feature selection for nonlinear kernel support vector
machines. Data Mining Workshops, 2007. ICDM Workshops 2007. Seventh IEEE International
Conference on; IEEE; 2007. pp. 231–236.

42. Ni X, Zhang D, Zhang H. Variable selection for semiparametric mixed models in longitudinal
studies. Biometrics. 2010;66(1):79–88. [PMC free article] [PubMed]

43. Savitsky T, Vannucci M, Sha N. Variable selection for nonparametric gaussian process priors: Models and
computational strategies. Statistical Science. 2011;26(1):130–149. [PMC free article] [PubMed]

44. Han A. Non-parametric analysis of a generalized regression model* 1:: The
maximum rank correlation estimator. Journal of Econometrics. 1987;35(2–3):303–316.

45. Nolan D, Pollard D. *u*-processes: Rates of convergence. The Annals of Statistics. 1987;15(2):780–799.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |