Enter Your Search:Search tips Search criteria Articles Journal titles Advanced

J Am Stat Assoc. Author manuscript; available in PMC 2010 November 24.
Published in final edited form as:
J Am Stat Assoc. 2010 March 1; 105(489): 424–436.
PMCID: PMC2990887
NIHMSID: NIHMS250286

# Robust Model-Free Multiclass Probability Estimation

Yichao Wu, Assistant Professor, Hao Helen Zhang, Associate Professor, and Yufeng Liu, Associate Professor

## Abstract

Classical statistical approaches for multiclass probability estimation are typically based on regression techniques such as multiple logistic regression, or density estimation approaches such as linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). These methods often make certain assumptions on the form of probability functions or on the underlying distributions of subclasses. In this article, we develop a model-free procedure to estimate multiclass probabilities based on large-margin classifiers. In particular, the new estimation scheme is employed by solving a series of weighted large-margin classifiers and then systematically extracting the probability information from these multiple classification rules. A main advantage of the proposed probability estimation technique is that it does not impose any strong parametric assumption on the underlying distribution and can be applied for a wide range of large-margin classification methods. A general computational algorithm is developed for class probability estimation. Furthermore, we establish asymptotic consistency of the probability estimates. Both simulated and real data examples are presented to illustrate competitive performance of the new approach and compare it with several other existing methods.

Keywords: Fisher consistency, Hard classification, Multicategory classification, Probability estimation, Soft classification, SVM

## 1. INTRODUCTION

Multiclass probability estimation is an important problem in statistics and data mining. Suppose we are given a sample {(xi, yi), i = 1, 2, …, n} consisting of iid observations from some unknown probability distribution P(X, Y), where xi d denotes the input vector, yi {1, 2, …, K} denotes the label, n is the sample size, d is the dimensionality of the input space, and K denotes the number of classes. The main goal is to estimate the conditional probabilities pk(x) = P(Y = k|X = x), k = 1, …, K. This problem is also known as soft classification, since the estimated pk’s can be used to determine the classification boundary among K classes and to predict class labels for future samples collected from the same population.

Traditionally, the probability estimation problem is commonly tackled by regression techniques such as multiple logistic regression, or the density estimation approaches such as linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). Agresti and Coull (1998) gave a thorough review on these methods. These methods often make certain model assumptions on the function forms of pk’s (or their transformations) or on the underlying distributions of subclasses. For example, multiple logistic regression assumes that the logarithms of the odd ratios are linear in x,

$logP(Y=k∣X=x)P(Y=1∣X=x)=βk0+xTβk,∀k≥2,$

where class 1 is chosen as the baseline class. On the other hand, both LDA and QDA assume that the covariates X associated with each subclass follow a multivariate Gaussian distribution and construct the probability estimates as

$P(Y=k∣X=x)=φ(x;μk,∑k)αk∑j=1Kφ(x;μj,∑j)αj,k=1,…,K,$

where (x; μ, Σ) is the density function of the multivariate Gaussian distribution associated with the mean μ and the covariance Σ and αk = P(Y = k) is known as the prior probability of class k. These methods are widely used in practice. In many real applications, however, it is difficult to justify the assumption of the linear effects of covariates in the multiple logistic regression. Moreover, it is often difficult to validate the Gaussian assumption for multivariate data. If the distribution is very skewed, some proper transformation is needed to make data approximately Gaussian, which can be nontrivial for multivariate data. These issues become even more challenging for high-dimensional data.

In this article, we propose a new class of model-free methods for estimating multiclass probabilities. The new method does not make any assumption on the forms of pk’s or the distribution for each subclass. Different from the traditional methods, we tackle the soft classification problem by solving a series of hard classification problems and combining these decision rules to construct the probability estimates. The main difference between soft classification and hard classification is their estimation target, with the former directly estimating pk(x)’s and the latter estimating arg maxk=1,…,K pk(x). For many complicated problems, estimation of the classification rule arg maxk=1,…,K pk(x) may be a relatively easier task than estimating the probability functions. Many successful large-margin classifiers such as the support vector machine (SVM) can estimate arg maxk=1,…,K pk(x) with high accuracy without estimating pk(x)’s at all. This motivates us to take advantage of good classification performance of hard classifiers and try to extract the probability information contained in them.

Wang, Shen, and Liu (2008) recently explored class probability estimation for binary large-margin classifiers. In particular, they made use of the property that the theoretical minimizer of a consistent weighted binary large-margin loss function is sign[p1(x) − π], where π (0, 1). Although a particular weighted binary large-margin classifier only estimates whether p1(x) is larger than π or not, one can obtain a good estimate of p1(x) if a sequence of weighted classifiers are calculated for many different π’s. As shown in Wang, Shen, and Liu (2008), this method indeed works well for binary class probability estimation. However, the simultaneous generalization from K = 2 to K ≥ 3 is nontrivial and largely unknown due to the increased level of problem complexity. Wu, Lin, and Weng (2004) proposed a pairwise coupling method for multi-class probability estimation by solving many binary problems. In this article, we develop a new multiclass probability estimation scheme by utilizing the proposed concept of border weight for large-margin classifiers. As a result, the (K − 1)-dimensional probability estimation reduces to the search of border weight. We propose two estimation schemes for probability estimation, the direct scheme and the indirect scheme. Furthermore, we focus on the truncated hinge loss (Wu and Liu 2007) for demonstration of our proposed probability estimation technique. The technique is, however, applicable to other large-margin classifiers as well.

The rest of our article is structured as follows. Section 2 presents the idea of weighted classification and its Fisher consistency properties. Section 3 introduces the main methodology, along with two estimation schemes and the theoretical properties of the resulting probability estimator. Section 4 discusses the computational algorithm and tuning method. Section 5 and Section 6 contain numerous simulated and real examples to illustrate the numerical performance of the new approach, which is followed by the concluding section. The Appendix collects proofs for the theoretical results as well as the derivation of our algorithm.

## 2. WEIGHTED CLASSIFICATION AND FISHER CONSISTENCY

In this section, we give a brief review on an important class of hard classifiers, SVM’s (Cortes and Vapnik 1995; Vapnik 1998). We start with the simple binary classification problems and then discuss the multiclass extensions. We will, in particular, discuss the extension of SVM’s by minimizing a weighted loss function.

### 2.1 Weighted Binary Classification

When K = 2, the class label y is often coded as {−1, +1} for notational convenience. The binary SVM classifier can be fit in the following regularization framework

$minf∈Fn−1∑i=1nH1(yif(xi))+λJ(f),$
(1)

where the function H1(z) = (1 − z)+ max{1 − z, 0} is the so-called hinge loss, J(f) is a penalty term for model complexity, and is some functional space. Let p1(x) = P(Y = 1|X = x). Lin (2002) showed that the SVM solution to Equation (1) targets directly at $sign[p1(x)−12]$. Therefore, sign[(x)] approximates the Bayes classification rule without estimating p1(x).

Since the SVM showed good classification accuracy in many applications, a natural question to ask is whether it is possible to extract any information about p1(x) from the SVM solution. Recently, Wang, Shen, and Liu (2008) proposed training a series of binary SVM’s by minimizing a weighted loss function and then constructing 1(x) by combining multiple SVM classification rules. In particular, by assigning a weight π to all the samples from class −1 and assigning 1 − π to all the samples from class +1, one can solve the regularization problem based on the weighted hinge loss

$minf∈Fn−1[(1−π)∑yi=1H1{yif(xi)}+π∑yi=−1H1{yif(xi)}]+λJ(f),$
(2)

where 0 ≤ π ≤ 1. Wang, Shen, and Liu (2008) showed that the minimizer to Equation (2) is a consistent estimate of sign[p1(x) − π]. Therefore, one can repeatedly solve Equation (2) using different π values, say, 0 = π1 < ··· < πm+1 = 1 and search ĵ such that πĵ and πĵ+1 satisfy sign[p1(x) − πĵ] ≠ sign[p1(x) − πĵ+1]. The probability estimate can be estimated as $p^1(x)=12(πj^+πj^+1)$. More technical details can be found in their article.

### 2.2 Weighted Multiclass Classification

Now consider the multiclass problems with K ≥ 2. In this setup, we code y as {1, 2, …, K}. A classifier seeks the function vector f = (f1, f2, …, fK), where fk is a map from the input domain to (the set of all real numbers) representing the class k; k = 1, …, K. To ensure uniqueness of the solution, a sum-to-zero constraint $∑k=1Kfk=0$ is usually employed. For any new input vector x, its label is estimated via a decision rule ŷ = arg maxk=1,2,…,K fk(x). Clearly, the argmax rule is equivalent to the sign function used in the binary case.

Various loss functions were proposed to extend the binary SVM to multiclass problems, such as Weston and Watkins (1999), Lee, Lin, and Wahba (2004), and Liu (2007). Here we focus on the notion of the 0–1 loss. Note that a point (x, y) is misclassified by f if y ≠ arg maxk fk(x), that is, if min g(f(x), y) ≤ 0, where

$g(f(x),y)={fy(x)−fk(x),k≠y}.$

The quantity min g(f(x), y) is known as the generalized functional margin and can be reduced to yf (x) in the binary case with y {±1} (Liu and Shen 2006). With the generalized functional margin, the 0–1 loss can be expressed as I(min g(f(x), y) ≤ 0). As in the binary case, one can replace the indicator function in the 0–1 loss by some other loss . Typically, to assure that a misclassified sample induces a larger loss than a correctly classified sample, the loss function is nonincreasing and satisfies that ′(0) < 0. Once the loss (·) is given, the decision vector can be obtained by solving the following regularization problem

$minfn−1∑i=1nℓ(ming(f(xi),yi))+λ∑k=1KJ(fk)subjectto∑k=1Kfk(x)=0.$
(3)

Motivated by Wang, Shen, and Liu (2008), we propose a new approach to estimate the class probabilities by solving a series of weighted multiclass problems and then combining multiple classification rules. In the article, we focus on the class of losses based on the functional margin (min g(f(X), Y)), as they provide a natural extension from two-class to multiclass problems. For the weighted learning, we assign a weight 0 ≤ πk ≤ 1 to samples from class k, k = 1, …, K, where π1 + ··· + πK = 1 to assure identifiability. Define the unit K-cube hyperplane as

$AK={(π1,…,πk):∑k=1Kπk=1,πk≥0,k=1,2,…,K}.$

For any given π AK, we can train a weighted hard classifier by minimizing the objective function using a weighted loss function

$minfn−1∑i=1nπyiℓ(ming(f(xi),yi))+λ∑k=1KJ(fk)subjectto∑k=1Kfk(x)=0.$
(4)

Compared with the binary case, extracting the probability information from the constructed classifiers becomes much more challenging for K > 2. In particular, instead of estimating only one probability function as in K = 2, we need to estimate multiple functions p1(x), …, pK−1(x) when K > 2. As a result, a substantially different formulation from the binary case is required for multiclass probability estimation.

In the binary case, the standard SVM is shown to be Fisher-consistent for estimating the Bayes classification rule $sign(p1(x)−12)$. To estimate conditional class probabilities, Wang, Shen, and Liu (2008)’s method requires that the weighted SVM in Equation (2) is Fisher-consistent for estimating weighted Bayes classification rule sign(p1(x) − π). To proceed with the multicategory probability estimation, we need to extend the definition of weighted Fisher-consistency. To construct a good probability estimate from the classification rules, we require that the loss function in Equation (4) is consistent in the following sense.

#### Definition 1

A functional margin based loss is called weighted Fisher-consistent for the weighted classification problem if the minimizer f* of E[πY (min g(f(X), Y))|X = x] satisfies

$argmaxk=1,…,Kfk∗(x)=argmaxk=1,…,Kπkpk(x),∀x,∀π∈AK.$

In a standard multiclass classification problem, the misclassification costs are all equal, i.e., C(Y, f(X)) = I(Yf(X)), and the Bayes rule minimizing E[C(Y, f (X))] is arg maxk=1,2,…,K pk(x). A loss is Fisher-consistent if the decision rule induced from f* = arg min E[ (min g(f(X), Y))|X = x] is the same as the Bayes rule, i.e., $argmaxk=1,…,Kfk∗(x)=argmaxk=1,…,Kpk(x)$ for all x. For a weighted learning problem, the weighted loss E[πY (min g(f(X), Y))] implies that unequal costs Cπ (Y, f (X)) = πY I(Yf (X)) are used for incorrect decisions. It is straightforward to show that the Bayes rule minimizing E[Cπ (Y, f (X)] is arg maxk=1,…,K πkpk(x). In this context, we say is weighted Fisher-consistent if arg maxk=1,…,K f*(x) = arg maxk=1,…,K πkpk(x) for all π and x. This is also known as classification calibrated (Bartlett, Jordan, and Mcauliffe 2006) and infinite-sample consistent (Zhang 2004). Therefore, the weighted Fisher-consistency can be regarded as an equivalent formulation of Fisher-consistency for weighted classification problems.

It turns out that not all functional margin based loss (min g(f(x), y)) satisfying ′ (0) < 0 is weighted Fisher-consistent for multicategory problems, as shown in the next proposition.

#### Proposition 1

Let (·) be a nonincreasing loss function satisfying ′ (0) < 0. For any given positive weights π AK, the minimizer f* of E[πY (min g(f(X), Y))|X = x] has the following properties:

1. If $maxk=1,…,Kπkpk∑k=1Kπkpk>1/2$, then $argmaxkfk∗=argmaxk=1,…,Kπkpk$.
2. If (·) is convex and $maxk=1,…,Kπkpk∑k=1Kπkpk≤1/2$, then f* = 0 is a minimizer.

Proposition 1 suggests that one sufficient condition for the weighted loss πy(min g(f(x), y)) to be weighted Fisher-consistent is $maxkπkpk∑kπkpk>1/2$, i.e., there exists a “dominating” class in the weighted sense. This condition is always satisfied for a binary problem except at the Bayes boundary {x: π1p1(x) = π2p2(x)}, but not for K > 2 as we require $maxk=1,…,Kπkpk∑k=1Kπkpk>1/2$ to hold for all π AK. When K > 2 and $maxk=1,…,Kπkpk∑k=1Kπkpk≤1/2$, f* = 0 can be a minimizer and consequently $argmaxk=1,…,Kfk∗(x)$ is not uniquely determined. As a result, the weighted loss πy(min g(f(x), y)) is not weighted Fisher-consistent in such cases. By Theorem 1, the weighted hinge loss πyH1(min g(f(x), y)) is not weighted Fisher-consistent.

Interestingly, although the weighted loss πy(min g(f(x), y)) may not be weighted Fisher-consistent, the corresponding truncated version can be weighted Fisher-consistent. Specifically, for any (·), we define its truncated loss at a location s ≤ 0 by

$ℓTs(·)=min(ℓ(·),ℓ(s)).$

The following theorem shows that the truncated loss Ts is weighted Fisher-consistent.

#### Theorem 1

Let (·) be a nonincreasing loss function satisfying ′ (0) < 0. Then a sufficient condition for the weighted truncated loss πyTs (min g(f(x), y)) with K > 2 and s ≤ 0 to be weighted Fisher-consistent for estimating arg maxj πjpj is that the truncation location s satisfies $sup{u:u≥−s≥0}ℓ(0)−ℓ(u)ℓ(s)−ℓ(0)≥K−1$. This condition is also necessary if (·) is convex.

#### Remark 1

As pointed out by one referee, the condition ′ (0) < 0 requires differentiability of at 0 and thus excludes nondifferentiable loss functions such as the ψ loss. In the following, we show how the condition can be relaxed for nondifferentiable losses. Note that ′ (0) is used to assure

$(∑k≠kπpπkpk(x))(−K)ℓ′(0)+πkπppkπp(x)Kℓ′(0)<0,$
(5)

when πkπp pkπp (x) > Σkkπp πkpk(x). If ′ (0) does not exist, the term in Equation (5) can be simply replaced by

$(∑k≠kπpπkpk(x))(−K)ℓ′(0−)+πkπppkπp(x)Kℓ′(0+)<0,$
(6)

where ′ (0) and ′ (0+) denote the left and right derivatives, respectively. Therefore, the condition ′ (0) < 0 can be relaxed as ′ (0+) < ′ (0) ≤ 0.

We now use two common loss examples to illustrate how to check the condition and find a proper truncating location s to assure the weighted Fisher-consistency of a truncated loss.

1. The hinge loss, (u) = [1 − u]+: In this case, $sup{u:u≥−s≥0}ℓ(0)−ℓ(u)ℓ(s)−ℓ(0)=1−0−s$ and the condition becomes $s∈[−1K−1,0]$ by noting that s ≤ 0.
2. The logistic loss, (u) = log(1 + eu): In this case, $sup{u:u≥−s≥0}ℓ(0)−ℓ(u)ℓ(s)−ℓ(0)=log2−0log(1+e−s)−log2$, which leads to condition s [− log(2K/(K−1) − 1), 0].

Both of these two loss functions satisfy ′ (0) < 0. Although they are not weighted Fisher-consistent themselves, they can become weighted Fisher-consistent after truncating them at s (with s satisfying the above condition).

Proposition 1 and Theorem 1 are weighted extensions of the results of Wu and Liu (2007). Furthermore, we note that the truncating location s given in Theorem 1 depends on the class number K. The larger K is the more truncation is needed to ensure Fisher consistency. This is due to the fact that the difficulty of no “dominating” class becomes more severe as K increases. The more truncation is the closer the truncated loss is to the 0–1 loss. For the hinge loss H1(u), exponential loss eu and logistic loss log(1 + eu), their truncated versions are guaranteed to be weighted Fisher-consistent for $s∈[−1K−1,0],[log(1−1K),0]$, and [− log(2K/(K−1) − 1), 0], respectively. Note that the ψ loss used in ψ-learning can be viewed as special examples of truncated loss functions (Shen et al. 2003; Liu and Shen 2006). Theoretically, different truncation may give a different performance. Empirically, the numerical examples in Wu and Liu (2007) indicated that minimum truncation appears to work better for the unweighted case. In this article we proceed with the minimum truncation required to achieve a weighted Fisher-consistency. By minimal truncation we mean truncation with the smallest s to make the corresponding truncated loss weighted Fisher-consistent: $s=−1K−1,log(1−1K)$, and − log(2K/(K−1) − 1) for the hinge loss, exponential loss, and logistic loss, respectively. In Figure 1 we plot these three truncated loss functions for K = 3 with the minimal truncation.

Plots of weighted Fisher-consistent truncated loss functions with minimal truncation for K = 3. The left, middle, and right panels correspond to the hinge, exponential, and logistic loss functions. A color version of this figure is available in the electronic ...

## 3. METHODOLOGY

In this section, we derive our methodology for multiclass probability estimation based on hard classifiers. In particular, we propose training a series of weighted classifiers and using them to construct the probability estimates. For demonstration, we focus on the hinge and truncated hinge loss functions. However, our estimation schemes are applicable to general large-margin classifiers.

### 3.1 Direct Scheme for Probability Recovery

Define the truncated hinge loss as

$HTs(u)=min(H1(s),H1(u)),$

where $s=−1K−1$ corresponds to the minimum truncation required by Theorem 1 to guarantee HTs to be weighted Fisher-consistent. Denote π as the solution of the π -weighted truncated-hinge-loss SVM, obtained by solving the following optimization problem

$minfn−1∑i=1nπyiHTs(ming(f(xi),yi))+λ∑k=1KJ(fk)subjectto∑k=1Kfk(x)=0,$
(7)

where π = (π1, …, πK ) AK. By Theorem 1, we have that $argmaxk=1,…,Kf^kπ$ converges to arg maxk=1,…,K πkpk as n → ∞ and λ → 0.

The following proposition gives a key result for estimating the probabilities for each x .

#### Proposition 2

For any given x satisfying mink pk(x) > 0, there exists a unique weight vector π̃ (x) = (π1(x), π2(x), …, πK (x)) AK such that

$π∼1(x)p1(x)=π∼2(x)p2(x)=⋯=π∼K(x)pK(x).$

Proposition 2 shows that for any x with mink pk(x) > 0, there is a unique weight vector so that the corresponding weighted probabilities π̃j(x)pj(x) are identical for all j. We call the point π̃ (x) AK as the border weight for x since the K weighted probabilities meet at this point.

Interestingly, the result in Proposition 2 can help us to estimate the conditional probabilities pk(x). In particular, for a given point x, using the property of weighted Fisher-consistency, the corresponding Bayes rule is arg maxk=1,…,K πkpk for any π AK. Then one can vary the weight vector π AK to search for the border weight. To illustrate this further, we consider a simple case of K = 3. In Figure 2, we plot the classification results of a particular point x for K = 3 when we change the weight vector. In this case, A3 is an equilateral triangle with the three vertices being (1, 0, 0), (0, 1, 0), and (0, 0, 1). Theoretically, for any x, the weighted Bayes rule arg maxk πkpk(x) assigns x to class k when πk is close to 1, and consequently, the whole region A3 can be divided into three subregions R1, R2, and R3 with Rk = {π A3 : k = arg maxj πjpj(x)} for k = 1, 2, 3. Since the vertex (1, 0, 0) represents imposing the weight 1 to points from class 1 and the weight zero to points from the other classes, the region R1 around (1, 0, 0) corresponds to the set of π with prediction arg maxk πkpk(x) = 1. The argument is similar for the other two vertices. Note that there is a special point in the center that borders all three subregions. This is the border weight satisfying π̃1(x)p1(x) = π̃2(x)p2(x) = π̃3(x)p3(x).

A plot of the weighted Bayes classification rule for all combinations of π for a certain point x when K = 3.

To estimate pj(x) it is enough to estimate π̃ (x) because, once the estimate of π̃ (x) is given, we can estimate pk(x) by the following proposition.

#### Proposition 3

For any given x , we assume that its associated border weight is estimated as $π∼^(x)$. Then its class probabilities can be estimated as

$p^k(x)=π∼^k(x)−1π∼^1(x)−1+π∼^2(x)−1+⋯+π∼^K(x)−1,k=1,…K.$

Propositions 2 and 3 suggest that identifying the border weight for each x is a key step to estimating the conditional probabilities pk(x) for k = 1, …, K. To that end, a general scheme is needed to search for π̃ AK for each x. Without loss of generality, we assume for the moment that the tuning parameter λ for Equation (7) is properly chosen. In the following, we outline the probability estimation scheme for general cases.

#### Direct Scheme

1. Define a fine grid of π within AK. Let the grid size be dπ. Any grid point π takes the form (m1dπ, m2dπ, …, mK dπ ) with nonnegative integers m1, m2, …, mK satisfying $∑k=1Kmkdπ=1$.
2. Solve Equation (7) over the above grid using the properly chosen tuning parameter λ.
3. Form all possible K-vertex polyhedrons of (side) length dπ using the available grid points. Here each K-vertex polyhedron corresponds to K adjacent grid points.
4. For any x , identify the K-vertex polyhedron such that its K vertices all belong to K distinct classes. The average of the coordinates corresponding to these vertices is defined as the estimate of the border weight π̃ (x) for x. The probability estimate can then be calculated using Proposition 3.

In the following, we demonstrate how the direct scheme works in the case of K = 3. To search for the border weight in Figure 2, we define a fine grid of π within the triangle as in Figure 3. Let the grid size be dπ with 1/dπ being an integer. To estimate probabilities at any point x, we need to identify some π such that its three neighboring combinations are of the form

Left: Classification rule over a grid of π for a point x for K = 3, where the circle ○ denotes being classified as class 1, the square □ denotes being classified as class 2, and the asterisk * denotes being classified as class ...

${π1(x)=(π1,π2,π3),π2(x)=(π1−dπ,π2+dπ,π3),π3(x)=(π1−dπ,π2,π3+dπ)},$
(8)

which classify Y into three distinct classes as shown on the left panel of Figure 3.

#### 3.1.1 Numerical Challenges in Implementing Direct Scheme

Now we provide some discussion on the numerical coherence of multiple decisions resulting from training multiple weighted classification problems. Let us start with the two-class problems. Assume that π and 1 − π are the costs for the negative and positive classes, then it is known that the minimizer π (x) of Equation (2) gives a consistent estimator of sign[P(Y = +1|X = x) − π]. When an increasing sequence of weights 0 = π1 < π2 < ···< πm+1 = 1 are used, we expect the decision sequence sign[πj (x)] to be monotonically changed for a fixed x due to their consistency properties. Though this is true in theory (or when n goes to infinity), the monotonic property of sign[π] may not always hold in finite sampling situations, mainly due to numerical variations. In this case, the probability p(x) can be estimated by taking the average of π* = min{πj : sign[πj (x)] = −1} and π* = max{πj : sign[πj (x)] = 1} (Wang, Shen, and Liu 2008).

For multiclass problems, a similar issue can occur even more frequently due to the increased complexity of the optimization problem. Take the three-class problem as an example. Each nonnegative weight vector is π = (π1, π2, π3) satisfying $∑k=13πk=1$. It is known that the minimization of Equation (5) satisfies that arg maxk k(x) = arg maxk πkpk(x) asymptotically. This suggests that the weight vectors change (partially) monotonically, the decision rule arg maxk k(x) should satisfy some constraints. For example, for a given x, if π = (π1, π2, π3) satisfies π1p1(x) > max(π2p2(x), π3p3(x)), then we have arg maxk k(x) = 1 asymptotically. Now if the weight is changed to $π′=(π1′,π2′,π3′)=(π1+d,π2−d,π3)$, the inequality $π1′p1(x)>max(π2′p2(x),π3′p3(x))$ still holds, implying that $argmaxkf^k′(x)=1$ asymptotically as well. Though this is true in theory, the relationship $argmaxkf^k(x)=argmaxkf^k′(x)$ does not necessarily hold in finite-sample results. Therefore, in practice, with a finite sample, those three neighboring combinations do not always take the form of Equation (8). Other variations are possible. For example, it can also be of the form

${π1(x)=(π1,π2,π3),π2(x)=(π1−dπ,π2,π3+dπ),π3(x)=(π1,π2−dπ,π3+dπ)},$

as shown on the right panel of Figure 3. This corresponds to the monotonicity violation in the binary case as discussed in the first paragraph of section 2.2 of Wang, Shen, and Liu (2008). Our selection criterion is to select three neighboring combinations corresponding to three distinct classes. The average (π1(x) + π2(x) + π3(x))/3 of these three neighboring combinations, denoted by (x) = (1(x), 2(x), 3(x)), serves as our estimate of the border weight. Using the estimated border weight (x), our estimate is given by

$p^k(x)=π^k(x)−1π^1(x)−1+π^2(x)−1+π^3(x)−1.$

For the finite-sample case, it is possible to have more than one possibility of three neighboring combinations corresponding to three distinct classes. Each possibility leads to one estimated border weight. When this happens, averaging all the estimated border weights is necessary to proceed our probability estimation. The nonuniqueness of border weights encountered in practice adds some challenges in the implementation of the direct scheme. As the number of classes gets larger, this may become more severe. Furthermore, the border weights are identified through counting multiple decisions. The process is discrete and tends to be slow and unstable. These challenges motivate us to develop another scheme that is more continuous and of high stability.

### 3.2 Indirect Scheme for Probability Estimation

In this section, we provide an alternative scheme to recover probabilities. Instead of directly targeting the probabilities as the direct scheme does, the new scheme estimates some continuous functions of probabilities, which can be easier to estimate, and then inverts those functions to recover probabilities.

Note that the total volume (or area when K = 3) of AK is given by

$∫01dπ1∫01−π1dπ2⋯∫01−π1−⋯−πK−2dπK−1K=K(K−1)!.$

The collection of π representing class k is given by Rk = {π : πkpkπjpj for jk}, which can be represented as

$Rk=∪rj1j2⋯jK−1k,$

where $rj1j2⋯jK−1k={π:πj1pj1≤πj2pj2≤⋯≤πjK−1pjK−1≤πkpk}$ and the union is over all permutations (j1, j2, …, jK−1) of {1, 2, …, K} \ k. When K = 3, Figure 4 demonstrates how A3 is partitioned into different parts using the notation $rj1j2⋯jK−1k$ corresponding to Figure 2. For permutation (j1, j2, …, jK−1), the volume (or area) of $rj1j2⋯jK−1k$ is given by

Demonstration of the partition of A3.

$K∫0Bj1dπj1∫pj1πj1/pj2Bj2dπj2⋯∫pjK−2πjK−2/pjK−1BjK−1dπjK−1,$

where $Bj1=(1−∑m=1k−1πjm)(1/pji)/((1/pk)+∑m=iK−1(1/pjm))$; i = 1, …, K − 1 with the convention $∑m=10πjm=0$. We can then sum over the area (volume) according to all permutations (j1, j2, …, jK−1) of {1, 2, …, K} \ k to give a formula for the volume (or area) of Rk.

Naturally the proportion of grid points of π leading to prediction of k, denoted by propk, estimates the volume (or area) ratio of $Volume(Rk)Volume(AK)$, which is a function of p1, p2, …, pK and denoted by hk(p1, p2, …, pK). Then by solving the equation system of K equations

$propk=hk(p1,p2,…,pK);k=1,2,…,K,$
(9)

we can obtain the estimated probabilities. In particular when K = 3, area of A3 is $3/2$ and area of R3 is $32p32(p2p3+p1p3+2p1p2)/((p1+p3)(p2+p3)(p1p2+p1p3+p2p3))$. Consequently $h3(p1,p2,p3)=p32(p2p3+p1p3+2p1p2)/((p1+p3)(p2+p3)(p1p2+p1p3+p2p3))$.

The indirect scheme can be summarized as follows:

#### Indirect Scheme

1. Same as those of the Direct Scheme.
2. For any x , calculate the grid percentage propk for k = 1, 2, …, K.
3. Solve the equation system in Equation (9) to recover the estimation of (p1(x), p2(x), …, pK (x)).

In Section 5, we will illustrate the performance of both schemes. Our empirical results suggest that the indirect scheme is indeed faster and more accurate.

### 3.3 Theoretical Properties

The next theorem establishes the consistency of our class probability estimation.

#### Theorem 2

For any nonincreasing loss function (·) with ′(0) < 0, if the truncation location s is chosen such that $sup{u:u≥−s≥0}ℓ(0)−ℓ(u)ℓ(s)−ℓ(0)≥K−1$. When λ → 0 and the grid size dπ → 0 as n → ∞, our estimate k(x) based on the truncated loss Ts is asymptotically consistent, i.e., k(x) → pk(x) for k = 1, 2, …, K as n → ∞.

The consistency result in Theorem 2 provides theoretical justification of our proposed method. It can be straightforward to extend the consistency to our indirect probability recovery scheme as Theorem 1 implies that propk is consistent for estimating hk(p1, p2, …, pK ) and inversion will inherit the consistency. Although our probability estimation method is model-free, it converges to the true probability asymptotically. As shown in our simulation studies in Section 5, our method indeed provides competitive probability estimation compared to several other existing techniques.

## 4. COMPUTATION ALGORITHMS

As shown on the right panel of Figure 5, the function HTs(·) is not convex, thus solving Equation (7) involves a nonconvex minimization problem. However, we note that HTs( u) can be decomposed as the difference of two convex functions,

The left, middle, and right panels display functions H1(u), Hs(u), and HTs(u), respectively. A color version of this figure is available in the electronic version of this article.

$HTs(u)=min(H1(u),H1(s))=H1(u)−Hs(u),$

where Hs(u) = (su)+. Figure 5 displays the three functions H1(u), Hs(u), and HTs(u).

Using this property of the truncated hinge loss function, we apply the difference convex (d.c.) algorithm (An and Tao 1997; Liu, Shen, and Doss 2005; Wu and Liu 2007) to solve the non-convex optimization problem of the weighted truncated-hinge-loss SVM. The d.c. algorithm solves the nonconvex minimization problem via minimizing a sequence of convex subproblems (see Algorithm 1). We derive the d.c. algorithm for linear learning in Section 4.1 and then generalize it to the case of nonlinear learning via kernel mapping in Section 4.2.

#### Algorithm 1

[The Difference Convex Algorithm for minimizing Q(Θ) = Qvex(Θ) + Qcav(Θ)].

1. Initialize Θ0.
2. Repeat $Θt=1=argminΘ(Qvex(Θ)+〈Qcav′(Θt),Θ−Θt〉)$ until convergence of Θt.

### 4.1 Linear Learning

Let $fk(x)=wkTx+bk$; wk d, bk , and b = (b1, b2, …, bk)T K, where wk = (w1k, w2k, …, wdk)T and W = (w1, w2, …, wK ). With = HTs, Equation (7) becomes

$minW,b12∑k=1K||wk||22+C∑i=1nπyiHTs(ming(f(xi),yi))subjectto∑k=1Kwjk=0;j=1,2,…,d;∑k=1Kbk=0,$
(10)

where the constraints are adopted to avoid the nonidentifiability issue of the solution. Note that Equation (10) is equivalent to the other representation of Equation (4) by setting C = 1/λ. Thus we will use them interchangeably.

Denote Θ as (W, b). Applying the fact that HTs = H1Hs, the objective function in Equation (10) can be decomposed as

$Qs(Θ)=12∑k=1K||wk||22+C∑i=1nπyiH1(ming(f(xi),yi))−C∑i=1nπyiHs(ming(f(xi),yi))=Qvexs(Θ)+Qcavs(Θ),$

where

$Qvexs(Θ)=12∑k=1K||wk||22+C∑i=1nπyiH1(ming(f(xi),yi))$

and

$Qcavs(Θ)=−C∑i=1nπyiHs(ming(f(xi),yi)),$

denote the convex and concave parts, respectively.

Define

$βik={Cπyiifk=argmax(fk′t:k′≠yi),fyit−fkt

where $ft=(f1t(·),f2t,…,fKt)T$ denotes the solution at the tth iteration. It is shown in the Appendix that the dual problem of the convex optimization at the (t + 1)th iteration, given the solution ft at the tth iteration, is as follows

$minα12∑k=1K‖∑i:yi=k∑k′≠yi(αik′−βik′)xiT−∑i:yi≠k(αik−βik)xiT‖22−∑i=1n∑k′≠yiαik′subjectto∑i:yi=k∑k′≠yi(αik′−βik′)−∑i:yi≠k(αik−βik)=0,k=1,2,…,K,0≤∑j≠yiαij≤Cπyi,i=1,2,…,n,αik≥0,i=1,2,…,n;k≠yi.$

This dual problem is a quadratic programming (QP) problem similar to that of the standard SVM and can be solved by many optimization software. Once the solution is obtained, the coefficients wk’s can be recovered as follows,

$wk=∑i:yi=k∑k′≠yi(αik′−βik′)xi−∑i:yi≠k(αik−βik)xi.$
(11)

It is interesting to note that representation of wk’s given in Equation (11) automatically satisfies that $∑k=1Kwjk=0$ for each 1 ≤ jd. Moreover, we can see that coefficients wk’s are determined only by those data points whose corresponding αikβik is not zero for some 1 ≤ kK and these data points are the SV’s of the weighted truncated-hinge-loss SVM. The set of SV’s of the weighted truncated-hinge-loss SVM using the d.c. algorithm is only a subset of the set of SV’s of the original weighted SVM. Basically the weighted truncated-hinge-loss SVM tries to remove points satisfying $fyit−fkt with $k=argmax(fk′t:k′≠yi)$ from the original set of SV’s, and consequently, eliminate the effects of outliers. This provides an intuitive algorithmic explanation of the robustness of the weighted truncated-hinge-loss SVM to outliers. A similar conclusion was provided by Wu and Liu (2007) for the unweighted version.

After the solution of W is derived, b can be obtained via solving either a sequence of KKT conditions as used in the standard SVM or a linear programming (LP) problem. Denote $f∼k(xi)=xiTwk$. Then b can be obtained through the following LP problem:

$minη,bC∑i=1nπyiηi+∑k=1K(∑i:yi=k∑k′≠yiβik′−∑i:yi=kβik)bksubjecttoηi≥0,i=1,2,…,n,ηi≥1−(f∼yi(xi)+byi)+f∼k(xi)+bk,i=1,2,…,n;k≠yi,∑k=1Kbk=0.$

### 4.2 Nonlinear Learning

For nonlinear learning, each decision function fk(x) is represented by hk(x) + bk with hk(x) , where is a reproducing kernel Hilbert space (RKHS). Here the kernel R(·, ·) is a positive definite function mapping from × to . Due to the representer theorem of Kimeldorf and Wahba (1971) (also see Wahba 1999), the nonlinear problem can be reduced to finding finite dimensional coefficients vik’s and hk(x) can be represented as $∑i=1nR(x,xi)vik$; k = 1,2, …, K.

Denote vk = (v1k, v2k, …, vnk)T, V = (v1, v2, …, vK ), and R to be an n × n matrix whose (i1, i2) entry is R(xi1, xi2). Let Ri be the ith column of R and denote the standard basis of the n-dimensional space by ei = (0, 0, …, 1, …, 0)T with 1 for its ith component and 0 for other components.

A similar derivation as in the linear case leads to the following dual problem for nonlinear learning

$minα12∑k=1K〈∑i:yi=k∑k′≠yi(αik′−βik′)Ri−∑i:yi≠k(αik−βik)Ri,∑i:yi=j∑k′≠yi(αik′−βik′)ei−∑i:yi≠k(αik′−βik)ei〉−∑i=1n∑k′≠yiαik′subjectto∑i:yi=k∑k′≠yi(αik′−βik′)−∑i:yi≠k(αik−βik)=0,k=1,2,…,K,0≤∑k≠yiαij≤Cπyi,i=1,2,…,n,αik≥0,i=1,2,…,n;k≠yi,$

where βik’s are defined similarly as in the linear case. After solving the above QP problem, we can recover the coefficients vk’s as follows

$vk=∑i:yi=k∑k′≠yi(αik′−βik′)ei−∑i:yi≠k(αik−βik)ei.$

The intercepts bk’s can be solved using LP as in the linear learning.

### 4.3 Parameter Tuning

So far we assume that we selected the optimal tuning parameter λ. In practice, the tuning parameter selection can be done using an independent validation set or cross-validation. In this article, we choose to select the parameter using an independent set of size ñ. Theoretically speaking, the larger ñ is the better the tuning effect and the better classifier we may obtain. This is related to Shao (1993)’s results on cross-validation in the context of linear model selection: the proportion of tuning set size over the size of all available data points, namely ñ/(n + ñ), should go to 1 as (n + ñ) → ∞ to ensure the asymptotically correct selection. However, how to split the dataset into the training part and the tuning part is always a trade-off between model training and parameter tuning since a large training set is also desired for better model fitting. A commonly accepted procedure is to use one half for training and the other half for tuning, i.e., n = ñ.

Now we detail the approach using an independent tuning set of size ñ. We first obtain probability estimates $p^j(λm)(x∼i)$, j = 1, 2, …, k for any i in the tuning set {(i, i): i = 1, 2, …, ñ} over a grid {λ1, λ2, …, λM} of the tuning parameter. Then we can evaluate the log-likelihood $L(λm)=∑i=1n∼log(p^yi(λm)(x∼i))$ of the tuning set for each λm. Let = arg maxm L(λm). The optimal tuning parameter is selected to be λ.

## 5. SIMULATIONS

In this section we use four simulation examples to illustrate the methodological power of our new multiclass probability estimation scheme by comparing it to some existing methods. We consider five alternative methods: cumulative logit model (CLM), baseline logit model (BLM), kernel multicategory logistic regression (KMLR), classification tree (TREE), and random forest (RF). Both CLM and BLM make certain assumptions on the forms of the transformed probabilities. In particular, the CLM assumes that $log∑i=1kpi(x)1−∑i=1kpi(x)=βk0+xTβk$ for k = 1, 2, …, K − 1 while the BLM assumes that $logpk(x)pK(x)=βk0+xTβk$ for k = 1, 2, …, K − 1. KMLR refers to the one proposed by Zhu and Hastie (2005) with Gaussian kernel $R(x1,x2)=e−||x1−x2||22/σ2$. Ten separate datasets are generated to tune the data width parameter σ among a grid of {1/4, 1/2, 3/4, 1, 5/4, 3/2}σm, where σm is the median pairwise Euclidean distance defined as median{||xixj||: yiyj}. Among these methods, CLM and BLM are essentially parametric models while our methods, KMLR, TREE, and RF are nonparametric. Denote the size of the training set by n. Fivefold cross-validation is used to select the tuning parameter. For the TREE-based method, we use the R package “Tree” and its build-in cross-validation function is used to prune trees with fold number set as 10. Similarly we use the build-in tuning for RF provided in the R package.

In simulations, the true conditional probability functions pk(·), k = 1, 2, …, K are known. To measure the estimation accuracy of the conditional probabilities, we use various scores evaluated on the testing set (of size 10n):

• 1-norm error $110n∑i=110n∑k=1K∣p^k(x¯i)−pk(x¯i)∣$
• 2-norm error $110n∑i=110n∑k=1K(p^k(x¯i)−pk(x¯i))2$
• Empirical generalized Kullback–Leibler (EGKL) loss $110n∑i=110n∑k=1Kpk(x¯i)logpk(x¯i)p^k(x¯i)$.

Here i denotes the predictor vector of the ith observation in the testing set. The average errors over 100 replications and the corresponding standard deviations (in parentheses) are reported. Whenever appropriate, our method is employed with minimal truncation with s = −1/(K − 1). We implement our method using linear learning for Examples 1, 3, and 4 while using the Gaussian kernel $R(x1,x2)=e−||x1−x2||22/σ2$ for Example 2. The grid size dπ is chosen to be 0.02 for our three-class examples and 0.05 for the five-class example. In addition to tuning parameter λ and truncation location s, different grid size gives different performance. See Table 5 in Section 5.2 for the effect of different grid sizes in the discussion following these numerical examples.

Probability estimation errors on the test set for Example 1 with different dπ

### 5.1 Numerical Examples

#### Example 1

We consider a three-class linear learning example. Our data are generated in two steps: (1) Y is uniformly distributed over {1, 2, 3}; (2) Conditional on Y = y, the two-dimension predictor X is generated from N(μ(y), Σ), where μ(y) = (cos(2/3), sin(2/3))T and Σ = 0.72I2 with I2 being the 2 × 2 identity matrix. The sample size n is 400. Table 1 reports the average test errors and the corresponding standard deviations (in parentheses) over 100 replications for various methods. Note that in this example, the BLM specifies the correct parametric model and hence fits the true (oracle) model, while the CLM corresponds to a model misspecification. A tuning of σ in Gaussian kernel for the KMLR selects σm as the best. As shown in Table 1, the oracle BLM performs the best while the CLM performs the worst. Except for the oracle BLM, our method with either the direct probability recovery scheme or the indirect probability recovery scheme consistently outperforms all the other methods with significant improvement. Between these two different probability recovery schemes the indirect scheme works much better. Hence in our later examples, we will only report results of our new method with the indirect scheme. Here the TREE-based methods (TREE or RF) lead to infinity (denoted by Inf in Table 1) for EGKL because it returns zero probability for some point x and some classes. The corresponding standard deviation does not make sense and we denote by NaN, which stands for Not A Number. This is one property of TREE-type methods.

Probability estimation errors on the test set for Example 1

#### Example 2

In this example, we study a three-class nonlinear example. For any x = (x1, x2)T, define $f1(x)=−x1+0.1x12−0.05x22+0.1,f2(x)=−0.2x12+0.1x22−0.2$, and $f3(x)=x1+0.1x12−0.05x22+0.1$. Set $pk(x)=P(Y=k∣X=x)=exp(fk(x))/(∑m=13exp(fm(x)))$ for k = 1, 2, 3. Each pair of data point (x, y) is generated in two steps: we first generate x1 ~ Uniform[−3, 3] and x2 ~ Uniform[−6, 6]; conditional on X = x, the class response Y takes value k with probability pk(x) for k = 1, 2, 3. The sample size is chosen to be n = 100. A similar example was previously used by Zhang et al. (2008).

In this example, we consider basis expansion for the parametric methods CLM and BLM by also including the quadratic terms $x12$ and $x22$. Consequently, the BLM is again the oracle model. Results over 100 repetitions in the same format of Example 1 are reported in Table 2. Column Indirect corresponds to our method with the indirect probability recovery scheme. The tuning of σ in Gaussian kernel selects 5σm/4 and σm/2 as the best for KMLR and our new method, respectively. Similar to Example 1, we again observe that the new method gives smaller errors than all the other methods except RF for the 1-norm error and the oracle.

Probability estimation errors on the test set for Example 2

#### Example 3

In Examples 1 and 2 the BLM takes the true model form, so it is not surprising that the BLM shows better performance than our method. In this example, we design an experiment so that none of the parametric methods corresponds to the oracle. This will provide a fair comparison between them.

The two-dimensional predictor X is uniformly distributed over the disc { $x:x12+x22≤100$}. Define functions $h1(x)=−5x13+5x2,h2(x)=−5x13−5x2$, and h3(x) = 0. Apply a transformation fk(x) = Φ1(T2(hk(x))), where Φ(·) and T2(·) are the cumulative distribution functions of the standard normal distribution and t distribution with degrees of freedom 2, respectively. We set probabilities $pk(x)=P(Y=k∣X=x)=exp(fk(x))/(∑j=13exp(fj(x)))$ for k = 1, 2, 3 as in Example 2. Because of the nonlinear transformation Φ1(T2(·)), BLM is no longer the oracle model. Our multiclass probability with linear kernel is not the oracle model either. The training set size is n = 600. The tuning of KMLR selects σ = σm/4 as the data width parameter. Table 3 shows clearly that our method is consistently better than the BLM and performs best among all the approaches under comparison.

Probability estimation errors on the test set for Example 3

#### Example 4

In this five-class example, the data are generated similarly as in Example 1. Response Y is uniformly distributed over {1, 2, 3, 4, 5}. Conditional on Y = y, the two-dimensional predictor X is generated from N(μ(y), Σ), where μ(y) = (cos(2/5), sin(2/5))T and Σ = 0.72I2. The sample size n is 1000. The tuning of KMLR selects 5σm/4 as the best. Simulation results are reported in Table 4. A similar improvement is observed for our new method.

Probability estimation errors on the test set for Example 4

Among the six procedures considered above, BLM and CLM are parametric methods, while our method, KMLR, TREE, and RF are nonparametric procedures that do not make explicit assumptions on the form of the true probability functions. Our simulated results suggest that, if the parametric assumption is correct, then the associated parametric estimator is essentially the oracle and performs the best among all. This explains why BLM gives the smallest errors in Examples 1, 2, and 4. However, if the parametric assumption is incorrect, then the parametric estimators can perform poorly, as shown for the BLM in Example 3 and the CLM in all the settings. By contrast, model-free methods do not rely on the model assumption and show more robust performance. For complicated problems, some of the nonparametric methods can outperform the parametric ones. As shown in Example 3, our method and RF are the top two performers. Furthermore, it is noticed that our method performs competitively among the three nonparametric procedures.

In practice, sometimes it is difficult to determine or validate the parametric assumption on the function forms, especially when data are complicated or high-dimensional, then a good nonparametric procedure will provide a useful alternative tool for estimating multiclass probabilities.

### 5.2 Empirical Computation Cost

The total computation cost of the proposed procedure is mainly determined by three factors: the computation cost of solving one weighted optimization problem, the number of optimization problems corresponding to different weight vectors, and the scheme for recovering probabilities from multiple decision rules. As shown in the article, each optimization problem involves a nonconvex minimization problem and the proposed DCA-based algorithm seems quite efficient. For example, it takes 0.4827, 5.4086, 0.5118, and 2.3549 s, on average, to solve an individual optimization problem for Examples 1, 2, 3, and 4, respectively. Since Example 2 deals with more complicated nonlinear classification problems it takes a little longer.

The second factor is controlled by the size of dπ. In a three-class problem, if dπ = 0.02, we need to solve 1176 optimization problems; and if dπ = 0.1, we only need to solve 24 problems. The effects of dπ are important: the smaller dπ is the better the estimation result will be. On the other hand, this accuracy gain is obtained at the cost of computational time. To illustrate the effects of dπ on the procedure, we now present the performance of our procedure in Example 1 with different values of dπ = 0.1, 0.04, 0.02, 0.01 used.

From Table 5, it is clear that smaller dπ values, in general, lead to better accuracy in probability estimation. However, this accuracy gain levels off as dπ becomes very small. For example, the accuracy improvement from dπ = 0.1 to dπ = 0.04 is substantial, but the difference among dπ = 0.04, 0.02, 0.01 is quite small. It is worth pointing out that the computational time grows faster as dπ gets smaller. For example, the computation time for dπ = 0.01 is about 25-folds of that for dπ = 0.04 and about four-folds of that for dπ = 0.02. So there is a trade-off between computational cost and estimation accuracy when choosing dπ. In our simulations, we find dπ = 0.02 works pretty well in various three-class problem settings.

To recover the probabilities, we propose two schemes in the article. The numerical results suggest that the indirect scheme is faster and produces better estimation accuracy. In practice, we recommend using the indirect scheme.

To conclude our simulation studies, we plot in Figure 6 a randomly chosen training set from each example to show what our training data look like. Note that the sample size is 1000 for Example 4. However to make Figure 6 look nicer, we only use a random sample of size 200 while plotting the right bottom panel for Example 4.

Plots of a randomly selected training set from each simulation example. The solid lines indicate Bayes boundaries for un-weighted classification.

## 6. REAL DATA

In this section, we apply our new multiclass probability estimation scheme to the wine data by comparing it to those four alternative methods considered in the previous section. The wine data are available online at the University of California, Irvine (UCI) Machine Learning Repository by following the URL http://archive.ics.uci.edu/ml/datasets/Wine. In addition to the categorical response variable Wine Type, it has 13 attributes available. They are Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, and Proline. All these 13 attributes are continuous. Before applying any probability estimation scheme, we standardize each attribute to have mean zero and standard deviation one. Wine Type belongs to one of three classes with class distribution of 59 in class 1, 71 in class 2, and 48 in class 3. The total number of observations is n = 178. We randomly select 19 observations from class 1, 23 from class 2, and 16 from class 3 to be set aside as the testing set. The remaining 120 observations are used as the training dataset. We randomly divide those 120 observations in the training set into eight folds with each fold containing five observations in class 1, six in class 2, and four in class 3 so that an eight-fold cross-validation is used to select any tuning parameter over a grid if necessary. The tuning selects σm as the best data width parameter for KMLR. For simplicity, our method is implemented with the linear kernel.

For any estimate j(·), we define its log-likelihood over the testing set as $∑i=158logp^yi(xi)$, where (xi, yi), i = 1, 2, …, 58 denotes observations of the testing set. The corresponding test error is defined by $∑i=158I(yi≠argmaxjp^j(xi))$, where I(·) is the indicator function taking value 1 if its argument is true and 0 otherwise. We report both the log-likelihood of the testing set and the test error in Table 6 for all five methods we include for comparison. The same reason as in the simulation examples applies to why both CLM and TREE lead to negative infinity. According to Table 6, our new method with linear learning performs competitively in terms of either the log-likelihood of the testing set or the test error.

Results of Wine Example

## 7. CONCLUSION

In this work, we propose a model-free multiclass probability estimation approach. It is achieved by solving a series of weighted hard classification problems and then combining these decision rules to construct the probability estimates. Both theoretical and numerical results are provided to demonstrate the competitive performance of our estimation procedure. Our numerical results show favorable performance of our new probability estimation procedure in comparison to several other existing approaches.

Our probability estimation procedure requires computation of weighted classifiers over a fine grid of the K-vertex polyhedron. The computational cost can be high when the class number K gets large. To further improve the computational efficiency, one possible solution is to investigate an efficient solution path over the grid. Further investigation is needed.

## Acknowledgments

The authors thank the editor, the associate editor, and two referees for their helpful suggestions that led to significant improvement of the article. The authors are supported in part by NSF grants DMS-0905561 (Wu), DMS-0645293 (Zhang), DMS-0747575 (Liu), and DMS-0606577 (Liu), and NIH/NCI grants R01-CA-085848 (Zhang) and R01-CA-149569 (Liu and Wu).

## APPENDIX

#### Proofs of Proposition 1 and Theorem 1

For any x and any π AK, define $p∼k=πkpk(x)/(∑m=1Kπkpk(x))$. Then k satisfies k ≥ 0 and $∑k=1Kp∼k=1$. Thus, we can treat k as a new conditional class probability, and as a result, Fisher-consistency implies weighted Fisher-consistency in that k covers all possibilities as we vary x in the whole domain. Proofs of Proposition 1 and Theorem 1 will be parallel to the proofs of proposition 2 and theorem 1 of Wu and Liu (2007). Thus, we skip them to save space.

#### Proof of Proposition 2

The unique vector π̃(x) is given by (π̃1(x), π̃2(x), …, π̃K (x)) with $π∼k(x)=1/pk(x)/(∑m=1K1/pm(x))$.

#### Proof of Proposition 3

The result is straightforward by Proposition 2.

#### Proof of Theorem 2

Note that, as n → ∞ and λ → ∞, the penalty consequently does not contribute to the objective function. Thus asymptotically we are solving

$minf1n∑i=1nπyiℓTs(ming(f(xi),yi))subjectto∑k=1Kfk(x)=0,$

for each π.

Note first that, for any x with positive probability density within a small neighborhood B(x, r) = {: ||x|| ≤ r} of radius r > 0 [i.e., PX() > 0 for any B(x, r)], the average of πyiTs(min g(f(xi), yi)) over xi B(x, r) converges to E(πY Ts(min g(f(X), Y))|X = x) as n → ∞ and the radius r shrinks to zero. Thus, by Theorem 1, it is guaranteed that there exists a set of neighboring π1(x) = (π1(x), π2(x), …, πK (x)), π2(x) = (π1(x) − dπ, π2(x) + dπ, …, πK (x)), …, πK (x) = (π1(x) − dπ, π2(x), …, πK (x) + dπ) such that the weighted truncated large-margin classifiers with π1(x), π2(x), …, and πK (x) classify x to class 1, 2, …, and K, respectively, for some π(x) = (π1(x), π2(x), …, πK (x)). Consequently our $π^(x)=∑k=1Kπk(x)/K$ is well defined for any x.

For any π = (π1, π2, …, πK ), $||π||1=∑k=1K∣πk∣$ denotes its 1-norm. Next, we prove consistency by contradiction. If there exists an x such that (x) does not converge to π̃(x) satisfying π̃k(x)pk(x) = π̃k(x)pk(x) for 1 ≤ kk′ ≤ K, then, as dπ → 0 and n → ∞, the classification rules for π1(x), π2(x), …, πK (x) do not union to the set {1, 2, …, K} due to the consistency established in Theorem 1 and the fact that $maxk=1K||πk(x)−π^(x)||1=dπ/K→0$. This violates our criterion on selecting π1(x), π2(x), …, πK (x). As a result (x) → π̃(x) for any x, which in turn implies that k(x) → pk(x) for k = 1, 2, …, K for any x.

#### Derivation of the Dual Problem in Section 4.1

Note that $∂∂wkQcavs(Θ)$ and $∂∂bkQcavs(Θ)$ can be written, respectively, as follows

$−C[∑i:yi=kπyi(−I{ming(f(xi),yi)

where I{A} = 1 if event A is true and 0 otherwise.

Using the definition of βij, we have

$∂∂wkQcavs(Θ)=∑i:yi=k(∑k′≠yiβik′)xiT−∑i:yi≠kβikxiT$

and

$∂∂bkQcavs(Θ)=∑i:yi=k(∑k′≠yiβik′)−∑i:yi≠kβik.$

Applying the first-order approximation to the concave part, the objective function at step (t + 1) becomes

$Qs(Θ)=12∑k=1K||wk||22+C∑i=1nπyiH1(ming(f(xi),yi))+∑k=1K〈∂∂wkQcavs(Θt),wk〉+∑k=1Kbk∂∂bkQcavs(Θt),$

where Θt is the current solution.

Using slack variable ξi’s for the hinge loss function, the optimization problem at step (t + 1) becomes

$minW,b,ξ12∑k=1K||wk||22+C∑i=1nπyiξi+∑k=1K〈∂∂wkQcavs(Θt),wk〉+∑k=1Kbk∂∂bkQcavs(Θt)subjecttoξi≥0,i=1,2,…,n,ξi≥1−[xiTwyi+byi]+[xiTwk+bk],i=1,2,…,n;k≠yi.$

The corresponding Lagrangian is

$L(W,b,ξ)=12∑k=1K||wk||22+C∑i=1nπyiξi−∑i=1nuiξi−∑i=1n∑k′≠yiαik′(xiTwyi+byi−xiTwk′−bk′+ξi−1)+∑k=1K〈∂∂wkQcavs(Θt),wk〉+∑k=1Kbk∂∂bkQcavs(Θt),$
(A.1)

subject to

$∂∂wkL=wkT−[∑i:yi=k∑k′≠yi(αik′−βik′)xiT−∑i:yi≠k(αik−βik)xiT]=0,$
(A.2)

$∂∂bkL=−[∑i:yi=k∑k′≠yi(αik′−βik′)−∑i:yi≠k(αik−βik)]=0,$
(A.3)

$∂∂ξiL=Cπyi−ui−∑k≠yiαik=0,$
(A.4)

where the Lagrangian multipliers are ui ≥ 0 and αik ≥ 0 for any i = 1, 2, …, n, k′ ≠ yi. Substituting Equations (A.2) through (A.4) into Equation (A.1) yields the desired dual problem in Section 4.1.

## References

• Agresti A, Coull B. Approximate Is Better Than ‘Exact’ for Interval Estimation of Binomial Proportions. The American Statistician. 1998;52:119–126.
• An LTH, Tao PD. Solving a Class of Linearly Constrained Indefinite Quadratic Problems by d.c. Algorithms. Journal of Global Optimization. 1997;11:253–285.
• Bartlett PL, Jordan MI, Mcauliffe JD. Convexity, Classification, and Risk Bounds. Journal of the American Statistical Association. 2006;101:138–156.
• Cortes C, Vapnik V. Support Vector Networks. Machine Learning. 1995;20:273–297.
• Kimeldorf G, Wahba G. Some Results on Tchebycheffian Spline Functions. Journal of Mathematical Analysis and Applications. 1971;33:82–95.
• Lee Y, Lin Y, Wahba G. Multicategory Support Vector Machines, Theory, and Application to the Classification of Microarray Data and Satellite Radiance Data. Journal of the American Statistical Association. 2004;99:67–81.
• Lin Y. Support Vector Machines and the Bayes Rule in Classification. Data Minning and Knoweldge Discovery. 2002;6:259–275.
• Liu Y. Fisher Consistency of Multicategory Support Vector Machines. Eleventh International Conference on Artificial Intelligence and Statistics. 2007. pp. 289–296. Available at http://www.stat.umn.edu/~aistat/proceedings/start.htm.
• Liu Y, Shen X. Multicategory ψ-Learning. Journal of the American Statistical Association. 2006;101:500–509.
• Liu Y, Shen X, Doss H. Multicategory ψ-Learning and Support Vector Machine: Computational Tools. Journal of Computational and Graphical Statistics. 2005;14:219–236.
• Shao J. Linear Model Selection by Cross-Validation. Journal of the American Statistical Association. 1993;88:486–494.
• Shen X, Tseng G, Zhang X, Wong W. On ψ-Learning. Journal of the American Statistical Association. 2003;98:724–734.
• Vapnik V. Statistical Learning Theory. New York: Wiley; 1998.
• Wahba G. Support Vector Machines, Reproducing Kernel Hilbert Spaces and the Randomized GACV. In: Schoelkopf B, Burges C, Smola A, editors. Advances in Kernel Methods Support Vector Learning. Cambridge, MA: MIT Press; 1999. pp. 69–88.
• Wang J, Shen X, Liu Y. Probability Estimation for Large Margin Classifiers. Biometrika. 2008;95:149–167.
• Weston J, Watkins C. Support Vector Machines for Multi-Class Pattern Recognition. In: Verleysen M, editor. Proceedings of the 7th European Symposium on Artificial Neural Networks (ESANN-99) Bruges; Belgium: 1999. pp. 219–224. Available at http://www.informatik.uni-trier.de/~ley/db/conf/esann/esann1999.html.
• Wu TF, Lin CJ, Weng RC. Probability Estimates for Multi-Class Classification by Pairwise Coupling. Journal of Machine Learning Research. 2004;5:975–1005.
• Wu Y, Liu Y. Robust Truncated-Hinge-Loss Support Vector Machines. Journal of the American Statistical Association. 2007;102:974–983.
• Zhang HH, Liu Y, Wu Y, Zhu J. Variable Selection for the Multicategory SVM via Adaptive Sup-Norm Regularization. Eletronic Journal of Statistics. 2008;2:149–167.
• Zhang T. Statistical Analysis of Some Multi-Category Large Margin Classification Methods. Journal of Machine Learning Research. 2004;5:1225–1251.
• Zhu J, Hastie T. Kernel Logistic Regression and the Import Vector Machine. Journal of Computational and Graphical Statistics. 2005;14:185–205.

 PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers.