Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2990887

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. WEIGHTED CLASSIFICATION AND FISHER CONSISTENCY
- 3. METHODOLOGY
- 4. COMPUTATION ALGORITHMS
- 5. SIMULATIONS
- 6. REAL DATA
- 7. CONCLUSION
- References

Authors

Related links

J Am Stat Assoc. Author manuscript; available in PMC 2010 November 24.

Published in final edited form as:

J Am Stat Assoc. 2010 March 1; 105(489): 424–436.

doi: 10.1198/jasa.2010.tm09107PMCID: PMC2990887

NIHMSID: NIHMS250286

Yichao Wu, Assistant Professor, Hao Helen Zhang, Associate Professor, and Yufeng Liu, Associate Professor

Yichao Wu , Hao Helen Zhang, Department of Statistics, North Carolina State University, Raleigh, NC 27695. Yufeng Liu, Department of Statistics and Operations Research, Carolina Center for Genome Sciences, University of North Carolina, Chapel Hill, NC 27599-3260

Yichao Wu: ude.uscn.tats@uw; Hao Helen Zhang: ude.uscn.tats@2gnahzh; Yufeng Liu: ude.cnu.liame@uilfy

See other articles in PMC that cite the published article.

Classical statistical approaches for multiclass probability estimation are typically based on regression techniques such as multiple logistic regression, or density estimation approaches such as linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). These methods often make certain assumptions on the form of probability functions or on the underlying distributions of subclasses. In this article, we develop a model-free procedure to estimate multiclass probabilities based on large-margin classifiers. In particular, the new estimation scheme is employed by solving a series of weighted large-margin classifiers and then systematically extracting the probability information from these multiple classification rules. A main advantage of the proposed probability estimation technique is that it does not impose any strong parametric assumption on the underlying distribution and can be applied for a wide range of large-margin classification methods. A general computational algorithm is developed for class probability estimation. Furthermore, we establish asymptotic consistency of the probability estimates. Both simulated and real data examples are presented to illustrate competitive performance of the new approach and compare it with several other existing methods.

Multiclass probability estimation is an important problem in statistics and data mining. Suppose we are given a sample {(**x*** _{i}*,

Traditionally, the probability estimation problem is commonly tackled by regression techniques such as multiple logistic regression, or the density estimation approaches such as linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). Agresti and Coull (1998) gave a thorough review on these methods. These methods often make certain model assumptions on the function forms of *p _{k}*’s (or their transformations) or on the underlying distributions of subclasses. For example, multiple logistic regression assumes that the logarithms of the odd ratios are linear in

$$log\frac{P(Y=k\mid \mathbf{X}=\mathbf{x})}{P(Y=1\mid \mathbf{X}=\mathbf{x})}={\beta}_{k0}+{\mathbf{x}}^{T}{\mathit{\beta}}_{k},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\forall k\ge 2,$$

where class 1 is chosen as the baseline class. On the other hand, both LDA and QDA assume that the covariates **X** associated with each subclass follow a multivariate Gaussian distribution and construct the probability estimates as

$$P(Y=k\mid \mathbf{X}=\mathbf{x})=\frac{\phi (\mathbf{x};{\mathit{\mu}}_{k},{\mathbf{\sum}}_{k}){\alpha}_{k}}{{\sum}_{j=1}^{K}\phi (\mathbf{x};{\mathit{\mu}}_{j},{\mathbf{\sum}}_{j}){\alpha}_{j}},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}k=1,\dots ,K,$$

where (**x**; ** μ**,

In this article, we propose a new class of model-free methods for estimating multiclass probabilities. The new method does not make any assumption on the forms of *p _{k}*’s or the distribution for each subclass. Different from the traditional methods, we tackle the soft classification problem by solving a series of hard classification problems and combining these decision rules to construct the probability estimates. The main difference between soft classification and hard classification is their estimation target, with the former directly estimating

Wang, Shen, and Liu (2008) recently explored class probability estimation for binary large-margin classifiers. In particular, they made use of the property that the theoretical minimizer of a consistent weighted binary large-margin loss function is sign[*p*_{1}(**x**) − *π*], where *π* (0, 1). Although a particular weighted binary large-margin classifier only estimates whether *p*_{1}(**x**) is larger than *π* or not, one can obtain a good estimate of *p*_{1}(**x**) if a sequence of weighted classifiers are calculated for many different *π*’s. As shown in Wang, Shen, and Liu (2008), this method indeed works well for binary class probability estimation. However, the simultaneous generalization from *K* = 2 to *K* ≥ 3 is nontrivial and largely unknown due to the increased level of problem complexity. Wu, Lin, and Weng (2004) proposed a pairwise coupling method for multi-class probability estimation by solving many binary problems. In this article, we develop a new multiclass probability estimation scheme by utilizing the proposed concept of *border weight* for large-margin classifiers. As a result, the (*K* − 1)-dimensional probability estimation reduces to the search of border weight. We propose two estimation schemes for probability estimation, the direct scheme and the indirect scheme. Furthermore, we focus on the truncated hinge loss (Wu and Liu 2007) for demonstration of our proposed probability estimation technique. The technique is, however, applicable to other large-margin classifiers as well.

The rest of our article is structured as follows. Section 2 presents the idea of weighted classification and its Fisher consistency properties. Section 3 introduces the main methodology, along with two estimation schemes and the theoretical properties of the resulting probability estimator. Section 4 discusses the computational algorithm and tuning method. Section 5 and Section 6 contain numerous simulated and real examples to illustrate the numerical performance of the new approach, which is followed by the concluding section. The Appendix collects proofs for the theoretical results as well as the derivation of our algorithm.

In this section, we give a brief review on an important class of hard classifiers, SVM’s (Cortes and Vapnik 1995; Vapnik 1998). We start with the simple binary classification problems and then discuss the multiclass extensions. We will, in particular, discuss the extension of SVM’s by minimizing a weighted loss function.

When *K* = 2, the class label *y* is often coded as {−1, +1} for notational convenience. The binary SVM classifier can be fit in the following regularization framework

$$\underset{f\in \mathcal{F}}{min}{n}^{-1}\sum _{i=1}^{n}{H}_{1}({y}_{i}f({\mathbf{x}}_{i}))+\lambda J(f),$$

(1)

where the function *H*_{1}(*z*) = (1 − *z*)_{+} max{1 − *z*, 0} is the so-called hinge loss, *J*(*f*) is a penalty term for model complexity, and is some functional space. Let *p*_{1}(**x**) = *P*(*Y* = 1|**X** = **x**). Lin (2002) showed that the SVM solution to Equation (1) targets directly at
$\text{sign}[{p}_{1}(\mathbf{x})-{\scriptstyle \frac{1}{2}}]$. Therefore, sign[(**x**)] approximates the Bayes classification rule without estimating *p*_{1}(**x**).

Since the SVM showed good classification accuracy in many applications, a natural question to ask is whether it is possible to extract any information about *p*_{1}(**x**) from the SVM solution. Recently, Wang, Shen, and Liu (2008) proposed training a series of binary SVM’s by minimizing a weighted loss function and then constructing _{1}(**x**) by combining multiple SVM classification rules. In particular, by assigning a weight *π* to all the samples from class −1 and assigning 1 − *π* to all the samples from class +1, one can solve the regularization problem based on the weighted hinge loss

$$\underset{f\in \mathcal{F}}{min}{n}^{-1}\left[(1-\pi )\sum _{{y}_{i}=1}{H}_{1}\{{y}_{i}f({\mathbf{x}}_{i})\}+\pi \sum _{{y}_{i}=-1}{H}_{1}\{{y}_{i}f({\mathbf{x}}_{i})\}\right]+\lambda J(f),$$

(2)

where 0 ≤ *π* ≤ 1. Wang, Shen, and Liu (2008) showed that the minimizer to Equation (2) is a consistent estimate of sign[*p*_{1}(**x**) − *π*]. Therefore, one can repeatedly solve Equation (2) using different *π* values, say, 0 = *π*_{1} < ··· < *π _{m}*

Now consider the multiclass problems with *K* ≥ 2. In this setup, we code *y* as {1, 2, …, *K*}. A classifier seeks the function vector **f** = (*f*_{1}, *f*_{2}, …, *f _{K}*), where

Various loss functions were proposed to extend the binary SVM to multiclass problems, such as Weston and Watkins (1999), Lee, Lin, and Wahba (2004), and Liu (2007). Here we focus on the notion of the 0–1 loss. Note that a point (**x**, *y*) is misclassified by **f** if *y* ≠ arg max* _{k} f_{k}*(

$$\mathbf{g}(\mathbf{f}(\mathbf{x}),y)=\{{f}_{y}(\mathbf{x})-{f}_{k}(\mathbf{x}),k\ne y\}.$$

The quantity min **g**(**f**(**x**), *y*) is known as the generalized functional margin and can be reduced to *yf* (**x**) in the binary case with *y* {±1} (Liu and Shen 2006). With the generalized functional margin, the 0–1 loss can be expressed as *I*(min **g**(**f**(**x**), *y*) ≤ 0). As in the binary case, one can replace the indicator function in the 0–1 loss by some other loss . Typically, to assure that a misclassified sample induces a larger loss than a correctly classified sample, the loss function is nonincreasing and satisfies that ′(0) < 0. Once the loss (·) is given, the decision vector can be obtained by solving the following regularization problem

$$\begin{array}{l}\underset{\mathbf{f}}{min}{n}^{-1}\sum _{i=1}^{n}\ell (min\mathbf{g}(\mathbf{f}({\mathbf{x}}_{i}),{y}_{i}))+\lambda \sum _{k=1}^{K}J({f}_{k})\\ \text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.16667em}{0ex}}\sum _{k=1}^{K}{f}_{k}(\mathbf{x})=0.\end{array}$$

(3)

Motivated by Wang, Shen, and Liu (2008), we propose a new approach to estimate the class probabilities by solving a series of weighted multiclass problems and then combining multiple classification rules. In the article, we focus on the class of losses based on the functional margin (min **g**(**f**(**X**), *Y*)), as they provide a natural extension from two-class to multiclass problems. For the weighted learning, we assign a weight 0 ≤ *π _{k}* ≤ 1 to samples from class

$${A}_{K}=\{({\pi}_{1},\dots ,{\pi}_{k}):\sum _{k=1}^{K}{\pi}_{k}=1,{\pi}_{k}\ge 0,k=1,2,\dots ,K\}.$$

For any given *π**A _{K}*, we can train a weighted hard classifier by minimizing the objective function using a weighted loss function

$$\begin{array}{l}\underset{\mathbf{f}}{min}{n}^{-1}\sum _{i=1}^{n}{\pi}_{{y}_{i}}\ell (min\mathbf{g}(\mathbf{f}({\mathbf{x}}_{i}),{y}_{i}))+\lambda \sum _{k=1}^{K}J({f}_{k})\\ \text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\sum _{k=1}^{K}{f}_{k}(\mathbf{x})=0.\end{array}$$

(4)

Compared with the binary case, extracting the probability information from the constructed classifiers becomes much more challenging for *K* > 2. In particular, instead of estimating only one probability function as in *K* = 2, we need to estimate multiple functions *p*_{1}(**x**), …, *p _{K}*

In the binary case, the standard SVM is shown to be Fisher-consistent for estimating the Bayes classification rule
$\text{sign}({p}_{1}(\mathbf{x})-{\scriptstyle \frac{1}{2}})$. To estimate conditional class probabilities, Wang, Shen, and Liu (2008)’s method requires that the weighted SVM in Equation (2) is Fisher-consistent for estimating weighted Bayes classification rule sign(*p*_{1}(**x**) − *π*). To proceed with the multicategory probability estimation, we need to extend the definition of weighted Fisher-consistency. To construct a good probability estimate from the classification rules, we require that the loss function in Equation (4) is consistent in the following sense.

A functional margin based loss is called weighted Fisher-consistent for the weighted classification problem if the minimizer **f**^{*} of *E*[*π _{Y}* (min

$$\underset{k=1,\dots ,K}{argmax}{f}_{k}^{\ast}(\mathbf{x})=\underset{k=1,\dots ,K}{argmax}{\pi}_{k}{p}_{k}(\mathbf{x}),\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\forall \mathbf{x},\forall \mathit{\pi}\in {A}_{K}.$$

In a standard multiclass classification problem, the misclassification costs are all equal, i.e., *C*(*Y*, **f**(**X**)) = *I*(*Y* ≠ **f**(**X**)), and the Bayes rule minimizing *E*[*C*(*Y*, *f* (**X**))] is arg max_{k}_{=1,2,…,}* _{K} p_{k}*(

It turns out that not all functional margin based loss (min **g**(**f**(**x**), *y*)) satisfying ′ (0) < 0 is weighted Fisher-consistent for multicategory problems, as shown in the next proposition.

Let (·) be a nonincreasing loss function satisfying ′ (0) < 0. For any given positive weights *π**A _{K}*, the minimizer

- If ${\scriptstyle \frac{{max}_{k=1,\dots ,K}{\pi}_{k}{p}_{k}}{{\sum}_{k=1}^{K}{\pi}_{k}{p}_{k}}}>1/2$, then $arg{max}_{k}{f}_{k}^{\ast}=arg{max}_{k=1,\dots ,K}{\pi}_{k}{p}_{k}$.
- If (·) is convex and ${\scriptstyle \frac{{max}_{k=1,\dots ,K}{\pi}_{k}{p}_{k}}{{\sum}_{k=1}^{K}{\pi}_{k}{p}_{k}}}\le 1/2$, then
**f**^{*}=**0**is a minimizer.

Proposition 1 suggests that one sufficient condition for the weighted loss *π _{y}*(min

Interestingly, although the weighted loss *π _{y}*(min

$${\ell}_{{T}_{s}}(\xb7)=min(\ell (\xb7),\ell (s)).$$

The following theorem shows that the truncated loss _{Ts} is weighted Fisher-consistent.

Let (·) be a nonincreasing loss function satisfying ′ (0) < 0. Then a sufficient condition for the weighted truncated loss *π _{y}*

As pointed out by one referee, the condition ′ (0) < 0 requires differentiability of at 0 and thus excludes nondifferentiable loss functions such as the *ψ* loss. In the following, we show how the condition can be relaxed for nondifferentiable losses. Note that ′ (0) is used to assure

$$\left(\sum _{k\ne {k}_{\pi p}}{\pi}_{k}{p}_{k}(\mathbf{x})\right)(-K){\ell}^{\prime}(0)+{\pi}_{{k}_{\pi p}}{p}_{{k}_{\pi p}}(\mathbf{x})K{\ell}^{\prime}(0)<0,$$

(5)

when *π _{kπp} p_{kπp}* (

$$\left(\sum _{k\ne {k}_{\pi p}}{\pi}_{k}{p}_{k}(\mathbf{x})\right)(-K){\ell}^{\prime}({0}^{-})+{\pi}_{{k}_{\pi p}}{p}_{{k}_{\pi p}}(\mathbf{x})K{\ell}^{\prime}({0}^{+})<0,$$

(6)

where ′ (0^{−}) and ′ (0^{+}) denote the left and right derivatives, respectively. Therefore, the condition ′ (0) < 0 can be relaxed as ′ (0^{+}) < ′ (0^{−}) ≤ 0.

We now use two common loss examples to illustrate how to check the condition and find a proper truncating location *s* to assure the weighted Fisher-consistency of a truncated loss.

- The hinge loss, (
*u*) = [1 −*u*]_{+}: In this case, ${sup}_{\{u:u\ge -s\ge 0\}}{\scriptstyle \frac{\ell (0)-\ell (u)}{\ell (s)-\ell (0)}}={\scriptstyle \frac{1-0}{-s}}$ and the condition becomes $s\in [-{\scriptstyle \frac{1}{K-1}},0]$ by noting that*s*≤ 0. - The logistic loss, (
*u*) = log(1 +*e*): In this case, ${sup}_{\{u:u\ge -s\ge 0\}}{\scriptstyle \frac{\ell (0)-\ell (u)}{\ell (s)-\ell (0)}}={\scriptstyle \frac{log2-0}{log(1+{e}^{-s})-log2}}$, which leads to condition^{u}*s*[− log(2^{K}^{/(}^{K}^{−1)}− 1), 0].

Both of these two loss functions satisfy ′ (0) < 0. Although they are not weighted Fisher-consistent themselves, they can become weighted Fisher-consistent after truncating them at *s* (with *s* satisfying the above condition).

Proposition 1 and Theorem 1 are weighted extensions of the results of Wu and Liu (2007). Furthermore, we note that the truncating location *s* given in Theorem 1 depends on the class number *K*. The larger *K* is the more truncation is needed to ensure Fisher consistency. This is due to the fact that the difficulty of no “dominating” class becomes more severe as *K* increases. The more truncation is the closer the truncated loss is to the 0–1 loss. For the hinge loss *H*_{1}(*u*), exponential loss *e*^{−}* ^{u}* and logistic loss log(1 +

In this section, we derive our methodology for multiclass probability estimation based on hard classifiers. In particular, we propose training a series of weighted classifiers and using them to construct the probability estimates. For demonstration, we focus on the hinge and truncated hinge loss functions. However, our estimation schemes are applicable to general large-margin classifiers.

Define the truncated hinge loss as

$${H}_{{T}_{s}}(u)=min({H}_{1}(s),{H}_{1}(u)),$$

where
$s=-{\scriptstyle \frac{1}{K-1}}$ corresponds to the minimum truncation required by Theorem 1 to guarantee *H _{Ts}* to be weighted Fisher-consistent. Denote

$$\begin{array}{l}\underset{\mathbf{f}}{min}{n}^{-1}\sum _{i=1}^{n}{\pi}_{{y}_{i}}{H}_{{T}_{s}}(min\mathbf{g}(\mathbf{f}({\mathbf{x}}_{i}),{y}_{i}))+\lambda \sum _{k=1}^{K}J({f}_{k})\\ \text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\sum _{k=1}^{K}{f}_{k}(\mathbf{x})=0,\end{array}$$

(7)

where ** π** = (

The following proposition gives a key result for estimating the probabilities for each **x** .

For any given **x** satisfying min* _{k} p_{k}*(

$${\stackrel{\sim}{\pi}}_{1}(\mathbf{x}){p}_{1}(\mathbf{x})={\stackrel{\sim}{\pi}}_{2}(\mathbf{x}){p}_{2}(\mathbf{x})=\cdots ={\stackrel{\sim}{\pi}}_{K}(\mathbf{x}){p}_{K}(\mathbf{x}).$$

Proposition 2 shows that for any **x** with min* _{k} p_{k}*(

Interestingly, the result in Proposition 2 can help us to estimate the conditional probabilities *p _{k}*(

A plot of the weighted Bayes classification rule for all combinations of *π* for a certain point **x** when *K* = 3.

To estimate *p _{j}*(

For any given **x** , we assume that its associated border weight is estimated as
$\widehat{\stackrel{\sim}{\mathit{\pi}}}(\mathbf{x})$. Then its class probabilities can be estimated as

$${\widehat{p}}_{k}(\mathbf{x})=\frac{{\widehat{\stackrel{\sim}{\pi}}}_{k}{(\mathbf{x})}^{-1}}{{\widehat{\stackrel{\sim}{\pi}}}_{1}{(\mathbf{x})}^{-1}+{\widehat{\stackrel{\sim}{\pi}}}_{2}{(\mathbf{x})}^{-1}+\cdots +{\widehat{\stackrel{\sim}{\pi}}}_{K}{(\mathbf{x})}^{-1}},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}k=1,\dots K.$$

Propositions 2 and 3 suggest that identifying the border weight for each **x** is a key step to estimating the conditional probabilities *p _{k}*(

- Define a fine grid of
within*π**A*. Let the grid size be_{K}*d*. Any grid point_{π}takes the form (*π**m*_{1}*d*,_{π}*m*_{2}*d*, …,_{π}*m*) with nonnegative integers_{K}d_{π}*m*_{1},*m*_{2}, …,*m*satisfying ${\sum}_{k=1}^{K}{m}_{k}{d}_{\pi}=1$._{K} - Solve Equation (7) over the above grid using the properly chosen tuning parameter
*λ*. - Form all possible
*K*-vertex polyhedrons of (side) length*d*using the available grid points. Here each_{π}*K*-vertex polyhedron corresponds to*K*adjacent grid points. - For any
**x**, identify the*K*-vertex polyhedron such that its*K*vertices all belong to*K*distinct classes. The average of the coordinates corresponding to these vertices is defined as the estimate of the border weight(*π̃***x**) for**x**. The probability estimate can then be calculated using Proposition 3.

In the following, we demonstrate how the direct scheme works in the case of *K* = 3. To search for the border weight in Figure 2, we define a fine grid of ** π** within the triangle as in Figure 3. Let the grid size be

Left: Classification rule over a grid of *π* for a point **x** for *K* = 3, where the circle ○ denotes being classified as class 1, the square □ denotes being classified as class 2, and the asterisk * denotes being classified as class **...**

$$\{{\mathit{\pi}}_{1}(\mathbf{x})=({\pi}_{1},{\pi}_{2},{\pi}_{3}),{\mathit{\pi}}_{2}(\mathbf{x})=({\pi}_{1}-{d}_{\pi},{\pi}_{2}+{d}_{\pi},{\pi}_{3}),{\mathit{\pi}}_{3}(\mathbf{x})=({\pi}_{1}-{d}_{\pi},{\pi}_{2},{\pi}_{3}+{d}_{\pi})\},$$

(8)

which classify *Y* into three distinct classes as shown on the left panel of Figure 3.

Now we provide some discussion on the numerical coherence of multiple decisions resulting from training multiple weighted classification problems. Let us start with the two-class problems. Assume that *π* and 1 − *π* are the costs for the negative and positive classes, then it is known that the minimizer * _{π}* (

For multiclass problems, a similar issue can occur even more frequently due to the increased complexity of the optimization problem. Take the three-class problem as an example. Each nonnegative weight vector is ** π** = (

$$\{{\mathit{\pi}}_{1}(\mathbf{x})=({\pi}_{1},{\pi}_{2},{\pi}_{3}),{\mathit{\pi}}_{2}(\mathbf{x})=({\pi}_{1}-{d}_{\pi},{\pi}_{2},{\pi}_{3}+{d}_{\pi}),{\mathit{\pi}}_{3}(\mathbf{x})=({\pi}_{1},{\pi}_{2}-{d}_{\pi},{\pi}_{3}+{d}_{\pi})\},$$

as shown on the right panel of Figure 3. This corresponds to the monotonicity violation in the binary case as discussed in the first paragraph of section 2.2 of Wang, Shen, and Liu (2008). Our selection criterion is to select three neighboring combinations corresponding to three distinct classes. The average (*π*_{1}(**x**) + *π*_{2}(**x**) + *π*_{3}(**x**))/3 of these three neighboring combinations, denoted by ** (****x**) = (_{1}(**x**), _{2}(**x**), _{3}(**x**)), serves as our estimate of the border weight. Using the estimated border weight ** (****x**), our estimate is given by

$${\widehat{p}}_{k}(\mathbf{x})=\frac{{\widehat{\pi}}_{k}{(\mathbf{x})}^{-1}}{{\widehat{\pi}}_{1}{(\mathbf{x})}^{-1}+{\widehat{\pi}}_{2}{(\mathbf{x})}^{-1}+{\widehat{\pi}}_{3}{(\mathbf{x})}^{-1}}.$$

For the finite-sample case, it is possible to have more than one possibility of three neighboring combinations corresponding to three distinct classes. Each possibility leads to one estimated border weight. When this happens, averaging all the estimated border weights is necessary to proceed our probability estimation. The nonuniqueness of border weights encountered in practice adds some challenges in the implementation of the direct scheme. As the number of classes gets larger, this may become more severe. Furthermore, the border weights are identified through counting multiple decisions. The process is discrete and tends to be slow and unstable. These challenges motivate us to develop another scheme that is more continuous and of high stability.

In this section, we provide an alternative scheme to recover probabilities. Instead of directly targeting the probabilities as the direct scheme does, the new scheme estimates some continuous functions of probabilities, which can be easier to estimate, and then inverts those functions to recover probabilities.

Note that the total volume (or area when *K* = 3) of *A _{K}* is given by

$${\int}_{0}^{1}d{\pi}_{1}{\int}_{0}^{1-{\pi}_{1}}d{\pi}_{2}\cdots {\int}_{0}^{1-{\pi}_{1}-\cdots -{\pi}_{K-2}}d{\pi}_{K-1}\sqrt{K}=\frac{\sqrt{K}}{(K-1)!}.$$

The collection of *π* representing class *k* is given by *R _{k}* = {

$${R}_{k}=\cup {r}_{{j}_{1}{j}_{2}\cdots {j}_{K-1}}^{k},$$

where
${r}_{{j}_{1}{j}_{2}\cdots {j}_{K-1}}^{k}=\{\mathit{\pi}:{\pi}_{{j}_{1}}{p}_{{j}_{1}}\le {\pi}_{{j}_{2}}{p}_{{j}_{2}}\le \cdots \le {\pi}_{{j}_{K-1}}{p}_{{j}_{K-1}}\le {\pi}_{k}{p}_{k}\}$ and the union is over all permutations (*j*_{1}, *j*_{2}, …, *j _{K}*

$$\sqrt{K}{\int}_{0}^{{B}_{{j}_{1}}}d{\pi}_{{j}_{1}}{\int}_{{p}_{{j}_{1}}{\pi}_{{j}_{1}}/{p}_{{j}_{2}}}^{{B}_{{j}_{2}}}d{\pi}_{{j}_{2}}\cdots {\int}_{{p}_{{j}_{K-2}}{\pi}_{{j}_{K-2}}/{p}_{{j}_{K-1}}}^{{B}_{{j}_{K-1}}}d{\pi}_{{j}_{K-1}},$$

where
${B}_{{j}_{1}}=(1-{\sum}_{m=1}^{k-1}{\pi}_{{j}_{m}})(1/{p}_{{j}_{i}})/((1/{p}_{k})+{\sum}_{m=i}^{K-1}(1/{p}_{{j}_{m}}))$; *i* = 1, …, *K* − 1 with the convention
${\sum}_{m=1}^{0}{\pi}_{{j}_{m}}=0$. We can then sum over the area (volume) according to all permutations (*j*_{1}, *j*_{2}, …, *j _{K}*

Naturally the proportion of grid points of ** π** leading to prediction of

$${\text{prop}}_{k}={h}_{k}({p}_{1},{p}_{2},\dots ,{p}_{K});\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}k=1,2,\dots ,K,$$

(9)

we can obtain the estimated probabilities. In particular when *K* = 3, area of *A*_{3} is
$\sqrt{3}/2$ and area of *R*_{3} is
${\scriptstyle \frac{\sqrt{3}}{2}}{p}_{3}^{2}({p}_{2}{p}_{3}+p1{p}_{3}+2{p}_{1}{p}_{2})/(({p}_{1}+{p}_{3})({p}_{2}+{p}_{3})({p}_{1}{p}_{2}+{p}_{1}{p}_{3}+{p}_{2}{p}_{3}))$. Consequently
${h}_{3}({p}_{1},{p}_{2},{p}_{3})={p}_{3}^{2}({p}_{2}{p}_{3}+p1{p}_{3}+2{p}_{1}{p}_{2})/(({p}_{1}+{p}_{3})({p}_{2}+{p}_{3})({p}_{1}{p}_{2}+{p}_{1}{p}_{3}+{p}_{2}{p}_{3}))$.

The indirect scheme can be summarized as follows:

- Same as those of the Direct Scheme.
- For any
**x**, calculate the grid percentage propfor_{k}*k*= 1, 2, …,*K*. - Solve the equation system in Equation (9) to recover the estimation of (
*p*_{1}(**x**),*p*_{2}(**x**), …,*p*(_{K}**x**)).

In Section 5, we will illustrate the performance of both schemes. Our empirical results suggest that the indirect scheme is indeed faster and more accurate.

The next theorem establishes the consistency of our class probability estimation.

For any nonincreasing loss function (·) with ′(0) < 0, if the truncation location *s* is chosen such that
${sup}_{\{u:u\ge -s\ge 0\}}{\scriptstyle \frac{\ell (0)-\ell (u)}{\ell (s)-\ell (0)}}\ge K-1$. When *λ* → 0 and the grid size *d _{π}* → 0 as

The consistency result in Theorem 2 provides theoretical justification of our proposed method. It can be straightforward to extend the consistency to our indirect probability recovery scheme as Theorem 1 implies that prop* _{k}* is consistent for estimating

As shown on the right panel of Figure 5, the function *H _{Ts}*(·) is not convex, thus solving Equation (7) involves a nonconvex minimization problem. However, we note that

The left, middle, and right panels display functions *H*_{1}(*u*), *H*_{s}(*u*), and *H*_{Ts}(*u*), respectively. A color version of this figure is available in the electronic version of this article.

$${H}_{{T}_{s}}(u)=min({H}_{1}(u),{H}_{1}(s))={H}_{1}(u)-{H}_{s}(u),$$

where *H _{s}*(

Using this property of the truncated hinge loss function, we apply the difference convex (d.c.) algorithm (An and Tao 1997; Liu, Shen, and Doss 2005; Wu and Liu 2007) to solve the non-convex optimization problem of the weighted truncated-hinge-loss SVM. The d.c. algorithm solves the nonconvex minimization problem via minimizing a sequence of convex subproblems (see Algorithm 1). We derive the d.c. algorithm for linear learning in Section 4.1 and then generalize it to the case of nonlinear learning via kernel mapping in Section 4.2.

[The Difference Convex Algorithm for minimizing *Q*(Θ) = *Q _{vex}*(Θ) +

- Initialize Θ
_{0}. - Repeat ${\mathrm{\Theta}}_{t=1}=arg{min}_{\mathrm{\Theta}}({Q}_{\mathit{vex}}(\mathrm{\Theta})+\langle {Q}_{\mathit{cav}}^{\prime}({\mathrm{\Theta}}_{t}),\mathrm{\Theta}-{\mathrm{\Theta}}_{t}\rangle )$ until convergence of Θ
._{t}

Let
${f}_{k}(\mathbf{x})={\mathbf{w}}_{k}^{T}\mathbf{x}+{b}_{k}$; **w**_{k}* ^{d}*,

$$\begin{array}{l}\underset{\mathbf{W},\mathbf{b}}{min}\frac{1}{2}\sum _{k=1}^{K}{\left|\right|{\mathbf{w}}_{k}\left|\right|}_{2}^{2}+C\sum _{i=1}^{n}{\pi}_{{y}_{i}}{H}_{{T}_{s}}(min\mathbf{g}(\mathbf{f}({\mathbf{x}}_{i}),{y}_{i}))\\ \text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\sum _{k=1}^{K}{w}_{jk}=0;\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}j=1,2,\dots ,d;\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\sum _{k=1}^{K}{b}_{k}=0,\end{array}$$

(10)

where the constraints are adopted to avoid the nonidentifiability issue of the solution. Note that Equation (10) is equivalent to the other representation of Equation (4) by setting *C* = 1/*λ*. Thus we will use them interchangeably.

Denote Θ as (**W**, **b**). Applying the fact that *H _{Ts}* =

$$\begin{array}{l}{Q}^{s}(\mathrm{\Theta})=\frac{1}{2}\sum _{k=1}^{K}{\left|\right|{\mathbf{w}}_{k}\left|\right|}_{2}^{2}+C\sum _{i=1}^{n}{\pi}_{{y}_{i}}{H}_{1}(min\mathbf{g}(\mathbf{f}({\mathbf{x}}_{i}),{y}_{i}))-C\sum _{i=1}^{n}{\pi}_{{y}_{i}}{H}_{s}(min\mathbf{g}(\mathbf{f}({\mathbf{x}}_{i}),{y}_{i}))\\ ={Q}_{\mathit{vex}}^{s}(\mathrm{\Theta})+{Q}_{\mathit{cav}}^{s}(\mathrm{\Theta}),\end{array}$$

where

$${Q}_{\mathit{vex}}^{s}(\mathrm{\Theta})=\frac{1}{2}\sum _{k=1}^{K}{\left|\right|{\mathbf{w}}_{k}\left|\right|}_{2}^{2}+C\sum _{i=1}^{n}{\pi}_{{y}_{i}}{H}_{1}(min\mathbf{g}(\mathbf{f}({\mathbf{x}}_{i}),{y}_{i}))$$

and

$${Q}_{\mathit{cav}}^{s}(\mathrm{\Theta})=-C\sum _{i=1}^{n}{\pi}_{{y}_{i}}{H}_{s}(min\mathbf{g}(\mathbf{f}({\mathbf{x}}_{i}),{y}_{i})),$$

denote the convex and concave parts, respectively.

Define

$${\beta}_{ik}=\{\begin{array}{ll}C{\pi}_{{y}_{i}}\hfill & \text{if}\phantom{\rule{0.16667em}{0ex}}k=argmax({f}_{{k}^{\prime}}^{t}:{k}^{\prime}\ne {y}_{i}),{f}_{{y}_{i}}^{t}-{f}_{k}^{t}<s\hfill \\ 0\hfill & \text{otherwise},\hfill \end{array}$$

where
${\mathbf{f}}^{t}={({f}_{1}^{t}(\xb7),{f}_{2}^{t},\dots ,{f}_{K}^{t})}^{T}$ denotes the solution at the *t*th iteration. It is shown in the Appendix that the dual problem of the convex optimization at the (*t* + 1)th iteration, given the solution **f*** ^{t}* at the

$$\begin{array}{l}\underset{\mathit{\alpha}}{min}\frac{1}{2}\sum _{k=1}^{K}{\Vert \sum _{i:{y}_{i}=k}\sum _{{k}^{\prime}\ne {y}_{i}}({\alpha}_{{ik}^{\prime}}-{\beta}_{{ik}^{\prime}}){\mathbf{x}}_{i}^{T}-\sum _{i:{y}_{i}\ne k}({\alpha}_{ik}-{\beta}_{ik}){\mathbf{x}}_{i}^{T}\Vert}_{2}^{2}-\sum _{i=1}^{n}\sum _{{k}^{\prime}\ne {y}_{i}}{\alpha}_{{ik}^{\prime}}\\ \text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\sum _{i:{y}_{i}=k}\sum _{{k}^{\prime}\ne {y}_{i}}({\alpha}_{{ik}^{\prime}}-{\beta}_{{ik}^{\prime}})-\sum _{i:{y}_{i}\ne k}({\alpha}_{ik}-{\beta}_{ik})=0,\\ k=1,2,\dots ,K,\\ 0\le \sum _{j\ne {y}_{i}}{\alpha}_{ij}\le C{\pi}_{{y}_{i}},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i=1,2,\dots ,n,\\ {\alpha}_{ik}\ge 0,\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i=1,2,\dots ,n;\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}k\ne {y}_{i}.\end{array}$$

This dual problem is a quadratic programming (QP) problem similar to that of the standard SVM and can be solved by many optimization software. Once the solution is obtained, the coefficients **w*** _{k}*’s can be recovered as follows,

$${\mathbf{w}}_{k}=\sum _{i:{y}_{i}=k}\sum _{{k}^{\prime}\ne {y}_{i}}({\alpha}_{{ik}^{\prime}}-{\beta}_{{ik}^{\prime}}){\mathbf{x}}_{i}-\sum _{i:{y}_{i}\ne k}({\alpha}_{ik}-{\beta}_{ik}){\mathbf{x}}_{i}.$$

(11)

It is interesting to note that representation of **w*** _{k}*’s given in Equation (11) automatically satisfies that
${\sum}_{k=1}^{K}{w}_{jk}=0$ for each 1 ≤

After the solution of **W** is derived, **b** can be obtained via solving either a sequence of KKT conditions as used in the standard SVM or a linear programming (LP) problem. Denote
${\stackrel{\sim}{f}}_{k}({\mathbf{x}}_{i})={\mathbf{x}}_{i}^{T}{\mathbf{w}}_{k}$. Then **b** can be obtained through the following LP problem:

$$\begin{array}{l}\underset{\mathit{\eta},\mathbf{b}}{min}C\sum _{i=1}^{n}{\pi}_{{y}_{i}}{\eta}_{i}+\sum _{k=1}^{K}\left(\sum _{i:{y}_{i}=k}\sum _{{k}^{\prime}\ne {y}_{i}}{\beta}_{{ik}^{\prime}}-\sum _{i:{y}_{i}=k}{\beta}_{ik}\right){b}_{k}\\ \text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}{\eta}_{i}\ge 0,\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i=1,2,\dots ,n,\\ {\eta}_{i}\ge 1-({\stackrel{\sim}{f}}_{{y}_{i}}({\mathbf{x}}_{i})+{b}_{{y}_{i}})+{\stackrel{\sim}{f}}_{k}({\mathbf{x}}_{i})+{b}_{k},\\ i=1,2,\dots ,n;\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}k\ne {y}_{i},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\sum _{k=1}^{K}{b}_{k}=0.\end{array}$$

For nonlinear learning, each decision function *f _{k}*(

Denote **v*** _{k}* = (

A similar derivation as in the linear case leads to the following dual problem for nonlinear learning

$$\begin{array}{l}\underset{\mathit{\alpha}}{min}\frac{1}{2}\sum _{k=1}^{K}\langle \sum _{i:{y}_{i}=k}\sum _{{k}^{\prime}\ne {y}_{i}}({\alpha}_{{ik}^{\prime}}-{\beta}_{{ik}^{\prime}}){\mathbf{R}}_{i}-\sum _{i:{y}_{i}\ne k}({\alpha}_{ik}-{\beta}_{ik}){\mathbf{R}}_{i},\sum _{i:{y}_{i}=j}\sum _{{k}^{\prime}\ne {y}_{i}}({\alpha}_{{ik}^{\prime}}-{\beta}_{{ik}^{\prime}}){e}_{i}-\sum _{i:{y}_{i}\ne k}({\alpha}_{{ik}^{\prime}}-{\beta}_{ik}){e}_{i}\rangle -\sum _{i=1}^{n}\sum _{{k}^{\prime}\ne {y}_{i}}{\alpha}_{{ik}^{\prime}}\\ \text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\sum _{i:{y}_{i}=k}\sum _{{k}^{\prime}\ne {y}_{i}}({\alpha}_{{ik}^{\prime}}-{\beta}_{{ik}^{\prime}})-\sum _{i:{y}_{i}\ne k}({\alpha}_{ik}-{\beta}_{ik})=0,\\ k=1,2,\dots ,K,\\ 0\le \sum _{k\ne {y}_{i}}{\alpha}_{ij}\le C{\pi}_{{y}_{i}},\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i=1,2,\dots ,n,\\ {\alpha}_{ik}\ge 0,\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i=1,2,\dots ,n;\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}k\ne {y}_{i},\end{array}$$

where *β _{ik}*’s are defined similarly as in the linear case. After solving the above QP problem, we can recover the coefficients

$${\mathbf{v}}_{k}=\sum _{i:{y}_{i}=k}\sum _{{k}^{\prime}\ne {y}_{i}}({\alpha}_{{ik}^{\prime}}-{\beta}_{{ik}^{\prime}}){e}_{i}-\sum _{i:{y}_{i}\ne k}({\alpha}_{ik}-{\beta}_{ik}){e}_{i}.$$

The intercepts *b _{k}*’s can be solved using LP as in the linear learning.

So far we assume that we selected the optimal tuning parameter *λ*. In practice, the tuning parameter selection can be done using an independent validation set or cross-validation. In this article, we choose to select the parameter using an independent set of size *ñ*. Theoretically speaking, the larger *ñ* is the better the tuning effect and the better classifier we may obtain. This is related to Shao (1993)’s results on cross-validation in the context of linear model selection: the proportion of tuning set size over the size of all available data points, namely *ñ*/(*n* + *ñ*), should go to 1 as (*n* + *ñ*) → ∞ to ensure the asymptotically correct selection. However, how to split the dataset into the training part and the tuning part is always a trade-off between model training and parameter tuning since a large training set is also desired for better model fitting. A commonly accepted procedure is to use one half for training and the other half for tuning, i.e., *n* = *ñ*.

Now we detail the approach using an independent tuning set of size *ñ*. We first obtain probability estimates
${\widehat{p}}_{j}^{({\lambda}_{m})}({\stackrel{\sim}{\mathbf{x}}}_{i})$, *j* = 1, 2, …, *k* for any **x˜*** _{i}* in the tuning set {(

In this section we use four simulation examples to illustrate the methodological power of our new multiclass probability estimation scheme by comparing it to some existing methods. We consider five alternative methods: cumulative logit model (CLM), baseline logit model (BLM), kernel multicategory logistic regression (KMLR), classification tree (TREE), and random forest (RF). Both CLM and BLM make certain assumptions on the forms of the transformed probabilities. In particular, the CLM assumes that
$log{\scriptstyle \frac{{\sum}_{i=1}^{k}{p}_{i}(\mathbf{x})}{1-{\sum}_{i=1}^{k}{p}_{i}(\mathbf{x})}}={\beta}_{k0}+{\mathbf{x}}^{T}{\mathit{\beta}}_{k}$ for *k* = 1, 2, …, *K* − 1 while the BLM assumes that
$log{\scriptstyle \frac{{p}_{k}(\mathbf{x})}{{p}_{K}(\mathbf{x})}}={\beta}_{k0}+{\mathbf{x}}^{T}{\mathit{\beta}}_{k}$ for *k* = 1, 2, …, *K* − 1. KMLR refers to the one proposed by Zhu and Hastie (2005) with Gaussian kernel
$R({\mathbf{x}}_{1},{\mathbf{x}}_{2})={e}^{-{\left|\right|{\mathbf{x}}_{1}-{\mathbf{x}}_{2}\left|\right|}_{2}^{2}/{\sigma}^{2}}$. Ten separate datasets are generated to tune the data width parameter *σ* among a grid of {1/4, 1/2, 3/4, 1, 5/4, 3/2}*σ _{m}*, where

In simulations, the true conditional probability functions *p _{k}*(·),

- 1-norm error ${\scriptstyle \frac{1}{10n}}{\sum}_{i=1}^{10n}{\sum}_{k=1}^{K}\mid {\widehat{p}}_{k}({\overline{\mathbf{x}}}_{i})-{p}_{k}({\overline{\mathbf{x}}}_{i})\mid $
- 2-norm error ${\scriptstyle \frac{1}{10n}}{\sum}_{i=1}^{10n}{\sum}_{k=1}^{K}{({\widehat{p}}_{k}({\overline{\mathbf{x}}}_{i})-{p}_{k}({\overline{\mathbf{x}}}_{i}))}^{2}$
- Empirical generalized Kullback–Leibler (EGKL) loss ${\scriptstyle \frac{1}{10n}}{\sum}_{i=1}^{10n}{\sum}_{k=1}^{K}{p}_{k}({\overline{\mathbf{x}}}_{i})log{\scriptstyle \frac{{p}_{k}({\overline{\mathbf{x}}}_{i})}{{\widehat{p}}_{k}({\overline{\mathbf{x}}}_{i})}}$.

Here * _{i}* denotes the predictor vector of the

We consider a three-class linear learning example. Our data are generated in two steps: (1) *Y* is uniformly distributed over {1, 2, 3}; (2) Conditional on *Y* = *y*, the two-dimension predictor **X** is generated from N(** μ**(

In this example, we study a three-class nonlinear example. For any **x** = (*x*_{1}, *x*_{2})* ^{T}*, define
${f}_{1}(\mathbf{x})=-{x}_{1}+0.1{x}_{1}^{2}-0.05{x}_{2}^{2}+0.1,\phantom{\rule{0.16667em}{0ex}}{f}_{2}(\mathbf{x})=-0.2{x}_{1}^{2}+0.1{x}_{2}^{2}-0.2$, and
${f}_{3}(\mathbf{x})={x}_{1}+0.1{x}_{1}^{2}-0.05{x}_{2}^{2}+0.1$. Set
${p}_{k}(\mathbf{x})=P(Y=k\mid \mathbf{X}=\mathbf{x})=exp({f}_{k}(\mathbf{x}))/({\sum}_{m=1}^{3}exp({f}_{m}(\mathbf{x})))$ for

In this example, we consider basis expansion for the parametric methods CLM and BLM by also including the quadratic terms
${x}_{1}^{2}$ and
${x}_{2}^{2}$. Consequently, the BLM is again the oracle model. Results over 100 repetitions in the same format of Example 1 are reported in Table 2. Column Indirect corresponds to our method with the indirect probability recovery scheme. The tuning of *σ* in Gaussian kernel selects 5*σ _{m}*/4 and

In Examples 1 and 2 the BLM takes the true model form, so it is not surprising that the BLM shows better performance than our method. In this example, we design an experiment so that none of the parametric methods corresponds to the oracle. This will provide a fair comparison between them.

The two-dimensional predictor **X** is uniformly distributed over the disc {
$\mathbf{x}:{x}_{1}^{2}+{x}_{2}^{2}\le 100$}. Define functions
${h}_{1}(\mathbf{x})=-5{x}_{1}\sqrt{3}+5{x}_{2},{h}_{2}(\mathbf{x})=-5{x}_{1}\sqrt{3}-5{x}_{2}$, and *h*_{3}(**x**) = 0. Apply a transformation *f _{k}*(

In this five-class example, the data are generated similarly as in Example 1. Response *Y* is uniformly distributed over {1, 2, 3, 4, 5}. Conditional on *Y* = *y*, the two-dimensional predictor **X** is generated from N(** μ**(

Among the six procedures considered above, BLM and CLM are parametric methods, while our method, KMLR, TREE, and RF are nonparametric procedures that do not make explicit assumptions on the form of the true probability functions. Our simulated results suggest that, if the parametric assumption is correct, then the associated parametric estimator is essentially the oracle and performs the best among all. This explains why BLM gives the smallest errors in Examples 1, 2, and 4. However, if the parametric assumption is incorrect, then the parametric estimators can perform poorly, as shown for the BLM in Example 3 and the CLM in all the settings. By contrast, model-free methods do not rely on the model assumption and show more robust performance. For complicated problems, some of the nonparametric methods can outperform the parametric ones. As shown in Example 3, our method and RF are the top two performers. Furthermore, it is noticed that our method performs competitively among the three nonparametric procedures.

In practice, sometimes it is difficult to determine or validate the parametric assumption on the function forms, especially when data are complicated or high-dimensional, then a good nonparametric procedure will provide a useful alternative tool for estimating multiclass probabilities.

The total computation cost of the proposed procedure is mainly determined by three factors: the computation cost of solving one weighted optimization problem, the number of optimization problems corresponding to different weight vectors, and the scheme for recovering probabilities from multiple decision rules. As shown in the article, each optimization problem involves a nonconvex minimization problem and the proposed DCA-based algorithm seems quite efficient. For example, it takes 0.4827, 5.4086, 0.5118, and 2.3549 s, on average, to solve an individual optimization problem for Examples 1, 2, 3, and 4, respectively. Since Example 2 deals with more complicated nonlinear classification problems it takes a little longer.

The second factor is controlled by the size of *d _{π}*. In a three-class problem, if

From Table 5, it is clear that smaller *d _{π}* values, in general, lead to better accuracy in probability estimation. However, this accuracy gain levels off as

To recover the probabilities, we propose two schemes in the article. The numerical results suggest that the indirect scheme is faster and produces better estimation accuracy. In practice, we recommend using the indirect scheme.

To conclude our simulation studies, we plot in Figure 6 a randomly chosen training set from each example to show what our training data look like. Note that the sample size is 1000 for Example 4. However to make Figure 6 look nicer, we only use a random sample of size 200 while plotting the right bottom panel for Example 4.

In this section, we apply our new multiclass probability estimation scheme to the wine data by comparing it to those four alternative methods considered in the previous section. The wine data are available online at the University of California, Irvine (UCI) Machine Learning Repository by following the URL http://archive.ics.uci.edu/ml/datasets/Wine. In addition to the categorical response variable Wine Type, it has 13 attributes available. They are Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, and Proline. All these 13 attributes are continuous. Before applying any probability estimation scheme, we standardize each attribute to have mean zero and standard deviation one. Wine Type belongs to one of three classes with class distribution of 59 in class 1, 71 in class 2, and 48 in class 3. The total number of observations is *n* = 178. We randomly select 19 observations from class 1, 23 from class 2, and 16 from class 3 to be set aside as the testing set. The remaining 120 observations are used as the training dataset. We randomly divide those 120 observations in the training set into eight folds with each fold containing five observations in class 1, six in class 2, and four in class 3 so that an eight-fold cross-validation is used to select any tuning parameter over a grid if necessary. The tuning selects *σ _{m}* as the best data width parameter for KMLR. For simplicity, our method is implemented with the linear kernel.

For any estimate * _{j}*(·), we define its log-likelihood over the testing set as
${\sum}_{i=1}^{58}log{\widehat{p}}_{{y}_{i}}({\mathbf{x}}_{i})$, where (

In this work, we propose a model-free multiclass probability estimation approach. It is achieved by solving a series of weighted hard classification problems and then combining these decision rules to construct the probability estimates. Both theoretical and numerical results are provided to demonstrate the competitive performance of our estimation procedure. Our numerical results show favorable performance of our new probability estimation procedure in comparison to several other existing approaches.

Our probability estimation procedure requires computation of weighted classifiers over a fine grid of the *K*-vertex polyhedron. The computational cost can be high when the class number *K* gets large. To further improve the computational efficiency, one possible solution is to investigate an efficient solution path over the grid. Further investigation is needed.

The authors thank the editor, the associate editor, and two referees for their helpful suggestions that led to significant improvement of the article. The authors are supported in part by NSF grants DMS-0905561 (Wu), DMS-0645293 (Zhang), DMS-0747575 (Liu), and DMS-0606577 (Liu), and NIH/NCI grants R01-CA-085848 (Zhang) and R01-CA-149569 (Liu and Wu).

For any **x** and any *π**A _{K}*, define
${\stackrel{\sim}{p}}_{k}={\pi}_{k}{p}_{k}(\mathbf{x})/({\sum}_{m=1}^{K}{\pi}_{k}{p}_{k}(\mathbf{x}))$. Then

The unique vector ** π̃**(

The result is straightforward by Proposition 2.

Note that, as *n* → ∞ and *λ* → ∞, the penalty consequently does not contribute to the objective function. Thus asymptotically we are solving

$$\begin{array}{l}\underset{\mathbf{f}}{min}\frac{1}{n}\sum _{i=1}^{n}{\pi}_{{y}_{i}}{\ell}_{{T}_{s}}(min\mathbf{g}(\mathbf{f}({\mathbf{x}}_{i}),{y}_{i}))\\ \text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\sum _{k=1}^{K}{f}_{k}(\mathbf{x})=0,\end{array}$$

for each ** π**.

Note first that, for any **x** with positive probability density within a small neighborhood *B*(**x**, *r*) = {**x˜**: ||**x˜** − **x**|| ≤ *r*} of radius *r* > 0 [i.e., *P*** _{X}**(

For any ** π** = (

Note that ${\scriptstyle \frac{\partial}{\partial {\mathbf{w}}_{k}}}{Q}_{\mathit{cav}}^{s}(\mathrm{\Theta})$ and ${\scriptstyle \frac{\partial}{\partial {b}_{k}}}{Q}_{\mathit{cav}}^{s}(\mathrm{\Theta})$ can be written, respectively, as follows

$$\begin{array}{l}-C\left[\sum _{i:{y}_{i}=k}{\pi}_{{y}_{i}}(-{I}_{\{min\mathbf{g}(\mathbf{f}({\mathbf{x}}_{i}),{y}_{i})<s\}}){\mathbf{x}}_{i}^{T}+\sum _{i:{y}_{i}\ne k}{\pi}_{{y}_{i}}({I}_{\{j=argmax({f}_{{k}^{\prime}}({\mathbf{x}}_{i}):{k}^{\prime}\ne {y}_{i}),{f}_{{y}_{i}}({\mathbf{x}}_{i})-{f}_{k}({\mathbf{x}}_{i})<s\}}){\mathbf{x}}_{i}^{T}\right],\\ -C\left[\sum _{i:{y}_{i}=k}{\pi}_{{y}_{i}}(-{I}_{\{min\mathbf{g}(\mathbf{f}({\mathbf{x}}_{i}),{y}_{i})<s\}})+\sum _{i:{y}_{i}\ne k}{\pi}_{{y}_{i}}({I}_{\{k=argmax({f}_{{k}^{\prime}}({\mathbf{x}}_{i}):{k}^{\prime}\ne {y}_{i}),{f}_{{y}_{i}}({\mathbf{x}}_{i})-{f}_{k}({\mathbf{x}}_{i})<s\}})\right],\end{array}$$

where *I*_{{}_{A}_{}} = 1 if event *A* is true and 0 otherwise.

Using the definition of *β _{ij}*, we have

$$\frac{\partial}{\partial {\mathbf{w}}_{k}}{Q}_{\mathit{cav}}^{s}(\mathrm{\Theta})=\sum _{i:{y}_{i}=k}\left(\sum _{{k}^{\prime}\ne {y}_{i}}{\beta}_{{ik}^{\prime}}\right){\mathbf{x}}_{i}^{T}-\sum _{i:{y}_{i}\ne k}{\beta}_{ik}{\mathbf{x}}_{i}^{T}$$

and

$$\frac{\partial}{\partial {b}_{k}}{Q}_{\mathit{cav}}^{s}(\mathrm{\Theta})=\sum _{i:{y}_{i}=k}\left(\sum _{{k}^{\prime}\ne {y}_{i}}{\beta}_{{ik}^{\prime}}\right)-\sum _{i:{y}_{i}\ne k}{\beta}_{ik}.$$

Applying the first-order approximation to the concave part, the objective function at step (*t* + 1) becomes

$${Q}^{s}(\mathrm{\Theta})=\frac{1}{2}\sum _{k=1}^{K}{\left|\right|{\mathbf{w}}_{k}\left|\right|}_{2}^{2}+C\sum _{i=1}^{n}{\pi}_{{y}_{i}}{H}_{1}(min\mathbf{g}(\mathbf{f}({\mathbf{x}}_{i}),{y}_{i}))+\sum _{k=1}^{K}\langle \frac{\partial}{\partial {\mathbf{w}}_{k}}{Q}_{\mathit{cav}}^{s}({\mathrm{\Theta}}_{t}),{\mathbf{w}}_{k}\rangle +\sum _{k=1}^{K}{b}_{k}\frac{\partial}{\partial {b}_{k}}{Q}_{\mathit{cav}}^{s}({\mathrm{\Theta}}_{t}),$$

where Θ* _{t}* is the current solution.

Using slack variable *ξ _{i}*’s for the hinge loss function, the optimization problem at step (

$$\begin{array}{l}\underset{\mathbf{W},\mathbf{b},\mathit{\xi}}{min}\frac{1}{2}\sum _{k=1}^{K}{\left|\right|{\mathbf{w}}_{k}\left|\right|}_{2}^{2}+C\sum _{i=1}^{n}{\pi}_{{y}_{i}}{\xi}_{i}+\sum _{k=1}^{K}\langle \frac{\partial}{\partial {\mathbf{w}}_{k}}{Q}_{\mathit{cav}}^{s}({\mathrm{\Theta}}_{t}),{\mathbf{w}}_{k}\rangle +\sum _{k=1}^{K}{b}_{k}\frac{\partial}{\partial {b}_{k}}{Q}_{\mathit{cav}}^{s}({\mathrm{\Theta}}_{t})\\ \text{subject}\phantom{\rule{0.16667em}{0ex}}\text{to}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}{\xi}_{i}\ge 0,\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}i=1,2,\dots ,n,\\ {\xi}_{i}\ge 1-[{\mathbf{x}}_{i}^{T}{\mathbf{w}}_{{y}_{i}}+{b}_{{y}_{i}}]+[{\mathbf{x}}_{i}^{T}{\mathbf{w}}_{k}+{b}_{k}],\\ i=1,2,\dots ,n;\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}\phantom{\rule{0.38889em}{0ex}}k\ne {y}_{i}.\end{array}$$

The corresponding Lagrangian is

$$L(\mathbf{W},\mathbf{b},\mathit{\xi})=\frac{1}{2}\sum _{k=1}^{K}{\left|\right|{\mathbf{w}}_{k}\left|\right|}_{2}^{2}+C\sum _{i=1}^{n}{\pi}_{{y}_{i}}{\xi}_{i}-\sum _{i=1}^{n}{u}_{i}{\xi}_{i}-\sum _{i=1}^{n}\sum _{{k}^{\prime}\ne {y}_{i}}{\alpha}_{{ik}^{\prime}}({\mathbf{x}}_{i}^{T}{\mathbf{w}}_{{y}_{i}}+{b}_{{y}_{i}}-{\mathbf{x}}_{i}^{T}{\mathbf{w}}_{{k}^{\prime}}-{b}_{{k}^{\prime}}+{\xi}_{i}-1)+\sum _{k=1}^{K}\langle \frac{\partial}{\partial {\mathbf{w}}_{k}}{Q}_{\mathit{cav}}^{s}({\mathrm{\Theta}}_{t}),{\mathbf{w}}_{k}\rangle +\sum _{k=1}^{K}{b}_{k}\frac{\partial}{\partial {b}_{k}}{Q}_{\mathit{cav}}^{s}({\mathrm{\Theta}}_{t}),$$

(A.1)

subject to

$$\frac{\partial}{\partial {\mathbf{w}}_{k}}L={\mathbf{w}}_{k}^{T}-\left[\sum _{i:{y}_{i}=k}\sum _{{k}^{\prime}\ne {y}_{i}}({\alpha}_{{ik}^{\prime}}-{\beta}_{{ik}^{\prime}}){\mathbf{x}}_{i}^{T}-\sum _{i:{y}_{i}\ne k}({\alpha}_{ik}-{\beta}_{ik}){\mathbf{x}}_{i}^{T}\right]=0,$$

(A.2)

$$\frac{\partial}{\partial {b}_{k}}L=-\left[\sum _{i:{y}_{i}=k}\sum _{{k}^{\prime}\ne {y}_{i}}({\alpha}_{{ik}^{\prime}}-{\beta}_{{ik}^{\prime}})-\sum _{i:{y}_{i}\ne k}({\alpha}_{ik}-{\beta}_{ik})\right]=0,$$

(A.3)

$$\frac{\partial}{\partial {\xi}_{i}}L=C{\pi}_{{y}_{i}}-{u}_{i}-\sum _{k\ne {y}_{i}}{\alpha}_{ik}=0,$$

(A.4)

where the Lagrangian multipliers are *u _{i}* ≥ 0 and

- Agresti A, Coull B. Approximate Is Better Than ‘Exact’ for Interval Estimation of Binomial Proportions. The American Statistician. 1998;52:119–126.
- An LTH, Tao PD. Solving a Class of Linearly Constrained Indefinite Quadratic Problems by d.c. Algorithms. Journal of Global Optimization. 1997;11:253–285.
- Bartlett PL, Jordan MI, Mcauliffe JD. Convexity, Classification, and Risk Bounds. Journal of the American Statistical Association. 2006;101:138–156.
- Cortes C, Vapnik V. Support Vector Networks. Machine Learning. 1995;20:273–297.
- Kimeldorf G, Wahba G. Some Results on Tchebycheffian Spline Functions. Journal of Mathematical Analysis and Applications. 1971;33:82–95.
- Lee Y, Lin Y, Wahba G. Multicategory Support Vector Machines, Theory, and Application to the Classification of Microarray Data and Satellite Radiance Data. Journal of the American Statistical Association. 2004;99:67–81.
- Lin Y. Support Vector Machines and the Bayes Rule in Classification. Data Minning and Knoweldge Discovery. 2002;6:259–275.
- Liu Y. Fisher Consistency of Multicategory Support Vector Machines. Eleventh International Conference on Artificial Intelligence and Statistics. 2007. pp. 289–296. Available at http://www.stat.umn.edu/~aistat/proceedings/start.htm.
- Liu Y, Shen X. Multicategory
*ψ*-Learning. Journal of the American Statistical Association. 2006;101:500–509. - Liu Y, Shen X, Doss H. Multicategory
*ψ*-Learning and Support Vector Machine: Computational Tools. Journal of Computational and Graphical Statistics. 2005;14:219–236. - Shao J. Linear Model Selection by Cross-Validation. Journal of the American Statistical Association. 1993;88:486–494.
- Shen X, Tseng G, Zhang X, Wong W. On
*ψ*-Learning. Journal of the American Statistical Association. 2003;98:724–734. - Vapnik V. Statistical Learning Theory. New York: Wiley; 1998.
- Wahba G. Support Vector Machines, Reproducing Kernel Hilbert Spaces and the Randomized GACV. In: Schoelkopf B, Burges C, Smola A, editors. Advances in Kernel Methods Support Vector Learning. Cambridge, MA: MIT Press; 1999. pp. 69–88.
- Wang J, Shen X, Liu Y. Probability Estimation for Large Margin Classifiers. Biometrika. 2008;95:149–167.
- Weston J, Watkins C. Support Vector Machines for Multi-Class Pattern Recognition. In: Verleysen M, editor. Proceedings of the 7th European Symposium on Artificial Neural Networks (ESANN-99) Bruges; Belgium: 1999. pp. 219–224. Available at http://www.informatik.uni-trier.de/~ley/db/conf/esann/esann1999.html.
- Wu TF, Lin CJ, Weng RC. Probability Estimates for Multi-Class Classification by Pairwise Coupling. Journal of Machine Learning Research. 2004;5:975–1005.
- Wu Y, Liu Y. Robust Truncated-Hinge-Loss Support Vector Machines. Journal of the American Statistical Association. 2007;102:974–983.
- Zhang HH, Liu Y, Wu Y, Zhu J. Variable Selection for the Multicategory SVM via Adaptive Sup-Norm Regularization. Eletronic Journal of Statistics. 2008;2:149–167.
- Zhang T. Statistical Analysis of Some Multi-Category Large Margin Classification Methods. Journal of Machine Learning Research. 2004;5:1225–1251.
- Zhu J, Hastie T. Kernel Logistic Regression and the Import Vector Machine. Journal of Computational and Graphical Statistics. 2005;14:185–205.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |