Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2876736

Formats

Article sections

- Abstract
- 1 Introduction
- 2 Framework
- 3 The CUD Bound
- 4 An Efficient Algorithm For Parametric Additive Models
- 5 Experiments
- 6 Conclusions and Future Work
- References

Authors

Related links

Uncertain Artif Intell. Author manuscript; available in PMC 2010 May 26.

Published in final edited form as:

Uncertain Artif Intell. 2008; 2008: 357–365.

PMCID: PMC2876736

NIHMSID: NIHMS100647

Department of Statistics, University of Michigan, Ann Arbor, MI 48104

Eric B. Laber: ude.hcimu@rebal; Susan A. Murphy: ude.hcimu@yhprumas

See other articles in PMC that cite the published article.

Confidence measures for the generalization error are crucial when small training samples are used to construct classifiers. A common approach is to estimate the generalization error by resampling and then assume the re-sampled estimator follows a known distribution to form a confidence set [Kohavi 1995, Martin 1996, Yang 2006]. Alternatively, one might bootstrap the resampled estimator of the generalization error to form a confidence set. Unfortunately, these methods do not reliably provide sets of the desired confidence. The poor performance appears to be due to the lack of smoothness of the generalization error as a function of the learned classifier. This results in a non-normal distribution of the estimated generalization error. We construct a confidence set for the generalization error by use of a smooth upper bound on the deviation between the resampled estimate and generalization error. The confidence set is formed by bootstrapping this upper bound. In cases in which the approximation class for the classifier can be represented as a parametric additive model, we provide a computationally efficient algorithm. This method exhibits superior performance across a series of test and simulated data sets.

Measures of uncertainty are particularly critical in the small sample setting where training set to training set variation is likely to be high. One example occurs in medical diagnostics in mental health in which one may want to classify patients as “low risk” or “high risk” for relapse based on observed medical history. In these settings the signal to noise ratio is likely to be small. Thus, both the estimated generalization error and the confidence in this estimate must be considered as the confidence might be very low due to the noise in the data. In addition, in this setting data must be collected from clinical trials and is subsequently expensive to obtain. The cost of these trials, coupled with notoriously low adherence rates, leads to data sets which are too small to admit a test set. Any inference for the generalization error must be derived from a single training set.

In this paper, we introduce a new method for obtaining confidence bounds on the generalization error in the small sample setting. The paper is organized as follows. Section 2 of the paper provides the framework for this research, and describes why this problem is difficult. We also survey several standard approaches to this problem and motivate the need for a new method. In section 3 we derive a new bound called the constrained uniform deviation bound (CUD bound hereafter) on the deviation between the estimated and true generalization errors and describe how to use this bound to form confidence sets. Section 4 gives an efficient algorithm for computing the new upper bound when the classifier can be represented as a parametric additive model. Section 5 describes an empirical study of several methods for constructing confidence sets and analyzes the results.

Our objective is to derive a confidence set for the generalization error under minimal assumptions. To this end, we only assume a training set = {(*x _{i}, y_{i}*)}

Given a training set we choose a classifier *f* using some criterion (discussed in detail below). The classifier *f* aims to predict the value of a label *y* given feature *x*. Thus, a natural measure of the performance of *f* is its *generalization error ξ*(*f*) = _{(}_{X,Y}_{)}_{~}_{}{*f* (*X*) ≠ = *Y*}. That is, the probability that *f* will fail to correctly predict label *Y* from feature *X*.

When data are abundant, inference for the generalization error is straightforward. In this setting, one partitions the available data into two sets and called a learning set and a testing set respectively. Estimation is done by choosing classifier *f* using learning set and then using empirical estimate
${\widehat{\xi}}_{\mathit{test}}(f)=\frac{1}{\#\mathcal{T}}{\sum}_{({x}_{i},{y}_{i})\in \mathcal{F}}\mathcal{I}\{f({x}_{i})\ne {y}_{i}\}$. The exact distribution of * _{test}* is known and confidence sets can be computed numerically [Langford 2005(1)]. Moreover,

The convenient properties of * _{test}*(

It is worth pointing out that there exist alternative motivations for seeking confidence sets in the absence of a testing set. The formation of a test set essentially “wastes” observations which are not used in the training of the classifier. The omission of these observations can significantly deteriorate the performance of the learned classifier. Another motivation is the existence of learning algorithms which assume that training accuracy correctly reflects generalization error when selecting a classifier. Using high quality bounds on the generalization error in the algorithm can boost performance [Langford 2005(2)] particularly when the training data is small.

Much of the work on small sample inference for the generalization error has focused on point estimates via resampling. Popular point estimates include the .632+ estimator given by Efron and Tibshirani [Efron 1995], the *LV*1*O*^{*} estimate given by Weiss [1991], and more recently the bolstered error estimate of Braga-Neto and Dougherty [Dougherty 2004]. A nice survey of generalization error estimates is given by Schiavo and Hand [Schiavo 2000].

In the literature, a variety of general approaches have been suggested for constructing confidence sets for the generalization error in the absence of a test set. We discuss them in turn.

*Known distribution.*One approach to this problem is to assume the resampled estimator follows a known distribution Φ [Kohavi 1995, Yang 2006]. Typically Φ is taken to be normal, so that a 1 −*δ*confidence set for estimated generalization error (*f*) of classifier*f*is given by$$\widehat{\xi}(f)\pm {\mathrm{\Phi}}^{-1}\left(\frac{\delta}{2}\right)\sqrt{\nu}/\sqrt{{n}^{\ast}},$$where*ν*is some estimate of the standard error and*n*^{*}is the effective sample size^{1}used to construct the resampled estimator. This approach mimics the case where (*f*) is constructed from a training set and then a normal approximation would have been justified for the estimator of the generalization error based on an independent test set.*Bayesian Approach.*This approach mimics the case where (*f*) is constructed from a training set and then using an independent test set, a Bayesian approach is used to construct a posterior confidence set. Here we specify a Beta prior*π*on the generalization error*ξ*(*f*) [Martin 1995].$$\xi (f)\sim \beta (\gamma ,\lambda ).$$If an independent test set = {(*x*)}_{i}, y_{i}_{i}_{=1,…}were available then random variables_{,m}*z*= (_{i}*f*(*x*) =_{i}*y*) would be iid_{i}*bern*(*ξ*). The posterior*P*(*ξ*(*f*)*|z*_{1}, …,*z*) is also a Beta distribution,_{m}$$\xi (f)\mid {z}_{1},\dots ,{z}_{m}\sim \beta (\gamma +m\widehat{\xi}(f),\lambda +m-m\widehat{\xi}(f))$$Note that the posterior depends only on the number of misclassified points,*m*(*f*) on the test set.Suppose we do not have a test set. Then given re-sampled estimate , we can view (*n*−*n*^{*})(*f*) as the misclassified examples where*n*and*n*^{*}are the full and effective sample sizes. One can think of this as training on a set of size*n*^{*}and testing on a set of size*n*−*n*^{*}. Using this idea, the resampled model is given by,$$\begin{array}{l}\xi (f)\sim \beta (\gamma ,\lambda )\\ \xi (f)\mid {z}_{1},\dots ,{z}_{m}\sim \beta (\nu ,\eta )\\ \nu =\gamma +\widehat{\xi}(f)(n-{n}^{\ast})\\ \eta =\lambda +(1-\widehat{\xi}(f))(n-{n}^{\ast}).\end{array}$$The posterior given above can be used to construct confidence intervals for*ξ*(*f*).*Direct Estimation.*In this approach we repeatedly resample the data and from each resampled dataset we construct a point estimate of the generalization error (this may require further resampling). This process generates a sequence of estimates for the generalization error. A confidence set can be constructed, for example, by taking the quantiles of the generated sequence of resampled estimators.

The methods described above can work well in some settings. Distribution-based methods tend to work well when distributional assumptions are met, providing tight confidence sets and accurate coverage. However, these methods can be non-robust to large deviations from distributional assumptions and can suffer from unacceptable type I errors. Direct estimation avoids distributional assumptions but is unstable for small samples because of non-smooth 0–1 loss. These problems are illustrated in the experiments sections of the paper.

Ideally confidence sets for the generalization error should satisfy the following criteria.

- The confidence set should deliver the promised coverage for a wide variety of examples.
- The confidence set should be efficiently computable.
- The confidence set should be theoretically justified.

Given multiple algorithms for constructing confidence sets satisfying the above, algorithms that produce confidence sets of smallest size are preferred. We propose a new approach and compare this approach to the above methods using the first and second criteria in the experiments section.

We construct an upper bound on the deviation between the training error and generalization error of a classifier using ideas from the large literature on constructing bounds on the so-called *excess-risk* of a selected classifier [Shawe-Taylor 1998, Barlett and Mendelson 2006]. Also we allow the use of a convex surrogate as in Zhang [2004] for the 0–1 loss. In particular we use the idea of constructing a bound for the supremum of the deviation between the training error and generalization error over a small subset of the approximation space . This small subset is determined by the convex surrogate loss, : (, *f*) → used to construct the classifier ( = arg min_{f}_{}(, *f*)). A variety of surrogate loss functions may be used, a common loss function is squared error loss, given by
$\mathcal{L}(\mathcal{D},f)={\scriptstyle \frac{1}{n}}{\sum}_{({x}_{i},{y}_{i})\in \mathcal{D}}{({y}_{i}-f({x}_{i}))}^{2}$. Some other examples are the hinge loss, logistic loss, exponential loss and penalized quadratic loss.

The intuition for the upper bound is as follows. Suppose we knew (we don’t) the limiting value of , say *f*^{*}, as the amount of training data becomes infinite. *f*^{*} may not be the Bayes classifier. Also notice that *f*^{*} *does not* depend on the training set. In addition, since we can think of as being concentrated near *f*^{*}, this motivates the use of *f*^{*} as the anchor for the small subset over which we take the supremum. In particular, we take the supremum over the set of classifiers belonging to a neighborhood of *f*^{*}. The size of this neighborhood should depend on how far is from *f*^{*} in terms of the difference in loss, (, ) − (, *f*^{*}).

To fix notation, let _{}(*f*) be the empirical error of *f* on data set , that is,

$${\widehat{\xi}}_{\mathcal{S}}(f)=\frac{1}{\#\mathcal{S}}\sum _{({x}_{i},{y}_{i})\in \mathcal{S}}\mathcal{I}\{f({x}_{i})\ne {y}_{i}\}$$

so that _{}() is the *training error*. We have the following result.

If α_{n} is any positive function of the size n of the training set and g(x) = (x + 1) ∧ 1 then:

$$\mid {\widehat{\xi}}_{\mathcal{D}}(\widehat{f})-\xi (\widehat{f})\mid \phantom{\rule{0.16667em}{0ex}}\le \underset{f\in \mathcal{G}}{sup}\mid {\widehat{\xi}}_{\mathcal{D}}(f)-\xi (f)\mid g({\alpha}_{n}(\mathcal{L}(\mathcal{D},{f}^{\ast})-\mathcal{L}(\mathcal{D},f)))$$

First notice that *g*(*x*) 1 ∀ _{+}. Then, since = arg min_{f}_{}(, *f*) we must have

$$\mathcal{L}(\mathcal{D},{f}^{\ast})-\mathcal{L}(\mathcal{D},\widehat{f})\ge 0$$

so that,

$$g({\alpha}_{n}(\mathcal{L}(\mathcal{D},{f}^{\ast})-\mathcal{L}(\mathcal{D},\widehat{f})))=1.$$

The result follows from noticing,

$$\begin{array}{l}\mid {\widehat{\xi}}_{\mathcal{D}}(\widehat{f})-\xi (\widehat{f})\mid \phantom{\rule{0.16667em}{0ex}}=\phantom{\rule{0.16667em}{0ex}}\mid {\widehat{\xi}}_{\mathcal{D}}(\widehat{f})-\xi (\widehat{f})\mid \times g({\alpha}_{n}(\mathcal{L}(\mathcal{D},{f}^{\ast})-\mathcal{L}(\mathcal{D},\widehat{f})))\\ \le \underset{f\in \mathcal{G}}{sup}\mid {\widehat{\xi}}_{\mathcal{D}}(f)-\xi (f)\mid \times g({\alpha}_{n}(\mathcal{L}(\mathcal{D},{f}^{\ast})-\mathcal{L}(\mathcal{D},f))),\end{array}$$

with the last inequality following because .

The intuition here is that we consider a small neighborhood around *f*^{*} with *α _{n}* chosen so that the size of the neighborhood shrinks at the same rate (, ) → (,

$$\begin{array}{l}\mid {\widehat{\xi}}_{\mathcal{D}}(\widehat{f})-\xi (\widehat{f})\mid \phantom{\rule{0.16667em}{0ex}}\le \underset{f\in \mathcal{G}}{sup}\mid {\widehat{\xi}}_{\mathcal{D}}(f)-\xi (f)\mid g({\alpha}_{n}(\mathcal{L}(\mathcal{D},{f}^{\ast})-\mathcal{L}(\mathcal{D},f)))\\ \le \underset{f\in \mathcal{G}}{sup}\mid {\widehat{\xi}}_{\mathcal{D}}(f)-\xi (f)\mid \mathcal{I}(\mathcal{L}(\mathcal{D},{f}^{\ast})\ge \mathcal{L}(\mathcal{D},f)-{\alpha}_{n}^{-1})\\ =\underset{f\in \mathcal{W}}{sup}\mid {\widehat{\xi}}_{\mathcal{D}}(f)-\xi (f)\mid ,\end{array}$$

where,

$$\mathcal{W}=\{f\in \mathcal{G}:\mathcal{L}(\mathcal{D},{f}^{\ast})\le \mathcal{L}(\mathcal{D},f)-{\alpha}_{n}^{-1}\}.$$

From the preceding set of inequalities we see that *g* serves as a continuous approximation to the indicator function. The rate, *α _{n}*, will depend on the complexity of and the training set size,

Let δ (0, 1], then if is the 1 − δ quantile of,

$$\underset{f\in \mathcal{G}}{sup}\mid {\widehat{\xi}}_{\mathcal{D}}(f)-\xi (f)\mid g({\alpha}_{n}(\mathcal{L}(\mathcal{D},{f}^{\ast})-\mathcal{L}(\mathcal{D},f))),$$

then P|_{}() − ξ() ≤ Q_{1−δ}) ≥ 1 − δ.

Thus, the problem is reduced to computing . This is done using the bootstrap.

For any fixed *δ* we estimate with its bootstrap estimate . A bootstrap sample from is an iid draw of size *n* from the distribution which puts equal mass on each point in . The bootstrap estimate is constructed by treating the original sample as the true population and each bootstrap draw as a random draw of size *n* from the true population. The mapping between population and bootstrap quantities is straightforward and given in the table below. The reader is referred to [Efron 1994] or [Hall 1992] for details.

Population | Bootstrap |

ξ(f) | _{}(f) |

f^{*} | |

(,) | (,) |

_{}(f) | _{}(f) |

Using the above, the procedure to compute is as follows.

- Construct
*B*bootstrap datasets , …, from . - For
*b*= 1, …,*B*compute:$${\mathcal{Q}}^{(b)}={\mathit{sup}}_{f\in \mathcal{G}}\mid {\widehat{\xi}}_{{\mathcal{D}}^{(b)}}(f)-{\widehat{\xi}}_{\mathcal{D}}(f)\mid \times g({\alpha}_{n}(\mathcal{L}({\mathcal{D}}^{(b)},\widehat{f})-\mathcal{L}({\mathcal{D}}^{(b)},f))).$$ - Set to the 1 −
*δ*sample quantile of , …, .

We now substitute into Proposition 1 in place of to construct the confidence set.

Initially the computation of the sup in step 2 above appears to be computationally intensive. Depending on the form of classifier space and loss function (,) taking this sup may prove to be computationally infeasible. However, when the loss function (,) is convex and the classifier is approximated by a parametric additive model, computation is straightforward by way of a branch and bound algorithm which we discuss next.

In this section we consider the space of classifiers

$$\mathcal{G}=\{f(x)=\mathit{sign}\left(\sum _{i=1}^{p}{\beta}_{i}b({\gamma}_{i},x)\right),{\beta}_{i},{\gamma}_{i}\in \mathbb{R}\forall i\},$$

where *b*(,) are basis functions indexed by the *γ _{i}*’s and the

$$\begin{array}{l}\widehat{f}(x)=f(x;\widehat{\mathit{\beta}},\widehat{\mathit{\gamma}})=arg\underset{\mathit{\beta},\mathit{\gamma}}{min}\mathcal{L}(\mathcal{D},f(;\mathit{\beta},\mathit{\gamma}))\\ =arg\underset{\mathit{\beta}}{min}\left(arg\underset{\mathit{\gamma}}{min}\mathcal{L}(\mathcal{D},f(;\mathit{\beta},\mathit{\gamma}))\right)\\ =arg\underset{\mathit{\beta}}{min}\mathcal{L}(\mathcal{D},f(;\mathit{\beta},\widehat{\mathit{\gamma}}))\end{array}$$

We now hold ** fixed and compute the upper bound as a supremum over **** β**.

If α_{n} is any positive function of the size n of the training set and g(x) = (x + 1) ∧ 1 then:

$$\mid {\widehat{\xi}}_{\mathcal{D}}(f(;\widehat{\mathit{\beta}},\widehat{\mathit{\gamma}}))-\xi (f(;\widehat{\mathit{\beta}},\widehat{\mathit{\gamma}}))\mid $$

is bounded above by:

$$\underset{\mathit{\beta}\in {\mathbb{R}}^{p}}{sup}\mid {\widehat{\xi}}_{\mathcal{D}}(f(;\mathit{\beta},\widehat{\mathit{\gamma}}))-\xi (f(;\mathit{\beta},\widehat{\mathit{\gamma}}))\mid \times g({\alpha}_{n}(\mathcal{L}(\mathcal{D},f(;{\mathit{\beta}}^{\ast},{\mathit{\gamma}}^{\ast}))-\mathcal{L}(\mathcal{D},f(;\mathit{\beta},\widehat{\mathit{\gamma}}))))$$

The supremum is difficult to calculate because of the term with the absolute values; it is the absolute difference between two sums of indicator functions and is hence both non-smooth and non-convex. We now discuss how to transform this problem into a series of convex optimization problems with linear constraints.

To begin, suppose is a bootstrap dataset drawn from . For any training point (*x _{i}, y_{i}*) in , let

$$\begin{array}{l}n\mid {\widehat{\xi}}_{{\mathcal{D}}^{(b)}}(f(;\mathit{\beta},\widehat{\mathit{\gamma}}))-{\widehat{\xi}}_{\mathcal{D}}(f(;\mathit{\beta},\widehat{\mathit{\gamma}}))\mid \phantom{\rule{0.16667em}{0ex}}=\left|\sum _{({x}_{i},{y}_{i})\in \mathcal{D}}({\phi}_{i}-1)\mathcal{I}\{f({x}_{i};\mathit{\beta},\widehat{\mathit{\gamma}})\ne {y}_{i}\}\right|\\ =\left|\sum _{({x}_{i},{y}_{i})\in {\mathcal{D}}^{(b)}-\mathcal{D}}({\phi}_{i}-1)\mathcal{I}\{f({x}_{i};\mathit{\beta},\widehat{\mathit{\gamma}})\ne {y}_{i}\}\right|.\end{array}$$

The above term does not depend on points with * _{i}* = 1, that is, those points which were selected exactly once in the bootstrap resampling. We let − denote the set of training points (

Notice that the number of points in − is necessarily smaller than *n* which translates into computational savings. To see this, we will need to make use of the following equivalence relation.

Given *β*^{1}, *β*^{2} * ^{p}* and a subset of , we say

$$\mid {\widehat{\xi}}_{{\mathcal{D}}^{(b)}}(f(;\mathit{\beta},\widehat{\mathit{\gamma}}))-{\widehat{\xi}}_{\mathcal{D}}(f(;\mathit{\beta},\widehat{\mathit{\gamma}}))\mid \equiv C({\mathcal{M}}_{i})\forall \mathit{\beta}\in {\mathcal{M}}_{i},$$

where *C*(* _{i}*) is a constant. Then referring back to step 2 of the bootstrap procedure outlined in section 3 we can write,

$$\begin{array}{l}{\mathcal{Q}}^{(b)}=su{p}_{\mathit{\beta}\in {\mathbb{R}}^{p}}\mid {\widehat{\xi}}_{\mathcal{D}}(f(;\mathit{\beta},\widehat{\mathit{\gamma}})-{\widehat{\xi}}_{{\mathcal{D}}^{(b)}}(f(;\mathit{\beta},\widehat{\mathit{\gamma}})))\mid \times g({\alpha}_{n}(\mathcal{L}({\mathcal{D}}^{(b)},f(;\widehat{\mathit{\beta}},\widehat{\mathit{\gamma}}))-\mathcal{L}({\mathcal{D}}^{(b)},f(;\widehat{\mathit{\beta}},\widehat{\mathit{\gamma}}))))\\ =su{p}_{i}C({\mathcal{M}}_{i})\times \underset{\mathit{\beta}\in {\mathcal{M}}_{i}}{sup}g({\alpha}_{n}(\mathcal{L}({\mathcal{D}}^{(b)},f(;\widehat{\mathit{\beta}},\widehat{\mathit{\gamma}}))-\mathcal{L}({\mathcal{D}}^{(b)},f(;\mathit{\beta},\widehat{\mathit{\gamma}}))))\\ =\underset{i}{sup}C({\mathcal{M}}_{i})\times g({\alpha}_{n}(\mathcal{L}({\mathcal{D}}^{(b)},f(;\widehat{\mathit{\beta}},\widehat{\mathit{\gamma}}))-\underset{\mathit{\beta}\in {\mathcal{M}}_{i}}{inf}\mathcal{L}({\mathcal{D}}^{(b)},f(;\mathit{\beta},\widehat{\mathit{\gamma}})))),\end{array}$$

with the last equality following from the monotonicity of *g*. Since is convex in ** β** this would be a series of convex optimization problems if membership in

Let *β*_{i} be a representative member of equivalence class * _{i}*. Then for any other

$$f({x}_{j};{\mathit{\beta}}_{{\mathcal{M}}_{i}},\widehat{\mathit{\gamma}})\sum _{k=1}^{p}{\beta}_{k}b({\widehat{\gamma}}_{k},{x}_{j})\ge 0$$

for all *x _{j}* such that (

In practice, even though *n* is small (e.g. *n* ≤ 50) *R* can be very large (on the order of 2* ^{n}* in the worst case) making direct computation of all

To use branch and bound we must recursively partition the search space. To do this we arbitrarily label the feature vectors in − by *x*_{1}, …, *x _{d}*. The root of the tree represents all of

To complete the branch and bound algorithm we need to define upper and lower bounds on the value of objective function,

$$\mathcal{O}(\mathit{\beta})=\mid {\widehat{\xi}}_{{\mathcal{D}}^{(b)}}(f(;\mathit{\beta},\widehat{\mathit{\gamma}})-{\widehat{\xi}}_{\mathcal{D}}(f(;\mathit{\beta},\widehat{\mathit{\gamma}})))\mid \times g({\alpha}_{n}(\mathcal{L}({\mathcal{D}}^{(b)},f(;\widehat{\mathit{\beta}},\widehat{\mathit{\gamma}}))-\mathcal{L}({\mathcal{D}}^{(b)},f(;\mathit{\beta},\widehat{\mathit{\gamma}})))).$$

In particular, given region of * ^{p}* defined by a node on the tree representing our partition, we require an upper bound () on (

$$\underset{\mathit{\beta}\in \mathcal{S}}{sup}\mathcal{O}(\mathit{\beta})\le \mathcal{U}(\mathcal{S}).$$

The upper bound () is constructed by bounding the two terms in (** β**) separately. An upper bound on

$$su{p}_{\mathit{\beta}\in \mathcal{S}}g({\alpha}_{n}(\mathcal{L}({\mathcal{D}}^{(b)},f(;\widehat{\mathit{\beta}}))-\mathcal{L}({\mathcal{D}}^{(b)},f(;\mathit{\beta},\widehat{\mathit{\gamma}})))),$$

which is a convex optimization problem. The upper bound () is the product of these two upper bounds.

In addition, we require the following lower bound,

$$L(\mathcal{S})\le su{p}_{\mathit{\beta}\in \mathcal{S}}\mathcal{O}(\mathit{\beta}).$$

The lower bound *L*() is obtained by plugging any feasible point in into (** β**). In practice, a natural choice is the

This algorithm running on a standard desktop with a sample size of *n* = 50 and using a total of 500 bootstrap samples to construct a confidence set runs in a only a few minutes. While this algorithm is still an NP-hard problem,^{2} in practice it is significantly less computationally intensive than evaluating all possible classifications. The reason is that the function *g* allows for significant reduction of the search space by restricting attention only to classifiers within a fixed distance of selected the classifier . To see this, notice that in the construction of (), if the term

$$su{p}_{\mathit{\beta}\in \mathcal{S}}g({\alpha}_{n}(\mathcal{L}({\mathcal{D}}^{(b)},f(;\widehat{\mathit{\beta}}))-\mathcal{L}({\mathcal{D}}^{(b)},f(;\mathit{\beta},\widehat{\mathit{\gamma}})))),$$

is non-positive we can remove the region from our search space since we know () ≥ 0.

In this section we describe a set of experiments designed to test the coverage and diameter of confidence sets constructed using the CUD bound. To form a baseline for comparison we also construct confidence sets using a handful of methods found in the literature. These consist of two non-parametric bootstrap methods, a normal approximation using the bootstrap [Kohavi 1995], a normal approximation using CV [Yang 2006], a generalization of a Bayesian approach [Martin 1995], and the inverted binomial [Langford 2005].^{3} An online supplemental appendix provides a full description of these methods, the simulation study, and provides links to the source code (www.stat.lsa.umich.edu/~laber/UAI2008.html). The confidence sets were constructed for sample sizes of *n* = 30 and *n* = 50 using the datasets given in table 1. All the data sets have binary labels and continuous features. The real datasets can be found at the UCI data repository (www.ics.uci.edu/mlearn). The simulated data sets were designed to investigate the performance of the new procedure in several scenarios of interest.

Datasets used in experiments along with the reason for their inclusion. Here *f*^{*} is the limiting value of the classifier as the amount of training data becomes infinite.

For this series of experiments we fit a linear classifier using squared error loss. That is,

$$\mathcal{G}=\{f(x)=\mathit{sign}(\sum _{i=1}^{p}{\beta}_{i}{\phi}_{i}(x)),{\beta}_{i}\in \mathbb{R}\}$$

where the * _{i}* are transformations of the features (in all cases

$$\mathcal{L}(\mathcal{D},f(;\mathit{\beta})=\sum _{({x}_{i},{y}_{i})\in \mathcal{D}}(\sum _{j=1}^{p}{\beta}_{j}{\phi}_{j}({x}_{i})-{y}_{i}{)}^{2},$$

and the scaling factor *α _{n}* used in

- Randomly divide data into training set and testing set .
- If the data set is real, randomly select
*n*training points for and use the remainder for . - If the data set is simulated, generate
*n*data points for and generate 5000 data points for .

- Build classifier = arg min
_{f}_{}(,*f*) using training data . - For each procedure use the training data and chosen classifier to construct confidence set with target coverage .950. The confidence interval is centered at
_{}() taken to be the .632 estimate [Efron 1983] in the the competing methods and the training error in the CUD bound.^{4} - Using the
*ξ*() =_{}() we record the coverage for each procedure.

The above steps are repeated 1000 times for each data set and the average coverage computed. The results are shown in tables 2 and and33 with the following abbreviations: quantile bootstrap (BS1) [Efron 1994], corrected boostrap (BS2) [Efron 1994], Yang’s CV estimate (Y) [Yang 2006], Kohavi’s normal approximation (K) [Kohavi 1996], a generalization of a Bayesian approach from Martin (M) [Martin 1996], and the inverted binomial of Langford (L) [Langford 2005].

Average coverage of new method and competitors. Target coverage level was .950, those meeting or exceeding this level are highlighted in orange.

Average diameter of confidence set constructed using new method and competitors. Method with smallest diameter and having correct coverage is highlighted in yellow.

The results for the *n* = 50 case are similar and the results are given in tables 4 and and55 of the appendix.

The results in tables 2 and and33 show promising results for the new method. It was the only method to provide the desired coverage on all ten datasets and, in addition, possessed the smallest diameter for two of them. The CUD bound confidence set is conservative, as must be expected by its construction. However, the diameter of the new confidence set is far from trivial, even for this extremely small sample size.

It is also of interest to learn when competing methods perform poorly. Figure 1 plots the empirical coverage of the top four methods against the mean absolute value of the training error’s kurtosis (standardized central fourth moment subtract 3) for each of the nine data sets listed in table 1. The kurtosis for a normal distribution is 0, and we use absolute kurtosis as a measure of deviation from normality. Figure 1 shows a trend of decreasing coverage as deviation from normality increases for the three distribution based methods. This suggests these methods may not be robust to non-normal training error. In particular we see that the method K (which has strongest normality assumptions) is most sensitive to deviation from normality. In contrast, the CUD bound method has stable, if conservative, coverage, regardless of the normality of training error.

In this paper we introduced a new method for constructing confidence sets for the generalization error in the small sample setting in classification. This bound is derived under the minimal assumption of a training set of *iid* draws from some unknown distribution . In addition, we provided a computationally efficient algorithm for constructing this new confidence set when the classifier is approximated by a parametric additive model. In preliminary experiments, the new confidence set exhibited superior coverage while maintaining a small diameter on a catalog of test and simulated data sets. Moreover, it was demonstrated that confidence sets based on distributional assumptions may not be robust to deviation from normality in the training samples.

Much work remains to be done. First, we need to provide theoretical guarantees for the level of confidence. We conjecture that a correct choice of *α _{n}* will depend on the complexity of the approximation space used to construct the classifier. It is well known that bootstrap estimators perform better when estimating smooth functions. Selecting

^{1}Here we use effective sample size to mean the expected number of unique points used in estimation during a single step of a resampling procedure.

^{2}Update: In the case of linear classification with squared error loss, current work in progress has proved this runs in polynomial time, on the order of (*n ^{V C}*

^{3}In this approach we are plugging in a resampled estimate and the effective sample size. That is, we are acting as if we had a test set.

^{4}The competing methods experience a significant reduction in performance if they are centered at the training error.

1. Bartlett P, Jordan M, McAuliffe J. Convexity, classification, and risk bounds. J Amer Statis Assoc. 2006 Mar;101(473):138–156.

2. Bartlett P, Mendelson S. Rademacher and gaussian complexities: Risk bounds and structural results. J Mach Learn Res. 2002;3:463–482.

3. Braga-Neto Ulisses, Dougherty Edward. Bolstered error estimation. Pattern Recognition. 2004;37:1267–1281.

4. Efron B. Estimating the error rate of a prediction rule: Improvement on cross-validation. J Amer Statis Assoc. 1983;78(382):316–331.

5. Efron B, Tibshirani R. An Introduction to the Bootstrap. Chapman & Hall/CRC; May, 1994.

6. Efron B, Tibshirani R. Cross-validation and the bootstrap: Estimating the error of a prediction rule. Technical report. 1995

7. Hall Peter. Springer Series in Statistics. Springer; New York: 1992. The Bootstrap and Edgeworth Expansion.

8. Kohavi Ron. A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI; 1995. pp. 1137–1145.

9. Langford J. Combining train set and test set bounds. ICML; 2002.

10. Langford J. Tutorial on practical prediction theory for classification. J Machine Learning Research. 2005;6:273–306.

11. Martin J Kent, Hirschberg Daniel S. Small sample statistics for classification error rates II: Confidence intervals and significance tests. 1996 Technical Report ICS-TR-96-22.

12. Schiavo Rosa A, Hand David J. Ten more years of error rate research. International Statistical Review. 2000;68(3):295–310.

13. Weiss SM. Small sample error rate estimation for k-nn classifiers. Transactions on Pattern Analysis and Machine Intelligence. 1991 Mar;13(3):285–289.

14. Yang Yuhong. Comparing learning methods for classification. Statistica Sinica. 2006;16(2):635–657.

15. Zhang Xi, Shin Kang G. Statistical analysis of feedback-synchronization signaling delay for multicast flow control. INFOCOM; 2001. pp. 1152–1161.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |