PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Uncertain Artif Intell. Author manuscript; available in PMC 2010 May 26.
Published in final edited form as:
Uncertain Artif Intell. 2008; 2008: 357–365.
PMCID: PMC2876736
NIHMSID: NIHMS100647

Small Sample Inference for Generalization Error in Classification Using the CUD Bound

Abstract

Confidence measures for the generalization error are crucial when small training samples are used to construct classifiers. A common approach is to estimate the generalization error by resampling and then assume the re-sampled estimator follows a known distribution to form a confidence set [Kohavi 1995, Martin 1996, Yang 2006]. Alternatively, one might bootstrap the resampled estimator of the generalization error to form a confidence set. Unfortunately, these methods do not reliably provide sets of the desired confidence. The poor performance appears to be due to the lack of smoothness of the generalization error as a function of the learned classifier. This results in a non-normal distribution of the estimated generalization error. We construct a confidence set for the generalization error by use of a smooth upper bound on the deviation between the resampled estimate and generalization error. The confidence set is formed by bootstrapping this upper bound. In cases in which the approximation class for the classifier can be represented as a parametric additive model, we provide a computationally efficient algorithm. This method exhibits superior performance across a series of test and simulated data sets.

1 Introduction

Measures of uncertainty are particularly critical in the small sample setting where training set to training set variation is likely to be high. One example occurs in medical diagnostics in mental health in which one may want to classify patients as “low risk” or “high risk” for relapse based on observed medical history. In these settings the signal to noise ratio is likely to be small. Thus, both the estimated generalization error and the confidence in this estimate must be considered as the confidence might be very low due to the noise in the data. In addition, in this setting data must be collected from clinical trials and is subsequently expensive to obtain. The cost of these trials, coupled with notoriously low adherence rates, leads to data sets which are too small to admit a test set. Any inference for the generalization error must be derived from a single training set.

In this paper, we introduce a new method for obtaining confidence bounds on the generalization error in the small sample setting. The paper is organized as follows. Section 2 of the paper provides the framework for this research, and describes why this problem is difficult. We also survey several standard approaches to this problem and motivate the need for a new method. In section 3 we derive a new bound called the constrained uniform deviation bound (CUD bound hereafter) on the deviation between the estimated and true generalization errors and describe how to use this bound to form confidence sets. Section 4 gives an efficient algorithm for computing the new upper bound when the classifier can be represented as a parametric additive model. Section 5 describes an empirical study of several methods for constructing confidence sets and analyzes the results.

2 Framework

Our objective is to derive a confidence set for the generalization error under minimal assumptions. To this end, we only assume a training set An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg = {(xi, yi)}i=1,…,n of n iid draws from (unknown) distribution F over An external file that holds a picture, illustration, etc.
Object name is nihms100647ig2.jpg × An external file that holds a picture, illustration, etc.
Object name is nihms100647ig3.jpg, where An external file that holds a picture, illustration, etc.
Object name is nihms100647ig2.jpg is any feature space and An external file that holds a picture, illustration, etc.
Object name is nihms100647ig3.jpg = {−1, 1} is the label space. We also assume that there is a space of potential classifiers An external file that holds a picture, illustration, etc.
Object name is nihms100647ig4.jpg from which we choose our classifier f using some criterion that depends on our training set An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg. Equally important is what we do not assume. We make no assumptions about the structure of An external file that holds a picture, illustration, etc.
Object name is nihms100647ig4.jpg, nor do we assume that the Bayes classifier belongs to An external file that holds a picture, illustration, etc.
Object name is nihms100647ig4.jpg. Finally, we do not assume the existence of an underlying “true” deterministic function y = g(x) which generates the data.

Given a training set An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg we choose a classifier f [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms100647ig4.jpg using some criterion (discussed in detail below). The classifier f aims to predict the value of a label y given feature x. Thus, a natural measure of the performance of f is its generalization error ξ(f) = An external file that holds a picture, illustration, etc.
Object name is nihms100647ig5.jpg(X,Y)~Fx2110{f (X) ≠ = Y}. That is, the probability that f will fail to correctly predict label Y from feature X.

When data are abundant, inference for the generalization error is straightforward. In this setting, one partitions the available data into two sets L and An external file that holds a picture, illustration, etc.
Object name is nihms100647ig6.jpgcalled a learning set and a testing set respectively. Estimation is done by choosing classifier f [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms100647ig4.jpg using learning set L and then using empirical estimate ξ^test(f)=1#T(xi,yi)FI{f(xi)yi}. The exact distribution of [Xi w/ hat]test is known and confidence sets can be computed numerically [Langford 2005(1)]. Moreover, [Xi w/ hat]test(f) is, under mild regularity conditions, asymptotically normal, making standard asymptotic inference applicable.

The convenient properties of [Xi w/ hat]test(f) are a direct consequence of the independent test set An external file that holds a picture, illustration, etc.
Object name is nihms100647ig6.jpg. We define the small sample setting as one in which the data are too meager to admit a test set. Without a test set one is forced to use the same data both to train and evaluate the classifier. Estimates and confidence sets for the generalization error must be constructed using re-sampling (e.g. bootstrap, cross-validation, etc.). However, when the training sets are small, the training set to training set distribution of these estimates may be highly non-normal, thus complicating the construction of confidence sets. This is particularly the case here as the generalization error ξ(f) is a non-smooth function of f.

It is worth pointing out that there exist alternative motivations for seeking confidence sets in the absence of a testing set. The formation of a test set essentially “wastes” observations which are not used in the training of the classifier. The omission of these observations can significantly deteriorate the performance of the learned classifier. Another motivation is the existence of learning algorithms which assume that training accuracy correctly reflects generalization error when selecting a classifier. Using high quality bounds on the generalization error in the algorithm can boost performance [Langford 2005(2)] particularly when the training data is small.

Relevant Work

Much of the work on small sample inference for the generalization error has focused on point estimates via resampling. Popular point estimates include the .632+ estimator given by Efron and Tibshirani [Efron 1995], the LV1O* estimate given by Weiss [1991], and more recently the bolstered error estimate of Braga-Neto and Dougherty [Dougherty 2004]. A nice survey of generalization error estimates is given by Schiavo and Hand [Schiavo 2000].

In the literature, a variety of general approaches have been suggested for constructing confidence sets for the generalization error in the absence of a test set. We discuss them in turn.

  1. Known distribution. One approach to this problem is to assume the resampled estimator follows a known distribution Φ [Kohavi 1995, Yang 2006]. Typically Φ is taken to be normal, so that a 1 − δ confidence set for estimated generalization error [Xi w/ hat](f) of classifier f is given by
    ξ^(f)±Φ1(δ2)ν/n,
    where ν is some estimate of the standard error and n* is the effective sample size1 used to construct the resampled estimator. This approach mimics the case where [Xi w/ hat](f) is constructed from a training set and then a normal approximation would have been justified for the estimator of the generalization error based on an independent test set.
  2. Bayesian Approach. This approach mimics the case where [Xi w/ hat](f) is constructed from a training set and then using an independent test set, a Bayesian approach is used to construct a posterior confidence set. Here we specify a Beta prior π on the generalization error ξ(f) [Martin 1995].
    ξ(f)β(γ,λ).
    If an independent test set An external file that holds a picture, illustration, etc.
Object name is nihms100647ig6.jpg= {(xi, yi)}i=1,…,m were available then random variables zi = x2110(f(xi) = yi) would be iid bern(ξ). The posterior P(ξ(f)|z1, …, zm) is also a Beta distribution,
    ξ(f)z1,,zmβ(γ+mξ^(f),λ+mmξ^(f))
    Note that the posterior depends only on the number of misclassified points, m[Xi w/ hat](f) on the test set.
    Suppose we do not have a test set. Then given re-sampled estimate [Xi w/ hat], we can view (nn*)[Xi w/ hat](f) as the misclassified examples where n and n* are the full and effective sample sizes. One can think of this as training on a set of size n* and testing on a set of size nn*. Using this idea, the resampled model is given by,
    ξ(f)β(γ,λ)ξ(f)z1,,zmβ(ν,η)ν=γ+ξ^(f)(nn)η=λ+(1ξ^(f))(nn).
    The posterior given above can be used to construct confidence intervals for ξ(f).
  3. Direct Estimation. In this approach we repeatedly resample the data and from each resampled dataset we construct a point estimate of the generalization error (this may require further resampling). This process generates a sequence of estimates for the generalization error. A confidence set can be constructed, for example, by taking the quantiles of the generated sequence of resampled estimators.

The methods described above can work well in some settings. Distribution-based methods tend to work well when distributional assumptions are met, providing tight confidence sets and accurate coverage. However, these methods can be non-robust to large deviations from distributional assumptions and can suffer from unacceptable type I errors. Direct estimation avoids distributional assumptions but is unstable for small samples because of non-smooth 0–1 loss. These problems are illustrated in the experiments sections of the paper.

Ideally confidence sets for the generalization error should satisfy the following criteria.

  1. The confidence set should deliver the promised coverage for a wide variety of examples.
  2. The confidence set should be efficiently computable.
  3. The confidence set should be theoretically justified.

Given multiple algorithms for constructing confidence sets satisfying the above, algorithms that produce confidence sets of smallest size are preferred. We propose a new approach and compare this approach to the above methods using the first and second criteria in the experiments section.

3 The CUD Bound

We construct an upper bound on the deviation between the training error and generalization error of a classifier using ideas from the large literature on constructing bounds on the so-called excess-risk of a selected classifier [Shawe-Taylor 1998, Barlett and Mendelson 2006]. Also we allow the use of a convex surrogate as in Zhang [2004] for the 0–1 loss. In particular we use the idea of constructing a bound for the supremum of the deviation between the training error and generalization error over a small subset of the approximation space An external file that holds a picture, illustration, etc.
Object name is nihms100647ig4.jpg. This small subset is determined by the convex surrogate loss, L: (An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg, f) → R used to construct the classifier (f = arg minf[set membership]An external file that holds a picture, illustration, etc.
Object name is nihms100647ig4.jpgL(An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg, f)). A variety of surrogate loss functions may be used, a common loss function is squared error loss, given by L(D,f)=1n(xi,yi)D(yif(xi))2. Some other examples are the hinge loss, logistic loss, exponential loss and penalized quadratic loss.

The intuition for the upper bound is as follows. Suppose we knew (we don’t) the limiting value of f, say f*, as the amount of training data becomes infinite. f* may not be the Bayes classifier. Also notice that f* does not depend on the training set. In addition, since we can think of f as being concentrated near f*, this motivates the use of f* as the anchor for the small subset over which we take the supremum. In particular, we take the supremum over the set of classifiers belonging to a neighborhood of f*. The size of this neighborhood should depend on how far f is from f* in terms of the difference in loss, L(An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg, f) − L(An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg, f*).

To fix notation, let [Xi w/ hat]An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg(f) be the empirical error of f on data set An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg, that is,

ξ^S(f)=1#S(xi,yi)SI{f(xi)yi}

so that [Xi w/ hat]An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg(f) is the training error. We have the following result.

CUD Bound

If αn is any positive function of the size n of the training set An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg and g(x) = (x + 1) ∧ 1 then:

ξ^D(f^)ξ(f^)supfGξ^D(f)ξ(f)g(αn(L(D,f)L(D,f)))

Proof

First notice that g(x) [equivalent] 1 ∀ R+. Then, since f = arg minf[set membership]An external file that holds a picture, illustration, etc.
Object name is nihms100647ig4.jpgL(An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg, f) we must have

L(D,f)L(D,f^)0

so that,

g(αn(L(D,f)L(D,f^)))=1.

The result follows from noticing,

ξ^D(f^)ξ(f^)=ξ^D(f^)ξ(f^)×g(αn(L(D,f)L(D,f^)))supfGξ^D(f)ξ(f)×g(αn(L(D,f)L(D,f))),

with the last inequality following because f [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms100647ig4.jpg.

The intuition here is that we consider a small neighborhood around f* with αn chosen so that the size of the neighborhood shrinks at the same rate L(An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg, f) → L(An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg, f*) as the size of An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg tends to infinity. It is the function g which restricts the domain over which the supremum is taken. In particular,

ξ^D(f^)ξ(f^)supfGξ^D(f)ξ(f)g(αn(L(D,f)L(D,f)))supfGξ^D(f)ξ(f)I(L(D,f)L(D,f)αn1)=supfWξ^D(f)ξ(f),

where,

W={fG:L(D,f)L(D,f)αn1}.

From the preceding set of inequalities we see that g serves as a continuous approximation to the indicator function. The rate, αn, will depend on the complexity of An external file that holds a picture, illustration, etc.
Object name is nihms100647ig4.jpg and the training set size, n. For example if An external file that holds a picture, illustration, etc.
Object name is nihms100647ig4.jpg has low complexity such as a smoothly parameterized class with a finite dimensional parameter then the rate can be expected to be on the order of n. We now present the main result which follows immediately from the upper bound.

Proposition 1

Let δ [set membership] (0, 1], then if An external file that holds a picture, illustration, etc.
Object name is nihms100647ig8.jpg is the 1 − δ quantile of,

supfGξ^D(f)ξ(f)g(αn(L(D,f)L(D,f))),

then P|[Xi w/ hat]An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg(f) − ξ(f) ≤ Q1−δ) ≥ 1 − δ.

Thus, the problem is reduced to computing An external file that holds a picture, illustration, etc.
Object name is nihms100647ig8.jpg. This is done using the bootstrap.

3.1 Computing An external file that holds a picture, illustration, etc.
Object name is nihms100647ig8.jpg

For any fixed δ we estimate An external file that holds a picture, illustration, etc.
Object name is nihms100647ig8.jpg with its bootstrap estimate An external file that holds a picture, illustration, etc.
Object name is nihms100647ig9.jpg. A bootstrap sample An external file that holds a picture, illustration, etc.
Object name is nihms100647ig10.jpg from An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg is an iid draw of size n from the distribution which puts equal mass on each point in An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg. The bootstrap estimate is constructed by treating the original sample An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg as the true population and each bootstrap draw as a random draw of size n from the true population. The mapping between population and bootstrap quantities is straightforward and given in the table below. The reader is referred to [Efron 1994] or [Hall 1992] for details.

PopulationBootstrap

ξ(f)[Xi w/ hat]An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg(f)
f*f
L(An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg,)L(An external file that holds a picture, illustration, etc.
Object name is nihms100647ig10.jpg,)
[Xi w/ hat]An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg(f)[Xi w/ hat]An external file that holds a picture, illustration, etc.
Object name is nihms100647ig10.jpg(f)

Using the above, the procedure to compute An external file that holds a picture, illustration, etc.
Object name is nihms100647ig9.jpg is as follows.

  1. Construct B bootstrap datasets An external file that holds a picture, illustration, etc.
Object name is nihms100647ig11.jpg, …, An external file that holds a picture, illustration, etc.
Object name is nihms100647ig12.jpg from An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg.
  2. For b = 1, …, B compute:
    Q(b)=supfGξ^D(b)(f)ξ^D(f)×g(αn(L(D(b),f^)L(D(b),f))).
  3. Set An external file that holds a picture, illustration, etc.
Object name is nihms100647ig9.jpg to the 1 − δ sample quantile of An external file that holds a picture, illustration, etc.
Object name is nihms100647ig13.jpg, …, An external file that holds a picture, illustration, etc.
Object name is nihms100647ig14.jpg.

We now substitute An external file that holds a picture, illustration, etc.
Object name is nihms100647ig9.jpg into Proposition 1 in place of An external file that holds a picture, illustration, etc.
Object name is nihms100647ig8.jpg to construct the confidence set.

Initially the computation of the sup in step 2 above appears to be computationally intensive. Depending on the form of classifier space An external file that holds a picture, illustration, etc.
Object name is nihms100647ig4.jpg and loss function L(,) taking this sup may prove to be computationally infeasible. However, when the loss function L(,) is convex and the classifier is approximated by a parametric additive model, computation is straightforward by way of a branch and bound algorithm which we discuss next.

4 An Efficient Algorithm For Parametric Additive Models

In this section we consider the space of classifiers

G={f(x)=sign(i=1pβib(γi,x)),βi,γiRi},

where b(,) are basis functions indexed by the γi’s and the βi’s are the basis coefficients. We also assume a convex surrogate loss function L(An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg, f) where convexity refers to convexity in β = (β1, …, βp) but not necessarily in γ = γ1, …, γp. We assume that given training set An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg we choose classifier

f^(x)=f(x;β^,γ^)=argminβ,γL(D,f(;β,γ))=argminβ(argminγL(D,f(;β,γ)))=argminβL(D,f(;β,γ^))

We now hold [gamma with circumflex] fixed and compute the upper bound as a supremum over β.

CUD Bound (Parametric Additive Models)

If αn is any positive function of the size n of the training set An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg and g(x) = (x + 1) ∧ 1 then:

ξ^D(f(;β^,γ^))ξ(f(;β^,γ^))

is bounded above by:

supβRpξ^D(f(;β,γ^))ξ(f(;β,γ^))×g(αn(L(D,f(;β,γ))L(D,f(;β,γ^))))

The supremum is difficult to calculate because of the term with the absolute values; it is the absolute difference between two sums of indicator functions and is hence both non-smooth and non-convex. We now discuss how to transform this problem into a series of convex optimization problems with linear constraints.

To begin, suppose An external file that holds a picture, illustration, etc.
Object name is nihms100647ig10.jpg is a bootstrap dataset drawn from An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg. For any training point (xi, yi) in An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg, let [var phi]i be the number of copies of (xi, yi) in An external file that holds a picture, illustration, etc.
Object name is nihms100647ig10.jpg. Then 0 ≤ [var phi]in for all i and the distribution of each [var phi]i is binomial with size n and probability 1n. Then we can write,

nξ^D(b)(f(;β,γ^))ξ^D(f(;β,γ^))=|(xi,yi)D(φi1)I{f(xi;β,γ^)yi}|=|(xi,yi)D(b)D(φi1)I{f(xi;β,γ^)yi}|.

The above term does not depend on points with [var phi]i = 1, that is, those points which were selected exactly once in the bootstrap resampling. We let An external file that holds a picture, illustration, etc.
Object name is nihms100647ig10.jpgAn external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg denote the set of training points (xi, yi) that satisfy [var phi]i ≠ 1.

Notice that the number of points in An external file that holds a picture, illustration, etc.
Object name is nihms100647ig10.jpgAn external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg is necessarily smaller than n which translates into computational savings. To see this, we will need to make use of the following equivalence relation.

Given β1, β2 [set membership] Rp and An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg a subset of An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg, we say β1 is congruent to β2 modulo An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg if f(x; β1, [gamma with circumflex]) = f(x; β2, [gamma with circumflex]) for all x such that (x, y) [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg for some y (i.e. β1 and β2 lead to the same classification on An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg). Congruency modulo An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg defines an equivalency relation and hence partitions Rp. For any fixed bootstrap dataset An external file that holds a picture, illustration, etc.
Object name is nihms100647ig10.jpg let An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg be the collection of distinct points in An external file that holds a picture, illustration, etc.
Object name is nihms100647ig10.jpgAn external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg and let x21331, …, x2133R be the subsequent equivalence classes. Then, for any equivalence class x2133i,

ξ^D(b)(f(;β,γ^))ξ^D(f(;β,γ^))C(Mi)βMi,

where C(x2133i) is a constant. Then referring back to step 2 of the bootstrap procedure outlined in section 3 we can write,

Q(b)=supβRpξ^D(f(;β,γ^)ξ^D(b)(f(;β,γ^)))×g(αn(L(D(b),f(;β^,γ^))L(D(b),f(;β^,γ^))))=supiC(Mi)×supβMig(αn(L(D(b),f(;β^,γ^))L(D(b),f(;β,γ^))))=supiC(Mi)×g(αn(L(D(b),f(;β^,γ^))infβMiL(D(b),f(;β,γ^)))),

with the last equality following from the monotonicity of g. Since L is convex in β this would be a series of convex optimization problems if membership in x2133i could be expressed as a set of convex constraints. We now show how this can be done.

Let βx2133i be a representative member of equivalence class x2133i. Then for any other β [set membership] x2133i we must have,

f(xj;βMi,γ^)k=1pβkb(γ^k,xj)0

for all xj such that (xj, y) [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms100647ig10.jpgAn external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg for some y. Since f(xj; βx2133i, [gamma with circumflex]) does not depend on β we see that membership to x2133i is equivalent to satisfying a series of linear constraints, which is clearly convex. We have shown that calculation of An external file that holds a picture, illustration, etc.
Object name is nihms100647ig15.jpg amounts to a series of R convex optimization problems.

In practice, even though n is small (e.g. n ≤ 50) R can be very large (on the order of 2n in the worst case) making direct computation of all R convex optimization problems infeasible. Fortunately, exhaustive computation can be done using a branch and bound algorithm.

To use branch and bound we must recursively partition the search space. To do this we arbitrarily label the feature vectors in An external file that holds a picture, illustration, etc.
Object name is nihms100647ig10.jpgAn external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg by x1, …, xd. The root of the tree represents all of Rp. The first left child represents the set of all β [set membership] Rp so that k=1pβkb(γ^k,x1)0 while the first right child represents all β [set membership] Rp so that k=0pβib(γ^k,x1)<0. In general, if An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg is a region defined by a node at level j of the tree, then its left child represents the subspace of An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg which satisfies β [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg and k=0pβkb(γ^k,xj+1)0, similarly, its right child is the subspace of An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg which satisfies β [set membership] An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg and k=1pβkb(γ^k,xj+1)<0. Notice that the terminal nodes of this tree are either infeasible or one of the equivalence classes x21331, …, x2133R.

To complete the branch and bound algorithm we need to define upper and lower bounds on the value of objective function,

O(β)=ξ^D(b)(f(;β,γ^)ξ^D(f(;β,γ^)))×g(αn(L(D(b),f(;β^,γ^))L(D(b),f(;β,γ^)))).

In particular, given region An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg of Rp defined by a node on the tree representing our partition, we require an upper bound An external file that holds a picture, illustration, etc.
Object name is nihms100647ig16.jpg(An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg) on An external file that holds a picture, illustration, etc.
Object name is nihms100647ig17.jpg(β),

supβSO(β)U(S).

The upper bound An external file that holds a picture, illustration, etc.
Object name is nihms100647ig16.jpg(An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg) is constructed by bounding the two terms in An external file that holds a picture, illustration, etc.
Object name is nihms100647ig17.jpg(β) separately. An upper bound on |[Xi w/ hat]An external file that holds a picture, illustration, etc.
Object name is nihms100647ig10.jpg(f (; β, [gamma with circumflex])) − [Xi w/ hat]An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg(f (; β, [gamma with circumflex]))| can be obtained by noticing that if An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg is a region of Rp defined by a node at level j, then the classification of the first j points in An external file that holds a picture, illustration, etc.
Object name is nihms100647ig10.jpgAn external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg is fixed on An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg. The upper bound is constructed by assuming the worst performance possible on the remaining dj points. To bound the second term, we compute,

supβSg(αn(L(D(b),f(;β^))L(D(b),f(;β,γ^)))),

which is a convex optimization problem. The upper bound An external file that holds a picture, illustration, etc.
Object name is nihms100647ig16.jpg(An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg) is the product of these two upper bounds.

In addition, we require the following lower bound,

L(S)supβSO(β).

The lower bound L(An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg) is obtained by plugging any feasible point in An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg into An external file that holds a picture, illustration, etc.
Object name is nihms100647ig17.jpg(β). In practice, a natural choice is the argsup of the second term in the objective function, which has already been computed during the construction of An external file that holds a picture, illustration, etc.
Object name is nihms100647ig16.jpg(β).

This algorithm running on a standard desktop with a sample size of n = 50 and using a total of 500 bootstrap samples to construct a confidence set runs in a only a few minutes. While this algorithm is still an NP-hard problem,2 in practice it is significantly less computationally intensive than evaluating all possible classifications. The reason is that the function g allows for significant reduction of the search space by restricting attention only to classifiers within a fixed distance of selected the classifier f. To see this, notice that in the construction of An external file that holds a picture, illustration, etc.
Object name is nihms100647ig16.jpg(An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg), if the term

supβSg(αn(L(D(b),f(;β^))L(D(b),f(;β,γ^)))),

is non-positive we can remove the region An external file that holds a picture, illustration, etc.
Object name is nihms100647ig7.jpg from our search space since we know An external file that holds a picture, illustration, etc.
Object name is nihms100647ig17.jpg([beta]) ≥ 0.

5 Experiments

In this section we describe a set of experiments designed to test the coverage and diameter of confidence sets constructed using the CUD bound. To form a baseline for comparison we also construct confidence sets using a handful of methods found in the literature. These consist of two non-parametric bootstrap methods, a normal approximation using the bootstrap [Kohavi 1995], a normal approximation using CV [Yang 2006], a generalization of a Bayesian approach [Martin 1995], and the inverted binomial [Langford 2005].3 An online supplemental appendix provides a full description of these methods, the simulation study, and provides links to the source code (www.stat.lsa.umich.edu/~laber/UAI2008.html). The confidence sets were constructed for sample sizes of n = 30 and n = 50 using the datasets given in table 1. All the data sets have binary labels and continuous features. The real datasets can be found at the UCI data repository (www.ics.uci.edu/mlearn). The simulated data sets were designed to investigate the performance of the new procedure in several scenarios of interest.

Table 1
Datasets used in experiments along with the reason for their inclusion. Here f* is the limiting value of the classifier as the amount of training data becomes infinite.

For this series of experiments we fit a linear classifier using squared error loss. That is,

G={f(x)=sign(i=1pβiφi(x)),βiR}

where the [var phi]i are transformations of the features (in all cases [var phi]i was either a polynomial or projection onto a number of principle components). The surrogate loss function is given by

L(D,f(;β)=(xi,yi)D(j=1pβjφj(xi)yi)2,

and the scaling factor αn used in g is chosen to be n. For each data set the following procedure was used to estimate coverage probabilities.

  1. Randomly divide data into training set An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg and testing set An external file that holds a picture, illustration, etc.
Object name is nihms100647ig6.jpg.
    • If the data set is real, randomly select n training points for An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg and use the remainder for An external file that holds a picture, illustration, etc.
Object name is nihms100647ig6.jpg.
    • If the data set is simulated, generate n data points for An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg and generate 5000 data points for An external file that holds a picture, illustration, etc.
Object name is nihms100647ig6.jpg.
  2. Build classifier f = arg minf [set membership]An external file that holds a picture, illustration, etc.
Object name is nihms100647ig4.jpg L(An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg, f) using training data An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg.
  3. For each procedure use the training data An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg and chosen classifier f to construct confidence set with target coverage .950. The confidence interval is centered at [Xi w/ hat]An external file that holds a picture, illustration, etc.
Object name is nihms100647ig1.jpg(f) taken to be the .632 estimate [Efron 1983] in the the competing methods and the training error in the CUD bound.4
  4. Using the ξ(f) = [Xi w/ hat]An external file that holds a picture, illustration, etc.
Object name is nihms100647ig6.jpg(f) we record the coverage for each procedure.

The above steps are repeated 1000 times for each data set and the average coverage computed. The results are shown in tables 2 and and33 with the following abbreviations: quantile bootstrap (BS1) [Efron 1994], corrected boostrap (BS2) [Efron 1994], Yang’s CV estimate (Y) [Yang 2006], Kohavi’s normal approximation (K) [Kohavi 1996], a generalization of a Bayesian approach from Martin (M) [Martin 1996], and the inverted binomial of Langford (L) [Langford 2005].

Table 2
Average coverage of new method and competitors. Target coverage level was .950, those meeting or exceeding this level are highlighted in orange.
Table 3
Average diameter of confidence set constructed using new method and competitors. Method with smallest diameter and having correct coverage is highlighted in yellow.

The results for the n = 50 case are similar and the results are given in tables 4 and and55 of the appendix.

Table 4
Appendix: Experimental Results for n = 50
Table 5
Appendix: Experimental Results for n = 50

The results in tables 2 and and33 show promising results for the new method. It was the only method to provide the desired coverage on all ten datasets and, in addition, possessed the smallest diameter for two of them. The CUD bound confidence set is conservative, as must be expected by its construction. However, the diameter of the new confidence set is far from trivial, even for this extremely small sample size.

It is also of interest to learn when competing methods perform poorly. Figure 1 plots the empirical coverage of the top four methods against the mean absolute value of the training error’s kurtosis (standardized central fourth moment subtract 3) for each of the nine data sets listed in table 1. The kurtosis for a normal distribution is 0, and we use absolute kurtosis as a measure of deviation from normality. Figure 1 shows a trend of decreasing coverage as deviation from normality increases for the three distribution based methods. This suggests these methods may not be robust to non-normal training error. In particular we see that the method K (which has strongest normality assumptions) is most sensitive to deviation from normality. In contrast, the CUD bound method has stable, if conservative, coverage, regardless of the normality of training error.

Figure 1
Shows coverage of top four methods by absolute value of kurtosis of the training error.

6 Conclusions and Future Work

In this paper we introduced a new method for constructing confidence sets for the generalization error in the small sample setting in classification. This bound is derived under the minimal assumption of a training set of iid draws from some unknown distribution F. In addition, we provided a computationally efficient algorithm for constructing this new confidence set when the classifier is approximated by a parametric additive model. In preliminary experiments, the new confidence set exhibited superior coverage while maintaining a small diameter on a catalog of test and simulated data sets. Moreover, it was demonstrated that confidence sets based on distributional assumptions may not be robust to deviation from normality in the training samples.

Much work remains to be done. First, we need to provide theoretical guarantees for the level of confidence. We conjecture that a correct choice of αn will depend on the complexity of the approximation space used to construct the classifier. It is well known that bootstrap estimators perform better when estimating smooth functions. Selecting αn = ∞ is similar to standard quantile bootstrap (BS1) which is ineffective in the classification setting because it bootstraps the non-smooth 0–1 loss. This suggests an upper bound on αn. Conversely, a smaller choice of αn leads to a bootstrapped estimate of a smoother quantity but also introduces conservatism. For example, setting αn = 0 is equivalent to a uniform deviation bound on the training error which can often be trivially loose. This suggests a lower bound on αn as well. Thus, αn must strike a balance between smoothness required by the bootstrap and conservatism.

Footnotes

1Here we use effective sample size to mean the expected number of unique points used in estimation during a single step of a resampling procedure.

2Update: In the case of linear classification with squared error loss, current work in progress has proved this runs in polynomial time, on the order of An external file that holds a picture, illustration, etc.
Object name is nihms100647ig17.jpg(nV C(An external file that holds a picture, illustration, etc.
Object name is nihms100647ig4.jpg)).

3In this approach we are plugging in a resampled estimate and the effective sample size. That is, we are acting as if we had a test set.

4The competing methods experience a significant reduction in performance if they are centered at the training error.

References

1. Bartlett P, Jordan M, McAuliffe J. Convexity, classification, and risk bounds. J Amer Statis Assoc. 2006 Mar;101(473):138–156.
2. Bartlett P, Mendelson S. Rademacher and gaussian complexities: Risk bounds and structural results. J Mach Learn Res. 2002;3:463–482.
3. Braga-Neto Ulisses, Dougherty Edward. Bolstered error estimation. Pattern Recognition. 2004;37:1267–1281.
4. Efron B. Estimating the error rate of a prediction rule: Improvement on cross-validation. J Amer Statis Assoc. 1983;78(382):316–331.
5. Efron B, Tibshirani R. An Introduction to the Bootstrap. Chapman & Hall/CRC; May, 1994.
6. Efron B, Tibshirani R. Cross-validation and the bootstrap: Estimating the error of a prediction rule. Technical report. 1995
7. Hall Peter. Springer Series in Statistics. Springer; New York: 1992. The Bootstrap and Edgeworth Expansion.
8. Kohavi Ron. A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI; 1995. pp. 1137–1145.
9. Langford J. Combining train set and test set bounds. ICML; 2002.
10. Langford J. Tutorial on practical prediction theory for classification. J Machine Learning Research. 2005;6:273–306.
11. Martin J Kent, Hirschberg Daniel S. Small sample statistics for classification error rates II: Confidence intervals and significance tests. 1996 Technical Report ICS-TR-96-22.
12. Schiavo Rosa A, Hand David J. Ten more years of error rate research. International Statistical Review. 2000;68(3):295–310.
13. Weiss SM. Small sample error rate estimation for k-nn classifiers. Transactions on Pattern Analysis and Machine Intelligence. 1991 Mar;13(3):285–289.
14. Yang Yuhong. Comparing learning methods for classification. Statistica Sinica. 2006;16(2):635–657.
15. Zhang Xi, Shin Kang G. Statistical analysis of feedback-synchronization signaling delay for multicast flow control. INFOCOM; 2001. pp. 1152–1161.