Home | About | Journals | Submit | Contact Us | Français |

**|**Curr Genomics**|**v.10(7); 2009 November**|**PMC2808673

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. DISCRETE CLASSIFICATION IN GENOMICS
- 3. DISCRETE CLASSIFICATION RULES
- 4. ERROR ESTIMATION FOR DISCRETE CLASSIFICATION
- 5. SMALL-SAMPLE PERFORMANCE OF DISCRETE CLASSIFICATION
- 6. LARGE-SAMPLE PERFORMANCE OF DISCRETE CLASSIFICATION
- 7. BINARY COEFFICIENT OF DETERMINATION (COD)
- 8. CONCLUSION
- REFERENCES

Authors

Related links

Curr Genomics. 2009 November; 10(7): 446–462.

PMCID: PMC2808673

Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77845, USA

Received 2009 February 19; Revised 2009 March 17; Accepted 2009 March 19.

Copyright ©2009 Bentham Science Publishers Ltd.

This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.5/), which permits unrestrictive use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article has been cited by other articles in PMC.

Discrete classification is common in Genomic Signal Processing applications, in particular in classification of discretized gene expression data, and in discrete gene expression prediction and the inference of boolean genomic regulatory networks. Once a discrete classifier is obtained from sample data, its performance must be evaluated through its classification error. In practice, error estimation methods must then be employed to obtain reliable estimates of the classification error based on the available data. Both classifier design and error estimation are complicated, in the case of Genomics, by the prevalence of small-sample data sets in such applications. This paper presents a broad review of the methodology of classification and error estimation for discrete data, in the context of Genomics, focusing on the study of performance in small sample scenarios, as well as asymptotic behavior.

In high-throughput Genomics applications, the objective is often to classify different phenotypes based on a panel of gene expression biomarkers, or to infer underlying gene regulatory networks from gene expression data. It is often advantageous to discretize gene expression data, for data efficiency and classification accuracy reasons. Classification of discrete data is a subject with a long history in Pattern Recognition [1-7]. In Genomics applications, this methodology has been applied both in classification of discretized gene expression data [8, 9], and in discrete gene expression prediction and the inference of boolean genomic regulatory networks, *via* the *binary coefficient of determination* (CoD) [10-12].

The most often employed discrete classification rule is the *discrete histogram rule* [1, 3, 4, 6, 13]. This classification rule has many desirable properties. For example, it can be shown that it is strongly universally consistent, meaning that, regardless of the particular distribution of the data, this rule can eventually learn the optimal classifier from the data, as the sample size increases, with probability one. In addition, the discrete histogram rule is simple enough to allow the exact analytical study of many of its properties.

Once a classifier is obtained from sample data, its performance must be evaluated. The most important criterion for performance is the *classification error*, which is the probability of making an erroneous classification on a future sample. The classification error can be computed exactly only if the underlying distribution of the data is known, which is almost never the case in practice. Robust *error estimation* methods must then be employed to obtain reliable estimates of the classification error based on the available data. An error estimator is a sample-based statistic, the bias and variance (and thus the root mean square error, RMS) properties of which determine how consistently the error estimator is near the true classification error, considering all possible sample training data sets from a given population. More generally, all statistical questions regarding the accuracy of the error estimator can be answered through the joint sampling distribution of the error estimator and true probability of error [14]. From an epistemological perspective, error estimation has to do with the fundamental question of the validity of scientific knowledge [15]. The quality of the error estimate determines the accuracy of the predictions that can be performed by the inferred model, and thus its scientific content.

Both classifier design and error estimation are complicated, in the case of Genomics, by the prevalence of *small-sample* data sets in such applications. With a small training sample set, the designed classifier will be, on average, more dissimilar to the optimal classifier, and thus have a larger classification error. In addition to that, in a small-sample setting, one must use the same data to both design the classifier and assess its validity, which requires data-efficient error estimators, and this in turn calls for careful study of performance.

It is the goal of the present paper to present a broad review of the methodology of classification and error estimation for discrete data, in the context of Genomics. The paper is organized as follows. Section 2 illustrates the application of discrete classification in Genomics through a pair of simple examples. Section 3 formalizes the problem, with particular emphasis on the discrete histogram classification rule. Section 4 reviews the most common error estimators used in discrete classification, commenting briefly on their properties. Sections 5 through 7 contain the bulk of the literature review on the subject. Section 5 reviews results on the small-sample performance of discrete classification; these are analyses that must hold for a given finite number of samples. This section reviews exact and approximate expressions for performance metrics of the actual and estimated errors for the discrete histogram rule; complete enumeration methods that can deal with intractable cases such as conditional performance metrics; distribution-free results on small-sample performance, with emphasis on the pioneering work of G.F. Hughes; as well as recent analytical results that indicate that ensemble classification methods may be largely ineffective in discrete classification. Section 6, by contrast, focuses on the large-sample performance of discrete classification; this is a more technical section, which reviews asymptotic results on whether optimal performance is reached, and how fast, as the sample size increases. Finally, Section 7 reviews the binary coefficient of determination (CoD).

The objective of *classification* is to employ a set of *training data*, consisting of independently observed known cases, and obtain a fixed rule to classify, as accurately as possible, unknown cases from the same population. The training data consists of carefully measured values of predictive variables and a response variable for each case. The response variable in classification is always *discrete*, i.e., it assumes a finite number of values; in fact, it is often binary, indicating one of two states, such as distinct cell phenotypes, disease severity, and so on.

If the predictor variables are also discrete, then one is in the context of *discrete classification*, also known as *multinomial classification *[13] and *discrete discriminant analysis* [7]. Additionally, in Statistics, the term *categorical data analysis* is often employed to refer to the statistical analysis of discrete data [16]. In Genomics, the predictor variables often correspond to the expression of a set of suitably selected genes; for discrete classification, gene expression must first be discretized into a finite number of intervals --- methods to accomplish this are described in [8, 9]. Note that the finite values taken on by the discrete predictors could be numeric (e.g., the mid-point value of an expression range), or purely categorical, as is often done by alluding to “up-regulated” and “down-regulated” genes. This distinction is immaterial in the case of the most commonly used discrete classification rule, known as the *discrete histogram rule *[1, 3, 4, 6, 13]. The discrete histogram rule simply assigns to each combination of possible values of the predictor variables a response value that is decided by majority voting among the observed response values. As will be seen in this paper, this simple rule has many desirable and interesting properties.

Fig. (**11**) depicts an example of how the discrete histogram rule would function in the case of cell phenotype classification based on the discretized expression values of two genes. Classification is between the phenotypes of “treated” and “untreated” cells (e.g., presence or absence of some drug in the culture, presence of enough nutrients *vs*. starvation, normal *vs*. abnormal cells, and a host of other possible conditions), and gene expression is discretized in ternary values, corresponding to down-regulated, basal, and up-regulated values. There are therefore 3^{2} = 9 possible combinations of observable values, or “bins”, which can be organized in this case into a 3×3 matrix. In this example, the observed training data set contains a total of 40 cases, with an equal number of cases in each of the “treated” and “untreated” categories (sometimes called a “balanced experimental design” in Statistics). The *counts* of observed response values over each of the bins are shown in the figure. The majority class is underlined in each case, and this would be the class assigned to a future case by the discrete histogram rule if the observed gene expression values fall into that particular bin. Note that there are two particular cases that require attention: there could be a tie between the counts over a bin, and no values might be observed in the training data over a bin (missing data). These cases can be resolved by randomly picking one of the classes or, if one wants to avoid random classifiers, one can break ties, in a fixed manner, in favor of one of the classes; for example, one might classify such cases as “untreated”. Based on the resulting discrete classifier in this particular example, one might posit that up-regulation of both genes is associated with treatment of the cells.

Fig. (**22**) depicts another example, which illustrates how the discrete histogram rule would be applied in the case of discrete gene expression prediction; this constitutes the basic building block for the inference of gene regulatory networks [10, 11]. Gene expression values have been discretized into binary values, indicating activation or not of the particular gene, and the expression of three genes (the predictor variables) is used to predict the expression of a fourth gene (the response, or *target*, variable). Note that the number of bins in this case is 2^{3} = 8 . The bins are represented side by side in Fig. (**22**), rather than organized into a matrix as in Fig. (**11**). This clearly makes no difference to the discrete histogram rule (an important point we will return to in the next Section). Note that the values of all variables (predictors and target) are coded into 0 and 1, and that ties in this example are broken in favor of the class 1, that is, high-expression. As can be seen, prediction is based on a total of 40 cases, i.e., 40 instances of the 4-tuple consisting of the three predictor genes and the target gene. Note that the values 0 and 1 for the target gene are not represented equally, so the design is “unbalanced.” It will be rarely the case in gene prediction that the design is balanced, since here one cannot possibly or meaningfully specify in advance the target classes for the observations; this is a very important difference with respect to the previous case of phenotype classification, where it is often possible and meaningful to do so.

The validity of any scientific conclusions made based on the previous classification models depends, naturally, on the accuracy of the obtained classifiers. In addition, it critically depends on the reliable *estimation* of said accuracy, based on the available data. These issues will be approached in the sequel.

Examples of discrete classification rules include the discrete histogram rule, mentioned in the previous section, as well as the maximum-mean-accuracy rule of [5], and many other examples of discrete rules used in Data Mining [17]. Among these, the discrete histogram rule is by far the most used one in practice. The discrete histogram rule is “natural” for categorical problems, not only due to its simplicity as majority voting over the bins, but also because it corresponds to the plug-in rule for approximating the optimal Bayes classifier, as we discuss below. In this section, we formalize the problem of discrete classification, which allows us to examine the properties of the discrete histogram rule, including classification accuracy and its estimation from data.

Let the predictor variables be denoted by *X*_{1}, *X*_{2},..., *X _{d}*, and the response variable be denoted by

The properties of the discrete classification problem are completely determined by the (discrete) joint probability distribution between the predictor *X* and the target *Y*:

$P\left(X=i,Y=j\right),\mathrm{for}i=1,...,b\mathrm{and}j=0,1.$

Given the identity $P\left(X=i,Y=j\right)=P\left(X=i\left|Y=j\right.\right)P\left(Y=j\right),$ it becomes clear that the discrete classification problem is determined by 2*b*+2 positive parameters *c*_{0} = *P*(*Y*=0), *c*_{1} = *P*(*Y*=1), and
${p}_{i}=P\left(X=i\left|Y=0\right.\right),{q}_{i}=P\left(X=i\left|Y=1\right.\right),$ for *i* = 1,...,*b*. Note that the parameters are not independent, since one must have ${c}_{0}+{c}_{1}=1,\sum {p}_{i}=1,\mathrm{and}\text{}\sum {q}_{i}=1.$

Through Bayes' theorem, these model parameters determine the posterior probabilities $P\left(Y=j\left|X=i\right.\right)$ for the classification problem, $P\left(Y=0\left|X=i\right.\right)=\frac{P\left(Y=0,X=i\right)}{P\left(X=i\right)}=\frac{{c}_{0}{p}_{i}}{{c}_{0}{p}_{i}+{c}_{1}{q}_{i}}$ with $P\left(Y=1\left|X=i\right.\right)=1-P\left(Y=0\left|X=i\right.\right)$. Therefore, the classifier *Ψ*^{*} that achieves the minimum *probability of misclassification* $P\left(Y\ne \mathrm{\Psi}\left(X\right)\right),$
known as the *Bayes classifier* [13], is given by

$${\mathrm{\psi}}^{}$$

(1)

It can be shown that if there are two or more discrete features in the original feature space (such as in Fig. **11**), and these features are independent conditionally to *Y*, i.e., within each class, then the Bayes classifier *Ψ*^{*} is a linear function of those features [13, p.466].

The minimum probability of misclassification, or *Bayes error*, achieved by the Bayes classifier, can be written as

$${\mathrm{\epsilon}}^{}$$

(2)

Here, *I _{A}* is an

The Bayes error is a measure of distance between the classes, and it provides a lower bound on classification performance. For discrete histogram classification, the predictor variables in the original feature space should be chosen so that the Bayes error is as small as possible.

In practice, one almost never knows the model parameters completely, and therefore one does not know the Bayes classifier. One must rely instead on designing a classifier from sample *training data*; one hopes that such a sample-based classifier is close in some sense to the Bayes classifier. The classifier produced by the discrete histogram rule becomes indeed very close to the Bayes classifier, as sample size increases, in a few important senses; this will be discussed in Section 6.

Given sample data ${S}_{n}=\left\{\left({X}_{1},{Y}_{1}\right),...\left({X}_{n},{Y}_{n}\right)\right\}$ containing *n* independent and identically distributed (i.i.d.) samples, one defines the *bin counts* *U _{i}*,

$${\mathrm{\psi}}_{n}\left({S}_{n},X=i\right)={I}_{{U}_{i}<{V}_{i}}=\left\{\begin{array}{c}1,\text{}{U}_{i}{V}_{i}\\ 0,{U}_{i}\ge {V}_{i}\end{array},\text{}i=1,...,\right.b.$$

(3)

When a specific training sample *S _{n}* =

Ordinarily, the samples in *S _{n}* are drawn form a mixture of the two class populations, and therefore the numbers ${N}_{0}=\sum {U}_{i}\mathrm{and}{N}_{1}=\sum {V}_{i}$ of samples in classes 0 and 1, respectively, are random variables. In this

The discrete histogram rule is the “plug-in” rule for discrete classification, that is, if one plugs the standard maximum-likelihood (ML) estimators of the unknown model parameters *c*_{0},*c*_{1} and {*p _{i}*},{

$$\hat{{c}_{0}}=\frac{\sum _{i}{U}_{i}}{n},\hat{{c}_{1}}=\frac{\sum _{i}{V}_{i}}{n}\mathrm{and}\hat{{p}_{i}}=\frac{{U}_{i}}{\sum _{i}{U}_{i}},\hat{{q}_{i}}=\frac{{V}_{i}}{\sum _{i}{V}_{i}},\mathrm{for}i=1,...,b,$$

(4)

in the expression for the Bayes classifier in (1), one obtains precisely the histogram classifier in (3). Since the standard ML estimators in (4) are consistent, meaning that they converge to the true values of the parameters as the sample size increases, one would expect the discrete histogram classifier to approach the optimal Bayes classifier as more samples are acquired, which is indeed the case; we come back to this issue in Section 6.

The most important performance criterion for the designed classifier *Ψ _{n}* is its accuracy on independent (e.g., future) data, which are assumed to come from the same population as the given training data. This accuracy is measured by the probability of misclassification ${\mathrm{\epsilon}}_{n}=P\left(Y\ne {\mathrm{\psi}}_{n}\left(X\right)\right),$, where (

$${\mathrm{\epsilon}}_{n}=\sum _{i=1}^{b}P\left(X=i,Y=1-{\mathrm{\psi}}_{n}\left(X=i\right)\right)=\sum _{i=1}^{b}\left[{c}_{0}{p}_{i}{I}_{{V}_{i}>{U}_{i}}+{c}_{1}{q}_{i}{I}_{{U}_{i}\ge {V}_{i}}\right]$$

(5)

Being a function of the random variables *U _{i}* and

In practice, the underlying probability model is unknown, and the classification error *ε _{n}* has to be estimated from the sample data using an

As the classification error *ε _{n}* itself, a nonrandomized error estimator $\hat{{\mathrm{\epsilon}}_{n}}$ produces a fixed value once the training data set

The variance of the error estimator, by itself, does not address its relationship to the quantity to be estimated, namely, the actual classification error. Relevant performance metrics that do so are discussed next. The *bias* $E\left[\hat{{\mathrm{\epsilon}}_{n}}-{\mathrm{\epsilon}}_{n}\right]$ of an error estimator measures whether, on average, it overestimates the true error, or underestimates it, whereas the *deviation variance* $\mathrm{Var}\left(\hat{{\mathrm{\epsilon}}_{n}}-{\mathrm{\epsilon}}_{n}\right)$ measures the spread of the deviation between estimated and actual errors; it can in fact be written as

$$\mathrm{Var}\left(\hat{{\mathrm{\epsilon}}_{n}}-{\mathrm{\epsilon}}_{n}\right)=\mathrm{Var}\left(\hat{{\mathrm{\epsilon}}_{n}}\right)+\mathrm{Var}\left({\mathrm{\epsilon}}_{n}\right)-2\mathrm{\rho}\left(\hat{{\mathrm{\epsilon}}_{n}},{\mathrm{\epsilon}}_{n}\right)\sqrt{\mathrm{Var}\left(\hat{{\mathrm{\epsilon}}_{n}}\right)\mathrm{Var}\left({\mathrm{\epsilon}}_{n}\right)}$$

(6)

a remarkable formula that combines the variances of the actual error and error estimator, and their *correlation* $\mathrm{\rho}\left(\hat{{\mathrm{\epsilon}}_{n}},{\mathrm{\epsilon}}_{n}\right).$. Small bias is of small significance if the deviation variance is large; this would mean that on average the error estimator is close to the true error, but that in fact the estimate for any particular sample set is likely to be far away from the true error. The *root mean-square error* (RMS)

$$\mathrm{R}\mathit{MS}\left(\hat{{\mathrm{\epsilon}}_{n}}\right)=\sqrt{E\left[{\left(\hat{{\mathrm{\epsilon}}_{n}}-{\mathrm{\epsilon}}_{n}\right)}^{2}\right]}=\sqrt{E{\left[\hat{{\mathrm{\epsilon}}_{n}}-{\mathrm{\epsilon}}_{n}\right]}^{2}+\mathrm{Var}\left(\hat{{\mathrm{\epsilon}}_{n}}-{\mathrm{\epsilon}}_{n}\right)}$$

(7)

conveniently combines both the bias and the deviation variance into a single measure, and is widely adopted for comparison of error estimator performance. Additional performance measures include the *tail probabilities* $P\left(\left|\hat{{\mathrm{\epsilon}}_{n}}-{\mathrm{\epsilon}}_{n}\right|>\mathrm{\tau}\right),\mathrm{for}\mathrm{\tau}0,$ which concern the likelihood of outliers, as well as the consistency of the error estimator; the *conditional bias* $E\left[\hat{{\mathrm{\epsilon}}_{n}}-{\mathrm{\epsilon}}_{n}\left|\hat{{\mathrm{\epsilon}}_{n}}\right.\right]=\hat{{\mathrm{\epsilon}}_{n}}-E\left[{\mathrm{\epsilon}}_{n}\left|\hat{{\mathrm{\epsilon}}_{n}}\right.\right]$
(resp. conditional deviation variance and RMS); and *confidence intervals* [*a,b*] such that $P\left(a\le {\mathrm{\epsilon}}_{n}\le b\left|\hat{{\mathrm{\epsilon}}_{n}}\right.\right)>1-\mathrm{\alpha},$ for 0 ≤ *α* ≤1, which give bounds on the true error corresponding to a given precision *α*, the observed error estimate, and the sample size. Confidence intervals express statistical power in error estimation --- more powerful methods will produce shorter confidence intervals for the true error at the same sample size. A very important fact is that all of the aforementioned performance metrics, and in fact any others, can be determined if one has knowledge of the *joint sampling distribution* of the vector of actual and estimated errors $\left({\mathrm{\epsilon}}_{n},\hat{{\mathrm{\epsilon}}_{n}}\right)$. Section 5 reviews exact analytical methods to compute these performance metrics, as well as complete enumeration methods that allow the computation of the joint sampling distribution of actual and estimated errors.

The resubstitution error estimator $\hat{{\mathrm{\epsilon}}_{n}^{r}}$ [18] is the simplest data-efficient alternative; it is simply the apparent error, or the proportion of errors the designed classifier makes on the training data itself. Clearly,

$$\hat{{\mathrm{\epsilon}}_{n}^{r}}=\frac{1}{n}\sum _{i=1}^{b}\mathrm{min}\left\{{U}_{i},{V}_{i}\right\}=\frac{1}{n}\sum _{i=1}^{b}\left[{U}_{i}{I}_{{V}_{i}>{U}_{i}}+{V}_{i}{I}_{{U}_{i}\ge {V}_{i}}\right].$$

(8)

For example, in Fig. (**22**), the resubstitution estimate for the classification error is 12/40 = 0.3. It is easy to see that plugging the ML estimators of the model parameters in (4) into the formula for the Bayes error (2), results in expression (8). Therefore, resubstitution for the discrete histogram rule is the plug-in estimator of the Bayes error in discrete classification. The resubstitution estimator is clearly nonrandomized, and it is very fast to compute. This estimator is however always optimistically biased in the case of the discrete histogram rule, in the sense that $E\left[\hat{{\mathrm{\epsilon}}_{n}^{r}}\right]\le E\left[{\mathrm{\epsilon}}_{n}\right],$ for any sample size and distribution of the data. In fact, it can be shown that

$$E\left[\hat{{\mathrm{\epsilon}}_{n}^{r}}\right]\le {\mathrm{\epsilon}}^{}$$

(9)

so that the average resubstitution estimate bounds from below even the Bayes error; this fact seems to have been demonstrated for the first time in [1] (see also [2]). Observe though that this is not guaranteed to apply to any individual training data and classifier, but only to the average over all possible training data. The optimistic bias of resubstitution tends to be larger when the number of bins is large compared to the sample size; in other words, there is more overfitting of the classifier to the training data in such cases. On the other hand, resubstitution tends to have small variance. In cases where the bias is not too large, this makes resubstitution a very competitive option as an error estimator. In fact, the next Section contains results that show that resubstitution can have smaller RMS than even complex error estimators such as the bootstrap, provided that sample size is large compared to number of bins. In addition, it can be shown that as the sample size increases, both the bias and variance of resubstitution vanish (see Section 6). Finally, it is important to emphasize that these observations hold for the discrete histogram rule; for example, the resubstitution estimator is not necessarily optimistically-biased for other (continuous or discrete) classification rules.

The leave-one-out error error estimator ${\hat{\mathit{\epsilon}}}_{n}^{l}$
[19] removes the optimistic bias from resubstitution by counting errors committed by *n* classifiers, each designed on *n*-1 points and tested on the remaining left-out point, and dividing the total count by *n*. A little reflection shows that

$$\hat{{\mathrm{\epsilon}}_{n}^{l}}=\frac{1}{n}\sum _{i=1}^{b}\left[{U}_{i}{I}_{{V}_{i}\ge {U}_{i}}+{V}_{i}{I}_{{U}_{i}\ge {V}_{i}-1}\right].$$

(10)

For example, in Fig. (**22**), the leave-one-out estimate for the classification error is 15/40 = 0.375. This is higher than the resubstitution estimate of 0.3. In fact, by comparing (8) and (10), one can see that, in all cases, it is true that
${\hat{\mathit{\epsilon}}}_{n}^{l}\ge {\hat{\mathit{\epsilon}}}_{n}^{r}$
. In particular, $E[{\hat{\mathit{\epsilon}}}_{n}^{l}]\ge E[{\hat{\mathit{\epsilon}}}_{n}^{r}]$
showing that the leave-one-out estimator must be necessarily less optimistic than the resubstitution estimator. In fact, it is a general result (not restricted to discrete histogram classification), that
$E[{\hat{\mathrm{\epsilon}}}_{n}^{l}]=E\left[{\mathit{\epsilon}}_{n-1}\right]$ making this estimator almost unbiased. As it turns out, this bias reduction is accomplished at the expense of an increase in variance [26]. The leave-one-out estimator is however nonrandomized.

A randomized estimator is obtained by selecting randomly *k* *folds* of size *n-n/k* , counting the errors committed by *k* classifiers, each designed on one of the folds and tested on the remaining points not in the fold, and dividing the total count by *n*. This yields the well-known *k*-fold *cross-validation* estimator [19-22]. The process can be repeated several times and the results averaged, in order to reduce the internal variance associated with the random choice of folds. The leave-one-out estimator is a cross-validation estimator with *k* = *n*; therefore, cross-validation is not randomized in this special case (it is also nonrandomized for other choices of *k* if one considers all possible folds of size *n-n/k*, which can be a very large number if *n* is large or *k* is small). It is a general result (not restricted to discrete histogram classification) that the *k*-fold cross-validation estimator
${\hat{\mathit{\epsilon}}}_{n}^{\mathrm{cvk}}$
is such that $E\left[{\hat{\mathit{\epsilon}}}_{n}^{\mathit{cvk}}\right]=E\left[{\mathrm{\epsilon}}_{n-n/k}\right].$

Another class of popular randomized error estimators are based on the the idea of bootstrap [23-25]. A “bootstrap sample” consists of *n* equally-likely draws with replacement from the original training data. The basic bootstrap estimator
$\hat{{\mathit{\epsilon}}_{0}}$ is similar to cross-validation, in that it counts the errors committed by *B* classifiers, each designed on a bootstrap sample and tested on training points not in the bootstrap sample, and divides the count by the total number of test points (which is variable). The number *B* of bootstrap samples must be made large to reduce the internal variance associated with bootstrap sampling (the ideal case *B* = ∞ leading to a nonrandomized estimator; in practice, this would be achieved by a very large, but finite, *B*, which is equal to the number of all possible draws of *n* indices with replacement from the index set 1,...,*n*). The estimator $\hat{{\mathit{\epsilon}}_{0}}$ tends to be pessimistically biased, and therefore a convex combination with resubstitution, which is optimistically biased (in the case of discrete histogram classification), was proposed in [24]:

$${\hat{\mathit{\epsilon}}}_{n}^{\mathit{b632}}=\left(1-0.632\right){\hat{\mathit{\epsilon}}}_{r}+0.632{\hat{\mathit{\epsilon}}}_{0}$$

(11)

This is known as the **0.632** *bootstrap* error estimator, and is quite popular in Machine Learning applications [17]. It has small variance, but can be very slow to compute. In addition, it will fail when the resubstitution estimator is too optimistic. A variant called the *0.632+ bootstrap* error estimator was introduced in [25], in an attempt to correct this problem. All cross-validation and bootstrap error estimators tend to be computationally intensive, due to the large number of classifier design steps involved and the need to reduce internal variance by averaging over a large number of iterations.

The fact that the distribution of the vectors of bin counts (*U _{1}*,...,

From (5) it follows that the expected error over the sample is given by

$$E\left[{\mathrm{\epsilon}}_{n}\right]=\sum _{i=1}^{b}\left[{c}_{0}{p}_{i}E\left[{I}_{{V}_{i}>{U}_{i}}\right]+{c}_{1}{q}_{i}E\left[{I}_{{U}_{i}\ge {V}_{i}}\right]\right]=\sum _{i=1}^{b}\left[{c}_{0}{p}_{i}P\left({V}_{i}>{U}_{i}\right)+{c}_{1}{q}_{i}P\left({U}_{i}\ge {V}_{i}\right)\right]={c}_{1}+\sum _{i=1}^{b}\left({c}_{0}{p}_{i}-{c}_{1}{q}_{i}\right)P\left({V}_{i}>{U}_{i}\right).$$

(12)

The computation of the probability *P*(*V _{i}*>

$$P\left({V}_{i}>{U}_{i}\right)=\sum _{l>k}\left(\begin{array}{c}n\\ k,l,n-k-l\end{array}\right){\left({c}_{0}{p}_{i}\right)}^{k}{\left({c}_{1}{q}_{i}\right)}^{l}{\left(1-{c}_{0}{p}_{i}-{c}_{1}{q}_{i}\right)}^{n-k-l},$$

(13)

whereas in the stratified sampling case, *U _{i}* is independent of

$$P\left({V}_{i}>{U}_{i}\right)=\sum _{l>k}\left(\begin{array}{c}{n}_{0}\\ k\end{array}\right){p}_{i}^{k}{\left(1-{p}_{i}\right)}^{{n}_{0}-k}\left(\begin{array}{c}{n}_{1}\\ l\end{array}\right){q}_{i}^{l}{\left(1-{q}_{i}\right)}^{{n}_{1}-l}.$$

(14)

To obtain the variance $\mathit{Var}\left[{\mathrm{\epsilon}}_{n}\right]=E\left[{\mathrm{\epsilon}}_{n}^{2}\right]-{\left(E\left[{\mathrm{\epsilon}}_{n}\right]\right)}^{2}$ one needs the second moment $E\left[{\mathrm{\epsilon}}_{n}^{2}\right]$:

$$E\left[{\mathrm{\epsilon}}_{n}^{2}\right]=\sum _{i=1}^{b}\left\{{c}_{0}^{2}{p}_{i}^{2}P\left({V}_{i}>{U}_{i}\right)+{c}_{1}^{2}{q}_{\mathrm{i}}^{2}P\left({U}_{i}\ge {V}_{i}\right)\right\}+\sum _{i,j=1i\ne j}^{b}\left\{{c}_{0}^{2}{p}_{i}{p}_{j}P\left({V}_{i}>{U}_{i},{V}_{j}>{U}_{j}\right)+{c}_{0}{c}_{1}\left[{p}_{i}{q}_{j}P\left({V}_{i}>{U}_{i},{U}_{j}\ge {V}_{j}\right)+{p}_{j}{q}_{i}P\left({U}_{i}\ge {V}_{i},{V}_{j}>{U}_{j}\right)\right]+{c}_{1}^{2}{q}_{i}{q}_{j}P\left({U}_{i}\ge {V}_{i},{U}_{j}\ge {V}_{j}\right)\right\}$$

(15)

This expression involves second-order bin probabilities, e.g., *P*(*V _{i}*>

However, computation of the expression in (15) becomes difficult when *n* or *b* are large. But if one can assume that the event {*V _{i}* >

$$\mathit{Var}\left[{\mathrm{\epsilon}}_{n}\right]=\sum _{i=1}^{b}{\left({c}_{0}{p}_{i}+{c}_{1}{q}_{i}\right)}^{2}P\left({V}_{i}>{U}_{i}\right)\left(1-P\left({V}_{i}>{U}_{i}\right)\right)$$

(16)

It is proved in [28] that, under a mild distributional assumption, the expression in (16) is asymptotically exact as the number of bins grows to infinity, for fixed sample size.

Fig. (**33**) illustrates the application of the formulas above in an example where stratified sampling is assumed, with *c*_{0} = *c*_{1} = 0.5 (so that, in particular, *n*_{0} = *n*_{1} = *n/*2), and probabilities *p _{i}* and

Similar exact expressions can be derived for the expectation and variance of the resubstitution and leave-out-error estimators, as well as their correlation with the actual error; see [29, 30]. These exact expressions allow one to compute exactly the bias, deviation variance, and RMS of both resubstitution and leave-one-out. This is illustrated in Fig. (**44**), where results for resubstitution (resub), leave-one-out (loo), 10-repeated 4-fold cross-validation (cv), and the .632 bootstrap (b632) are displayed. In this figure, “standard deviation” refers to the square root of the deviation variance. For the 0.632 error estimator, *B* = 100 bootstrap samples are employed. Performance measures for resubstitution and leave-one-out are exact; they are computed using the exact expressions mentioned previously. For the other error estimators, performance measures are derived from a Monte-Carlo computation using 20,000 samples from each probability model. The model is the Zipf parametric model mentioned previously, with *c _{0}* =

Performance of error estimators, for *n* = 20 and *E*[*ε*_{n}] = 0.2. The values for resubstitution and leave-one-out are exact; the values for the other error estimators are approximations based on Monte-Carlo computation.

Analytical exact expressions for the correlation between actual and estimated errors can also be derived [30]. This is illustrated in Fig. (**55**), where the correlation for resubstitution and leave-one-out error estimators is plotted versus sample size, for different bin sizes. In this example, we assume full sampling and the Zipf parametric model mentioned previously, with *c*_{0} = *c*_{1} = 0.5 and parameter *α* adjusted to yield two cases: easy (Bayes error = 0.1) and difficult (Bayes error = 0.4) classification.

Exact correlation between the actual error and the resubstitution and leave-one-out cross-validation error estimators for probability models of distinct difficulty, as determined by the Bayes error. Left plot: Bayes error = 0.10. Right plot: Bayes error **...**

We can observe that the correlation is generally low (below 0.3). We can also observe that at small sample sizes, correlation for resubstitution is larger than for leave-one-out cross-validation, and, with a larger difficulty of classification, this is true even at moderate sample sizes. Correlation generally decreases with increasing bin size; in one striking case, the correlation for leave-one-out becomes negative, at the critical small-sample situation of *n* = 20 and *b* = 32. This behavior of the correlation for leave-one-out mirrors the behavior of deviation variance of this error estimator, which is known to be large under complex models and small sample sizes [13, 26, 31], and is in accord with (6).

As mentioned previously, all the performance metrics of interest for the actual error *ε _{n}* and any given error estimator $\hat{{\mathrm{\epsilon}}_{n}}$ can be derived from joint sampling distribution of the pair of random variables $\left({\mathrm{\epsilon}}_{n},\hat{{\mathrm{\epsilon}}_{n}}\right).$ These include conditional metrics, such as the conditional expected actual error given the estimated error and confidence bounds on the actual error conditioned on the estimated error, which are very difficult to study

However, due to the finiteness of the discrete problem, it turns out that the joint sampling distribution of actual and estimated errors in the discrete case can be computed exactly by means *complete enumeration*. Such methods have been extensively used in categorical data analysis [16, 32-35]; complete enumeration has been particularly useful in the computation of exact distributions and critical regions for tests based on contingency tables, as in the case of the well-known Fisher exact test and the chi-square approximate test [32, 33].

Basically, complete enumeration relies on intensive computational power to list all possible configurations of the discrete data and their probabilities, and from this exact statistical properties of the methods of interest are obtained. The availability of efficient algorithms to enumerate all possible cases on fast computers has made possible the use of complete enumeration in an increasingly wider variety of settings.

In the case of discrete classification, recall that the random sample is specified by the vector of bin counts *W _{n}* = (

$$P\left({\mathrm{\epsilon}}_{n}=k,\hat{{\mathrm{\epsilon}}_{n}}=l\right)=\sum _{{w}_{n}{D}_{n}}$$

(17)

where *P*(*W _{n}* =

$$P\left({W}_{n}={w}_{n}\right)=\left(\begin{array}{c}n\\ {u}_{1},...,{u}_{b},{v}_{1},...,{v}_{b}\end{array}\right){c}_{0}^{\sum _{i}{u}_{i}}{c}_{1}^{\sum _{i}{v}_{i}}\underset{i=1}{\overset{b}{\Pi}}{p}_{i}^{{u}_{i}}{q}_{i}^{{v}_{i}}$$

(18)

Even though the configuration space *D _{n}* is finite, it quickly becomes huge with increasing sample size

Exact joint distribution between the actual error and the resubstitution and leave-one-out cross-validation error estimators, for *n* = 20 and *b* = 8 , and a Zipf probability model of intermediate difficulty (Bayes error = 0.2).

This approach can be easily modified to compute the conditional sampling distribution $P\left({\mathrm{\epsilon}}_{n}=k,\left|\hat{{\mathrm{\epsilon}}_{n}}=l\right.\right).$ This was done in [14] in order to find exact conditional metrics of performance for resubstitution and leave-one-out error estimators. Those included the conditional expectation $E\left({\mathrm{\epsilon}}_{n}\left|\hat{{\mathrm{\epsilon}}_{n}}\right.\right)$ and conditional variance $\text{V}\mathit{ar}\left({\mathrm{\epsilon}}_{n}\left|\hat{{\mathrm{\epsilon}}_{n}}\right.\right)$ of the actual error given the estimated error, as well as the 100(1-*α*)% upper confidence bound *γ _{α}*, such that $P\left({\mathrm{\epsilon}}_{n}<{\mathrm{\gamma}}_{\mathrm{\alpha}}\left|\hat{{\mathrm{\epsilon}}_{n}}\right.\right)=1-\mathrm{\alpha}.$

This is illustrated in Fig. (**77**), where the aforementioned conditional metrics of performance for resubstitution and leave-one-out are plotted versus conditioning estimated errors, for different bin sizes. In this example, we assume stratified sampling and the Zipf parametric model mentioned previously, with *c*_{0} = *c*_{1} = 0.5 and parameter *α* adjusted to yield $E\left[{\mathrm{\epsilon}}_{n}\right]\approx 0.25,$ which corresponds to intermediate classification difficulty. Sample size is fixed at *n* = 20 to emphasize small-sample effects and two bin sizes are considered, *b* = 4,8. The curves for the conditional expectation rise with the estimated error; they also exhibit the property that the conditional expected actual error is larger than the estimated error for small estimated errors and smaller than the estimated error for large estimated errors. A point to be noted is the flatness of the leave-one-out curves. This reflects the high variance of the leave-one-out estimator. Note that the 95% upper confidence bounds are nondecreasing with respect to increasing estimated error, as expected. The flat spots observed in the bounds result from the discreteness of the estimation rule (this phenomenon is more pronounced when the number of bins is smaller).

Note that the model parameters *p _{i}* and

In [4], G.F. Hughes provided exact expressions that allow the computation of the average Bayes error $\stackrel{\left(b,{c}_{0}\right)}{{\mathrm{\epsilon}}^{}}$ and the average expected actual error $\overline{E\left[\mathrm{\epsilon}\right]}\left(n,b,{c}_{0}\right)$ for the discrete histogram rule, both averaged over the model space (*c*_{0}), by assuming that all models in (*c*_{0}) are equally-likely to occur. This provides a distribution-free analysis of performance, some of the qualitative features of which are still valid in particular distributional settings. For example, one of the famous conclusions derived in [4] is that, with *n* and *c*_{0} fixed, the curve of the expected actual *accuracy* $1-\overline{E\left[\mathrm{\epsilon}\right]}\left(n,b,{c}_{0}\right)$ as a function of number of bins *b* *peaks* around an optimal value *b*^{*}, which increases with increasing sample size *n*. Even though this result was derived in terms of the average accuracy over the model space, and for the discrete histogram rule, this “peaking phenomenon” is in fact observed for the majority of individual distributions, and indeed for the majority of classification rules, both discrete and continuous [36].

Using the expressions in Hughes' paper, we plotted the average Bayes accuracy $1-\stackrel{\left(b,{c}_{0}\right)}{{\mathrm{\epsilon}}^{}}$ and average expected actual accuracy $1-\overline{E\left[\mathrm{\epsilon}\right]}\left(n,b,{c}_{0}\right)$, both as a function of *b*, for various values of *n*, assuming the balanced case *c*_{0} = 0.5; the results are displayed in Fig. (**88**). The curves are plotted as a function of *p* = log_{2}*b*. This corresponds to the case where *p* binary predictors are used in the original feature space; for example, the point *p* = 5 in the plot corresponds to 5 binary features, with *b* = 2^{5} = 32. One can easily observe the “peaking phenomenon” in this plot. The optimal number of features moves to the right with increasing sample size *n*, and, regardless of the value of *n*, accuracy tends to the no-information value of 0.5 as the number of predictors increases. Sample size computations can be performed based on the curves of Fig. (**88**); for example, if one has *p* = 3 binary predictors, so that *b* = 8, then sample size should be equal to *n* = 60 at a minimum, according to this analysis. The expressions for these curves are quite complicated and computationally intensive for large *n*; however for small *n*, the expressions become quite simple. For example, with *n* = 2,

Average Bayes accuracy and average expected actual accuracy plotted as a function of number of binary predictors *p* = log_{2} (*b*) .

$1-\overline{E\left[\mathrm{\epsilon}\right]}\left(2,b,0.5\right)=\frac{1}{2}+\frac{1\text{}b-1}{2b(b+1)}$

so that the accuracy margin over the no-information value of 0.5 vanishes as 1/*b*. This implies that the decrease is exponential in *p* = log_{2}(*b*), as can be gleaned from Fig. (**88**).

Note that peaking ceases to occur as *n* → ∞, which corresponds to the Bayes accuracy (see the next Section). This must be the case, since the Bayes accuracy is known to be nondecreasing in the number of features. The expression for the average Bayes accuracy in the case *c*_{0} = 0.5 is simple; as shown in [4], this is given by

$1-\stackrel{\left(b,0.5\right)}{{\mathrm{\epsilon}}^{}}$

with an asymptotic value (as *b* → ∞) of 0.75 (it is shown in [4] that, for general *c*_{0}, this asymptotic value is equal to 1-*c*_{0}(1-*c*_{0})). This relatively small value highlights the conservative character of Hughes' distribution-free approach; for example, in practice, where one deals with a fixed distribution of the data, the optimal number of features would typically be larger than the ones observed in Fig. (**88**), so that sample size recommendations based on this analysis tend to be pessimistic --- a fact that was pointed out in [37]. Nevertheless, the qualitative behavior of the analysis is entirely correct. Finally, we remark that Fig. (**88**) closely matches Fig. (**33**) in [4], but larger values of *b* are shown here, which possibly were difficult to compute in 1968.

In [38], Braga-Neto and Dougherty carried out an analysis of the performance of ensemble classification methods [39, 40] when applied to the discrete histogram rule, which provided evidence that such ensemble methods may be largely ineffective in discrete classification. Part of the analysis is similar to the work of Hughes', discussed in the previous subsection, in the sense that it examines the average performance over the model space, assuming equally-likely models. Two methods were considered, namely, the *jackknife* and *bagging* ensemble classification rules obtained from the discrete histogram rule. Briefly, ensemble methods are based on perturbing the training data, designing an ensemble of classifiers based on the perturbed data sets using a given base classification rule (in this case, the discrete histogram rule), and aggregating the individual decisions to obtain the final classifier. Data perturbation is often accomplished by resampling methods such as the jackknife [41] and bootstrap [23] --- the latter case being known as “bagging” [40] --- whereas aggregation is done by means of majority voting among the individual classifier decisions. For the jackknife majority-vote classification rule, it was shown in [38] that, under full sampling and equally-likely classes, the best gain in performance (i.e., decrease in expected classification error) over all models in the model space (*c*_{0}) is smaller than the worst deficit (i.e., increase in expected classification error). Any discrepancy in performance however disappears as sample size increases; in particular the following bound is shown to hold:

$$\left|E\left[{\mathrm{\epsilon}}_{n}^{J}\right]-E\left[{\mathrm{\epsilon}}_{n}\right]\right|\le \frac{1}{\mathit{ne}}\sqrt{\frac{2}{\mathrm{\pi}\left(n+1\right)}}$$

(19)

where $E\left[{\mathrm{\epsilon}}_{n}^{J}\right]$ and $E\left[{\mathrm{\epsilon}}_{n}\right]$ are the expected classification errors of the jackknife and base classification rules, respectively. In addition, an exact expression is given for the average $\left|\overline{E\left[{\mathrm{\epsilon}}_{n}^{J}\right]}-\overline{E\left[{\mathrm{\epsilon}}_{n}\right]}\right|$ over the model space (*c*_{0}), assuming equally-likely distributions as in the work of Hughes. In the case of equally-likely classes (*c*_{0} = 0.5), the result simplifies to show that the average difference is positive, that is, there is an average deficit (which in fact is shown to still hold if the classes are only approximately equally likely, in a precise sense). The left plot in Fig. (**99**) displays these quantities plotted as a function of sample size, for *p* = 2 (*b* = 4 discrete cells), and for the balanced case, *c*_{0} = 0.5. We can observe in the plot that the best gain (inf) is smaller than the worst deficit (sup) and that there is an average deficit (positive average deviation). The values of inf and sup are actually independent of *b*.

Performance of ensemble discrete classification rules, for *p* = 2 . Left plot: bounds and average difference between expected errors of the jackknife and discrete histogram classification rules, as a function of sample size, for *c*_{0} = 0.5 . Right plot: **...**

Regarding the bagging case, it is shown in [38] that, given the training data, and for any sample size, number of cells, or distribution of the data, the random bagging classifier converges to the original discrete histogram classifier with probability 1 as the number of classifiers in the ensemble *m* increases, and, furthermore, it also gives the following exponential bound on the absolute difference $\left|{\mathrm{\epsilon}}_{n,m}^{B}-{\mathrm{\epsilon}}_{n}\right|$ between the generalization errors of the bagging and the base classifiers,

$$\left|{\mathrm{\epsilon}}_{n,m}^{B}-{\mathrm{\epsilon}}_{n}\right|\le {e}^{{-2\mathit{mc}}^{2}}$$

(20)

where the constant *c* > 0 does not depend on *m*, but depends in a simple way on the distribution of the data. The difference therefore converges exponentially fast to zero as the number of classifiers in the ensemble increases (for fixed *n*). From this it follows that the difference between expected errors over all training data also converges exponentially fast to zero (the constant *c* is larger, guaranteeing faster convergence, if the classes are more separated, in a precise sense). The right plot in Fig. (**99**) displays the expected error for the bagged discrete histogram classification rule as a function of number of classifiers in the ensemble, for model parameters derived from an actual data set, corresponding to *p* = 2 binary features picked from the SPECT Heart data set of the UCI Machine Learning Repository. The expected classification error for the bagging classifier is found by means of a Monte-Carlo computation using 100,000 simulated training sets, assuming full sampling. The Monte-Carlo computation introduces the wobble visible in the plots (even at this very large number of simulated training sets). Also indicated are the exact expected errors of the base discrete histogram classification rule, by means of dashed horizontal lines. We can see that in all cases bagging leads to a larger expected classification error than the base classification rule, although the deviation quickly converges to zero in each case, in agreement with equation (20) above.

Large-sample analysis of performance has to do with behavior of classification error and error estimators as sample size increases without bound, i.e., as *n* → ∞. From a practical perspective, one expects performance to improve, and eventually reach an optimum, as more time and cost is devoted to obtaining an increasingly large number of samples. It turns out that not only this is true for the discrete histogram rule, but also it is possible in several cases to obtain fast (exponential) rates of convergence. Critical results in this area are due to Cochran and Hopkins [1], Glick [6, 42], and Devroye, Gyorfi and Lugosi [13]. We will review briefly these results in this Section.

Recall the bin counts *U _{i}* and

$$\underset{n\to \mathrm{\infty}}{\mathrm{lim}}{\mathrm{\psi}}_{n}\left(X=i\right)=\underset{n\to \mathrm{\infty}}{\mathrm{lim}}{I}_{{U}_{i}<{V}_{i}}={I}_{{c}_{0}{p}_{i}<{c}_{1}{q}_{i}}={\mathrm{\psi}}^{}$$

(21)

with probability 1.

that is, the discrete histogram classifier designed from sample data converges to the optimal classifier over each bin, with probability 1. This is a distribution-free result, so it is true regardless of the joint distribution of predictors *X* and target *Y*, as the SLLN itself is distribution-free. One says then that the discrete histogram rule is *universally strongly consistent* [13].

The exact same argument, in connection with eqs. (2), (5) and (8), shows that

$$\underset{n\to \mathrm{\infty}}{\mathrm{lim}}{\mathrm{\epsilon}}_{n}=\underset{n\to \mathrm{\infty}}{\mathrm{lim}}{\hat{\mathit{\epsilon}}}_{n}^{r}={\mathrm{\epsilon}}^{}$$

(22)

so that the classification error, and also the apparent error, converge to the optimal Bayes error as sample size increases. From the previous equation it also follows that

$$\underset{n\to \mathrm{\infty}}{\mathrm{lim}}E\left[{\mathrm{\epsilon}}_{n}\right]=\underset{n\to \mathrm{\infty}}{\mathrm{lim}}E\left[{\hat{\mathit{\epsilon}}}_{n}^{r}\right]={\mathrm{\epsilon}}^{}$$

(23)

In particular, ${\mathrm{lim}}_{n\to \mathrm{\infty}}E\left[{\hat{\mathit{\epsilon}}}_{n}^{r}-{\mathit{\epsilon}}_{n}\right]=0$
and the bias of resubstitution vanishes with increasing sample size. Recalling (9), one always has
$E\left[{\hat{\mathrm{\epsilon}}}_{n}^{r}\right]\le {\mathrm{\epsilon}}^{}$
, so that (23) in fact implies that $E\left[{\mathrm{\epsilon}}_{n}\right]\downarrow {\mathrm{\epsilon}}^{}$
, while $E\left[{\hat{\mathrm{\epsilon}}}_{n}^{r}\right]\uparrow {\mathrm{\epsilon}}^{}$
, as *n* → ∞.

These results are all based on the SLLN (and are thus distribution-free). The question arises as to the speed with which the limits are attained, as the SLLN can yield notoriously slow rates of convergence. This is not only a theoretical question, as the usefulness in practice of such results may depend on how large a sample size needs to be to guarantee that the discrete classifier or error estimator is close enough to optimality. The answer is that exponential rates of convergence can be obtained, if one is willing to drop the distribution-free requirement. Otherwise, polynomial rates of convergence can be established. These results are briefly reviewed below.

Regarding the discrete histogram rule, with a proviso that ties in bin counts are assigned a class randomly (with equal probability), it is shown in [6, Theorem A], that the following exponential bound on the convergence of $E\left[{\mathrm{\epsilon}}_{n}\right]$ to *ε*^{*} applies

$$E\left[{\mathrm{\epsilon}}_{n}\right]-{\mathrm{\epsilon}}^{}$$

(24)

where the constant *c* > 0 is distribution-dependent:

$c=\mathrm{log}\frac{1}{1-\underset{\left\{i:{c}_{0}{p}_{i}\ne {c}_{1}{q}_{i}\right\}}{\mathrm{min}}{\left|\sqrt{{c}_{0}{p}_{i}}-\sqrt{{c}_{1}{q}_{i}}\right|}^{2}}$

Interestingly, the number of bins does not figure in this bound. The speed of convergence of the bound is determined by the minimum (nonzero) difference between the probabilities *c*_{0}*p _{i}* and

On the other hand, a distribution-free bound is provided by [13, Theorem 27.1]:

$$E\left[{\mathrm{\epsilon}}_{n}\right]-{\mathrm{\epsilon}}^{}$$

(25)

This polynomial *O*(*n*^{-1/2}) bound is inferior to the exponential bound in (24), but it does guarantee a fixed rate of convergence that is independent of the distribution.

Regarding convergence of $E\left[{\hat{\mathrm{\epsilon}}}_{n}^{r}\right]$
to *ε*^{*}, and again assuming random tie-breaking over cells, it is shown in [6, Theorem B], that the following exponential bound applies

$${\mathrm{\epsilon}}^{}$$

(26)

*provided that* there is no cell over which *c*_{0}*p _{i}* =

The previous results on the discrete histogram rule concern expectation and bias. In [13], (distribution-free) results on variance and RMS are also given, both for resubstitution and leave-one-out (here, the convention we have adopted of breaking ties in the direction of class 0 is again in effect). For the resubstitution error estimator, one has the following bounds [13, Theorem 23.3]:

$$\mathrm{Var}\left[{\hat{\mathrm{\epsilon}}}_{n}^{r}\right]\le \frac{1}{n}$$

(27)

and

$$\mathrm{R}\mathit{MS}\left({\hat{\mathrm{\epsilon}}}_{n}^{r}\right)\le \sqrt{\frac{6b}{n}}$$

(28)

In particular, both quantities converge to zero as sample size increases. For the leave-one-out error estimator, one has the following bound [13, Theorem 24.7]:

$$\mathrm{R}\mathit{MS}\left({\hat{\mathit{\epsilon}}}_{n}^{l}\right)\le \sqrt{\frac{1+6/e}{n}+\frac{6}{\sqrt{\mathrm{\pi}\left(n-1\right)}}}$$

(29)

This guarantees, in particular, convergence to zero as sample size increases.

An important factor in the comparison of the resubstitution and leave-one-out error estimators for discrete histogram classification resides in the different speeds of convergence of the RMS to zero. Convergence of the RMS bound for the resubstitution estimator is *O*(*n*^{-1/2}) (for fixed *b*), whereas convergence of the RMS bound for the leave-one-out estimator is *O*(*n*^{-1/4}), thus slower. Now, as remarked in [13, p.419], it can be shown that for some distributions there is also a *lower bound* of kind *O*(*n*^{-1/4}) on the RMS of leave-one-out. Therefore, in the worst case, the RMS of leave-one-out to zero is guaranteed to decrease as *n*^{-1/4}, and therefore is certain to decrease slower than the RMS of resubstitution. Note that the bad RMS of leave-one-out is due almost entirely to its large variance, typical of the cross-validation approach, since this estimator is essentially unbiased.

In classical regression analysis, the *coefficient of determination* (CoD) gives the relative decrease in unexplained variability when entering a variable *X* into the regression of the dependent variable *Y*, in comparison with the total unexplained variability when entering no variables:

$$\mathit{CoD}\left(X,Y\right)=\frac{{\mathit{SS}}_{Y}-{\mathit{SS}}_{\mathit{XY}}}{{\mathit{SS}}_{Y}}$$

(30)

where *SS*_{Y} and *SS*_{XY} are the sums of squared errors associated with entering no variables and entering variable *X* to predict *Y*, respectively. The term *SS*_{Y} is proportional to the total variance ${\mathrm{\sigma}}_{Y}^{2}$, which is the error around the mean *µ*_{Y} (so that entering no variables in the regression corresponds to using the mean as the predictor).

In classification, a very similar concept was introduced in [10]:

$$\mathit{CoD}\left(X,Y\right)=\frac{{\mathrm{\epsilon}}_{Y}^{}}{}$$

(31)

where ${\mathrm{\epsilon}}_{Y}^{}$ is the Bayes error in the absence of any features, and ${\mathrm{\epsilon}}_{\mathit{XY}}^{}$
is the Bayes error when using feature vector *X* to predict *Y*. By convention, one assumes 0/0 = 1 in the above definition. This *binary coefficient of determination* measures the relative decrease in prediction error of a target variable when using predictor variables, relative to using no predictor variables; notice the remarkable similarity between (30) and (31).

The binary CoD was perhaps the first predictive paradigm utilized in the context of microarray data, the goal being to provide a measure of nonlinear interaction among genes [10]. Even though the binary CoD, as defined in (31), has general application in classification, it has been extensively used in the case of discrete classification and prediction, particularly in problems dealing with gene expression quantized into discrete levels [8, 44] --- see the examples given in Section 2 --- and its use in the inference of gene regulatory networks [11, 12]. As its classic counterpart, the binary CoD is a goodness-of-fit statistic that can be used to assess the relationship between predictor and target variables (e.g., how tight the association between a set of predictor genes and a target gene is).

Even though the definition above employs Bayes errors, the CoD can be likewise defined in terms of the classification error of predictors designed from sample data, using for example the discrete histogram rule. In addition, the actual classification errors will typically need to be computed through error estimation techniques; e.g., one may speak of resubstitution and leave-one-out CoD estimates. All the issues discussed in previous sections regarding classification and error estimation for discrete data generally apply here.

A recent paper [45] defined and studied the concept of *intrinsically multivariate predictive* (IMP) genes using the binary CoD. Briefly, IMP genes are those the expression of which cannot be predicted well by any subset of binary predicting gene expressions, but is predicted very well by the entire set. In [45], the properties of IMP genes were characterized analytically, and it was shown that high-predictive power, small covariance among predictors, a large entropy of the joint probability distribution of predictors, and certain logics, such as XOR in the 2-predictor case, are factors that favor the appearance of IMP. In addition, quantized gene-expression microarray data were employed to show that the gene DUSP1, which exhibits control over a central, process-integrating signaling pathway, exhibits IMP behavior, thereby providing preliminary evidence that IMP can be used as a criterion for discovery of *canalizing* genes, i.e., master genes that constrain (“canalize”) large gene-expression pathways [46].

The importance of discrete classification in Genomics lies in its broad application in problems of phenotype classification based on panels of gene-expression biomarkers and inference of gene regulatory networks from gene-expression data, where data discretization is often employed for data efficiency and classification accuracy reasons. This paper presented a broad review of methods of classification and error estimation for discrete data, focusing for the most part on the discrete histogram rule, which is the classification rule most employed in practice for discrete data, due to its excellent properties, such as low complexity and small data requirement (under small number of cells), and universal consistency. The most important criterion for performance is the classification error, which can be computed exactly only if the underlying distribution of the data is known. In practice, robust error estimation methods must be employed to obtain reliable estimates of the classification error based on available sample data. This paper reviewed analytical and empirical results concerning the performance of discrete classifiers (in terms of the classification error) as well as of error estimators for discrete classification. Those results were categorized into small-sample results --- small-sample data being prevalent in Genomics applications --- and large-sample (i.e., asymptotic) results. The binary Coefficient of Determination was also reviewed briefly; it provides a measure of nonlinear interaction among genes and is therefore very useful in the inference of gene regulatory networks. Progress in classification and error estimation for discrete data, particularly the analysis of performance in small-sample cases, has a clear potential to lead to genuine advances in Genomics and Medicine, and therefore the study of such methods is a topic of considerable research interest at present.

This work was supported by the National Science Foundation, through NSF award CCF-0845407.

1. Cochran W, Hopkins C. Some classification problems with multivariate qualitative data. Biometrics. 1961;17:10–32.

2. Hills M. Allocation rules and their error rates. J. R. Stat. Soc. Series B Methodol. 1966;28:1–31.

3. Hills M. Discrimination and allocation with discrete data. Appl. Stat. 1967;16:237–250.

4. Hughes G. On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inf. Theory. 1968;IT-14:55–63.

5. Hughes G. Number of pattern classifier design samples per class. IEEE Trans. Inf. Theory. 1969;IT-15:615–618.

6. Glick N. Sample-based multinomial classification. Biometrics. 1973;29:241–256.

7. Goldstein M, Dillon W. Discrete discriminant analysis. New York: Wiley; 1978.

8. Zhou X, Wang X, Dougherty E. Binarization of Microarray Data Based on a Mixture Model. Mol. Cancer Ther. 2003;2:679–684. [PubMed]

9. Shmulevich I, Zhang W. Binary analysis and optimization-based normalization of gene expression data. Bioinformatics. 2002;18:555–565. [PubMed]

10. Dougherty E, Kim S, Chen Y. Coeficient of determination in nonlinear signal processing. EURASIP J. Appl. Signal Processing. 2000;80:2219–2235.

11. Schmulevich I, Dougherty E, Kim S, Zhang W. Probabilistic Boolean Networks: a rule-based uncertainty model for gene-regulatory networks. Bioinformatics. 2002;18:261–274. [PubMed]

12. Schmulevich I, Dougherty E, Zhang W. From Boolean to probabilistic Boolean networks as models of genetic regulatory networks. Proc. IEEE. 2002;90:1778–1792.

13. Devroye L, Gyorfi L, Lugosi G. A Probabilistic Theory of Pattern Recognition. New York: Springer; 1996.

14. Xu Q, Hua J, Braga-Neto U, Xiong Z, Suh E, Dougherty E. Confidence Intervals for the True Classification Error Conditioned on the Estimated Error. Technol. Cancer Res. Treat. 2006;5:579–590. [PubMed]

15. Dougherty E, Braga-Neto U. Epistemology of computational biology: mathematical models and experimental prediction as the basis of their validity. J. Biol. Syst. 2006;14:65–90.

16. Agresti A. Categorical Data Analysis. 2nd. New York: Wiley; 2002.

17. Witten I, Frank E. Data Mining. San Diego, CA: Academic Press; 2000.

18. Smith C. Some examples of discrimination. Ann. Eugen. 1947;18:272–282. [PubMed]

19. Lachenbruch P, Mickey M. Estimation of error rates in discriminant analysis. Technometrics. 1968;10:1–11.

20. Cover T. Learning in Pattern Recognition. In: Watanabe S, editor. Methodologies of Pattern Recognition. New York NY: Academic Press; 1969. pp. 111–132.

21. Toussaint G, Donaldson R. Algorithms for Recognizing Contour-Traced Hand-Printed Characters. IEEE Trans. Comput. 1970;19:541–546.

22. Stone M. Cross-Validatory Choice and Assessment of Statistical Predictions. J. R. Stat. Soc. Series B Methodol. 1974;36:111–147.

23. Efron B. Bootstrap Methods: Another Look at the Jacknife. Ann. Stat. 1979;7:1–26.

24. Efron B. Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation. J. Am. Stat. Assoc. 1983;78:316–331.

25. Efron B, Tibshirani R. Improvements on Cross-Validation: The .632+ Bootstrap Method. J. Am. Stat. Assoc. 1997;92:548–560.

26. Braga-Neto U, Dougherty E. Is Cross-Validation Valid for Microarray Classification? Bioinformatics. 2004;20:374–380. [PubMed]

27. Braga-Neto U, Dougherty E. Classification. In: Dougherty E, Shmulevich I, Chen J, Wang Z. J, editors. Genomic Signal Processing and Statistics, EURASIP Book Series on Signal Processing and Communication. Hindawi Publishing Corporation; 2005.

28. Braga-Neto U. An Asymptotically-Exact Expression for the Variance of Classification Error for the Discrete Histogram Rule. GENSIPS'2008 -IEEE International Workshop on Genomic Signal Processing and Statistics, Phoenix, AZ. 2008.

29. Braga-Neto U, Dougherty E. Exact performance of error estimators for discrete classifiers. Pattern. Recognit. 2005;38:1799–1814.

30. Braga-Neto U, Dougherty E. Exact Correlation between Actual and Estimated Errors in Discrete Classification. 2009. (In Review)

31. Hanczar B, Hua J, Dougherty E. Decorrelation of the True and Estimated Classifier Errors in High-Dimensional Settings. EURASIP J. Bioinform. Syst. Biol. 2007 Article ID 38473, 12 pages. [PMC free article] [PubMed]

32. Agresti A. A survey of exact inference for contingency tables. Stat. Sci. 1992;7:131–153.

33. Verbeek A. A survey of algorithms for exact distributions of test statistics in rXc contingency tables with fixed margins. Comput. Stat. Data Anal. 1985;3:159–185.

34. Klotz J. The Wilcoxon, Ties, and the Computer. J. Am. Stat. Assoc. 1966;61:772–787.

35. Hirji K, Mehta C, Patel N. Computing Distributions for Exact Logistic Regression. J. Am. Stat. Assoc. 1987;82:1110–1117.

36. Hua J, Xiong Z, Lowey J, Suh E, Dougherty E. Optimal number of features as a function of sample size for various classification rules. Bioinformatics. 2005;21:1509–1515. [PubMed]

37. Abend K, T.J. Harley J, Chandrasekaran B, T.J. Harley J, Hughes G. Comments “On the mean accuracy of statistical pattern recognizers” IEEE Trans. Inf. Theory. 1969;IT-15:420–423.

38. Braga-Neto U, Dougherty E. Performance of Ensemble Methods in Discrete Classification. 2009. (In Review)

39. Xu L, Krzyzak A, Suen C. Methods for combining multiple classifiers and their applications in handwritten character recognition. IEEE Trans. Syst. Man Cybern. 1992;22:418–435.

40. Breiman L. Bagging Predictors. Mach. Learn. 1996;24:123–140.

41. Quenouille M. Approximate tests of correlation in time series. J. R. Stat. Soc. Series B Methodol. 1949;11:68–84.

42. Glick N. Sample-Based classification procedures derived from density estimators. J. Am. Stat. Assoc. 1972;67:116–122.

43. Feller W. An Introduction to Probability Theory and Its Applications. Vol. 1. New York, NY: Wiley; 1968.

44. Kim S, Dougherty E, Chen Y, Sivakumar K, Meltzer P, Trent J, Bittner M. Multivariate measurement of gene expression relationships. Genomics. 2000;67:201–209. [PubMed]

45. Martins D, Braga-Neto U, Hashimoto R, Bittner M, Dougherty E. Intrinsically Multivariate Predictive Genes. IEEE J. Select. Topics Signal Processing. 2008;2:424–439.

46. Dougherty E, Brun M, Trent J, Bittner M. A Conditioning-Based model of contextual regulation. IEEE/ACM Trans. Comput. Biol. Bioinform. 2008;6:310–320. [PubMed]

Articles from Current Genomics are provided here courtesy of **Bentham Science Publishers**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |