General concepts. Let us consider the general case in which we want to classify a collection of objects

*i* = 1, . . .,

*n* into

*K* predefined classes. For instance, if one wants to distinguish between different types of tumors based on gene expression values, then

*K* would represent the number of known existing tumor types. Without loss of generality, data on features can be organized in an

*n* ×

*p* matrix X = (

*x*_{ij}), where

*x*_{ij} represents the measured value of the variable (feature)

*j* in the object (sample)

*i*. Every row of the matrix X is therefore a vector

**x**_{i} with

*p* features to which a class label

*y*_{i} is associated,

*y* = 1,2,. . .,

*c*,. . .,

*K*. In such multiclass classification problems, a classifier

*C*(

**x**) may be viewed as a collection of

*K* discriminant functions

*g*_{c}(

**x**) such that the object with feature vector

**x** will be assigned to the class

*c* for which

*g*_{c}(

**x**) is maximized over the class labels

*c* {1,. . .,

*K*}. The feature space

*X* is thus partitioned by the classifier

*C*(

**x**) into

*K* disjoint subsets.

There are two main approaches to the identification of the discriminant functions

*g*_{c}(

**x**) [

13]. The first assumes knowledge of the underlying class-conditional probability density functions (the probability density function of

**x** for a given class) and assigns

*g*_{c}(

**x**) =

*f*(

*p*(

**x** |

*y* =

*c*)), where

*f* is a monotonic increasing function, for example the logarithmic function. Intuitively, the resulting classifier will classify an object

**x** in the class in which it has the highest membership probability. In practice,

*p*(

**x** |

*y* =

*c*) is unknown, and therefore needs to be estimated from a set of correctly classified samples named

*training* or

*design* set. Parametric and nonparametric methods for density estimation can be used for this end. From the parametric category, we will discuss linear and quadratic discriminants, while from the nonparametric one, we will describe the

*k*-nearest neighbor (

*k*-NN) decision rule. The second approach is to use data to estimate the class boundaries directly, without explicit calculation of the probability density functions. Examples of algorithms in this category include decision trees, neural networks, and support vector machines (SVM).

Error estimation. Suppose the classifier

*C*(

**x**) was trained to classify input vectors

**x** into two distinct classes, 1 and 2. The classification result on a collection of input objects

**x**_{i},

*i* = 1,. . .,

*n* can be summarized in a

*confusion matrix*. The confusion matrix contrasts the predicted class labels of the objects

with the true (given) class labels

*y*_{i}. An example confusion matrix computed for 100 objects is:

The *error rate* (*Err*) of the classifier is defined as the average number of misclassified samples, i.e., the sum of off-diagonal elements of the confusion matrix, divided by the total number of objects. In the example above, *Err* = (10 + 20) / 100 = 30%. Conversely, the *accuracy* of the classifier can be defined as *Acc* = 1 − *Err* = 70% and represents the fraction of samples successfully classified.

The goal behind developing classification models is to use them to predict the class membership of

*new samples*. If the data used to build the classifier is also used to compute the error rate, then the resulting error estimate, called the

*resubstitution* estimate, will be optimistically biased [

14]. A better way to assess the error is the

*hold-out* procedure in which one splits the data into two equal parts. The first half is used to train the classifier (the

*training set*), while the remaining half is used to assess the error (the

*test set*). With biological data, this approach is rarely feasible due to the paucity of the data. A more appropriate alternative is the

*leave-one-out* cross-validation method (LOO) which trains the classifier

*n* times on (

*n* − 1) samples, omitting each observation in turn for testing the classifier. The

*n* test results obtained in this way can be arranged into a confusion matrix, and

*Err* estimated by the proportion of off-diagonal elements. Although the estimate of the error obtained with the leave-one-out procedure gives low bias, it may show high variance [

15]. A good tradeoff between bias and variance may be obtained by using

*N-fold cross-validation* in which the dataset is split into (

*n* −

*m*) training points and

*m* test points (

*N* =

*n/m*). Using multiple resampling, one can obtain a mean, as well as a standard deviation, for the classifier error.

Types of classifiers. *Quadratic and linear discriminants.* A standard classification approach, applicable when the features are continuous variables (e.g., gene expression data), assumes that for each class *c*, **x** follows a multivariate normal distribution *N*(**m**_{c},Σ_{c}) having the mean **m**_{c} and covariance matrix Σ_{c}. The covariance matrix Σ is square with dimension *p* × *p*. The element *i*,*j* of this matrix is the covariance between the variables *i* and *j*.

Using the multivariate-normal probability density function and replacing the true class means and covariance matrices with sample-derived estimates (

**m**_{c} and

_{c}, respectively), the discriminant function for each class can be computed as:

where

and

The discriminant functions are monotonically related to the densities *p*(**x** | *y* = *c*), yielding higher values for larger densities. The values of the discriminant functions will differ from one class to another only on the basis of the estimates of the class mean and covariance matrix. A new object **z** will be classified in the class for which the discriminant is the largest. This classification approach produces nonlinear (quadratic) class boundaries, giving the name of the classifier as *quadratic discriminant* rule or Gaussian classifier.

An alternative to this quadratic classifier is to assume that the class covariance matrices Σ_{c}, *c* = 1,. . .,*K* are all the same. In this case, instead of using a different covariance matrix estimate for each class, a single pooled covariance matrix is used. This can be especially useful when the number of samples per class is low. In this case, calculating a covariance matrix from only a few samples may produce very unreliable estimates. Better results may be obtained by assuming a common variance and using all samples to estimate a single covariance matrix. The resulting classifier uses hyperplanes as class boundaries, hence the name *normal-based linear discriminant*.

To cope with situations when the number of features is comparable with the number of samples, a further simplification can be made to the normal-based linear discriminant, by setting all off-diagonal elements in the covariance matrix to zero. This implies that between-features covariation is disregarded. Such a

*diagonal linear discriminant* was found to outperform other types of classifiers on a variety of microarray analyses [

16].

The above-presented classifiers work optimally when their underlying assumptions are met, such as the normality assumption. In many cases, some of the assumptions may not be met. However, (as pointed out by one of the anonymous reviewers) what matters in the end for a practical application is how close the estimated class boundaries are to the true class boundaries. This can be assessed through a cross-validation process.

In very recent work, Guo and colleagues [

17] have presented a regularized linear discriminant analysis procedure useful when the number of features far exceeds the number of samples.

*k-Nearest neighbor classifier.* The

*k*-NN classifier can be seen as a nonparametric method of density estimation [

13] and uses no assumption on the data distribution, except for the continuity of the feature variables. The

*k*-NN classifier does not require model fitting but simply stores the training dataset with all available vector prototypes of each class. When a new object

**z** needs to be classified, the first step in the algorithm is to compute the distance between

**z** and all the available objects in the training set,

**x**_{i},

*i* = 1,. . .,

*n*. A popular choice of distance metric is the Euclidean distance:

. A thorough discussion of distance functions with application to microarray analysis is given by Gentleman et al. [

18].

The distances are ordered and the top *k* training samples (closest to the new object to be predicted) are retained. Let us denote with *n*_{c} the number of objects in the training dataset among the *k* ones which belong to the class *c*. The *k*-NN classification rule classifies the new object **z** in the class that maximizes *n*_{c}, i.e., the class that is most common among the closest *k* neighbors. The *k*-NN discriminant functions can be written as *g*_{c}(**x**) = *n*_{c}. When two or more classes are equally represented in the vicinity of the point **z**, the class whose prototypes have the smallest average distance to **z** may be chosen.

*Decision trees.* A special type of classifier is the decision tree [

19], which is trained by an iterative selection of individual features that are the most salient at each node in the tree. The input space

*X* is repeatedly split into descendant subsets, starting with

*X* itself. There are several heuristic methods for constructing decision-tree classifiers. They are usually constructed top-down, beginning at the root node and successively partitioning the feature space. The construction involves three main steps. 1) Selecting a splitting rule for each internal node, i.e., determining the feature together with a threshold that will be used to partition the dataset at each node. 2) Determining which nodes are terminal nodes. This means that for each node we must decide whether to continue splitting or to make the node terminal and assign to it a class label. 3) Assigning class labels to terminal nodes by minimizing the estimated error rate.

The most commonly used decision tree classifiers are binary. They use a single feature at each node, resulting in decision boundaries that are parallel to the feature axes (see ). Although they are intrinsically suboptimal, the resulting classifier is easy to interpret.

*Neural networks.* The most common neural network architecture used in classification problems is a fully connected, three-layered structure of nodes in which the signals are propagated from the input to the output layer via the hidden layer (see ). The input layer only feeds the values of the feature vector

**x** to the hidden layer. Each hidden unit weights differently all outputs of the input layer, adds a bias term, and transforms the result using a nonlinear function, usually the logistic sigmoid:

Similarly to the hidden layer, the output layer processes the output of the hidden layer. Usually there is one output unit for each class. The discriminant function implemented by the

*k*th output unit of such a neural network can be written as:

In this equation,

*w*_{i}_{,j} is the weight from the

*i*th input unit to the

*j*th hidden node,

*α*_{j}_{,k} is the weight from the

*j*th hidden unit to the

*k*th output node,

is the bias term of the

*j*th hidden unit,

is the bias term of the

*k*th output unit. They all represent adjustable parameters and are estimated (learned) during the training process that minimizes a loss function. A commonly used loss function is the sum of squared errors between the predicted and expected signal at the output nodes, given a training dataset.

Consider that

*N*_{T} training samples are available to train a neural network with

*K* output units. The error of the neural network on the training set can be computed as:

where

**ω** represents all the adjustable parameters of the neural network (weights and biases) which are initialized with small random values, and

*e*_{s} is the error obtained when the

*s*th training sample is used as input into the network. The error

*e*_{s} is defined as proportional to the sum of squared differences between the expected outputs of the network and the actual outputs, given the current values of the weights, i.e.,

Here,

*g*_{s}_{,k} represents the actual output of the unit

*k* for the sample

*s*, while

*g*_{s}_{,k} is the desired (target) output value for the same sample. When a sample belongs to the class

*k*, it is desired that the output unit

*k* fires a value of 1, while all the other output units fire 0. The learning process is done by updating the parameters

**ω** such that global error decreases in an iterative process. A popular update rule is the back-propagation rule [

20], in which the adjustable parameters

**ω** are changed (increased or decreased) toward the direction in which the training error

*E*(

**ω**) decreases the most.

Equation 6 above can be modified in a way that the training process not only minimizes the sum of squared errors on the training set, but also the sum of squared weights of the network. This

*weights regularization* enhances the generalization capability of the model by preventing small variations in the inputs to have excessive impact on the output. The underlying assumption of the weights regularization is that the boundaries between the classes are not sharp.

For more details on theory and practical use of neural networks, please see Duda et al. [

12], Ripley [

21], Venables and Ripley [

22], and references therein.

*Support vector machines.* Consider a two-class, linearly separable classification problem, as shown in , left panel. While many decision boundaries exist that are capable of separating all the training samples into two classes correctly, a natural question to ask is: are all the decision boundaries equally good? Here the goodness of decision boundaries is to be evaluated as described previously by cross-validation. Among these decision boundaries, SVMs find the one that achieves maximum margin between the two classes. From statistical learning theory, the decision functions derived by maximizing the margin minimize the theoretical upper bound on the expected risk and are thus expected to generalize well [

23]. The margin is defined as the distance between a planar decision surface that separates two classes and the closest training samples to the decision surface (see , right panel). Let us denote with

the labeled training dataset where

**x**_{i} ^{p},

*y*_{i} {−1,+1}. SVMs find an optimal hyperplane

**wx**^{T} +

*b* = 0, where

**w** is the

*p*-dimensional vector perpendicular to the hyperplane and

*b* is the bias. The objective of training SVMs is to find

**w** and

*b* such that the hyperplane separates the data and maximizes the margin 1 / ||

**w** ||

^{2} (, right panel). By introducing non-negative slack variables

*ξ*_{i} and a penalty function measuring classification errors, the linear SVM problem is formulated as follows:

subject to constraints:

where

*C* is a parameter to be set by the user, which controls the penalty to errors. The optimization problem can be reduced to a dual problem with solutions given by solving a quadratic programming problem [

23]. The decision function is simply

where

*α*_{i} are coefficients that can be solved through the dual problem. Data points with nonzero

*α*_{i} are called support vectors (SVs) (e.g., , right panel). In SVMs, only SVs contribute to the construction of the decision boundaries.

The linear SVMs can be readily extended to nonlinear SVMs where more sophisticated decision boundaries are needed. This is done by applying a kernel transformation, i.e., simply replacing every matrix product (

**x**_{i}**x**^{T}) in linear SVMs with a nonlinear kernel function evaluation

*K*(

**x**_{i}**x**). This is equivalent to transforming the original input space

*X* nonlinearly into a high-dimensional feature space. The training data that are not linearly separable in the original feature space can be linearly separated in the transformed feature space. Consequently, the decision boundaries are linear in the projected high-dimensional feature space and nonlinear in the original input space. Two commonly used kernels include polynomial

and radial basis function (RBF)

The kernel functions return larger values for arguments that are closer together in feature space.

In constructing linear SVMs for classification, the only parameter to be selected is the penalty parameter *C. C* controls the tradeoff between errors of SVMs on training data and the margin. For nonlinear SVMs, the learning parameters include *C* and parameters associated with the kernels used, e.g., *γ,* in radial basis function (RBF) kernels. In practice, learning parameters are selected through cross-validation methods.

To conclude, the key points with the SVMs are: a) one believes there is a representation of features in which classes can be discriminated by a single hyperplane (perhaps with only a few errors); b) one chooses the hyperplane that lies at the largest distance between sentinel cases near the class boundary (large margin); and c) one can use kernel transformations when data is not linearly separable in the original feature space, but it may be so in the transformed space.

Dimensionality reduction. An important aspect of the classifier design is that in some applications, the dimensionality

*p* of the input space is too high to allow a reliable estimation of the classifier's internal parameters with a limited number of samples (

*p* *n*). In such situations, dimensionality reduction may be useful. There are two main categories of approaches to dimensionality reduction. The first one is to obtain a reduced number of new features by combining the existing ones, e.g., by computing a linear combination.

*Principal component analysis* (PCA) is one particular method in this branch, in which new variables (principal directions) are identified and may be used instead of the original features. The second type of dimensionality reduction involves

*feature selection* that seeks subsets of the original variables that are adequately predictive.

A serious difficulty arises when

*p* *n* is

*overfitting*. Most of the procedures examined in this tutorial include a set of tunable parameters. The size of this set increases with

*p*. When more tunable parameters are present, very complex relationships present in the sample can often be fit very well, particularly if

*n* is small. Generalization error rates in such settings typically far exceed training set error rates. Reduction of the dimensionality of the feature space can help to reduce risks of overfitting. However, automated methods of dimension reduction must be employed with caution. The utility of a feature in a prediction problem may depend upon its relationships with several other features, and simple reduction methods that consider features in isolation may lead to loss of important information.

The statistical pattern recognition literature classifies the approaches to feature selection into

*filter methods* and

*wrapper methods*. In the former category, a statistical measure (e.g., a

*t*-test) of the marginal relevance of the features is used to filter out the features that appear irrelevant using an arbitrary threshold. For instance, marker genes for cancer prediction were chosen based on their correlation with the class distinction and then used as inputs in a classifier [

24].

Although fast and easy to implement, such filter methods cannot take into account the joint contribution of the features. Wrapper methods use the accuracy of the resulting classifier to evaluate either each feature independently or multiple features at the same time. For instance, the accuracy of a

*k*-NN classifier has been used to guide a genetic algorithm that searched an optimal subset of genes in a high combinatorial space [

25]. The main disadvantage of such methods trying to find optimal subsets of features is that they may be computationally demanding. Main advantages of wrapper methods include the ability to: a) identify the most suited features for the classifier that will be used in the end to make the decision, and b) detect eventual synergistic feature effects (joint relevance). More details on feature selection methods and classification can be found in the literature [

16,

26,

27].