The problem of predicting the diagnostic category of a given tissue sample is of fundamental clinical importance. Conventional diagnostic methods are based on subjective evaluation of the morphological appearance of the tissue sample, which requires a visible phenotype and a trained pathologist to interpret the view. In some cases the class is easily identified by cell morphology or cell-type distribution, but in many cases apparently similar pathologies can lead to very different clinical outcomes. Since the advent of DNA array technology [
1-
6], researchers have begun to use expression array analysis as a quantitative phenotyping tool. The potential advantage to using arrays for phenotyping is that they provide a simultaneous quantitative measure of thousands of parameters (for example, gene-expression levels) some of which are likely to have disease relevance. When array analysis is used predominately for phenotyping, we refer to the expression pattern as an 'expression array phenotype'. Owing to the ability to quantify a large number of parameters, the use of expression array in phenotyping promises both more accurate class prediction and the identification of subclasses that could not be defined by traditional methods.
There has been a recent explosion in the use of expression array phenotyping for identification and/or classification in a variety of diagnostic areas. Examples of diagnostic categories (or classes) include cancer versus non-cancer [
7,
8], different subtypes of tumor [
9-
13], and prediction of responses to various drugs or cancer prognosis [
14-
16]. The prediction of the diagnostic category of a tissue sample from its expression array phenotype given the availability of similar data from tissues in identified categories is known as classification (or supervised learning). A challenge in predicting diagnostic categories using microarray data is that the number of genes is usually significantly greater than the number of tissue samples available, and only a subset of the genes is relevant in distinguishing different classes. Selection of relevant genes for classification is known as feature selection. This has three main applications: first, the classification accuracy is often improved using a subset instead of the entire set of genes; second, a small set of relevant genes is convenient for developing diagnostic tests; and third, these genes may lead to biologically interesting insights that are characteristic of the classes of interest.
There have been many reports that address the classification and feature-selection problems, for example [
9,
10,
14,
17]. However, many of these methods are tailored towards binary classification in which there are only two classes [
9,
14]. Moreover, there has been very limited effort to develop classification and feature-selection algorithms for microarray data with repeated measurements or error estimates. Array data is well known to be noisy; for example, Lee
et al. [
18] showed that any single microarray output is subject to substantial variability. This is particularly true for genes with low expression levels, which are more difficult to measure than genes with high expression levels. As the cost of microarray experiments is declining, more research laboratories are generating microarray data with repeated measurements [
9,
14,
19,
20]. Repeated measurements not only provide improved estimates of gene-expression levels but can also be used to estimate the uncertainty or variability in the measurement. In some cases the repeated measurements are biological replicates (for example, independent samples), whereas in other cases only technical replicates are available. Regardless of the source, however, variability estimates should be taken into account in both clustering and classification algorithms, as variability estimates can potentially be exploited to improve the results.
We have developed two algorithms called the uncorrelated shrunken centroid (USC) algorithm, and the error-weighted, uncorrelated shrunken centroid (EWUSC) algorithm. Both USC and EWUSC are integrated feature-selection and classification algorithms that are applicable to data with any number of classes. Our primary contribution is that both USC and EWUSC exploit interdependence between genes to reduce the number of selected features. In addition, EWUSC takes advantage of variability estimates over repeated measurements to down-weight noisy genes and noisy experiments so that no ad hoc filtering step is necessary. On the other hand, USC is applicable to microarray datasets with or without repeated measurements.
Introduction to classification and feature selection
Classification is a supervised learning approach, in which the classes (or labels) of a subset of samples are inputs to the algorithm. This is in contrast to clustering, which is an unsupervised approach, in which no knowledge of the samples is assumed. A training set is a set of samples for which the classes are known. A test set is a set of samples for which the classes are assumed to be unknown to the algorithm, and the goal is to predict which classes these samples belong to. The first step in classification is to build a 'classifier' using the given training set, and the second step is to use the classifier to predict the classes of the test set.
In the context of gene-expression data, the samples are usually the experiments, and the classes (or labels) are usually different types of tissue samples (for example, cancer versus non-cancer, different tumor types, rate of disease progression, and response to therapy). A typical microarray dataset consists of thousands to tens of thousands of genes, and dozens to hundreds of experiments. One challenge of classification using microarray data is that the number of genes is significantly greater than the number of samples. In this situation, it is possible to find both random and biologically relevant correlations of gene behavior with sample type. To protect against spurious results, the goal is to identify the smallest possible subset of genes that correlate most strongly with the known class labels. In addition, a small subset of genes is desirable for the development of expression-based diagnostics. The problem of selecting relevant genes (or features) for classification is known as feature selection.
Cross validation is a well-established technique used to optimize the parameters or features chosen in a classifier. In m-fold cross-validation, the training set is randomly divided into m disjoint subsets with roughly equal size. Each of these m subsets is left out in turn for evaluation, and the other (m - 1) subsets are used as inputs to the classification algorithm. In this work, we randomly divide each class into m disjoint subsets (where m is less than the size of the smallest class in the training set), so that each class is represented in the subset fed to the classification algorithm. The left-out subset of the training set is used to evaluate classification accuracy because the classes of this subset are known. The most popular form of cross-validation is leave-one-out cross-validation (LOOCV), in which m is equal to the number of samples in the training set, and each sample in the training set is left out in turn to evaluate the prediction results.
Related work
van't Veer
et al. [
14] recently applied a binary classification algorithm to cDNA array data with repeated measurements, and classified breast cancer patients into good and poor prognosis groups. Their classification algorithm consists of the following steps. The first step is filtering, in which only genes with both small error estimates and significant regulation relative to a reference pool of samples from all patients are chosen. The second step consists of identifying a set of genes whose behaviour is highly correlated with the two sample types (for example, upregulated in one sample type but downregulated in the other). These genes are rank-ordered so that genes with the highest magnitudes of correlation with the sample types have top ranks. In the third step, the set of relevant genes is optimized by sequentially adding genes with top-ranked correlation from the second step. Leave-one-out cross-validation is used to evaluate and choose an optimal set of features. van't Veer
et al.'s approach takes variability estimates of repeated measurements into consideration by using error-weighted correlation in their method. However, this method involves an
ad hoc filtering step and does not generalize to more than two classes.
Ramaswamy
et al. [
10] combined support vector machines (SVMs), which are binary classifiers, to solve the multiclass classification problem. They showed that the one-versus-all approach of combining SVM yields the minimum number of classification errors on their Affymetrix data with 14 tumor types. The one-versus-all combination approach builds k (the number of classes) binary classifiers, each of which distinguishes one class from all the other classes. Suppose binary classifier i predicts a discriminant value f
i(x) for a given sample x in the test set. The combined multiclass classifier assigns sample x to the class for which the corresponding binary classifier produces the highest discriminant value. In addition to not taking variability estimates of repeated measurements into account, this approach selects different relevant features (genes) for each binary classifier.
Nguyen and Rocke [
21,
22] used partial least squares (PLS) for feature selection, together with traditional classification algorithms such as logistic discrimination and quadratic discrimination to classify multiple tumor types from microarray data. These traditional classification algorithms require the number of samples (experiments) to be greater than the number of variables (genes), and it is therefore essential to reduce the dimensionality before applying these traditional classification techniques. PLS is a dimension-reduction technique that maximizes the covariance between the classes and a linear combination of the genes. This approach can be generalized to multiple classes, but it does not make use of variability estimates of the data. In addition, it is a multistep process that involves a filtering step (to select genes with significant mean differences) and then application of PLS to further reduce the dimensionality so that the number of samples is greater than the number of dimensions.
Dudoit
et al. [
23] compared the performance of different discrimination methods (including nearest neighbor classifiers, linear discriminant analysis and classification trees) for classifying multiple tumor types using gene-expression data. None of the discrimination methods they evaluated takes measurement variability into consideration, and their emphasis is on discrimination methods and not feature selection.
Yeung
et al. [
24] showed that clustering algorithms that take advantage of repeated measurements (including the error-weighted approach that down-weights noisy measurements) yield more accurate and more stable clusters. Here, we will focus on the supervised learning approach, instead of the unsupervised clustering technique.
Tibshirani
et al. [
17] developed a 'shrunken centroid' (SC) algorithm for classifying multiple cancer types. It is an integrated approach for feature selection and classification. Features are selected by considering one gene at a time: the difference between the class centroid (average expression level or ratio within a class) of a gene and the overall centroid (average expression level or ratio over all classes) of a gene is compared to the within-class standard deviation plus a 'shrinkage threshold' which is fixed for all genes. The intuition is that genes with at least one class centroid that is significantly different from the overall centroid are selected as relevant genes. The size of the shrinkage threshold is determined by cross-validation on the training set to minimize classification errors.
Our contributions
Our algorithms have the following desirable characteristics. Both EWUSC and USC exploit the interdependence of genes to reduce the number of selected features. EWUSC takes advantage of the variability of gene-expression data over repeated measurements, so no ad hoc filtering step is necessary. Both EWUSC and USC can be applied to data with any number of classes. Both EWUSC and USC adopt an integrated approach for both feature selection and classification. Both algorithms make no assumption on data distributions.
We illustrate the advantage of removing correlated genes (for example, the USC algorithm) on the NCI 60 data [
12] for which there is no variability information. This dataset has been extensively used in other publications for classification algorithm development [
22,
23,
25]. We illustrated and compared our USC and EWUSC algorithms with two real datasets: a multiple tumor dataset from Ramaswamy
et al. [
10] and a breast cancer dataset from van 't Veer
et al. [
14]. These two datasets were chosen as they are publicly available in a form from which we can calculate or obtain error estimates for each gene-expression level or ratio. We used a subset of the multiple tumor data [
10] that consists of 7,129 genes and 11 tumor types on Affymetrix chips. There are 96 samples in the training set, and 27 samples in the test set. For the Affymetrix dataset we estimated the variability in the gene-expression levels using the robust multi-array analysis (RMA) tool [
26,
27] from the BioConductor project [
28]. A subset of the published data was used as we could only obtain raw data (.cel files) for a subset. The breast cancer dataset [
14] consists of 25,000 genes with four repeated measurements on cDNA arrays. There are 78 samples in the training set, 19 samples in the test set, and two classes of patients: one class with good prognosis (with more than 5 years of survival time), and another class with poor prognosis (with less than 5 years of survival time). For the breast cancer cDNA array data, published p-values as calculated by Rosetta's Resolver software were used to calculate the error estimates. In addition, we created synthetic datasets with repeated measurements and compared the performance of EWUSC, USC and SC at different noise levels.
We adopted three criteria for assessing feature selection and classification algorithms: prediction accuracy, number of relevant genes and feature stability. Prediction accuracy is defined as the percentage of correct classifications on the test set. The number of relevant genes is the total number of genes used to achieve optimal prediction accuracy. Feature stability is the level of agreement of selected genes chosen over different cross-validation runs of the algorithm.
Using these algorithms we obtained the following general results. Exploiting gene interdependence by removal of correlated genes typically results in comparable or higher prediction accuracy using fewer relevant genes. This is highly desirable if one wishes to develop diagnostic tools from the selected set of genes. Using error or variability estimates as weighting factors generally yields higher feature stability and reduces the number of relevant genes on real datasets. On the multiple tumor data, our EWUSC algorithm achieves 16% increase in prediction accuracy, using only 10% of the genes as features (compared with using all the available genes in the published result). On the breast cancer data, our EWUSC algorithm produces the same number of classification errors as the published result using a larger feature set. Unlike the published algorithm for this dataset, however, the EWUSC algorithm is applicable to datasets with more than two classes.