The results of this paper have been validated on a dataset of gene expression profiles from complementary DNA (cDNA) microarrays related to very similar phenotypes. Only a reduced subset of genes allows for discrimination (Table ). This peculiarity increases the complexity of the classification allowing us to better validate the proposed method. It is worth mentioning here that, in all experiments, the training-set does not include any sample from the test-set. This is a given requirement to avoid overoptimistic results and therefore to honestly evaluate the classifiers performances.
Structure of the considered data-set
The data-set includes a total of 7 phenotypes. Samples have been downloaded from the cDNA Stanford Microarray database [24
]. All genes without a valid UnigeneID have been discarded. The expression level of each gene is measured as the log-ratio between the Cy5 and the Cy3 channel of the array:
Four sets of samples have been downloaded from a large set of experiments aiming at performing Lym-phoma Classification [25
• Diffuse Large B-Cell Lymphoma (DLBCL): a non-Hodgkin lymphoma disease,
• B-Cell Chronic Lymphocytic Leukemia Wait&Watch (CLLww),
• B-Cell Chronic Lymphocytic Leukemia (CLL), and
• Follicular Lymphoma (FL): independent lymphonode samples on LymphoChip microarrays [27
The remaining three phenotypes in the data-set are:
• Acute Lymphoblastic Leukemia (ALL),
• Core Binding Factor Acute Myeloid Leukemia (CBF-AML): subgroups characterized by shorter overall survival [28
• Acute Myeloid Leukemia 2 data-set (AML2): peripheral-blood samples or bone marrow samples of intermediate-risk AML with a normal karyotype.
Three multi-class classifiers often used in gene expression profile analysis have been considered in this study: k–Nearest Neighbors (k-NN), feed-forward Neural NETwork with a single hidden layer (N-NET), and Random Forests (RF). All algorithms are available in R. A detailed description of how data have been processed and how the prediction models for the different classifiers have been trained is available in the Methods section.
Class probability estimates analysis and decision rules
The process of detecting samples to reject in a multi-class classification system can be modeled as a binary classification test discriminating between samples that belong to one of the known classes (target samples) and samples that do not belong to any of them (reject samples). The outcome of the test is measured in terms of:
• true positives (TP): target samples correctly accepted,
• true negatives (TN): reject samples correctly rejected,
• false positives (FP): reject samples erroneously accepted, and
• false negatives (FN): target samples erroneously rejected.
The number of TP, TN, FP, and FN adds up to 100% of the data-set. The accuracy in which target samples are assigned to the corresponding class is out of the scope of this work and depends on the accuracy of the selected multi-class classifier.
The multi-class classifiers considered in this paper (RF, N-NET and k-NN) do not natively implement a rejection option. Discarding reject samples by setting a single threshold on the class probability estimates is inaccurate since class probability estimates show small differences between target and reject samples (refer to the Methods section for specific details on how class probability estimates have been computed). However, this information can still be used for discrimination if coupled with well tuned decision rules.
In order to perform a preliminary qualitative analysis of how class probability estimates change between target and reject samples, we performed a set of multi-class classification experiments generating different splits of the considered data-set (in terms of targets/reject samples and test/training data). For each split, the multi-class classifiers have been trained on a subset of the considered phenotypes, using the remaining data as a set of samples to reject. Figure reports, for each classifier, two density plots that show how the value of class probability estimates of target and reject samples distribute in the performed experiments. In the MAX plot the considered random variable is the highest class probability estimate of each classified sample, split into target samples (solid line) and reject samples (dashed line); in the DIFF plot the considered random variable is the difference between the two highest probability estimates of each sample, again considering target and reject samples. The density functions have been estimated from the experimental data by performing a Gaussian kernel density estimation using the density() command of R.
Figure 1 Probability density plots. Probability density plots to show how the value of class probability estimates of target and reject samples distribute over a set of experiments. In the MAX plot the considered random variable is the highest class probability (more ...)
Although the plots of Figure may seem to suggest a strong overlap between the distributions of target samples (solid lines) and reject samples (dashed lines), a certain amount of separation is still visible. This is particularly evident in the case of RF, that shows a quite visible distinction both in the MAX and in the DIFF plots. In particular, in the DIFF plot of RF, target samples (solid line) have a max around 0.8, far from the max of reject samples (dashed line) that falls around 0.1. This means that, for a target sample, the difference between the two top rated classes is very high (around 0.8 in most of the cases). Instead, reject samples show a very low difference between probability estimates of the two top ranked classes, revealing the inability of the classifier to clearly select a target class. k-NN and N-NET show smaller separation; however, experimental results will show that a partial discrimination is still possible.
From this preliminary analysis, it seems reasonable that, for the three considered classifiers, the class probability estimates provided by R could potentially be used for detecting reject samples. The idea exploited in this paper to design a set of decision rules for detecting reject samples is to split the MAX plot into three distinct areas: (i) max area, (ii) decision area, and (iii) reject area, delimited by two rejection thresholds Tmax and Trej (Tmax >Trej), as shown in Figure for RF.
Definition of the rejection thresholds in the case of RF. The MAX plot is split into three distinct zones: (i) max area, (ii) decision area, and (iii) reject area.
Figure describes the overall decision process applied to each sample that must be classified. max1 and max2 denote the two highest class probability estimates for the considered sample.
Decision Rule Pseudo-code. Pseudo-code describing the decision rules able to discriminate between target and reject samples based on the class probability estimates of the sample and on the computed rejection thresholds.
If the highest class probability estimate (max1
) is lower than Trej
, then the sample falls in the reject area and is rejected to maximize the number of TN (Rule R1 Figure , rows 1-2). Similarly, if max1
is higher than Tmax
, the sample falls in the max area and can be accepted to maximize the number of TP. The class with probability estimate equal to max1
is predicted (Rule R2-Figure , rows 4-5). The first part of this decision process is very similar to the single threshold method proposed in [12
Whenever neither R1 or R2 are satisfied, i.e., max1 falls between Trej and Tmax (decision area) there are two possible conditions based on the analysis of the difference between max1 and max2 (DIFF plot of Figure ):
1. if max1 – max2 >Tdiff, the sample can be accepted and the class with probability estimate equal to max1 is predicted (Rule R3.1-Alg. 3, rows 7-8). Tdiff is the minimum difference between the probability estimates of the two top ranked classes that allows us to use max1 to perform a reliable classification;
2. if max1 – max2 <Tdiff , the value of max2 is considered. Two cases are possible:
• if max2 is higher than Trej, i.e., both max1 and max2 fall in the decision area, the prediction is considered uncertain (Rule R3.2-Alg. 3, rows 10-11). In this case the classifier does not produce any classification result. This rule prevents from providing a result when the distinction between two classes is not sufficient to correctly discriminate. In alternative, multiple classification results can be provided to alert the user that the confidence in the prediction is low;
• if max2 is lower than Trej, the sample is rejected (Rule R3.3-Alg. 3, row 13). This rule mitigates the noise in those samples that fall at the border of the reject area.
The three rejection thresholds (Tmax, Trej, and Tdiff) can be empirically chosen analyzing the density plots of Figure :
• if the MAX plot shows a clear separation between target and reject samples, Tmax can be placed in such a way to maximize the number of TP immediately detected by rule R2;
• similarly to Tmax, Trej can be placed to maximize the number of TN detected by rule R1;
• the definition of Tdiff is performed looking at the DIFF plot. A good heuristic is to consider the point where the two curves intersect.
Manually setting the three thresholds is very complex and may easily lead to a high error rate. When the plots do not show a clear separation between target and reject samples, the choice of the thresholds involves a trade-off between increasing the sensitivity, and lowering the specificity of the classifier. This is a complex optimization task.
All thresholds must be setup only considering information extracted from the considered training data. To tackle the complexity of this process, and to allow the automatic tuning of all rejection parameters, a threshold setup algorithm based on a Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) has been developed. The full description of this algorithm is available in the Methods section.
Architecture of the multi-class classifier with rejection option
The proposed decision rules can be easily integrated within the multi-class classification flow provided by R or other similar computational environments. Figure shows the computational flow of the resulting system. As usual when working with classification algorithms, a training set containing known samples is used to train the prediction model then used to classify a set of test (unknown) samples.
Computational flow. Computational flow to setup a multi-class classification system with a rejection option based on the proposed decision rules.
Compared to a traditional multi-class classification flow, the proposed system includes two additional modules required to: 1) setup the rejection thresholds, and 2) apply the decision rules.
Setting up the rejection thresholds requires collecting a statistically significant set of class probability estimates for both target and reject samples on which to compute the density plots of Figure . At this stage, in which the model is trained and real reject samples are not available, this information can only be collected starting from samples in the training set by setting up several cross-validation experiments on different folds of these training data. Figure outlines the way this module operates. Let us denote with T the full training set and with Ti a portion of it including only those samples related to a specific phenotype. If k classes are included in T, k subsets of experiments are generated by iteratively excluding one of the classes Ti from T to form a new target class T′. The removed samples are used as a set of reject samples denoted with R (Figure , rows 1-3).
Pseudo-code description of the threshold setup process.
For each subset (Figure , row 4), m folds are generated by removing x samples from each subclass contained in T′, and x samples from R. Each fold will therefore generate a test-set (Test*) of x · (k – 1) target samples , and x reject samples. To avoid overoptimistic results, target samples of the test-set are removed from T′ forming a new training set Train* (Figure. 5, rows 5-8). Each fold is then used for an independent multi-class classification experiment to obtain the class probability estimates of each test sample in Test*. After running all k · m experiments, the CMA-ES analyzes the collected probability estimates and provides a set of optimal thresholds (refer to the Methods section for a complete description of this step).
Validation and discussion
The proposed rejection rules have been tested on different groups of experiments based on different configurations of the considered data-set in terms of target and reject samples. The rejection accuracy has been compared with the one of a set of selected one-class classifiers. Five one-class classification algorithms have been considered in this comparison:
• Parzen one-class classifier (Parzen-OC),
• k-NN one-class classifier (k-NN-OC),
• K-Means one-class classifier (k-Means-OC),
• Principal Component Analysis (PCA) one-class classifier (PCA-OC), and
• SVDD, a SVM based one-class classifier (SVM-OC).
All one-class classifiers have been implemented and optimized using Matlab’s DD_tools [29
], a standard implementation used in several publications on microarray analysis [19
Two methods of using one-class classifiers have been considered. Let us suppose to have a target class including k different subclasses (phenotypes). The one-class classification problem can be solved by training either k different one-class classifiers (one for each subclass) with samples rejected if rejected by all classifiers, or a single one-class classifier trained with samples of all k subclasses. The first approach will be referred to as Multi One-Class with Voting (MOCV) while the second approach will be referred to as Single One-Class with Multiple Targets (SOCMT).
Four groups of experiments each denoted as Gk
[3, 6]) have been generated. In the group Gk
the target class includes k
out of the 7 available phenotypes. Samples in the remaining 7 – k
classes must be rejected. For each group different random combinations of the k
classes included in the target profile are considered and for each combination several random splits of data into test- and training-set are generated for a total amount of 40 experiments for each group. For all experiments, the test-set includes a balanced number of target and reject samples.
For each experiment data are classified with MOCV, SOCMT, and the three considered multi-class classifiers paired with the decision rules presented in this paper. Rejection thresholds have been automatically computed according to the process described in Figure .
Table summarizes the results of the classification. It provides for each classifier (rows), and for each group of experiments (column groups), the average sensitivity (Sens) and specificity (Spec) with the corresponding Confidence Intervals (CI) computed with 95% Level of Confidence (LOC). RF coupled with the decision rules is the classifier that globally better performs with respect to any other available option (its performance is highlighted in bold in Table ). The other two multi-class classifiers (N-NET and k-NN) are also comparable or in some cases better than one-class classifiers. This result can be better appreciated looking at Table reporting the average accuracy improvement of the proposed approach over the two possible configurations of one-class classifiers. Accuracy is computed as the percentage of samples correctly classified (TP + TN) over the total amount of classified samples. Averages are computed over the 40 experiments of the corresponding group. Looking at the performance of RF (highlighted in bold) one can notice a significant improvement of the accuracy in all experiments compared to one-class classifiers. The table also highlights how the other two multi-class classifiers have performances comparable to one-class classifiers in most of the experiments.
A final confirmation of the improvement introduced by the presented approach can be appreciated in the Receiver Operating Characteristic (ROC) curves of Figure . For each group of experiments the related ROC curve compares the average performance of the three multi-class classifiers coupled with the decision rules and the two best one-class classifiers (Parzen-OC and SVM-OC). In the case of multi-class classifiers the ROC curve is plotted by changing the value of the three rejection thresholds in order to explore as much as possible the space of possible solutions, while, in the case of one-class classifiers, it is obtained by changing the considered rejection rate. In all experiments RF improves the accuracy of the one-class classifiers while k-NN and N-NET provide an accuracy that is comparable to those of one-class classifiers. This result is obtained using a very simple computational model compared to the one required to setup a one-class classification model.
ROC curves. ROC curves comparing the best one-class classifiers with the multi-class classifiers coupled with the decision rules.
All proposed results have been obtained by computing the rejection thresholds using the CMA-ES with the SS1 objective function (see the Methods section). The diagram of Figure shows the average accuracy obtained during the cross-validation experiment used to setup the rejection thresholds. Proposed results are for RF coupled with the rejection rules considering the different CMA-ES objective functions. The dashed diagonal lines represent iso-accuracy lines, with accuracy that decreases from the top-left corner to the bottom-right corner. The graph shows that the three functions SS, SS1, and SS2 provide the better accuracy with SS1 providing the best results.
CMA-ES Objective functions performances for RF. ROC graphs comparing the accuracy of RF while computing the rejection thresholds with different objective functions.
The value of the objective function associated with the thresholds computed by the CMA-ES can be used as an indicator of the reliability of the trained model. Whenever an objective function is equal to 0, it means perfect discrimination among target samples and reject samples. Values greater than 0 indicate reduced accuracy. This is confirmed by the results of Table . It reports for each classifier and for each group of experiments the average accuracy and the average value of the SS1 objective function associated with the computed thresholds. The numbers clearly show how RF, that is the one with better accuracy, has a lower value of the objective function compared k-NN and N-NET, thus suggesting a more reliable model.
Reliability of the trained model