We begin by demonstrating that the performance of the population of predictors improves on each iteration of the evolutionary algorithm. Figure shows the average scores in each iteration from several trials. The graph shows that the average score increases more rapidly over the first few generations in comparison with the final generations. The evolutionary algorithm typically converges and terminates in less than 50 generations. The runs terminate on different iterations depending upon when the termination condition is met.
In the baseline test, the evolutionary algorithm is run on the entire set of genes (without applying any feature selection method) in order to obtain a baseline measure of performance.
The baseline system is evaluated on the 7,070 genes of the leukemia dataset. The initial predictors in population are built by randomly selecting 10 genes to be an initial feature of the predictors. This means the evolutionary algorithm has to search for 10 optimal predictive genes set from the

possible subsets. Performance of the predictors is evaluated using KNN classifier to determine the LOOCV on training samples. After the best predictor is found in each generation, it will be tested again on test samples to give the performance based on an out-of-sample estimation. The KNN classifier classifies each sample in the test data using a database of all training samples.
Table reports the maximum and the average accuracy of the baseline system on 38 training samples and 34 test samples. The results show that the evolutionary algorithm gives predictors with perfect classification on the training samples but those predictors do not classify the test data well. The average accuracy on test data is 68% at best while the average accuracy on training data is up to 98%. Table indicates that population size may be a more important factor than feature size for the baseline system.
| Table 1The accuracy of the baseline system built by randomly selecting genes from 7,070 genes in the leukemia dataset. |
Building an informative initial gene pool by the RankGene software
We determine the best choice for the number of initial features, i.e. the initial gene pool, to be 100 by experiment (by exploring the performance of the classifier using the information gain ranking method for feature selection). Based on this result, the performance of the following six ranking methods for a range of population and feature sizes is investigated: R1. Information gain; R2. Twoing rule; R3. Gini index; R4. Sum minority; R5. Max minority; R6. Sum of variances. The details of each method can be found at
http://genomics10.bu.edu/yangsu/rankgene/.
Leukemia dataset
The parameters of the evolutionary classifier are evaluated on the set of 38 training samples and the 34 test samples of the leukemia data – as described in the original work. We explore: population size {10, 30, 50} feature size {30, 50} and initial gene pool 100.
The classification results are summarised in table . When a predictor with 100% accuracy is learned in one or more of the test runs we indicate this by (*). Average accuracy over ten trials lies in the range 92–98%. Using feature selection, the accuracy on the test data is more than 19% greater than that of the baseline system. For a given feature selection method, the prediction accuracy on the test samples varies up to 3% across the range of parameter settings surveyed. The choice of feature selection method contributes up to 5% for a given set of algorithm parameters. Information gain consistently gives the best performance on the leukemia dataset.
| Table 2The average accuracy using out-of-sample prediction on the 34 leukemia test samples. The symbol (*) means that there is some perfect predictors found by the algorithm. The highest accuracy is written in bold. |
The results of the confirmation study are shown in table , where feature selection improves accuracy by at least 27%: The accuracy on the training data in the baseline system (without feature selection) is 85.26% which is significantly greater than the test accuracy of 67.06%. A variation of 4.4% is observed in average accuracy across the RankGene methods. The best RankGene method is information gain. In comparison with feature selection by information gain, all other methods have a significantly lower accuracy on the testing data than information gain, while there are no differences in performance on the training data across the other RankGene methods.
| Table 3The average accuracy of the GA/KNN classifier using out-of-sample prediction on the 34 leukemia test samples. |
NCI60 dataset
The aim of this investigation is to find the best parameters and ranking method for the evolutionary classifier when applied to the NCI60 data. The same range of parameters is surveyed as for the leukemia data, and again the performance of six feature selection methods is evaluated.
Due to the very small sample size of the NCI60 dataset, it is not possible to divide the data into training and testing sets. Thus, the accuracy of predictors in table is given by using the LOOCV error rate estimation on the whole dataset. To get a more reliable performance of the evolutionary algorithm on the NCI60 dataset, the .632 bootstrap estimator will be used.
| Table 4The average accuracy on 5 sets of parameters and six ranking methods on the NCI60 data. No perfect predictors are found by the algorithm. The highest accuracy is written in bold. |
The best classification score on the NCI60 data is 76.23%, and was obtained using information gain, with a population and feature size of 30. No predictor learned in any run of the system was able to classify all data 100% correctly.
Discrimination method
The frequency of selection of the genes that are members of the best predictor across 100 independent trials is assessed in order to determine the reproducibility of the results. If a gene is consistently preferentially chosen as a member of a predictive set it would suggest that the gene selection operation is reproducible – despite the random initialisation.
Z-score analysis is one means to determine the significance of the observed frequency of an event against that which might have occurred by chance. This calculation normalises the frequency with which each of the initial genes was selected in all predictors that classify the training and test data perfectly [
9]. The Z score can be calculated using equation (1).
Where Si denotes the number of times genes i was selected, E(Si) is the expected number of times for gene i being selected, and σ denotes the square root of the variance. The calculation of E(Si) is as follows: let A number of perfect predictors found in the experiment, Pi = (number of genes) / (number of genes in the initial gene pool). Then, E(Si) = Pi * A.
The evolutionary classifier was run 100 times on the leukemia dataset using the best set of parameters: 100 initial genes constructed by the information gain and feature size = 50. There are 43 predictors (ignoring the duplication) that classify all training and test data correctly. Figure shows a plot of Z-score applied to the top ranked genes that are most frequently selected. The top 24 genes have a similar Z-score and the top 55 have a positive Z-score. In this case, it seems reasonable to choose the 55 top-ranked gene as the most discriminative genes.
To confirm the predictive power of the 55 top-ranked genes we constructed a classifier using these genes as features. As before, we measure the accuracy on the test set using the training set as a case-base. An accuracy of 100% is obtained.
Additional file:
1 shows the 55 top-ranked genes, their frequency of occurrence in the 43 best predictors and the Z-score value. This set of genes will be used in the further evaluation using .632 bootstrap estimator.
The Z-score analysis was repeated on the NCI60 data and these results are shownin Figure . It can be seen that the Z-score of the 8th ranked gene is half that of the top ranked gene and that the top 40 genes have a positive Z-score. Additional File
2 lists the top 40 genes. A classifier constructed using these genes has an accuracy of 73.8% by LOOCV, and 68.2% by a bootstrap estimate.
.632 bootstrap estimates
The .632 bootstrap method involves sampling the original dataset (with replacement) to obtain a new resampled set on which the classification error
δ is measured [
8]. The accuracy
ε on the samples omitted from the resampled set is also determined. The bootstrap estimator
ab632 for a dataset of
n samples requires
n resampled datasets to be constructed and the classification error calculated according to equation (2). The final bootstrap estimate is the average value of
ab632 over
b iterations of the procedure (we take
b = 200).
On the leukemia data, the .632 bootstrap estimate for the accuracy of the evolutionary algorithm with population size 10 and feature size 50 is 96.40%. The .632 estimate for a classifier based on a fixed set of 55 top-ranked genes is 96.85%.
On the NCI60 data, the .632 bootstrap estimate for the accuracy of the evolutionary algorithm with population size 30 and feature size 30, is 59.19%. The .632 estimate for a classifier based on 40 top-ranked genes is 68.18%.