Our procedure is explicitly targeted to select very small sets of genes, and has been shown [1
] to have a classification error rate on-par with other, state-of-the-art, classification procedures. Additionally, our programs allow the exploratory usage of random forest for identifying large subsets of genes potentially relevant for class prediction. In contrast to other tools, such as M@CBETH [17
], we are not restricted to two-class problems.
To avoid underestimating the error rate of the classification procedure, we use the bootstrap (the 0.632+ approach of [27
]). As in [1
], we bootstrap the complete procedure, including selecting the classifier with minimal out-of-bag error rate (thus, this is a "full" or "double" bootstrap procedure, sensu [5
]), and thus our estimates of error rate are not affected by selection biases. This contrasts, for instance, with Prophet [18
], where the error rate reported is that of the classifier with the smallest cross-validated error rate. Based upon the bootstrap results, we also show the average out-of-bag predictions for each sample, allowing to easily asses poorly predicted samples and potential outliers. There are other tools available for performing cross-validation and bootstrap of classification methods, such as the R package ipred [28
] by A. Peters and T. Hothorn, the BioConductor package MCRestimate [29
] by M. Ruschhaupt, U. Mansmann, P. Warnat, W. Huber and A. Benner, specifically targeted to computing misclassification error rates combining the gene selection and classification steps, or the caGEDA web application [10
] that incorporates bootstrap, leave-one-out, and random resampling validation of several classifiers. Our approach, however, has been tailored to our own variable selection procedure and has been parallelized. A unique feature of GeneSrF and varSelRF are their emphasis on examining possible multiple solutions.
Since we obtain 200 resamples in the process of bootstrapping (see above) there is little added computational cost to providing analysis of stability and multiplicity of solutions. We report the number of genes selected and the identity of the individual genes selected in the original sample and the 200 bootstrap runs, including frequencies of every gene selected in the solutions. Moreover, the biological interpretation of the results is enhanced by the access to additional information. If the input file contains gene identifiers for either human, mouse, or rat genomes (in the form of Affymetrix IDs, Clone IDs, GenBank Accession numbers, Ensembl Gene IDs, Unigene clusters, or Entrez Gene IDs), for each gene in the results, the web-based application provides a link to IDClight [11
], which allows the user to obtain additional information, including mapping between gene and protein identifiers, PubMed references, Gene Ontology terms, and KEGG and Reactome pathways. The multiple solutions can be further studied by sending sets of selected genes to our tool PaLS [30
] to examine PubMed references, Gene Ontology terms, KEGG pathways, or Reactome pathways that are common to a user-selected percentage of genes or lists (bootstrap solutions). A fully commented example of the output is provided in the on-line help [24
Finally, GeneSrF is one of the very few tools for the analysis of gene expression data that uses parallelization and, as far as we know, the only web-based tool to use parallelization for gene selection and classification. This is an important methodological novelty, as we can no longer expect that increases in single-CPU speed will allow us to analyze larger data sets in shorter time: the rate of increase in CPU speed has slowed down considerably in the last five years but, in contrast, increasing numbers of CPU cores (either in individual machines – including laptops – or via off-the-shelf computing clusters) are becoming much more affordable [13
]. Thus, further decreases in user wall time (time to wait for a result) and ability to tackle more complex problems will depend on our ability to use parallel, distributed, and concurrent programming. GeneSrF therefore represents a case example on combining parallel computing with a user-friendly web-based application for the analysis of gene expression data and, by making the full source code available, allows other researchers to build upon our developments.
Future work focuses on extending the software to use random forest-related techniques applicable to heterogeneous types of variables such as addition of categorical data [31
] and other clinical information. As well, we are exploring alternative mechanisms and languages for parallelizing and distributing computations, and we are rewriting most of the code using Pylons [32
], a Python web framework, to try to simplify installation of the web-based application. Installation now involves several steps (see [33
]), and the most time consuming are setting up and verifying the LAM/MPI environment, and using the correct paths in files involved in controlling the MPI environment and executing and controlling R.