It is well-known that almost all published studies present positive research results, as outlined by Kyzas et al [1
] for the special case of prostate cancer. In the case of microarray studies, that often focus on the identification of differentially expressed genes or the construction of outcome prediction rules, this means that almost all studies report at least a few significant differentially expressed genes or a small prediction error, respectively.
According to Ioannidis [2
], " [...] most published research findings are wrong". This may be partly due to the editorial policy of many journals which accept almost only papers presenting positive research results (except perhaps recent initiatives like the Journal of Negative Research Results in Medicine
). Authors are thus virtually urged to "find something significant" in their data, which encourages the publication of wrong research findings due to a variety of technical and statistical pitfalls. Microarray studies are especially subject to such mechanisms and known to yield "noise discovery" [3
Technical challenges that particularly affect microarray studies include, e.g. technical errors in the lab, problems with image analysis and normalization. Statistical pitfalls and biases of studies on microarray-based prediction are equally diverse. A problem well covered in the literature is the "small n
" dimensionality problem (also referred to as "n p
", i.e. less observations than variables). In univariate analyses for identifying differentially expressed genes, the multiple testing problem resulting from high dimensionality can be addressed, e.g. by means of approaches based on the false discovery rate [4
]. In the context of microarray-based prediction, another important statistical pitfall is incomplete cross-validation (CV), as pointed out by numerous authors [6
]: if the selection of relevant variables is performed before cross-validation using all available observations, the cross-validated error rate is quite naturally optimistically biased. Most recent studies take this important point into account, either by performing variable selection for each CV iteration successively or by using class prediction methods involving an intrinsic variable selection step, like the Lasso [11
]. Hence, we do not address again the problem of incomplete CV in the present article.
The reported classification error rate can also be lowered artificially by selecting the values of tuning parameters a posteriori, i.e. on the basis of the computed CV error rates. Doing this, one selects the "best" version of a classifier and evaluates it using the same data, which of course leads to an underestimation of the error rate (named "bias source I" in our present paper). A quantitative study on this topic can be found in [12
]. Note that the problem of optimal parameter selection affects not only microarray research but also classical medical studies based on conventional low-dimensional predictors, although probably not as dramatically. A particular parameter that is especially crucial in the analysis of high-dimensional data is the number of selected variables (if variable selection is performed). In many studies, it is chosen a posteriori based on the CV results, thus inducing biases in the reported error rate. Another source of bias (named "bias source II" in our paper) that is related to, but more global than optimal parameter selection, is the optimal selection of the classification method itself from the wide range of classifiers that are available for the analysis of microarray data today (e.g. support vector machines, random forest or L2
penalized logistic regression). This again is an issue that, in principle, can be encountered in all types of medical studies, but affects microarray studies more drastically. Whereas standard statistical approaches - for instance logistic regression for class prediction problems - have become the methodological "gold standard" in conventional medical statistics and allow a comparatively fair evaluation of research results, the field of microarray data analysis is characterized by the lack of benchmark standard procedures and a huge and heterogeneous amount of methods - ranging from adaptations of standard statistical procedures to computer intensive approaches adopted from machine learning - whose respective merits and pitfalls remain partly unexplored. This is particularly true for studies involving class prediction problems, i.e. when the goal is to derive a classification rule for predicting the class membership (typically the disease outcome) of patients based on their microarray transcriptome data.
In this context, if the sample is not large enough to put aside a validation data set, it is common practice to evaluate the performance of classifiers based on techniques like cross-validation (CV) including leave-one-out (LOO) CV as a special case, repeated splitting into learning and test data sets, or bootstrap sampling. See the methods section for more details on the cross-validation technique used here and [13
] for an extensive review on cross-validation and resampling techniques in general. However, it is not sufficient to use correct methods together with a correct internal CV scheme. Evaluating several classification methods in cross-validation and then reporting only the CV results obtained with the classifier yielding the smallest error rate is an incorrect approach [14
], because it induces an optimistic bias. In their dos and donts list, Dupuy and Simon [14
] recommend to "report the estimates for all the classification algorithms if several have been tested, not just the most accurate." They discourage from optimizing the choice of the classification algorithm based on the obtained results.
In this article, we empirically investigate the consequences of such an optimization. We report the results of an experiment that allows us to quantify the optimistic bias induced by optimal tuning parameter selection (bias source I) and optimal selection of the classification method (bias source II) in a realistic setting based on original microarray data. After we have illustrated the drastic effect of optimal classifier selection, we discuss alternative ways to report results of class prediction studies when no validation set is available, and give suggestions for good scientific practice in this context.
In our experiment we compute the misclassification rate of a total of 10 usual classification methods (k-nearest-neighbors, linear discriminant analysis, Fisher's discriminant analysis, diagonal linear discriminant analysis, partial least squares followed by linear discriminant analysis, neural networks, random forests, support vector machines, shrunken centroid discriminant analysis and L2
-penalized regression) based on cross-validation. Some of these 10 classification algorithms are combined with preliminary variable selection or/and used with different plausible tuning parameter values successively. The aim is to investigate the different sources of biases resulting from optimal selection (optimal choice of tuning parameters including gene selection and optimal choice of the classification method) and their relative importance. All the considered procedures are classical approaches, most of which have already been used in published medical studies. For the sake of reproducibility, all our analyses are based on the freely available Bioconductor package CMA version 0.8.5 [15
] which is described extensively in [16
The classifiers are applied to original and modified data sets, some of which are obtained by permuting the class labels of real microarray data sets. We then assess the minimal misclassification rate over the results of the different variants of classifiers in order to quantify the bias arising when the optimal classification method and/or its tuning parameters are selected a posteriori in a data-driven manner. The permutation of the class labels is used to mimic data sets under the global null hypothesis that none of the genes are differentially expressed with respect to the response class. This approach thus provides non-informative microarray data that, however, preserve their realistic correlation structure, and can serve as a "baseline" to quantify the bias induced by optimal classifier selection.