We developed a classification model for early detection of PCA on the basis of SELDI-TOF MS data using DF. DF is an ensemble method, where each prediction is a mean value of all the DT models combined to construct the DF model. The idea of combining multiple DT models implicitly assumes that a single DT model could not completely represent important functional relationships between predictor variables (m/z peaks in this study) and the associated outcome variables (PCA in this study), and thus different DT models are able to capture different aspects of the relationship for prediction. Given a certain degree of noise always present in omics data, optimizing a DT model inherently risks overfitting the noise. DF minimizes overfitting by maximizing the difference among individual DT models. The difference is achieved by constructing each individual DT model using a distinct set of predictor variables. Noise cancellation and corresponding signal enhancement are apparent when comparing the results from DF and DT. DF outperforms DT in all statistical measures in the 2,000 L10O runs. Whether DT performs better than other similar classification techniques depends on the application domain and the effectiveness of the particular implementation. However, Lim and Loh (1999) compared 22 DT methods with nine statistical algorithms and two artificial neural network approaches across 32 data sets and found no statistical difference among the methods evaluated. Thus, the better performance of DF than DT implies that the unique ensemble technique embedded in DF could also be superior to some other classification techniques for class prediction using omics data.
Combining multiple DT models to produce a single model has been investigated for many years (Bunn 1987
; Clemen 1989
; Zhang et al. 2003
). Evaluating different ways for developing individual DT models to be combined has been a major focus, which have all been reported to improve ensemble predictive accuracy. One approach is to grow individual DT models based on different portions of samples randomly selected from the training set using resampling techniques. However, resampling using a substantial portion of samples (e.g., 90%) tends to result in individual DT models that are highly correlated, whereas using a less substantial portion of samples (e.g., 70%) tends to result in individual DT models of lower quality. Either high-correlated or lower-quality individual DT models can reduce the combining benefit that might otherwise be realized. The individual DT models can also be generated using more robust statistical resampling approaches such as bagging (Breiman 1996
) and boosting (Freund and Schapire 1996
). However, it is understood that boosting that uses a function of performance to weight incorrect predictions is inherently at risk of overfitting the noise associated with the data, which could result in a worse prediction from an ensemble model (Freund and Schapire 1996
). Another approach to choosing an ensemble of DT models centers on random selection of predictor variables (Amit and Geman 1997
). One popular algorithm, random forests, has been demonstrated to be more robust than a boosting method (Breiman 1999
). However, in an example of classification of naive in vitro
drug treatment sample based on gene expression data, Gunther et al. (2003)
showed reduced prediction accuracy of random forests (83.3%) compared with DT (88.9%).
It is important to note that the aforementioned techniques rely on random selection of either samples or predictor variables to generate individual DT models. In each repeat the individual DT models of the ensemble are different; thus, the biologic interpretation of the ensemble is not straightforward. Furthermore, these methods need to grow a large number of individual DT models (> 400) and could be computationally expensive. In contrast the difference in individual DT models is maximized in DF such that a best ensemble is usually realized by combining only a few DT models (i.e., four or five). Importantly, because DF is reproducible, the variable relationships are constant in their interpretability for biologic relevance.
Omics data such as we stress in this article normally have a limited number of samples and a large number of predictor variables. Furthermore, the noise associating with both categorical dependent variables and predictor variables is usually unknown. It is consequently imperative to verify that the fitted model is not a chance correlation. To assess the degree of chance correlation of the PCA model, we computed a null distribution of prediction with 2,000 L10O runs based on 2,000 pseudo-data sets derived from a randomization test. The null hypothesis was tested by comparing the null distribution with the DF predictions in 2,000 L10O runs using the actual training data set. The degree of chance correlation in the predictive model can be estimated from the overlap of the two distributions (). Generally speaking, a data set with an unbalanced sample population, small sample size, and/or low signal:noise ratio would tend to produce a model with distribution overlapping the null distribution. For the PCA model, the distributions are spaced far apart with no overlap, indicating that the model is biologically relevant.
A model fitted to omics data has minimal utility unless it can be generalized to predict unknown samples. The ability to generalize the model is an essential requirement for diagnostics and prognostics in medical settings and/or risk assessment in regulation. Commonly, test samples are used to verify the performance of a fitted model. Such external validation, while providing a sense of real-world application, must incorporate assurance that samples set aside for validation are representative. Setting aside only a small number of samples might not provide the ability to fully assess the predictivity of a fitted model, which in turn could result in the loss of valuable additional data that might improve the model. Besides, one rarely enjoys the luxury of setting aside a sufficient number of samples for use in external validation in omics research because in most cases data sets contain barely enough samples to create a statistically robust model in the first place. Therefore, an extensive L10O procedure is embedded in DF that can provide an unbiased and rigorous way to assess the fitted model’s predictivity within the available samples’ domain without the loss of samples set aside for a test set.
A model’s ability to predict unknown sample’s is directly dependent on the nature of the training set. In other words, predictive accuracy for different unknown samples varies according to how well the training set represents the given samples. Therefore, it is critical to be able to estimate the degree of confidence for each prediction, which could be difficult to derive from the external validation. In DF the information derived from the extensive L10O process permits assessment of the confidence level for each prediction. For the PCA model the confidence level for predicting unknown samples was assessed based on the distribution of accuracy over the prediction probability range for the left-out samples in the 2,000 L10O runs. We found that the sensitivity and specificity of the model were 99.2 and 98.2% in the HC region, respectively, with an overall concordance of 98.7%. In contrast, a much lower prediction confidence of 78.9% was obtained in the LC region, indicating that these predictions need to be further verified by additional methods. Generally, the number of samples within the HC region compared with the LC region depends on the signal:noise ratio in the data set. For noisy data, more unknown samples will be predicted in the LC region and could be as high as 40–50% (results not shown). For the PCA data set some 80% of the left-out samples predicted in the 2,000 L10O runs were in the HC region, indicating that the data set has a high signal:noise ratio.
A number of classification methods reported in the literature require selection of the relevant or informative predictor variables before modeling is actually performed. This is necessary because the method could be susceptible to noise without this procedure, and the computational cost is prohibitive for iterative variable selection during cross-validation. Although these are otherwise effective methods, they could produce what is called “selection bias” (Simon et al. 2003
). Selection bias occurs when the model’s predictive performance is assessed using cross-validation where only the preselected variables are included. Because of selection bias, cross-validation could significantly overstate prediction accuracy (Ambroise and McLachlan 2002
), and external validation becomes mandatory to assess a model’s predictivity. In contrast, model development and variable selection are integral in DF. DF avoids the selection bias during cross-validation because the model is developed at each repeat by selecting the variables from the entire set of predictor variables. The cross-validation thereby provides a realistic assessment of the predictivity of a fitted model. Given the trend of ever decreasing computation expense, carrying out exhaustive cross-validation is increasingly attractive, particularly when scarce sample data can be used for training as opposed to external testing. Of course, external validation is still strongly recommended when the amount of data suffices, in which case the cross-validation process will still enhance the rigor of the validation.