We applied the parametric method to high dimensional multivariate normal datasets, while varying the parameter settings and the class prevalences. Results are shown in Table and [Additional file 1
: Supplemental Table S1]. We considered total samples of size n
= 200, n
= 100 and n
= 50. For example, when m
= 50 genes are informative and n
= 200, then the optimal number of samples for the training set (reading across the first row of Table ) is 170, 70 or more, 30 or more, and 20 or more for effect sizes of 0.5, 1.0, 1.5 and 2.0, respectively. The "or more" in the last three training set sizes indicates that training set sizes anywhere from the specified size up to 190 result in practically equivalent mean squared error.
Table of optimal allocations of the samples to the training sets
Several features are apparent in Table : (i) when the achievable accuracy is not much greater than 50%, the optimal split allocates the vast majority of samples to the test set. In this circumstance, no good classifier is possible so additional samples allocated to the training set are wasted and detract from lowering the variance of estimation in the test set; (ii) when the gene expression profiles of the two classes are widely separated, e.g., with a large number of differentially expressed genes and large effect sizes, small training sets are adequate to develop highly effective classifiers. The MSE is flat in this circumstance nor large test sets are needed.
[Additional file 1
: Supplemental Table S1] shows the results when the prevalence is unbalanced, namely, 2/3 from one class and 1/3 from the other class. The results for this imbalanced prevalence setting are very similar to the equal prevalence setting. This suggests that the same general optimal splits apply for a range of class prevalence (33% to 67%).
The relative sizes of the three terms contributing to the mean squared error of Equation (1) for the scenarios of Table and [Additional file 1
: Supplemental Table S1] are shown in the Supplementary material [Additional file 1
]. An example is shown in Figure . Generally, the A term tends to be relatively small across the range of sample sizes.
Figure 2 Example of MSE decomposition. Example figure showing the relative contributions of the three sources of variation to the mean squared error. This is a scenario from one entry in Table 1. Plots for all other scenarios associated with Table 1 and [Additional (more ...)
The squared bias term B tends to be relatively large for small sample sizes and to dominate the other terms. When development of a good classifier is possible, the actual accuracy of classifiers developed on the training set may initially increase rapidly as the training set size increases. As the sample size increases, the bias term B decreases until no longer dominating. This is because the accuracy of the classifier improves as the size of the training set increases and approaches the maximum accuracy possible for the problem at hand. The rate of decrease of the squared bias term B will depend somewhat on the type of classifier employed and on the separation of the classes. When the classes are not different with regard to gene expression, learning is not possible and B will equal zero for all training set sizes.
The binomial variance term V is generally relatively small unless the test set becomes very small at which point it often dominates. The exceptions to this general rule are in cases where the prediction accuracy nears 1 for t <n, in which case this V term remains near zero even as the test set size becomes small. Another partial exception is when the full dataset accuracy is below 85%, when the binomial variance increases.
Figure is a comparison of the two most common rules of thumb for splitting a sample into a training set and a test set. The figure compares 50% allotment to the training set versus 67% allotment to the training set for the equal prevalence case. Each scenario represented in Table is also present in Figure . The x-axis is the average accuracy (%) for classifiers developed from the full dataset of n samples. The y-axis is the excess error from using a non-optimal split. The discussion is organized around the full dataset accuracy:
Figure 3 Comparing two rules of thumb. Comparison of two common rules-of-thumb: 1/2 the samples to the training set and 2/3 rds of the samples to the training set. X-axis is the average accuracy (%) for training sets of size n. "Excess error" on the y-axis is (more ...)
• When the achievable true accuracy using the full dataset for training is very close to 1, both the 50% allotment and the 67% allotment to the training set result in similar excess error.
• When the achievable true full dataset accuracy is moderate, say between 60% and 99%, then in several cases, assigning 67% to the training set results in noticeably lower excess error, while in other cases the two allotment schemes are roughly equivalent.
• Finally, and not surprisingly, when the achievable true full dataset accuracy is below 60% (shaded area on graph), then allotment of 50% to the training set is preferable.
In sum, this graph shows that allotment of 2/3 rds to the training set is somewhat more robust than allotment of 1/2 to the training set.
The nonparametric method was applied to simulated datasets and the MSE estimates compared to the parametric approach. Agreement between the two was very good [Additional file 1
: Supplemental Section 4].
Table and [Additional file 1
: Supplemental Section 5] shows that the results are similar under an empirically estimated covariance matrix and distance between the classes [16
Empirically estimated effects and covariance
Table shows the results of the application of the nonparametric method to several real-world datasets.
Applications to real datasets
Note that the rightmost two columns show the excess error when 1/2 and when 2/3 rds are allotted to the training set. For the Rosenwald et al. [12
] dataset of diffuse large B-cell lymphoma, we estimated the optimal split for distinguishing between germinal-center B-cell-like lymphoma from all other types. For this dataset of n
= 240 patient samples, the optimal split was 150 : 90, with about two-thirds of the samples devoted to the training set. The excess error (root mean square error difference, RMSD) from the 2/3 rds to training set rule of thumb is 0.001; as a comparison, the RMSD for a simple binomial random variable (with p = 0.96) between a sample size of 236 and 240 is also 0.001. Hence, the excess error at t
/3 is very small.
For the Boer et al. [19
] dataset, the optimal split was 80 for the training set and 72 for the test set, so that 53% were used to train the classifier to distinguish normal kidney from renal cell carcinoma. The dramatic difference in gene expression between cancer and normal tissues meant that a smaller training set size was needed to develop a highly accurate classifier [Additional file 1
: Supplemental Section 6.3]. As a result, the 1/2 to training set rule of thumb is a little better than the 2/3 rds to training split. That being said, the excess error when 2/3 rds ares used for training is only 0.004. For comparison, the RMSD of 0.004 is similar to the RMSD resulting from increasing the sample size from 142 to 152 in simple binomial sampling (when p
For the Golub et al. [20
] dataset, the optimal split was 40 for the training set and 32 for the test set, or 56% for the training to distinguish acute lymphoblastic leukemia from acute myologenous leukemia. This is another example of two classes with dramatically different expression profiles. Like the Rosenwald dataset, the 2/3 rds to training set rule resulted in smaller excess error than the 1/2 rule.
To distinguish oligodendroglioma from glioblastoma in the the Sun et al. [21
] dataset required 40 for the training set and 91 for the test set, or 31% for the training set. This optimal training sample size was somewhat smaller than expected. This appeared to be due to the accuracy leveling off after t
= 40 training samples, while the variance terms increased monotonely for t
> 40. The multidimensional scaling plot for these data [Additional file 1
: Supplemental Section 6.4] showed a pronounced separation into two groups of cases - but these groups only partly corresponded to the class labels. The two groups were found easily with n = 40 samples, but the corresponding error rate was relatively high because of the imperfect correlation between the class labels and the two clusters in the plots. One is left to speculate whether this pattern was the result of real underlying biology, or artifacts such as batch effects or sample labeling errors. In this case it did appear that 40 samples in the training set was adequate to achieve accuracy near the best possible with the full n = 130 samples.
A possible explanation for the Sun et al. [21
] dataset is that the full dataset accuracy was relatively low. We therefore investigated another dataset of van't Veer et al. [22
] which also had low full dataset predictive accuracy and found a similar pattern. As shown [Additional file 1
: Supplemental Section 6.5], the multidimensional scaling plot of grade 1/2 lung tumors versus grade 3 lung tumors showed two groups that did not match up with the tumor grade labels. This non-normality within groups may reflect underlying biological heterogeneity. As can be seen in the table, the optimal training set proportion is below 50% for this dataset as it was for the Sun et al. dataset, suggesting that with lower accuracies the setting is more complex and a single rule of thumb may not be adequate.
The supplement provides figures related to the fitting on the real datasets [Additional file 1
: Supplemental Section 6]. We found that for the application to the real-world microarray datasets it was critical to perform at least 1,000 bootstrap re-samplings and 1,000 sample splits in order to obtain adequately de-noised MSE curves over the range of sample sizes.