There are several optimality criteria that have been used in choosing between models and model construction systems. In no particular order, it is generally considered to be an improvement to: i) reduce the time of model construction, ii) reduce the complexity in implementing the method, iii) reduce the relative number of model parameters, iv) increase the exposure of the individual parameter contribution to the model for interpretation, v) increase the predictive precision of the model and vi) increase the predictive accuracy of the model. Previous comparisons among learning techniques and feature mapping methods for siRNAs have not generally used specific statistical methods to discriminate among the myriad of possible combinations. Here we suggest the use of and provide a demonstration of statistical models that maximizes both predictive model precision and accuracy that can discriminate among the high dimensionality of model space. Furthermore, from the observations here, it may be difficult to generalize the contributions of specific features when comparing among learning techniques as there are significant interactions among learning technique and feature that contribute to model performance. Stated plainly, the optimal feature set for maximizing the performance of a GLM model won't likely be the same feature set in an ANN or SVM model, or vice versa, therefore the learning technique influences what features are “relevant” in the model. Inferring “biological relevance” from “model relevance” when modeling technique has an influence on the features in the model is then questionable. Furthermore, any preference for model interpretability and the selection of a GLM based model may be somewhat self fulfilling where GLMs tend to perform best (among other GLMs, but not globally best) with a smaller number of features when compared to ANN or SVM models.
Overall, multiple tests are presented in and the P
values are not corrected for multiple tests. However, there are 28 planned comparisons within a single learning technique between the 8 presented methods, each among the measures of both precision and accuracy. If a Bonferroni correction is warranted as a way to adjust the type-I and type-II error rates, the typically used P
value of 0.05 for the type-I error rate becomes 0.05/28
1.79E-03, and the cells in labeled with ‘**’ still exceed this threshold. Additionally, 10-fold cross validation was used to generate the multiple replicate estimates of the model performance, but in cases where additional power is required for comparisons among models a higher order cross validation can be performed to increase the replication level and associated power of statistical tests.
It has been shown that the paired t
-test is more liberal than the McNemar's test for classification learning problems 
, but the models tested here are regression models resulting in continuously distributed values and the tests presented in are based on a 2 population t
-tests without the assumption of homoscedasticity of population variances and using Welch's correction for degrees of freedom. For comparative purposes, the McNemar's test on these results can be found in supplementary materials (Table S1
), and consistent with being a more liberal test the McNemar's test fails to reject the null hypothesis of equality for 24 R
and 12 MSE
comparisons while the 2 population t
-test fails to reject 26 R
and 17 MSE
comparisons. The 2 population t
-test is therefore a more conservative test than McNemar's, and more appropriate for the continuously distributed values that result from regression rather than classification procedures.
It is ill advised to use measures of model precision and accuracy that result from both training and testing on the same dataset. However, for comparative purposes these values are presented in this study, . Also, the use of a single kind of cross validation to reduce the problem of over-training models has not been universally adopted. The comparison of the present approaches to previously described methods for training and testing regression learning techniques for the same siRNA dataset are summarized in . It should be pointed out that many of the methods summarized in are not being compared on an equal footing as their training sets were different or the dataset used in model testing was not available, but this is simply a proposed mechanism for making comparisons among predictive models when publishing the method. A complete comparison among techniques and methods is difficult due to the lack of many complementary metrics, the lack of availability of the algorithm's implementation or both. Adopting a common set of standard metrics for model comparison might allow ongoing refinements to be placed in a historical context or comparisons among approaches to take place in a quantitative fashion. A final proposal to allow extensible comparisons among a growing constellation of models would be to publish the individual replicates from any cross-validation procedure, as standard population level measures and comparisons such as t-tests (or other appropriate tests) would be possible across models, when published separately.
Comparisons among regression learning techniques results for model precision and accuracy.
Many of the conclusions here depend on the procedure of cross validation, and several kinds of cross validation have been suggested 
, including 5×2-fold and 10×10-fold, as well as the 1×10-fold stratified method performed here. To help determine whether the choice of procedure for cross validation unduly influences the present results, the PSBC method was used to compare the mean and standard deviations resulting from various kinds of cross validation procedures across the ANN, GLM and SVM techniques, . In general, lower fold (2-fold, 3-fold) cross validation procedures tend to provide lower estimates of the R
and higher estimates of the MSE
due to their relatively smaller sizes of training sets when compared to the higher fold (10-fold, 20-fold) partitions. Also, there are some improvements seen in the reduction of the standard deviations by increasing fold partitions to 5, 10 and 20-fold, but there appears to be marginal benefit, from an estimation of the generalization error perspective, in progressing past 10-fold. Finally, 10 replicates of 10-fold (10×10-fold) and stratified 10-fold (1×10-fold) appear to have similar properties resulting in similar measures of central tendency and dispersion, and the 10-fold increased computational cost in the 10×10-fold might then be difficult to justify where learning algorithms are time intensive.
Comparison of model cross-validation procedures on the PSBC feature mapping method across 3 learning techniques.
From ANOVA results, measures of model precision can be explained rather well by a simple linear combination of (R
model 3: R
technique + method + error), with some evidence for interactions between techniques and methods contributing to the variance in R
. By contrast, measures of model accuracy cannot be explained by a simple linear combination of technique and method, the model of that takes interactions between technique and method into account (MSE
model 4: MSE
technique + method + (technique×method) + error) has a significantly better descriptive fit for the data. These observations suggest that finding highly precise models might simply be a matter of performing a 3-step process. The first step would be surveying learning techniques and choosing the technique with the greatest precision. The second step would involve surveying feature mapping methods and choosing the method, or feature set, with the greatest precision. The final step would combine the highly precise learning technique with the highly precise mapping method for the most precise model. By contrast, this 3-step process would not be suitable to finding highly accurate models, due to the large interaction component between technique and method seen in contributing to the variance in model accuracies (MSE
). Finally, to address whether any one technique or method had excessive influence on the ANOVA results, each of the 3 techniques and 5 methods were sequentially removed and the ANOVA repeated (see supplementary materials Text S1
for regression CV data, Text S2
for R statistical analysis script on regression CV and Text S3
for results from R analysis on regression, similarly see Text S4
, Text S5
, and Text S6
for mean squared error CV data), and similar conclusions concerning variance partitions can be made under the leave one out analyses as with the entire dataset.
The degree of variability among learning techniques and feature mapping methods for measures of both model precision and accuracy are not equivalent. Overall for measures of precision, the learning techniques generally perform equally, but there are trends that suggest SVM techniques are more robust to the presence of noisy methods (features) than ANN and GLM techniques when adding other features. These observations would be consistent with SVM techniques tending to result in large numbers of features for robust models while ANN and GLM techniques would not be robust under those larger feature set scenarios, but would instead be better suited to smaller numbers of features that contain less noise.
By contrast, for measures of accuracy, there appear to be vast differences in learning techniques. For accuracy measures, SVM techniques tend to provide lower variance and smaller magnitude of errors. ANN techniques tend to provide small magnitudes of errors, but some feature methods appear to result in higher variability of accuracy measures. Finally, the GLM techniques tend to provide low accuracy models, where errors appear to be additive with the accumulation of more noisy features. The single exception to this low accuracy in GLM is for the method of PSBC, which is comparable to, but significantly under performs, the accuracies seen in the ANN and SVM techniques for this method. It is unclear to what degree one desirable property of GLM techniques outweighs ANN and SVM techniques in measures of precision and accuracy. Namely the explicit contribution of each feature to the final model in GLMs can be useful, but if model predictive precision and accuracy are quantifiable and lower than other techniques like ANN and SVM then model transparency will need to be given a higher priority than precision or accuracy in determining a desirable learning technique.
There are several limitations to the present study. First, the available siRNA data for constructing predictive models is limited. While the dataset under study is rather large, there are few additional siRNAs that have complete complementarity to their target mRNAs. So while there are near 600 additional 21-mer siRNAs with empirically measured activities 
, only 223 of these have complete complementarity to their respective target sequences due to a constant terminal dinucleotide DNA sequence “TT” in the siRNA's 3′ most positions, irrespective of whether the target mRNA possessed an “AA” sequence or not.
Second, it has been suggested that there is a positive association between a siRNA's activity and the physical location of the siRNA's target location in the mRNA 
. Therefore when creating cross validation partitions for siRNAs, keeping siRNAs that share the same target footprint as siRNAs in the testing set would result in an upwards bias in estimates for precision and accuracy for that model. To investigate this possible source of bias, we implemented a cross-validation system that removed siRNAs from the training set that shared a target mRNA footprint with any siRNA in the testing set. In the stratified cross validation scheme with the SVM technique and P+25 feature mapping resulted in model R
0.711 and MSE
0.020 with an average number of siRNAs in the training sets of 2187.9, . Cross validation that removed siRNAs from the training partition which share a footprint with any siRNA in the testing partition resulted in a model with an average among partitions of R
0.694 and MSE
0.021 and an average number of siRNAs in the training sets of 2009.6. There is no significant difference between model precision (R
0.310) or accuracy (MSE
0.324) when excluding siRNAs from the training set that overlap with any of those in the testing set. These model comparison values result from testing on all 2431 siRNAs, across all partitions, but simply not all of the siRNAs are used to train the underlying model. So while there may be significant variance components in siRNA activity associated with the siRNA's target, these appear to have no statistically significant influence on the outcomes of predictive models when removing overlapping siRNAs from training partitions, or at least not specifically to the SVM technique applied to the P+25 feature method. The reduction of predictive power seen in removing siRNAs from the training set that overlap with the testing set is similar to the reduction of power seen in removing siRNAs in general from the training set, similar to the lower order folds in , not surprisingly reducing training data set size reduces model performance.
Third, the degree to which learning technique parameter tuning, additional features or feature selection methods results in the production of predictive models is not known. To place the learning techniques on a more even playing field, the parameters were optimized using the PSBC feature set, but it is likely that other optimal parameters could be found in the scenarios of additional or other features. A combinatory examination of 216 SVM parameter sets across the 8 feature methods (Table S2
) suggests that first, not unpredictably, it is possible to de-tune effective parameters and produce less effective SVM models and second, the same general parameters optimized under the PSBC feature set produce maximally (or nearly so) predictive models under other feature sets. In general, it is possible to de-tune an ANN or SVM by choosing suboptimal model parameters to perform more poorly on the same feature set. Additional features, for example target secondary structures have been shown to be a significant factor 
in siRNA activity, and that feature set was not explored here, however adding target mRNA secondary structure features does not necessarily result in improved measures for model precision or accuracy if other features already dominate the model 
. There were 279 distinct feature set combinations across 3 learning techniques for a total of 837 distinct models, but this is beyond doubt not an exhaustive exploration of model, parameter or feature space.
Certainly other sources of bias and error exist in the present study, but the intention here is to help determine to what degree the choice of machine learning technique and feature mapping method might produce different results in modeling siRNA effectiveness, possibly accounting for some of the heterogeneity seen in previously published studies modeling siRNA activity and what features produce maximally predictive models. These features have then been interpreted as the most relevant, but this interpretation needs to be placed clearly in the light of their relevance to a model's predictability and not necessarily of their biological relevance. The methods and techniques presented here are all available for download from sourceforge.net (http://sourceforge.net/projects/seq2svm/
) as a group of C++ classes and interfaces for their execution. Finally, to provide access to additional data mining and learning techniques in a graphical interface, there is also an executable that transforms a siRNA dataset, by various methods, into an attribute-relation file format (ARFF), appropriate for use in the Waikato environment for knowledge analysis (WEKA).