Four ML schemes were evaluated: a Decision Tree (J48, an implementation of the C4.5 algorithm), a Support Vector Machine (SVM), a Naïve Bayes classifier (NN) and a Multi-layer Perceptron (MLP). Two alternate models for ML were tested in this study, using a dataset of 46 instances and two classes. These were:
• -a two-class model with classifications: "CREB-regulated" and "NOT CREB-regulated", and
• -a three-class model with a third classification "Nrf2-regulated" [24
Nrf2 (NF-E2-related factor 2), the primary transcription factor that binds the Antioxidant Response Element (ARE), was selected because, like CREB, Nrf2 is a ubiquitous transcription factor. Secondly, it has a requirement for CREB Binding Protein for enhanced transcription activity [25
]. Using the leave-one-out cross-validation technique, the two-class model had lower Mean Absolute Error rates for all learning schemes explored than the three-class model (Figure ). Also, of the four schemes and two models evaluated, the area under the Receiver Operating Characteristic (ROC) curve, a measure of test accuracy, was highest for the C4.5 scheme under the two-class model (Figure ).
Figure 1 Learning Scheme Accuracy and Error Rates. Accuracy and error rates for learning on a two-class and a three-class model (defined in the Methods section), using the Leave-one-out Cross Validation technique. A) A comparison of accuracy and error rates for (more ...)
Of the four ML schemes, using the leave-one-out cross-validation technique and the two-class model, the C4.5 Decision Tree algorithm had the lowest overall predicted error rate (Figure ; Table ). Its ROC curve was closest to the left-hand border and the top border of the ROC space (Figure and Additional File 1
), indicating that it had the most optimal trade-off between sensitivity and specificity among the four schemes evaluated. It also had the highest area under the ROC curve (Table ). The C4.5 Decision Tree algorithm [26
] works top-down, seeking at each stage an attribute that best separates the classes. The attribute with the greatest information gain
is chosen. It then recursively processes the sub-problems resulting from the split until the information
either reaches a maximum or is zero. The information measure (entropy
) is calculated thus:
Performance of learning schemes following 460 runs**
Figure 2 Learning Scheme ROC Curves. Receiver Operating Characteristic (ROC) curve for learning schemes using the two-class model and the Leave-one-out Cross Validation technique. The C4.5 test is closest to the left-hand border and the top-border of the ROC space, (more ...)
Area under ROC curves, two-class model.
Entropy (p1, p2, .... pn) = -p1log2p1-p2log2p2....-pnlog2pn
where p1, p2, .... pn are fractions representing the data distribution at a node (attribute) and sum up to 1.
The two-class model was also used to test an independent dataset generated from 21 genes of known CREB regulation status. C4.5 correctly classified 81% of instances (Table ) with F-measures of 0.87 and 0.67 respectively for the classes "CREB-regulated" and "NOT CREB-regulated" respectively. The F-measure is the harmonic mean of Precision and Sensitivity and can be used as a single measure of a test's performance:
Evaluation of two-class model: C4.5 predictions on an independent set of genes of known CREB regulation status**
F-measure = (2 * Precision * Sensitivity)/(Precision + Sensitivity)
where Precision = True Positives/(True Positives + False Positives)
Sensitivity (or Recall) is a measure of the probability that the test would reject a false null hypothesis:
Sensitivity = True Positives/(True Positives + False Negatives)
Additionally, using the two-class model, three out of four genes determined by two independent microarray platforms to be up-regulated in the ILS cerebellum [22
] were determined by C4.5 to be transcriptionally CREB-regulated (Table ). The platforms were the Affymetrix (Santa Clara, CA) platform Mouse Expression Set 430 (MOE430) and the cDNA arrays NIA15K manufactured at the University of Colorado's School of Medicine. Similarly, three out of four genes up-regulated by both platforms in the ISS cerebellum were deemed CREB-regulated (Table ). Furthermore, 64% and 52% of a cross-section of other up-regulated cerebellar genes in ILS and ISS mice, respectively (as per the MOE430 platform), were deemed CREB-regulated.
C4.5 two-class model predictions for up-regulated genes (cross-validated between MOE430 and NIA15k platforms) in ILS mouse cerebellum
C4.5 two-class model predictions for up-regulated genes (cross-validated between MOE430 and NIA15k platforms) in ISS mouse cerebellum