Our experiments are motivated by the clinical need for high-performance classification and anomaly measurement of EEG signals in an online, continuous-monitoring environment. The field of EEG review has no accepted minimum standards both for sensitivity and precision of event detections as well as anomaly measurement. We see this study as an initial investigation into multi-class epileptiform discharge detection. We did not use sensitivity and precision as benchmarks for performance because which of these measures is optimized depends greatly on the application. For example, detecting epileptic seizures might require high sensitivity and lower precision so as to avoid missing any seizures, even at the expense of a higher false positive rate. Other features of interest, such as episode rates of frontal slowing as a measure of dissipation of sedating medications, might emphasize a higher precision at the cost of sensitivity. We chose to use a balance of these two performance measures, F1, as an objective way to compare classifier performance without emphasizing a specific application. We have as much as possible avoided the preselection and filtering that often occurs in EEG event detection literature, again with an eye towards online monitoring, where channel and “clean signal” preselection is impossible. That said, we have not yet implemented our methods in a clinical trial, and so can only estimate the standards we must meet in order for them to be practical and used.
4.1. Comparing classifiers
Human labeling of specific waveforms is itself a very difficult task, and the inter-rater reliability among board-certified neurophysiologists is usually low [21
]. For computational simplicity, we constrained our classifiers to only examine individual channel-seconds, when in reality experts incorporate much greater spatial and temporal information in their markings. These factors kept our expectations modest for the ultimate recall and precision of our classifiers.
In our classification performance experiment, a different classifier was the highest performing in each of the three datasets with only the DT performing worst in all. We therefore argue that DBNs have classification performance competitive with the two other high-performing classifiers, SVMs and KNN, for our EEG data. To our knowledge, the only other studies comparing DBNs to SVMs on a head-to-head classification task have been in handwritten digit recognition [11
], and in those studies DBNs slightly outperformed SVMs overall.
As previously mentioned, we have found DBNs can be sensitive to heavy class-imbalance, which occurs in our dataset, and have used a simple post-training technique that improves the sensitivity of the minority classes. In some ways this step is unfair to the other classifiers that did not receive such a post-training step. Similar post-training steps for the other three classifiers are not entirely straightforward, so we did not undertake them in this study. Nevertheless, we see this problem as a necessary area of future work for our group and others.
In our classification time experiment, the DBN was significantly faster than the SVM and KNN though slower than the DT. Notably, DBN and DT classification times are relatively consistent as the dimensionality of the data changes between the three datasets. We see this experiment as a first attempt to find which classifiers are more likely to efficiently process the continuous streams of multi-channel EEG data that can sometimes reach hundreds of channels at once.
An important goal of this research was to develop a classification technique that can operate in real time on raw data in a clinical environment. We found that the DBN was the only classification technique we explored that was able to do this. While DBN performance was slightly worse than both the SVM and KNN classifiers on raw data, neither of these techniques are implementable at present on raw data streams in a real time clinical environment. Of note, the DBN performed better on features we extracted in this experiment than either the SVM or KNN. While DBN performance did not substantially reduce the high false positive rates that limit existing detectors, we did not optimize detectors for this task, something we will be investigate in future work.
Given that DBNs can also learn most of their model in the unsupervised phase of learning, the large amounts of unlabeled EEG data may also be conducive to the semi-supervised DBN training paradigm and allow it to learn more sophisticated models than otherwise possible with traditional supervised learners. In addition to the DBN’s fast execution time and ability to learn from unlabeled data, the anomaly measurement capability that naturally falls out of the DBN training paradigm is another advantage of DBNs over SVMs and KNN.
4.2. Comparing data representations
In our exploration of using raw data instead of hand-chosen features, we found that in classification F1, the raw data yielded results comparable to the features data. We made our best attempt to be as rigorous and thorough as possible in feature-selection. Nevertheless, it is certainly possible that in spite of our best efforts, the features we selected were not optimal for this specific classification task. That said, optimizing features based on very specific domain knowledge is precisely what we would like to move away from. If using raw data with sophisticated learning algorithms gives just as good performance as using features, we believe this move is an improvement, both in increased methodological elegance and a reduced reliance on ultra-specific domain knowledge.
It appears that the top classifiers did not fall victim to the curse of dimensionality with the raw256 data, perhaps since they had a relatively large amount of labeled training data (72,800 samples, see ). In classification time, the SVM and KNN were much slower than with the lower dimensional feat16 and pca20 data, but the DBN maintained relatively constant speed regardless of the input dimensionality. This result makes sense given that DBN classification consists almost exclusively of a number of matrix multiplications, and the first matrix is the only one that changes with the dimensionality of the input data for a given model.
In the DBN anomaly measurement, we showed with the Wilcoxon-Mann-Whitney test and also in that the raw data has significantly better separation between the background and non-background (anomaly) classes versus the features data. This result, combined with the previous finding that raw data seems on average no worse for the classification task, indicates that using raw data instead of features for EEG processing may have some advantages in both methodological elegance and better anomaly measurement.
4.3. Novel multi-class EEG waveform classification
We have focused throughout this study on using different classifiers and different data representations for our EEG waveform classification task. We note that—to the best of our knowledge—no other classification paradigm exists for the 5 clinically significant EEG waveform classes we used in this study. The most recent variant [2
] of the commonly used spike-detection algorithm originally proposed by Gotman et al. [24
] has a reported recall range of 0.09–0.34 and a false positive rate range of 4.2–48.6 per hour [25
]. While the authors of this study do not report their results in terms of F1
, we believe that our DBN classifier on the raw256
data, which has 0.2271 recall and 0.1920 precision in the spike & sharp waves class, is competitive with this existing clinical standard, especially since the DBN also detects three other types of clinically important signals at the same time. To our knowledge, two of the classes we detect, GPEDs & triphasic and PLEDS, have never before been examined for automated classification despite their published significance and use in clinical EEG review. Part of this study’s novelty also stems from its being the first to perform automated waveform classification and anomaly measurement in continuous EEG of critically-ill patients. Other small sample studies have restricted themselves to seizure detection and changes in time-frequency parameters of EEG [26
4.4. Limitations and future work
We did not thoroughly examine the training time of the different models in this paper, but it deserves mention. The training time of the KNN and SVM models took anywhere from a few hours to a few days, depending on the dataset. That of the DBNs took anywhere from a few days to more than a week+
. DBN training is an important consideration when applying it to practical applications in the clinical arena, such as the ICU. Since the waveforms of interest in the ICU, such as epileptiform discharges and frontal slowing, can have great similarity from patient to patient, training on one representative data set, one time, would likely be adequate for application to many patients. In cases where the detector would need to be trained to a rare pattern in one individual, the incremental training times will be much lower because the model has already been initialized, which avoids its having to redo the first two training stages that take up the bulk of the total training time. These techniques were not investigated for this study, but are fertile areas for future work. The field of DBN learning is also still very young, and the training regime is still very un-optimized. With optimization improvements and the use of relatively cheap GPU processors (which have already been shown to decrease RBM training by up to a factor of 72 [27
]), RBM and DBN training times will certainly decrease.
Each of the four classification algorithms has variants that may have faster testing time given certain assumptions about the data*
. We could not implement and test all of these in our initial experiments but plan to explore them in future ones.
Epileptiform discharges like GPEDs and PLEDs are usually characterized by their spatial patterns across one or both sides of the brain. In this initial work, we explored how well classifiers are able to separate these signals using only information from one second of one channel and found that they are able to distinguish them reasonably well#
. Nevertheless, we are confident that increasing the amount of information in space (channels) and time (more than one second) would improve classification performance. We are currently pursuing this next iteration for improving the classification performance.
This study finds that DBNs and raw data may be most appropriate for both online and retrospective clinical EEG monitoring and data mining tasks, but we need to apply these methods in the clinical setting to truly understand the classification performance and test time requirements. Our future work will involve training these models on a larger patient population as well as incorporating more spatial and temporal area of the EEG.
While our results indicate that DBN computational performance is such that it could be applied in a “real” clinical context, more research will be necessary to optimize this process. Using raw data as input to the DBN makes incorporating new patient data fairly easy, as little preprocessing is necessary. For applications on extracted features, computational overhead will increase as the number and/or complexity of the features increases. Regarding to supervision, for some applications a single training set may be applicable to multiple patients, though rare, patient-specific patterns may require more individualized training, which will require some optimization. We will also explore online DBN learning, in which new, unlabeled patient data can be incorporated into the DBN model. One common online learning paradigm is to use the existing classifier to assign labels to new data and then use those samples labeled with high confidence to further train the model. Finally, we expect to work on more specific applications of this method that require fine-tuned sensitivity and precision metrics. We expect to work closely with practicing clinicians to better understand and test the requirements of these systems.