The results of this study suggest that, given a properly chosen threshold, our new automated method performs better than the benchmark method (Staba et al., 2002
), and as well as human experts at identifying gamma-band HFO oscillations in clinical IEEG. Additionally, it is clear from our experiments that human experts are only marginally consistent at visually identifying HFO events, primarily due to the large number of false negative errors they make, but are very good at labeling candidate HFOs when presented with short EEG segments for review. While these results arise from an analysis of just two patients, over 1300 events were studied: we feel that the performance of both human and automated methods is unlikely to improve simply by “adding more patients.” These important points support two central motivations for this study: (1) to validate our new method, in preparation for deploying it to track the spatial and temporal evolution of multi-scale oscillations during seizure generation; and (2) to present important concepts and techniques for benchmarking automated algorithms for tracking physiological oscillations in human recordings. As more algorithms focused on tracking similar, multi-scale, signature events in normal and pathological brain activity are developed, similar validation against human experts will be required.
The discrepancy between the high-bandwidth recordings previously analyzed (Staba et al., 2002
), and the clinical bandwidth data available for this study implies that direct comparisons between the methods is not exact. In addition, the reference implementation is only approximate, as discussed in section 2.2. In particular, the number of samples per event for clinical data sampled at 200 Hz is smaller (~ 17 – 75 samples) than previously used (Staba et al., 2002
). This makes confirmation of candidate HFO events more difficult, either by spectral estimation or by duration thresholding. For the bandwidth considered (0.1 – 85 Hz) in this study, however, comparison of the results of the automated detector implementations (X, Y1
) (cf. 3.1) show that X performs best with regard to specificity and sensitivity when compared to a ground truth of human consensus markings. We attribute this to the three algorithm improvements we introduced—nonparametric thresholding, spectral equalization, and an alternative energy measure. A more complete comparison of X and Y methods over a broad parameter range is necessary, however, to unequivocally compare the relative performance of each.
Clearly the proper choice of a threshold is the most important factor for automated detector performance, while spectral equalization is necessary to compensate for the spectral rolloff present in clinical EEG. For both patient data sets in this study, the 97.5 percentile was an acceptable threshold that was easily, visually determined from the cdf of line length values on a small training set. As noted previously, both energy and line length values were not normally distributed, hence our non-parametric thresholding strategy appeared more suitable than other proposed methods (e.g., mean-plus-standard-deviation based rules (Staba et al., 2002
)). However, it is unlikely that a single quantile threshold will perform optimally across multi-patient data sets. Ultimately, a better method for threshold calibration will be required for fully automated HFO detection. We also note that this problem, threshold selection, still exists for more computationally expensive detection methods, e.g., time-frequency methods like matching pursuit, and is non-trivial not because of the nature of HFO events themselves (simple oscillations), but rather the nature of the diversity of background EEG they coincide with.
Automated vs. Human Performance
The automated detector, X, clearly detected the majority (89.7%) of unanimous human consensus set events, but it also produced nearly twice as many “extra” detections. This raised an important question about the performance of the automated algorithm: were the majority of unlabeled, extra events due to errors made by the human experts during identification (e.g., false negatives), or errors made by the automated detector (e.g., false positives)? Only if the former were confirmed would we consider the automated algorithm acceptable. The results in suggest that this was indeed the case, and that the automated algorithm performs at least as well human experts.
Additional insights into human versus automated performance are also apparent: it is clear from that the majority of human errors are false negatives. In retrospect, this seems plausible: fatigue, vigilance, and distractions make it easy to miss HFOs during marking tasks. Surprisingly, it also appears that false negatives are the most serious error made by the automated detector: the sensitivity of the detector for patient 2 was simply too low, resulting in many false negative errors. This reinforces our previous conclusion that threshold selection (especially in a patient-specific manner) is the key determinant of automated detector performance.
The results of our experiments suggest that human performance on a marking task is not consistent across the full range of HFOs, and is only marginally reproducible. Based on the experts' previous EEG reading experiences prior to this study (e.g., spike detection, seizure onset localization, and other event identification), we anticipated some degree of poor performance; however, the results were worse than expected. Humans miss many events (false negative) when asked to visually identify HFOs in EEG. However, when an expert does identify an HFO, other experts are also likely to agree with that marking upon review. This highlights the finding that human experts are very consistent (and accurate) when presented with short clips (1-2 s) of candidate EEG to confirm HFOs. We attribute this asymmetry of performance between identification and review to the different vigilance requirements of each task.
One major implication of the poor human performance is that the determination of a ground truth data set (e.g., “the gold standard”) is difficult. In the limit, it is likely that unanimous agreement among EEG readers would result in an empty ground truth data set. In our case, a unanimous consensus data set accounted for only 21% of all events identified, while review consensus was 67%. Human ground truth data is still not precise, which makes the validation of any detection algorithm difficult. We believe one possible improvement to ground truth determination is to implement subjective event scoring during the identification task (e.g., HFOs are rated on a scale of 1 – 10). By scoring each event, the expert is qualitatively stating the degree of HFO-ness observed, and providing an indication of their internal threshold for marking events. We hypothesize that changes in this internal threshold which occur throughout a marking session—in addition to vigilance changes—are a major contributor to false negative errors and confounder of reproducibility. We note, however, that such a scoring system requires much more effort and time on the part of the marker, and may not provide a clear indication of how to form a ground truth set.
In addition to improving algorithm precision and speed, experiments are underway within our group to use this and similar automated approaches to track important physiological oscillations in normal and pathological recordings of neuronal activity. While the automated detector, X, performed well in our study, a method for calibrating the threshold on a per-patient basis is required prior to analyzing large patient databases. An acceptable calibration technique may be semi-supervised, where human experts review sampled candidate events identified by the automated detector to determine the “best” threshold. The method presented in the paper should also be benchmarked more exhaustively against a larger patient database to study its performance against a greater range of background EEG. In particular it would be interesting to characterize the sensitivity of this detector to spikes and other non-HFO, epileptiform activity since basic bandpass energy-based detection strategies operate as weak spike detectors. Finally, mapping events detected from multichannel, multi-patient databases will allow characterization of the spatiotemporal distribution of gamma-band HFOs, and their spectral characteristics in patients with epilepsy. Improved automated HFO detection tools will facilitate the exploration of associations between HFO location, ictal onset zone, and pathological tissue.