The distribution of clinical features from the SHHS database
The feature set we considered from the SHHS database included routinely available clinical features [age, sex, race, BMI, systolic and diastolic blood pressure (SBP and DBP), coronary artery disease (CAD; defined by a history of angina or myocardial infarction), diabetes (DM) and ESS] and PSG features [AHI, RDI, total sleep time (TST), sleep efficiency, percentage of stage N1, N2, N3 and REM sleep, arousal index (total, as well as stage specific: NREM versus REM sleep), percentage of the night below 90% oxygen saturation, and oxygen nadir in REM and NREM sleep] (). In addition, we considered novel ECG-spectrographic features that characterize sleep architecture and sleep-disordered breathing by the dominant frequency of cardiopulmonary coupling, consisting of a combination of autonomic heart rate variability and respiration-related changes in R
-wave amplitude (Thomas et al., 2005
). The ECG-spectrogram allows sleep to be categorized according to the amount of time spent in states of high-frequency cardiopulmonary coupling (HFC; associated with stable respiratory rate and tidal volumes in NREM sleep), LFC (associated with fluctuations in respiratory rate and tidal volumes), e-LFC (associated with apneas and hypopneas) and VLFC (associated with fluctuations characteristic of wake and of REM sleep) (). We restricted analysis to the subjects in the SHHS who had an adequate ECG signal (see Materials and methods) in order to use the ECG-spectrographic features, as reported previously (Thomas et al., 2009
). Thus, a total of 27 features were considered, two of which were chosen as targets for algorithm classification: AHI and ESS.
Sleep Heart Health Study (SHHS) subject features and abbreviations
We divided the features into three categories (clinical, PSG and spectrographic) to represent types of data that might be considered when making clinical predictions. For example, the problem of predicting sleep apnea is mainly relevant when considering non-PSG features, as obtaining the PSG itself includes routine measurement of AHI, so making an AHI prediction based upon, for instance, sleep stages, would be of mainly academic interest. However, being able to predict AHI based on purely clinical features or a simple single-lead ECG would have potential practical utility in screening or risk stratification. Similarly, predicting ESS based on physiology is of interest for mechanistic reasons; however, predicting it based on clinical features is less useful because the ESS is itself obtained easily in routine clinical contact.
shows the distribution of values for routine clinical features. Binned histograms of the continuous variables demonstrated non-Gaussian distributions in each case within this data set. For each continuous variable, we found statistically significant deviation from normality in all cases by two tests (the KS normality test and the D’Agostino and Pearson normality test). A minority of subjects contained missing values for at least one of the 27 features. Thus, we restricted the data set further for classification to those subjects with complete data. In order to assess whether subjects with missing data differed systematically from those with complete data (n = 4647), each histogram is overlaid with the distribution of subjects in the ‘missing data’ subset (n = 652). Subjects with missing data had similar distributions to those with complete data in each case, such that removing this subset from the classification approach is unlikely to confound the results.
Correlation analysis of clinical features in the SHHS database
We calculated non-parametric correlations (Spearman’s rank method) between individual clinical, PSG and ECG features (discrete/continuous metrics only; n = 22) and the two variables of interest, ESS and AHI. Small but significant correlations were found between most features and the ESS () and AHI (). The strongest correlations with ESS were small, and the only ones with an r-value at or stronger than 0.1 (or −0.1) were the AHI (0.13), the RDI (0.1), the BMI (0.1), the % time with <90% oxygen saturation (0.1), the REM oxygen nadir (−0.1) and the NREM oxygen nadir (−0.11). The AHI, in contrast, was correlated more strongly with several features, and all the features had r-values stronger than 0.1 (or −0.1) (). Scatterplots are shown for two commonly associated feature-pairs in (ESS versus AHI) and (AHI versus BMI). These plots illustrate the variability in the data, consistent with the modest correlations, and suggest the potential utility of more sophisticated classification methods to capture patterns and relationships between features and the endpoints of ESS and AHI.
Figure 2 Spearman’s correlation of clinical features with Epworth Sleepiness Scale (ESS) and apnea–hypopnea index (AHI). R-values for correlation between ESS (a) and AHI (b) are shown for the listed clinical features. Scatterplots are shown for (more ...)
Naive Bayes performance in sleepiness classification
A naive Bayes classifier assigns the test data points (subjects) in question based on the probability of each feature of that subject occurring in a given class (in this study, defined by AHI and ESS score). In other words, each feature may be considered to have a sensitivity and specificity with regard to a given class membership. Thus, the combination of sensitivity, specificity and prior probability of class membership is used by the algorithm according to Bayes’ theorem. The ‘naive’ aspect refers to the fact that the algorithm assumes that the features are independent of each other with regard to class association. This assumption may not reflect the clinical reality of interactions among patient features; we address this below, by comparing the performance of a non-naive Bayes classifier algorithm.
We first attempted to classify ESS using the naive Bayes classifier, based on either PSG features or ECG spectrographic features. ESS values were dichotomized into normal (0–10) or abnormal (11–24) according to typical clinical criteria. Algorithm performance is shown in the form of a ‘confusion matrix’, which is of similar structure to the familiar dichotomous 2×2 box illustrating sensitivity, specificity and predictive value of diagnostic test performance. Sensitivity and specificity values determine the positive and negative likelihood ratios, according to the following equations: LR(+) = sensitivity/(100 – specificity) and LR(−) = (100 – sensitivity)/specificity.
The classification accuracy for ESS class was poor regardless of training on PSG or spectrogram data (). Using PSG data, the sensitivity for detecting abnormal ESS was only 16.7%, while the specificity was 88.8%, corresponding to LR(+) of 1.49 and LR(−) of 0.94. Using the ECG data, sensitivity was lower at 5.3% and specificity higher at 96.5%, but the LR values were still poor, with LR(+) of 1.51 and LR(−) of 0.98. The closer the LR values are to 1, the smaller will be the change in disease probability after obtaining the test result, according to Bayes’ theorem. Finally, we tested whether all available information (26 features) would improve the ESS classification (), but the results were similar to those obtained with clinical features only ().
Figure 3 Performance of the naive Bayes classifier in predicting Epworth Sleepiness Scale (ESS) and apnea–hypopnea index (AHI) classes. The prediction power of the naive Bayes classifier algorithm is shown for ESS based on polysomnogram (PSG) features (more ...)
Interpreting the confusion matrices requires consideration of the proportion of subjects in each class—that is, the prior probability or prevalence. In the SHHS population used for this analysis, the prevalence of abnormal ESS was approximately 25%, and thus the PPV of the algorithm (approximately 34–36%) represents only a small improvement over this prevalence value. The NPV was nearly identical to the prevalence of normal ESS, as expected when the sensitivity is so low and the LR(−) is so close to 1. In other words, these LR values indicated that the classification algorithm, viewed as a diagnostic test, yields little information beyond that contained in the prior probabilities. The classification performance was only marginally better if we markedly shifted the cutoff of abnormal ESS (for example, using 0–1 as normal, or using 0–19 as normal), suggesting that the poor performance was not simply attributable to the clinical definition of normal ESS as 0–10 (data not shown). Finally, to assess the possibility that features were not independent (as assumed by the algorithm), we tested the non-naive analog of this classifier for ECG and clinical features. The classification method, however, reduced to the naive case, indicating that the learning algorithm could not find statistical evidence of dependence between any of the features. Although this does not mean that dependencies do not exist (for example, there are known dependencies between AHI and RDI), it suggests that combining features does not improve significantly the performance of the classification algorithm.
The classification performance was worse using the k-NN algorithm, with tested k values of 1, 3, 5 and 10 (data not shown). The poor classification accuracy for ESS with both algorithms could be either because the features truly do not predict sleepiness class, or because the ESS is a poor marker of sleepiness, or both. For example, the eight questions of the ESS are equally weighted, although it is likely that falling asleep while driving is a more substantial indicator of sleepiness than falling asleep after laying down in the afternoon explicitly to rest.
Naive Bayes performance in classifying apnea severity
We next turned to prediction of an objectively defined metric, the AHI, using clinical features and ECG features. We dichotomized all subjects into normal (0–5) or abnormal (>5), based on the typical clinical criteria for diagnosing OSA. AHI classification was performed based on clinical features () or ECG features (). Using clinical features, the sensitivity for AHI >5 was 57.5% and specificity was 73.7%, corresponding to LR(+) 2.19 and LR(−) 0.58. Using ECG features, the sensitivity was lower at 39.0%, with a higher specificity of 82.7%, corresponding to LR(+) 2.25 and LR(−) 0.74.
Although these results demonstrated improved prediction of the more objective endpoint of AHI compared to the subjective ESS, the LR values are still close to 1, and thus only modestly adjust disease probability. For example, in the subjects studied here, the prevalence of AHI >5 was approximately 45%, such that the PPV using either clinical-or ECG-based classifiers was approximately 65%. The NPV was approximately 68% based on clinical features and approximately 62% based on ECG features.
Classification was not improved substantially when four groups of AHI were considered (0–5, 5–15, 15–30 and >30), including individual cutoffs such as <30 versus >30 (data not shown). As in the case of ESS classification, the k-NN algorithm performance for AHI classification was worse, for k values of 1, 3, 5 and 10 (data not shown).
Finally, we performed AHI classification using a combination of clinical, spectrographic and PSG features unrelated to sleep-disordered breathing. We excluded PSG metrics of RDI, low O2 values and arousal indices because they are tied intimately to the actual calculation of AHI, and may thus elevate falsely the classification accuracy for trivial reasons. Using these combined data, the sensitivity for AHI > 5 was 56.0% and specificity was 77.4%, essentially unchanged from the classification based on clinical features ().
Predicting sleepiness and sleep apnea with an SVM classifier
We next turned to the SVM classification technique. SVM considers the distribution of class features in multi-dimensional space and designates a ‘hyperplane’ that allows the best feature-based separation of the classes of interest. We used a radial basis function kernel, which is very flexible in the consideration of non-linear feature relationships. ESS classification was poor with this method, and across a range of parameter combinations the classification defaulted to the prior probability: that is, the algorithm classified all subjects into the normal 0–10 class, which was the most prevalent.
Classification of AHI, in contrast, was much better, and performed similarly to the naive Bayes classifier () for clinical and spectrographic features. Using clinical features, the sensitivity was 59.0% and the specificity was 74.5%, with PPV 65.4% and NPV of 69.0% (). The corresponding LR(+) was 2.3, and the LR(−) was 0.55. Using spectrographic features, the sensitivity was 43.4% and the specificity was 83.5%, with PPV 68.3% and NPV of 64.4% (). The corresponding LR(+) was 2.6, and the LR(−) was 0.68. Combining these features with non-respiration PSG features yielded sensitivity of 62.3% and specificity 78.3%, with PPV 70.1% and NPV of 71.7%. The corresponding LR(+) was 2.9 and the LR(−) was 0.48. The combined data performed slightly better than the naive Bayes classifier.
Figure 4 Performance of the support vector machine (SVM) classifier in predicting apnea–hypopnea index (AHI) class. The prediction power of the SVM algorithm is shown for AHI based on clinical features (a), electroencephalogram (ECG) features (b) or a (more ...)
The weights assigned to the features used in the ‘combined’ data set are shown in . The relative importance of each feature is in general agreement with clinical expectation and the non-parametric correlation data shown in . For example, the e-LFC, BMI and LFC features were the strongest in the algorithm. We again note that although the ‘combined’ data improved the classification, the aim of classification is to perform well without the need for PSG features (which already include the AHI in routine practice), and thus we suggest that the most practical results are those that classify based on easily obtained clinical or ECG features.
Mutual information approach to correlating clinical features with ESS and AHI
Finally, we undertook an information theoretical approach to quantifying the relationship of various features with the ESS and the AHI. Mutual information is a powerful tool in this regard because it captures how much statistical information one variable can provide about another. For the SHHS data, we thus used mutual information to determine the relationship between clinical, PSG and ECG features and the ESS or AHI. Mutual information is not limited to linear relationships as in the Pearson’s correlation, or to monotonic relationships as in the non-parametric Spearman’s rank correlation. Instead, it captures any relationship (sometimes referred to as dependency) between the variables—without needing to know or specify what the relationship is. Because this calculation depends to some extent upon the number of bins used to categorize the ESS and AHI, we normalized the mutual information value (which, like entropy, is in units of bits) to the entropy of the ESS and the AHI distributions themselves. In this way, values approaching zero indicate little or no shared information or dependency, while values approaching 1 indicate a high or exact degree of dependency (of any kind) between the two variables.
shows the normalized mutual information between ESS and multiple discrete/continuous features. In every case the value was close to zero, and always < 0.05, indicating very little shared dependency, consistent with the poor performance of the classification algorithms for predicting ESS. Note that even the AHI has a nearly zero value, emphasizing the exceedingly small statistical dependency between the AHI and ESS. shows the normalized mutual information between the features and the AHI. As expected, there were several features that showed some degree of dependency. For example, the arousal index (whether total, in NREM or in REM sleep) showed a small relationship with AHI. The RDI and the oxygenation metrics also showed dependency, as expected, as the RDI depends in part on the AHI, which includes oxygen values. Finally, the ECG features showed a relationship with AHI. This is also expected as the ECG-spectrogram (in particular e-LFC) has been associated with apnea severity (Thomas et al., 2009
Figure 5 Mutual information between various features and Epworth Sleepiness Scale (ESS) or apnea–hypopnea index (AHI). The normalized mutual information is shown for various discrete/continuous features compared with ESS (a) and AHI (b). Categorical features (more ...)