|Home | About | Journals | Submit | Contact Us | Français|
COPD patients are burdened with a daily risk of acute exacerbation and loss of control, which could be mitigated by effective, on-demand decision support tools. In this study, we present a machine learning-based strategy for early detection of exacerbations and subsequent triage. Our application uses physician opinion in a statistically and clinically comprehensive set of patient cases to train a supervised prediction algorithm. The accuracy of the model is assessed against a panel of physicians each triaging identical cases in a representative patient validation set. Our results show that algorithm accuracy and safety indicators surpass all individual pulmonologists in both identifying exacerbations and predicting the consensus triage in a 101 case validation set. The algorithm is also the top performer in sensitivity, specificity, and ppv when predicting a patient’s need for emergency care.
Chronic Obstructive Pulmonary Disease (COPD) is a serious long-term lung condition that progressively restricts airflow from the lungs and imposes a significant burden on patients’ daily lives. COPD includes a spectrum of pulmonary phenotypes with emphysema and chronic bronchitis being the two most prominent members. Flare-ups (or exacerbations) are a frequent trigger of physician and hospital visits that are both costly and distressing to patients. Moreover, exacerbations are associated with long term declines in lung function and health status [1, 2]. A World Health Organization report anticipates that by 2030, COPD will become the third leading cause of mortality and the seventh leading cause of morbidity worldwide .
Despite the recognized impact of exacerbations on morbidity, mortality, and health status, there is no standardized clinical approach to improve self-identification of COPD exacerbations by patients at home. Perhaps the most widely used system is a physician provided paper checklist or “action plan”. In these instances, patients are instructed to refer to a document when they are feeling concerned about their breathing [4, 5]. The document generally has green, yellow, and red zones, which guide patients to continue usual treatment, call a physician, or go to the emergency room if their symptoms match those designated in a particular zone [6, 7]. While the type of medical guidance offered in these checklists has demonstrated some utility in patient education [8, 9], the method of delivering that guidance through a hard-coded list lacks rigor, validation, and robustness at the level of the individual patient . It is not surprising then, that urgent calls or visits to the emergency room may provide the fastest path to feedback especially during hours when a doctor’s office is closed. In fact, COPD is one of the leading chronic conditions driving potentially avoidable hospital admissions . The need for novel solutions that limit the impact of exacerbations on patient health is abundantly apparent.
Both in COPD and many other chronic diseases, telemonitoring and mobile application-based tools have generated a great deal of excitement as novel, nonpharmacologic strategies to improve home-based disease management [12, 13]. In many cases, however, clinical examinations of such approaches have struggled to show statistically significant efficacy [14–16]. In COPD, one difficulty in enabling early diagnosis and potential treatment of exacerbation is the lack of a specific predictive diagnostic criteria. For example, the American Thoracic Society (ATS) defines a COPD exacerbation as, “an event in the natural course of the disease characterized by a change in the patient’s baseline dyspnea, cough, and/or sputum and beyond normal day-to-day variations, that is acute in onset and may warrant a change in regular medication in a patient with underlying COPD” . This definition is highly ambiguous given the range, duration, and severity of possible COPD symptoms, which makes a definitive diagnosis of a COPD exacerbation challenging.
Compounding the issue of diagnosis is the inherent complexity in the interdependence of clinical features. For example, a rule-based system that dictates recommendations based on oxygen saturation or pulse may struggle to deliver appropriate guidance to a Gold Stage 1 COPD patient, who likely has normal baseline pulse and oxygen saturation in comparison to a Gold Stage 4 patient, who is likely to have abnormal baseline values . Moreover, the myriad of patient physiologic profiles within individual Gold stages confounds efforts to create effective nested rules systems. Thus, app-based solutions that simply mimic paper-based home action flowcharts are unlikely to result in important improvements in patient outcomes. Machine-learning methods have gained considerable attention as novel strategies for capturing the interdependence of health variables when making predictions of complex health events [19–21]. In this study, we developed one such approach to provide both at-home decision support and an assessment of the possibility of a disease flare-up to COPD patients. We started by performing a detailed literature search and conducting an expert opinion review to define key patient characteristics including demographics, comorbid conditions, history, symptoms and vitals signs that are sufficiently and robustly predictive of exacerbation risk [22–31]. We used these variables to generate clinically diverse, simulated patient cases, and we asked physicians to provide their opinion on 1) the severity of the patient’s baseline health, vital signs, and current symptoms, 2) whether or not the patient was experiencing an exacerbation, and 3) the appropriate triage category for the patient. Physician labeled data sets were used to train a supervised machine-learning algorithm that predicts the likelihood that a patient is having a COPD flare-up and provides guidance on the appropriate responsive action. The algorithm feature set included a diverse mix of current and baseline health data. The model’s performance was validated by comparing its predictions to the consensus decision of a panel of physicians in an out-of-sample representative patient set. Analysis of the algorithm performance and the physician provided data showed 1) the algorithm showed exceptional performance when compared to individual pulmonologists in assessing the likelihood that a patient is experiencing an exacerbation and identifying the appropriate consensus triage, 2) the algorithm triaged in favor of the safety of the patient, when disagreeing with consensus, more often than individual physicians, and 3) the algorithm decision making was transparent and consistent when compared to participating pulmonologists.
Physician input was used to facilitate three major aspects of the algorithm development process:
All participating physicians were board certified pulmonologists and/or critical care specialists from both private and academic institutions. Refer to S1 Table for the profiles of the physicians and their respective roles in this study.
The most relevant patient symptoms, vital signs, and baseline characteristics in relationship to COPD triage were identified through a multi-tier process. First, a comprehensive literature review of common institutional practices, published guidelines, and COPD assessment tests [32–35], clinical predictors and prediction models of exacerbations [25, 29, 36, 37], and current COPD management applications [38–41] was carried out. Once selected, the features were put under consideration by a panel of three board certified pulmonologists and one critical care specialist. This panel scrutinized and modified the variable list based on consensus practice methodologies and clinical experience. Finally, the questions, responses, and measures of each variable were generated and reviewed for content, conciseness, and patient appropriate language.
The question and response list from the aforementioned process defines a space of possible patient cases. To create the optimum set of data for training and validation, a statistical experimental design using the R optFederov package from the AlgDesign library was used. Each feature was modeled linearly. This method was applied to the profile variables and baseline vital signs to generate a diverse test set of 100 patient types. Once generated, the remaining symptom, current vital sign, and comorbidity features for each patient case were randomly selected in a Monte Carlo simulation based on known distributions and correlations in the literature [42–45] to create realistic patient scenarios.
The test set was shuffled and sent to a group of 6 pulmonologists to separately triage and assess the likelihood of an exacerbation. This set gave the physicians an opportunity to better assess the suite of patient health variables and provide feedback on the appropriateness of question language, completeness of clinical features, and realism of cases while actively triaging cases. The feedback from physicians was used to update the algorithm feature list and redesign a larger set of 2501 patient scenarios replete with baseline, vitals, and symptom data. In total, 101 cases were randomly selected for validation and 2400 cases were used for training.
Each of the 6 pulmonologists provided exacerbation and triage data in the training and validation datasets. An additional 3 pulmonologists contributed labels to the validation set. In particular, they provided,
The triage categories from which the physicians could choose were,
Data was sent to physicians in 100-case batches. Triage and exacerbation assessments were recorded in spreadsheets akin to the sample shown in S1 Spreadsheet. Cases that were used in the training were individually labeled by physicians, while cases used in the validation set included the opinion of all 9 previously mentioned physicians. The process is depicted in Fig 1.
The strategy used to find the optimal prediction model is shown in Fig 2. This process was identical both for predicting the presence of an exacerbation and predicting the appropriate triage recommendation. Initially, several candidate supervised learning classifiers were selected including support vector machines, logistic regression, Naive Bayes, KNN and a variety of gradient boosted and ensemble decision tree methods. For each classifier type, thousands of algorithms were trained on each combination of physician training data using Python’s Scikit-Learn suite. All algorithms went through a hyper-parameter optimization process including a grid search with 5-folds cross-validation. The top performing algorithms of each class were selected based on how they performed when making predictions on the out-of-sample validation test.
Algorithm predictions were validated by comparing the algorithm’s triage and exacerbation (y/n) classifications to the consensus decision of a panel of physicians on 101 hypothetical patient cases. Each individual physician and the algorithm were tested for how often their particular recommendation for a patient case matched the majority opinion. In cases of ties, the more conservative medical decision (higher triage/exacerbation category) was accepted as the correct one. The performance of the algorithm was compared to the other physicians in three scenarios: 1) The algorithm voted in the majority opinion, 2) The algorithm did not vote in the majority opinion, and 3) Neither the algorithm nor any individual physician voted in the majority opinion. The 101 validation cases were removed from the 2501 case set prior to training, which made them statistically diverse, clinically relevant, and truly out-of-sample. Statistical measures of performance used in this study included:
Additional analysis was done to assess the algorithm’s performance on identifying situations in which “medical attention” was required. In particular, medical attention is defined as triage circumstances which call for physician assistance (triage category = 3) or emergency care (triage category = 4). The remaining two triage categories define instances in which no medical attention is needed. The statistical measures of performance for this study included:
Confusion matrices were used in this study to visualize the extent of algorithm and physician agreement with consensus for each triage class. Perfect confusion matrices are diagonal, indicating complete agreement between the triager and the consensus. Off-diagonal entries below the diagonal indicate under-triage with comparison to consensus while entries above the diagonal indicate over-triage. Subsequent result sections show such results.
The importance of clinical variables in machine-learning predictions is calculated based on the methodology used by the particular model. In this study, the feature ranking of the Gradient-Boosted Decision Trees (GB) classifier was determined by the expected fraction of samples (case outcomes) to which a particular feature contributed across all trees in the ensemble. Higher fractions indicate higher feature importance. Ultimately, the fractional contribution was determined as an average across the entire forest.
In the case of Logistic Regression, the feature importance was determined by the size of the coefficient effect. When predicting triage, the Logistic Regression included a separate prediction model for each class relative to the other classes, conforming to one-vs-rest methodology. Hence, the feature importance was determined as the average rank of each feature effect over the four prediction models.
The size and scope of the physician panel used in the validation set was a topic of great importance in this study. In order to assess the robustness of the consensus on the validation set, we started by selecting a minimum number of doctors (five) from the complete validation panel of 9 doctors + algorithm. After finding the majority triage opinion of the 5 physician panel on each case, we added a 6th doctor and calculated the percent of 101 total cases that changed triage labels. This process was repeated for all other physicians not in the 5-member panel and the results were averaged. Finally, the outcome of this procedure was averaged for every possible initial combination of 5 physicians to yield the average, max, and min percentage of cases where the majority decision changed after adding a 6th physician panel member. Using this method for all initial panel sizes generates a quantitative assessment of how many physicians are needed to establish a robust consensus in the validation set.
The clinical variables selected for algorithm training were found through the multi-tier process described in the method’s section, Algorithm Feature Selection & Patient Case Generation. The final variable list is shown in Table 1, and includes 1) patient background characteristics that are associated with COPD exacerbation risk and severity, 2) current clinical symptoms that encompass widely accepted features of exacerbations, and 3) physiologic measurements that are predicted to influence physician perception of exacerbation severity.
As the feature list included both continuous and categorical variables with different units and responses, the detailed questions and responses are included in S1 Document. The numerical levels of each categorical variable correspond to patient level responses. For example, in the case of cough, the levels 1,2,3 correspond to less than usual, same as usual, and more than usual respectively. All features have an additional response of unknown except for age, weight, height, gender, baseline dyspnea, and symptom questions. This was done to train the algorithm on cases in which patient data could be missing.
As detailed in the methods, patient cases generated using the variables shown in Table 1 were labeled by physicians, and the resultant data were used to train algorithms using a variety of strategies. Fig 3 includes a comparison of the top performing algorithms of each classifier type for out-of-sample classification accuracy. Among the different machine-learning classifiers tested, The top 2 performers were the Gradient-Boosted Decision Tree and the Logistic Regression. All classifier algorithm types were trained in a comparable way inclusive of hyper-parameter optimization and cross-validation.
Model accuracy was measured as the percentage of classifications that matched the consensus triage and exacerbation labels in the validation set. The accuracy results of the GB classifier when the classifier voted in the consensus are depicted in Fig 4. The algorithm agreed with the consensus opinion in 88% of triage cases, whereas an individual physician agreed with the consensus 74% of the time at best. When determining if an exacerbation had occurred, the algorithm assessment again agreed with the consensus determination more than any individual doctor with a success rate of 97% as compared to 95% from the top performing physician. A comparison of the algorithm to the average physician performance is also shown in Fig 4. In the case of triage accuracy, sensitivity, and ppv, the algorithm performed more than 1 standard deviation better than the average physician (more than 2 standard deviations in the case of accuracy). The exhaustive set of statistical performance metrics for the top algorithms and the top physician are shown in Table 2.
It is noteworthy that the algorithm maintained its classification performance relative to the other physicians even in the assessments where it did not vote in the consensus. Results of these tests are shown in Fig 5. When the consensus opinion did not include the algorithm but included all individual physicians (a test that inherently favors the physicians), the algorithm had a triage/exacerbation classification accuracy of 82%/96% compared to the top performing physician at 77%/94%. In the case where no member’s vote was included in the consensus when calculating that member’s accuracy, the top physician dropped considerably in performance with triage and exacerbation accuracies of 62% and 93% respectively.
The confusion matrices shown in Fig 6 give a comprehensive performance summary of both the algorithm and the top-performing physician (the physician with the highest classification accuracy on the validation set) in triage and exacerbation identification. On exacerbation identification, the top performing algorithm and top performing physician showed comparable performance when compared to consensus. On triage category identification, however, a number of performance and safety differences were observed
In the final study of algorithm performance, the algorithm and physicians were examined for their ability to discern the presence of general medical need (i.e: triage category of “Call your doctor” or “Go to the ER”). Statistical performance metrics of this study are given in Table 3 with confusion matrices shown in Fig 7. Similar observations can be made about the algorithms effectiveness though the top physician did exhibit superior performance in specificity and PPV over both algorithms. This is likely explained by the fact that the algorithm tended to over-triage in cases when it disagreed with consensus. It is further noteworthy that the algorithm never failed to identify the need for medical attention in the 101 validation cases, while the top performing physician misclassified 11 out of 80 such consensus instances (13.75% of the time).
Physician-labeled data on exacerbation and triage categories were compared in the validation sets to better understand physician decision making. Fig 8 below shows the distribution of triage and exacerbation labels in the validation set per doctor. Plots of the average triage and exacerbation classes are also shown with error bars indicating 1 standard deviation intervals.
A variety of observations can be made about outlier opinions, inter-physician consistency, physician treatment of risk, and correlation between triage and exacerbation categories. Doctor 2, for example, triaged 63% of patients to the emergency room and Doctor 3 triaged 32% of patients as needing no additional medical attention, which are both over 2 standard deviations outside of the respective means. Physician triage assessment was also often highly independent of exacerbation assessment. Drs. 3 and 6, for example, had nearly identical exacerbation class distributions, but Dr. 3 triaged 32% of patients as needing no additional medical attention as opposed to 11% for that of Dr. 6. This could be partially due to a belief that an alternate diagnosis was driving symptoms. Moreover, the shape of the consensus triage distribution was matched closely only by the algorithm plus Drs. 5, 7, and 9. This suggests that the remaining physicians used a qualitatively different logic when choosing triage categories.
The study of how many physicians constituted a robust validation panel (detailed in the Methods section) resulted in the convergence plot shown in Fig 9. Looking at the graph one can notice that each case of the validation set converged to an unchanging correct answer as more doctors were added. 7 physicians marked the region where the set showed good convergence with only 8% of cases changing on average when adding another doctor.
The patient variables that most influence triage and exacerbation assessment are shown in Table 4. Table 4 shows the hierarchical importance of the top 15 features in predicting the correct triage category based on the process described in the methods section Algorithm Feature Importance. Interestingly, the GB algorithm favored patient profile characteristics like age, BMI, and height as the most influential factors for triage while the same variables failed to reach the top 15 list for LR triage. LR also tended to weight vital sign levels more heavily when predicting on both triage and exacerbation classes. Despite these differences, both algorithms maintained comparably high statistical measures of performance.
The machine-learned triage approach in this study performed favorably when compared to individual physicians in a broad range of statistical performance measures both in triage and in predicting the presence of a COPD exacerbation. Unlike existing paper checklist type tools, the models incorporated the baseline medical health of the patient in a way that robustly accounted for the complex interactions of patient health variables. Gradient-Boosted Decision Trees and Logistic Regression showed the highest performance when making out-of-sample predictions on the validation set. The performance metrics used to evaluate the algorithms demonstrated accuracy, safety, consistency, and edge case prediction performance comparable to or better than the top performing physicians in all studies with three different assessments of consensus.
The use of machine-learning predictions on a clinical feature set to identify disease flare-ups and provide subsequent patient decision support is a unique contribution of this study. To date, we are not aware of any other work that has produced a comparable result. The use of consensus physician opinion as a validation standard and the analysis of individual physician performance on that standard is also a unique contribution of this study.
The current study has demonstrated that the top two performing algorithms, GB and LR, both yield a suite of statistical performance metrics that compare favorably with individual physicians, and yet, the clinical variables that most influence each model’s output maintain a different rank order of importance. Logistic regression generally weighted vital sign data with more importance. This result suggests that a good recommendation based on the validation standard in this study can be achieved through different logic (modeling type) with diverse health data (algorithm features).
While the algorithm exhibits very strong performance when predicting on the out-of-sample validation set, ultimately the algorithm training is done on cases with individual physician labels. Training on cases with more opinions could facilitate a more robust in-sample cross-validation process. Given the data collection methodology used in this study, it would become increasingly expensive and intractable to collect orders of magnitude more cases for training, but with increased access to electronic medical records, one could consider using a much larger patient dataset with historic patient outcomes guiding the training process. This approach would require considerable thought and further investigation given the lack of gold standard on what constitutes a correct triage assessment.
Thus far, the algorithm has been both developed and tested on hypothetical patient cases. Unlike clinical datasets for post-hoc analysis of hospitalized patients, data for outpatient triage and evaluation is not readily available, necessitating the use of simulated data. Although the feature set within the training data is certainly comprehensive and large when compared to the information that would generally be available to a pulmonologist, internist, or nurse in the clinical setting, an additional level of validation would be to compare the prediction of the algorithm in a real patient setting with a set of physicians actively triaging the same set of patients. This type of clinical data would provide additional insight based on current medical practices.
It is further recognized that the black-box nature of ensemble decision tree methods makes the decision making logic in triage recommendations difficult to interpret. The feature importance studies previously discussed shed light on which patient variables most influence the final outcome, but ultimately, the inherent complexity and interactions of the feature set make it difficult to give a simple, linear causal explanation of the algorithm output based on the inputted features.
Mobile applications geared toward improved at-home patient care and self-management of chronic illnesses have substantially grown in use due largely to the availability of technology and the rising costs of health care . While the growing popularity of mhealth (health care and public health practice supported by mobile devices ) is evident, its impact and efficacy is not . This study has shown that machine-learning based applications offer the exciting prospect of accurate and personalized triage of COPD patients. Early detection of disease flare-ups and accurate council to patients has the potential to both reduce the severity of exacerbations and prevent unnecessary hospitalizations for otherwise healthy, anxious patients. This may assist the drive towards personalized medicine by better guiding decision support for individual patients.
The current algorithm is deployed in a mobile app that is primarily meant for at-home patient use, though it could also be used by nurses and internists less familiar with a patient’s baseline health as a tool to confirm their assessments during patient calls. The app should be further explored for its effectiveness in real patient populations with respect to various clinically relevant endpoints both for improving patient decision making and for engendering clinical reduction in severity and frequency of COPD exacerbations. Future investigation of the feasibility of machine-learning applications in clinical trials will be needed. Moreover, a robust clinical study on the influence of these applications on patient anxiety, stress, and overall health would elucidate.
With modern computational capability and continuously better access to health data, the opportunities to train machine-learning algorithms on large patient outcome datasets will improve. This may have particular relevance in COPD, where emerging data from large phenotyping and genotyping efforts, such as COPDGene and Spiromics, are delineating novel variables that impact exacerbation and disease risk . Blood-based biomarkers and cellular content such as eosinophils, for example, are known to be correlated with increased risk of COPD exacerbations . Although physician opinion is currently the gold standard for many clinical decisions, including diagnosis and triage of COPD exacerbations, active cloud-based training that integrates patient data in electronic medical records with available scientific knowledge may eventually provide specific predictions and recommendations that support medical-decision making. Such cloud-based information could be returned to a patient at home or to a provider in a clinical setting as APIs for computers and mobile devices.
This study has shown that a machine-learning approach to triaging patients with COPD is a viable and robust method when compared to individual pulmonologists at facilitating at-home triage and exacerbation self-identification. The ML algorithm exhibited higher accuracy than all individual, board certified physicians in predicting the consensus opinion on both the presence of an exacerbation and the appropriate triage category in a representative set of patient cases. Furthermore, the algorithm erred in favor of patient safety more often than any individual pulmonologist and exhibited greater consistency in its recommendations. While the app is not meant to be a substitute for physician examinations or physician guided patient care, it does provide simple, easily accessible, safe, and highly accurate at-home decision support which can direct patients to the right care. Furthermore, it is generalizable to other chronic illnesses in which relevant symptom, signs, and patient profile data are available.
Pulmonologists who participated in this study provided expert opinion in clinical selection/review, algorithm training, and validation. The physician profiles are indicated in the supporting table.
Simulated patient cases are issued to physicians to provide exacerbation and triage labels for the purpose of algorithm training and validation. This file includes a sample batch of 100 cases with the corresponding physician entered data.
This file includes all the data used for training and validation.
The authors thank Dr. Lucas Abraham and Dr. Ervin Anaya for useful consultation through algorithm development and validation.
The authors received no specific funding for this work.
The data is available as a Supporting Information file.