A well-known goal of the use of EMRs is to improve the quality and efficiency of patient care. 1
EMRs have been acknowledged as a source to identify large numbers of subjects for research studies. Such studies include those limited to data collected through EMRs (e.g., the understanding of individual disease courses and outcomes), but also extend to those requiring additional data gathering (e.g., genetic studies of complex diseases). With these goals in mind, data has been extracted from EMRs at various medical centers for the identification of subjects with diseases including asthma, 6
diabetes mellitus, 31
and heart failure. 32
However, few studies have demonstrated that the extracted data can itself be useful for clinical studies. In this work, the authors used data extracted from EMRs with tools available in the i2b2 asthma data mart to characterize and predict which asthma patients develop COPD.
Bayesian networks were used to create a predictive model of COPD using the following extracted variables: age, sex, race, BMI, smoking history, and 104 comorbidities. Of these, age, sex, race, smoking history, and 8 comorbidities modulate the risk of COPD. The model has good predictive accuracy, as indicated by an AUROC of 0.83 (SE 0.03) when using the model to predict COPD in an independent set of patients. The ability of single variables to predict COPD was assessed using the information from one variable at a time to predict COPD with the network (). The strongest single variable predictor in the network is age (). Surprisingly, this variable alone predicts COPD with an AUROC of 0.81 (SE 0.04) in the independent subjects, which is not significantly different than the area obtained with all variables. The relationship between age and COPD is well known. Because COPD is a chronic disease that worsens over time, it is characteristically present in older adults. 7
Emergency room (ER) visit and hospitalization rates for COPD among U.S. adults have been estimated 33
and are consistent with our findings. The 65–74 and 75+ age groups have the highest rates of ER visits and hospitalizations, while the youngest groups have the lowest. Although the importance of age in COPD is known, it is not obvious that it should be the best predictor in our model. For example, smoking is the most important cause of COPD 34
and being male is also traditionally associated with a higher likelihood of having COPD. 7
Besides age, most other single variables are unable to predict COPD better than at random, and those that are able to predict COPD better than at random have AUROCs that are significantly lower than that using all variables or age (). If age information is suppressed while the remaining variable information is used to predict COPD, the corresponding AUROC is 0.73 (SE 0.04). This demonstrates that the model contains significant predictors of COPD besides age, and that interesting interactions among these variables are able to predict COPD, albeit with lower accuracy than age alone. Consistent with the Dutch hypothesis of COPD, these results suggest that some subjects with asthma develop COPD as they age regardless of their smoking status and independently of other network variables. Further study of the relationships among the network's variables is required to confirm this premise, and incorporation of other variables into the model is necessary to understand what alters the progression to COPD among patients with asthma as they age.
Some of the comorbidities that were found to modulate the risk of COPD are general symptoms, such as “shortness of breath” and “respiratory distress or insufficiency.” These symptoms alone are insufficient to indicate COPD, but in the context of the model are helpful to predict COPD. Infections are known to be related to COPD exacerbations, 35
which supports some of the other variable relationships with COPD (e.g., “pneumonia, organism unspecified”, “acute bronchitis”). For each patient, the authors used all ER visit primary diagnoses and hospitalization admission diagnoses available during the observation period to infer comorbidities. The authors did not differentiate comorbidities based on the order in which they occurred in time. Therefore, the authors are limited in knowing whether the above comorbidity predictors and COPD are in causal relationships. In future studies, more careful evaluation of the time course of comorbidities may help to establish whether there are early predictors of COPD. The current results are still useful to indicate how COPD is related to other comorbidities. For example, patients who have hospital admissions/ER visits for COPD are likely to have separate hospital admissions/ER visits for shortness of breath, but shortness of breath events are also related to pneumonia, heart failure, and respiratory distress or insufficiency. Though one might intuitively expect that COPD and shortness of breath be related, the network demonstrates that the relationship is influenced by several other variables, and relationships between variables that would be intuitively thought to be directly related are not always present (e.g., acute upper respiratory infections and shortness of breath). Further, the network provides a quantitative measure of relationships among variables, which is stronger than having an intuition that relationships should exist.
Though the model performs well at predicting COPD in the independent set of asthma patients, there is clear room for improvement in predictive accuracy. The ROC curve in shows that, in the independent group of patients, the predictive model is highly sensitive (100/90/80%) for thresholds at which the specificity is lower (45/60/67%). Some of the factors that affect the predictive accuracy are errors in data extraction, the inherent limitations of the medical record data, and the challenge of determining which patients have COPD.
There is virtually no error in extracting age, sex, race, and BMI as these variables are structured data in the RPDR. However, there are limitations to the ways in which temporal age and BMI measures can be represented as a single value for this study. Because the authors condensed longitudinal data to establish which patients were affected by comorbidities, the authors selected single values for age and BMI for each patient. The age that was used was that of the patient at the initial ER visit or hospitalization recorded. The range of observations for each patient had to be at least five years but was never greater than 10 years. Therefore, the age ranges that the authors used to categorize patients tended to be much greater than the amount of aging that each patient underwent during the observation period (i.e., most patients remained in the same age category across observations). The BMI that the authors calculated was based on average height and weight values. Because subjects were adults during the times of observation, height is expected to remain constant. However, a person's weight can fluctuate over time. The authors assumed that the average weight was the best representative weight over the course of observation because it would best account for the weight that was most often observed for each patient. In the vast majority of patients, there was minimal change in weight over the extracted values.
Extracting smoking history poses greater challenges than the extraction of other demographic variables because it requires the use of NLP methods on free text portions of EMRs. Previous work describes some of the problems involved in extracting smoking status from EMRs, and how NLP has been used to successfully determine it. 4,6
The HITEx methodology that was used to extract the smoking data for this paper has been shown to have an accuracy of 90% compared to expert classification, 6
which is very good reliability. In addition to the challenge of extracting smoking status by NLP, there can be uncertainty in the veracity of patient reporting. The authors chose to define a negative smoking history as one where a patient had “never smoked” in at least 90% of the smoking histories extracted. A positive smoking history could contain a mixture of “current-smoker,” “past-smoker,” and less than 90% “never-smoked” in the extracted smoking histories. Most often, positive smokers had a large percentage of “current-smoker” and “past-smoker” as extracted smoking histories. Because the authors looked at records from patients who were observed for at least 5 years, it was deemed more significant to differentiate patients with a positive smoking history from those with a negative history, than to differentiate current smokers from past smokers. Therefore, the authors chose a definition of nonsmokers that would attempt to ensure that this group truly contained patients with a negative history of smoking.
As for most demographic variables, comorbidities have virtually no error in being extracted by the i2b2 data mart as these variables were inferred directly from ICD-9 billing codes corresponding to ER primary diagnoses and hospitalization admission diagnoses. However, the assignment of diseases with ICD-9 codes is subject to error. In previous work, principal diagnosis classification using ICD-9 billing codes was measured to have accuracy between 72 and 80%, depending on the amount of data available per record, when compared to expert classification. 6
This low accuracy was nearly identical to that using HITEx (73–82%), but using ICD-9 codes resulted in higher specificity values (85–91% compared to 82–87%). Thus, most subjects classified as having a diagnosis according to ICD-9 codes, would also be classified as such by an expert. In addition to being limited in their ability to describe individualized disease presentations, the use of ICD-9 codes is often biased and subject to error when they are assigned without a thorough patient evaluation or by physicians inexperienced in their use. Additionally, some diseases are not classified using uniform criteria, leading to the labeling of patients with different pathological processes as having the same disease. Despite these limitations, the use of ICD-9 codes is standard in United States hospitals for billing and record keeping because they allow a finite and uniform set of diagnoses. 36
In our work, ICD-9 codes provide a satisfactory means of classifying patients with diseases, as evidenced by the relationships among variables found by and good predictive accuracy of our model.
A further limitation of our study is the definition of COPD. A complex disease process, COPD is not always diagnosed using uniform criteria. Several patients with COPD feature characteristics of emphysema and chronic bronchitis, blurring the distinctions between these diseases, while some subjects with emphysema and chronic bronchitis do not have COPD. 37
Therefore, our definition of COPD, based on ICD-9 codes from EMRs, is likely to classify some subjects as having COPD that would not be classified this way based on other standards (e.g., lung function measures). This sort of misclassification would only decrease the predictive accuracy of our model and hence bias our result in the direction of no effect. Nonetheless, having objective measures such as lung function incorporated in our data extracted from EMRs would be helpful to increase the accuracy of our model. The i2b2 data mart will be expanded to include such measures, although few will likely be available because lung function tests are not routinely ordered for most patients. In this sense, data obtained in epidemiological studies, which gather uniform measures for all participants, have a clear advantage over data extracted from EMRs. However, both sources of data are important and information extracted from them can be complementary. The fastest increase in clinical knowledge is likely to be achieved by integrating findings from multiple sources.
Despite all the limitations listed above, including those attributable to extracting data by NLP, the authors have shown that our EMR-extracted data are of good enough quality to create predictive models. Our model's AUROC of 0.83 for predicting COPD in an independent group of patients is comparable to that of routine clinical tests such as prostate-specific antigen tests (AUROC 0.62–0.86) 38
and mammography (AUROC 0.67–0.84). 39
Such clinical tests are evaluated prospectively, while our model was created and tested with retrospective data, which may be subject to bias in the assignment of comorbidity ICD9 codes. Despite the limitations of the retrospective data used, the current AUROC of our model suggests that the performance on prospective data will be good. Gathering prospective data to test our predictive model will provide a more objective assessment of its predictive accuracy.
If the accuracy of NLP extraction of smoking status, which is 90% at this writing, were perfect, then the performance of our predictive model would increase significantly. Because smoking is known to be an important risk factor for COPD, the authors expect that the AUROC might improve by as much as 0.05. Similarly, if the errors associated with using ICD-9 codes were reduced, our model's predictive performance would increase. One way to accomplish this would be to improve principal/admission diagnosis assignment by combining NLP methods with the extraction of billing codes. Although our current accuracy of classification with ICD-9 codes is low (72–80%), the specificity is good (85–91%). Therefore, even though most comorbidities assigned in our data are accurate, the authors would likely obtain additional comorbidities per patient with better methods to extract primary/admission diagnoses from EMRs. This would likely strengthen some existing comorbidity relationships and perhaps introduce new ones, increasing the AUROC corresponding to prediction of COPD by 0.05–0.10. However, even if classification of smoking status and comorbidities from EMRs were perfect, the authors would still expect other errors mentioned (e.g., inaccuracy in patient reporting and limitations inherent in disease classification schemes), to keep the prediction accuracy of our model below 100%.
Our results demonstrate the promise of using medical records to create predictive models and attest to the utility of approaches like i2b2's in instrumenting the healthcare enterprise. Potential applications of predictive models include improved resource allocation for healthcare systems and more closely targeted individualized prevention/management programs. To this end, future studies will improve the NLP methodology used to extract data, expand our model to include more comorbidities and medication history, and consider the time course of patients' medical histories.