Our results demonstrate the feasibility as well as the challenges of assessing clinical outcomes in EMR using NLP of clinicians’ narrative notes. Using a simple set of empirically defined terms, which are readily extracted from free text, 23% of narrative notes could be accurately classified as depressed, 22% as euthymic and the remainder as intermediate or subthreshold. We emphasize that a large number of patients and notes remain in this third group by design: criteria were selected a priori to maximize specificity for the two outcome categories (TRD and single-treatment responder) anticipating their use in future biomarkers studies. Selection of more liberal thresholds would of course greatly increase the proportion of subjects classified to the extreme groups, and might be desirable for other types of investigations such as effectiveness studies seeking to characterize TRD risk.
The intermediate group also reflects the limitations both of the diagnostic system and clinical documentation. That is, many patients will experience only partial improvement and this may not be well captured in the narrative text. Of note, even for those individuals classified as euthymic based on the narrative note, mean QIDS-SR is in the mildly depressed range. One contributor to this discordance might be the specific guidance given to the raters to not score anxiety or other symptoms, while patients might score anxiety symptoms as (for example) agitation or poor concentration – a challenge any time a self-report and clinician-rated assessment are compared. Given the relative paucity and lack of systematic administration of QIDS-SR, these exploratory analyses should be interpreted with caution. This finding underscores the prevalence of residual mood symptoms in clinical practice, as well as the potential utility of using self-report measures in this context (Nierenberg et al. 2010
The superiority of using clinician- or even patient-reported measures to determine symptom severity should be apparent, which might lead one to question the utility of NLP-based approaches. Indeed, these results should highlight the limitations of the narrative text as well as the potential utility of standardized assessments (and their inclusion in EMR systems). On the other hand, progress toward this goal has been remarkably slow even in academic mental health systems and, once implemented, it will be many years until large datasets with these measures accumulate. During this transition, the value of using existing large datasets, with millions of patients and years of data collection, should also be clear.
Our findings strongly suggest that billing data alone, including ICD-9 codes used for billing, is unlikely to be adequate for establishing outcomes. This likely reflects clinicians’ lack of concern for accuracy in such codes, as they do not impact reimbursement and are often used primarily to reflect the diagnosis of the patient and not current clinical status. Indeed, prior reports suggest that such codes may not reliably distinguish individuals by diagnosis, as was illustrated in a cohort of mood disorder patients undergoing electroconvulsive therapy (Jakobsen et al. 2008
We note several caveats in interpreting our work. First, the portability of these classification models remains to be determined. Different healthcare systems may have different standards or formats for narrative notes, which would be expected to influence the performance of our classifiers. However, we emphasize that MGH and BWH, the two major hospitals within the Partners Health Care system, include two distinct departments of psychiatry with different medical record systems and approaches to documentation, which should improve portability to other systems. The vast majority of clinical notes derive not from the in-patient units, but from affiliated out-patient clinics in the region, most of which are not primarily academic in orientation.
Second, as we have noted, these classifiers should not be construed as a substitute for systematic and quantitative assessment. Manual review of notes identified a remarkable disparity in quality and nature of documentation and consequent ambiguity in description of clinical states. For example, a common notation was ‘depression is stable’, which might refer to a patient who continues to be depressed (as in unchanged), or one whose illness is successfully managed (as in remaining in remission). Likewise, it was not uncommon to encounter documentation of details of recent stressors or events, in the absence of mood symptoms. As more health care systems move to EMR, there is a unique opportunity to better quantify outcomes. For example, the 16-item patient-rated QIDS-SR has been shown to be highly correlated with clinician-rated measures and sensitive to treatment effects (Rush et al. 2003b
); another well-validated alternative is the PHQ-9 (Kroenke et al. 2001
). Their incorporation in EMR systems would greatly improve their capacity to support future outcome studies. At minimum, EMR systems that utilize templates could require clinicians to record a clinical status [for example, using the 7-point Clinical Global Impression scale (Guy, 1976
), or even recording remission status].
Third, in defining longitudinal outcomes, multiple assumptions are required about treatment status. As the Partners HealthCare system is not a ‘closed’ one, there is documentation of a prescription being given but not of it being filled or re-filled. Therefore, there is some risk for misclassification in both directions. Individuals labeled ‘responsive’ may have remitted in spite of not adhering to treatment, as might be expected given the sizeable rates of placebo response in MDD (Fournier et al. 2010
). Conversely, individuals labeled as having TRD may actually be non-adherent, or partially adherent, or receive inadequate medication dosage or duration, a phenomenon sometimes referred to as ‘pseudoresistance’. This limitation underscores the value of integrating clinical data with pharmacy billing data whenever possible. A related challenge is determining tolerability; some individuals classified as resistant may actually be intolerant to multiple medications and thus unable to achieve therapeutic doses necessary for symptomatic improvement. Whether tolerability can itself be accurately determined with NLP approaches merits further investigation. Incorporating tolerability data is further complicated by its partial correlation with efficacy: individuals may be more likely to tolerate medications that they perceive as being helpful to them, and vice versa. In addition to adherence and tolerability, psychiatric and medical co-morbidity are also important moderators of treatment response to which NLP approaches may be applicable.
It should be emphasized that TRD was selected for this study precisely because it is a difficult problem for NLP. Many outcomes within psychiatry should be substantially easier to define, particularly those such as hospitalization, which are likely to be available from billing data. Given the chronicity of many psychiatric disorders, however, the ability toparse less ‘hard’ outcomes such as remission among out-patients will clearly be important in facilitating future studies.
Classification based upon narrative notes provides an opportunity to take advantage of existing EMR systems for highly efficient clinical investigation. In the Partners HealthCare system, there are ~4 years of psychiatry out-patient notes, which, even in the absence of detailed rating scales, yield some perspective on clinical outcomes on a very large scale. With appropriate protection of patients’ privacy, this resource could be applied to efficiently identify risk factors for treatment resistance. It can facilitate investigations of effectiveness, for example, by comparing outcomes across different clinics or payor types within a health care system to highlight potential disparities. (We note the importance of considering confounding in these sorts of population-based investigations, and also the well-established methodologies for addressing these concerns.) Finally, it might allow for efficient recruitment of specific clinical populations; for example, investigations of novel interventions specifically for patients with TRD, or pharmacogenomic investigations of TRD. By comparison, in the largest TRD study to date, >4000 patients were enrolled in order to yield fewer than 100 patients per arm in the most treatment-resistant phase (Trivedi et al. 2006
). If personalized medicine is to become a reality in psychiatry, multiple large datasets will be required to build and validate models for treatment outcome. Our results suggest that applying NLP tools to existing EMR data may help accelerate this process.