Electronic Health Records (EHR) collect patient data obtained in the course of clinical care. These records, when aggregated in Clinical Data Warehouses (CDWs), are a rich source of data for research. For example, we may want to estimate disease prevalence, track infectious diseases, identify unexpected side-effects of drugs, identify cohorts of patients for studies and compare the effectiveness of alternative treatments for a given condition.
Unfortunately, CDWs built from EHRs have not lived up to these hopes. The fundamental problem is that we are attempting to use EHR data for purposes other than supporting clinical care. Over 20 years ago, van der Lei warned against this practice and proposed that “Data shall be used only for the purpose for which they were collected.” [1
] For example, ICD-9-CM codes are routinely assigned to a patient for billing purposes, but billing rules are not meant to preserve and encode clinical reality. Instead, billing rules are meant to comply with the byzantine, sometimes mutually incompatible requirements of insurers, administrators, and regulators. For example, patients and their insurers are billed only for conditions for which they are treated by providers. In multiple-provider situations this means that each provider only sees a part of the patients’ conditions.
For example, consider a patient P with breast cancer that gets her oncological treatment at Cancer Center A. Cancer Center A bills the patient, and the patient’s insurance, for cancer care. The same patient gets tonsillitis and decides to go to outpatient clinic B. Clinic B sees her, treats her, and bills her insurance for tonsillitis. At this point in time, Cancer Center A has a record for a patient with breast cancer, and Clinic B has a record for a patient with tonsillitis.
This state of affairs is appropriate for routine clincal care. Somewhere in the patient’s record at Clinic B a physician or nurse will have written that the patient also had breast cancer. If another physician at Clinic B needs to know, she can find out by reading the patient’s file.
Now consider a researcher at Clinic B who wants to know if breast cancer predisposes people towards tonsillitis. Any attempt to find a correlation using billing data will miss patient P. The same is true for researchers trying to perform genomics research on these diseases; they will simply miss these patients.
This is not idle speculation; at our outpatient clinic, approximately 52% of patients who have or had breast cancer according to their own charts have been billed for the condition. Similarly, 23% of patients with endometrial cancer have a billing code compatible with endometrial cancer [2
]. Data from other institutions and other conditions are similar. For example, 52% of patients with an ICD-9-CM code for Wegener’s Granulomatosis at St. Alexius Medical Center actually met the diagnostic criteria for the condition [3
]. A strategy combining different ICD-9 codes yielded an 88% positive predictive value (PPV) for Lupus Nephritis cases at Brigham & Women’s Hospital in Boston. However, the sensitivity was impossible to compute (i.e., it was not known how many cases were missed) [4
]. Other studies had similar outcomes [5
Many research efforts such as those focused on Comparative Effectiveness Research (CER), genomics, proteomics, genealogy require accurate knowledge of the patient’s entire medical history and list of conditions; sometime referred to as the patient’s phenotype. These research endeavors aren’t interested in the patient’s billing history. They are interested in what conditions the patients actually had. This is also known as high-throughput phenotyping.
This information is often available in physicians’ and nurses’ notes. Further, clinical notes will contain information about past events, unlike other sources of information. For example, a patient with a remote past history of breast cancer, now without evidence of disease, will not receive medications, and will not have procedures or lab exams done that could point to the diagnosis. Clinical notes may also be more abundant than other sources of information. Our CDW contains 295,000 patients with at least one clinical note; 161,000 patients with at least one recorded vital sign; 143,000 patients with at least one medication in a structured field; and 138,000 patients with at least one lab exam. Thus, clinical notes are an important resource for research projects that require clinical information.
Manual review of hundreds of thousands of charts is impractical. Even smaller-scale manual review is expensive, and prone to error and inconsistent coding [11
]. The biomedial informatics reserch community therefore continuously seeks ways to extract computable information from free text [12
]. Automated coding systems such as MetaMap [13
], cTAKES [14
], and MedLEE [15
] can map text to Unified Medical Language System (UMLS) concepts; however, without the addition of customized rules they draw no inferences from it. Many interesting problems require determining the state of the patient – i.e. “did the patient ever have breast cancer?” instead of the easier “does this document mention breast cancer?” Automated classification systems built using Weka[16
] or MAVERIC’s ARC [17
] address the second need, and perform very well on cross-validation [2
]. However, these systems have two weaknesses. The first weakness is that they require training data, which are expensive and slow to create, as it requires a clinician to read each patient’s chart and decide whether the patient had the condition in question. The second weakness is that a system that works well to identify one concept may not work as well to identify a different concept, or even the same concept in another data set; in other words, these systems are not generalizable [12
We therefore set out to build a high-throughput phenotyping system that required neither training data nor customized disease-specific rules, used available external knowledge, and performed well compared to existing automated coding systems. We based our design on the intuition that clinicians first look for explicit statements that assert that a patient has the condition of interest. If they fail to find these statements, they look for evidence of the condition. For example, the question “does the patient have diabetes?” can be answered by finding a statement in the notes that asserts that the patient has diabetes. However, if the explicit assertion is missing, it is still possible to determine whether the patient has diabetes by looking for concepts that are commonly associated with diabetes. Thus, a physician might read the chart and discover that the patient had high glycosylated hemoglobin (a lab marker of long-term glucose concentration in blood), takes metformin (a drug used to treat diabetes), and had a foot exam (commonly performed on patients with diabetes during office visits). The presence or absence these additional elements may add evidence for or against a diagnosis of diabetes respectively, in the event that the concept is explicitly mentioned. In other words, human experts use background knowledge to understand text; specifically, they look for consistency between multiple concepts found in the text.In many respects, this process mirrors the construction and integration phases of Kintsch’s influential construction-integration model of text comprehension: concepts derived from both the reader’s background knowledge and elements of the text itself are integrated during the process of constructing a mental representation of the text, and the extent to which these concepts are collectively consistent with a particular interpretation (e.g., patient has diabetes) determines whether or not this interpretation prevails.
We imitate this comprehension process [18
] by constructing a nearest neighbor graph using a limited breadth-first search from a seed term on UMLS concepts extracted from our CDW, to simulate associative retrieval of related concepts during the process of text comprehension. We also simulate the imposition of external knowledge not explicitly mentioned in the record by using knowledge from the UMLS and the biomedical literature to curate the graph. Finally, we use spreading activation on the graph to simulate the integration component of Kintsch’s model, which resolves inconsistencies by spreading activation across the links between concepts, such that concepts that are contextually consistent will ultimately be more activated.