The uptake of electronic medical records (EMRs) is increasing amongst family physicians in Canada and around the world[1
]. EMRs contain comprehensive clinical information regarding the course of care including lab results, prescriptions, patient risk factors, family history and past medical history in addition to many physical measures such as height, weight, blood pressure and detailed information on clinical encounters not presently available from other data sources. However, EMRs were not designed for research but rather to help physicians improve their clinical practice. As such, secondary use of this data is impeded by the fact that much of the rich clinical data contained in EMRs is not entered in a format that lends itself easily to analysis[3
]. Specifically, the lack of methods for de-identifying the narrative free-text portions of EMR data in order to preserve privacy has presented a major challenge for researchers interested in utilizing this data.
At the Institute for Clinical Evaluative Sciences (ICES) we have developed an E
dministrative data L
atabase (EMRALD) using data from family physician EMRs. This EMR data is linked through unique scrambled health card numbers to the multiple health related administrative databases for the province of Ontario, housed at ICES. ICES is an independent, not-for-profit health services research organization with a unique designation as a 'prescribed entity' in Section 45(1) of the Personal Health Information Protection Act
(PHIPA), Ontario's privacy legislation[4
]. This means that ICES has policies and procedures in place to protect the privacy and confidentiality of patients[5
] as required by the Act (s.45(3)), which have been reviewed and approved by the Information and Privacy Commissioner of Ontario. This status allows ICES to receive and use health information without consent for the purposes of analysis and compiling statistical information about our health care system. Even though ICES does not release any individual level information, a free-text de-identification tool is needed in order to further enhance privacy measures through all steps of in-house EMR data analysis.
Although a number of software programs have been developed to address the issue of de-identification of narrative free-text for different types of medical data, [6
] none have been customized for the full range of primary care EMR notes. These notes contain free-text from a wide variety of sources including point form progress notes, consultation letters from different practitioners in a variety of specialties, diagnostic test results, pathology reports and hospital discharge summaries. These free-text records use a wide variety of formatting and syntax, making it more complex to devise a tool.
Approaches to free-text de-identification include machine-learning based systems[11
] or lexicon and pattern-based systems[6
]. The machine-learning systems use labeled examples to automatically search for a statistical pattern of indicator features. For example, a human annotator would label U.S. zip codes or Canadian postal codes as elements to remove from EMRs. Then, features from the text such as the capitalization pattern, the appearance of digits, the term itself, the part of speech and syntactic dependencies are used to find a statistical rule that distinguishes between the postal codes and other text. Success in de-identifying medical discharge summaries has been achieved using a support vector machine (SVM) as the machine-learning algorithm[11
]. In this case, the SVM attempts to find a separating hyperplane between the positive (labeled) and negative examples where the examples are described using a specified set of text-based features.
On the other hand, the lexicon and pattern approach uses a manually (instead of automatically) built collection of word lists, regular expressions, and heuristics. This second approach has the disadvantage that experts must spend time to create and organize the word lists and patterns. However, this characteristic can also be an advantage because the expert can include knowledge of the field that goes beyond the available training examples or beyond a fixed set of local features.
It is possible to adapt either type of system, but the style of adaptation differs. Adaptation of a machine-learning based system emphasizes adding additional training examples and modifying the set of text-based features. This adaptation would require expertise to label the new examples and then would require a large number of iterations to evaluate the effect of different features. Given that we are regularly adding EMR records from clinics in different geographic locations that receive information from different institutions and specialty areas, the adaptation of a lexicon and pattern system[17
] emphasizing extending word lists, adding new word lists and adding and removing regular expressions appeared to be more appropriate for our needs. For the most part, new words and patterns can be added independently of each other such that the effects of a change are predictable to the expert. This type of adaptation can require more time from the expert, but again presents the possibility of quickly introducing additional domain knowledge without having to constantly retrain the system each time a new clinic is introduced.
Most of the work done previously in this area has been designed to de-identify all personal health information (PHI) as outlined by the Heath Insurance Portability and Accountability Act (HIPAA) in the United States. While PHI such as names and locations are not necessary to preserve, PHI such as age, dates of hospitalizations, procedures and visits have clinical implications which are important to preserve in EMR data in order to fully utilize the data for research and evaluation purposes.
We set out to determine if deid
] an open source software program designed and tested on hospital nursing notes, could be modified to de-identify primary care EMR records in EMRALD with high precision and while preserving clinically important content.