MeDS effectively de-identified a wide variety of HL7 messages from multiple sources and is comparable to our gold standard. In our first evaluation, our software scrubbed 49,368 (98.45%) of all 50,148 unique identifiers (patient and nonpatient). It scrubbed 99.06% of all unique HIPAA-specified patient identifiers in this sample. In our second test on pathology reports exclusively, MeDS scrubbed 92,682 (99.12%) of all 93,509 unique identifiers, and 79,993 (99.47%) of the 80,418 HIPAA-specified patient identifiers. These results are better than the performance of most other reported systems. Thomas et al. 6
reported removing 7,151 (92.75%) of 7,710 names, with a system designed to scrub only proper names from pathology reports. Beckwith et al. 9
reported removing 3,439 (98.23%) of 3,499 unique identifiers from pathology reports. Gupta et al. 7
reported their software “performed extremely well” but did not quantify the results. Regarding over-scrubbing, in our first evaluation our software committed 4,012 over-scrubs, roughly 7% of the total number of found identifiers (54,160), a much smaller percent than reported by others. Beckwith et al. 9
reported 4,671 over-scrubs, roughly equal to the total number of found identifiers (4,515). Neither Thomas et al. 6
nor Gupta et al. 7
reported their over-scrubbing rates. Our study is unique in that we evaluated our software's performance in scrubbing multiple types of documents, not just a single type of report. In the recent i2b2 de-identification challenge, Uzuner et al. 15
evaluated 7 different systems and report sensitivities ranging from 0.80 to 0.96 and specificities ranging from 0.83 to 0.97. The performance of our system is similar to the best systems in this challenge, although several differences in study design prevent direct comparison. In our study, we processed nearly 9,000 reports of varying types. The i2b2 test set consisted of 220 discharge summaries. We did not introduce ambiguities in our test set. We processed real-world HL7 messages direct from our health information exchange. In the i2b2 challenge, ambiguities were intentionally introduced into the test set by replacing patient and provider names with medical terms. Finally, in our study MeDS had access to and used patient information in the header section of documents. The systems in the i2b2 study were not provided with equivalent information.
Overall, we found more non-HIPAA identifiers than HIPAA-specified identifiers, especially in laboratory reports. Approximately 70% of the non-HIPAA identifiers are provider names; the rest are laboratory/hospital names, addresses, and phone numbers. We describe these as non-HIPAA because provider information is not listed among the 19 variables () that HIPAA specifies as needing removal and is not strictly a patient identifier. However, most de-identification trials have removed provider information because it does provide information that could contribute to re-identification of the patient.
We performed several modifications to MeDS after our initial testing and achieved a modest improvement in HIPAA-specified identifier scrubbing performance. We added missing elements to several of the regular expressions (such as adding “suite” and “room #” to our regular expression that detects addresses), and we created several more regular expressions to detect more variations of date and accession number patterns. However, we found that the most significant modification needed to prevent missed identifiers was to adjust the order of processing by the regular expressions. Our initial evaluation showed that most missed patient identifier fragments were explained by a previous regular expression removing identifiers that caused subsequent regular expressions to be less effective. The following example illustrates this clearly. There is a regular expression that detects and scrubs any number over 4 digits. There is also a regular expression that detects a street address by looking for a pattern of “any number + any words + street identifier (i.e., street, road, boulevard, etc.).” If the number regular expression precedes the street address regular expression, errors can occur (i.e., “12345 Main Street” is converted to “xxx Main Street” prior to processing by the street address regular expression; therefore the “any number + any words + street identifier” pattern no longer exists). To prevent such errors, we discovered that, in general, during processing regular expressions that detect more specific patterns should precede those that detect more general patterns. Despite these errors, we deemed it very unlikely that a patient's identity could have been determined by any of these retained patient identifier fragments.
One of the strengths of our software is that it does not rely on a single method or process to remove identifiers. For example, a patient name in a report could be detected and scrubbed by the regular expression processor (i.e., the pattern “patient name: firstname lastname”), a direct match in the proper name list table, a match to the header information extracted earlier from the message, and finally by the word nearness similarity algorithm in the case of misspelled names. Perhaps any of the above processes alone would detect the patient name and the need to have multiple processes to remove a single kind of identifier may seem unnecessary. However, we found that this redundancy in the scrubbing processes lessens the likelihood of an identifier escaping detection.
Using data present in the header section of a message had a small effect on the overall accuracy of the scrubbing. Only a very small percentage of scrubbings depended on this technique. However, although this process is rarely needed for most reports, there could be instances when this process is invaluable, such as when a patient name is also a common word or a medical term. In such cases, the name is more likely to be missed by the pattern matching and name matching processes.
The ultimate goal of de-identification software is to scrub true patient identifiers while minimizing over-scrubbing. A medical report completely scrubbed of not only all patient identifiers but all important medical data as well is of no use to researchers. We considered that because we scrubbed information in excess of what HIPAA specifies and that our software committed 4,012 over-scrubbing errors, perhaps our scrubbed messages would no longer hold any research value. Therefore, we analyzed a sample of 300 scrubbed messages to determine readability and interpretability based on the following criteria. A laboratory message was interpretable if the type of test and the result was retained. A pathology report was interpretable if the type of report, specimen, and conclusion could be determined. A narrative report was interpretable if the majority of significant clinical data was retained, and the type of report and conclusion (if applicable) could be determined. An example of a scrubbed HL7 message is shown in . Approximately 95% of scrubbed messages were both readable and interpretable.
Example of a scrubbed HL7 narrative report message (endoscopy report).
The MeDS's name nearness scrubber committed many false-positive errors. Examples of these are shown in . Although clearly this process is not highly specific for detecting misspelled patient names, the importance of removing misspelled patient names in a report makes such a process valuable.
Table 7 Examples of Over-scrubbing Errors Committed by the Name Nearness Scrubber
We acknowledge a limitation of de-identifying reports by removing HIPAA-specified patient identifiers. Despite removing the majority of the HIPAA-specified patient identifiers, data could occasionally remain that could potentially result in re-identification. Some documents, such as admission notes, typically contain detailed patient historical information. If a history is very unique, the identity of the patient could be compromised, especially when coupled with other data. The identities of “a former president of the United States with Alzheimer's disease” and “an HIV-positive, 6'9 inch black male, former professional basketball player” are probably readily apparent, despite the absence of any HIPAA identifiers. We did not find such occurrences in our dataset. This phenomenon illustrates the fact that although frequently the absence of patient identifiers is an adequate measure of de-identification, occasionally it is not. Eliminating from the dataset patient records of well-known individuals could help protect against such occurrences. In future versions of the software, we plan to add algorithms to scrub such contextual inferences.
There are several limitations to our study. The developer of the software also acted as the gold standard and evaluator of the scrubbing process. Ideally, several trained experts not part of the development team would perform the evaluation of the scrubbing process, allowing for interrator reliability measurements of agreement. All collected reports were part of the INPC network, which is limited to the central Indiana area. Processing reports originating from a different network in a different geographic area may affect scrubbing accuracy and may require software modification to achieve similar results.
Although these initial results are promising, we see several ways to improve our software. The addition of a geographic name database would lessen the possibility that portions of a patients address are missed, and has been used successfully in other de-identification systems. 9
We anticipate extending the name nearness scrubber to include other patient identifiers such as patient addresses, and provider names. Further modification to the name nearness scrubber is needed to lessen the likelihood of the software interpreting valid words as spelling errors of patient names. However, sophisticated natural language processing techniques would likely be needed for accurate determination of true spelling errors.