The de-identification software was developed to scrub patient and provider identifying information from MIMIC II free-text medical records before limited public release of the data under a data use agreement. This software has been used to de-identify approximately 700,000 nursing notes, 30,000 discharge summaries, and 300,000 radiology reports, containing a total of approximately 220 million words (~1.8 GB). Our algorithm, running on a standard desk-top computer (3.4 GHz Pentium 4 processor with 512 kb cache) can de-identify text at a rate of approximately 10 MB per hour. Large US hospitals produce terabytes of data annually, but typically only about 8 GB of this is free text data (22 MB/day) [
30], so that real time de-identification of free text data from a major hospital seems quite feasible with our software. De-identification of large volumes of data can be even more rapidly performed using several multi-core servers, as is our current approach.
The algorithm recall performance on the gold standard corpus of re-identified nursing notes (with an average rate of 0.967) is better than the average individual human de-identifier (0.81), the best single human de-identifier (0.94), and the average consensus of two human de-identifiers (0.94). On the test data, the algorithm performs as well as the best single human de-identifier and the average consensus of two human de-identifiers.
The algorithm performed well with critical PHI, missing no patient names in each corpus. Only one instance of an age over 89 was missed in the gold standard data, and three in the test corpus. This indicates that number-related sub-routines performed well, evidenced by the fact that only two full dates were missed in the test corpus.
Locations proved more difficult to detect, with less than five and two false negatives per 100,000 words in the gold standard and test corpora respectively. It is interesting to note that when customized dictionaries are not used, the algorithm performs almost as well, except for locations, with more than 23 times as many incidences of PHI for this category. It is therefore important to make sure that an extensive local dictionary is used to ensure locations smaller than a state, and in particular hospital-specific names, are removed by the algorithm.
Word misspellings are a major source of difficulty in de-identifying free text. Extensive lists of known PHI are used to identify instances of names, locations, etc. No spell-checking libraries such as
Ispell or
Aspell [
25] are used. However, misspelled PHI instances do not match these known PHI instances, and currently context information is used to identify them. For example, uncommon words preceded by "Mr." or "Doctor" are identified as name PHI. The false positive count rises when these context rules are too inclusive, while lax rules elevate the risk of PHI release. The rules for the algorithm were based on careful evaluation of the trade-off between the number of false negatives and false positives.
An obvious extension is to look for likely misspellings of the patient's name if known a priori as names are the most dangerous type of PHI to miss. Similarly, searches for single digit omissions, insertions and reversals in social security numbers, medical record numbers and other a priori known patient identifiers can be used.
In addition to the PHI categories specified by HIPAA, free-text context information may reveal a patient's identity. For example, "the patient's trailer was blown away by a tornado the night before Christmas" is a piece of text that does not contain any terms that are outright PHI, but the date is obvious to a human and a news search could potentially reveal details on this newsworthy event and the identity of this relatively unique patient. Despite the risk of such inadvertent PHI disclosure, it is not feasible to have a clinician review every de-identified record to ensure removal of all PHI. In fact, Table illustrates that even a consensus of three expert de-identifiers was incapable of removing all PHI. A potential, but difficult avenue for further work would be to devise an intelligent method to scrub contextual information, such as in the above example, which can indirectly reveal a patient's identity.
Although machine learning approaches coupled with large and representative labelled databases could identify some PHI occurrences, such systems are innately fragile. The combination of machine learning techniques with sensible heuristics has been shown to improve the precision and recall of de-identification systems [
28]. However, it is unlikely that all eventualities can be represented in either of these approaches. It is more likely that an information-identification approach will work more accurately and ultimately be of more use. That is, there is an assumption that every component of a medical text is intended to mean something to the person who wrote it. A classification problem that identifies the semantic categories of each component of the text can thus mark for exclusion as potential PHI (or irrelevant information) any element for which it has no contrary evidence.
Before such systems can be developed, gold standard corpora (encompassing a vast range of data types) must be created and made available. Until now no public database of free text elements of medical records has been available, and comparisons of the algorithm described here with other algorithms are therefore difficult. Brief testing of other publicly available algorithms produced poor results on the gold standard and test corpora because each algorithm is designed specifically for a particular type of data structure. The availability of de-identification algorithms, especially open-source algorithms that can be customized and extended, will make further development of large gold-standard corpora feasible.
Medical de-identification systems have the potential for widespread use in information sharing for research purposes. The system described here is sufficiently generalized to handle text files of almost any format, albeit with varying performance, and will be useful in other research groups' de-identification efforts and database construction. Each PHI identification module can be switched on or off and dictionaries can be changed or switched. Furthermore, the user is able to specify which word categories are to be identified as PHI. Thus, it is possible to identify the full range of PHI categories or any subset of it, making the software appropriate for use on any type of text or medical record. In the spirit of open-source software, the full source-code has been made available online for public use via PhysioNet [
8,
9].
It should be noted however, that de-identification does not supersede the Common Rule [
31], which applies to any human subject research that draws from any confidential medical records and states the responsibility of researchers to obtain IRB approval prior to data collection, as well as the responsibilities of IRB's to ensure the data are used safely.
IRB's are allowed, in the case of "minimal risk" and significant benefit, to grant permission to use clinical data in research without explicit patient consent, if such consent would be impractical to get. For example, we received approval to use raw patient data to help develop de-identification algorithms, with various safeguards to make sure that the data do not "leak". Our algorithm, however, is not meant to subvert these rules, and as such, our algorithm is insufficient to remove all PHI types required by HIPAA, and human oversight is also probably required.
An alternative approach that virtually ensures full HIPAA-compliant de-identification is that of concept-matching [
26] where the output is devoid of phrases that do not map to a reference terminology and is stripped of nonmedical and extraneous information. Although, some relevant information may be removed, concept matching also provides the terminology code for each medical term included in the sentence, making it possible to index and relate the terms to each other and standard biomedical ontologies.