Two different reference standard corpora were used for this study: 1) the
2006 i2b2 de-identification challenge corpus [
19], and 2) a corpus of various VHA clinical documents developed for the CHIR de-identification project.
The i2b2 de-identification corpus includes 889 documents (discharge summaries) that were de-identified and all PHI “re-synthesized” (i.e., automatically replaced with realistic surrogates; further details about the re-synthesis strategy can be found in [
19]). This corpus was annotated for 8 categories of PHI and includes 19,498 PHI annotations (details in Table):
| Table 2PHI category distribution and mapping for the VHA, i2b2 and Swedish Stockholm EPR corpora |
. Patients: includes the first and last name of patients, their health proxies, and family members, excluding titles (e.g., Mrs. Smith was admitted).
. Doctors: refers to medical doctors and other practitioners mentioned in the records, excluding titles.
. Hospitals: names of medical organizations and of nursing homes, including room numbers, buildings and floors (e.g., Patient was transferred to room 900).
. Locations: includes geographic locations such as cities, states, street names, zip codes, building names, and numbers.
. Dates: includes all elements of a date. Originally, years were not annotated in this corpus, however we modified these annotations in order to consider years and then be consistent with our VHA date annotations.
. Phone numbers: includes telephone, pager, and fax numbers.
. Ages: ages above 90 years old.
. IDs: refers to any combination of numbers, letters, and special characters identifying medical records, patients, doctors, or hospitals (e.g., medical record number).
The VHA de-identification corpus includes a large variety of clinical documents. A stratified random sampling approach was used to ensure good representation of the variety of clinical notes found at the VHA and a sufficient representation of less common note types. Documents created between 04/01/2008 and 3/31/2009 and containing more than 500 words (to ensure sufficient textual content and PHI identifiers) were included. We then used the 100 most frequent types of clinical notes at the VHA as strata for sampling, and randomly selected 8 documents in each stratum, therefore ending with 800 documents in this corpus. Figure depicts the frequency of the 100 most frequent clinical note types (Addendum excluded). These most frequent note types include consult notes from different specialties, nursing notes, discharge summaries, ER notes, progress notes, preventive health notes, surgical pathology reports, psychiatry notes, history and physical notes, informed consent, operation reports, and other less common note types. A few document types correspond to the majority of the clinical notes available. In our case, the 10 most frequent note types correspond to 42% of all clinical notes, and the 25 most frequent note types to 65%.
As already mentioned, an objective of our sampling strategy was to create a reference corpus as much representative of the variety of VHA clinical notes as possible, and although this strategy resulted in oversampling low frequency notes, it also allowed us to measure the generalizability of de-identification tools across document types to a certain extent.
We then manually annotated the documents using a PHI schema that included all categories of PHI defined in the HIPAA “Safe Harbor” legislation [
2], as well as some armed forces-specific information such as deployment locations, and units. Each document was independently annotated by two reviewers, disagreements were adjudicated by a third reviewer, and a fourth reviewer eventually examined ambiguous and difficult adjudicated cases. The PHI categories annotated in our corpus were defined as follows:
. Names: all occurrences of person names, distributed in four sub-categories (i.e., patients, relatives, healthcare providers, and other persons) and including first names, last names, middle names and initials (not titles), e.g. “Patient met Dr. JAMISON JAMES”.
. Street City: addresses including the city, street number and name, apartment number, etc. (e.g., “lived on 5 Main Street, Suite 200, Albany NY 0000”).
. State Country: all mentions of states and countries. It also includes mentions of countries associated with military service, service awards, or place of residence at the time of deployment (e.g., “He was awarded the Korean service medal”).
. ZIP code: zip code information.
. Deployment: armed forces-specific identifiers that describe a deployment location, or mention of units, battalion, regiment, brigade, etc. (e.g., “had worked as a cook at Air Base 42 for 3 yrs”).
. Healthcare Unit Name: any facility performing health care services, including smaller units (e.g., detox clinics, HIV clinics), and generic locations such as MICU, SICU, ICU, ER. This also includes all explicit mentions of healthcare facilities, clinical laboratories, assisted living, nursing homes, and generic mentions such as “the hospital”, “the clinic”, or “medical service” (e.g., “patient was referred to the blue clinic”, “transferred to 4 west”).
. Other Organization Name: company or organization names not related with healthcare that are attributed to a patient or provider (e.g., “patient is an active member of the Elk’s club”).
. Date: all elements of a date, including year and time, days of the week, and day abbreviations (e.g., “on December, 11, 2009@11:45am”, “administered every Monday, TU, and Thurs”).
. Age

>

89: all instances of age greater than 89 years old.
. Phone Number: all numeric or alphanumeric combinations of phone, fax, or pager numbers, including phone number extensions (e.g., “call 000-LEAD”, “dial x8900”).
. Electronic Address: references to electronic mail addresses, web pages and IP addresses.
. SSN: combinations of numbers and characters representing a social security number, including first initial of last name and last four digits of the SSN (e.g., “L0000 was seen in clinic”).
. Other ID Number: all other combinations of numbers and letters that could represent a medical record number, lab test number, or other patient or provider identifier such as driver’s license number (e.g., “prescription number: 0234569”, “Job 13579/JSS”).
For the study presented here, a subset of 225 clinical documents was randomly selected from the annotated VHA de-identification corpus. The 225-document evaluation corpus contained 5,416 PHI annotations. Since our objective in this study was to measure how available text de-identification methods perform with VHA documents, to then develop a best-of-breed system adapted for VHA clinical narratives, we had to perform a detailed errors analysis. We decided to use this subset of 225 documents for this study and set part of the corpus aside for future independent evaluation.
In Table, we show the distribution of each PHI category in our 225-document VHA corpus, along with the corresponding categories in the i2b2 de-identification corpus, as well as the Swedish Stockholm EPR De-identified corpus [
20]. Although we did not use the latter in this study, we included it in this table to have another comparison of the distribution of PHI in clinical documents. As shown in the table, Dates, Healthcare Units, and Person Names are the most common PHI categories, while other categories like Ages rarely appear in clinical documents. The PHI category distribution varies significantly between the three different corpora. For example, Locations represent 5.57% of the annotations in the VHA corpus (considering
Street CityState Country and
Zip code as Locations), while this percentage drops to 1.35% in the i2b2 corpus and 3.35% in the Swedish Stockholm EPR De-identified corpus. These differences contribute to the difficulties encountered when developing automated text de-identification tools that could be applied across different institutions and document types, a challenge that is still unmet.