Since the formula for specificity is based on data set size rather than total number of comparisons, this measure changes with data set size. If instead one divides the number of matches (all of which are deemed to be false positives) by the number of total comparisons: m/((n2–n)/2), the resulting ratio is constant, which makes it more useful when attempting to predict the number of false positives that can be expected for a given data set.
Applying this formula to a STRIDE test data set with m=18 917 and n=158 300, yields
or 1.5 errors per 108
Therefore, when comparing two data sets of around 10 000 patients each, which when compared to each other results in about 108 comparisons, one can expect a few false positives, whereas data sets with 5000 or fewer records are likely to not contain any false positives when linked with the proposed composite identifier. This number is only useful as a very rough rule of thumb, however, since each data set has its own characteristics, and is mostly intended to underscore the point that if false positives are considered unacceptable, this algorithm cannot be relied upon exclusively but should instead be used in conjunction with other linkage variables.
We then turned our attention to the study data set of female breast cancer patients from Stanford and PAMF. With 10 939 patients in one data set and 8166 in the other, our combined cohort contained 19 105 records, of which 2087 were found to be held in common after manual review. After our matching efforts were complete, we found just three erroneous matches, which is reasonably consistent with our predictions.
Our composite identifier found 2028 of the 2087 confirmed matches, with an additional 59 patients found by comparing hashed SSNs, as shown in . Given the fact that we had SSNs on only 87% of our patients, we consider the number 59 as a lower bound on the number of false negatives. In our subsequent manual review of the full identifiers from both sites for patients identified as candidate matches, we found that of the 2028 patients found to be in common by only looking at the composite identifier, 1824 were exactly matching in all of full first name, full last name, and date of birth. The remaining 204 were confirmed by inspection of the full identifiers at both sites to be the same person. This was a very interesting result in that it provided us with a measure of how much better our approach is compared to using full names rather than two-letter prefixes. Fully 10% of our shared cohort would have been mistakenly split had we relied on string hashes of date of birth plus full names rather than using two-letter name prefixes with date of birth.
Characteristics of name prefix plus date of birth as an identity preserving anonymized identifier. PAMF, Palo Alto Medical Foundation; SSN, social security number.
We obtained IRB approval to share the cryptographic hashes of each patient's SSN between informatics staff specifically for the purpose of identifying false negatives. We found 59 patients with differing name prefix/date of birth identifiers who shared the same SSN and turned out to be the same person for a sensitivity measure of 2028/(2028+59)=97.17%, although our true sensitivity may be as low as 2028/(2028+68)=96.7%, since we were missing SSNs on 13% of our cohort. shows the various sources of identity mismatch (false negatives) when matching is based entirely on name prefix and date of birth.
Analysis of reasons for false negatives in name prefix/date of birth (DoB) hash. These cases were identified by matching social security numbers. All 59 false negatives were determined by manual review to be the same person.
By way of contrasting the efficacy of our composite identifier with a real-world identifier, we looked at the results produced by matching on a patient's SSN. By both metrics of rate of false positives and rate of false negatives, SSNs fare significantly worse13
than our composite identifier. The number of patients in common found by comparing SSNs was only 1806, compared with 2028 found by the name prefix and date of birth identifier. Put another way, since there were 222 patients in common found by our identifier who would have been misclassified as different patients had we relied entirely on matching by SSN, the rate of false negatives for SSN-only matches is 10% higher than that of our identifier. One hundred and seventy-two of these were cases where one or both of the SSNs were missing; the reasons for the false negative in the remaining 50 cases are depicted in .
Reasons for false negatives in social security number (SSN) matching. In 12 cases the SSNs were completely different, despite our verification that the matched records referred to the same individual.
Similarly, the rate of false positives was much higher with SSN matching. Compared with only three false positives for our composite identifier, there were 57 false positives for SSN matches, resulting in a nearly 20 times greater number of misclassified records.
Finally, we exchanged individual cryptographic hashes of the patient's full first name, full last name, and date of birth and analyzed the results in order to measure the error rate of other individual linkage variables, the results of which are given in .
Characterization of the primary source of error for several linkage variables in our data set
We observed incidentally that both our institutions appear to adopt a practice of keeping the woman's maiden name on file, generally hyphenating it with her married name, when recording a change of name due to marriage. This is presumably to help the doctors recall the patient. We have no data, however, on how systematic or widespread this practice is, nor can we assess how many records remained split due to name change at one institution but not the other. We do expect at least 0.3% of our patients remain unlinked due to typographic errors in the date of birth and name variations affecting the initial letters of both first and last name.