Analyzing the results of language processing tasks can be challenging in that there is often a subjective aspect to scoring. This is present in healthcare coding and abstracting, where variation between human “experts” is not uncommon. ALM has developed a set of statistical tools for the purposes of quality assurance and performance auditing.^{ 4 } After reporting the i2b2 results, we include examples that demonstrate the need for further analysis, briefly discuss the pertinent techniques from our production analysis tool suite, and provide the analytic results as derived from the raw test scores.

The i2b2 confusion matrix for ALM’s challenge performance, and the precision, recall, and F-measure scores (rounded to 4 decimal places) for all categories are reproduced in below.

| **Table 1**Table 1 A-Life Medical (Reported Matrix) |

| **Table 2**Table 2 A-Life Medical (Reported Scoring) |

The weighted F-measure for our system was 0.8388. The mean across all participants in the i2b2 challenge was 0.7957.

Overall we agreed with the i2b2 scoring of the documents. Systematic issues we recognized in our review included cases in which our engine failed to recognize novel formatting, and cases in which our engine’s production design prevented information extraction, such as the “history of” issue with diagnosis sections.

We also found five cases which we believe were miscategorized. While we recognize that human coders may disagree on the coding of a document,

^{ 4 } we felt that the coding for these five cases was clear. Three examples are detailed below:

- • “SOCIAL HISTORY: Widowed since 1972, no tobacco, no alcohol, lives alone.” (ALM “non-smoker”, i2b2 “past-smoker”).
- • Social History: “No alcohol use and quit tobacco greater than 25 years ago with a 10-pack year smoking history.” (ALM “past-smoker”, i2b2 “current-smoker”).
- • “He is a heavy smoker and drinks 2–3 shots per day at times.” (ALM “current-smoker”, i2b2 “past-smoker”).

Given the small size of the test set, rescoring using our judgments on these five documents provides a significant increase in our scores ().

| **Table 3**Table 3 A-Life Medical (Revised Matrix) |

| **Table 4**Table 4 A-Life Medical (Revised Scoring) |

The weighted F-measure for our system given the revised document scoring was 0.8965.

Although we prefer these revised scores, they are neither more nor less indicative of relative system performance than the original scores. We discuss this point through consideration of two important measures:

- • An estimated coefficient of variation (CV) based on observed variation between human “experts.”
- • A confidence interval and precision level based on test size.

Because medical coding and abstracting are, to a degree, subjective and prone to error, we argue there is no straightforward ‘gold standard’ for evaluation. A measure of coefficient of variation (CV) should be used in evaluating test scores.^{ 4 } For medical coding/abstracting, the CV is a measure of the observed (or expected) variation in performance from one test to the next, or a measure of the variation between qualified coders/abstractors performing the same test. Observed CV on medical coding tasks can be very large when measured on production coders. When measured on qualified auditors, a CV of 0.05 (5%) is not unusual and a CV less than 0.03 (3%) is highly unlikely.

Based on a sample of only two (ALM’s judgments and i2b2’s judgments), combined with large-scale observations of production medical coding and auditing, we believe that a CV of 0.05 (5%) is minimal for the i2b2 smoking history challenge. The challenge organizers noted that on an individual basis the physicians who created the test standard only agreed 80% of the time. This variation seems to argue for measuring the test results on the basis of inter-rater agreement.^{ 5 }

Appropriate confidence and precision levels are dependent in part on test set size. The US Department of Health & Human Services, Office of the Inspector General (OIG) provides the Rat-Stats^{ 6 } calculator for determination of minimum test set sizes for auditing purposes. Applying the Rat-Stats unrestricted sample calculator to the parameters of this challenge produces sample sizes as shown in :

| **Table 5**Table 5 Rat-Stats Sample Sizes |

Making an approximate fit of the i2b2 smoking history 104 document test set to the Rat-Stats sample size yields only an approximate 10% precision level at 95% confidence. Alternatively, at a less than 80% confidence level, the precision level rises to 5%.

We next apply these CV and confidence-precision level measures in our analysis of the probability distributions of the lowest performing, highest performing, and ALM systems.

Note that the desired confidence interval should be greater than CV, i.e., no matter how large the sample, we cannot be more confident of our test results than we are of our scorer. Increasing the sample size, which is the practical effect of decreasing H, will not truly improve precision once H ≤ CV/2. H ≥ (CV/2) + 0.005 is recommended. We also recommend CL ≤ 100 * (1 − CV) because, similar to H, we cannot expect to achieve a confidence level in the test that is greater than the maximum accuracy that the scorer can achieve.

Therefore, using an estimated CV of 0.05 and P of 0.2 (based on the mean F-measure of the challenge participants) and fpc * n of 104 (the actual sample size), then for all scores, X = x − (0.05 * 0.2 * 104). In other words, the raw number of errors (defects) should be reduced by approximately 1, for all participants. Also, the optimal sample size would be approximately 400 documents which would yield a confidence level of about 95% at about a 5% precision level.

Based on this analysis (95% confidence level and 10% precision level) and the reported scores from the i2b2 smoking history challenge adjusted according to X, the probability distributions for the lowest score, ALM’s score and the best score would be approximately as shown in . The 95% confidence interval for the ALM score is marked by vertical bars. Note that within these bars there is significant system overlap, particularly between ALM’s system and the best performing system, but also including the lowest performing system as well.

Although we do not have ready access to the full results from each participant, it appears that a judgment of statistical significance between the majority of the scores is not warranted given the test set size and the variation in judgment regarding the test standard.