The validity of both components of the Autocoder was evaluated using standard techniques of computing precision and recall. Precision and recall are measures widely used in the domains of machine learning and text categorization 29
and are defined in the “Evaluation Measures” subsection. The development of several reference standards is discussed in the “Reference Standard Development” section. Finally, the experiment results are reported in the “System Evaluation” section.
We used the standard evaluation metrics of precision, recall and f-score. Precision is defined as the ratio of correctly assigned categories (true positives) to the total number of categories produced by the classifier (true positives and false positives).
Recall is the ratio of correctly assigned categories (true positives) to the number of target categories in the test set (true positives and false negatives).
F-score represents the harmonic mean of precision and recall according to the formula in (4):
where P is the precision, R is the recall and α is a weight that is used to favor either precision or recall. In our computations α was set to 0.5 indicating equal weight given to precision and recall.
Two sets of precision/recall results are reported: micro-averaged and macro-averaged as described in Manning and Shutze. 29
The micro-averaging method represents the results where true positives, false positives and false negatives are added up across all test instances first and then these counts are used to compute the statistics. The macro-averaging method computes precision/recall for each test instance first, and then averages these statistics over all instances in the reference standard. These two methods yield different results when the instances have more than one correct category and when categories are represented by unequal numbers of instances. The micro-averaging method favors large categories with many instances, while the macro-averaging method shows how the classifier performs across all categories. 29
Training and Testing Data
Three reference standards were developed to evaluate the Autocoder. Given the architecture of the Autocoder and the data flow logic built into it, each diagnostic statement that enters the system can fall into three broad categories. We will refer to them as A, B and C. The first type (A) consists of statements that have passed the example-based component’s filter controlled by the MIN_EVENT_FREQ parameter set at 25. We can classify these data solely with the example-based component and with high confidence—the categories assigned to this type will not be subsequently reviewed.
The second type of data (B) is made up of diagnostic statements that have been found in the Medical Index database of previously coded examples, but whose diagnosis-gender-code event frequency is lower than the value of the MIN_EVENT_FREQ parameter set at 25. We are less confident in classifying a case like this and therefore submit this case for manual review.
The third type (C) consists of diagnostic statements of which we do not have any prior record. These types of data need to be classified with the machine learning component. The codes assigned to these diagnostic statements are of low confidence and can only be used as suggestions for subsequent manual review.
All available data samples collected between 1994 and 2004 are split into training and testing according to the flow diagram in . The training data consisted of over 22 million non-unique examples entered into the database between 1994 and June 1, 2003. The testing data consisted of 898,584 examples collected between June 1, 2003 and January 1, 2004. In order to determine the distribution of the three types of data we looked up each testing data sample in the database created from the training data samples and determined if it belonged to one of the following three types: A, B, or C. There were 527,673 samples (58.7%) of type A data, 213,440 samples (23.7%) of type B data, and 157,471 samples (17.5%) of type C data. We drew several random samples from each of the three datasets to create reference standards as discussed in the following subsections.
Training and testing data collection schedule.
In order to create a reference standard with acceptable statistical power we conducted a pilot study to determine the expected level of precision and recall and to optimize the parameters of the example-based component. We created a random sample of 75,000 entries from the type A data set. A regression test for precision/recall of the example-based component was performed by varying two parameters: MIN_EVENT_FREQ and MAX_NUM_CAT. MIN_EVENT_FREQ was varied in the range between 1 and 200 () and the MAX_NUM_CAT parameter between 1 and 10 ().
Precision/Recall results where MIN_EVENT_FREQ parameter is varied between 1 and 200 and MAX_CAT_NUM is held at 1.
Precision/Recall results where MAX_CAT_NUM parameter is varied between 1 and 10 and MIN_EVENT_FREQ is fixed at 1.
Initially, we varied the MAX_NUM_CAT parameter while holding the MIN_EVENT_FREQ parameter steady at its lowest value of 1. This was the logical starting point as we knew that the majority of diagnostic statements are assigned either one, two, or three codes. shows the actual frequency distribution of diagnoses with various code assignments. The distribution drops off sharply after three categories per diagnostic statement; however, we did extend the variation of the parameter to 10. The results in Chart 2 show that there actually is a point at which the precision and recall curves cross. When the parameter is changed from 1 to 2, the recall goes up from 96.1% to 97.6% while the precision drops from 97.4% to 96.2%. When the MAX_NUM_CAT parameter is set to 3 or higher, the recall stays about the same; however, the precision drops dramatically, as is expected. This result allows us to optimally set MAX_NUM_CAT parameter to 2.
Distribution of the number of codes assigned to diagnoses in the test data.
Once the optimal value for MAX_NUM_CAT parameter was determined, we optimized the MIN_EVENT_FREQ parameter by holding MAX_NUM_CAT steady at its lowest value. The results in Chart 1 show that the most optimal MIN_EVENT_FREQ value is 25. Despite the fact that the precision and recall curves do not cross, it is clear that the growth in precision asymptotes at MIN_EVENT_FREQ set to 25. Not much is gained in recall in going lower than 25, but there is a substantial drop in precision; therefore 25 is set as the most optimal value for MAX_EVENT_FREQ.
Using both MAX_NUM_CAT set to 2 and MAX_EVENT_FREQ set to 25, we arrive at 97% precision and 94% recall on the test set of 75,000 instances. According to our statistical power calculations, at this level of precision and recall we would need to examine over 2,600 random samples manually in order to estimate the results within a 1% margin of error using a 95% confidence interval.
Type A Reference Standard
We compiled a set of 3,000 entries from a sample that had been coded manually by the Medical Index staff as shown in . These entries were manually re-verified for accuracy and completeness by two senior Medical Index staff with more than ten years of medical classification experience. Nineteen instances were excluded due to technical problems such as missing text of the entry or patient gender information. The resulting set of 2,981 instances was used as the reference standard for further evaluations of the example-based component.
Type B Reference Standard
A random sample of 3,000 entries with frequency below 25 was extracted from the same test set of 75,000 instances used to develop the type A reference standard. This population is complementary to the population used to sample type A reference instances due to the single frequency threshold.
Type C Reference Standard
We compiled a random sample of 3,000 entries that were not found via lexical string match in the database used to train the example-based and the machine learning components. We wanted to make sure that the entries in this set were truly never seen before and used a more aggressive normalization as well as random manual checking. This resulted in a set of 2,281 entries in the final reference standard.
Neither type B nor type C entries were manually re-verified by classification experts; however, both of these types of entries had been manually classified before and thus can serve as a reference standard. We are confident that these data are of high quality because the standard manual coding process involves initial coding with subsequent verification by a more experienced coder. Thus we can think of type A data as being doubly verified. The agreement between the first and second verification on type A data set is 94%. This provides an empirical foundation for our confidence in the quality of coding on type B and type C data and obviates the need for additional verification.
The Autocoder was evaluated on the three reference standards, with each component of the Autocoder having been evaluated on the appropriate standard. The evaluation results for all three types are presented in and .
Table 1 Table 1 Micro-average Precision/Recall Results for Type A, B and C Data. The Cells for Precision Recall Contain the Number of True Positives (tp) Followed by the Sum of True Positives (tp) and False Positives (fp) or False Negatives (fn), Followed by (more ...)
Table 2 Table 2 Macro-average Precision/Recall Results for Type A, B and C Data. The Cells for Precision Recall Contain the Macro-averaged Value for Precision/Recall, Followed by the Total Number of Test Instances, Followed by the Width of a 95% Confidence Interval (more ...) Evaluation on Type A Data
Type A data consist of diagnostic entries found in the database of previously coded entries with frequency greater than or equal to an empirically established threshold of 25. Since the example-based classifier component is intended to operate without subsequent review, it was necessary to optimize the parameters to maximize precision and recall as well as its capture rate (the number of entries processed). Since MAX_NUM_CAT parameter does not affect the capture rate, we plotted the capture rate with respect to the variation in the MIN_EVENT_FREQ only in Chart 1. With MIN_EVENT_FREQ set at 25, we are able to capture 47.5% of the unique test entries.
With MAX_NUM_CAT parameter set to 2 and MIN_EVENT_FREQ set to 25, the Autocoder achieved a precision of 96.7% and recall of 96.8% resulting in an f-score of 96.7 using the micro-averaging method and a precision of 98.0% and recall of 98.3% (with an f-score 98.2) using the macro-averaging method.
Evaluation on Type B Data
Type B data consist of diagnostic entries found in the database of previously coded entries with frequency lower than an empirically established threshold of 25. Diagnostic statements classified as type B were categorized using the example-based component and were sent for subsequent manual review. The micro-averaging method yielded a precision of 86.6%, recall of 93.7%, and an f-score of 90.4, while the macro-averaging method yielded a precision of 90.1%, recall of 95.6% recall, and an f-score of 93.1.
Evaluation on Type C Data
Type C data consist of diagnostic entries not found in the database of previously coded entries. The best results for this data type are displayed in . The micro-averaged technique yielded a precision of 58.6%, recall of 44.5%, and an f-score of 50.7. The macro-averaged technique yielded a precision of 58.5%, recall of 50.7%, and an f-score of 54.4.