The results of our method show significant promise. Given a relatively small number of manual annotations, our machine learning-based classifiers are able to effectively learn how to automatically identify Locations and Anchors. Furthermore, they significantly outperform the baselines. The poor performance of the baselines suggests that this task is complex and requires the effective fusion of numerous resources. On the other hand, the success achieved with such a limited number of annotations (by NLP standards) points to the generalizability of our method.
The automatically chosen Location features indicate that Location identification is largely a syntactic recognition task (i.e., recognize which syntactic constructions are typical for Locations). This is not surprising, as our decision to separate Location and Anchor identification was motivated by the observation that spatially ambiguous findings (such as those in ) are usually given a syntactically clear spatial position (the Location), but this was often underspecified and needed a more semantic grounding (the Anchor). This suggests that our Location identifier will prove adept at recognizing cases with common syntactic constructions but would have trouble when rarer syntactic constructions (as well as rarer Location terms) are present. This observation was largely found to hold during an error analysis of our method. One immediately obvious path to improvement is to annotate more data. As stated above, our annotators spent less than 1 day annotating these spatial relations. Our goal, however, was not to achieve the highest possible score, but rather to see how well our method generalizes on a relatively small number of annotations. Our method’s success on this small amount of data suggests it could easily be re-trained for different abnormalities and other types of clinical reports.
The most common type of error for the L
ocation identifier is to select the A
nchor token as the L
ocation. Two such examples of this are (G
old indicates ground truth while S
elected indicates system output):
- (12) ...low density
[thickening]Finding of the distal
... - (13) There are [inflammatory changes]Finding surrounding the
of the proximal
.
Another common type of error was to select a different token with approximately the same meaning as the labeled L
ocation. In the following example, “periappendiceal” is related to the appendix (literally meaning near the appendix), and was marked by the human annotator instead of “appendix” simply because it was directly modifying the finding term:
- (14) The
is dilated with
[inflammatory changes]Finding ...
Other errors involve mistaking a nearby finding’s L
ocation with that of the current finding:
- (17) No
thickening or adjacent [inflammatory changes]Finding are noted...
In this case, “wall” is the Location and “gallbladder” the Anchor for the “thickening” finding, but the “inflammatory changes” finding does not have an Anchor, only the Location “gallbladder” (“adjacent” is a spatial proximity indicator, not an anatomical part). Given a perfect syntactic parse structure, it could be recognized that “wall” only modifies “thickening”, while “adjacent” refers to the proximity of “gallbladder”. However, syntactic parsers are not trained on clinical data and thus perform poorly relative to their targeted domain. The classifier is then forced to rely on surface-level features such as the token distance, thus making “wall” appear to be a better candidate than “gallbladder”.
The Has-Anchor classifier performs quite well with an accuracy of 92.7%. The data set is quite balanced (50.6% of Locations require Anchors), so the classifier achieves 42 points over the baseline of always predicting an Anchor is necessary. As seen in the Features section, this classifier largely functions by learning what Location words need Anchors and the contextual information it relies on is largely based on a 1-token window around the selected Location. Since all of the features used for Location identification were tried and only the helpful ones were kept, it can be assumed that the features that capture linguistic constructions (such as those based on the dependency parse) provide little value. The only feature of this type used in the Has-Anchor classifier is the parts of speech between the finding and selected Location.
As might be expected from the features, with little contextual information to fall back on, the Has-A
nchor classifier performs poorly on rare L
ocation words:
- (18) There are [inflammatory changes]Finding around the distal [right ureter]Location. [
]Anchor
Since the most common case is that an anchor is required (50.6% to 49.4%), the classifier defaults to requiring an A
nchor. Additionally, there are some L
ocations which are spatially unambiguous but marked with A
nchors to provide further specificity. In these cases the Has-A
nchor classifier has trouble with L
ocations that do not always have A
nchors:
- (19) [Inflammatory changes]Finding are seen in the [right lower quadrant]Location around the [cecum]Anchor .
In our data, “right lower quadrant” is marked as a Location 18 times, with 7 of those cases requiring an Anchor, so unless the context is properly represented the classifier will always select that no Anchor is needed. While features were considered which capture the contextual information, these features proved harmful to the overall results.
As seen in , the Anchor identifier is the largest source of errors. However, an alternative assessment of Anchor detection is to consider all Locations without Anchors (i.e., those that are unambiguous anatomical parts) to also be Anchors. In this case, the end-to-end Anchor performance improves from 70.96 to 77.26. This score is a realistic measure of how well our method detects anatomical locations of findings.
The chosen features for Anchor identification are similar to those for Location identification with a few notable differences. Like Location identification, the candidate Anchor word and its syntactic relationship with the finding are important. But due to our pipelined architecture (see ), we can include features based on the selected Location as well. Specifically, surface-level lexical features connecting the Anchor to the Location were found to work best, suggesting that content words (as opposed to grammatical relationships) play an important role in determining which token is a valid Anchor for a given Location.
Given the number of words between A
nchors and their respective findings (and therefore the diversity of grammatical relations between them), the largest source of errors for A
nchor identification unsurprisingly involve guessing the most common A
nchor words instead of the specific anatomical part for the finding. Common examples of these errors are:
- (20) The
, spleen, pancreas, and bilateral adrenal glands show now focal mass... Urinary
is partially distended with no calcification or [focal wall]Location [thickening]Finding. - (21) The
, also show nondilated caliber, with no adjacent inflammation.
diverticulitis is noted, with no [wall]Location [thickening]Finding ...
In our data, both “liver” and “appendix” appear as Anchors far more often than “bladder” and “colonic”. Both of these errors could be overcome with features that recognize syntactic scoping. Specifically, both examples use a copular construction in which the grammatical subjects (“Urinary bladder” and “Colonic diverticulitis”, respectively) dominate the rest of their respective sentences, and thus the findings should be associated with the subject instead of an anatomical part from a previous sentence. Unfortunately, recognizing this semantic property using the available syntactic resources is quite challenging and is thus left to future work.