The ultimate goal of the ChartIndex project is to create a CDA-compliant model of clinical document representation at both the structural and semantic levels. The semantic model requires an indexing engine that can automatically identify and then represent important biomedical concepts in clinical documents as UMLS concept descriptors. Prior work in this area has found that achieving good indexing precision is a major challenge. Our current work uses a variety of approaches to address this issue. In this report we show that using a combination of machine learning and NLP can aid in the automated identification of sentence boundaries and NPs in clinical radiology reports.
Most existing NLP systems identify phrases using shallow parsing or text-chunking methods. This is partially because chunking systems are faster than full parsers and partially because full parsers are perceived as error prone. In the past, some researchers who attempted to use full parsing in their information systems did not see improvements in accuracy.51,52
However, with recent advances in statistical parsing methods, we believe that full parses are now better able to resolve important ambiguities within a reasonable time. It takes 1 to 2 seconds for our parser to parse a sentence of average length of 25 words on a Pentium 4 2.8 GHz computer with 1 GB SDRAM, which is sufficiently fast for our current applications. The performance of the parser was evaluated on its native training domain in a previous study,45
although the accuracy on medical texts has not been explicitly tested. A full parser also offers the ability to predict larger NPs that most text chunkers do not attempt to predict. Moreover, full parsers make more detailed assertions about relational syntactic structures, which can reasonably be expected to be useful for future indexing work.
Another concern of applying a statistical parser to a domain other than its training native domain is performance degrading.53,54
In this study, we applied the Stanford parser, trained on the Penn Treebank WSJ corpus, to clinical radiology documents in the medical domain. As Gildea54
reported, word-to-word dependencies are corpus specific. Thus, the unlexicalized Stanford parser gave us a compact and fast statistical parser that we believe is less dependent on its training corpus. Furthermore, we augmented the Stanford parser with a biomedical lexicon derived from the UMLS Specialist Lexicon. The performance of the parser on our document collection, although not evaluated on full parse trees, was evaluated at the levels of both maximal and base NPs. The finding by Hwa53
that higher-level constituents are the most informative linguistic units in grammar induction suggests that the evaluation on maximal NPI might be a better indicator than base phrases of overall parsing accuracy. As shown above, the results of maximal NPI are acceptable, and those of base NPI are comparable to published performance in the newswire, which is the parser's native training domain.
In addition, we believe that the representation of NPs within a parse tree provides considerable potential flexibility at the time of indexing. Compared with the flat output structure produced by text chunking, a parse tree captures more structural information revealing the semantics of the sentence, which may be very helpful in identifying negated concepts. This approach can support heuristics that select the optimal NP node for indexing by traversing paths between the maximal NPs and base NPs within the parse tree.
Noun phrase identification is a critical step in the ChartIndex model. Most important biomedical concepts in clinical documents are NPs, and most UMLS concept descriptors are NPs. The identification of NPs in ChartIndex relies on a high-performance statistical parser. High performance statistical parsers are usually trained on corpora in a general domain. To apply those parsers to clinical documents effectively, we need to supply the parser with biomedical terms. We have shown the UMLS Specialist Lexicon to be an effective resource for this purpose. As mentioned above, there are generally two problems in integrating SL with these parsers. The first one is the mismatch of syntactical categories, which causes ambiguities in mapping entries in SL syntactical categories to entries in Penn Treebank syntactical categories. To address this issue we chose to do the mapping conservatively, by mapping only unambiguous terms. A second problem is that statistical parsers usually use a lexicon with relative frequencies of different syntactical categories for each token. Those relative frequencies for some common words may be very different in biomedical domain from those in general domains. In the base NPI experiment, we manually changed relative frequencies for a few words as mentioned in the Methods section. Those results are marked as “SL+GR” in the data tables. shows an improvement from 91.1% F1 to 92.8% F1 for all reports. While this tuning is extremely specific, and might raise concerns of overfitting, certain words are so extremely common and so extremely different in distribution between domains that even just a few such modifications (for five words in this case) can be widely applicable and generally useful.
The experiment on the identification of maximal NPs reported consistently lower performance numbers, with F1 of 80.2% without terms from the UMLS Specialist Lexicon and F1 of 83.3% with terms from the UMLS Specialist Lexicon. One reason is as mentioned above, to identify the maximal NP is indeed a more difficult problem because the parser needs to resolve attachment ambiguities. Another reason is related to how the errors are counted. For example, “occluded left FEM to distal bypass graft” is a maximal NP in a sentence. The parser mistakenly marked “left” as a verb, which led to three false-positives of “occluded,” “FEM,” and “distal bypass graft” and one false-negative, in the maximal NPI. On the other hand, in base NPI, the same parse error only led to two false-positives, “occluded” and “FEM,” since “distal bypass graft” is a correct base NP. Additionally, because the total number of maximal NPs is smaller than the total number of base NPs, a failure is weighted more in maximal NPI results.
From , we can see that the performance of maximal NPI has less variation across different modalities after applying the extended lexicon, except ultrasounds (US), which may reflect the small data set. From , we can see that adding terms from SL improved performance consistently, except in the smaller data set of mammogram (MAMMO), in which there were no changes in performance. Across the whole test set, there was a reduction of 15.3% in false-positives and 16.5% in false-negatives with the use of the Specialist Lexicon.
shows the same trend in base NPI. The baseline F1 measure ranges from 73.6% in ultrasound (US) to 91.1% in radiology procedure (PROC). With SL terms, the F1 measure was improved to between 86.4% in US and 93.9% in PROC. The final version using both SL terms and a grammar with a few changes further improved F1 to between 90.1% in radiograph (RAD) and 95.5% in US. Due to the factors mentioned above, we can see the performance improvements are more substantial in base NPI by comparing the data in with the data in .
There were some challenges presented by this document set. For example, as mentioned previously, some sentences were not well formed. Most of the ill-formed sentences were NPs. The parser could parse most of them correctly but had some problems with some long complex NPs in the Impression section. Sometimes those NPs were parsed as a full sentence either with a NP and a verb phrase or with a verb phrase only. In both cases, it was usually because some words in the text have more than one POS tag, and they were not pretagged using the extended lexicon. These words were also rare words (seen by the parser less than 20 times in the whole training process) and were tagged as a verb and the head of a verb phrase by the parser. The second type of error originated from commonly used words in capitalized form, such as “Right.” The parser currently treats “Right” as a separate entry from “right” since “Right” may serve as part of a proper noun. There were also a few cases of parsing errors involving punctuations like parenthesis and the pound sign “#”. These errors were due to the fact that these punctuations are used differently in radiology reports and indicate that there are syntactic adaptations that still remain to be done. These cases could not necessarily be addressed by simple lexicon adaptation.
Statistical learning techniques have been widely adopted in text processing and text mining applications and have been shown to produce robustness and good performance. One potential hurdle associated with this approach is the need for large-labeled training sets for use in supervised machine learning. This is especially true in the biomedical domain since there are few publicly available large-labeled corpora of clinical documents, such as the Penn Treebank corpora for general domains. We have extended such a parser using a domain-specific lexicon. There are still some remaining issues with this approach such as the ambiguous tag mappings in the conversion of lexicons using different POS tag sets. Also, the probabilities associated with POS tags for each term can be very different in clinical documents compared with the training collection in a general domain. However, our work has shown that, with the help of lexical entries contained in a domain-specific lexicon (the UMLS Specialist Lexicon) a statistical natural language parser trained on a training set in the general domain can achieve significantly improved performance on NP identification within clinical radiology reports.
There are limitations to our analysis. First, the method that we used to create a gold standard was not optimal.55
Ideally we would have asked each expert physician to review all 100 radiology reports. However, this was not possible given the time constraints of these experts and we instead asked each physician to review 25 of the 100 documents. If all 100 documents had been reviewed by each physician, we could have evaluated per-rater reliability and intra-rater variability. Second, the use of computer pre-markups may bias human experts' judgments. Third, we did not evaluate the entire parse tree for each sentence, although we did get a good estimate of the parser's overall performance through evaluation of the base NPI and maximal NPI.