A great wealth of patient specific medical data is stored as transcribed free text. While this format is useful for individuals reading the medical record, information stored as free-text is difficult to use in decision support systems or automated cross population studies [
1]. Efforts to extract computer-usable information from free text archives vary widely. Traditionally, teams of trained abstractors have manually reviewed patients' charts. String matching is a simple algorithmic approach. Identifying concepts is a much more complex process. Algorithmic natural language understanding holds great promise, but remains difficult to achieve [
2,
3]. Despite the challenges, a number of groups have applied natural language processing techniques with varying degrees of success [
4-
10]. Concept-based indexing is another approach that has been applied to a number of areas including literature retrieval, health related web sites, clinical diagnoses, and medical narratives [
11-
16].
Natural language processing is routed in a logical representation of discourse. Until the 1920s logic and mathematics was considered spiritual not scientific. Since the time of Pythagoras, mathematics was considered a revelation of the divine order. In Principia Mathematica (Russell and Whitehead), demonstrated that mathematics was logical. Logical positivism was then applied to science and psychology.
Noam Chomsky's seminal work "The Logic Structure of Linguistic Theory," was published in 1955 in mimeograph form and in press in 1975. This work expressed the view that language was a cognitive activity and required a meta-model of language to effectively communicate. He demonstrated that the stimulus response model could not account for human language. This idea that language is processed led to the application of computer science to free text (natural language) processing. Computational linguistics (CL) is a field of computer science which seeks to understand and represent language in an interoperable set of semantics. CL overlaps with the field of Artificial Intelligence and has been often applied to machine translation from one human language to another. Naomi Sager in 1994 published in JAMIA a paper entitled "Natural Language Processing and the Representation of Clinical Data." Here Dr. Sager showed that for a set of discharge letters a recall of 92.5% and a precision of 98.6% could be achieved for a limited set of pre-selected data using the parser produced by the Linguistic String Project at New York University [
1-
3].
In 2004, Friedman et al reported a method for encoding concepts from health records using the UMLS [
4]. In this study Dr. Friedman and colleagues used MedLEE to abstract concepts from the record and reported a recall of 77% and a precision of 89%. In 2001, Nadkarni provided a description of the fundamental building blocks needed for NLP [
5]. He discussed their method for lexical matching and part of speech tagging in discharge summaries and surgical notes. Henry Lowe developed MicroMeSH an early MUMPS based terminology browser which incorporated robust lexical matching routines. Dr. Lowe working with Bill Hersh reported the accuracy of parsing radiology reports using the Sapphire indexing system [
6]. Here they reported good sensitivity and they were able to improve performance by limiting the UMLS source vocabularies by section of the report.
MetaMap has the capacity to be used to code free text (natural language) to a controlled representation which can be any subset of the UMLS knowledge sources [
7]. MetaMap uses a five step process which begins by using the SPECIALIST minimal commitment parser which identifies noun phrases without modifiers. The next step involves the identification of phrase variants. These variants are then used to identify candidate phrases from within the source material [
8]. Linguistic principals are used to calculate a score for each potential match. Brennan and Aronson used MetaMap to improve consumer health information retrieval for patient [
9].
We have built and described systems for concept based indexing, automated term composition, and automated term decomposition. In its current version, the system uses the SNOMED-CT terminology. The accuracy of this automated technique has previously been evaluated [
10]. Many individuals have evaluated the accuracy of manual term composition [
11,
12]. The clinical coding center of the NHS has reported limited success with their own algorithm for automated term dissection in the past [
13,
14].
As we move toward compositional terminologies, the need to organize the terms within a compositional expression becomes important for both the readability and understanding of these composite terms [
15,
17]. Identifying concepts that are explicitly asserted as not being the case and separating them from positive assertions becomes of critical importance if we are to understand the implications of medical text. Linguistic negation is a challenging problem [
18]. This trial evaluates a mechanism for automated assignment of negation status to concepts parsed from the terminology using a negation ontology. The text is analyzed to identify expressions indicating negation and a model of negation is applied to assign values to concepts. We have named this system the automated negation assignment grammar [
10]. We recognize the following semantic types: Kernel concepts, Modifiers, Qualifiers or Negative Qualifiers [
19]. A rule base is then applied which organizes the Modifiers, Qualifiers and Negative Qualifiers around the Kernel concepts. These are represented in a hierarchical structure with the degree of indentation being representative of semantic dependency. The accuracy of this automated technique has previously been evaluated [
10]. Many individuals have evaluated the accuracy of manual term composition [
11,
12].
Identifying concepts that are explicitly asserted negatively (e.g. "no evidence of pneumonia") and separating them from positive assertions becomes of critical importance if we are to understand the implications of medical text.
To illustrate the importance of concept negation, we reference a case of a 62 year old female who presents with erythema over the dorsum of the left foot with exquisite tenderness over a wound situated over the mid foot. After a comprehensive clinical work up, she was found to have a Cellulitis of the left foot without signs of lymphangitic spread of her infection. In this case, it is an important distinction that our patient did not have "Lymphangitis" associated with her "Cellulitis, left foot," as opposed to a distinct separate case where the diagnosis of "Lymphangitis was present." Epidemiologically, if one was studying Lymphangitis, it would be important to exclude this patient's record from the analysis.
A previous study of Negation by Mutalik et al, described the lexical assignment of negation using the UMLS to code free text documents. Their intervention had a sensitivity of 95.7% and a specificity of 91.8% [
20]. They did not report the UMLS coverage of the concepts that appeared in the text. They also noted that the words "no", "not" "denied/denies" and "without" made up 92.5% of the negation in their study. Chapman et al looked to identify negation in discharge summaries and identified negative UMLS concepts with a sensitivity of 77.8% and a specificity of 94.5% using regular expressions [
21].