In recent years, computerized approaches have been widely adopted to conduct clinical research, particularly by using data that is organized in a machine processable and understandable way. Louit et al. [1
] reviewed the current methods and problems for genomic medicine data integration. Bodenreide [2
] described how Semantic Web ontologies [3
] are being used for knowledge management, data integration, etc. in the biomedical field. Tenenbaum et. al. [4
] proposed the Biomedical Resource Ontology (BRO) for semantic annotation and discovery of biomedical resources.
One prerequisite here is to have decent structured clinical data, i.e., converting originally free text based data into certain structured formats. Automatic approaches that are based upon Natural Language Processing (NLP) techniques have been well studied [5
]. According to the reported performance of the clinical Text Analysis and Knowledge Extraction Systems (cTAKES) [6
], one of the state-of-the-art NLP systems in the medical domain, it has been able to achieve a F-score of 82.4% for Named Entity Recognition (NER) [7
] which is a fundamental task for transforming free text to structured data. However, clinical research usually requires more precise results. Fully automatic approaches for data extraction are preferred but they do not always give satisfying results; while it may not be realistic to only rely on manual annotation to construct structured data due to the large volume of clinical notes that are needed to process. Therefore, semi-automatic data extraction could be one choice where we automatically extract information from clinical narratives and then manual efforts are used to refine such automatic annotations. The results from this semi-automatic process could potentially serve as the training sets to help automatic systems to further improve their performance.
Another important problem is that what structured format we should use to store the data. In clinical research, a variety of data formats have been adopted, such as Comma-separated Values (CSV) [8
], relational databases, eXtensible Markup Language (XML), etc. Jeronimo et al. [9
] constructed a multimedia database tool containing cervical cancer related patient records. The data is stored in a relational database and can be exported as CSV files. The HL7 Clinical Document Architecture (CDA) [10
] is an XML based markup standard intended to specify the encoding, structure and semantics of clinical documents for exchange. The advantage of these data formats is that they provide a structured way for data storage and thus provide the possibility for more convenient data management compared to free text based data representation. Compared to the existing formats, the Semantic Web [3
] and its corresponding knowledge representation standards (e.g., the Resource Description Framework (RDF)2
and the Web Ontology Language (OWL)3
) adds an expressive framework for semantics-enabled data representation and knowledge inference.
In the Semantic Web, an ontology is an explicit and formal specification of a conceptualization, formally describing a domain of discourse. It consists of a set of terms (classes) and their relationships (class hierarchies and predicates). RDF is a graph based data model for describing resources and their relationships. Two resources are connected via one or more predicates in the form of triple. A triple, < s
>, consists of three parts: subject, predicate and object. The subject is an identifier (e.g., a URI) and the object can either be an identifier or a literal value, such as strings, numbers, dates, etc. One advantage of the Semantic Web is that it supports knowledge inferencing. For instance, two classes A
are defined to be disjoint in an ontology and if an automatic system or a human annotator annotates a piece of free text with both classes, the system should be able to report a potential annotation error. For automatic systems, such formal representation of semantics could be helpful to improve their performance under specific scenarios [11
In this paper, we propose Semantator4
, a semi-automatic annotation system for annotating clinical narratives using semantic web ontologies. Here, semantic annotation is to annotate entities with ontology classes, such as instances and their relationships creation/deletion. Although Semantator is designed for annotating clinical documents, it can also be applied to documents in other domains. Semantator is developed as a plugin of Protégé5
, a well known tool for building and interacting with Semantic Web ontologies. The current version of Semantator provides: 1) the basic manual annotation functionalities, including ontology instance creation/deletion, relationship creation/deletion, linking equivalent instances and exporting/reloading annotation; 2) automatic annotation by connecting to the NCBO annotator [12
] and cTAKES [6
]; 3) Semantic Web based reasoning by utilizing the underlying semantics of the owl:disjointWith
The rest of the paper is organized as following. Section 2 discusses related work. Section 3 and Section 4 introduce the manual and automatic annotation functionalities respectively. We present some discussions on Semantator based on our user studies in Section 5 and conclude in Section 6.