The difficulties involved in any effort to add semantic mark-up are myriad and this add-in does not resolve all of them. While we think it is a significant step forward, it also highlights some of the more difficult challenges (see [33
] for an illuminating discussion of these).
The use of ontologies is a solid initial step in defining what is effectively a controlled vocabulary for term recognition in natural language. These ontologies represent a vast amount of expertise and careful consideration across a wide range of domains. However, they were not created for automated term recognition so it is unsurprising that they are not a perfect fit for this application.
A desirable goal in the creation of an ontology is the inclusion of univocal terms - terms which are unambiguous and precise. For example, the Human Disease Ontology3
contains the term "Leukemia, T-Cell, HTLV-II-Associated," which is very precise and descriptive, but is not likely to appear verbatim in a manuscript and, thus, is not likely to be recognized in a string or pattern matching approach. The ontology creators recognized that terms may have different usages, so most ontologies assign synonyms to the preferred usage of a term. These synonyms can be used in addition to the preferred term to increase the chance of successfully inferring a semantically important word. For example, the synonym for the aforementioned term, "Atypical hairy cell leukemia (disorder)," is a bit more natural and easier to automatically recognize, but actual papers that discuss this disease use "hairy cell leukemia", "hairy-cell leukemia", "hairy T cell leukemia", and "T cell hairy leukemia," terms that are not included in the ontology synonym list [34
]. "Hairy cell leukemia" is a separate (less specific) term in this ontology, parent to "Leukemia, T-Cell, HTLV-II-Associated" but also to 12 other distinct leukemias.
There are occasions when it is not always desirable to use such precise terms when writing a manuscript. General concepts are often necessary, for example, the Human Disease Ontology term "leukemia." However, when a term is less precise it may have different conceptual meanings. The Human Disease Ontology and Family Health History Ontology [39
] both contain the term "leukemia," but define the term alternately as a disease and a medical diagnosis - subtle, but potentially significant, distinctions. Although the add-in allows an author to associate any word or phrase with a specific ontology term, this requires an extra step by the author (at least once per document).
Rather than invent an ontology alternative to address these problems, a possible adaptation to existing ontologies might be the inclusion of an additional set of synonyms for a term that reflect its use in natural language
. Automated finding of these types of synonyms in extant literature is feasible (if not entirely accurate) using heuristic approaches [40
]. Synonyms found in this manner, or gathered from term-mapping databases [41
], could be used as a supplement to the ontologies. Incorporating a more sophisticated term recognition approach such as term normalization or other heuristic rules (for example [45
]), into the add-in would also likely be a significant improvement.
Regardless of the automated recognition approach, human disambiguation of terms and synonyms would still require some consideration by the author to ensure that the intended meaning is accurately conveyed. Even professional biocurators do not always agree on the most appropriate terms to assign to concepts in an article [50
]. For an author who lacks familiarity with ontologies or literature curation, the process of trying to first identify the semantically important words and phrases in their manuscript and then the most appropriate term to use to describe them could prove to be too challenging, at least without clear guidelines from the intended manuscript recipient [51
]. These difficulties may be magnified if co-authors of the manuscript disagree on term usage. Initiatives such as ODIE4
show that establishing a feedback loop between ontology developers and ontology users frequently results in the discovery of new, relevant terms to add to existing ontologies. Ontology developers from the Gene Ontology, for example, have expressed keen interest in creating such a system within this add-in and we intend to explore this in a future version. Ideally, we would also like to be able to enable recognition and mark-up of relations between terms, but this is a significant challenge in its own right and is beyond the scope of the current project.
Although these challenges in the semantic enrichment of literature have not yet been resolved, we believe that the add-in is a significant advance and that it may provide the necessary stimulus to engage researchers beyond the bioinformatics community. Importantly, this add-in can work in concert with the Article Authoring add-in5 which converts a .docx manuscript into the National Library of Medicine's XML format6 - required for deposition of articles in PubMed Central and used by many life sciences publishers. The combined use of these add-ins would generate a document that maintains author-added semantic metadata and can be incorporated directly into these workflows without any further effort on the part of publishers or archives. Feedback during practical use from a broad and large user-base will help define any barriers to common use and will guide the design of an interface that can lower those barriers. Few of us want to spend yet more time and effort writing or typesetting papers, but if this effort culminated in a reference to the paper from a database or other resource, authors would likely be rewarded with an increased citation rate and wider readership, in addition to an overall improvement in the accessibility of knowledge.