The typhoon of technological advances witnessed during the last decade has left in its wake a flood of life-science data, and an increasingly impenetrable mass of biomedical literature describing and analysing those data. Importantly, the modern frenzy to gather more and more information has left us without adequate tools either to mine the rapidly increasing data- and literature-collections efficiently, or to extract useful knowledge from them. To be usable, information needs to be stored and organized in ways that allow us to access, analyze and annotate it, and ultimately to relate it to other information. Unfortunately, however, much of the data accumulating in databases and documents has not been stored and organized in rigorous, principled ways. Consequently, finding what we want and, crucially, pinpointing and understanding what we already know, have become increasingly difficult and costly tasks (Attwood et al.
A group of scientists for whom these problems have become especially troublesome are biocurators, who must routinely inspect thousands of articles and hundreds of related entries in different databases in order to be able to attach sufficient information to a new database entry to make it meaningful. With something like 25 000 peer-reviewed journals publishing around 2.5 million articles per year, it is simply not possible for curators to keep abreast of developments, to find all the relevant papers they need, to locate the most relevant facts within them, and simultaneously to keep pace with the inexorable data deluge from ongoing high-throughput biology projects (i.e. from whole genome sequencing). For example, to put this in context, Bairoch estimates that it has taken 23 years to manually annotate about half of Swiss-Prot's 516 081 entries (Bairoch, 2009
; Boeckmann et al.
), a painfully small number relative to the size of its parent resource, UniProtKB (The UniProt Consortium, 2009
), which currently contains ~11 million entries. Hardly surprising, then, that he should opine, ‘It is quite depressive to think that we are spending millions in grants for people to perform experiments, produce new knowledge, hide this knowledge in a often badly written text and then spend some more millions trying to second guess what the authors really did and found’ (Bairoch, 2009
The work of curators, and indeed of all researchers, would be far easier if articles could provide seamless access to their underlying research data. It has been argued that the distinction between an online paper and a database is already diminishing (Bourne, 2005
); however, as is evident from the success stories of recent initiatives to access and extract the knowledge embedded in the scholarly literature, there is still work to be done. Some of these initiatives are outlined below.
The Royal Society of Chemistry (RSC) took pioneering steps towards enriching their published content with data from external resources, creating ‘computer-readable chemistry’ with their Prospect software (Editorial, 2007
). They now offer some of their journal articles in an enhanced HTML form, annotated using Prospect: features that may be marked up include compound names, bio- and chemical-ontology terms, etc. Marked-up terms provide definitions from the various ontologies used by the system, together with InChI (IUPAC International Chemical Identifier) codes, lists of other RSC articles that reference these terms, synonym lists, links to structural formulae, patent information and so on. Articles enriched in this way make navigation to additional information trivial, and significantly increase the appeal to readers.
In a related project, the ChemSpider Journal of Chemistry
exploits the ChemMantis System to mark up its articles (http://www.chemmantis.com
). With the ChemSpider database at its heart, ChemMantis identifies and extracts chemical names, converting them to chemical structures using name-to-structure conversion algorithms and dictionary look-ups; it also marks up chemical families, groups and reaction types, and provides links to Wikipedia definitions where appropriate.
In an initiative more closely related to the life sciences, FEBS Letters
ran a pilot study (Ceol et al.
) with the curators of the MINT interaction database (Chatr-aryamontri et al.
), focusing on integration of published protein–protein interaction and post-translational modification data with information stored in MINT and UniProtKB. Key to the experiment was the Structured Digital Abstract (SDA), a device for capturing an article's key facts in an XML-coded summary, essentially to make them accessible to text-mining tools (Seringhaus and Gerstein, 2007
); these data were collected from authors via a spreadsheet, and structured as shown in —while clearly machine-readable, this format has the notable disadvantage of being rather human unfriendly.
Fig. 1. Structured summary for an article in FEBS Letters (Lee et al., 2008). Three interactions are shown, with their links to MINT and UniProtKB.
A different approach was taken with BioLit (Fink et al.
), an open-source system that integrates a subset of papers from PubMed Central with structural data from the Protein Data Bank (PDB) (Kouranov et al.
) and terms from biomedical ontologies. The system works by mining the full text for terms of interest, indexing those terms and delivering them as machine-readable XML-based article files; these are rendered human-readable via a web-based viewer, which displays the original text with colored highlights denoting additional context-specific functionality (e.g. to view a 3D structure image, to retrieve the protein sequence or the PDB entry, to define the ontology term).
A more adventurous approach was taken by Shotton et al.
), who targeted an article in PLoS Neglected Tropical Diseases
for semantic enhancement. The enrichments they included were live Digital Object Identifiers and hyperlinks; mark-up of textual terms (disease, habitat, organism, etc.), with links to external data resources; interactive figures; a re-orderable reference list; a document summary, with a study summary, tag cloud and citation analysis; mouse-over boxes for displaying the key supporting statements from a cited reference; and tag trees for bringing together semantically related terms. In addition, they provided downloadable spreadsheets containing data from the tables and figures, enriched with provenance information and examples of ‘mashups’ with data from other articles and Google Maps.
To stimulate further advances in the way scientific information is communicated and used, Elsevier offered its Grand Challenge of Knowledge Enhancement in the Life Sciences in 2008. The contest aimed to develop tools for semantic annotation of journals and text-based databases, and hence to improve access to, and dissemination of, the knowledge contained within them. The winning software, Reflect, focused on the dual need of life scientists to jump from gene or protein names to their molecular sequences and to understand more about particular genes, proteins or small molecules encountered in the literature (Pafilis et al.
). Drawing on a large, consolidated dictionary that links names and synonyms to source databases, Reflect tags such entities when they occur in web pages; when clicked on, the tagged items invoke pop-ups displaying brief summaries of entities such as domain and/or small molecule structures, interaction partners and so on, and allow navigation to core biological databases like UniProtKB.
All of these initiatives differ slightly in their specific aims, but nevertheless reflect the same aspiration—to get more out of digital documents by facilitating access to underlying research data. As such, it is interesting to see that a number of common themes have emerged: most are HTML- or XML-based, providing hyperlinks to external web sites and term definitions from relevant ontologies via color-coded textual highlights; most seem to ignore PDF as a foundation for semantic enrichment (despite a significant proportion of publisher content being offered in this format). The results of these projects are encouraging, each offering valuable insights into what further advances need to be made: clearly, we need to be able to link more than just a single database to a single article, or a single database to several articles, or several databases to a single issue of a single journal. Although necessary proofs of principle, these are just first steps towards more ambitious possibilities, and novel tools are still needed to help realize the goal of fully integrated literature and research data.
In this article, we describe a new software tool, Utopia Documents, which builds on Utopia, a suite of semantically integrated protein sequence/structure visualization and analysis tools (Pettifer et al.
). We describe the unique functionality of Utopia Documents, and its use in semantic mark-up of the Biochemical Journal
). We also outline the development of a number of new plugins, by means of which we have imported additional functionality into the system via web services.