|Home | About | Journals | Submit | Contact Us | Français|
Biomedical research is increasingly becoming a data-intensive science in several areas, where prodigious amounts of data is being generated that has to be stored, integrated, shared and analyzed. In an effort to improve the accessibility of data and knowledge, the Linked Data initiative proposed a well-defined set of recommendations for exposing, sharing and integrating data, information and knowledge, using semantic web technologies.
The main goal of this paper is to identify the current status and future trends of knowledge representation and management in Life and Health Sciences, mostly with regard to linked data technologies.
We selected three prominent linked data studies, namely Bio2RDF, Open PHACTS and EBI RDF platform, and selected 14 studies published after 2014 (inclusive) that cited any of the three studies. We manually analyzed these 14 papers in relation to how they use linked data techniques.
The analyses show a tendency to use linked data techniques in Life and Health Sciences, and even if some studies do not follow all of the recommendations, many of them already represent and manage their knowledge using RDF and biomedical ontologies.
These insights from RDF and biomedical ontologies are having a strong impact on how knowledge is generated from biomedical data, by making data elements increasingly connected and by providing a better description of their semantics. As health institutes become more data centric, we believe that the adoption of linked data techniques will continue to grow and be an effective solution to knowledge representation and management.
Biomedical research is increasingly becoming a data-intensive science in several areas, where prodigious amounts of data is being generated that has to be stored, and most of the times integrated, shared and analyzed. Moreover, biomedical data is highly complex when compared to standard big data projects that, for example, store and analyze short messages and their authors, often collected from a single source and using common formats. In opposition, even a single health institution has to deal with multiple types of data, in heterogeneous formats and from different sources, such as electronic health records, clinical images and reports, or genome sequences.
The challenge of how to store and manage biomedical data in the most precise way possible has a long-standing history, and besides the big technological advances it still remains an open issue. For example, in 1985 The Committee on Models for Biomedical Research proposed a structured and integrated view of biology to cope with the available data . Nowadays, the BioMedBridges  initiative aims at constructing the data and service bridges needed to connect the emerging Biomedical Sciences Research Infrastructures (BMSRI), which are on the roadmap of the European Strategy Forum on Research Infrastructures (ESFRI).
Besides all the technological advances that we may deliver to make data easily accessible, researchers need more than raw data, they need a clear and objective characterization of who, what, where, why and how that data was collected. For example, due to the Galileo’s strong commitment to the advance of Science, he integrated the direct results of his observations of Jupiter with careful and clear descriptions of how they were performed, which he shared in Sidereus Nuncius . These descriptions enabled other researchers not only to be aware of Galileo’s findings but also to understand, analyze and replicate his methodology. We must understand the meaning of data to replicate experiments and their outcomes, otherwise they are just sequences of zeros and ones where we are able to find useless correlations but no causality. For example, knowing the raw sequence of our genome is useless without the knowledge that science gave us after all these years of studying its meaning.
Even if you have easy access to all biomedical data, its real value can only be leveraged through how effectively we can analyze it towards the acquisition of knowledge that needs to be represented and managed. Creating data without producing knowledge is like writing books that are never read, and biomedical data is like erudite books in terms that they normally are not easy to read with their challenging writing styles. Thus, biomedical literature has been the traditional and natural mean for representing knowledge, where all the findings are properly described and their limitations and potentials fully discussed. As a consequence, a large amount of the knowledge acquired in Life and Health Sciences is available through literature. However, representing knowledge as unstructured free text hinders its accessibility and usage, since the retrieval of information from a large collection of texts is a tedious and time-consuming task for humans and a hard and prone to error task for machines .
In an effort to improve the accessibility of data and knowledge without losing too much flexibility, the Linked Data initiative  proposed a well-defined set of recommendations for exposing, sharing and integrating data, information and knowledge, using semantic web technologies. This paradigm is more than just a standardized messaging and text communications protocol to avoid data silos, such HL7, Linked Data enables the association and characterization of any kind of data in the form of links reinforcing our tools to represent and manage knowledge. The links are described using Resource Description Framework (RDF)  that provides a universal graph-based data model to connect the data between themselves but also to add semantics to them . This model is more flexible than traditional data storage models, but still not as much as unstructured free text. Thus, literature will not be replaced by linked data but more data and knowledge can be easily expressed this way without hindering its accessibility. Besides RDF there are other graph-based models, such as Property Graphs , which may prove to be more effective in some specific areas of biomedicine.
One of the earliest well-known attempts of applying Linked Data to biomedical data was Bio2RDF , an open access platform that provided access to millions of documents in normalized RDF format with data from hundreds of different organisms. The potential of Bio2RDF was demonstrated in a case where a knowledgebase about Parkinson’s disease was successfully built and some specialized questions were efficiently answered.
A few years ago, a public-private partnership between the pharmaceutical industry and the academia, publishers, small and medium sized enterprises initiated the project Pharmacological Concept Triple Store (Open PHACTS) . Its goal was to build an open pharmacological knowledgebase that could overcome with the complexity of data access and licensing hurdles intrinsic to this domain with a solid plan for sustainability, service provision and maintenance in the public domain. Like Bio2RDF, Open PHACTS platform is based on RDF format with a bottom-up perspective of data standards where information from multiple providers is exposed by adaptive integration of the information.
More recently, the European Bioinformatics Institute (EBI), a major provider of bioinformatics data and services, made available to the community the EBI RDF platform . This platform integrated multiple EBI data resources, such as UniProt, Gene Expression Atlas, ChEMBL, BioModels, Reactome and Biosamples, based on the RDF format and accessible through a standard query language interface (SPARQL). EBI RDF platform is the web interface for online access, but besides just providing data in a common format, this platform makes an effort in including as much as possible common vocabularies to describe their semantics and provenance.
Just by adopting the Linked Data paradigm does not mean that we are sharing knowledge. Each human has his own set of links in his mind, and to start communicating we need a common ground. For example, by using spoken English two human can share their knowledge about the world and therefore create more links in their minds. There are specific predicates that should be used for linking datasets, such as owl:sameAs, rdfs:seeAlso, skos:exactMatch and skos:closeMatch, and a recent study showed that in Life Sciences the predicate owl:sameAs was the most widely used linking predicate (52.17%) .
One important aspect of the Linked Data paradigm is the usage of common vocabularies that are expressed using the RDF Schema  (RDFS) and the Web Ontology Language  (OWL). These vocabularies are used to describe the data elements and their relations by defining classes and their properties. The usage of common vocabularies is incentivized (but not compulsory) to establish a common interpretation of data and by consequence enable knowledge sharing. These vocabularies can vary from simple terminologies to highly complex semantic models of a given domain encoded in the form of ontologies, such as Gene Ontology (GO) . The Linked Data paradigm uses RDF as its data model together with its vocabulary definition languages RDFS and OWL. However, in fact the usage of RDF and ontologies goes beyond the scope of Linked Data, and many biomedical projects exploit them without necessarily following the Linked Data paradigm
Above biomedical data was compared to erudite books with challenging writing styles, but now imagine if each one of them were written in an exclusive language that could not be easily mapped to English, and therefore without any thorough translation available. Reading each book required us to learn a new language to fully understand its message. The knowledge was there but accessible to just a few. Thus, standard classification vocabularies represent a solution that prevents data and knowledge from being stored as silos by enabling data annotations with common terms, which makes data and its meaning more accessible. These vocabularies are instantiated by Knowledge Organization Systems (KOS)  in the form of classification systems, thesauri, lexical databases, gazetteers, and taxonomies, and ontologies. The latter can be loosely defined as “a vocabulary of terms and some specification of their meaning” [17, 18]. If an ontology is accepted as a reference by the community then the representation of its domain becomes a standard, and knowledge sharing and management is facilitated.
The etymological encyclopedia (Book IV: Medicine)  compiled by Isidore of Seville (c. 560–636) was one of the first attempts to systematize medicine knowledge. In the seventeen century, London bills of Mortality  established a classification terminology for registering morbidity and mortality cases that enabled the study of mortality rates and their causalities. However, only in the last decades the biomedical community openly engaged on developing and using ontologies to represent and manage knowledge. Perhaps the most known KOS in medicine nowadays is the vocabulary provided by the International Classification of Diseases ICD , a classification system that is being maintained by the World Health Organization (WHO), which originally aimed at providing a statistical analysis tool for disease incidence and mortality. The current release ICD-10  provides a vocabulary containing a list of generic clinical terms mainly arranged and classified according to anatomy or etiology.
Another well-known ontology is the Systematized Nomenclature of Medicine - Clinical Terms (SNOMED CT), originally created by the College of American Pathologists and currently maintained by the International Health Terminology Standards Development Organization. The SNOMED CT provides a highly comprehensive and detailed set of clinical terms used in many systems to enrich the information in electronic health records. The July 2016 release provided 321.901 active concepts . SNOMED also includes logic-based definitions to represent terminological knowledge, i.e., facts about the meaning of the terms. For example, the term myocardial infarction includes the fact that it must involve the myocardium, and it must involve an infarction. SNOMED CT is available through the Unified Medical Language System (UMLS)  maintained by the U.S National Library of Medicine. The UMLS provides a Metathesaurus that integrates more than one hundred miscellaneous vocabularies (e.g. Medical Subject Headings thesaurus (MeSH)), which in the 2015AB release covered more than three million concepts .
One of the criticisms of SNOMED CT is the fact that is proprietary. Therefore the Open Biomedical Ontology foundry (OBO)  proposed an alternative approach where design patterns and best practices in ontology specification are stimulated on an open usage and collaborative development basis. OBO established a set of principles that the ontologies have to satisfy before becoming part of the project. These principles ensure high quality, formal rigor and also interoperability between OBO member ontologies. One of these principles requests the ontology to be open and available without any constraint other than acknowledging its origin. They also provide an alternative format to OWL to represent ontologies, named OBO format.
GO is one of the most popular OBO ontologies, which has been extensively used to annotate gene-products with terms describing their molecular functions, biological processes and cellular components. In September 2014, GO provided more than 41,775 terms and a total of 53,042,843 gene-products annotated with them . GO relies on a large consortium of collaborators that cooperate in maintaining and updating the ontology which made it widely used and accepted, and thus considered a major example of success for biomedical ontologies. Another OBO ontology is the Disease Ontology (DO)  that provides human disease terms, phenotype characteristics and related medical vocabulary disease concepts. In October 2014, DO contained 8,803 terms of which 2,384 are considered obsolete . The Human Phenotype Ontology (HPO)  is also an OBO ontology that provides terms for describing phenotypic abnormalities seen in human disease. In 2015, HPO contained more than 11,000 terms, and over 116,000 annotations to over 7,000 rare diseases .
To present an overview of the significant developments in knowledge representation based on linked data approaches over the past one or two years, we started by identifying a representative set of articles by analyzing the citations to the well-known projects described above. Thus, in December 2015 we started by collecting the list of articles on Google Scholar that cited articles presenting Bio2RDF, Open PHACTS and/or EBI RDF platform. At the time they were quoted by 498, 128, and 59 articles, respectively.
The list of articles was then automatically filtered using the following restrictions: i) published after 2014 (inclusive), ii) having the word data in their title, and iii) published in biomedical journals. The first restriction limited the survey to approaches published over the past one or two years. The second one limited the survey to approaches that have a strong focus on data. The third restrictions limited the survey to biomedical studies that are already well-established.
Finally, we manually analyzed the scope of each article and selected only the ones that represented case-studies, repositories or frameworks working with data of Life and Health sciences. Thus, we removed all the articles mainly describing software and tools, statistical analysis of data, opinions, reviews and surveys.
The result of this process was a list of 14 articles that we believe provide a good representative overview, not a comprehensive list, of the significant developments in knowledge representation over the past one or two years.
As displayed on Table 1, from the 14 selected articles, i, ix, x, xi, xii, xiii and xiv quote Bio2RDF work, but they do not use this platform. Study i, for example, proposed a Linked Clinical Data Cube, which main goal was to use data from Australian, Imaging, Biomarker and Lifestyle study of Ageing (AIBL) and makes it available as linked data for the research community. The authors quote Bio2RDF as related work, since both studies transform data from databases in RDF and share them in an easy and accessible way, allowing the extraction of relevant information from these databases. Another example is study ix. They created a platform (eXframe platform), that as well as Bio2RDF, makes the information available as RDF. In study xiii, a Web tool (TogoTable) is built. It utilizes the features of RDF to connect several Linked Open Data (LOD) databases, enabling links to Bio2RDF.
Articles iv, vi, vii and viii, not only quote Bio2RDF, but also used it in their methodology. In iv, they presented an approach to integrate pathway data from four different Linked Data repositories using Bio2RDF Kegg’s data as the core and Bio2RDF Reactome distribution as an extension. The goal of article vi was to mine linked open data and they used Bio2RDF ontologies to link some entities. Article vii used Bio2RDF biological database applying new standards for LOD necessary to communicate effectively with other reference databases already operating under the scheme or Semantic Web. Finally, article viii, proposes a nanopublication publishing format that uses Bio2RDF since it provides RDF and URIs for different biomedical resources.
The EBI RDF platform is quoted in 7 of the 14 papers: v, vi, ix, x, xi, xii and xiii. However, only article v work is related with this platform. From the articles that only quote EBI RDF platform, article x compares the efforts from EBI-RDF to provides an innovative approach to queries and explore rich biological data collections, with their own efforts to create an ontology for generating standardized RDF for glycan structures and related data. Another example is article xii. The goal was to create SEEK platform: a suite of tools to support the management, sharing and exploration of data and models in systems biology. SEEK stores metadata in RDF which promotes greater interoperability with other platforms like EBI-RDF.
The Open PHACTS is quoted in only 4 of the 14 selected articles (ii, iii, xi and xiv), but none of them use it. The eTOX data-sharing project (ii) is gathering data from public and private domain, being the main goal the development of a common ontology and it quotes Open PHACTS as another initiative to gather chemical related toxicity information. The same description is given in article xiv. This work is related with The Semantic Enrichment of the Scientific Literature (SESL) project. Articles iii and xi shows Open PHACTS as an example of other efforts where Semantic Web technology has been used for the biomedical data integration.
Table 2 shows the data input and output used in the selected articles. Most articles (ii, iii, viii, ix, x, xii and xiv) used Ontology Web Language (OWL) and Open Biomedical Ontologies (OBO) as input or just as complementary data to improve the output. The eTOX project (ii), for example, used data from preclinical studies, extracted from papers and PDF’s through data mining, and ontoBrowser ontology to confirm and standardize the data so they could be used to create a new ontology of toxicity. This is useful to create a predictive model for drug development process. In ix, several ontologies such NCBITaxon, EFO, FMA, BTO, CL, NCI Thesaurus and CHEBI were used to annotate data. Ontologies from Bio2RDF platform are particularly employed, as can be read in articles iv, vi, vii, and viii. Gene Ontology was used in articles iii, xi and xiv. Article v do not specify the type of input data, just mentioning that the data were manually extracted and curated from chemistry literature.
RDF data was the input in articles i, iv, xi, vii and xiii. In i, clinical study data was extracted in Clinical Data Interchange Standard Consortium – Operational Data Model (CDISC ODM) format and Data Documentation Initiative RDF (DDI-RDF) vocabulary was used to enrich clinical data based on the CDISC standards. Both RDF and ontologies are used in two particular cases, viii and xi. In the first one, data from Bio2RDF platform is used as well as NIF Standard ontology (NIF-STD), NCI Thesaurus (NCI), Gene Regulation Ontology (GRO), SemanticScience Integrated Ontology (SIO) and Sequence Ontology (SO). The former, a bio-ontology repository, used several ontologies and RDF data to create the platform. Final output for most articles is RDF data. Exceptions are article ii, that created a new ontology of toxicological terms, article viii, which has as output nanopublications, and xi, as already said, intended to create an ontology repository, thus its output is a research ontology platform.
In the selected articles we were not able to find information about dereferencability of the vocabularies/ontologies. However, according to recent statistics, most vocabularies terms (71.73%) are not dereferencable, 19.47% are totally dereferencable and 8.8% are only partially derefenrencable. Particularly in Life Science, 66.67% are not dereferencable, 27.78 are totally dereferencable and 5.56% are partially dereferencable .
There is no doubt that nowadays we have access to more data, more easily and with higher quality than a decade ago. However, is it our knowledge keeping up the pace? As presented above, RDF technologies are having a strong impact on how the Life and Health Sciences community is storing, integrating and sharing data and knowledge. Even if not fully following Linked Data paradigm, the community is now making a large effort in exploiting some of its technologies for connecting the data elements and consequently providing a better description of their semantics. In that sense, ontologies are performing a crucial role in making the semantic annotations consistent and interoperable. Unlike the knowledge concealed in the articles, the knowledge shared through annotations using standard ontologies can be easily processed and analyzed by computational methods. For example, it enables us to search for similar and related entities based on their biomedical meaning, such as similar molecular functions and similar diseases .
Many information retrieval systems, such as Google, use similarity measures to calculate the similarity between a query and a document that takes in account its relevance to the user. For instance, if we try to look for physiology models annotated with Scaphoid we may be interested in receiving the models annotated with Wrist as well, but probably not all the models annotated with other Upper limb segments. This relevance can be captured by semantic similarity measures that return a numerical value reflecting the closeness in meaning between semantic annotations . These semantic similarity measures have been successfully developed and applied to biomedical ontologies, particularly to the Gene Ontology, where they are mainly used to compare genes or proteins based on the similarity of their functions. Another popular technique is enrichment analysis that exploits the semantic annotations to identify clinical and biological characteristics that may better describe the outcome of a group of patients with a common disease. For example, recently this technique was effectively applied to improve the disease prognosis of the hyper-trophic cardiomyopathy .
In a near future health institutes would be data centric, where each situation is analyzed according to previous situations by comparing similar patient profiles with similar phenotypes. For example, screening processes that are crucial to detect life-threatening situations in a short period of time would benefit from having a large knowledgebase together with advanced information retrieval systems that could provide these alerts in real time. Due to privacy issues these knowledgebase are normally restricted to local data that hinders their effectiveness, but for sharing data we do not need open data. For example, we can use a remote similarity service from an external knowledgebase and if we get a hit, we may automatically send a request to access that matching information. If permissions are granted, we may access the information in an anonymized and controlled way, i.e., in case of any leak we know who, how, why and what was granted and accessed. Thus, by dealing with sensitive data does not mean that we cannot share metadata and services following a linked data perspective, by the contrary it is one (if not the) of the best approaches to represent and manage knowledge in such a setting.
Linked Data offers an effective solution to break down data silos, however, the systematic usage of these technologies requires a strong commitment from the research community. Creating linked data resources with sound and comprehensive characterizations of their meaning and using semantic annotations to common ontologies is a complex and subjective process, which can be supported by automatic methods, such as text mining , but in most cases it requires a lot of specialized human intervention. So recognition and reward mechanisms besides bibliometric indicators will be essential to avoid the creation of raw data silos that cannot be reused by others, or even by the owners themselves . This incentive is currently so low that sometimes even authors cannot recover the data associated with their own publications. Public funding agencies and journals may enforce data-sharing policies, but adherence is most of the times inconsistent and scarce . The problem, therefore, is obtaining a proactive involvement of the community in integrating and sharing data. To support these, we have to go beyond technological advances, and create motivation mechanisms that encourage data owners to share their data in a meaningful way .
Linked Data is not free or open data and is not sound data, it can have access restrictions, be incomplete, have errors, but the technological advances and the successful use cases in the Life and Health Sciences shown above are a promising sign that linked data may in near future be omnipresent in our daily lives as the Internet is today.
We would like thank Gonçalo Figueiró, Joana Teixeira and Marta Antunes for helping us in identifying some of the details from the articles reviewed. This work was supported by FCT through funding of the LaSIGE Research Unit, ref. UID/CEC/00408/2013.