In this work, we have presented BCML, a new data format designed for the representation of process description specification of the SBGN data model for the representation of biological networks. Moreover, our format was designed to provide significant additional capabilities, dramatically increasing the amount of information that it can store and offering additional flexibility. These additional capabilities are useful for both biological interpretation and data analysis.
The aim that guided the development of BCML was to build a flexible and dynamic representation of biological pathways in an unambiguous way, while still being understandable by the biologist, being able to improve the available analysis and supporting features such as contextual selection, extended annotation and graphical visualization. We chose XML because it is highly flexible, easily parseable and simply transformable into other formats. Moreover, XML is an easy to learn language and a well-designed schema (e.g. using descriptive and self-explanatory tags) allows an easy to understand reading of the files produced with it, also for people with limited computer skills (Barillot and Achard, 2000
). Also, specialized XML editors can create easily understandable data models composed of entities nested on multiple levels. Such available editors also facilitate XML content visualization.
The adoption the SBGN model ensures that BCML is able to represent the major part of the bulk of biological information in an unambiguous way, and at the same time, is compatible with the most recent graphical standard for biological pathways.
Before developing a new format, we examined KGML, BioPAX (Luciano, 2005
) and SBML (Hucka et al., 2003
). Each of the formats provided features that we needed, but neither covered exactly our use case. SBML had a different focus, modeling quantitative and temporal aspects, than our intended goal and did not properly support a graphical representation. KGML was not detailed enough to capture all the required information, and BioPAX did not implement at the time of writing some of our requirements, for example generation of subpathways from contextual selection. Also, KGML and SBML did not offer SBGN-compliant representations, while BioPAX required an additional step of conversion.
One of the most important problems we faced when building this model was the fact that in general terms, most pathways available in public databases are ‘generic’. Although most pathways are organism specific, they lack information on the precise cellular types or tissues in which the described phenomena take place. Even more seriously, many publicly available pathways combine in a single diagram elements that are specific only to certain tissues (Cavalieri et al., 2010
). Since the commonly used pathway databases do not currently store information about tissue localization, a life scientist unfamiliar with the minutiae of a specific pathway could easily and incorrectly infer that these genes would be involved in this pathway in all cells. In contrast, a tool relying on BCML can display (on request) only those elements of the pathways that are known to be applicable in the currently selected organism and tissue type, and supported by the level of proof specified by the user. Thus, one single BCML pathway representation can capture all the information available about that pathway for all organisms, tissue types, type of evidence, etc., as well as experimentally measured values such as gene expression, protein abundance, etc.
We are strongly convinced that the most improvements in the analysis of pathway will come from a better use of the already existing knowledge. For this reason, we needed formats able not only to store additional information (findings) about a pathway, but also to be able to include or exclude elements based on such existing knowledge thus enabling the creation of ‘customized’ pathways, better suited to describe specific biological problems or to highlights gaps in our current knowledge.
BCML is our proposal to overcome this lack of flexibility in the current available data formats for pathways. The constraints set by BCML, which reflect the ones set up by SBGN, are also important in ensuring that a pathway will be designed in the correct form from the start. Moreover, in BCML we added the possibility to integrate experimental measurements as a way to improve interpretation of experimental results.
Lastly, we wanted to construct a format that could facilitate subsequent analyses, because the use of pathways to perform analysis, especially in the context of high-throughput data, is an expanding field (Cavalieri and De Filippo, 2005
; Werner, 2008
). BCML can, through very simple transformations, be used both for gene list-based approaches such as GSEA (Subramanian et al., 2005
) or the canonical Fisher's Exact Test (Draghici et al., 2003
; Grosu et al., 2002
), and as well as the more advanced, topology-aware methods such as Impact Analysis (Tarca et al., 2009
). Furthermore, the contextual selection can also be applied when transforming the data for analysis, thus permitting analyses ‘tailored’ to specific biological problems.
In order to ensure that our format had an advantage, according to our design, to the already existing formats, we compared two specific implementations of the Toll-like receptor 3, a receptor involved in the dendritic cell response to double-stranded RNA (Kawai and Akira, 2007
; Meylan and Tschopp, 2006
): one represented using our format, and as the other we used the one stored in the Reactome pathway database, which we used as established reference. We then used a publicly available dataset to test the reliability of the implementation applied to data analysis: we expected an activation of the TLR3 pathway, because the dataset contained dendritic cells stimulated with poly(I:C), a synthetic homolog of the dsRNA recognized by the receptor.
Our results showed that the BCML representation of the TLR3 pathway performs as well as the established reference, while offering important additional features, such as the possibility of incorporating experimental measurements, the possibility of using topology-aware analysis algorithms and the contextual selection of elements according to a specific biological context: when compared to the use of different pathways in the same species, such an approach is more powerful as it can be used to highlight subtle differences in signaling networks among different cell types or tissues.
With regards to classical analysis methods, both Reactome and the BCML implementations of TLR3 gave significant P
-values for TLR3 pathway. The discrepancy in the results of the statistical test is not related to the format, but it is an expression of the different curation in the standard (Reactome) and DC-ATLAS (Cavalieri et al., 2010
), where the BCML implementation of TLR3 was taken from. This shows that our implementation of TLR3 is comparable to established standards when using an external, focused dataset. Additionally, BCML provides the possibility of using topology-aware analysis methods such as Impact Analysis, which are more precise as they take into account the order and the causal relationships among the various entities.
For both methods, using a subset of the TLR3 pathway produced by contextual selection, keeping into account the specific biological context, yielded P
-values lower by one order of magnitude with respect to the generic implementation. This result is of particular importance because it clearly shows the need for pathway definitions that match as closely as possible the biological context that is being investigated, leading to more robust and precise results. As a matter of fact, pathways and cellular networks exhibit even greater differences between species (Mestas and Huges, 2005
; Shen-Orr et al., 2010
), and even more importantly, the cell type is an even greater discriminating factor: for example, in the TLR3 pathway, the presence of >50% of its known genes has not been demonstrated in dendritic cells (Cavalieri et al., 2010
). Thus, when using the pathway definition for computational analysis, it is essential to be as close as possible to the experimental setup to prevent or notice inconsistencies that will ultimately affect the final interpretation of the results. Despite BCML's lack of refinement compared to the currently available alternatives, it provides additional functionality, and it highlights a possible solution to problems that are now arising when representing pathways. Thanks to the selection capabilities of BCML, it is possible to construct specific pathways for ‘tailored’ analyses. Such selection can be used both by the biologist to visualize the non-demonstrated interactions and to the bioinformatician who can adjust the analysis methods to take missing annotations into account.
BCML only covers Process Description at the moment, while a complete SBGN representation of the pathway should also includes the Entity Relationship and Activity Flow representation. We will work toward implementing these two specification in order to provide a more complete data model for all the SBGN graphical pathway descriptions. We will also work on developing tools to convert BCML to other formats such as BioPAX to increase interoperability.
One of the major strengths of this work is that the format was conceived in parallel with the implementation of flexible tools for the representation, manipulation and analysis of biological networks. We are hereby making available to the community, not only an abstract data model for a possible large adoption, but also the basic tools that allow the manipulation of pathways in this format.
Being SBGN compliant and machine readable, BCML provides a convenient and precise way to represent biological pathways, in a form useful to both the biologist and the bioinformatician. Its dynamic nature makes it an important tool for the dissection of complex, highly specific biological problems. Lastly, BCML containing deeper descriptions of biological knowledge turns out to be a format extremely suitable for advanced pathway analysis methods, as well as creation of knowledge-based online resources. We expect that our model will be a useful contribution to the pathway community, making possible the creation of more practical and more complete pathway representations that will be both more end-user friendly, as well as better suited for advanced computational analysis.