|Home | About | Journals | Submit | Contact Us | Français|
Users may view, print, copy, download and text and data- mine the content in such documents, for the purposes of academic research, subject always to the full Conditions of use: http://www.nature.com/authors/editorial_policies/license.html#terms
BioPAX (Biological Pathway Exchange) is a standard language to represent biological pathways at the molecular and cellular level. Its major use is to facilitate the exchange of pathway data (http://www.biopax.org). Pathway data captures our understanding of biological processes, but its rapid growth necessitates development of databases and computational tools to aid interpretation. However, the current fragmentation of pathway information across many databases with incompatible formats presents barriers to its effective use. BioPAX solves this problem by making pathway data substantially easier to collect, index, interpret and share. BioPAX can represent metabolic and signaling pathways, molecular and genetic interactions and gene regulation networks. BioPAX was created through a community process. Through BioPAX, millions of interactions organized into thousands of pathways across many organisms, from a growing number of sources, are available. Thus, large amounts of pathway data are available in a computable form to support visualization, analysis and biological discovery.
Molecular biology research has yielded detailed knowledge of biomolecular components and their interactions. Increasingly powerful technologies, including genome-wide molecular measurements, have accelerated the progress towards a complete map of molecular interaction networks in cells and between cells of key organisms. A single person can no longer memorize these maps, therefore, they must be represented in a form suitable for computer processing and storage and made easily available to scientists via software systems. Accordingly, the BioPAX (Biological Pathway Exchange) project aims to facilitate knowledge representation, systematic collection, integration and wide distribution of pathway data from heterogeneous information sources and thereby, their incorporation into distributed biological information systems that support visualization and analysis.
Biology has come a long way since the Boehringer-Mannheim wall chart of metabolic pathways1 and the Nicholson Metabolic Map2. Since then, a number of groups have developed methods and databases for organizing pathway information3-16, but only recently collaborated as part of the BioPAX project to develop a generally accepted standard way of representing these pathway maps. Complete molecular process maps must include all interactions, reactions, dependencies, influence and information flow between pools of molecules in cells and between cells. For ease of use and simplicity of presentation, such network maps are often organized in terms of sub-networks or pathways. Pathways are models that biologists have delineated within the entire cellular biochemical network that help us describe and understand specific biological processes. Thus, a useful definition of a pathway is a set of interactions between physical or genetic cell components, often describing a cause-and-effect or time-dependent process, which explain some observable biological function. How do we represent these pathways in a generally accepted and computable form?
The total volume of pathway data mapped by biologists and stored in databases has entered a rapid growth phase17, similar to the rapid expansion of biological sequence data after the introduction of automated sequencing technology. The number of pathway and molecular interaction related online resources has grown from 190 in 2006 to 325 in 2010, a 70% increase17. In addition, molecular profiling methods, such as RNA profiling using microarrays or protein quantification using mass spectrometry, provide large amounts of information about the dynamics of cellular pathway components and increase the power of pathway analysis techniques18,19. However, this growth poses a formidable challenge for pathway data collection and curation as well as for database, visualization and analysis software, as these data are often fragmented.
The principal motivation for building pathway databases and software tools is to facilitate qualitative and quantitative analysis and modeling of large biological systems using a computational approach. Over 300 pathway or molecular interaction related data resources17 and many visualization and analysis software tools3,20-22 have been developed. Unfortunately, most of these databases and tools were originally developed to use their own pathway representation language, resulting in a heterogeneous set of resources that are extremely difficult to combine and use. This has occurred because many different research groups, each with their own system for representing biomolecules and their interactions in a pathway, work independently to collect pathway data recorded in the literature (estimated from text-mining projects23 to be present in at least 10% of the over 20 million articles currently indexed by PubMed). As a result, researchers waste much time collecting information from different sources and converting its representation from one system to another. They may pay substantial opportunity cost as a result of pathway data fragmentation. For instance, visualization and analysis tools developed for one pathway database cannot be reused for others, making software development efforts more expensive. The situation currently resembles biologists assembling a multi-dimensional puzzle, with thousands of pieces, each one created and shared ad hoc. It is, therefore, imperative to develop computational methods to cope with both the magnitude and fragmented nature of this rapidly expanding and exceedingly valuable pathway information. While independent research efforts are needed to find the best ways to represent pathways, community coordination and agreement on one or a few standard sets of semantics is necessary to be able to efficiently integrate pathway data from multiple sources on a large scale.
A common, inclusive and computable pathway data language is necessary to share knowledge about pathway maps and to facilitate integration and use for hypothesis testing in biology24. A shared language facilitates communication by reducing the number of translations required to exchange data between multiple sources (Figure 1). Developing such a representation is challenging due to the large variety of pathways in biology and the diverse uses of pathway information. Pathway representations frequently use abstractions for metabolic, signaling, gene regulation, protein interaction and genetic interaction and these serve as a starting point toward a shared language25. Also, several variants of this common language may be required to answer relevant research questions in distinct fields of biology, each covering unique levels of detail addressing different uses, but these should be rooted on common principles and must remain compatible.
BioPAX was developed to address these challenges. We have developed BioPAX as a shared language to facilitate communication between diverse software systems and to establish standard knowledge representation of pathway information. BioPAX supports representation of metabolic and signaling pathways, molecular and genetic interactions and gene regulation. Relationships between genes, small molecules, complexes and their states (e.g. post-translational protein modifications, mRNA splice variants, cellular location) are described, including the results of events. Details about the BioPAX language are available in online documentation at http://www.biopax.org. The BioPAX language provides a set of terms, with associated descriptions, to represent many aspects of biological pathways and their annotation. It is implemented as an ontology, a formal system of describing knowledge (Box 1) that helps structure pathway data so that it is more easily processed by computer software (Figure 2). It provides a standard syntax used for data exchange that is based on OWL (Web Ontology Language) (Box 1). Finally, it provides a validator that uses a set of rules to verify whether a BioPAX document is complete, consistent and free of common errors. BioPAX is the only community standard for biological pathway exchange to and from databases, but coordinates with other standards in related areas (Figure 6).
An ontology is a formal system for representing knowledge62. Formal representation is required for computer software to make use of information. Formal knowledge systems have been used in science for thousands of years, for example, Aristotle’s representation of the basic elements of all things (the five elements Fire, Earth, Air, Water and Ether). Well known modern examples include organism taxonomies63 or the Gene Ontology64. A formal representation allows for consistent communication of knowledge between individuals or computer systems and helps manage complexity in information processing as knowledge is broken down into clear concepts that can be considered independently. Ontologies also enable integration of knowledge between independent resources linked on the World Wide Web (WWW). Such linked, structured data form the basis of the semantic web, an extension of the WWW that promises improved information management and search capability61. Representing and sharing knowledge using ontologies is simplified by availability of the standard web ontology language (OWL) (http://www.w3.org/TR/owl-features/). Tools to edit OWL, such as Protégé65, have been developed by the Semantic Web community and adopted in the life sciences.
An ontology is composed of classes, properties (representing relations) and restrictions and is used to define individuals (instances of classes, also known as objects) and values for their properties. Classes (also known as concepts, types) are often arranged into a specialization hierarchy (or taxonomy) where child classes are more specific than, and inherit the properties of, parent classes. For example, in BioPAX, the Biochemical Reaction class is a ‘subclass of’ the Conversion class. Classes may have properties (also known as fields, attributes or slots), which express possible relations to other classes (i.e. the may have values of specific types). For example, a Small Molecule is related to the Chemical Structure class by the property structure. Restrictions (also known as constraints) define allowable values and connections within an ontology. For example, Molecular Weight must be a positive number. Individuals are instances of classes where values occupy the properties of those instances. BioPAX defines the classes, properties and restrictions required to represent biological pathways and leaves creation of the individuals to users (data providers and consumers). Advantages of implementing BioPAX using OWL are that both the ontology and the individuals and values can be stored in the same XML-based format, which makes data transmission easier. Also, OWL is a standard ontology language that is supported by useful software tools for editing, transmitting, querying, reasoning and visualizing.
Pathway models described by biologists are generally expressed in scientific language and as network diagrams. An example is the AKT signaling pathway, important in regulating proliferation in many eukaryotic cells and often deregulated in cancer26,27. The AKT pathway is a cell surface receptor activated signaling cascade that transduces signals from the outside to the inside of a cell via a series of molecular binding and protein post-translational regulation events. These include protein-protein interactions and protein kinase mediated phosphorylation events that successively activate downstream kinases to phosphorylate additional proteins and activate or inhibit molecular interactions. The activated pathway eventually results in activation of multiple transcription factors, which turn on sets of genes to promote cell survival. A typical AKT signaling pathway diagram with associated text description can only be interpreted by people, and not computationally. By representing the pathway using the BioPAX language (Figure 3), it can also be interpreted by computer software and made available for numerous uses, such as pathway analysis of gene expression data. Representing a pathway using the BioPAX language sometimes necessitates being more explicit to avoid capturing inconsistent data. For instance, the typical notion of an ‘active protein’ is context dependent, as the same molecule could be active in one cellular context, such a cellular compartment with a set of potential interacting molecules, and inactive in another context. Thus, capturing the specific mechanism of activation, such as phosphorylation modification, is usually required, and the presence of downstream events that include the modified form signifies that the molecule is active. Interactions where the mechanism of action is unknown can also be specified.
BioPAX covers all major concepts familiar to biologists studying pathways, including metabolic and signaling pathways, gene regulatory networks and genetic and molecular interactions (Table 3). The BioPAX language is distributed as an ontology definition (Figure 4) with associated documentation, a validator and other software tools (Table 1). Frequently used pathway abstractions in multiple pathway databases and software are supported as follows:
The first three pathway abstractions are process-oriented. They imply a temporal order and can be thought of as extensions of the standard chemical reaction pathway notation to accommodate biological information. Molecular and genetic interactions, however, imply a static network of connections among system components instead of the temporally ordered process of reactions that defines a metabolic or signaling pathway. BioPAX supports combining these different types of data into a single model that is useful to gain a more complete view of a cellular process.
BioPAX provides many additional constructs, not shown in Figure 4, that are used to store extra details, such as database cross-references, chemical structure, experimental forms of molecules, sequence feature locations and links to controlled vocabulary terms in other ontologies (Supplementary Figure S1). BioPAX reuses a number of standard controlled vocabularies defined by other groups. For example, Gene Ontology40 is used to describe cellular location, PSI-MI vocabularies38 are used to define evidence codes, experimental forms, interaction types, relationship types and sequence modifications, and Sequence Ontology41 is used to define types of sequence regions, such as a promoter region on DNA involved in transcription of a gene. Other useful controlled vocabularies can be referenced, such as the molecule role ontology42.
BioPAX defines additional semantics that are currently only captured in documentation. For instance, physical entities represent pools of molecules and not individual molecules, corresponding to typical semantics used when describing pathways in textbooks or databases. A molecular pool is a set of molecules in a bounded area of the cell, thus it has a concentration. Pools can be heterogeneous and can overlap, as in the case of a protein existing in multiple phosphorylation states.
BioPAX also defines a range of constructs that are represented as ontology classes. Some of these represent biological entities, such as proteins, and are organized into classes that conceptualize the pathway knowledge domain. Others are used to represent annotations and properties of the database representation of biological entities. For instance, BioPAX provides xref classes to represent different kinds of references to databases that can be useful for data integration. These are represented as subclasses of utility class for convenience. A future version of BioPAX would ideally capture these semantics and structure these concepts more formally.
Once pathway data is translated into a standard computable language such as BioPAX, it is easier for software to access it and thereby support browsing, retrieval, visualization and analysis by biologists (Figure 5). This enables efficient re-use of data in different ways avoiding the time-consuming and often frustrating task of translating it between formats (Figure 1). Additionally, it enables uses that would be impractical without a standard format, such as those dependent on combining all available pathway data.
BioPAX can be used to help aggregate large pathway datasets by reducing the required collection and translation effort, for instance using software such as cPath43. Typical biological queries, such as “What reactions involve my protein of interest?” generate more complete answers when querying these larger pathway datasets. Another frequent use is to find pathways that are active in a particular biological context, such as a cell state, as determined by a genome-scale molecular profile measurement. For instance, pathways with multiple differentially expressed genes, as measured by DNA microarrays, may be transcriptionally active in one biological condition and not in another. Functional genomics and pathway data can be imported into software and combined for visualization and analysis to find interesting network regions. A typical workflow involves overlaying molecular profiling data, such as mRNA transcript profiles, on a network of interacting proteins to identify transcriptionally active network regions, which may represent active pathways44. A number of recent papers have used this pathway analysis workflow to highlight genes and pathways that are active in specific model organisms or diseased tissues, such as breast cancer, using gene and protein expression, copy number variants (CNVs) and SNPs19,44-49. BioPAX has been used in a number of these studies to collect and integrate large amounts of pathway information from multiple databases for analysis. For instance, protein expression data was combined with pathway information to highlight the importance of apoptosis in a mouse model of heart disease50. Multiple groups have found that tumor associated mutations are significantly related by pathway information47,48. And recently, in a study of rare CNVs in 996 autism spectrum disorder affected individuals, a core set of neuronal development related pathways were found to link dozens of rare mutations to autism that were not significantly linked to the disorder on their own by traditional single-gene association statistics49. These studies highlight the importance of pathway information in explaining the functional consequence of mutations in human disease. BioPAX pathway data can also be converted into simulation models, for instance using differential equations51 or rule-based modeling languages52, to predict how a biological system may function after a gene is knocked-out.
BioPAX is useful for exchanging information among and between data providers and analysis software. Pathway database groups can share the effort of pathway curation by making their pathways available in BioPAX format and exchanging them with others. For example, Reactome8 BioPAX formatted pathways are imported by the NCI/Nature Pathway Information Database (PID)9. Data providers can use existing BioPAX enabled software to add useful new features to their systems. For example, the Cytoscape network visualization software20 can read and display BioPAX formatted data as a network. The Reactome group used this feature to create a pathway visualization tool for their website. Because Reactome data were available in BioPAX format, and Cytoscape could already read BioPAX format, this new feature was easy to implement.
The Paxtools Java programming library for BioPAX has been developed to help software developers readily support the import, export and validation of BioPAX formatted data for various uses in their software (http://www.biopax.org/paxtools/). Using Paxtools and other tools, a range of BioPAX-aware software has been developed, including browsers, visualizers, querying engines, editors and converters (Table 2). For instance, the ChiBE and VisANT pathway visualization tools read BioPAX format22 and the WikiPathways website53, a community wiki for pathways, is working on using BioPAX to help import pathways from numerous sources, including manually edited pathways from biologists. The Pathway Tools software21 and CellDesigner pathway editor54 are developing support for BioPAX-based data exchange. In addition, tools for the storage and querying of Resource Description Framework (RDF - http://www.w3.org/RDF/) datasets, generated within the Semantic Web community, can be used to effectively process BioPAX data.
The BioPAX language uses a discrete representation of biological pathways frequently used in databases, the literature and textbooks. Dynamic and quantitative aspects of biological processes, including temporal aspects of feedback loops and calcium waves, must also be considered in a complete pathway map. BioPAX does not support this, but coordinates with the SBML and CellML mathematical modeling languages55,56 and a growing software toolset supporting biological process simulation57 which cover these aspects. Detailed information about experimental evidence supporting a pathway map is useful for recognizing the relative levels of support for different pathway aspects. This information is only included in BioPAX for molecular interactions, because that was already defined by the Proteomics Standards Initiative Molecular Interactions (PSI-MI) language58 and it was reused. The BioPAX workgroup makes use of PSI-MI controlled vocabularies and other concepts and coordinates with the PSI-MI workgroup to build these vocabularies in areas of shared interest, such as genetic interactions. Although BioPAX does not aim to standardize how pathways are visualized, work is coordinated with the Systems Biology Graphical Notation (SBGN, http://sbgn.org) community to ensure that SBGN can be used to visualize BioPAX pathways. Currently, most BioPAX concepts can be visualized using SBGN process description (PD) and SBGN activity flow (AF) diagrams and a mapping of BioPAX to SBGN entity relationship (ER) diagrams is under development. BioPAX development is coordinated with the above standardization efforts to ensure complementarity and compatibility. For instance, BioPAX uses controlled vocabularies developed by PSI-MI and can be used to annotate SBML and CellML models (Figure 6). BioPAX aims to be compatible with these and other efforts, so that pathway data can be transformed between alternative representations when needed. For instance, PSI-MI to BioPAX and SBML to BioPAX converters are available (Table 2).
While BioPAX facilitates communication of current knowledge, it is challenging for all knowledge representation efforts to anticipate new forms of information. As new types of pathway data and new knowledge representation languages and tools become available, the BioPAX language must evolve through the efforts of a community of scientists that includes biologists and computer scientists.
BioPAX is developed via community consensus among data providers, tool developers and pathway data users. More than 15 BioPAX workshops have been held since November 2002, attended by a diverse set of participants. Incremental versions (or levels) of the BioPAX language were progressively developed at these workshops to focus the group’s efforts on attainable intermediate goals. Broader input came from mailing lists and a community wiki. Community members participated in developing functionality they were interested in, which was integrated into specific levels (See Supplementary Table S1). Level 1 supports metabolic pathways, Level 2 adds support for molecular interactions and post-translational protein modifications by integrating data structures from the PSI-MI format, and Level 3 adds support for signaling pathways, molecular state, gene regulation and genetic interactions (Table 3). It is anticipated that newer BioPAX levels replace older ones, so use of the most recent BioPAX Level 3 is currently recommended. To ease the burden on users and developers, BioPAX aims to be backwards compatible where practical. Level 2 is backwards compatible with Level 1, however Level 3 involved a major redesign that necessitated breaking backwards compatibility. This said, many core classes have remained compatible with previous levels since Level 1 and software is provided for updating older BioPAX pathways to Level 3 (via Paxtools). All BioPAX material (Table 1) is made freely available under open source licenses via a central website (http://www.biopax.org) in order to encourage broad adoption. The database and tool support (Table 2) of a common language aids the creation, analysis, visualization and interpretation of integrated pathway maps.
In addition to the creation of a shared language for data and software, the process of achieving community consensus spurs innovation in the field of pathway informatics. Community discussion helps resolve technical knowledge representation issues faced by many data providers and users and facilitates the convergence to common terminology and representation. Solutions are discovered in independent research groups and incorporated in new data models and community best practices, which then enable identification of new issues. Thus, community workshops support a positive feedback cycle of knowledge sharing that has led to an accepted BioPAX language and development of better software and databases. We expect this to continue and to support new scientific uses of pathway information, motivated by end user access to valuable integrated pathway information and efficiency gain for database and software development groups. This will especially benefit new pathway databases and software tools that adopt standard representation and software components from the start.
The BioPAX shared language is a starting point on the path to developing complete maps of cellular processes. Additional near and long-term goals remain to be realized to enable effective integration and use of biological pathway information, as described below.
Data must be collected and translated to a standard format for it to be integrated. This process is underway, as the descriptions of millions of interactions in thousands of pathways across many organisms from multiple databases are now available in BioPAX format. However, vast amounts of pathway data remain difficult to access in the literature and in databases that don’t yet support standard formats. Increasing use of standards requires promoting and supporting data curation teams and automating more of the data collection process using software. Easy to use tools for tasks like pathway editing must also be developed so that biologists can share their data in BioPAX format without substantial investment. Ideally, appropriate software would allow authors to enter data directly in standard formats during the publication process, to facilitate annotation and normalization by curators before incorporation into databases for use by researchers53.
To aid data collection, community best practice guidelines and rules must be developed, led by major data providers, to help diverse groups use BioPAX consistently when multiple ways of encoding the same information exist. This will enable data providers to benefit from automatic syntactic and semantic validation of their data so they can ensure they are sharing data using standard representation and best practices59,60. Data collection and automatic validation will facilitate convergence to generally accepted biological process models.
Multiple models of the same biological process may usefully co-exist. Ideally, different models could be compared for analysis and hypothesis formulation. However, comparison is difficult because the same concept can be represented in multiple ways due to use of multiple levels of abstraction (such as the hRas protein versus the Ras protein family), use of different controlled vocabularies, data incompleteness or errors. Future research needs to develop semantic integration solutions that recognize and aid resolution of conflicts.
Pathway diagrams are highly useful for communicating pathway information, but their automatic construction, in a biologically intuitive way, from pathway data stored in BioPAX is a major challenge. The SBGN pathway diagram standardization effort provides a starting point towards achieving this goal (Figure 3). Intuitive and automatically drawn biological network visualizations may one day replace printed biology textbooks as the primary resource for knowledge about cellular processes.
As uses of pathway information and technology evolve, so must the BioPAX language. For instance, future BioPAX levels should capture cell-cell interactions, be better at describing pathways where sub-processes are not known or need not be represented, more closely integrate third-party controlled vocabularies and ontologies to ease their use and better encode semantics for easier data validation and reasoning.
Many groups within the BioPAX community, including most pathway data providers and tool developers, are working to achieve the above goals. For instance, Pathway Commons (http://www.pathwaycommons.org) aims to be a convenient single point of access for all publicly accessible pathway information and the WikiPathways project (http://www.wikipathways.org/) seeks to enable pathway curation by individuals53. Also, the semantic web community is developing a set of technologies that promise to ease the integration of information dispersed on the World Wide Web (WWW)61. These technologies will aid pathway data integration, since BioPAX is compatible with them through use of the W3C standard Web Ontology Language, OWL. All of the above research and development activities support the vision of data providers sharing computable maps of biological processes in a standard format for convenient use by a community of pathway researchers.
Funded by the US Department of Energy workshop grant DE-FG02-04ER63931, the caBIG program, the US National Institute of General Medical Sciences workshop grant 1R13GM076939, award number P41HG004118 from the US National Human Genome Research Institute and Genome Canada through the Ontario Genomics Institute (2007-OGI-TD-05). Thanks to many people who contributed to discussions on BioPAX mailing lists, at conferences and at BioPAX workshops, especially Alan Ruttenberg and Jonathan Rees.
Supplementary material Supplementary Table S1. Author contributions.
Supplementary Table S2. An example BioPAX file describing the phosphorylation and activation of CHK2 by ATM in human. Data was originally obtained from the Reactome database8.
Supplementary Table S3. An example BioPAX file describing the two reactions involved in glucose metabolism in Escherichia coli. Data was originally obtained from the EcoCyc database14.
Supplementary Figure S1. Diagram of BioPAX Level 3 utility classes.