At the cellular level, life is a network of molecular interactions. Molecules are synthesized and degraded, transported from one location to another, form complexes with other molecules, and undergo temporary and permanent modifications. However, all of this apparent complexity can be reduced to a simple common representation; each step is an event that transforms input physical entities into output entities.
Much of the power and expressivity of any pathway database lies in the data model used to represent these molecules and their interactions. Reactome uses a frame-based knowledge representation consisting of classes, or 'frames', that describe various concepts such as reaction, pathway, and physical entity. Pieces of biologic knowledge are captured as instances of those classes. Classes have attributes, or 'slots', which hold pieces of information about the instances. For example, each reaction is represented as an instance of the class 'reaction', whose input and output slots are filled with the reactants (input) and products (output) of the given reaction.
The Reactome data model extends the concept of a biochemical reaction to include such things as the association of two proteins to form a complex, or the transport of an ubiquitinated protein into the proteasome. Reactions are chained together by shared physical entities; an output of one reaction may be an input for another reaction and serve as the catalyst for yet another reaction.
It is convenient, if arbitrary, to give such a set of interlinked reactions a name, thereby organizing them into a goal-directed 'pathway'. In Reactome, the reaction in which fructose-6-phosphate is formed from glucose-6-phosphate is followed by a reaction in which fructose-6-phosphate and ATP are transformed into fructose-2,6-bisphosphate and ADP, and another in which - in response to the positive regulatory effect of fructose-2,6-bisphosphate - fructose-6-phosphate and ATP are transformed into fructose-1,6-bisphosphate and ADP. Together, these and subsequent reactions form the 'glycolysis' pathway. Pathways can be part of larger pathways. Reactome represents glycolysis and gluconeogenesis (glucose synthesis) as parts of 'glucose metabolism', which in turn is a part of a larger pathway named 'metabolism of small molecules'. Reactome pathways are cross-referenced to the Gene Ontology (GO) biologic process ontology [10
Reactions that are driven by an enzyme are described as requiring a catalyst activity, modeled in Reactome by linking the macromolecule that provides the activity to the GO molecular function term [10
] that describes the activity. In addition, the Reactome data model allows reactions to be modulated by positive and negative regulatory factors. When a precise regulatory mechanism ('positive allosteric regulation', 'noncompetitive inhibition') is known, this information is captured.
Reactome reactions act upon 'physical entities'. Entities include proteins, nucleic acids, small molecules, and even subatomic particles such as photons. A physical entity can be a single molecule, such as a polypeptide chain, or an ensemble of components, such as a macromolecular complex.
Part of the challenge of describing biologic processes in computable form is the complexity of the many transformations in molecules that occur during the course of a pathway. Molecules are modified, moved from place to place, or cleaved, or they may take on different three-dimensional conformations. Many of these modifications are critical to the process under consideration; for example, phosphorylation of a protein at a particular amino acid residue may convert it from an inactive form to an active form. The Reactome data model handles these issues by treating each form of a molecule as a separate physical entity. Under this scheme the unphosphorylated and phosphorylated versions of a protein become separate physical entities, and if the protein can be phosphorylated at different residues then each distinct phosphorylation pattern is treated separately. The corresponding phosphorylation process is annotated as a reaction whose input is the unphosphorylated physical entity and whose output is the phosphorylated version.
Because the functions of biologic molecules critically depend on their subcellular locations, chemically identical entities located in different compartments are represented as distinct physical entities. For example, extracellular D-glucose and cytosolic D-glucose are distinct Reactome entities. This allows us to treat transport events as ordinary reactions; glucose transport is a reaction that takes extracellular D-glucose as its input, and produces cytosolic D-glucose as its output. To annotate the subcellular locations of molecules, we use a subset of the GO cellular component ontology [10
] that has been pruned to remove compartments that overlap with others, such as 'intracellular'.
Reactome also treats molecules that have distinct biologically significant conformational states as separate physical entities. For example, a key event in photoreception in the retina is the photon-triggered isomerization of the rhodopsin 11-cis form to the all-trans form. In Reactome, each functionally significant rhodopsin isomer can be treated separately.
Physical entities that represent the same chemical in different compartments, configurations, or modifications states share much of the same information, and it would be inefficient and error prone to replicate that information for each entity. It is also desirable to identify all physical entities that share the same basic chemical structure or sequence. Reactome handles this using the concept of a 'reference entity', which captures the invariant features of a molecule such as its name, reference chemical structure, amino acid or nucleotide sequence (when relevant), and accession numbers in reference databases. The data model allows each physical entity to refer to its reference entity, and vice versa. For the common case of a protein that has undergone post-translational covalent modification, the Reactome data model records the location and type of the modification using the 'modified residue' class.
Most biologic reactions involve not simple molecules, but large macromolecular complexes, and Reactome treats each complex as a named physical entity. This allows us to describe molecular assembly operations, such as the recruitment of double-strand break repair complex components to the site of DNA damage, as a series of reactions in which the inputs and outputs are intermediates in the formation of the DNA repair complex. In the data model, complexes refer to all of the components that they contain, so that it is possible to fetch all complexes that involve a particular component or to dissect a complex to find the individual molecules that comprise it. In the data model, a physical entity comprised of a single molecule is known as a 'simple entity', whereas entities comprising two or more simple entities belong to the 'complex' class.
Like simple entities, complexes that have catalytic activity are cross-referenced to the GO molecular function ontology. When appropriate, we record which component or molecular domain of the complex has the active site for the activity; this aids in the transfer of knowledge to the GO database, which associates molecular function terms with protein monomers and cannot currently accept information about entire complexes.
There are many cases in which it is convenient to group physical entities together into sets on the basis of common properties. For example, the SLC28A2 plasma membrane nucleoside transporter operates equally well on adenosine, guanosine, inosine, and uridine; these four molecules are interchangeable from the point of view of the transport system [12
]. In order to avoid creating four almost identical reactions for these nucleosides, Reactome's data model allows the creation of two 'defined sets' for extracellular and cytosolic nucleosides. SLC28A2-mediated nucleoside transport can then be described as a single reaction that converts the extracellular nucleoside set into the cytosolic set. Defined sets are also used to describe protein paralogs that are functionally interchangeable, equivalent RNA splice variants, and isoenzymes.
Another type of set used by Reactome is the 'candidate set'. This is used when the state of knowledge is incomplete and it is believed that one out of several candidate physical entities is responsible for a particular task. This is used, for instance, to express the assertion that, 'The presence of a particular cyclin-dependent kinase is responsible for this step in cell cycle progression, but we do not know which one.'
Finally, there is an 'open set' class, which is used for cases in which all members of the set cannot be explicitly enumerated. For example, in the RNA transcription pathways, we need to describe reactions that involve all mRNAs but we cannot enumerate all distinct mRNA molecules. Instead, we use an open set named 'mRNA'. As we add distinct mRNA molecules to the database, they become a part of this set, allowing them to be treated simultaneously from the perspective of a generic mRNA subject to transcriptional and splicing reactions, as well as from the point of view of a distinct mRNA that is, for example, under the control of a particular transcriptional factor.
Together, the simple entity, complex, and set classes allow detailed and flexible annotation of physical entities and their interactions. For example, Cdc2 protein (Universal Protein Resource [UniProt]:P06493) can be phosphorylated in the cytosol at threonine-14. The phosphorylated form of Cdc2 is distinct from unmodified Cdc2. Both the phosphorylated and unphosphorylated forms can also be found in complexes with cyclins B1 or B2. Both of these cyclins are represented by a single distinct entity, and the two of them together are represented collectively by a defined set called 'cyclin B'. The complexes between the cyclins and Cdc2 are represented as two instances of the complex class: one complex consisting of the 'cyclin B' defined set and unphosphorylated Cdc2, and the other consisting of the 'cyclin B' defined set and phosphorylated Cdc2. These complexes then take part in the various reactions of the cell cycle pathway. We can simultaneously create complexes of Cdc2 with individual cyclins if a particular cyclin/Cdc2 complex does something that the others do not.
The use of sets simplifies both the curation and the querying of Reactome. For example, the web query interface allows researchers to search for pathways involving 'cyclin B' and obtain a comprehensive list. Without this functionality, a researcher might have to search serially for each member of the set of entities that together comprise cyclin B.
A critical aspect of the Reactome data model is evidence tracking imposed at every level. Every reaction entered into the knowledge base must be backed up by evidence from the biomedical literature, and documented with appropriate citations. Reactome recognizes two types of evidence: direct and indirect. Direct evidence for a reaction in humans comes from a direct assay on human cells. However, much of current biochemical knowledge has been developed from experiments and observations in nonhuman species. Insights obtained in one species are then projected onto other species on the basis of sequence similarity of genes or proteins between the respective species. When work in one species is used to make inferences about a human pathway, it becomes Reactome indirect evidence.
In practice, we use nonhuman experimental data to document an inferred human biologic process with a two-step process. First, we create a reaction that describes the reaction in the nonhuman species, using physical entities that are appropriate for the organism that was directly assayed, for instance Drosophila Notch protein. The papers that describe the experiments used to characterize the nonhuman reaction become the direct evidence for that reaction in the knowledge base. Next, we create an inferred reaction that describes the reaction in human, using human physical entities, for example the four human Notch paralogs. The nonhuman reaction is now used as the evidence to support the inferred human reaction. In this way, the complete chain of evidence is preserved from primary experiment to nonhuman reaction, to the inferred human reaction.
Reactome uses well recognized external identifiers to establish connections with other public biologic databases. In addition to GO terms to describe molecular function, biologic process and subcellular compartment, we use ChEBI (Chemical Entities of Biological Interest [13
]) and UniProt [14
] to reference small molecules and protein sequences, respectively. These cross-references are mandatory fields in the corresponding Reactome records and are hand checked by Reactome staff. In addition, we automatically cross-reference proteins, genes, reactions, and other objects to a variety of popular external databases, including Entrez Gene [15
], Online Mendelian Inheritance in Man (OMIM) [16
], and Kyoto Encyclopedia of Genes and Genomes (KEGG) [17
] (Table ). We chose ChEBI and UniProt over other potential reference datasets because these resources are heavily curated to remove redundancy.
Database cross-references in Reactome
The data model includes several classes to describe special cases such as biologic polymers and reactions that occur concurrently within a pathway, as well as utility classes to aid in curation workflow management and the website user interface. There are also classes in the data model that allow us to describe functional submolecular domains in proteins, nucleotide sequences, and other macromolecules.
Pathways and reactions can have attached summations (human-readable text) and illustrations. Summations orient the reader and summarize the process in textbook style. Summations can also be used to add comments that do not fit into the Reactome data model.