In PID, an interaction is an event with its participating molecules and conditions. A PID pathway is a network of these events connected by the participant molecules. PID recognizes four kinds of molecules: small molecules (called compounds), RNA, proteins and complexes. PID recognizes five kinds of events: gene regulation (called transcription, but encompassing both transcription and translation), molecule transport (called translocation), small-molecule conversion (called reaction), protein–protein interactions (called modification) and black-box processes whose internal composition is not provided (called macroprocesses). In addition, an entire pathway can be abstracted and used as a single event in another pathway. As a participant in an event, a molecule may have one of four roles: input, output, positive regulator and negative regulator. These roles define simple relations: an interaction consumes its inputs (but not its regulators) and produces its outputs; and the inputs, positive regulators and the absence of negative regulators are jointly the necessary and sufficient causes of the interaction.
Each molecule in PID has a defining entity, called a basic molecule. Basic molecules are distinguished by their nucleotide or amino acid sequence (for macromolecules) or by their chemical formula (for small molecules). While PID does not record the sequence of a macromolecule or the chemical formula for a small molecule, each protein or RNA is associated with a UniProt or Entrez Gene accession and most small molecules are associated with Chemical Abstracts Service (CAS) registry numbers. A basic molecule has a primary name and may have multiple aliases. Each molecule use, as an interactant in an interaction or as a component of a complex, references its corresponding basic molecule. Each molecule use may have additional information, including posttranslational modifications (for proteins) and cellular location and activity state (for all molecule types).
A basic protein molecule has a single identifying UniProt accession associated with a particular amino acid sequence. If the particular isoform of a protein used in an interaction is not known, then the basic protein molecule may be associated with an Entrez Gene identifier instead of a UniProt accession; in PID, this method of identifying proteins is restricted almost entirely to the uncurated section of the database imported from BioCarta. A use of a protein as a participant in an interaction or component of a complex may have additional attributes: posttranslational modifications, an abstract activity-state attribute and a cellular location attribute. Currently, PID uses 13 types of posttranslational modifications, with phosphorylation being by far the most frequently used modification (). The abstract activity-state attribute, with values such as ‘active’ and ‘inactive’, allows curators to distinguish functionally different forms of a protein even if the precise covalent modifications are not known. Values for the cellular location attribute are drawn from the Gene Ontology (GO) cellular component vocabulary (5
). Cleaved subunits of a precursor protein are not distinguished by the posttranslational modification mechanism; rather they are treated as basic protein molecules separate from each other and from the precursor. However, PID explicitly relates the cleaved subunit to its precursor and records the cleavage coordinates when these are known. A PID protein corresponds roughly to a BioPAX Level 3 protein reference, while a BioPAX Level 3 protein corresponds to a PID protein use (with posttranslational modifications and cellular location).
Posttranslational modifications in NIC-Nature Curated data source
PID allows the definition of generic proteins, complexes, small molecules and RNA molecules. A generic molecule is called a family, but is not restricted to the traditional protein families defined by sequence similarity: any set of proteins (or other type of molecule) that are in some respect functionally equivalent may be grouped in a family. Individual protein members of a protein family may have posttranslational modifications or activity states. The family itself can be used as a participant in an interaction, or as a component of a complex.
Because data are entered by multiple curators and because the database contains data from multiple sources, PID needs rules for determining equivalence of molecules. Two basic molecules that are neither families nor complexes are equivalent if they have the same external database accession (e.g. UniProt or Entrez Gene), or if, in cases where neither has an external database accession, they have the same name. Two molecule uses (as participant in an interaction or component of a complex or member of a family) are equivalent if they refer to the same basic molecule, and have the same (or no) posttranslational modifications, and have the same (or no) activity-state attribute, and have the same (or no) cellular location attribute. Two basic families (or complexes) are equivalent, if they have the same number of members (or components) and if for each member (component) of one, there is an equivalent member (component) in the other. These rules are applied recursively to define, for example, equivalent uses of complexes with components that are families. Equivalence of molecule uses is the basis on which novel networks are constructed: any two interactions in the database may be joined in a network if one interaction has a participant that is equivalent to a participant in the other interaction. Analogous rules of equivalence are implemented for interactions and entire networks, allowing equivalent (redundant) interactions to be pruned from the novel networks.
An interaction may be supported by one or more citations to the literature. Currently, interactions in the NCI-Nature Curated data source are annotated with 3105 distinct PubMed references. In addition, an interaction may be annotated with one or more evidence codes that specify the kind of evidence adduced in the citations in support of the interaction ().
Evidence in NCI-Nature Curated data source
A predefined pathway is a curated pathway representing a known biological process. At present, every pathway stored in the PID database is a predefined pathway and every interaction in the database is a member of at least one predefined pathway. However, the search and retrieval tools allow the user to compose novel pathways from interactions defined in the predefined pathways. This ability to recombine interactions and to thus create novel pathways is a distinguishing feature of PID.
Since the original BioCarta diagrams were not associated with an explicit data model, the import of the BioCarta pathway data did not challenge the PID data model. The original BioCarta diagrams show protein–protein interactions, but the semantics of the connecting arrows are implicit. The import of these pathways into PID required the interpretation of each interaction and the manual encoding of the semantics in the PID data model. This was tedious, but since the original BioCarta pathways were underspecified, the process did not entail loss of information. In contrast, the import of the Reactome data is automated but does entail some loss of information. PID uses Reactome's BioPAX export as the source for the imported Reactome data. Some features of Reactome are not expressible in BioPAX Level 2. For example, Reactome has ‘entity sets’, which correspond roughly to PID's molecule families. However, since BioPAX Level 2 lacks the means to specify an entity set, this information was lost in the import process. Along with other important enhancements, this is being corrected in BioPAX Level 3. On the other hand, Reactome has some features that are expressible in BioPAX Level 2 but have no correspondence in PID. For example, in Reactome it is possible to explicitly specify that one interaction is a predecessor (‘preceding event’) of another, and this is also directly expressible in BioPAX Level 2. However, in PID the predecessor relation is implicit, inferred from the identity of interactants and the directionality of inputs and outputs. Consequently, the predecessor relation between two Reactome interactions that do not share an interactant is lost in the PID import.