Molecular interactions are the heart of cellular physiology, and protein-protein interactions specifically play a key role in a multitude of cellular functions, from signal transduction to gene expression regulation. Thus, knowledge of the interaction networks of cells is fundamental to understanding the roles played by each protein in the cellular machinery. The recent development of high-throughput methodologies for the study of protein-protein interactions offers great promise for the compilation of the cellular interactomes. The volume of data thus generated requires the development of informatics tools for storing, querying and analyzing the data.
The molecular interaction databases MINT (Molecular INTeraction) [1
] and IntAct [3
] were conceived for the purpose of storing experimentally verified protein-protein interactions reported in peer-reviewed journals. Not all experimental methods and experimental setups are equally trustworthy. For instance, some techniques, although useful for mapping the interaction domains, are performed in vitro
, and therefore in the absence of cellular factors that may modulate the interaction; whereas for in vivo
techniques the system is often perturbed in order to facilitate the detection of an interaction. Both MINT and IntAct therefore endeavor to capture a full representation of the interaction data available in the literature to allow users to determine the reliability of an interaction. With the aim of achieving complete literature coverage, the two databases (along with other major public interaction data providers) founded the International Molecular Exchange Consortium (IMEx) [5
] for sharing curation efforts and for exchanging completed records on molecular interaction data.
One of the most important recent advances in interaction data annotation is the development of the PSI-MI controlled vocabulary (CV) [6
]. This was developed by the Molecular Interactions (PSI-MI) work group of the Human Proteome Organization Proteomics Standards Initiative (HUPO-PSI) [7
] and consists of a standardized and hierarchical ontology of terms used for accurately describing interaction data. The PSI-MI CV terms provide an in-depth description of the term to which the various synonyms used in literature can be mapped. Thus, the PSI-MI CV greatly aids consistent, unambiguous annotation and is a boon to data exchange in several respects. First, it permits annotators to describe interaction data fully without resorting to free text; this makes annotation faster and less error prone. Second, when applied in accordance with the agreed standards, the CV permits seamless exchange of data between databases, because no mapping from one lexicon (or one set of semantic rules) to another is required. For instance, to describe an experiment in which a GST-tagged molecule is over-expressed in a eukaryotic cell, pulled down with affinity beads, and interacting partners identified by mass spectrometry, curators can describe the experiment with the most appropriate CV terms available. In the absence of a shared CV, databases may employ free text descriptions that can vary between individual curators and databases, or have separate in-house CVs that do not map to each other. Thus, IntAct and MINT curate data using the PSI-MI CV terms in order to describe interaction data consistently. Advances in experimental techniques for determining and characterizing interactions are reflected in the continual evolution of the CV. A snapshot of the hierarchical PSI-MI CV is shown in Figure . The data itself is stored and disseminated in the PSI-MI 2.5 standard, an XML exchange format [8
An overview of the PSI-MI CV in OLS. CV, controlled vocabulary; MI, Molecular Interactions; OLS, Ontology Lookup Service; PSI, Proteomics Standards Initiative.
Deposition into public databases is a mandatory prerequisite for publication of nucleic acid sequences, protein sequences, and protein structures. However, this is not yet the case for molecular interaction data; journals are only now starting to make such database submission mandatory [9
]. Nevertheless, also the upload of high-throughput experiments data requires a curation effort. Thus, the efficient extraction of molecular interaction data from already-published literature is necessary to populate the publicly available databases. Furthermore, in the case of high-throughput experiments the only way to upload the information is through a manual curation of the data usually supplied as supplementary materials. To date, the only reliable way to achieve this is through manual curation, which is a time-consuming and laborious process. The development of effective text-mining tools could complement manual curation by speeding up the information extraction process, thus permitting increased literature coverage. For instance, text mining tools could facilitate the mapping of protein interactors to their UniProtKB [10
] identifiers, as well as selecting the text that best describes the interaction and matching this text to appropriate PSI-MI CV terms. However, for a full and accurate description of interactions, a manual element is still required (see Challenges for automatic extraction, below).
The BioCreative [11
] protein-protein interaction (PPI) task addresses exactly these goals. Competitors were compared and evaluated to determine whose methodologies would most likely be useful in real world scenarios, for instance as an aid to the database curators. To assist with the BioCreative PPI task, IntAct and MINT contributed both a training set for development of algorithms and a test set for objective evaluation of the text-mining tools. Interactions annotated from the test set publications were not publicly released by contributing databases until the BioCreative subtasks were completed. In addition, both databases provided a full description of their curation process, including the paper selection criteria and the quality control processes used to check resulting database records.