|Home | About | Journals | Submit | Contact Us | Français|
The Biological General Repository for Interaction Datasets (BioGRID: http://thebiogrid.org) is an open access database that houses genetic and protein interactions curated from the primary biomedical literature for all major model organism species and humans. As of September 2014, the BioGRID contains 749 912 interactions as drawn from 43 149 publications that represent 30 model organisms. This interaction count represents a 50% increase compared to our previous 2013 BioGRID update. BioGRID data are freely distributed through partner model organism databases and meta-databases and are directly downloadable in a variety of formats. In addition to general curation of the published literature for the major model species, BioGRID undertakes themed curation projects in areas of particular relevance for biomedical sciences, such as the ubiquitin-proteasome system and various human disease-associated interaction networks. BioGRID curation is coordinated through an Interaction Management System (IMS) that facilitates the compilation interaction records through structured evidence codes, phenotype ontologies, and gene annotation. The BioGRID architecture has been improved in order to support a broader range of interaction and post-translational modification types, to allow the representation of more complex multi-gene/protein interactions, to account for cellular phenotypes through structured ontologies, to expedite curation through semi-automated text-mining approaches, and to enhance curation quality control.
Massive increases in high-throughput DNA sequencing technologies (1) have enabled an unprecedented level of genome annotation for many hundreds of species (2–6), which has led to tremendous progress in the understanding of gene organization, genome evolution and the genetic basis for disease. At the same time, sequencing-based methods have uncovered many intricacies of gene regulation at a genomic scale, including expression patterns, alternative splicing, non-coding transcription and the myriad of regulatory factors that bind DNA and RNA (7–11). Proteomics approaches, largely based on mass spectrometry, have similarly mapped the abundance and post-translational modifications of proteins at impressive depth of coverage (12–15). At the phenotypic level, genome-wide reagent collections for systematic perturbation of gene function have led to compendia of functional profiles for many different phenotypic characteristics (16–19). This wealth of new data has been accrued in model organism systems, and particularly in humans, in both normal and disease contexts. In spite of this data deluge, the fundamental problem of how genotype is translated into phenotype, and how genetic mutations can affect this complex relationship, remains a formidable roadblock in our understanding of fundamental biology and the basis for human disease.
It is now evident that genes and their encoded proteins function in the context of a vast, dynamic network of interactions (20–23). The generation of comprehensive genetic and protein interaction maps will thus be essential for unraveling the many complexities of biological processes and for understanding the general genotype to phenotype mapping problem (24). For example, the integration of genetic interaction networks with other genome-wide data types has helped to explain how sets of genes function differently in specific cellular contexts, conditions or tissues (25–28). The systematic experimental identification and characterization of protein and genetic interaction networks in major model organism species and humans has continued to grow in pace and scale (21,23,29–35). With such interaction datasets in hand, it has been possible to implement computational methods for analysis and prediction of the response of cellular networks to perturbation by disease-associated mutation or pathogen infection (28,36–38).
The comprehensive annotation and compilation of all known biological interactions in a computable form is essential for network-based approaches to understanding biological systems and human disease (39). The Biological General Repository for Interaction Datasets (BioGRID: http://thebiogrid.org) was established in order to help capture biological interaction data from the primary biomedical literature and to provide this data in a readily computable format (40). BioGRID collects and annotates genetic and protein interaction data from the published literature for all major model organism species and humans. When available, data on the influence of protein post-translational modifications, including phosphorylation and ubiquitination, is also captured. The complete BioGRID dataset is freely accessible through a dedicated web-based search portal and is also available for download in various standardized formats. BioGRID data content is updated and permanently archived on a monthly basis, and in addition to the BioGRID web interface, is disseminated to the research community through model organism database (MOD) partners (41–46) and other biological resources and meta-databases (47–52). The interaction datasets in BioGRID thus provide a resource for biomedical researchers who study the function of individual genes and pathways, as well as for computational biologists who analyze the properties of large biological networks.
The current BioGRID release (August 2014, version 3.2.115) houses a total of 749 912 interactions (515 032 non-redundant) comprising 471 525 protein (physical) interactions (318 069 non-redundant) and 278 387 genetic interactions (204 801 non-redundant) (Table (Table1).1). The number of interactions housed in BioGRID has increased by ~50% since the 2013 BioGRID update (40). All data in BioGRID has been manually curated from a total of 43 149 articles indexed in PubMed (Figure (Figure1).1). BioGRID also currently contains data on 42 907 protein phosphorylation sites, which are mainly drawn from high-throughput mass spectrometry studies, as housed in the PhosphoGRID database (53). In 2014, Google Analytics reported that the BioGRID received on average 88 080 page views and 12 399 unique visitors per month, versus 69 237 page views and 10 110 unique visitors per month in 2012. BioGRID data files were downloaded on average 9256 times per month in 2014, compared with 6900 downloads per month in 2012. These statistics do not include the widespread dissemination of BioGRID records by various partner databases and meta-resources. In 2014, the BioGRID user base was located primarily in the USA (30%), followed by China (8%), United Kingdom (7%), Canada (6%), Germany (6%), Japan (6%), India (4%), France (4%), Spain (2%) and all other countries (27%).
BioGRID continues to maintain complete curation of the primary literature for genetic and protein interactions in the model yeasts Saccharomyces cerevisiae (342 878 total interactions) and Schizosaccharomices pombe (68 015 total interactions). These datasets are updated on a monthly basis and released for redistribution through the Saccharomyces Genome Database (41) and PomBase (43). In addition to these two yeasts, BioGRID contains interaction data for more than 30 model organisms at varying depths of coverage. However, the immense extent of the biomedical literature—more than 24 million articles in PubMed as of August 2014—and its ever-accelerating rate of growth render the complete manual curation of all interaction data virtually impossible (39). The identification of publications that contain actual interaction data is a non-trivial step in the curation workflow (54). Although the entire BioGRID dataset is drawn directly from just 43 149 publications, in reality several-fold more publications have been directly parsed by curators, usually in an entirely manual fashion (55). While our initial strategy for the identification of relevant papers was based on simple PubMed searches based on keywords and/or gene names, we now prioritize literature queues for different projects through advanced text-mining approaches. For example, BioGRID has several projects that are facilitated by Support Vector Machine (SVM) analyses carried out in collaboration with the Textpresso text-mining group (56). We have also begun to use text-mining for the curation of protein phosphorylation sites through a collaboration with developers of the RLIMS-P system (57). To facilitate the development of improved text-mining approaches, the BioGRID routinely contributes to the BioCreative (Critical Assessment of Information Extraction in Biology) challenge by providing test datasets and curation expertise (58–60).
Curation accuracy and consistency are critical for the integrity of the BioGRID resource. The Interaction Management System (IMS) that is used to coordinate curation efforts helps ensure that only unambiguous and appropriate gene identifiers are used. For direct submission of high-throughput datasets to BioGRID, curators work closely with data providers to ensure proper data representation, particularly for quantitative datasets. For example, BioGRID recently incorporated a pre-publication dataset of 23 756 human protein interactions detected by quantitative affinity capture-mass spectrometry (35), as generated by the Gygi and Harper groups. BioGRID also provides an e-mail based helpdesk for evaluation and correction of dubious entries noticed by authors or other users. Importantly, as each monthly BioGRID update is permanently archived, users are able to trace any alterations to the dataset, and thereby easily assess any potential impact on analyses that may have been performed. BioGRID has also recently implemented an automated random re-curation procedure, whereby small subsets of interactions derived from low-throughput studies are blindly re-curated in order to ensure curation consistency.
To maximize depth of BioGRID curation coverage in specific areas relevant to human disease, we have undertaken a series of themed curation projects delineated by a specific biological process or a specific disease topic. These themed curation efforts are implemented in three discrete steps: (i) compilation of a structured gene annotation reference list for the project, typically in consultation with domain experts; (ii) generation of a list of all candidate publications through custom PubMed queries and text-mining approaches and; (iii) curation of the interaction data according to structured evidence codes as coordinated through the automated IMS curation interface. In the largest such project to date, we have curated the entire literature for interactions associated with the ubiquitin-proteasome system (UPS). Manual expert compilation of a comprehensive UPS gene reference list was augmented by semi-automated parsing of protein domain and protein function annotations available through a number of sequence-based databases (48,49,61–63). We thus annotated 1251 human genes to the UPS in a structured format that classified each gene according to enzymatic and other functional characteristics. This gene list was then used to seed PubMed searches to generate a prioritized curation queue of ~20 000 publications. As will be reported elsewhere in detail, a sustained manual curation effort allowed the construction of a dataset of 102 906 interactions (50 561 non redundant) in the human UPS. In addition, we carried out the systematic annotation of ubiquitination sites detected by high-throughput mass-spectrometry-based approaches (64–66).
A second major curation theme undertaken recently at BioGRID is the arachidonic acid pathway (AAP) as part of the Personalized NSAID Therapeutics Consortium (PENTACON) project (http://www.pentaconhq.org). The AAP is the primary cellular mechanism for production of pain and inflammation mediators, and is also involved in renal function and homeostasis (67) Core genes involved in the AAP, as well as AAP-related genes and genes involved in blood pressure (BP) regulation, were identified using curated pathway resources such as KEGG (68) and Reactome (69), as well as Gene Ontology (63) annotations. These gene lists were further expanded via on-going literature review and by input from domain experts associated with the PENTACON project. BioGRID curators directly reviewed over 2400 papers and curated more than 1300 AAP protein interactions, 49% of which were from low-throughput studies. This curation effort was then broadened to include AAP-related and BP-related proteins to yield an additional 1200 interactions (84% low-throughput) and 2100 interactions (70% low-throughput), respectively.
Each themed project will be associated with a specific project page in the BioGRID web interface, which will enable users to identify and query specific gene lists within each project. Similarly, project-specific download datasets will be made available and updated on a monthly basis. Other themed curation areas in progress include projects on Parkinson's Disease (PD) and other neurobiological disorders, breast cancer, the Wnt signaling pathway, the chromatin modification system, the autophagy system and ubiquitin-like modifiers. We encourage enquiries from potential expert collaborators with an interest in interaction curation projects with a particular focus on a human disease or a conserved biological process.
BioGRID curation is based on a structured but simplified set of experimental evidence codes for the representation of protein (physical) and genetic interactions. The BioGRID data model allows for the representation of both binary and higher order interactions. BioGRID evidence codes map directly to the Molecular Interaction Ontology, which is maintained by the Proteomics Standards Initiative (70), thereby making BioGRID data records fully interoperable with other datasets released in PSI-MI format. BioGRID evidence codes are periodically updated to reflect new advances in experimental methods. For instance, a Proximity Label-Mass Spectrometry (MS) evidence code was recently introduced in order to document interactions detected upon covalent modification of interaction partners by diffusible reactive species produced by a bait-enzyme fusion protein (71). All evidence codes are fully documented on the BioGRID help wiki section (http://wiki.thebiogrid.org/doku.php/experimental_systems).
BioGRID has recently collaborated with WormBase (45) to develop a new Genetic Interaction (GI) Ontology. This standard has been approved by the main MODs, including SGD (41), CGD (72), PomBase (43), ZFIN (46), FlyBase (42) and TAIR (73). The new GI ontology reconciles different terminologies often used by the biomedical research community and across different MODs. The GI Ontology is based on a previous standard (74), but extends the list of GI terms and inequalities to provide more granular terms based on terminology that is familiar to geneticists (75,76). These GI terms are structured in an ontological format whereby the relationships between the various interaction types are precisely defined. The GI ontology is also available in a simplified slim version of only 23 terms that cover the majority of the genetic interaction cases curated by various MODs. These newly standardized GI terms will facilitate the interpretation of genetic interactions, enable the integration of large genetic interaction datasets, and allow cross-species comparisons of genetic interaction networks. We note that BioGRID currently contains 265 000 yeast genetic interactions associated with over 600 unique phenotypes, which will be automatically remapped to the new GI ontology terms in future releases. The GI ontology is now available as part of the Proteomics Standards Initiative-Molecular Interaction (PSI-MI) ontology (70) and will be published in full in the near future (Grove et al., in preparation).
The web-based IMS curation interface for the BioGRID has recently undergone major revisions in order to allow more sophisticated annotation for future curation projects. The IMS core architecture now enables curation of a broader range of interaction types including for proteins, genes, RNA, small molecules, domains and protein fragments. The overall database architecture has also been improved to allow representation of higher order relationships between interacting partners, such as triple mutant combinations, protein complexes, chemical-genetic interactions and post-translational modifications (Figure (Figure2).2). The IMS has been elaborated to include more than a dozen comprehensive new ontologies (77–79) that allow curators to unambiguously record new details of any relationship, such as cell lines, phenotypes, small molecules, alleles, diseases, tissues and enzymes (Figure (Figure3).3). IMS features for curation tracking, fault tolerance and overall curation quality have also been improved. For example, to accommodate more frequent deposition of high-throughput datasets in BioGRID, new tracking tools enable the long-term storage of Supplementary Data files for archival and data reconciliation purposes. The IMS can also track the decision-making processes of each curator for each specific publication, such that it is possible to trace decisions even when the original source material is no longer available or the curator is no longer a member of the BioGRID team. To improve the overall fault tolerance of the underlying database architecture, we have continuously updated our MySQL database platform to utilize enhancements such as InnoDB tables and transactional logging.
The BioGRID is currently deployed on five virtual machines (VMs) hosted by a commercial third party provider. The VMs are fully customizable and provide state-of-the-art Intel Ivy Bridge processors, application-specific memory that is scalable from 1 to 96 GB and industry-leading native SSD high performance storage that can be readily expanded as needed. Each system has a fully redundant backup that runs daily and weekly and is situated on a 40 GB network that allows for fast access by BioGRID developers and curators in different countries, as well as by web interface and REST service users. Since deployment to cloud-based servers two years ago, the BioGRID has maintained at least 99.9% uptime, without a single major system failure. Each deployment is routinely refreshed with new hardware and software updates that keep pace with changing requirements, demands for higher usage, and system stability and security.
The IMS and the BioGRID have been improved through a new comprehensive annotation system. Our previous system included more than 28 million unique aliases, identifiers, systematic names and MOD references for over 100 supported model organisms. The updated annotation platform provides 20 million additional references and support for many additional organisms. The new data records will allow faster curation as obscure identifiers used in older publications can be easily translated into common references that are recognizable by most major MODs. The local storage of annotations in the new system also improves robustness of the internal curation pipeline by obviating the need for external APIs. The annotation system is updated on a regular basis and allows for straightforward incorporation of new organisms and facile adaptation to major annotation changes. These enhancements to the database architecture maximize performance and flexibility in curation tasks, especially for HTP datasets.
All BioGRID datasets and interaction records can be accessed and interrogated by a variety of different means. The BioGRID web page allows searches of interaction data by gene name, gene aliases or PMID publication identifiers. The complete BioGRID dataset or subsets thereof are also available for download in a number of tabular (tab, tab2 and mitab) and XML (PSI-MI 1.0, PSI-MI 2.5) formats. A detailed step-by-step guide to the BioGRID web interface is now available (Oughtred et al., submitted). BioGRID interaction data is also accessible to the individual researcher indirectly through a number of other biological databases including NCBI Entrez-Gene (48), Uniprot (49), DroID (80), GermOnline (81), FlyBase (42), TAIR (73), SGD (41), PomBase (43), STRING (47), iRefIndex (82), GeneMania (83) and Pathway Commons (50).
Software developers can access BioGRID data directly through the BioGRID representational state transfer (REST) service (84). The BioGRID Webgraph (84) and the BioGRID Cytoscape plugin (84) utilize the REST service for the visualization and analysis of BioGRID interaction networks. The BioGRID REST service application program interface (API) has been completely rebuilt to improve performance, enhance reliability and support scalability through more powerful server hardware available in the cloud. This transition to cloud-based servers has reduced query response times from an average of 5.1 s to <0.02 ms. As a direct result of these changes, the BioGRID REST service now supports more than 350 worldwide active projects that perform more than 100 000 queries per month with an average return of more than 2 million interactions per month. For example, the ProHits open source mass spectrometry LIMS platform uses the REST service to incorporate BioGRID data into analysis of experimental mass spectrometry data (85,86). The BioGRID Cytoscape plugin version 2.3 has also been redesigned to take advantage of the improvements made to the REST API and can be downloaded directly from the BioGRID website at http://wiki.thebiogrid.org/doku.php/tools. Finally, we have also implemented support for the PSICQUIC API interface (87), which has resulted in more than 140 000 queries per month from a wide variety of users.
The BioGRID will continue its core mandate to curate biological interaction data from the primary biomedical literature across the major model organism species and humans for unrestricted dissemination to the research community. The BioGRID database architecture will continue to be improved through additional updates to the IMS curation management system that will facilitate the routine deposition of pre-publication large-scale quantitative datasets, allow the capture of detailed phenotype information associated with genetic interactions, and further extend the internal annotation system to new organisms. Future themed curation drives will be focused on conserved biological processes such as the autophagy system and specific human diseases such as neurological and cardiac disorders. All interaction data for themed projects will be made accessible through project-specific web interfaces. The BioGRID will also continue to exploit text-mining technologies in order to improve the efficiency of curation workflows for future themed projects. BioGRID curation parameters for these projects will be extended to additional post-translational modifications, context-specific effects and structured phenotypes. New computational approaches based on integration of genome-scale datasets will be used to develop tissue- and disease-specific functional networks that will help guide and validate expert manual curation. This disease network-associated curation will be augmented through the capture of relevant drug or small molecule interactions. Collectively, these approaches will enable efficient cross-species comparisons of biological interaction networks, particularly for identification of new models of human disease.
The authors thank Chris Grove and Paul Sternberg at WormBase for ongoing collaborative development of the Genetic Interaction Ontology. We also thank Mike Cherry, Val Wood, Gavin Sherlock, Bill Gelbart, Monty Westerfield, Judy Blake, Russ Finley, David Botstein, Henning Hermjakob, Sandra Orchard, Anne-Claude Gingras, Frank Liu, Gary Bader, Chris Sander, Ivan Sadowski, Lincoln Stein, Mark Ellisman, Maryann Martone, Melissa Haendel, Igor Jurisica, Charlie Boone, Wade Harper, Steve Gygi, Olga Troyanskaya and the PENTACON consortium for support and discussions.
National Institutes of Health [R01OD010929 and R24OD011194 to M.T. and K.D.]; Biotechnology and Biological Sciences Research Council [BB/F010486/1 to M.T.]; National Institutes of Health National Heart, Lung and Blood Institute [U54HL117798 Curation Core to K.D., Garret FitzGerald overall P.I.]; Genome Canada Largescale Applied Proteomics; Ontario Genomics Institute (OGI-069); Genome Québec International Recruitment Award and a Canada Research Chair in Systems and Synthetic Biology [to M.T.]. Funding for open access charge: National Institute of Health [R01OD010929].
Conflict of interest statement. None declared.