|Home | About | Journals | Submit | Contact Us | Français|
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact email@example.com
Access to unified datasets of protein and genetic interactions is critical for interrogation of gene/protein function and analysis of global network properties. BioGRID is a freely accessible database of physical and genetic interactions available at http://www.thebiogrid.org. BioGRID release version 2.0 includes >116 000 interactions from Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens. Over 30 000 interactions have recently been added from 5778 sources through exhaustive curation of the Saccharomyces cerevisiae primary literature. An internally hyper-linked web interface allows for rapid search and retrieval of interaction data. Full or user-defined datasets are freely downloadable as tab-delimited text files and PSI-MI XML. Pre-computed graphical layouts of interactions are available in a variety of file formats. User-customized graphs with embedded protein, gene and interaction attributes can be constructed with a visualization system called Osprey that is dynamically linked to the BioGRID.
Protein interactions assemble the molecular machines of the cell and underlie the dynamics of virtually all cellular responses (1), while genetic interactions reveal functional relationships between and within regulatory modules (2). The sum of all such interactions defines the global regulatory network of the cell (3). Proteomic and functional genomics platform technologies now generate large datasets of protein and genetic interactions, but these datasets vary widely in coverage, data quality, annotation and availability (4,5). The collation of interaction data in a consistent, well-annotated format is essential for interrogation of gene function, investigation of system level attributes and benchmarking of high throughput (HTP) interaction studies. A number of interaction databases, including BIND (6), DIP (7), HPRD (8), IntAct (9), MINT (10), and MIPS (11), provide a variety of datasets and analysis tools. We have developed a biological General Repository for Interaction Datasets (BioGRID) to house and distribute comprehensive collections of physical and genetic interactions. The precursor to BioGRID was originally conceived as a laboratory information management system (LIMS) for HTP interaction data (12). The first public release of BioGRID (version 1.0; July 2002; then termed GRID) housed HTP two-hybrid and mass spectrometric protein interaction data generated from the budding yeast Saccharomyces cerevisiae (13). The BioGRID has since been elaborated into a resource for HTP interaction data from other species, including the nematode worm Caenorhabditis elegans, the fruit fly Drosophila melanogaster and human. In addition, the BioGRID now contains many genetic and protein interactions curated from focused studies reported in the primary literature [Reguly,T., Breitkreutz,A., Boucher,L., Breikreutz,B.-J., Hon,G., Myers,C., Parsons,A., Friesen,H., Oughtred,R., Tong,A. et al. (2005) Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae (submitted)]. The BioGRID has been queried for over 38 000 000 interactions since its inception. The recent version 2.0 release of BioGRID is a fully integrated cross-species database that supports most major model organisms, with increased data content and improved functionality.
HTP approaches to identify novel protein and gene networks have begun to augment hypothesis-driven biochemical and genetic approaches (14). These hypothesis-generating HTP techniques include the two-hybrid (2-H) method for detecting pair-wise protein interactions (15–17), mass spectrometric (MS) analysis of purified protein complexes (12,18), and the synthetic genetic array (SGA) and molecular barcode (dSLAM) methods for systematic detection of synthetic lethal genetic interactions (19,20). BioGRID currently includes HTP protein interaction datasets from two systematic mass spectrometric studies (12,18) and three two-hybrid studies (15–17) in S.cerevisiae, which total 12 994 interactions between 4478 proteins (Table 1). In addition, BioGRID contains all extant HTP genetic interaction datasets from both SGA and dSLAM approaches (19–22), totaling 6119 interactions between 1440 genes. Finally, BioGRID incorporates large-scale HTP two-hybrid surveys for C.elegans (23) and D.melanogaster (24,25), among others.
HTP datasets are laden with false positive and negative interactions (4,5). This shortfall compromises both prediction of gene/protein function and network-level analysis. The primary literature contains a vast collection of well-validated physical and genetic interactions that, while searchable on a publication by publication basis in PubMed, are not available in a relational database. A comprehensive set of literature-derived interactions would serve as a gold standard both for HTP datasets and for automated text mining approaches, augment the predictive power of HTP data and enable a re-analysis of global network properties. Spurred on by these potential applications, significant efforts to curate interaction data from the primary literature are underway by several databases (6–11), as well as by the Gene Ontology (GO) consortium (26). We have recently manually parsed the entire S.cerevisiae literature for protein and genetic interactions [Reguly,T., Breitkreutz,A., Boucher,L., Breikreutz,B.-J., Hon,G., Myers,C., Parsons,A., Friesen,H., Oughtred,R., Tong,A. et al., submitted for publication]. This comprehensive curation effort yielded 19 744 protein interactions and 11 234 genetic interactions, all of which have been placed into BioGRID. We note that the size of this literature dataset exceeds all HTP datasets combined. BioGRID also contains imports of 10 943 literature-derived genetic interactions from Flybase (27) and 30 761 literature-derived interactions from HPRD (8). The total number of literature interactions in BioGRID currently stands at over 70 000 (Table 1). In addition to the S.cerevisiae literature, we have curation efforts underway for the fission yeast Schizosaccharomyces pombe, the fruit fly Drosophila melanogaster and focused aspects of the human protein interaction literature, all of which will be deposited in BioGRID.
As network complexity increases, tabular formats for data display quickly overwhelm human comprehension. Graphical representation of interaction networks not only enables a high density of data to be visualized but immediately conveys complex inter-relationships between graph nodes, in this case either proteins or genes. A defining feature of the GRID database is an inter-dependent visualization tool called Osprey (http://biodata.mshri.on.ca/osprey) that runs as a desktop application in Windows, Linux and OSX environments (28). The Osprey platform is a facile graphical interface to query BioGRID datasets, from which the user can build custom graphical representations of any chosen set of interactions. Osprey represents individual genes/proteins by nodes and interactions by edges that connect nodes. Additional color-coded annotation is embedded in nodes and edges to represent GO categories, experimental evidence and/or data source information. A variety of graphical layouts and toggle options afford different views of the network. The Osprey file format captures all annotation associated with each node/edge in the graph, and can thus be used as a graphical file exchange format for interaction data. User-defined datasets can be up-loaded into Osprey for annotation and integration with public datasets in BioGRID. Osprey graphs can also be saved in JPEG, PNG, SVG file formats for figure construction. Pre-computed graphical representations of the first-order interaction shell for every gene/protein in the BioGRID are included on each results page and are available for direct download (Figure 1).
The BioGRID web interface was developed with PHP 5.0.4 and is hosted on an Apache 2.0 web server at our primary mirror (http://www.thebiogrid.org). The entire package is capable of running on any PHP 4.x compatible web server, and has been tested successfully on IIS, Apache 1.3 and Apache 2.0. BioGRID currently uses freely available MySQL 4.1 as its primary database management system (http://www.mysql.com) for both the web-based interface and interaction curation. The BioGRID is readily established on in-house servers and is easily adapted as an internal data management system by the individual laboratory.
Consistent annotation is essential in order to collapse redundant interactions into a single search result and ensure accuracy for queries and results. All ancillary annotation is compiled from over 25 popular web-based resources, extracted and stored via an annotation compilation system (ACS) written with Java Technology and Java SDK version 1.4.2. BioGRID annotation tables are updated on a monthly basis and made freely available via the web-based interface. The BioGRID ACS currently supports 294140 genes in 13 different organisms: Saccharomyces cerevisiae, Schizosaccharomyces pombe, Caenorhabditis elegans, Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Canis familiaris, Bos taurus, Arabidopsis thaliana, Xenopus laevis, Takifugu rubripes and Danio rerio.
All the interaction data present in BioGRID is freely downloadable at http://www.thebiogrid.org. Data is available in multiple formats including tab-delimited text file and PSI-MI XML (29), as well as in Osprey and other graphical file formats. BioGRID supports the data exchange standard PSI version 2.5, as mandated by the International Molecular Exchange Consortium (IMEx) that aims to facilitate the open distribution of interaction data (see http://imex.sourceforge.net/). Interaction data is updated regularly, and all downloadable files are refreshed to reflect the most recent changes. Download files are customizable by publication, record, organism and experimental system. To maximize performance and minimize database downtime, mirror versions of BioGRID are under construction in the US and Europe. Information on curation contributions or hosting a mirror may be obtained from the BioGRID website. Source code is freely available on request. The BioGRID is actively linked to the Saccharomyces Genome Database (30), Flybase (27) and Germ Online (31) websites.
We will continue to curate interactions from major model organisms, including human, which will be posted as monthly updates of interaction data. Annotation will be routinely updated to allow unambiguous retrieval of protein/gene names. Capability to house quantitative genetic interactions and curated post-translational modifications will be implemented in the near future. We also plan to support complex and pathway descriptions, and to enable cross-species predictions though BLAST-based alignments of orthologous networks (32). A planned open source release version of the BioGRID platform, called ProtoGRID, will simplify installation of local versions of BioGRID. Similarly, the curation management system will be released to facilitate curation of interaction data by interested groups. Finally, graphical representations will be augmented through network clustering based on user-defined attributes, including co-expression and co-localization.
We thank Jim Woodgett for generous support and advice, Rachel Drysdale and Don Gilbert for assistance in parsing genetic interactions from FlyBase; Kara Dolinski, Michael Cherry and David Botstein for helpful discussions and support at SGD; and, Russ Finley, Joel Bader, Marc Vidal, Jef Boeke, Tim Hughes and Charlie Boone for pre-publication release of large-scale datasets. L.B. is supported by a National Cancer Institute of Canada Doctoral Award with funds from the Terry Fox Foundation; M.T. is supported by a Canada Research Chair in Functional Genomics and Bioinformatics. This work was funded by a grant from the Canadian Institutes of Health Research to M.T. Funding to pay the Open Access publication charges for this article was provided by the Canadian Institutes for Health Research.
Conflict of interest statement. None declared.