Our aim in developing the Genes2Network software is to provide cell- and molecular-experimental biologists as well as computational biologists with a user-friendly tool for creating subnetworks from lists of mammalian genes or proteins by connecting these genes or proteins using known protein-protein interactions. To accomplish this task we developed a large-scale high-quality mammalian protein-protein interaction database. This database was created by consolidating databases containing mostly low-throughput literature-based protein interaction data extracted manually by expert biologists, but also data generated from high-throughput methods. To develop Genes2Networks, we consolidated ten currently available mammalian protein interaction network datasets into one large dataset. To prune out interactions of low confidence, a simple filter was implemented. Genes2Networks is delivered as a web interface application. This tool can be used to extract relevant subnetworks given lists of gene or protein names. The input to the system is a list of Entrez gene symbols. The system uses the merged datasets made of selected databases to find interactions between the nodes in the seed list. The merged datasets can be filtered based on user preferences concerning the maximum number of interactions a reference can provide, and the minimum number of references required for interactions to be included. The resultant filtered dataset serves as a reference network for exploring, by depth-first traversal, paths between the seed nodes. Nodes that fall on paths shorter than a user defined path length between seed nodes are included as intermediates in the outputted subnetwork. The system's output includes a statistical analysis report, and a three color network map, highlighting the seed nodes in one color, the significant intermediates in another color, and the non-significant intermediates in a third color. The statistical analysis provides a list of intermediate nodes used to connect the gene names, sorted by significance of specificity to interact with nodes from the seed list. This process is illustrated in Figure .
Developing a high-quality large-scale mammalian protein interaction network
We used only mammalian (mouse/rat/human) interactions recorded in the following datasets: BIND [
17], HPRD [
18], IntAct [
19], DIP [
20], MINT [
21], Rual et al. [
22], Stelzl et al. [
23], Ma'ayan et al. [
24], PDZBase [
25], and PPID [
19,
26]. All interactions from these databases/datasets were determined experimentally and include a PubMed reference to the primary research article that describes the experiments used to identify the interactions. Some of the databases contain interactions that were manually extracted from the literature (e.g. HPRD); some datasets are the result of high throughput experimental data (e.g. Rual et al. and Stelzl et al.); whereas some databases contain both low and high-throughput interactions (e.g. BIND, IntAct, and DIP). Consolidation of interactions from the ten different network databases was accomplished by combining human/mouse/rat Entrez gene symbols using information from Swiss-Prot [
27]. The consolidated network created from the ten datasets contains 44,877 interactions and 11,033 nodes. This network is stored in a structured text flat-file-space-delimited format. This file is loaded into the program using a hash data structure implemented in c language for fast loading and access. We do not include in this initial implementation datasets of interactions created via
in-silico ab-initio interaction prediction methods or model organisms orthologs interactions such as those collected in OPHID [
27], HPID [
28], IntNetDB [
29], and POINT [
30]. The datasets we used describe mostly binary interactions, but in rare cases complexes containing more than two proteins are listed. These were excluded from the merged dataset. Nodes in the ten datasets are provided with accession codes linking them to entries describing genes and proteins in databases such as Swiss-Prot [
31] and NCBI's Entrez Gene [
32]. HPRD [
18] and PPID [
19,
26] are not included in the public web interface application since these databases require a license for redistribution. Currently, HPRD and PPID data are only available to internal users at Mount Sinai School of Medicine.
Filtering
Many of the interactions and components listed in the ten databases that we used are the result of high-throughput experiments such as yeast-2-hybrid screens [
2,
3], and mass-spectrometry [
6]. These interactions are considered low-quality since these techniques often report many false positives [
33]. Thus, we applied a simple filtering approach allowing users to exclude interactions originating from articles that provide many interactions, and/or include only interactions reported by several different papers. The rationale for this filtering approach is the assumption that a research article that reports many interactions is likely reporting the results of a high-throughput technique which tends to produce many false positives. Alternatively, interactions that are reported in many different research articles, and appear in multiple databases, can be given more confidence because these interactions have been reported multiple times independently. Hence, users may select to include only interactions from low-throughput studies with multiple references to improve the reliability of the consolidated network. Users are presented with list-boxes and text-boxes that allow adjustment of the filtering thresholds. More sophisticated filtering techniques implementing machine learning technologies such as support vector machines (SVM) [
34], and taking into account more knowledge about the interactions (i.e. experimental method used, impact factor of journals, etc.) are planned for future implementations.
Web interface
To enhance accessibility to the core Genes2Networks software, we developed a state-of-the-art web-based interface. This interface allows users to input lists of human Entrez Gene symbols in a textbox or through uploading a text file. As genes are added, the system validates the entries using NCBI's e-utils. The validation is achieved by searching the NCBI gene database, with the input entry, while restricting the organism to human. If an exact match is not found, the user is presented with a list of suggestions with links to choose the intended matching entry. By clicking on a highlighted gene symbol from the list of suggestions, the gene can be added to the seed list.
Using the merged consolidated network reference database, the program outputs subnetworks that describe all found interactions and nodes on paths connecting the list of inputted gene symbols. The web interface provides users with full access to configure which databases to include in the consolidated reference network that is used to connect the genes. Additionally, users can upload other network databases for inclusion in the reference dataset. These additional networks can be consolidated with the provided networks. The output subnetwork is visualized using a dynamical web-enable AJAX viewer called AVIS [
35]. The viewer allows browsing, zooming and panning, and linking to interaction resources. The user can configure the colors of the outputted nodes so that the seed-list genes, intermediate genes that are above a specified Z-score and the rest of the nodes are displayed in different colors. The user can also adjust the maximum number of steps/hops to use in order to find paths between the nodes in the seed list to connect the seed list genes. Steps/hops are the number links (not nodes) needed to connect the inputted seed list. Additionally, the program outputs a statistical report that ranks intermediates used to connect the genes based on their specificity to interact with the seed list. As the user adjusts the settings, changes in the resulting network are automatically redisplayed. A representative screenshot of the system is illustrated in Figure .
Significant intermediates
The output subnetworks produced by Genes2Networks contain nodes, mostly proteins, which were not originally provided by the user as input. We call these nodes "intermediate nodes". Some of these intermediate nodes may be present in the output subnetwork because these intermediates are highly connected nodes (hubs) in the consolidated reference network used to connect the seed-list. On the other hand, intermediate nodes may be specific to interact with components from the inputted seed list. If these intermediates are specific, it may be beneficial for the user to identify them as potential specific regulators and specific participants in pathways, protein complexes and modules involving the input seed list. For this, Genes2Networks outputs a Z-score value of the significance of intermediates in the outputted subnetwork. The Z-score is computed for each intermediate node using a binomial proportions test [
36] as follows:
Where "a" equals the links from the intermediate node being examined to nodes from the input seed list, "b" equals the total links for the intermediate node in the consolidated background reference network, "c" is the total links in the outputted subnetwork, and "d" is the total links in the consolidated background reference network. The outputted ranked list of intermediates is displayed underneath the subnetwork map viewer.