The procedure used to integrate multiple PPI databases to yield a modular and biologically meaningful network is shown in Figure . Seven PPI databases were preprocessed so that only human data were selected by using unified EntrezGeneIDs. Seven integrated networks were obtained by using the k-votes method for k = 1, 2, 3, …, n, where n = 7. In the k-votes method, all known interactions are examined, and if an interaction is present in at least k PPI databases, it is included in the integrated network.
Figure 1 Network modeling and evaluation flowchart PPI data are taken from seven preprocessed public PPI databases and used to create seven integrated networks using the k-vote method (A). SCAN is used to generate functional modules for each of these integrated (more ...)
After all seven integrated networks were constructed; cluster analysis was performed on each one using the SCAN algorithm with ε values in steps of 0.01 from 0 to 1. Each ε value yielded a clustering result. We calculated the four quality measures including modularity, similarity-based modularity, clustering score, and normalized enrichment for each clustering result, shown in Figures and . The integrated network that achieved the best overall performance in terms of overall clustering quality measures was recognized as the most informative network.
Figure 2 Optimality measures for the seven consensus networks Figure shows the four optimality measures for Ĝ2: modularity, similarity-based modularity, clustering score, and enrichment score. Figure shows the same (more ...)
Seven integrated PPI networks yielded by using the k-votes method
Ĝ1 (k=1): The network is constructed by including all interactions of seven PPI databases. It is equivalent to the traditional union approach of creating a PPI network. The modularity values show a downtrend over ε and do not reach an optimal value at any ε (Figure ). An optimal value for any of the four quality measures is a non-edge case maximal ε value, ε values close to 0 or 1 are not considered because they yield only trivial modules that consist of either all vertices or very few vertices. Similarity-based modularity possesses an optimal value at ε=0.5, which demonstrates a superior performance over modularity. In regards to biological significance tests, both clustering score and normalized enrichment show an uptrend over ε and do not converge to an optimal value. Therefore, we can conclude that network Ĝ1 (k=1) does not constitute a robust network with a reasonable biological significance. One reason for such results could be due to false positives. Since this network has every interaction proposed by any one of the seven databases, any interaction wrongly identified by even one of the databases would be a false positive and decay the network’s robustness.
Ĝ2 (k=2): The network comprises interactions that are present in at least two PPI databases. We observed that modularity could not be optimized for any ε value, as was the case for the case of Ĝ1 (k=1) (Figure ). We obtained an optimal similarity-based modularity at ε=0.3, which again demonstrates a superior performance over modularity. In contrast to Ĝ1 (k=1), there is a clear maximum for both the clustering score and normalized enrichment value, which was at ε=0.59 and at ε=0.74, respectively. Therefore, the network Ĝ2 (k=2) is both statistically significant and biologically meaningful.
Ĝ3, Ĝ4, and Ĝ5 (k=3, 4, 5): For the three networks constructed by using k=3, 4, and 5 respectively, we observed an optimality in terms of statistical clustering quality measures including both modularity and similarity-based modularity (Figure ). However, there is no biological optimality in terms of either clustering score or enrichment. Therefore, the networks are statistically significant, but not biologically meaningful. Interestingly, we found both modularity and similarity-based modularity were optimized at the same ε value. Since these networks do not possess biological significance, we rule out them as comprehensive networks. One factor that could contribute to the poor biological significance of these networks is the low coverage of interactions, which is the result of high number of votes (k) required for the consensus.
Ĝ6 and Ĝ7 (k=6, 7): For networks constructed by using k=6 and 7, respectively, the significance tests show flat results over every ε value, which indicates there is neither statistical nor biological significance for both networks (Figure ). The main reason behind this is the sparse interactions among proteins; most of the proteins and their interactions are not present in these networks.
The number of nodes (proteins) and edges (interactions), as well as the presence of optimality, in terms of all quality measures are summarized in Table . Based on the results, we concluded that network Ĝ2, established by using k=2 in the k-votes method, is the only one of the seven networks that is both statistically significant and biologically meaningful. Therefore, the best integration strategy is the one using a consensus of at least two votes in the committee of seven PPI databases for this study. On the other hand, the number of edges (interactions) drops by approximately 73% from 132,603 to 36,086, in comparison with Ĝ1. Therefore, Ĝ1 may be preferred if the coverage of possible protein-protein interactions is more important for the biological study and one is not overly concerned with false positive associations. The significant decrease of interaction coverage also indicates the rarity of agreement between the original seven PPI databases in terms of protein-protein interactions. Hence, there is a trade-off between the coverage and the reliability of protein-protein interactions. The optimal integrated network is a balance that is dependent on the focus of the study.
Presences of Optimal Quality Measures
From a biological perspective, functional modules with high statistical significance reflect a biological (disease) phenotype. The optimal parameter ε=0.59 from the network constructed using k=2 achieving the maximal clustering score was applied. 97 out of 158 modules were found to be statistically significant by SCAN using an α level of 0.05. Table lists the top ten modules with significant biological enrichment of KEGG pathways by the clustering score. Proteins with similar biological functions can be successfully clustered together by applying SCAN to the network constructed using k=2; in fact, six out of the top ten modules (1, 2, 4, 5, 6, and 8) have a perfect purity for the KEGG pathway represented.
Top ten modules with significant biological enrichment in KEGG