Gene proximity implying phenotype similarity
The proposed approach for inferring disease genes is based on the assumption that phenotypically similar diseases are caused by functionally related genes that are usually proximal in a PPI network. Moreover, we assume the existence of a linear relationship between similarities of diseases and proximities of genes that are associated with the diseases. In order to validate this assumption, we compile from HPRD 2,466 associations between 1,590 diseases and 1,440 genes, calculate Bayes factors for these disease genes, and run a Wilcoxon signed rank test to check whether the resulting Bayes factors are significantly greater than 1 (the random case). Results show that the p-value is smaller than 2.2E-16, indicating that the similarities of diseases have a linear relationship with the proximities of disease genes.
To further substantiate this point, we perform a series of permutations towards disease-disease, disease-gene, and gene-gene relationships. First, we break disease-disease relationship by permuting the phenotype similarity profile. Second, we break disease-gene relationship by two methods: (1) permuting disease-gene associations and (2) replacing disease genes in known disease-gene associations with randomly selected genes. Third, we break gene-gene relationship by permuting connections in the underlying protein-protein interaction network while keeping node degrees and recalculating the diffusion kernel. For each of the above permutations, we calculate Bayes factors of disease genes and present the results in Figure , from which we can clearly see that the median of Bayes factors based on the original data is much higher than those using permuted relationships.
Figure 1 Bayes factors of the original and permuted data. “original”, “permuted PPS”, “permuted seed”, “random seed”, “permuted PPI” denote the results obtained using original data, (more ...)
We also perform similar studies using data extracted from BioGRID, BIND, IntAct, and MINT, and we obtain similar results as HPRD. From these comprehensive studies, we conclude that similarities between diseases can be explained using network proximities of genes that are associated with the diseases. In other words, gene proximity implies phenotype similarity.
Prioritization with individual PPI networks
We design a series of large scale leave-one-out cross-validation experiments to show the validity and effectiveness of the Bayesian regression approach on individual PPI networks. As described in the method section, in each run of the validation procedure, we prioritize candidate genes according to the Bayes factors against two control sets, random controls and linkage intervals, with the performance being evaluated by mean rank ratios and AUC scores. Results are shown in Table and Figure .
Performance of the Bayesian regression approach on individual data sources.
ROC curves of the five PPI networks on random controls (A) and linkage intervals (B). AUC scores for the validation of random controls are average of 10 independent runs.
From Table , we see that the mean rank ratios obtained using the five PPI networks are all below 0.17, and the AUC scores are all above 0.83, suggesting the effectiveness of the Bayesian regression approach. The best performance are obtained using HPRD, with which the mean rank ratios against random controls and linkage intervals are 0.1349 and 0.1353, respectively, and the AUC scores are 0.8738 and 0.8720, respectively. From Figure , we see that the ROC curves of the HPRD data set are above those of the other data sets, suggesting that the performance on HPRD is superior over that of the others. To understand this observation, we perform one-sided Wilcoxon rank sum test against the hypothesis that Bayes factors of disease genes for the HPRD data set are greater than those for the other data sets. Results show that Bayes factors of disease genes for the HPRD data set are indeed greater than those of BioGRID (p-value=4.7E-2), BIND (p-value=2.6E-3), IntAct (p-value=1.9E-5), and MINT (p-value=2.5E-5). We therefore conjecture that the performance of the proposed method depends on how well the linear relationship between disease similarity and gene proximity exhibits.
To further demonstrate the effectiveness of the proposed approach, we repeat the same leave-one-out cross-validation experiments using the existing CIPHER approach [25
], which relies on Pearson's correlation coefficient between the disease similarity vector and the gene proximity vector to prioritize candidate genes. We compare the results of these two approaches in Figure , from which we see clearly that the Bayesian regression approach in general achieves lower mean rank ratios and higher AUC scores in all the five data sets. For example, when performing cross-validation for linkage intervals using the HPRD data set, the CIPHER approach achieves a mean rank ratio of 0.1746 and an AUC score of 0.8313, whereas the Bayesian approach achieves a mean rank ratio of 0.1353 and an AUC score of 0.8720, suggesting an obvious improvement over the CIPHER approach. Note that the CIPHER method calculates gene proximity matrix by applying a Gaussian kernel to the shortest path distance matrix of the underlying network. We also try to use the diffusion kernel matrix as the gene proximity matrix and find the difference is not obvious. These results strongly suggest that the Bayesian regression approach is superior over the CIPHER approach in prioritizing candidate genes.
Figure 3 Comparison with the CIPHER approach and the ordinary regression method. Subplots A and C illustrate mean rank ratios and AUC scores against random controls, respectively. Subplots B and D illustrate mean rank ratios and AUC scores against linkage intervals, (more ...)
It is also of interest to compare the Bayesian approach with the ordinary linear regression method. For this purpose, we implement another method that relies on R2, the coefficient of determination, to prioritize candidate genes. We repeat the leave-one-out cross-validation experiments for this method and present the results in Figure , from which we see clearly that the Bayesian regression approach in general achieves higher performance than the ordinary regression method in terms of both mean rank ratios and AUC scores in all the five data sets.
Prioritization with the integration of multiple PPI networks
The coverage of a single PPI network is in general not high. Even the largest HPRD network covers only 9,470 genes, less than half of known human protein-coding genes. We therefore propose to use the Bayesian regression approach to integrate multiple PPI networks, for the purpose of improving the coverage.
By taking the union of genes on individual PPI networks, we obtain 15,644 human genes. Focusing on these genes, we extract from BioMart 2,708 associations between 1,752 diseases and 1,621 genes. With this data set, we repeat the leave-one-out cross-validation experiments using individual gene proximity profiles and present the results in Figure . Note that in this procedure, we set the proximity of two genes to zero (minimum proximity) if any of the two genes is absent from the underlying network. We observe that the performance on individual proximity profiles drops dramatically in this larger data set (in comparison with Table ), simply because each PPI network covers only a fraction of genes, and the scheme of handling missing data (setting to zero) yield small Bayes factors for genes that are absent from the network.
Figure 4 Performance of the integration method. Subplot A illustrates mean rank ratios for the integration method and individual PPI networks. Subplot B illustrates AUC scores for the integration method and individual PPI networks. Results for the validation of (more ...)
We then use the Bayesian regression approach to integrate all the five PPI networks by extending the design matrix to include gene proximities from multiple profiles. We repeat the leave-one-out cross-validation experiments and present the results in Figure , from which we observe clearly the better performance of the proposed approach with the integrated use of multiple PPI networks. The mean rank ratios for random controls and linkage intervals are 0.1385 and 0.1380, respectively, with AUC scores being 0.8702 and 0.8692, respectively. In contrast, combining all genes and interactions in individual PPI networks together to form a large network (15,644 nodes and 77,332 edges) and then applying CIPHER only yields mean rank ratios 0.1850 and 0.1876 (AUC scores 0.8230 and 0.8180) for random controls and linkage intervals, respectively. Directly applying the Bayesian regression approach to the combined network yields mean rank ratios 0.1462 and 0.1469 (AUC scores 0.8624 and 0.8601) for random controls and linkage intervals, respectively. Furthermore, we also extract from the combined network interactions that exist in at least two individual PPI networks and obtain a high confident network (8,463 nodes and 28,617 edges). Focusing on genes in this network, we extract from BioMart 2,219 associations between 1,441 diseases and 1,271 genes. Directly applying the Bayesian regression approach to this high confident network yields mean rank ratios 0.1373 and 0.1380 (AUC scores 0.8717 and 0.8694) for random controls and linkage intervals, respectively. Though these results are slightly better than those of the Bayesian integration method, the coverage of this high confident network is apparently much lower than that of the Bayesian integration approach. From these results, we conclude that the Bayesian regression approach is effective in integrating multiple PPI networks for prioritizing disease genes.
It is of interest to see how much individual PPI networks contribute to the integration approach. For this purpose, we repeat the cross-validation experiments by integrating every four data sources and see how the exclusion of the remaining network affects the prioritization results. We find that all data sources have positive contributions to the integration approach, because mean rank ratios increase and AUC scores drop when any of the five data source is excluded. Results also suggest the order of the data sources according to their contributions (differences in cross-validation results) as follows: HPRD > BIND > BioGRID > IntAct > MINT. It is not surprising that HPRD contributes most to the integration method, because HPRD has the largest coverage and highest performance in previous validations. It is also not surprising that MINT has the least contribution, and IntAct has the second least contribution, because both coverage and performance of these two data sources are not high. However, it is not obvious that BIND contributes more than BioGRID, because individually, BioGRID has higher coverage and performance than BIND. To understand this observation, we analyze the relation between the five data sources and find that Bayes factors calculated from HPRD and BioGRID are highly correlated (Pearson's correlation coefficient = 0.9770). Therefore, the removal of BioGRID, when HPRD is presented, will not significantly affect the performance of the integration method.
Finally, we study whether the integration approach is biased toward well characterized genes, that is, genes appearing in more data sources tend to have higher ranks. We group all genes in a validation procedure into 10 categories according their ranks such that the i-th category contains genes ranked at ((i - 1)×10%,i×10%]. For each category, we group genes according to the number of PPI networks containing the genes such that the j-th category contains genes appearing in exact j networks. We then perform pair-wise Pearson's chi-squared tests against the alternative hypothesis that frequencies of genes appearing in different number of PPI networks are different across different rank categories. Results show that for the cross-validation experiment on linkage intervals, the minimum p-value produced by the series of chi-square tests is 0.0909, suggesting that we cannot reject the null hypothesis. Similar results are obtained for the cross-validation experiment on random controls. We therefore conclude that the integration approach is not biased toward well characterized genes. In other words, genes appearing in more data sources do not tend to have higher ranks.