We used an integrated network data set that contains nearly half of all human proteins (Bossi and Lehner 2009
). The Mendelian and complex disease genes were retrieved from hOMIM (Blekhman et al. 2008
) and GAD (Becker et al. 2004
), respectively. These two types of disease genes were investigated separately because they show distinct properties in many respects (Blekhman et al. 2008
; Cai et al. 2009
). In addition, we obtained genes identified in GWAS of human disease (Hindorff et al. 2009
). In total, 647 hOMIM, 1,110 GAD, and 412 GWAS genes can be mapped on the PPI network (Materials and Methods). However, the three sets are not mutually exclusive. For instance, 331 genes are shared between hOMIM and GAD, whereas 109 genes are shared between GAD and GWAS (supplementary fig. S3
, Supplementary Material
online). In order to study each type of disease genes independently, we removed hOMIM genes from GAD gene set and removed both hOMIM and GAD genes from GWAS gene set. No genes were removed from hOMIM as all of them were manually curated and are free of associations with complex phenotypes (Blekhman et al. 2008
). The results reported in this paper were based on data analysis with three nonoverlapping sets of 647 Mendelian disease genes, 779 complex disease genes, and 287 GWAS genes.
Characteristic Network Properties of Mendelian and Complex Disease Genes
We calculated degree (k
), betweenness centrality (CBtw
), current information flow (CCif
), bridging centrality (CBdg
), and clustering coefficient (CClu
) for each applicable protein in the complete interaction network (Materials and Methods). provides the results of comparisons of the mean and variance of these metrics between disease genes and nondisease genes. Average degree (k
) of Mendelian disease genes is not different from that of nondisease genes, and
of complex disease genes is only marginally significantly higher than that of nondisease genes. This result suggests that Mendelian and complex disease genes are not hub genes, which is consistent with results of previous studies (Goh et al. 2007
; Feldman et al. 2008
). Mendelian and complex disease genes have significantly higher CBtw
, suggesting that these disease genes tend to occupy network positions that are of global importance in communications between protein pairs. At the same time, they have significantly lower CClu
suggesting that the number of connections among the neighboring proteins of disease genes is unusually low. Interestingly, the variance of k
, and CClu
of Mendelian and complex disease genes is also unusually small, suggesting consistency in network properties of disease genes. Finally, GWAS genes do not show any statistically significant differences from nondisease genes. Note that this might be partially due to the small sample size of GWAS genes. We will return to this question later in the paper.
It seems that Mendelian and complex disease genes (but not GWAS genes) have distinct and consistent network properties; however, this is difficult to interpret for three reasons. First, the network metrics are strongly correlated with each other () and thus it is not entirely clear which network properties tend to be truly distinct for disease genes. Second, evolutionary ages of Mendelian and complex disease genes differ from those of nondisease genes (Domazet-Loso and Tautz 2008
; Cai et al. 2009
) and genes of different ages tend to have different network properties (see below). Thus, disease genes might have distinct network properties simply due to their different age. Finally, it is possible that disease genes have been studied more thoroughly compared with other genes and thus might have a disproportionately high number of detected PPIs. Below we 1) reduce the dimensionality of the network metrics using principal component analysis (PCA), 2) show that the network properties of disease genes are distinct over and above what is expected of genes of their age, and 3) provide evidence that the inspection bias cannot account for the observed results.
Table 2 Correlation Coefficients between Variables: Degree (k), Betweenness Centrality (Btw), Current Information flow (Cif), Bridging Centrality (Bdg), Clustering Coefficient (Clu), Nonsynonymous Substitution Rate (dN), Synonymous Substitution Rate (dS), the (more ...)
Defining Two Key PCs for Network Properties of Disease Genes
To understand the relationships among the five network measures, we conducted PCA. All variables that show deviation from normality (i.e., all except CClu) were log transformed and then scaled to zero mean and unit variance. The result of PCA shows that the first two PCs explain 73.4% of the total variation (40.7% and 32.7% for the first and second PC, respectively).
The magnitude and sign of each variable's contribution to the first two PCs are shown in a PC biplot (). Each variable is represented by a line from the origin to a point with coordinates (c1, c2). The coordinates c1 and c2 are the correlations between the variable and the first and second axis, respectively. Longer lines indicate stronger correlations between a PC (biplot axis and everything related to that) and the corresponding variable. The first PC (PC 1) correlates most strongly with three variables, k, CBtw, and CCif; the second PC (PC 2) correlates strongly with the other two variables, CBdg and CClu.
FIG. 1.— PCA of network properties of human genes. (A) Biplot showing five variables (represented by arrows): degree (k), betweenness centrality (Btw), current information flow (Cif), bridging centrality (Bdg), and clustering coefficient (Clu). (B,C,D,E) Heat (more ...)
PCA was conducted with all (disease and nondisease) genes. Nondisease and disease genes were highlighted separately in heat maps to show their density and distribution in the PC 1–2 space (). Compared with nondisease genes, Mendelian, and complex disease genes occupy a much narrower region. Distributions of Mendelian and complex disease genes are more biased (41%, 27%, 20%, and 12% in I–IV quadrants for Mendelian disease genes, ; 49%, 26%, 18%, and 7% for complex disease genes, ) than nondisease genes, which are more evenly distributed in the four quadrants (29%, 23%, 29%, and 19%, ). The centers of distributions are shifted toward the first quadrant with proportionally more Mendelian and complex disease genes having positive PC 1 and PC 2 (G
< 0.001 for the comparison of Mendelian and complex disease genes with nondisease genes). Note that complex disease genes have a more biased distribution toward the first quadrant than the Mendelian genes (G
< 0.001). Because PC 1 correlates strongly and positively with degree (k
) and PC 2 correlates strongly and negatively with clustering (CClu
), the above results can be stated differently: Mendelian or complex disease genes tend to be highly connected (high k
) to genes that are themselves are not very well connected (low clustering CClu
). This property can be thought of as “brokering” value of a protein such that a protein with a high brokering value connects many other proteins that would not be connected otherwise. For an example of the connection patterns for two broker genes (SUMO4 and PRKCZ) and two examples of nonbroker genes with similar values of k
(PCBP1 and BMS1), see supplementary fig. S4
Distribution of GWAS genes in the four quadrants is less biased (37%, 24%, 27%, and 12%, ) than that of other disease genes and is only marginally enriched in the direction of the first quadrant (P = 0.016) compared with nondisease genes. Their distribution is also not different from that of Mendelian genes (P = 0.32), however, it is significantly different from that of complex disease genes (P < 0.01). This indicates that the different network propertied of GWAS genes compared with complex disease genes is not merely a result of the small number of GWAS genes and lack of power.
We further placed disease and nondisease genes on the scatter plot of k and CClu (). It is clear that most of the highly connected Mendelian () and complex () disease genes (with log10(k) ≥ 1.5) have a low CClu (≤0.2), which is not the case for the nondisease genes with similar values of k. GWAS genes do not show this distinct feature (). We split the scatter plot area ad hoc (based on visual inspection) into three regions defined by log10(k) = 1.5 (or k = 31) and CClu = 0.2 (). Region I contains genes with relatively low k, whereas regions II and III contain genes with high k. The difference between regions II and III is that region III contains genes with lower CClu. Region III represents a characteristic “high brokering value” zone, in which both Mendelian and complex disease genes are present much more often. For instance, only 2.4% and 1.3% of all genes in region II are Mendelian and complex genes, while this number goes up to 10.4% and 16.6% in region III, respectively (P < 0.001 for all comparisons, G-test). Again the pattern is much less pronounced albeit marginally significant for GWAS genes (III [3.7%] vs. II [1.3%], P = 0.008, G-test).
FIG. 2.— Characteristic changes of clustering coefficient (Clu) as a function of degree (k) for disease genes. Red crosses are data points of disease genes, (A) Mendelian, (B) complex, and (C) GWAS. Red circles are means of Clu for data points in the bins with (more ...)
Network Properties as a Function of Gene Age
To investigate whether genes of different ages tend to have different network properties and whether this can explain differences in network properties of disease genes, we grouped all genes into different age groups. Gene age was estimated based on the concept of phylostrata (Domazet-Loso and Tautz 2008
), assuming Dollo parsimony (Le Quesne 1974
; Farris 1977
). Six age groups were defined (labeled 1–6, where group 1 includes the youngest genes and group 6 the oldest genes) and each protein was assigned to one of these age groups (Materials and Methods). Disease and nondisease genes are not distributed equally in different age groups. Mendelian disease genes are overrepresented in the old group, whereas complex disease genes are overrepresented in the middle age groups (Domazet-Loso and Tautz 2008
; Cai et al. 2009
illustrates the changes of PCs as a function of the evolutionary age of the gene. For nondisease genes, average PC 1 increases monotonically with gene age (Spearman's ρ = 0.104, P
= 4.44 × 10−16
), indicating that older nondisease genes have higher levels of k
, and CCif
. This is not unexpected because proteins of older genes had more time to acquire interactions with other proteins. In contrast, Mendelian and GWAS genes show no correlation between PC 1 and evolutionary age (both P
> 0.001). For complex disease genes, the correlation is positive and marginally significant (Spearman's ρ = 0.113, P
= 5.61 × 10−4
, ; ). All disease genes have relatively high level of PC 1 compared with nondisease genes of the same age (). PC 2 shows no correlation with gene age for all the genes (, ). We also show the changes of individual network metrics as a function of gene age in the supplementary Information (supplementary fig. S5
, Supplementary Material
FIG. 3.— PCs as a function of gene age. (A) PC 1, nondisease versus disease genes; (B) PC 2, nondisease versus disease genes. Types of disease genes include Mendelian, complex, and GWAS genes, at left, middle, and right panels, respectively. Box plots of PCs for (more ...)
Correlations between Evolutionary Age of Genes (age) and Variable x: the First PC (PC 1), Degree (k), Betweenness Centrality (Btw), Current Information Flow (Cif), the Second PC (PC 2), Bridging Centrality (Bdg), and Clustering Coefficient (Clu)
The lack of correlation between PC 1 and gene age is one of the characteristic patterns for all three types of disease genes. Given that the numbers of disease genes (especially those in the young age groups) are small, it is possible that the lack of correlation between PC 1 and gene age in disease genes is a product of the small sample size. To rule out this possibility, we randomly sampled nondisease genes in each age bin such that the number of genes in the sampled subset was equal to the number of Mendelian, complex, or GWAS genes in the corresponding age bin, respectively. We repeated this subsampling process to create 10,000 replicates of nondisease gene sets and computed the Spearman's correlation coefficients between PC 1 and the age of the gene for these subsets. The observed correlation coefficients obtained for disease genes falls at the very end of the lower tail of the resampled ρ distribution (empirical P < 0.0001, 6.67 × 10−4, and 3.33 × 10−4 for Mendelian, complex, and GWAS genes, respectively). Thus, the lack of correlation between PC 1 and gene age cannot be attributed to the small sample size of disease gene sets.
Because Mendelian and complex disease genes have distinct age distributions (Domazet-Loso and Tautz 2008
; Cai et al. 2009
, supplementary fig. S8
, Supplementary Material
online), it is possible that their distinct network properties are simply a function of their age. To rule out this possibility, we randomly sampled a subset of nondisease genes to the same size and age distribution of corresponding diseases genes (). The procedure allowed us to control for different size and age distribution of gene groups. shows the results derived from using Mendelian and complex disease genes as subsampling targets, respectively. Means and variances of PC 1 and PC 2 for subsampled gene subsets are shown as scatter crosses. The subsampling procedure was repeated 10,000 times to get the 99.9% confidence ellipses. Observed data points for nondisease genes are within the confidence ellipses. The variance of PC 1 for Mendelian disease genes is lower, and the mean of PC 1 for complex disease genes is higher than expected by chance. Mendelian and complex disease genes have significantly higher mean and lower variance of PC 2. As expected based on the above results, the GWAS genes do not deviate significantly from the subsampled nondisease genes (). Note that the GWAS genes have the same age distribution as the nondisease genes (supplementary fig. S4
, Supplementary Material
online) and thus shows comparison of the GWAS genes with nondisease genes without any subsampling.
FIG. 4.— Variance and mean of PCs of disease genes. The open circle and square indicate observed data points of variance against mean of the two PCs for Mendelian and complex disease genes, respectively. The crosses are data points of mean and variance for 10,000 (more ...)
Impact of the Inspection Bias
Last, we address the problem of inspection bias—the impact of more intense investigation of known, especially disease genes on the number of detected PPIs. The inspection bias alone should not dramatically affect the signals we have detected because Mendelian and complex disease genes do not have a higher average degree than nondisease genes (), which is opposite to the expectation of inspection bias. Nevertheless, we conducted additional tests to control for other less obvious potential effects of this bias.
First, we applied a simple assay to show that disease genes have indeed been studied more intensively than other genes. We separated human genes into named and unnamed genes according to whether they have HGNC-(HUGO Gene Nomenclature Committee)-approved names. Genes under intensive experimental studies tend to have unique and meaningful names; genes that have undergone fewer studies may not have such names. In our gene set, there are 447 unnamed genes, including 419 nondisease genes, 6 Mendelian disease genes, 19 complex disease genes, and 3 GWAS genes (supplementary table S1
, Supplementary Material
online). Proportionally disease genes are more likely to be named than nondisease genes (P
< 0.0003 for all three types of disease genes, G
We then filtered out all unnamed genes and repeated data analysis with only named genes. In this way, we decreased the impact of inspection bias due to nondisease genes being disproportionately poorly studied. We found that all results in above sections hold without any qualitative changes (data not shown). Second, we randomly sampled nondisease genes to generate multiple gene sets with the same number of genes and the same distribution of k as that in the corresponding disease gene set. For each type of disease genes, we constructed 10,000 such replicates and obtained the distribution of CBtw, CCif, CBdg, and CClu. We found that, except for CBdg, the three other network measures for Mendelian and complex disease genes fall far away from the center of distribution of the measures, with significantly higher CBtw and CCif and significantly lower CClu (). Thus controlling for k does not affect the detection of characteristic network properties of disease genes. This confirms that genes with the same level of k still differ in other aspects depending on whether they are disease genes or not.
FIG. 5.— Distributions of network metrics for subsampled nondisease genes. Network metrics include betweenness centrality (Btw), current information flow (Cif), bridging centrality (Bdg), and clustering coefficient (Clu). Values of network metrics for (A) Mendelian (more ...)