We investigated two sets of human Mendelian disease genes. First, we used the collection of 1,637 human genes involved in diseases from the OMIM Morbid Map (http://www.ncbi.nlm.nih.gov/Omim/getmorbid.cgi
). Second, we investigated 803 genes from the hOMIM data set—a manually curated collection of Mendelian disease genes, obtained from Blekhman et al. (2008)
. The hOMIM gene set is less redundant and free of complex phenotypic entries. The two disease gene sets significantly overlap: 781 genes are present in both sets. Because the two data sets generated qualitatively similar results, we only reported here results derived from the hOMIM data set.
For complex disease genes, we investigated 1,347 genes extracted from GAD database (Becker et al. 2004
). The majority of genes collected in GAD are associated with complex diseases. In the study of Blekhman et al. (2008)
, the list of manually curated complex disease genes (supplementary table S5
of Blekhman et al. [2008
]) contains 53 genes; only three of them (namely LTA4H, PALB2, and BLMH) were missing from the GAD gene set.
Non-disease genes were defined as genes that do not appear in any of the disease gene sets (including OMIM Morbid, hOMIM, and the complex disease gene sets). The non-disease gene set contained 13,864 genes (82.9% of all genes), indicating that 17.1% of human genes are known to be associated with either Mendelian or complex diseases.
Distribution of Disease Genes in Age Groups
We estimated the age for all 16,727 genes included in our analysis and split them into nine bins according to their ages, where the age group 1 contained the youngest genes and the age group 9 contained the oldest genes. The age was estimated using Dollo parsimony (Le Quesne 1974
; Farris 1977
) by finding the most highly divergent lineage in which an ortholog (using the Phylopat pipeline; Hulsen et al. 2006
) or a homolog (using BlastP) of a particular human gene could be found (see Materials and Methods for details).
Two binning approaches, equally populated bins and equally spaced bins, were used (Materials and Methods). illustrates the results obtained for equally populated bins (i.e., having the same number of genes in each of the nine age bins). The bin for age groups 1 contained only 10 Mendelian disease genes (0.54%); this frequency is significantly lower than that of any other age group, which all contained at least 58 disease genes (P < 0.001, χ2 test). Older groups (e.g., group ≥3) contained more Mendelian disease genes—3.12–7.10% of them were Mendelian disease genes. This pattern was also observed when the genes were grouped using equally spaced bins—the two binning approaches produced qualitatively similar results.
FIG. 2.— Frequencies of Mendelian disease genes (A) and complex disease genes (B) as functions of their age. Genes are partitioned into nine equally populated bins as well as (I) young-, (II) middle-, and (III) old-aged groups (Materials and Methods). The error (more ...)
To simplify the patterns, we pooled all genes into three (including young-, middle-, and old-aged groups) instead of nine groups. The probability to contain DNA variants associated with Mendelian diseases is significantly lower in the young gene group than in the middle-aged and the old gene groups (both P
< 0.001, χ2
test) (). This pattern is consistent with the finding of Domazet-Loso and Tautz (2008)
. We further computed the fractions of complex disease genes in different age groups (). The frequency of complex disease genes in younger groups (groups 1–3) is also significantly smaller than that in middle- and old-aged groups (P
< 0.001, χ2
test); however, unlike Mendelian disease genes, complex disease genes are more likely to be in the middle-aged than in the old-aged groups (P
< 0.001, χ2
We also obtained the age of genes from the study of Domazet-Loso and Tautz (2008)
. They estimated the age of genes using genes’ phylostratum (Domazet-Loso et al. 2007
), which focuses on homologs and determines the age of the gene family by strict parsimony assuming that a gene family can be lost but cannot reevolve independently in different lineages or be horizontally transferred. The phylostratum estimate for the age of genes match our estimates of age well (Spearman's ρ = 0.40, P
« 0.001). All patterns obtained with phylostratum age estimate are indeed similar to those obtained with our age estimate (data not shown).
In addition to these two age estimates using strict parsimony, the PGL measure is calculated for all genes (see Materials and Method for detail). PGL captures the patchiness of phylogenetic distributions for genes that have the same age. The steady state model of gene gain and loss, assuming that genes lost have the same rate distribution as genes gained, predicts that different gene age classes have specific PGLs (Wolf et al. 2009
). Indeed, we found that both our gene age and the phylostratum gene age are significantly correlated with PGL (Spearman's ρ = −0.55 and −0.26, respectively, both P
« 0.001). We also examined the relation between the propensity of a gene to be lost (Materials and Methods for details) and the likelihood of the gene to be involved in Mendelian or complex diseases. We split genes into small, medium, and large PGL bins. In these bins (with equal number of genes), 5.7, 5.2, and 3.1% of genes are Mendelian disease genes and 5.5, 8.0, and 3.7 percent of genes are complex disease genes. The pattern resembles the one obtained with young-, middle-, and old-age groups. Next, we test whether disease genes have higher or lower PGL values than non-disease genes. Mendelian disease genes are likely to have a lower PGL values than non-disease genes (median 0.1671 vs. 0.1690 and mean 0.1756 vs. 0.2142, P
= 2.3 × 10−15
, Mann–Whitney–Wilcoxon [MWW] test). Complex disease genes are also likely to have a lower PGL values than non-disease genes but the difference is less significant (median 0.1671 vs. 0.1690 and mean 0.1910 vs. 0.2142, P
= 1.1 × 10−4
, MWW test).
Selective Pressure on Disease Genes
Comparison of Variables between Mendelian, Complex, and Non-disease Genes
FIG. 3.— Ka, Ks, and Ka/Ks as functions of the age of genes. Mendelian disease genes (A) and complex disease genes (B) are partitioned into one–nine equally populated bins as well as (I) young-, (II) middle-, and (III) old-aged groups. Median values and (more ...)
However, such an association was not observed in Mendelian disease genes. Ka and Ka/Ks values for Mendelian disease genes do not decrease with gene age (for Ka/Ks, Spearman's ρ = −0.0104, P = 0.783; ). In fact, there was no difference in Ka or Ka/Ks values among age groups for Mendelian disease genes (P = 0.045, Kruskal–Wallis [KW] test)(). These results suggest that Mendelian disease genes appear to be under strong purifying selection irrespectively of the gene age.
Correlations between Various Variables and the Gene Age
Given that the number of Mendelian disease genes in young age bins is very small, it is possible that the lack of correlation between Ka or Ka/Ks and gene age is due to the small sample size of disease genes. To confirm that this was not the case, we randomly sampled subsets of non-disease genes in each of the nine age bins such that the number of the genes in the subset was equal to the number of Mendelian disease genes in that age bin. We repeated this subsampling process to create 10,000 replicates of non-disease gene sets and computed the Spearman's correlation coefficients between Ka, Ks, or Ka/Ks and the age of the gene for these subsets. The distribution of the correlation coefficients obtained for these subsets and the observed correlation coefficients for disease genes were plotted in fig. S2. The observed correlation coefficients between Ks values and the age of the gene fall well within the distribution of replicate correlation coefficients (fig. S2B). In contrast, the observed correlation coefficients between Ka (or Ka/Ks) and gene age for disease genes fall far from the end of the upper tail of the resampled distributions (fig. S2A,C) (P < 10−5), confirming that the difference reported above between disease and non-disease genes is not merely due to the small sample size.
This difference seems to be mainly driven by the significantly different Ka (or Ka/Ks) values between Mendelian disease genes and non-disease genes in the young genes. In groups 1 to 3, the Ka and Ka/Ks values of Mendelian disease genes are significantly lower than those in non-disease genes (both P < 0.001, Kolmogorov–Smirnov [KS] test) (upper panel of ). Similarly, in group I, the Ka and Ka/Ks values of Mendelian disease genes are almost 3-fold lower than those in non-disease genes (both P < 0.001, KS test) (lower panel of ). In group 4–9 (or groups II and III), we did not observe significant difference in Ka (or Ka/Ks) values between disease and non-disease genes (P > 0.05, KS test) ().
Unlike Mendelian disease genes, both the Ka and Ka/Ks values of complex disease genes are negatively correlated with the age of genes (Spearman's ρ = −0.120 and −0.123, P < 0.001) in a pattern similar to that of non-disease genes (Spearman's ρ = −0.249 and −0.263, P < 0.001) (). Repeating the subsampling analysis describe above, we confirmed that the scarcity of complex disease genes in each age bin was not the reason that complex disease genes resembled non-disease genes in these patterns (fig. S3). Finally, we found significant differences in both Ka and Ka/Ks values between different age groups for complex disease genes (both P < 0.001, KW test).
Although, as a function of gene age, the changes of Ka and Ks/Ks for complex disease genes are similar to those for non-disease genes, values of Ka and Ka/Ks of young complex disease genes are still significantly lower than those of young non-disease genes. For genes in groups 1–3, the Ka and Ka/Ks values of complex disease genes are 1.4- and 1.5-fold lower than those of non-disease genes, respectively (both P < 0.001, KS test) (upper panel of ). In group I, the Ka and Ka/Ks values of complex disease genes are 1.5- and 1.2-fold lower than those of non-disease genes, respectively; however, the differences are less significant (P = 0.0046 and 0.0485, respectively, KS test) (lower panel of ), underscoring the relatively weaker purifying selection acting on complex disease genes compared with Mendelian disease genes.
We obtained highly consistent results with the PGL as a complementary measure of gene evolutionary age. For non-disease genes, values of PGL are positively correlated with values of K
a and K
s (Spearman's ρ = 0.155 and 0.167, respectively, P
« 0.001 in both cases) but not correlated with values of K
s (Spearman's ρ = 0.021, P
= 0.026). This result is consistent with those from previous studies (Krylov et al. 2003
; Wolf et al. 2006
; Borenstein et al. 2007
). In contrast, for Mendelian disease genes, PGL does not correlate with any of divergence rate measures (P
> 0.001, Spearman correlation between PGL and K
s, or K
s). For complex disease genes, PGL are marginally significantly positively correlated with K
a and K
s (Spearman's ρ = 0.114 and 0.139, P
= 0.002 and 1.28 × 10−4
, for K
a and K
s, respectively) but not correlated with K
> 0.001). These results suggest that rapidly evolved genes have a higher propensity to be lost, but the pattern is only upheld for non-disease genes. The trend is less significant in complex disease genes and completely disappears in Mendelian disease genes.
We used an additional measure of selective pressure based on polymorphism data to confirm the results derived from K
s. The measure is the ratio of nonsynonymous-to-synonymous polymorphisms (P
s). Recent accumulation of human genome–wide single nucleotide polymorphism (SNP) data enables the derivation of P
s (International HapMap Consortium 2003
; Bustamante et al. 2005
). We found that both Mendelian and complex disease genes have lower values of P
s computed from two SNP data sets—HapMap SNPs (International HapMap Consortium 2003
) and Applera SNPs (Bustamante et al. 2005
; data not shown). This is an additional line of evidence of strong purifying selection in disease genes (see also Liu et al. 2008
). With either divergence or polymorphism information, we find that disease genes tend to be under stronger purifying selection than non-disease genes but only in the young gene categories.
Effects of Inheritance Mode and Gene Function
We divided Mendelian disease genes into dominant disease genes (238 hOMIM genes that are known to have dominant diseases-causing mutations) and recessive disease genes (389 genes that are known to have recessive diseases-causing mutations) as annotated by Blekhman et al. (2008)
. Dominant genes have significantly lower values of K
s than those of recessive genes (median K
s 0.216 vs. 0.242; P
= 3.673 × 10−4
, KS test). This result is consistent with the results reported in two previous studies (Furney et al. 2006
; Blekhman et al. 2008
). Neither dominant nor recessive genes show any correlation between K
s and gene age (fig. S4). Collectively, dominant disease genes are younger than recessive disease genes (fig. S5).
We also examined whether the strong purifying selection acting on young Mendelian disease genes was due to the enrichment of particular biological functions in these genes (Materials and Methods). Compared with the non-disease genes in the same age group, young Mendelian disease genes were significantly enriched with anatomical structure development (GO:0048856, adjusted P
= 19 × 10−5
and 8.31 × 10−26
for equally spaced bins and equally populated bins, respectively) and multicellular organismal development (GO:0007275, adjusted P
= 8.18 × 10−5
and 1.57 × 10−20
for equally spaced bins and equally populated bins, respectively) genes. In addition to these two terms, some GO terms were identified to be significant only when we used equally populated bins. These terms include circulation (GO:0008015), response to stress (GO:0006950), cellular component organization and biogenesis (GO:0016043), response to external stimulus (GO:0009605), coagulation (GO:0050817), cellular developmental process (GO:0048869), as well as other terms. The complete list of enriched terms can be found in supplementary table S1
. Among all these GO terms, only one term, nucleic acid binding (GO:0003676), was enriched in non-disease genes.
Effects of Gene Expression
Next, we studied the expression patterns of disease and non-disease genes in relation to gene age. We calculated the average (aveExp), maximum (maxExp), and heterogeneity (hetExp) of gene expression across 54 normal tissues for each human genes (). Mendelian disease genes show significantly higher hetExp (P
= 0.007) and maxExp (P
< 0.001) values than non-disease genes, whereas their aveExp (P
= 0.699) values are similar (KS test) (). This result is consistent with the hypothesis that tissue-specific genes are more likely to be involved in human disease than widely expressed genes (Winter et al. 2004
; Adie et al. 2005
FIG. 4.— Mean expression level (aveExp), expression heterogeneity (hetExp), and peak expression level (maxExp) as functions of the age of genes. Mendelian disease genes (A) and complex disease genes (b) are partitioned into one to nine equally populated bins as (more ...)
FIG. 5.— Sequence identify of the closest homolog of genes. Mendelian, complex, and non-disease genes are partitioned into (I) young-, (II) middle-, and (III) old-aged groups. Median values and 95% confidence intervals are plotted. P values of KS tests between (more ...)
Furthermore, Mendelian diseases genes show similar maxExp values across different age groups (P = 0.699, KW test), whereas maxExp for non-disease genes is positively correlated with the age of genes (Spearman's ρ = 0.114, P < 0.001; KW test, P < 0.001)(). Non-disease genes in different age groups have different hetExp values (P = 0.000443, KW test), but hetExp values for Mendelian disease genes of different age groups show no variation (P = 0.191, KW test). There is no correlation between hetExp and gene age for both Mendelian and non-disease genes (P = 0.195 and 0.221, respectively, Spearman test) ( and ).
Similar to Mendelian disease genes, complex disease genes show significantly higher maxExp (P = 1.89 × 10−9, KS test) and hetExp (P = 1.49 × 10−20, KS test) values and similar aveExp values to non-disease genes (). Moreover, complex disease genes show the same patterns of expression variables versus gene age as Mendelian disease genes, that is, there is a positive correlation between aveExp and gene age and there is no significant correlation between either maxExp or hetExp and gene age ( and ).
We conducted a survey of the tissue-specific expression patterns of disease versus non-disease genes. Distribution of genes showing peak expression in 54 tissues and portions of Mendelian and complex disease genes in all genes showing peak expression in the corresponding tissues are given in supplementary figure S6
online). We found that Mendelian disease genes are more likely to be most highly expressed in liver and kidney (P
« 0.001 in both cases, Fisher's exact tests with Bonferroni correction) but less likely in testis (P
= 6 × 10−6
). Complex disease genes are more likely to be most highly expressed in liver (P
In addition, disease genes and non-disease genes show no substantial difference in the correlation between Ka/Ks and gene expression, even after these genes were assigned into young-, middle-aged, and old groups (fig. S7–8).
Effects of Presence of Close Duplicates
It has been hypothesized (Lopez-Bigas and Ouzounis 2004
) that proteins with similar paralogs should be less often involved in diseases because the compromised function of such proteins when mutated could be compensated for by their functional paralogs (Frenette et al. 1996
; Wagner 2000
; Gu 2003
; Kamath et al. 2003
; Dean et al. 2008
; Wagner 2008
). Here, we test this hypothesis using our gene sets. We used two definitions for “singleton human genes.” The first considers the genes that do not have any sequence homologs, which can be identified by BlastP searches (see Materials and Methods for criteria used to define homologs). The second considers those that are not included in any Ensembl protein family (Enright et al. 2002
). Using either of these definitions, Mendelian disease genes were not found more likely to be singleton human genes than non-disease genes. This result is consistent with that of Yue and Moult (2006)
We next resorted to a different approach for testing the role and magnitude of duplicate gene contribution to robustness against deleterious human mutations. We used sequence similarity between paralogs or homologs to quantify the likelihood and magnitude of functional compensation, following Hsiao and Vitkup (2008)
. For nonsingleton human genes (i.e., those with identified paralogs), the distributions of amino acid sequence identities of the closest homologs are significantly different between disease and non-disease genes. The average identity of the closest homolog is 47.9% for Mendelian disease genes, 48.2% for complex disease genes, and 52.3% for non-disease genes (Mendelian vs. nondisease, P
< 0.001; complex vs. nondisease, P
0.001; Mendelian vs. complex, P
0.00132, KS test). This difference between disease genes and non-disease genes seems more substantial and statistically significant for middle-age genes (fig. 5). The lack of statistical significance for young genes may be attributed to the small number of genes.