To further facilitate analysis with our family data, Pseudofam provides key statistics, such as the degree of pseudogenization and pseudogene-to-gene ratio, for each family both online and in the datasets for download. It also provides a tool to correlate different family parameters between species. To identify outlier families that have an unusual degree of pseudogenization, Pseudofam calculates the enrichment of parent proteins in each family and uses the hypergeometric distribution to calculate
P-value, viz:
This formula calculates the probability Pr(
K) of having the observed number of parent proteins
k for a given family with
n proteins under the hypergeometric distribution. Required for the computation is the total number of proteins
N used for identifying the pseudogenes and the corresponding number of parent proteins
m. The
P-value for a positive enrichment is the Pr(
K
k) and for a negative enrichment is the Pr(
K
k). This parent protein approach is preferred over using a random sampling method to calculate the enrichment of pseudogenes because it is more computationally efficient and less susceptible to the changes of the pseudogenes identification algorithm or parameters that may cause the number of pseudogenes identified to fluctuate. The following sections show a brief analysis based on the key statistics provided by Pseudofam.
Degree of pseudogenization
shows the numbers of protein and pseudogene families in different species and their degree of pseudogenization. It indicates that among the species in our study mammals have a higher percentage (an average of 50%) of families containing pseudogenes than nonmammals (an average of 22%). For instance, human has 3486 protein families of which 1790 (51%) are found to have pseudogenes. On the other hand, Drosophila has 2620 protein families but only 201 (8%) are found to have pseudogenes. Looking at the families individually shows that certain families have a high degree of pseudogenization, while some have no pseudogenes at all. For example, the reverse transcriptase (RNA-dependent DNA polymerase) family has 18 out of 22 (82%) proteins found to have associated pseudogenes. In contrast, the bestrophin protein family, which has 71 proteins, has not been found to have any pseudogenes.
| Table 1.Numbers of protein and pseudogene families in different species out of 9318 PfamA families |
Correlation of family sizes across species
Since the mammalian genomes have a substantial number of pseudogene families, they enable us to carry out a more accurate statistical analysis of the correlation of genes and pseudogenes. shows the Spearman correlation of the family size between the five mammalian genomes in our study. It shows that protein family size has an obviously stronger correlation (~0.81) among species than pseudogene family size (~0.63). It also shows that the correlation of pseudogene family size decreases when the evolutionary distance increases between the species. For example, human has a correlation of 0.89 with chimpanzee, but only around 0.58 with dog, mouse and rat. Similarly, mouse has a correlation of 0.67 with rat, but only around 0.58 with human, chimpanzee and dog. It supports the theory that pseudogenes in general are evolving under no or less selection pressure relative to functional genes.
| Table 2.Spearman's rank correlation of protein family sizes (the upper right) and pseudogene family sizes (the lower left) between different species |
Extreme families
The enrichment results (see
Supplementary Table S1) show that families with housekeeping proteins, such as the GAPDH protein (a NAD-binding enzyme involved in glycolysis and glyconeogenesis), and the ribosomal protein RPL7A (responsible in mRNA-directed protein synthesis in all organisms) (
14) have significantly more parent proteins than others. In order to investigate whether proteins having housekeeping functions tend to have more pseudogenes than those with nonhousekeeping functions, we downloaded a total of 575 human housekeeping genes derived from gene expression profiling (
23,
24). We selected all the 197 pseudogene families that contain both the housekeeping and nonhousekeeping genes, and tested the pseudogene-to-gene ratio between these two types of genes using a Wilcoxon signed rank test. We found that the pseudogene-to-gene ratio for housekeeping genes is significantly higher (
P-value < 0.04) than for nonhousekeeping genes in such pseudogene families, especially in processed pseudogenes (
P-value < 0.01). It has also been reported previously by Gonclaves
et al. (
25) that housekeeping genes generally have more processed pseudogenes. This could be explained by the relatively constant expression level of housekeeping genes, which boosts their chances of being retrotranscribed.