Genomic duplication and subsequent functional diversification of resultant paralogs are a major driving force of genomic evolution. Paralogs are more enriched in multicellular species and often display complementary expression patterns. This study built upon previous discoveries by other investigators (12
), providing evidence that paralog count distribution in a genome exhibits a power-law relationship [P(K) K−α
] and that the value of the parameter α can be used to gauge paralog abundance. The study examined the fluctuation of the value of α among proteomes of individual cell types and the whole proteomes of unicellular and multicellular species. A quantitative relationship among paralog enrichment, paralog expression pattern diversification and multicellularity was uncovered. To our knowledge, this represents the first quantitative theoretical insight into the role of paralog enrichment and expression pattern diversification in multicellularity, a fundamental phenomenon in biology.
Multicellularity imposes a greater burden on the genetic makeup of an organism: meeting the demand of a much wider spectrum of functionality across different developmental stages and differentiated cell types. This study provides direct and quantitative evidence that creating a larger paralog repository with diversified expression patterns is a major evolutionary mechanism to meet this enhanced demand. Even though paralog is more abundant in a multicellular proteome, the size of the paralog repository in a specific cell type is comparable to that in a unicellular proteome. Proteins of larger paralog families display higher fluctuation in their expression levels in the set of cell types examined in this study. Whether, and to what extent, different sets of paralog families are used in different developmental stages and differentiated cell types remains to be investigated.
While a powerful source of functional innovation in biological evolution, genomic duplication can also cause deleterious effects by breaking the balance between duplicated and non-duplicated genes. Essential cellular machineries require a stoichiometric balance among their components. For example, protein complex formation depends on a specific ratio among subunits of the complex (30
). Moreover, core functions such as cell growth require a balance among the sets of involved biochemical pathways (31
). This gene dosage evolutionary constraint is captured in the ‘gene balance hypothesis’ (30
). Thus, whole-genome duplication is more tolerated than non-whole-genome duplication, since gene dosage balance is not broken (31
). Paralogs in S. cerevisiae
quickly diverge to circumvent this evolutionary constraint, in that their biochemical specificity (interaction partners) in the protein–protein interaction networks and their regulatory control change dramatically (31
There is one additional layer of functional diversification in multicellular species, diversifying cell distribution patterns. The evolutionary pressure is to create complementary expression patterns among paralogous proteins. Many paralogous proteins do not coexist in the same cell. They can preserve their biochemical specificity, e.g. interacting with the same set of proteins, without breaking the gene dosage balance. The gene dosage constraint is thus lessened, explaining the higher retention rate of duplicate genes observed in multicellular genomes (5
). Consequently, a larger repository of paralogs is maintained in multicellular species.
Additionally, the BDIM (birth, death and innovation model) model, a mathematical model of the birth-and-death theory, was developed to quantitatively explain the power-law distributions of protein domain counts in a proteome (11
). We believe it provides a framework to quantitatively interpret observed pattern of α-values in the whole proteomes of S. cerevisiae
and C. elegans
and in the proteomes of specific C. elegans
cell types. As discussed earlier, a lower value of α in C. elegans
indicates that P(K)
decreases at a slower pace as K
increases, and therefore dictates higher paralog abundance. This intuitive interpretation is consistent with BDIM. To exhibit power-law behavior, it assumes gene duplication rate (D
) and gene loss rate (L
) as a function of paralog count K
where λ, a
The model then predicts a power-law distribution, P(K) K−(1+b−a)
. The values of α for this distribution, ‘1
’, is therefore determined by gene duplication and loss rates; lower α-values dictates higher values for duplication rate constant, ‘a
’, and thus evolution environments more accommodating to gene duplication events (11
). Therefore, eukaryotic genomes have higher paralog abundance than bacteria genomes (14
), as eukaryotic cellular environment is more permissive for gene duplication, allowing duplicate genes be partitioned to different cellular compartments to bypass the dosage evolutionary constraint. This also explains lower α-values and higher paralog abundance in multicellular species such as C. elegans
, in which duplicate genes can potentially overcome the dosage evolutionary constraint through expression in different cell types. For genes expressed in the same cells, however, such evolutionary mechanism does not apply. Thus, α-values for specific cells are larger. Moreover, individual C. elegans
cell types have similar architecture and operation, and hence similar cellular environment for genomic evolution as S. cerevisiae
, a fruitful model organism for the study of multicellular species. Therefore, it is understandable that specific C. elegans
cells have α-value comparable to that of S. cerevisiae
How biochemical specificities of close paralogs diverge is an active research area. It is important for understanding the evolution of biochemical networks, as the networks emerge and grow through gene duplication (node addition) and subsequent divergence (rewiring) (35
). It is also an important topic in biomedical research. Drugs often interact with close paralogs of intended target protein, causing adverse side effects. A general practice is to identify sequence segments that conserve among orthologs, but diverge among paralogs (37–39
). The findings reported here can potentially benefit the studies of paralog diversification in multicellular proteomes. It is expected that paralogs with overlapping expression patterns tend to diverge in their biochemical specificities, whereas paralogs with complementary expression patterns conserve due to a lack of gene dosage evolutionary constraint. We are currently identifying sequence segments that conserve among paralogs with complementary expression patterns, but diverge among those with overlapping patterns.
Our results suggest that biochemical network analysis in multicelluar species is more challenging than current practice assumes. Most network models are generic—they are constructed without considering whether two proteins are expressed in the same cells. The cDNA expression libraries used in high-throughput protein–protein detection efforts, such as yeast two-hybrid, were constructed without discriminating whether two proteins are expressed together. However, diversification of expression patterns, as our study suggests, cannot be ignored. It is better to construct tissue/cell specific network models to guide basic biological and biomedical researches.