As a source of information about yeast duplicated genes we use the dataset consisting of 3909 pairs of paralogous yeast proteins. This set was obtained by blasting all yeast proteins against each other with a conservative E-value cutoff of 10-10
and leaving only pairs in which the aligned region constituted at least 80% of the length of a longer protein. This prevented the appearance of pairs of multidomain proteins paralogous over only one of their domains. We further curated this dataset by removing 72 known [17
] transposable elements and all their paralogs (108 proteins all together). That left us with 2299 paralogous pairs formed by 1596 yeast proteins (about 25% of the genome). These pairs are characterized by a broad and relatively uniform distribution of the percent identity (PID) of amino acid sequences ranging from 20% to 100% (See Fig. ) The histogram in the Fig. is binned at 5% PID (as the data used to plot the Fig. ), and one can see that even in the least represented bins there are over 40 paralogous pairs providing sufficient statistics for our analysis. Our set of all possible pairs of paralogous proteins contains some redundant information especially for large protein families. Indeed, a family of, say, 4 proteins would contribute (4·3)/2 = 6 paralogous pairs to our analysis, while it contains at most 3 true duplicated pairs. However, in the situation where the data describing molecular networks are incomplete and noisy such redundancy is rather beneficial by providing better statistics. We have verified that apart from somewhat larger errorbars all our quantitative findings remained virtually unchanged when we repeated our analysis of upstream regulations in yeast using only 938 pairs of putative duplicated proteins. These pairs were obtained from the full set of 2299 paralogous pairs by the detailed phylogenetic analysis of individual families. It is also worthwhile to note that while the average number Ks
of silent substitutions per substitution site in a pair of duplicated genes is commonly used as a proxy of the time elapsed since the duplication event [1
], the PID (or Ka
– the number of non-silent substitutions per site – related to PID via PID = 100 exp(-2Ka
)) is rather a crude estimate of the extent of their functional similarity. Hence, our analysis emphasizes function-dependent rather than time-dependent divergence between paralogous proteins.
The histogram of amino acid sequence identities (PID) of 2299 pairs of paralogous yeast proteins used in our study.
The system-wide data describing the transcription regulatory network of yeast was taken from the Ref. [2
], which reports the so-called "chip-on-chip" study of in-vivo binding of 106 transcription factors to upstream regulatory regions of genes encoding all 6270 of yeast proteins. Since the number of transcriptional regulators in this dataset is quite large, the probability that by pure chance the same transcription factor would be incorrectly detected among upstream regulators of both
duplicated genes is small (of order of 1%). Thus the contribution of false positives of the dataset of Ref. [2
] to the regulatory overlap Ωreg
is quite insignificant. This allowed us to use a P-value cutoff equal to 10-2
(12854 regulations) less conservative than the 10-3
cutoff (4418 regulations) of Lee et al
]. On the other hand, false positives (if present in the data) could significantly affect the average number of regulatory inputs of individual proteins used to normalize the regulatory overlap in Fig. . However, we found that both the initial drop and the rate of exponential decay of the normalized
regulatory remains virtually unchanged when Fig. is repeated for different values of the P-value cutoff ranging from 10-2
(data not shown). In the same range of P-values the average number of regulations per gene changes six-fold (from 2 to 0.33)! This suggests that false positives are not a significant part of the experimental dataset of Ref. [2
] at least up to 10-2
, and validates the robust nature of parameters extracted from the Fig. . In the analysis shown in Fig. we have dropped 3 paralogous pairs sharing the same intergenic sequence since by design of the chip-on-chip experiment [2
] such pairs would have 100% regulatory overlap. We also checked that Fig. does not change significantly if one limits the analysis to genes without diverging promoters ensuring that a given intergenic could possibly regulate only one gene.
As a source of information about binding partners of yeast proteins we combined the data from two independent high-throughput two-hybrid experiments: the core dataset of Ito et al
] (806 interactions among 797 proteins) and the extended Uetz et al
. dataset [3
], downloaded from the website of this group (1446 interactions among 1340 proteins). The resulting network consists of 1734 proteins joined by 2111 non-redundant interactions. Using this combined dataset we found that even 100% identical proteins share on average only 30% of their binding partners. However, unlike for upstream regulation, the set of interaction partners of a protein is fully determined by its amino acid sequence. Therefore, an imperfect overlap in the set of binding partners of identical proteins has to be attributed to false positives/negatives inevitably present in high-throughput two-hybrid experiments. The relatively high rate of false negatives in genome-wide two-hybrid experiments is further corroborated by the fact that datasets used in our study coming from two independent experiments [3
] have only 141 interactions in common. The abundance of missing interactions makes the normalization of the interaction overlap impractical. That was the reason why unlike in Fig. in Fig. we used the raw (unnormalized) interaction overlap. To make sure that differences between Figs. and are not caused by differences in normalization we repeated them using various normalization schemes as well as altogether unnormalized (data not shown). We found that apart from the overall scale of the y-axis, changes in normalization do not affect exponential decay parameters of Figs ,.
The system-wide data on viability of S. cerevisiae
null-mutants used in our study was obtained from Ref. [5
] in which 1103 essential (non-viable null-mutants) and 4678 non-essential (viable null-mutants) yeast proteins were reported. The lists of viable and non-viable null-mutants as discovered in Ref. [5
] were downloaded from the Saccharomyces Genome Database [17
Our analysis of protective effects of paralogs in C. elegans
is based on the set of 15587 viable and 1170 non-viable (embryonic or larval lethality or sterility) RNAi phenotypes reported in [8
]. The information about worm paralogs is obtained from the EuGenes database [18
] and consists of 30036 paralogous pairs involving 10071 worm proteins (blastp with 10-30
cutoff and no requirements on the length of aligned region). In Fig. we used 13884 RNAi phenotypes for which we were able to uniquely map the genepair name to the worm protein name used in EuGenes.
The two-hybrid assay of protein-protein inetractions in H. pylori
] used in Fig. contains 1465 interactions between 732 proteins, while there are only 260 paralogous pairs involving 140 proteins. As in yeast this set was obtained by blasting all protein sequences found in the fully sequenced genome against each other with a conservative E-value cutoff of 10-10
and leaving only pairs in which the aligned region constituted at least 80% of the length of a longer protein.
Finally, our analysis of the interaction overlap between paralogous proteins in D. melanogaster
is based on the full dataset of the high-throughput two-hybrid experiment [7
]. It consists of 20671 protein-protein physical interactions involving 7002 of fly proteins obtained in. To generate Fig. we also used the set of 16713 paralogous pairs involving 2827 fly proteins.