(a) The correlation structure
We collected or calculated the values of seven genome-related variables, namely, fitness effect of gene knockout experiments (KE), EL, number of genetic interactions between genes (GI), number of physical interactions between gene products (PPI), NP, sequence ER and PGL for 3912 clusters of orthologues eukaryotic genes (KOGs;
Tatusov et al. 2003). The KOGs are a natural framework for this type of analysis because they readily allow for estimation of ER and PGL, and also because data obtained on different model systems can be combined through the knowledge of orthologues relationships, thus increasing the number of genes amenable to analysis (see
§2 and electronic supplementary material for details; the values of the seven variables for each KOG are given in the electronic supplementary material).
Of the analysed variables, five characterize an organism's phenotype (KE, EL, GI, PPI, NP) and two reflect aspects of evolution (ER, PGL). Examination of the pairwise correlations between these variables reveals a clear-cut pattern: the phenotypic and evolutionary variables form two distinct classes such that variables within a class tend to be positively correlated whereas variables of different classes are negatively correlated, with most of the correlations being statistically significant (). Inverse trends are detected only for GI, namely, a weak positive GI–ER correlation and weak negative GI–EL and GI–KE correlations. Since the majority of the GI data comes from synthetic lethal studies, genes with a non-zero number of GI are likely to be non-essential, which seems to explain the deviation from the general pattern. It is also of note that NP behaves in full concordance with the rest of the phenotypic variables although, a priori, there might be some ambiguity as to whether NP is to be classified as phenotypic or evolutionary.
| Table 1The correlations between the seven genomic variables. (Asterisks denote the correlations that are significantly different from zero, p<0.05.) |
Importantly, all the previously established positive correlations within the phenotypic and evolutionary classes of variables and the negative correlations between the variables of different classes held true for the analysed dataset (). The only significant difference from the previous results, apart from the addition of new variables, was that, in the earlier work (
Krylov et al. 2003, p. 2671), we failed to detect a significant negative correlation between ER and KE (although a marginal trend in this direction has been seen), whereas in the present work, such a correlation, weak but significant, has been detected (). It appears most likely that the difference is due to the greater sensitivity of the present analysis, thanks to an expanded dataset included in the analysis and the more sophisticated procedure employed for the estimation of ER. Also, similarly to the previous studies, all observed correlations were weak to moderate although most of them reached statistical significance, thanks to the large number of data points analysed (). This persistent pattern of weak (even if significant) correlations emphasizes the need for multivariate analysis in order to elucidate the actual nature of the interplay between the phenotypic and evolutionary variables.
An issue of potential concern with respect to the relatively weak correlations described here and elsewhere is the potential effect of the procedures used to derive the aggregate variables (see
§2). To eliminate potential artefacts caused by the specifics of these procedures, we investigated separately the effects of several alternative approaches, e.g. transforming KE from a binary to a continuous variable or replacing the median over paralogues with the maximum as the measure of EL, PPI and GI (see
§2 and supplement 6 of the electronic supplementary material for details), on the structure of the correlation matrix. None of these modifications changed the sign of any of the significant correlations, and in most cases, the changes in the magnitude of the correlations were relatively small (supplement 6 of the electronic supplementary material).
Thus, to succinctly summarize the current results of pairwise correlation analysis, genes whose knockout has a severe effect on fitness, that are highly expressed, have many protein–protein interaction partners, and many paralogues have a propensity to evolve slowly, in terms of both ER and PGL. Conceptually, one may think of these as ‘important’ genes that are subject to strong functional constraints and, as a result, refractory to evolutionary change.
(b) Principal component analysis of the genomic variables: a gene's status, adaptability and reactivity
To investigate the relationships between all the analysed genome-related variables simultaneously, we performed PCA of the 3912 KOGs in the seven-dimensional space. Each of the seven PCs accounted for a significant fraction of the variance in the data, i.e. the contribution of each PC was non-negligible ( and figure 1S of the electronic supplementary material). This shows that none of the original variables can be represented as a linear combination of other variables. Furthermore, the PCA results were found to be highly robust to various modifications of the data analysis procedures, e.g. replacing PGL with the raw number of gene losses or using the maximum among paralogues, instead of the median, to assign a KOG's EL, PPI and GI values, as described in detail in
§2 and in supplement 6 of the electronic supplementary material.
The first three PCs captured over one-half (54.8%) of the total variance in the data (; table 1S and figure 1S of the electronic supplementary material). The first PC (PC1), which accounts for 25% of the overall variance, is comprised of strong positive contributions from EL, NP, KE and PPI, large negative contributions from ER and PGL, and effectively no contribution from GI (a; tables 1S and 2S of the electronic supplementary material).
PC1 appears to correspond to what may be viewed as a gene's status in the genome-wide community of genes. Indeed, the genes with the high values of PC1 are the ‘high-status’ (most ‘important’) genes—those that cannot be knocked out without a major effect on fitness, are highly expressed, occupy a prominent position in the PPI network, have many paralogues and are evolutionarily conserved. By contrast, genes with low values of PC1 can be knocked out at little cost, evolve fast, are, typically, expressed at a low level and have few (if any) paralogues and protein–protein interactions, i.e. have a low status.
The next two PCs are associated with statistically identical eigenvalues, a property known as sphericity (the p-value of the sphericity test is 0.255; and figure 1S of the electronic supplementary material), and therefore are defined only up to a rotation in the plane PC2–PC3. Nevertheless, it seems possible to interpret this plane as a two-dimensional measure of a gene's functional and evolutionary plasticity, and the two PCs (; table 2S of the electronic supplementary material) as capturing two different facets of this plasticity.
The second PC (PC2), which accounted for 15.3% of the variance, is comprised of positive contributions from NP and GI, negative contributions from KE and effectively no contribution from ER, PGL, EL and PPI (
a; tables 1S and 2S of the electronic supplementary material). Thus, PC2 gives high rank to genes that have many paralogues and often are functionally backed-up by other genes (high GI) but are non-essential (non-lethal upon knockout). We speculated that these features are associated with genes whose activity is highly malleable in response to changes in the cellular and extracellular environments. Under this interpretation, one would predict that genes with high PC2 values have highly skewed distributions of ELs under different experimental conditions, life cycle stages or different tissues of complex organisms. We tested this prediction by computing the skewness indices for expression scores obtained at different stages of the yeast cell cycle, various developmental stages of
Drosophila and different human tissues, and comparing them with the PC2 values (). Indeed, the genes with high PC2 values tend to have more strongly skewed distributions of the ELs, especially those with high status (high values of PC1), i.e. with the most important biological roles (Fisher Omnibus test
p-values of 0.01 and much less than 10
−20 for low- and high-status KOGs, respectively, when the combined data for three organisms were analysed;
Bailey & Noble 2003). Thus, we denoted PC2 gene's adaptability.
| Table 2Median skewness of expression score distributions in relation to (a) PC1 and PC2 or (b) PC1 and PC3. (Species abbreviations: Dme, Drosophila melanogaster; Hsa, Homo sapiens; Sce, Saccharomyces cerevisiae.) |
The third principal component (PC3), which accounts for another 14.5% of the variance (), is similar to PC2 in that it favours non-essential genes with many paralogues. In contrast to adaptability, however, the contribution of GI to PC3 is strongly negative, the contributions of ER and PPI are weakly negative, whereas PGL and EL make substantial positive contributions (b; tables 1S and 2S of the electronic supplementary material). Given that high PC3 values are also associated with significantly increased skewness of expression profiles (; Fisher Omnibus test p-values of 4×10−4 and much less than 1×10−20 for low- and high-status KOGs, respectively), we consider PC3 to be another manifestation of a gene's ability to adjust to different functional modes at different life cycle stages or in different tissues. Thus, we denoted PC3 gene's ‘reactivity’.
(c) Status, adaptability and reactivity of different functional classes of genes and multisubunit complexes
Different functional classes of genes show contrasting trends in status, adaptability and reactivity distributions. Although the individual KOGs in each class span a wide range of values, the group centroids often significantly differ from zero ( and table 3S of the electronic supplementary material; see supplementary data of the electronic supplementary material for the values of status, adaptability and reactivity for all analysed KOGs). Information storage and processing systems, as a whole, are significantly biased toward high status and low adaptability and reactivity; genes involved in cellular processes show, on average, relatively high status and the highest adaptability but low reactivity; genes for metabolic enzymes and transporters are characterized by moderate status and adaptability but the highest reactivity; finally, poorly characterized genes typically fall into the low-status division, and also show low adaptability and reactivity. These trends in status, adaptability and reactivity appear biologically plausible. Thus, the high status of information processing system components is compatible with the fact that many of these are central to genome replication and expression; the characteristic high adaptability of genes involved in cellular processes, particularly, signal transduction, might reflect the involvement of these genes in complex networks of partially redundant pathways; and, the exceptionally high reactivity of metabolic genes corresponds to the notion that changes in the levels of the respective proteins in response to changes in the availability of metabolites are functionally important and do not necessarily involve much back-up. Finally, a curious observation is the distinctly low status of uncharacterized genes; it seems that the functions of the ‘most important’ eukaryotic genes are already known, at least in general terms.
| Table 3Status, adaptability and reactivity of selected multisubunit complexes and functional classes of proteins. (*Significantly different from zero (p<0.05), using t-test with Bonferroni correction.) |
Genes whose protein products form multisubunit molecular complexes usually show strong coherence in status, adaptability and reactivity, whereas different complexes, even those with generally similar functions, may differ dramatically ( and ). Thus, comparison of cytosolic and mitochondrial ribosomal proteins shows a clean separation, with the former having a much higher status than the latter (
a). Indeed, mitochondrial ribosomes are extremely diverse in different taxa, and genes coding for mitochondrial ribosomal proteins evolve fast and are often lost during evolution (
Mears et al. 2002;
Koonin et al. 2004;
Mushegian 2005).
Complexes with the same intracellular location but distinct functions often show different, characteristic status-adaptability patterns. Thus, the vacuolar ATPase subunits are well separated from those of the vacuolar sorting complex, the former having a much higher status and somewhat lower adaptability (
b). A similar pattern is seen in a comparison of histones and the replication licensing complex, two chromatin-associated complexes. The histones have a significantly higher status and greater adaptability than the licensing complex subunits (
c), which presumably reflects the key role of histones and their modifications in chromatin maintenance and remodelling (
Vermaak et al. 2003).
Many functional systems show a distinct pattern, with a dense core of central components of a relatively high status and low adaptability, and a sparse periphery of more adaptable, lower-status genes. This pattern is illustrated in d for the RNA processing and modification systems. As a class, these have a relatively high average status and low adaptability as it is characteristic of information processing systems in general ( and table 3S of the electronic supplementary material). However, a closer examination reveals a tight, high-status–low-adaptability cluster that is enriched for core subunits of the spliceosome and the mRNA cleavage–polyadenylation complex and a scattered cloud with a significantly lower average status and a wide range of adaptability values consisting of diverse proteins involved in various forms of RNA processing and modification (d).
Different functional groups of genes also display distinct adaptability–reactivity patterns, e.g. low–low for RNA processing and modification; low–high for translation, ribosomal structure and biogenesis; high–low for signal transduction systems; and high–high for carbohydrate transport and metabolism; and table 3S of the electronic supplementary material). These patterns might reflect different functional–evolutionary modalities of these categories of genes. For example, both the translation systems components and those of signal transduction systems are involved in various forms of environmental response but the latter are characterized by a high level of functional back-up as opposed to the former.