C. albicans Expression Data
We assembled a dataset describing the genome-wide transcriptional responses of C. albicans
to diverse perturbations, including drug resistance [15
], stress [18
], expression of only one mating type locus [19
], and response to mating pheromone [20
]. Also included were transcription profiles of cells growing as yeast or hyphal cells [25
], in biofilms [21
], exposed to blood components [22
], altered pH [24
], or signaling molecules [26
]. The studies were performed primarily with laboratory strains, but also with some clinical isolates [15
]. Altogether, the dataset consists of 244 expression profiles, generated by seven different laboratories, using four independently designed microarrays. All data were put into a unified format (orf19),
which included a total of 6,167 open reading frames (ORFs) (see Materials and Methods
Previous studies demonstrated that genes with similar functions are often co-expressed (see [28
]). To determine if this relationship is observed in the C. albicans
expression data, we examined the similarity of the expression patterns of genes assigned to the same biological process within the Gene Ontology (GO) categories [32
]. The significance of co-expression within a specific GO category was quantified by calculating the distribution of pair-wise correlations between genes within the category, and by comparing it to the distribution of random gene assemblies of the same size (see Materials and Methods
and Figure S1
). Indeed, a large fraction of predicted GO categories received a highly significant score, indicating that, also in the C. albicans
data, functionally linked genes tend to be co-expressed (A).
Functionally Linked Genes Tend to Be Co-Expressed
For comparison, we performed an analogous analysis of S. cerevisiae,
using a dataset of ~1,000 publicly available genome-wide expression profiles [33
]. Overall, the significance of co-expression within the C. albicans
GO terms was lower than that of the S. cerevisiae
counterparts (A and S1
). This lower significance may reflect the smaller size of the dataset available for C. albicans,
its quality, or the fact that the GO terms were originally defined for S. cerevisiae
. Alternatively, transcriptional regulation may play a less prominent role in C. albicans
. The mitochondrial-targeting and protein-folding GO categories, which were co-expressed more tightly in C. albicans,
provided an interesting exception, although the significance of this difference was marginal (B). Despite the quantitative difference, we observed a strong correlation between the significance of the co-expression in the two organisms (correlation coefficient 0.92, B). For example, in both organisms, functional groups involved in aspects of protein synthesis and sugar metabolism were most stringently co-expressed.
Differential Clustering Algorithm for Comparative Analysis of Gene Expression Data
While providing a useful means for systematic analysis, GO categories do not necessarily correspond to transcriptional units. In fact, in most GO categories, only a subset of the genes is co-expressed (e.g., C). Moreover, in certain cases, a single GO category can be separated into subsets that display independent or even inversely correlated expression patterns. For example, the C. albicans genes attributed to gluconeogenesis were split into two autonomously co-expressed subgroups, one associated with the glycolysis pathway itself, the other involved in other aspects of gluconeogenesis. Interestingly, in this case, this split was conserved between S. cerevisiae and C. albicans (C). However, in general, the fine structures in regulatory patterns differed between the two organisms (e.g., tRNA aminoacetylation, C).
Differences in the pattern of gene regulation within individual GO categories are likely to reflect differences in the physiology, or in the adaptation to different environments, of the two organisms. Existing approaches for comparative gene expression analyses emphasize mostly conserved co-regulation patterns, rather than differences in expression patterns [8
]. To better capture differential expression patterns, we developed a novel approach, termed the differential clustering algorithm (DCA), for systematically characterizing both similarities and
differences in the fine structure of co-regulation patterns ().
The Differential Clustering Algorithim (DCA)
The DCA is applied to a set of orthologous genes that are present in both organisms. As a first step, the pair-wise correlations between these genes are measured in each organism separately, defining two pair-wise correlation matrices (PCMs) of the same dimension (i.e., the number of orthologous genes) (A). Next, the PCM of the primary (“reference”) organism is clustered, assigning genes into subsets that are co-expressed in this organism, but not necessarily in the second (“target”) organism. Finally, the genes within each co-expressed subgroup are re-ordered, by clustering according to the PCM of the target organism. This procedure is performed twice, reciprocally, such that each PCM is used once for the primary and once for secondary clustering, yielding two distinct orderings of the genes.
The results of the DCA are presented in terms of the rearranged PCMs. Since these matrices are symmetric and refer to the same set of orthologous genes, they can be combined into a single matrix without losing information. Specifically, we join the two PCMs into one composite matrix such that the lower-left triangle depicts the pair-wise correlations in the reference organism, while the upper-right triangle depicts the correlations in the target organism (B). Inspection of the rearranged composite PCM allows for an intuitive extraction of the differences and similarities in the co-expression pattern of the two organisms (). An automatic scoring method is then applied to classify clusters into one of the four conservation categories: full, partial, split, or no conservation of co-expression (A and B).
The DCA Method Automatically Classifies Clusters to Different Conservation Classes
Functionally Related Genes Exhibit Different Degrees of Co-Expression Conservation
To systematically characterize the conservation or divergence of co-expression between genes with a related function, we applied the DCA to gene groups defined by membership in the same biological process GO categories [32
]. We also applied it to groups of genes that have a common DNA sequence motif of length 6 or 7 base-pairs in their promoter region (within 600 base-pairs upstream of the predicted start codon). The DCA procedure identifies co-expressed clusters embedded within these gene sets, and assigns each of these clusters to one of the four above-mentioned conservation categories (full, partial, split,
or no conservation
Examples of clusters assigned to each category are shown in C. Clusters associated with growth, including genes encoding ribosomal components (C, 14) and genes containing the PAC motif (C, 13, primarily genes encoding rRNA processing proteins), were strongly co-regulated in both organisms, and were thus assigned to the full conservation class. Also assigned to this class were clusters of genes involved in oxidative phosphorylation (C, 15) and monosaccharide catabolism (C, 16).
Of particular interest are clusters that are differentially expressed between the two organisms. The most noticeable differences were found for clusters whose genes are involved in both cytoplasmic and mitochondrial translation. This included, for example, the GO terms “protein synthesis” (C, 9), “tRNA metabolism” (C, 5), and “tRNA amino-acetylation” (C). These clusters were uniformly co-expressed in C. albicans. In contrast, in S. cerevisiae they were split into two distinct subclusters, associated with cytoplasmic or mitochondrial functions, respectively, which displayed independent or even inversely correlated expression patterns. This differential expression pattern of mitochondrial genes reflects a major phenotypic difference between the two organisms: rapidly growing S. cerevisiae cells utilize fermentation and do not require oxygen. In contrast, rapid growth in C. albicans relies on aerobic respiration and requires mitochondrial functions.
Flexible Regulatory Patterns of Cell Cycle Genes
Among the clusters assigned to the no conservation class was a group of cell cycle genes that are involved in the transition from S-phase to mitosis (C, ). These genes were tightly co-expressed in C. albicans, but not in S. cerevisiae, suggesting that the cell cycle transcription program differs between the two organisms.
To better characterize the differences in regulation of cell cycle genes, we examined the “cell cycle” GO category in more detail. We included in this analysis also expression data from Schizosaccharomyces pombe
which is evolutionarily more distant to S. cerevisiae
and C. albicans
]. For S. cerevisiae
and S. pombe,
we also restricted the expression data to cell cycle experiments. No such cell cycle–dedicated conditions were available for C. albicans
. We note, however, that many experiments in the C. albicans
dataset used cells emerging from stationary phase with some degree of synchrony, which likely captured some features of cell cycle–specific regulation. It should be noted that the gene set is based on the S. cerevisiae
GO term, and therefore does not include genes that are cell cycle–related only in the other two organisms.
We applied the DCA to the above-mentioned data, with each of the three yeasts serving once as a reference and once as a target organism (off-diagonal in , green background). As a control, we considered the same organism as both the reference and target organism, but used only 25% of the expression data for the secondary clustering (diagonal in , gray background). Moreover, for S. cerevisiae and S. pombe, we tested complementary expression data containing no cell cycle experiments as another control. In this case the cluster conservation was weaker, yet some aspects of cell cycle regulation remained (unpublished data).
DCA Analysis of Cell Cycle Genes
Essentially all clusters identified as co-expressed in the reference organism were, at most, partially co-expressed in the other two organisms ( and S4–S13
). As an example, we highlight here the regulation of the major cyclin-dependent kinase (encoded by CDC28
in S. cerevisiae
) and the associated mitotic B-cyclin (encoded by CLB2
In S. cerevisiae
, there are six B-cyclins, several with redundant functions [35
], and their expression is cell cycle–regulated. CDC28
expression is not correlated with any of them. Accordingly, CDC28
were associated with two distinct clusters: CDC28
was assigned to a cluster composed of genes involved in the early cell cycle functions (e.g., budneck formation, DNA replication, and repair [Figure S4
]), whereas CLB2
was assigned to a cluster composed of genes with functions in mitosis (Figure S12
). Neither of these clusters was co-expressed in C. albicans
or in S. pombe
has one major, essential B-cyclin, cdc13
ortholog), which is required for mitosis. In the S. pombe
cell cycle data,
expression of cdc13
was inversely correlated with expression of cdc2
was co-expressed with a cluster of genes, many of whose S. cerevisiae
orthologs are involved in replication and DNA repair (Figure S9
), whereas cdc13
was co-regulated with genes involved primarily in mitosis and general cell cycle control (Figure S11
has two B-cyclins, and one of them, CLB2,
is essential [39
]. Interestingly, in C. albicans
orthologs were co-expressed. Both genes were assigned to a cluster associated with anaphase and mitotic exit (B and S11
). Northern blot analysis of CDC28
transcripts in C. albicans
cells emerging synchronously from stationary phase confirmed that the mRNA levels of CDC28
correlate, peaking with the presence of large budded cells (S/G2 phase) (JB and M. McClellan, unpublished data).
We conclude that transcriptional regulation of cell cycle genes is highly flexible and has diverged significantly between the three yeast species. Our results expand on previous reports that have shown that only a small set of genes are subject to similar cell cycle regulation in both S. cerevisiae
and S. pombe
Each of these fungi has a distinctive repertoire of morphologies: S. cerevisiae
and C. albicans
undergo budding to form yeast or pseudohyphae; C. albicans
also forms true hyphae by a non-budding mechanism involving different organellar structures [41
]; S. pombe
is a fission yeast with a distinct, non-budding mechanism of morphogenesis. In all three fungi, cell cycle regulation and morphogenesis are clearly linked [39
]. Further analysis is needed to establish how these distinct morphologies are connected to the differential pattern of gene expression found in each organism.
C. albicans Transcription Modularity
The analysis above focused on pre-defined sets of genes that are known to be related and thus are suspected to be, at least partially, co-expressed. In order to identify novel regulatory relationships that are not confined to specific function-related genes, we conducted a complementary, unsupervised analysis of the C. albicans
expression data. To this end, we used the iterative signature algorithm (ISA) [31
] to determine the modular organization of the C. albicans
transcription program. The ISA segregates the data into overlapping transcription modules, each consisting of a subset of co-expressed genes together with the subset of experimental conditions inducing this co-expression.
The ISA assigned 2,770 C. albicans genes into transcription modules with varying stringencies of correlated expression. Modules were classified as core modules (15%), composed primarily of genes possessing an S. cerevisiae ortholog; as C. albicans–specific modules (37%), consisting primarily of genes without S. cerevisiae orthologs; or as modules with a mixture of both types of genes (48%) (A–C).
Modules were annotated manually by examining their gene and condition contents (A; see also http://barkai-serv.weizmann.ac.il/candida
). In addition, we systematically checked each module for over-representation of GO categories and of DNA sequence motifs in the 5′-UTR. This analysis clearly established the biological relevance of the C. albicans
transcription modules. First, many modules contained one or several over-represented GO terms, indicating their functional coherence (D). Second, most modules were associated with sequence motifs that were significantly enriched in the promoter regions of genes within the module (D).
Module association provides numerous functional links for C. albicans
genes (see http://barkai-serv.weizmann.ac.il/candida
). We experimentally tested one of these links, namely orf19.5850.
Previous studies reported that a strain heterozygous for a transposon disruption allele of this gene exhibits reduced filamentous growth [45
]. Our analysis assigned orf19.5850
to the rRNA processing module (E). Indeed, tagging this predicted protein product with yellow fluorescent protein (YFP) revealed its localization to the nucleolus (F), as expected for a gene involved in rRNA processing. After this experiment was initiated, the localization of the S. cerevisiae
ortholog was shown to be both nucleolar and nuclear [46
The C. albicans versus S. cerevisiae Transcription Modularity
The hierarchical organization of a transcription program is captured by its module tree,
which connects related modules identified at different stringencies of correlated expression [10
] (A). The C. albicans
module tree was composed of three main branches. One of these branches was associated with Candida
-specific cell types: they were induced in opaque cells and/or repressed in white cells. This module included genes important for fatty acid metabolism, mating, and arginine and glutamine biosynthesis, as well as genes repressed under conditions of biofilm production. The second main branch was composed primarily of modules pertaining to core functions, including genes required for rapid growth (e.g., ribosomal proteins and rRNA processing genes). Finally, the third main branch was associated with carbohydrate metabolism and the response to stress, as well as with genes involved in C. albicans–
specific processes such as hyphal or white-opaque growth.
This global organization is similar to that found in the S. cerevisiae
module tree, in which two of the major branches were associated with rapid growth and stress-response, respectively [31
]. In contrast, in higher eukaryotes, including D. melanogaster, C. elegans, Arabidopsis thaliana,
and human, these two core functions did not correspond to main branches of the module trees [10
Apart from this global similarity, the module trees of C. albicans
and S. cerevisiae
displayed some notable differences. First, in C. albicans
, amino acid biosynthesis was associated with the protein synthesis branch, whereas no such association was seen in S. cerevisiae
]. This indicates that in C. albicans,
but not in S. cerevisiae,
amino acid biosynthesis is induced under conditions that also increase protein synthesis (e.g., rapid growth). To test if these differences arose from the distinct types of conditions available in the two datasets, we removed from the S. cerevisiae
data all environmental perturbations relevant for amino acid metabolism (such as amino acid or nitrogen starvation). We also removed other subsets of conditions, such as the set of 300 profiles of deletion mutants [47
], or the set of general environmental perturbations [48
]. In all cases, the amino acid and the protein synthesis modules appeared on separate branches (unpublished data). This indicates that the observed distinctions in the module trees of the two yeasts reflect differences in the underlying organization of their transcriptional programs, rather than differences in the set of available conditions.
In C. albicans,
the core protein synthesis branch also included specific modules, which contained members of the major repeat sequence family [49
] along with genes important for cell wall synthesis and several genes involved in cell cycle progression, such as CLB2, CDC5,
. The reason for this association of cell wall proteins, the major repeat sequence family, and cell cycle genes is not clear. Examining the conditions associated with this module, we noted that this module is induced primarily in white cells and is repressed primarily in opaque cells [19
], and thus may reflect a common regulation associated with the conditions used to study the white-opaque transition.
An intriguing feature of the C. albicans
–specific branch of the transcription program is that genes related to arginine biosynthesis were separated from the main amino acid biosynthesis module. These genes were co-expressed with genes required for biotin synthesis, most likely because biotin is required for the activity of ornithine transcarbamylase (encoded by ARG3
]. In addition, these genes were co-expressed with genes associated with the mating response [19
] and were up-regulated in C. albicans
cells interacting with macrophages [23
]. Because methylated arginines are inhibitors of nitric oxide [51
], which is produced by macrophages, it is tempting to speculate that the expression of genes required for arginine synthesis elicits a protective response of C. albicans
cells to macrophage attack.
Furthermore, in C. albicans, the mitochondrial ribosomal protein module and the ergosterol biosynthesis module both appear on the protein synthesis branch associated with rapid growth. In contrast, the S. cerevisiae mitochondrial ribosomal protein module is associated with stress responses. Again, this pattern of co-regulation likely reflects the fact that rapid growth requires mitochondria-mediated respiration in C. albicans but not in S. cerevisiae.
Higher-Order Regulatory Relationships between GO Terms Provide Complementary Views of Transcription Programs
The above direct comparison of the two module trees is useful for distinguishing broad features of the respective organizations, yet it is limited by the lack of a one-to-one relationship between the two module sets. For example, the average overlap between S. cerevisiae modules and their best matching C. albicans counterparts is only 19% (C). Furthermore, although many modules are significantly enriched with genes belonging to a specific GO category, typically several distinct GO categories are represented in each module. Thus, associating each module with one summarizing annotation is a simplification that does not capture the full complexity of the transcriptional organization.
To overcome these difficulties, we developed a new approach, termed “higher-order connectivity analysis” (HOCA), in which we analyze the modular components of the transcription program through their association with functional categories. Specifically, we define a GO connectivity network, where two GO terms are connected if they are both over-represented in at least one common transcription module (A, and Materials and Methods
). Applying HOCA to the S. cerevisiae
and C. albicans
expression data yielded two independent “GO networks,” corresponding to the regulatory relationships between the GO terms in S. cerevisiae
and C. albicans,
respectively. The two networks were composed of a corresponding set of nodes (GO terms), connected by organism-specific links. We quantified the strength of each link using the topological overlap
], which weights each edge by the similarity in the overall connectivity of the two nodes (A, and Materials and Methods
). The C. albicans
GO connectivity diagram is displayed in B.
Connectivity Analysis between Gene Attributes Reveals Different Patterns of Co-Expression in C. albicans and S. cerevisiae
Differential Connectivity in the C. albicans versus S. cerevisiae GO Networks
To compare the GO networks of C. albicans and S. cerevisiae, we restricted the set of nodes to the GO terms that are common to both organisms. In this case, we have two matrices of the same dimension (i.e., the number of common GO terms), describing the topological overlaps between all pairs of GO terms in each organism (C). The two matrices were analyzed using the DCA method to automatically classify the resulting clusters of GO terms into the full, split, partial, and no conservation classes of co-expression.
D depicts some of the GO term associations assigned to the different conservation classes. Notably, GO terms concerning carbohydrate metabolism (c.f. cluster 3) were correlated with the stress response in C. albicans but not in S. cerevisiae. This may be related to the fact that C. albicans requires mitochondrial function during rapid (aerobic) growth, producing high levels of reactive oxygen species that, in turn, would induce oxidative stress–related genes. In contrast, rapid (fermentive) growth in S. cerevisiae does not generate such high levels of reactive oxygen species and therefore would not induce these genes.
Sequence Motifs Associated with the Differential Regulation of C. albicans Amino Acid Biosynthesis Genes
Consistent with the modular analysis described above, we detected an interesting difference in the regulation of amino acid biosynthesis genes in C. albicans relative to S. cerevisiae. Cluster 5 (D) includes GO terms involved in the biosynthesis of several amino acids. All these GO terms are connected in S. cerevisiae, presumably reflecting their common regulation by the transcription factor Gcn4p. In contrast, only one subset of these GO terms (arginine, glutamine, and sulfur amino acid metabolism) was connected in C. albicans. This suggests a differential, and more refined regulation of amino acid biosynthesis by C. albicans.
To better characterize this differential co-regulation pattern, we applied the DCA to the genes of the amino acid biosynthesis transcription module in S. cerevisiae
(Materials and Methods
). In S. cerevisiae,
these genes are uniformly co-expressed. In contrast, in C. albicans
this group was split into four clusters that displayed distinct regulatory patterns (). These clusters were associated with arginine, methionine, aromatic, and general amino acid biosynthesis.
DCA Analysis of Amino Acid Biosynthesis Genes
To address the mechanism underlying this differential regulatory pattern, we asked whether these clusters are linked to differential appearance of cis-
regulatory elements. To this end, we examined the promoter sequences of the genes in each cluster, searching for an over-represented DNA sequence of length 6–8 nucleotides. First, we analyzed the S. cerevisiae
promoters and found that, as expected, all clusters were significantly enriched with the TGACTC motif, which is the known binding site for Gcn4p, the transcriptional activator of amino acid biosynthetic genes. Furthermore, the cluster that includes genes required for methionine biosynthesis was associated with an additional motif (CACGTG), which is bound by the Cbf1 transcription factor, a known regulator of methionine biosynthesis genes [53
Next, we searched for over-represented DNA sequences in the promoters of genes in the C. albicans
clusters. The TGACTC motif was significantly enriched in three of the four clusters (numbers 1–3), consistent with previous reports showing its conservation across different yeast species [54
]. Notably, the cluster associated with methionine biosynthesis genes, which is not co-regulated in our dataset, appears to have lost both the TGACTC (Gcn4-binding) and the CACGTG (Cbf1-binding) motifs ().
Strikingly, the three C. albicans
clusters that maintained the TGACTC motif were all associated with additional over-represented motifs that were not found in the promoters of the corresponding S. cerevisiae
genes (). Specifically, the arginine and general amino acid clusters were each associated with a distinct novel motif (TAACCGC and TTCCTG, respectively), whereas all three clusters were associated with the AATTTT [56
These results suggest that combinatorial regulation by different transcription factors underlies the distinct pattern of amino acid biosynthesis genes in C. albicans.
Interestingly, the AATTTT motif (or its reverse complement; see C, 11) is also enriched in genes involved in ribosome biogenesis and rRNA processing, providing a possible explanation for the observed correlation between amino acid biosynthesis and the protein synthesis branch in the C. albicans
Differential Connectivity between Cis-Regulatory Elements
The above analysis described the higher-order organization of the C. albicans transcription program based on gene sets sharing functional attributes (i.e., GO categories). A complementary approach is to define putative regulatory units based on common sequence motifs in the 5′-UTRs of its genes.
In a given transcription module, more than one sequence element is typically over-represented. Multiple associations of binding motifs that differ by a single nucleotide likely reflect flexibility in the binding specificity of a single transcription factor. These sequences can be summarized by a consensus motif. Indeed, several clusters of motifs assigned to the “split” conservation pattern correspond to consensus motifs that are partially conserved, but exhibit some organism-specific modifications. Interestingly, many single nucleotide sequence variations of a motif were connected only in S. cerevisiae, suggesting that S. cerevisiae transcription factors tend to have a higher degree of DNA binding flexibility as compared to their C. albicans counterparts. Moreover, the consensus sequences in S. cerevisiae were usually slightly different from those in C. albicans.
Over-representation of several distinct sequence motifs in a given transcription module most likely indicates combinatorial regulation of these genes by different transcription factors. For example, in both organisms, the known consensus motifs PAC [57
] and the sequence AAAATT were linked in a single cluster (E) pointing to combinatorial action of the associated transcription factors. Moreover, the sequence TGAAAAT was connected to this cluster, but only in S. cerevisiae.
This indicates that in S. cerevisiae,
the common sequence AAAAAT almost always appears with the prefix TG. In contrast, this TG prefix is not seen in C. albicans
. Additional results are summarized at http://barkai-serv.weizmann.ac.il/candida