|Home | About | Journals | Submit | Contact Us | Français|
Organisms simplify the orchestration of gene expression by coregulating genes whose products function together in the cell. Many proteins serve different roles depending on the demands of the organism, and therefore the corresponding genes are often coexpressed with different groups of genes under different situations. This poses a challenge in analyzing whole-genome expression data, because many genes will be similarly expressed to multiple, distinct groups of genes. Because most commonly used analytical methods cannot appropriately represent these relationships, the connections between conditionally coregulated genes are often missed.
We used a heuristically modified version of fuzzy k-means clustering to identify overlapping clusters of yeast genes based on published gene-expression data following the response of yeast cells to environmental changes. We have validated the method by identifying groups of functionally related and coregulated genes, and in the process we have uncovered new correlations between yeast genes and between the experimental conditions based on similarities in gene-expression patterns. To investigate the regulation of gene expression, we correlated the clusters with known transcription factor binding sites present in the genes' promoters. These results give insights into the mechanism of the regulation of gene expression in yeast cells responding to environmental changes.
Fuzzy k-means clustering is a useful analytical tool for extracting biological insights from gene-expression data. Our analysis presented here suggests that a prevalent theme in the regulation of yeast gene expression is the condition-specific coregulation of overlapping sets of genes.
All organisms possess an essentially fixed repertoire of proteins determined by their genome sequence. They have evolved to survive varying internal and external environments by carefully controlling the abundance and activity of these proteins to suit their conditions. To simplify this task, genes whose products function together are often under common regulatory control such that they are coordinately expressed under the appropriate conditions. This property has been frequently exploited in the analysis of genome-wide expression data, as the experimental observation that a set of genes is coexpressed frequently implies that the genes share a biological function and are under common regulatory control . Many proteins have multiple roles in the cell, however, and act with distinct sets of cooperating proteins to fulfill each role. Their genes are therefore coexpressed with different groups of genes, each governed by a distinct regulatory mechanism, in response to the varying demands of the cell (Figure (Figure1a).1a). This complicates the analysis of expression data and calls for a more nuanced approach to data analysis.
The yeast Saccharomyces cerevisiae evolved in a niche in which the availability of nutrients and the conditions of growth vary constantly, and it possesses sophisticated mechanisms to choreograph the expression of its approximately 6,000 genes in order to thrive - or at least survive - in a wide range of environmental conditions. These responses are governed by a complex, condition-specific regulatory system that transduces information through the cell to the nucleus, where gene expression is adjusted accordingly. Many of the individual components of this regulatory system function under particular conditions and govern the expression of overlapping sets of gene targets, allowing a given gene to be coexpressed with different gene groups in response to different conditions (Figure (Figure1a).1a). As a consequence, the targets of each regulatory system often display similar expression patterns in response to one set of conditions but divergent patterns under other situations (Figure (Figure1b).1b). For example, the known targets of the oxidative stress-responsive transcription factor Yap1p are coordinately induced in response to conditions that inflict oxidative damage, but these genes are divergently expressed in response to other environmental changes (Figure (Figure1c)1c) . Similarly, the known targets of other transcription factors in yeast (including Aft1p, Zap1p, Pho4p, Hac1p, Hsf1p, and others) are similarly expressed only in response to certain environments [2,3,4,5,6].
The complexity of the regulatory network that governs yeast gene expression complicates the analysis of whole-genome expression data. Because of the connection between gene-expression regulation and gene product function, computational analysis of expression data is used extensively to identify groups of similarly expressed genes. However, the central limitation of most of the commonly used algorithms is that they are unable to identify genes whose expression is similar to multiple, distinct gene groups, thereby masking the relationships between genes that are coregulated with different groups of genes in response to different conditions. Consider, for example, k-means clustering [7,8]. The k-means algorithm partitions genes into a defined set of discrete clusters, attempting to maximize the expression similarity of the genes in each cluster (Figure (Figure2a).2a). The algorithm is initiated by randomly partitioning the genes into k groups. Each group is then represented by a 'centroid' (the mean expression pattern of genes in the group), and the genes are repartitioned to the cluster whose centroid is most similar to their expression pattern. The partitioning process is iterated until the gene partitions are stable (or some other stopping criterion is met). The end result of the algorithm is a set of k clusters of similarly expressed genes. However, a key property of this algorithm (and many others like it) is that each gene is assigned to one and only one cluster, obscuring the relationships between conditionally coregulated genes such as those shown in Figure Figure1.1. This limitation is especially problematic when analyzing large gene-expression datasets that are collected over many experimental conditions, when many of the genes are likely to be similarly expressed with different groups in response to different subsets of the experiments. A number of methods have been developed to deal with complex relationships between objects [9,10,11]. Here, we explore the utility of one such method - fuzzy k-means clustering.
Fuzzy k-means clustering  facilitates the identification of overlapping groups of objects by allowing the objects to belong to more than one group. The essential difference between fuzzy k-means clustering and standard k-means clustering is the partitioning of genes into each group (Figure (Figure2b).2b). Rather than the hard partitioning of standard k-means clustering, where genes belong to only a single cluster, fuzzy k-means clustering considers each gene to be a member of every cluster, with a variable degree of 'membership'. Each gene has a total membership of 1.0 that is apportioned to clusters on the basis of the similarity between the gene's expression pattern and that of each cluster centroid. Genes whose expression patterns are very similar to a given centroid will be assigned a high membership in that cluster, whereas genes that bear little similarity to the centroid will have a low membership. Importantly, genes can be assigned significant memberships to more than one cluster, thus revealing genes whose expression is similar to multiple, distinct groups of genes.
We implemented a heuristic variant of fuzzy k-means clustering that incorporated principal component analysis (PCA) and hierarchical clustering to analyze published yeast genomic expression data that followed the response of cells to different environments. The method successfully identified clusters of functionally related genes and more comprehensive groups of known transcription factor targets in yeast. In the process of this analysis, we identified previously unrecognized similarities in the expression of yeast genes and uncovered correlations between the environmental conditions. We explored the regulation of gene expression by correlating the identified clusters with known regulatory elements present in the genes' promoters. These details implicate mechanisms that yeast cells use to orchestrate genomic expression programs in response to variable conditions.
We implemented a version of the fuzzy k-means algorithm, based on a description by Gath and Geva , in a C++ program called FuzzyK (available at ; see Materials and Methods for complete details). We altered the algorithm in two fundamental ways: first, we performed three successive cycles of fuzzy k-means clustering, with the second and third rounds of clustering performed on subsets of the data. Second, because the random initialization commonly used in k-means clustering can have a profound impact on the results , we instead chose to initialize each clustering cycle by seeding prototype centroids with the eigen vectors identified by PCA of the respective dataset (see below). Here we present an overview of the algorithm, followed by a discussion of the parameter optimization.
The input of the program is a table of expression values, where each row represents a given gene's relative transcript abundance under the condition indicated in each column (Figure (Figure3).3). The first round of clustering is initialized by defining k/3 prototype centroids (where k is the total number of clusters and 3 is the number of clustering cycles) as the most informative k/3 eigen vectors identified by PCA of the input dataset (see Materials and methods). The prototype centroids are refined in the subsequent steps: each gene is assigned a membership score to each of the prototype centroids, based on the Pearson correlation between the gene's expression pattern and the centroid in question. Each of the centroid patterns is then recalculated as the weighted mean of all of the gene-expression patterns in the dataset, where each gene's weight is proportionate to its membership in the corresponding cluster (see Materials and methods for details regarding the calculations). Genes that have a large membership to a given centroid will contribute more to the mean, and the new centroid will migrate in the direction of those genes. The process of calculating gene-centroid memberships and updating the centroids is iterated until the centroid patterns become fixed (or until the termination criterion is met, as described in Materials and methods).
After this initial round of fuzzy clustering, duplicate centroids (pairs whose Pearson correlation is greater than 0.9) are averaged, and genes with a greater than 0.7 correlation to any of the identified centroids are removed from the dataset (see Materials and methods). The fuzzy k-means clustering steps described above are repeated on this smaller dataset to identify patterns missed in the first clustering cycle, and the new centroids are added to the set identified in the first round. The process of averaging replicated centroids and selecting a data subset is repeated, and a third cycle of clustering is performed on the subset of genes with a correlation of less than 0.7 to any of the existing centroids. The newly identified centroids are combined with the previous sets, and replicate centroids are averaged.
In the final step of the program, the membership of each gene to each centroid is calculated. Thus, the output of the algorithm is twofold: the method presents a list of the unique centroids identified in the fuzzy clustering cycles along with a matrix representing the final membership scores for each gene to each centroid. In this representation, each gene can be related to all the identified clusters through its membership value, allowing genes that belong significantly to multiple clusters to be realized. As a consequence, each cluster consists of a continuous list of all of the genes in the dataset, ranked according to decreasing membership.
The continuous clusters identified by fuzzy k-means clustering present a challenge in visualizing the clustering results. To this end, we have developed the program FuzzyExplorer, a PERL viewer based on the program GeneExplorer (developed by Christian Rees; C. Rees, P.O. Brown and D. Botstein, unpublished results). Using this software, the genes that belong significantly to each cluster can be identified and visualized by applying a membership cutoff: all genes whose membership is greater than the cutoff will be selected as part of the cluster and their gene-expression patterns will be displayed. Rather than define a single cutoff for each cluster, the visualization software applies a sliding membership cutoff to select the genes, allowing each cluster to be expanded or collapsed in terms of the number of genes selected. This flexibility allows the user to define the appropriate membership cutoff for each cluster. For example, at a very high membership cutoff, most of the genes in each cluster will have highly correlated expression patterns in all of the experiments and will be closely related in terms of function and regulation. As the membership cutoff decreases, additional genes will be assigned to each cluster group: in many cases, the similarity in the expression of the selected genes will exist over only a subset of the microarray experiments, promoting the identification of conditionally coregulated genes or genes whose products are more peripherally associated with the same cellular processes (see below). The appropriate membership cutoff will vary for each cluster and for the desired results, and selecting meaningful cutoffs can be guided by additional information (see Discussion). For many of the clusters discussed below, the membership cutoffs were empirically chosen to select genes that had coherent gene expression patterns over a given subset of the experiments.
The parameters used for the fuzzy clustering were empirically defined for the analysis of yeast genomic expression data. The parameters were optimized to maximally recover clusters identified by hierarchical clustering: these were defined as all hierarchical gene clusters that had a Pearson correlation greater than 0.7 (see Materials and methods); in essence, these clusters served as positive controls. We also assessed the ability of the fuzzy k-means algorithm to identify groups of genes with coherent expression patterns, sets of genes whose products are functionally related, and clusters of known transcription factor targets. A summary of the parameter optimization is discussed below, with additional information available at .
We found that performing three clustering cycles, with the second and third cycles performed on subsets of the data as described in Materials and methods, maximized the recovery of the clusters identified through hierarchical clustering. Performing three cycles to identify k = 100 centroids recovered 79% of the known clusters in the dataset, compared to the case when the clustering was carried out in one round using identical parameters and seed vectors, for which 64% of the known clusters were identified (see ). Performing more than three rounds of clustering did not identify additional known clusters in the dataset, nor did it lead to the increased identification of large clusters of coherently expressed genes (data not shown). We therefore implemented three cycles of clustering, although more sophisticated methods of determining the optimal number of cycles can be envisioned.
A significant challenge in partitioning-clustering techniques is defining the number of clusters, k . With standard implementations of the k-means algorithm, underestimating k will result in large clusters of many genes that display divergent gene-expression patterns, while overestimating k will over-fit the data and split groups of similarly expressed genes into multiple, small clusters. Because of the dependence of k-means clustering on k, a number of methods have been developed to estimate this parameter [16,17]. In contrast, fuzzy k-means clustering appears to be less sensitive to over-fitting, because the genes are not forced to belong to only a single cluster. For example, performing the clustering with k = 300 added only approximately 30 unique centroids relative to when the clustering was performed with k = 120, and otherwise-identical parameters (see ). The relatively small number of centroids added when k was increased to 300 was largely due to the fact that the program identified many more replicates of centroids, which were consequently removed from the final set. Of the approximately 30 added centroids, most appeared to represent local minima, as they were centroids that were poorly reproduced in bootstrapping experiments (see Materials and methods) and identified few genes that had coherent patterns of expression (data not shown). Nonetheless, the addition of these patterns did not significantly affect the relative memberships of genes to the other centroids (data not shown), indicating that overestimating k did not appreciably affect the clustering results. This presents a significant advantage over standard k-means clustering as it reduces the requirement of accurately estimating k by allowing this parameter to be overestimated.
We examined a number of different initialization methods (data not shown) and found that seeding prototype centroids with the eigen vectors identified by PCA performed optimally. Together, the eigen vectors describe the variation in the gene-expression dataset, and therefore seeding the centroids with these vectors provides a systematic method of sampling the data space. In addition, this protocol produces deterministic clustering results, in contrast to the random initialization method commonly implemented in k-means clustering. One potential drawback of this method is that the number of clusters, k, is limited to the number of eigen vectors (which is determined by the number of microarray experiments analyzed). This limitation is alleviated by performing successive cycles of fuzzy k-means clustering on subsets of the data and recalculating the eigen vectors for the respective dataset used in each cycle. In addition, the clustering protocol can incorporate user-defined vectors to seed any number of additional centroids.
Many of the eigen vectors identified by PCA seemed to contain little information about the dataset, as previously noted for this type of analysis [18,19]. Nonetheless, most of the eigen vectors diverged to different gene-expression patterns within 10-15 iterations. The final centroids identified by the fuzzy clustering method showed little dependence on the eigen vectors used to seed the process, as evidenced by bootstrapping analysis. More than 50% of the final centroids were identified in 90% of the bootstrapping trials in which PCA was performed on a random sample of the data, despite the fact the most of the eigen vectors were significantly different in each trial (see Materials and methods). Most of the final centroids bore little similarity to the eigen vectors used to initialize the process, with less than 5% of final centroids similar to any of the eigen vectors with a Pearson correlation greater than 0.7.
Similarly to standard k-means clustering, the results of the fuzzy k-means method were affected by the data context. This was evident by the fact that the recovery of known clusters was enhanced by performing successive rounds of clustering on data subsets, as described above. In addition, the algorithm performed slightly better on an input dataset that consisted of the subset of yeast genes that showed differential expression patterns, as opposed to the entire gene-expression dataset. As the input dataset for the clustering process, we empirically selected genes whose standard deviation of expression was greater than around 1.4 (log20.45) from each gene's expression mean, amounting to approximately 4,400 out of the approximately 6,200 genes. The algorithm performed equally well on input datasets selected by other criteria of differential expression (data not shown). Performing the clustering on data subsets posed no limitation to the method, because at the end of the procedure all genes in the complete dataset were assigned membership values to the superset of identified centroids.
We applied the modified fuzzy k-means algorithm to the analysis of 93 published microarray experiments, each measuring the changes in transcript abundance of the approximately 6,200 predicted yeast genes as cells responded to zinc starvation , phosphate limitation , DNA-damaging agents , and a variety of other stressful environmental conditions . Because the algorithm was not significantly affected by overestimating k (described above), we approximated k to be roughly double the number of expected clusters in the dataset (defined as the number of clusters identified by hierarchical clustering). Using k = 120 and the parameters described in Materials and methods, the algorithm identified 91 unique centroids (Figure (Figure4a;4a; the complete results can be viewed at ). More than half of these centroids were correlated more than 0.7 to the cluster means identified by hierarchical clustering of the data, accounting for 87% (46/53) of all of the known clusters in the dataset.
The fuzzy clustering method was also able to identify clusters of genes that were not identified by hierarchical or standard k-means clustering. For example, one centroid (cluster 61, Figure Figure5a5a and see ) represented genes that were strongly repressed in response to prolonged nitrogen starvation but induced by treatment with the sulfhydryl-rearranging drugs dithiothreitol (DTT) and diamide. Six of the eight characterized genes within the top 20 genes in this cluster encode proteins that are localized to or function in relation to the cell wall, including those involved in bud growth and cell separation (AXL2, CIS3, SIM1), cell-wall proteins induced by antifungal drugs (SVS1, PRY2, PRY4, CRH1, TOS6), a putative cell-wall sensor (WSC2), and a Golgi mannose transporter required for glycosylation of cell-wall proteins (VRG4) (see  for references). Given the similarity in the expression of these genes, many of the uncharacterized genes may also be involved in processes related to the cell wall. In fact, nearly 70% of all of the genes in this group encode proteins predicted to contain signal peptides (including nine of the twelve characterized genes and five of the eight uncharacterized genes in the group; data not shown ), supporting the notion that these proteins are secreted to the cell surface. Extending this group to the top 100 genes in the cluster identified many more similarly expressed genes that are functionally related. In addition to the genes involved in cell wall biosynthesis, which accounted for 30% of the characterized genes selected in this group, an additional 30% of the characterized genes encode proteins involved in protein glycosylation and secretion, while the remaining genes are involved in protein synthesis, cytoskeletal functions, and sterol and lipid biogenesis, all of which can be related to cell-wall and membrane synthesis. Most of the genes that had high memberships in this cluster did not fall into a discrete cluster when the data were analyzed with other clustering algorithms: the top 20 genes associated with this group were distributed into five different clusters when the data were organized by hierarchical clustering and four clusters using k-means clustering (see Materials and methods). This example demonstrates the utility of our method in identifying previously unrecognized groups of functionally related genes.
In addition to identifying new groups of similarly expressed genes, fuzzy k-means clustering also provided more comprehensive clusters of previously recognized groups of functionally related genes. In many cases, these genes were similarly expressed in only a subset of the experiments, a feature that prevented their association when the data were analyzed with the other clustering methods. An example of this is a centroid that represents genes that were strongly induced by amino-acid starvation (cluster 2 in Figure Figure5b).5b). Essentially all of the top seven characterized genes associated with this cluster function in methionine biosynthesis and showed similar expression patterns in response to all of the experimental conditions (see ). However, as the membership cutoff was decreased to expand the cluster, additional functionally related genes were included, despite the fact that these genes were divergently expressed in response to conditions other than amino-acid limitation. Of the characterized genes within the top 100 genes belonging to this cluster, 64% (42/66) encode proteins that are directly involved in amino-acid biosynthesis, while more than half of the remaining characterized genes are involved in aspects of nitrogen and carbon metabolism that support amino-acid synthesis. Only half of these genes fell into the same cluster when the data were analyzed with hierarchical clustering or k-means clustering (see Materials and methods), while the remaining genes fell into multiple smaller groups in both cases.
One of the most significant advantages of fuzzy k-means clustering is that genes can belong to more than one group, revealing distinct aspects of their function and regulation. An illustration is provided by the gene KAR2, which encodes an HSP70 protein-folding chaperone localized to the endo-plasmic reticulum (ER) that is known to respond to defects in ER secretion and to unfolded proteins in this organelle (see  for review). Consistent with the known functions of the protein, KAR2 has significant membership in two clusters. The first (cluster 58 in Figure Figure5c)5c) includes genes that were induced by the reducing agent DTT, a condition that prevents proper disulfide-bond formation and secretion in the ER . More than 75% (23/31) of the characterized genes within the top 50 genes in this cluster are localized to the ER and participate in various aspects of secretion, including protein folding (KAR2, LHS1, FKB2, JEM1), protein disulfide isomerization (EUG1, PDI1, ERO1), protein glycosylation (GFA1, PMT3, PMI40, SEC59, WBP1, OST2), and forward and retrograde trafficking (ERD2, ERP1, ERP2, SEC24, SEC13, RET2, RET3, and others ). Many of the uncharacterized genes in this group are likely to be functionally related to the characterized genes. In addition, KAR2 also has significant membership in a second cluster (cluster 11 in Figure Figure5d),5d), which is composed of genes that were induced following heat shock and diamide treatment. Roughly 40% (14/36) of the top characterized genes associated with this group encode protein-folding chaperones localized to different subcellular regions (including those that encode the cytosolic Hsp90 and Hsp70 factors, the mitochondrial Hsp10/Hsp60p and Ssc1p, and the ER- and mitochondrial-associated Ssa1p), and their induction following heat shock and diamide treatment is likely to be in response to widespread protein unfolding inflicted by these conditions. That KAR2 clusters with both groups of genes reflects the dual role of Kar2p in the response to ER-specific challenges and to conditions that generally destabilize proteins throughout the cell, presumably without affecting other aspects of secretion.
The clustering of KAR2 with genes in these two clusters not only reflects the functional role of the encoded protein but also corroborates the conditional regulation of KAR2 expression. In response to defects in ER secretion, KAR2 is known to be induced by the transcription factor Hac1p as part of the unfolded protein response (UPR) [25,26,27,28]. In fact, nearly all of the top 50 genes in cluster 58 were shown by Travers et al.  to be induced following DTT treatment in a manner dependent on Hac1p and its upstream regulator, Ire1p [6,29,30]. However, unlike most of the genes in cluster 58, KAR2 is also induced in response to heat shock, along with the other chaperone genes in cluster 11, by the transcription factor Hsf1p . Consistently, most of the top genes in this group, including KAR2, contain multiple Hsf1p-binding sites in their promoters. The clustering of KAR2 with both clusters of genes therefore reflects the known induction of the gene by Hac1p as part of the UPR but by Hsf1p following heat shock.
Many additional yeast genes have significant membership in more than one of the fuzzy clusters. When the genes were assigned to all clusters with an empirically defined membership cutoff of 0.06, more than a third of the assigned genes were placed in more than one group (Table (Table1);1); at a slightly lower cutoff of 0.04, almost two-thirds of all of the assigned genes were placed into multiple clusters. As with KAR2, the fuzzy assignment of many of these genes was consistent with the known roles of the encoded proteins. Genes involved in histidine biosynthesis (for example HIS4 and HIS5) clustered with other genes involved in amino-acid synthesis (cluster 2 and cluster 80) but also with genes required for adenine biogenesis (cluster 1), in agreement with the roles of these gene products in both histidine and purine metabolism [32,33]. Genes that encode ER vesicle coat proteins (SEC13, SEC21, SEC24, COP1, RET2, RET3 and others) were induced with other ER-specific genes in response to the UPR, as discussed above for cluster 58, but were also strongly repressed following long-term nitrogen and carbon starvation, along with hundreds of other genes that function in diverse aspects of secretion and protein synthesis (cluster 72). The repression of these genes coincided with the cellular growth arrest resulting from the starvation conditions and was likely to be triggered by the decreased demand for protein and membrane synthesis in nondividing cells. That these genes belong to multiple, distinct clusters reflects the condition-specific roles of the encoded proteins and suggests that their conditional expression with these alternative groups of genes is triggered by different cellular signals.
Many of the genes that clustered together by fuzzy k-means clustering are likely to be coregulated at the level of transcription in response to certain environmental conditions. We explored this possibility by characterizing the enrichment of each fuzzy cluster for genes that contained known transcription factor binding sites within 800 base-pairs (bp) upstream of their open reading frames (ORFs). Genes were assigned to each cluster using an empirically chosen membership cutoff of 0.06 or 0.08, and the probability of observing the number of genes in each cluster that contained one or more copies of each transcription factor binding site was calculated, based on the hypergeometric distribution (see Materials and methods). Roughly 25% of the identified fuzzy clusters were statistically enriched (with P <2 × 10-4) for genes that contain copies of at least one of 43 different promoter elements, and more than half of these clusters were enriched for multiple sites (the complete results are available at ). In many cases, the presence of the promoter elements was consistent with the known regulation of the genes' expression (Figure (Figure6).6). For example, cluster 2 was enriched for genes that contain binding sites for Gcn4p, Cbf1p, and Met31/32p. Around 75% of these genes are known to be induced by Gcn4p in response to amino-acid limitation and contain the Gcn4p promoter element [34,35]. However, those that are specifically involved in methionine synthesis also contain the recognition sequences for Met31/32p and/or Cbf1p, factors that cooperatively regulate gene expression according to the demand for the products of this pathway [36,37,38,39].
At a membership cutoff of 0.08, cluster 45 consisted of many genes involved in the tricarboxylic acid cycle and oxidative phosphorylation, and this group was enriched for the binding site of the Hap2/3/4p complex that is known to regulate the genes' expression. At a slightly lower cutoff of 0.06, additional genes involved in respiration and utilization of alternative carbon sources were assigned to the cluster, making the enrichment of the promoter element recognized by the catabolite repressor Mig1p statistically significant [40,41,42]. At both of these membership cutoffs, this cluster was also highly enriched for the sequence recognized by the stress-responsive factors Msn2p and Msn4p. These factors recognize a sequence that is very similar to the Mig1p-binding site, and it is possible that the enriched sequence actually represents derivative Mig1p elements. However, a specific role for Msn2p in the response to glucose starvation has recently been identified , raising the possibility that the factor is directly involved in regulating these genes. Another cluster (cluster 73) consists largely of genes that were sharply repressed in response to environmental stresses [2,44]. The ribosomal protein genes in this group are regulated by the factor Rap1p and contain multiple copies of its binding site within their promoters [45,46], while other genes in this group contain two putative regulatory sequences that have been previously identified as enriched in the promoters of many of these genes [2,47,48,49,50]. In each of these cases, the comprehensive clusters identified by the fuzzy k-means analysis included additional genes that were not previously known to be targets of these factors. Some of the binding-site occurrences shown in Figure Figure66 are likely to have occurred by chance (especially for sequences that are common in the genome, such as the Hap2/3/4p-, Mig1p-, and Msn2/Msn4p-binding sites). However, the similarity between the expression patterns of these genes and those of the known transcription factor targets, along with the functional correlation of the gene products and the presence of the respective binding sites in the genes' promoters, strongly suggest that many of these genes are legitimate targets of these regulators.
The overlapping clusters identified by fuzzy k-means clustering presented more complete groups of transcription factor targets compared to other clustering methods, enhancing the identification of promoter elements enriched in the clusters and implicating details of the conditional regulation of gene expression. An example can be seen in a set of around 15 genes that are induced by Yap1p in response to oxidative stress, but by the general stress factors Msn2p and/or Msn4p (Msn2/Msn4p) in response to other stressful conditions . These genes belonged significantly to three different clusters (Figure (Figure6).6). One cluster (cluster 7) consisted of around 90 known Msn2/Msn4p targets and was enriched for genes whose promoters contain the known Msn2/Msn4p-binding site as well as other C-rich sequences that are similar to, but distinct from, the Msn2/Msn4p site (see below). The second cluster that these genes belonged to (cluster 4) comprised known Yap1p targets, most of which contain the Yap1p-binding site within their promoters. The third cluster (cluster 8) specifically represented the subgroup of genes that are conditionally regulated by Yap1p or Msn2/Msn4p, and this group was enriched for both the Yap1p element and the Msn2/Msn4p-binding site (but not other C-rich sequences). At a lower membership cutoff, additional genes were assigned to cluster 8 that showed similar expression patterns and contain both the Yap1p and Msn2/Msn4p promoter elements, suggesting that these genes may also be conditionally regulated by the factors. In contrast to these results, when the data were analyzed by k-means clustering, these genes could only be assigned to a cluster of Yap1p targets or a cluster of Msn2/Msn4p targets, and therefore no group of genes that was statistically enriched for both of these promoter elements could be identified (data not shown).
The majority of the clusters identified by fuzzy k-means clustering were not statistically enriched for known transcription factor binding sites. To identify novel enriched promoter sequences, we calculated the hypergeometric distribution of all possible 6-mer sequences in the promoters of the genes clustered by fuzzy k-means clustering. Almost all the statistically significant 6-mers represented known transcription factor binding sites, with the exception of a group of C-rich sequences with high statistical enrichment in the promoters of the Msn2/Msn4p targets in cluster 7 (Figure (Figure6).6). We therefore focused our attention on the newly identified group of cell-wall genes, defined as the top 20 genes in cluster 61. Although none of the 6-mers met the significance cutoff for this cluster (P = 10-6), the most significant sequence (CGCGAA, P = 10-5) was identical to the core binding site of SBF, a transcription factor complex that regulates cell-cycle-dependent gene expression at the G1 to S transition [51,52,53]. In fact, more than two-thirds of these genes were identified by Iyer et al.  as part of a larger set of around 180 genes whose flanking regions were physically bound by the SBF complex. However, this set of genes was not coordinately expressed during cell-cycle progression [54,55], suggesting that these cell-wall genes may be regulated by a distinct mechanism in response to environmental conditions.
To try and identify novel sequences conserved in these promoters, we used the motif-finding algorithm MEME , initializing the EM algorithm with the most significantly enriched 6mer for the cluster. Two sequences were repeatedly identified using a variety of parameters (Figure (Figure7):7): one motif was very similar to, but extended from, the known SBF-binding site and was present in 80% of the gene promoters, often in multiple copies. This represents a significant enrichment over the entire set of yeast gene promoters, of which approximately 30% contain the site. Nearly 90% of the promoter regions that contain this motif were shown to be physically bound by the SBF complex , consistent with the idea that SBF binds this sequence. These details raise the possibility that SBF coordinates the expression of this set of cell-wall genes in response to conditions other than cell-cycle progression , perhaps in a manner that is specifically dependent on the variant site identified here. MEME also identified another sequence that has not previously been implicated in gene-expression regulation (Figure (Figure7b).7b). More than 55% of the top 20 genes in cluster 61 harbor the motif in their promoters, in contrast to only 16% of all yeast promoters. The precise roles of these sequences in mediating gene expression will require further experiments; however our ability to identify novel sequences conserved in these promoters highlights the potential for discovering additional regulatory elements by fuzzy k-means clustering.
As well as understanding the similarities between genes based on their expression patterns, it is also enlightening to correlate the experimental conditions in terms of their effects on gene expression. These correlations can implicate common features of the environments while pointing to the regulatory systems that are activated in each situation. A convenient feature of the fuzzy-clustering output is that the experiments can be hierarchically clustered, on the basis of the expression patterns of each selected subset of genes, to reveal similarities in the effects of the conditions. Performing the clustering based on subsets of genes presents subtle correlations between the experiments that cannot be realized when the clustering is based on all the genes in the dataset. For example, when the 93 microarray experiments analyzed in this study were hierarchically clustered on the basis of the expression patterns of all the genes in the dataset, the experiments largely clustered according to the individual time courses (Figure (Figure8a).8a). This reveals that the overall genomic expression program triggered by each environment was unique to each set of conditions.
However, when the microarray experiments were clustered on the basis of subsets of genes identified by fuzzy k-means clustering, more detailed correlations emerged, indicating more information about the effects of each environment. An example is the sulfhydryl-oxidizing drug diamide, which affects many aspects of cell biochemistry. When the experiment clustering was performed on the basis of genes encoding protein-folding chaperones (the top 10 genes in cluster 11), a striking similarity between the effects of diamide and heat shock was observed (Figure (Figure8b).8b). In contrast, when the microarray clustering was performed on the basis of genes involved in oxidative stress defense, (the top 24 genes belonging to cluster 4), diamide was most similar to hydrogen peroxide and menadione, which inflict oxidative damage by generating reactive oxygen species (Figure (Figure8c).8c). In terms of the genes induced in the UPR (identified as the top 30 genes or so in cluster 58), the effects of diamide were most similar to those triggered by the reducing agent DTT (Figure (Figure8d).8d). That the effects of diamide were similar to those of different environmental conditions depending on the genes analyzed reflects the diverse effects of this drug on the cell (Figure (Figure8f).8f). By crosslinking protein sulfhydryl groups, diamide is thought to disrupt protein structure and trigger oxidative stress [57,58], both of which are likely to perturb normal ER functions .
The similarities between other environmental conditions are less well understood. For example, limitation of the essential nutrient zinc triggered diverse gene-expression changes in the cell . Although the overall genomic expression program triggered by this condition was distinct, when the experiment clustering was based on genes involved in the use of alternative carbon sources and respiration (cluster 36 and cluster 45, respectively), a significant similarity between the effects of zinc limitation and conditions that involve glucose starvation emerged (Figure (Figure8e).8e). Like carbon starvation, zinc limitation triggered the increased expression of these genes, even though the cells were not limited for glucose (T. Lyons, personal communication). This result suggests that zinc-limited cells may have a defect in glucose metabolism, leading to the induced expression of the respiration and carbon-utilization genes. While a link between zinc limitation and sugar metabolism has been established in mammals , the molecular basis of this correlation is not known. In contrast to this relationship, when the experiment clustering was performed on genes encoding ER-resident proteins, the effects of zinc starvation were most similar to those inflicted by DTT and diamide (Figure (Figure8d),8d), suggesting that zinc limitation may initiate the UPR. Zinc starvation is not known to induce this program in yeast (C. Patil, personal communication). However, a potential connection between zinc and the UPR is the protein calreticulin, an ER protein-folding chaperone that acts on glycosylated proteins . That mammalian calreticulin is a zinc-dependent protein  raises the possibility that zinc limitation prevents the proper activity of this protein in yeast, leading to unfolded ER proteins and triggering the subtle increase in expression of genes that participate in the UPR. The correlations between the gene-expression changes triggered by these conditions suggest hypotheses about the effects of each condition that warrant future experiments.
To respond to diverse and frequently changing environmental conditions, yeast cells must precisely mediate the synthesis and function of the proteins in the cell. This is controlled in part by the overall genomic expression program that results from the combined action of different regulatory factors, each of which responds to specific extra- and intra-cellular signals. Many of these regulators act under specific conditions, and together they govern the expression of overlapping sets of genes. Individual genes, in turn, are regulated by multiple, condition-specific systems that result in each gene being coexpressed with different groups of genes under different situations.
Although examples of this type of regulation have been observed on an individual gene basis, our results suggest that the condition-specific regulation of overlapping sets of yeast genes is a prevalent theme in the regulation of yeast gene expression. A large fraction of yeast genes is expressed in patterns that are similar to different groups of genes in response to different subsets of the experiments (Table (Table1).1). Furthermore, a substantial number of these genes contain multiple transcription factor binding sites in their promoters (Figure (Figure6,6, and see ), consistent with the idea that they are conditionally regulated by multiple, independent regulatory systems. The condition-specific regulation of gene expression has also been implicated in higher organisms [63,64] and probably has a significant role in regulating genomic expression. This is in contrast to the regulatory logic of prokaryotes, in which the expression of defined sets of genes in operons is a predominant feature of regulation. Thus, the conditional regulation of overlapping groups of genes may represent a regulatory theme that is particularly important in eukaryotes.
The prevalence of conditional gene coexpression poses a challenge for the analysis of gene-expression data, because many genes will have expression patterns that are similar to multiple, distinct gene groups. Fuzzy k-means clustering is well suited to identifying conditionally coexpressed genes for a number of reasons. First and foremost, the method can present overlapping clusters, revealing distinct features of each gene's function and regulation. The resulting implications can be used to assign refined hypothetical functions to uncharacterized gene products on the basis of the known functions encoded by the genes in each cluster. In addition, this information can suggest additional cellular roles of well studied proteins (see ). The overlapping clusters identified by fuzzy k-means clustering also present more comprehensive groups of conditionally coregulated genes. This is especially important for the successful identification of regulatory motifs common to the promoters of similarly expressed genes, because motif-finding algorithms are often hindered by small sample sets. More than two-thirds of the gene clusters we identified are not enriched for known regulatory elements, highlighting the potential for discovering novel sequences involved in gene-expression regulation. We expect that fuzzy k-means clustering will advance that discovery, as illustrated by our ability to identify new sequences conserved in the promoters of clustered genes.
Another benefit of the fuzzy k-means algorithm is that it identifies continuous clusters of genes. This allows each cluster to be expanded or collapsed to view genes of varying similarity in expression. While the genes of highest membership in a given cluster are often tightly correlated in terms of biochemical function and regulation, expanding the cluster can identify genes that are similarly expressed in only subsets of the experimental conditions. The resulting gene relationships can suggest details about the cellular roles served by the encoded gene products and the regulatory systems that govern the genes' expression in response to the relevant conditions. Thus, the results of fuzzy k-means clustering are naturally suited for biologists to use in an intuitive and physiologically meaningful way.
The unique features of fuzzy k-means clustering have allowed us to uncover complex similarities in yeast gene-expression patterns, identify putative transcription factor binding sites present in the genes' promoters, and elucidate the environmental conditions that trigger changes in gene expression. Integrating these details can indicate the cellular signals and regulatory systems that govern the expression of specific sets of genes in yeast (Figure (Figure9).9). For example, the fuzzy clustering of genes involved in methionine biosynthesis with other amino-acid biosynthetic genes and with genes involved in nitrogen utilization lead to the identification of multiple transcription factor binding sites in the genes' promoters. Together, these details reflect the alternative regulatory systems that are known to govern the expression of the methionine biosynthesis genes. Although they are induced by one regulatory system (Cbf1p-Met31/32p) according to the demand for the pathway's products, they are induced by an alternative system (Gcn4p) in response to a general signal of amino-acid starvation [34,35,65], and they are probably also regulated by a third mechanism (GATA factors) in response to the available nitrogen source. Combining this information with similar indications for other sets of genes gives a summary of the details discussed in this study and suggests a model for the organization of the regulatory system that controls gene expression in yeast (Figure (Figure9).9). The overlapping nature of the sets of coregulated genes supports the ability of the cell to customize the emergent genomic expression program to the particular needs of the cell, while minimizing the number of regulators required to produce each genomic expression program.
The fuzzy k-means algorithm used here was chosen for its conceptual and algorithmic simplicity. There are many alternative algorithms that might accomplish the same ends. For example, Ihmels et al.  have applied a heuristic algorithm to the analysis of yeast gene-expression data to identify overlapping sets of genes whose expression is similar to known gene-expression patterns. This method produced interesting results and identified genes that were similarly expressed to known transcription factor targets. A key difference between these algorithms is that fuzzy k-means clustering requires no a priori information about the dataset. Thus, each method may be suitable for a different biological question, namely identifying genes whose expression is similar to known or expected gene expression patterns versus an unbiased, de novo exploration of the gene-expression dataset.
Despite the advantages of fuzzy k-means clustering discussed above, the method also has a number of limitations. Most notably, the assignment of genes to the clusters requires a user-defined membership cutoff. While this allows complete flexibility in data exploration, selecting meaningful cutoffs is a challenge. Choice of cutoff can be guided by a number of criteria, including the coherence of the selected gene-expression patterns, the functional relationships of the characterized genes selected, or the statistical enrichment of sequences in the selected genes' promoters. We have attempted to alleviate the challenge of selecting cutoffs by providing visualization software specifically designed for the fuzzy clustering results, allowing the gene expression data to be inspected directly and dynamically.
Although the fuzzy k-means clustering method successfully identified nearly 90% of the known clusters in the dataset, it routinely failed to identify a small number of groups that were identified by hierarchical clustering. The inability of the method to find the expression patterns representing these groups seemed to be dependent on the overall properties of the dataset, rather than the absence of an appropriate eigen vector used to initiate the process, as the program was unable to identify these patterns even when the process was initiated by seeding the centroids with the unidentified patterns (data not shown). We have accounted for this limitation by allowing any number of expression patterns to be added to the final list of identified cluster centroids, thereby revealing genes that are similarly expressed to the pattern in question.
Despite these limitations, the unique advantages of fuzzy k-means clustering make the technique a valuable tool for gene-expression analysis. We believe that fuzzy k-means clustering will be a useful complement to other computational methods commonly used to analyze gene-expression data. Whereas algorithms that present discrete gene clusters provide a straightforward method of initial data exploration, the flexibility of fuzzy k-means clustering can be used to reveal more complex correlations between gene-expression patterns, promoting refined hypotheses of the role and regulation of gene-expression changes.
The clustering software FuzzyK and the visualization program FuzzyExplorer are available from , along with the complete clustering results and additional information.
Published genomic expression data of wild-type S. cerevisiae responding to zinc starvation , phosphate limitation , DNA-damaging agents , and a variety of stressful environmental changes  were combined into a dataset of 6,153 genes and 93 microarray experiments (dataset A). These data were chosen because the experiments were performed using the same experimental and microarray methods . The data were downloaded from the Stanford Microarray Database and were otherwise unprocessed before clustering, with the exception of the heat shock, DTT, and carbon-source experiments, which were transformed as previously described . The complete dataset organized by hierarchical clustering can be downloaded from . A subset of this data was used in the fuzzy k-means clustering and consisted of 4,373 genes whose standard deviation in expression was log2(0.45) from each vector mean (dataset B), identified using the program Cluster .
For all methods discussed, the weighted, uncentered Pearson correlation was used as the similarity metric (referred to simply as the Pearson correlation) . Where noted, the Pearson distance was used, equal to 1 - correlation. The array weights used in the calculation were generated as previously described, using a Pearson correlation cutoff of 0.8 and an exponent of 1 (see  for details).
Average linkage hierarchical clustering of the data was carried out using the program Cluster as previously described, using the weighted, uncentered Pearson correlation as the similarity metric . Dataset A and dataset B were hierarchically clustered with identical parameters, using array weights calculated based on a correlation cutoff of 0.8 and an exponent of 1 . Clustering of the microarray experiments was carried out similarly, using gene weights calculated with a correlation cutoff of 0.7 and an exponent of 1.
To represent all the significant gene clusters identified by hierarchical clustering, the dendrogram generated for dataset B by the Cluster program was parsed, and the average expression patterns of clusters of more than three genes with an average Pearson correlation >0.7 were calculated. Through this method, 38 hierarchical cluster means were identified. The parsing process was repeated to calculate the average expression patterns of clusters of more than three genes with average correlations >0.8 and >0.9. Cluster means not already represented in the initial group of 38 clusters were added to the group, resulting in a total of 53 hierarchical cluster means identified for the dataset. The centroids identified through fuzzy k-means clustering were considered similar to the hierarchical cluster means if the Pearson correlation between the vectors was >0.7.
We implemented the modified fuzzy k-means method in the C++ program FuzzyK, available at .
The fuzzy k-means algorithm  is based on the minimization of the objective function shown below, for a given fuzzy partition of the data, F, and a set of K cluster centroids, V
where Xi is the expression pattern of the ith gene in the dataset, Vj is the centroid of the jth cluster, d is the Pearson distance between Xi and Vj, mXiVj is the membership of Xi in cluster Vj, N is the number of genes in the dataset, and K is the total number of clusters.
We implemented the algorithm to perform three successive cycles of fuzzy k-means clustering. The first cycle of clustering was initialized by performing PCA on dataset B using the GNU Scientific Library SVD function. Of the top k/3 eigen vectors, those to which no gene had a maximal Pearson correlation were eliminated (for k = 120, only one eigen vector was eliminated in each cycle). The remaining eigen vectors were used as prototype centroids for that clustering cycle. Subsequent cycles of clustering were initialized similarly, except that PCA was performed on the respective data subset used in that clustering cycle.
During the centroid refinement in each clustering cycle, new centroids were calculated on the basis of the weighted mean of all the gene-expression patterns in the dataset according to
where each gene's membership m (a continuous variable from 0 to 1) was defined as
and w was the gene weight: in the first clustering cycle, the gene weights used were those defined by the program Cluster, using a Pearson correlation cutoff of 0.7 and an exponent of 1  and in subsequent cycles the gene weight was empirically defined as
where dXi, Vj is the Pearson distance between gene Xi, and vector Vj, cXi, Xn is the Pearson correlation between genes Xi and Xn, and x is the correlation cutoff, in this case 0.6. This weighting scheme served to overweight genes that were correlated to other genes in the dataset.
In each clustering cycle, the centroids were iteratively refined until the average change in gene memberships between iterations was <0.001 (approximately 40-60 total iterations in each clustering cycle). While around 85% of the centroids stabilized within approximately 15 iterations, some of the centroids required more than 40 iterations before stabilizing.
After each clustering cycle, the centroids were combined with those identified in previous cycles, and replicate centroids were averaged: each centroid was compared to all other centroids in the set, and centroid pairs correlated >0.9 were replaced by the average of the two vectors. The new vector was compared to the remaining centroids in the set and averaged with those to which it was correlated >0.9. This process continued until each centroid (or the vector that replaced it) was compared to all other existing centroids in the set.
Following the first and second clustering cycles, data subsets were selected to apply to subsequent rounds of clustering. Genes that were correlated to any existing centroid with a Pearson correlation >0.7 were removed from the dataset, and array and gene weights were recalculated on the data subset as described above. The new data subset was applied to a subsequent cycle of clustering, performed as described above.
The centroids identified through three rounds of fuzzy clustering were combined into one set and replicate centroids were averaged, as described above. Each gene in dataset A was assigned a membership score to each of the unique centroids. For display in the figures, the final list of centroids was ordered by hierarchical clustering. Genes were selected in each cluster if their membership score was greater than the empirically determined membership cutoff applied to each cluster. For display in Figure Figure5,5, the genes selected in each cluster were subsequently organized by hierarchical clustering.
For an optimal comparison of the results of k-means and fuzzy k-means clustering, we performed the k-means clustering identically to the fuzzy k-means protocol, except that during the clustering iterations each gene contributed only to the cluster to which it was most similar (with a membership of 1.0). Three rounds of hard k-means clustering were performed with k = 120, and each cycle was initiated by seeding k/3 centroids with the most informative k/3 eigen vectors identified by PCA, as described above. The process for merging centroids, selecting the data subsets for subsequent clustering rounds, and gene and array weighting were carried out identically as described for fuzzy k-means clustering. After identification of the final set of centroids, each gene was assigned only to the centroid to which it was most similar.
To estimate the dependence of the procedure on the initial dataset, a bootstrapping method was applied in which the fuzzy k-means protocol was repeated 100 times, each time on 4,373 genes chosen randomly from dataset B, with k = 102. The occurrence of each centroid in the bootstrap trials was determined by summing the number of trials that contained a centroid that was correlated >0.7 to the centroid in question. By this criterion, roughly 50% of the centroids were identified in 90% of the trials, while approximately 25% of the centroids were identified in all of the trials.
To estimate the dependence of the procedure on the eigen vectors used to seed the clusters, a similar bootstrapping procedure was carried out in which PCA was performed on 4,373 genes chosen randomly from dataset B but the cluster refinement was done using all genes in dataset B. The frequency of each centroid was scored as described above. The results were similar to the previous bootstrapping experiment, with around 50% of the centroids present in 90% of the bootstrapping trials, and around 25% of the centroids identified in all the trials.
Genes were assigned to all the 91 identified centroids on the basis of a membership cutoff of 0.06 or 0.08. The statistical enrichment of each cluster for genes that contained known transcription factor binding sites or different 6-mer sequences within 800 bp upstream of the ORF was assessed, according to the hypergeometric distribution. The probability of observing at least q genes that contained one or more copies of a given sequence out of l genes in a fuzzy cluster was calculated as
where M is the number of genes in the genome that contain the motif and N is the total number of genes in the genome. Forty-three transcription factor binding sites were compiled from the literature (see  for a complete list of sequences). The enrichment of each sequence was considered significant if the P value was <0.01 divided by the number of elements searched, or 2 × 10-4 for the 43 transcription factor binding sites and 2 × 10-6 for the 4,096 different 6-mers.
The program MEME  was seeded with the most significant 6-mer (CGCGAA) enriched in the promoters of the genes selected for cluster 61, and the program was run with a variety of parameters (the parameters and MEME output can be found at ). Genes whose promoters contained significant matches to the identified matrices were identified using Patser on the RSA tools website [68,69].
We thank A. Moses and E. Kelley for helpful suggestions and programming assistance, A. Alizadeh, J. Bolderick, N. Ogawa, C. Patil, C. Rees, P. Spellman, and M. Kamvysselis for helpful discussions, and A. Moses, D. Chiang, and J. Fay for critical reading of the manuscript. A.P.G. is supported by an NSF postdoctoral fellowship in biological informatics, and M.B.E. is a Pew Scholar in the Biomedical Sciences. This work was conducted under the US Department of Energy contract No. ED-AC03-76SF00098.