This section starts by describing the steady-state chemostat microarray compendium and the regression analysis to assess the influence of cultivation parameters on gene expression. Then, the combinatorial effects of cultivation parameters on the transcriptome are investigated using enrichment tests and through biological interpretation of these effects on genes of functional categories and biochemical pathways. To demonstrate the usefulness of the compendium, this section concludes by presenting two case studies concerned with, firstly, the functional analysis of uncharacterized and dubious genes, and secondly, the interpretation of shake-flask-based transcriptome studies using the compendium.
Inferring the influence of cultivation parameters on gene expression
The Saccharomyces cerevisiae laboratory reference strain CEN.PK 113-7D (MATa) was grown at steady state in chemostat cultures under 55 different conditions. A condition can be characterized by a specific configuration of the settings of ten different cultivation parameters. One of these cultivation parameters is the available carbon source. Throughout the compendium five different carbon sources were used, i.e. acetate, ethanol, galactose, glucose and maltose. Thus, these five compounds form the settings that the cultivation parameter carbon source can assume. Table provides an overview of the settings for all cultivation parameters. Figure depicts the expression levels of the gene UPC2 across all 55 conditions. The lower part of this figure is a schematic representation of the settings of the ten cultivation parameters over all conditions. Note that the expression levels are absolute expression levels that come from a single-channel microarray system and not relative expression levels, where a reference condition is employed. A regression model was designed to assess the influence of the cultivation parameters on gene expression. The model was applied to all differentially expressed genes individually. (A large majority (6005 of 6383) of the genes in the S. cerevisiae genome was found to be differentially expressed in at least one of the 55 conditions.) Using a step-wise approach, the regression model iteratively selects significant predictors in order to reconstruct the expression pattern of a gene.
| Table 1Settings within the cultivation parameters |
Here, the cultivation parameters form the predictors. We incorporated single effects and two types of combinatorial effects. See Figure for a schematic example of genes that are influenced by these effects. A single effect is constituted by one setting of one cultivation parameter. For example, limiting element carbon is a predictor. (This will be a significant predictor for genes, which show differential expression between carbon-limited growth and growth that is limited by the residual quantity of other substrates.) In Figure gene g1 responds solely to a single effect. The first type of combinatorial effect is constituted by applying the logic AND function between two settings of two different cultivation parameters. For example, limiting element carbon AND aerobic growth (in short: aerobic carbon-limited growth) form such a combinatorial effect. Of course, the cell's transcriptome and metabolome are known to respond in a combinatorial fashion to particular environmental conditions or parameters. That is, the simultaneous presence of certain environmental factors results in a transcriptional and metabolic state that is not a simple aggregation of the states reached based on the single presence of one of these factors. For example, when glucose is present, it is utilized in different ways by S. cerevisiae, depending on the presence of oxygen. Including these AND effects enables the systematic investigation of the influence of combinations of cultivation parameters on gene expression. Gene g2 in Figure responds to an AND effect. The second type of combinatorial effect is constituted by applying the logic OR function on two different settings within the same cultivation parameter. Here, carbon-limited OR iron-limited growth forms an example.
This effect is included, because we expect that closely related settings within a cultivation parameter, e.g. similar carbon sources, will have a similar effect on gene expression. Gene g3 in Figure responds to an OR effect. In the case of UPC2 (Figure ), the regression model successively selected the single effect aeration type, the AND combinatorial effect anaerobic zinc-limited growth and the OR combinatorial effect nitrogen source proline or asparagine. (Note that cultivation parameter aeration type can assume only two settings, i.e. aerobic growth and anaerobic growth. Since these two predictors are mutually redundant, only one of them (aerobic growth) is included as a predictor in the regression model and labeled as aeration type. A positive regression coefficient for aeration type indicates that the gene is more highly expressed under aerobic conditions; a negative coefficient indicates the reverse scenario.) The regression model keeps on adding cultivation parameters as predictors, until no further significant improvement can be made. For example, for g4 in Figure the single effect A+ is selected first, followed by the combinatorial effect A+&Bi. See Methods section for details.
The expression of many genes responds to combinatorial effects
For most genes the regression model was able to explain 60 to 80% of the variance, which is present in their expression patterns across the 55 conditions. See Figure . The amount of explained variance does not depend that much on the average expression level of a gene, although there is a steady increase in explained variance with increasing average expression level. Much more important is the degree to which a gene is differentially expressed. The F-statistic, i.e. the ratio between the variance of the average expression levels across the 55 conditions and the average replicate variance across these conditions, is strongly correlated with the degree to which the gene's expression pattern can be reconstructed. The expression levels of genes with small F-statistics are obscured by measurement noise and do not differ significantly between the growth conditions. Also not surprisingly, when more significant cultivation parameters are selected by the regression model, more of the variance of the gene can be explained. Figure outlines which and how many cultivation parameters were selected to reconstruct the expression patterns of all genes. On average, a gene is influenced by 1.25 (± 1.18) single effects, 1.73 (± 1.43) AND effects and 1.01 (± 1.04) OR effects. The limiting element, aeration type and protocol (which is dealt with in more detail below) are the most prominent factors that influence gene expression behavior. Here it should be noted that the setup of the cultivation parameters in the compendium is not fully combinatorial, i.e. not all possible combinations of cultivation parameters are present in the dataset. For example, across the 55 conditions, 53 have been cultivated under pH 5, while only a single condition was performed with a lower pH (3.5) and similarly for a higher pH (6.5), thereby precluding combinatorial effects between the higher or lower pH and other environmental parameters. Thus, the numbers of genes, which are influenced by a particular cultivation parameter (as visualized in Figure ), are biased by the number of different settings of the cultivation parameters and the number of combinations of cultivation parameters present in the compendium. Anyhow, the results indicate that the expression of many genes is influenced, not only independently by particular cultivation parameters, but also in a combinatorial fashion, i.e. there are many combinatorial effects between cultivation parameters that affect gene expression behavior.
The regression analysis was repeated using only the single effects as predictors. For most genes this resulted in a lower percentage of explained variance. See Figure . Of course, this result could be expected based on the fact that many combinatorial effects were selected as significant predictors in the original regression model. Subsequent enrichment analysis provided additional evidence for combinatorial regulation. Genes, of which their expression levels are manipulated by a particular single effect or combinatorial effect, were grouped and checked for functional overrepresentation. Additional File
1 provides an overview of all enrichment analysis results. It reveals the many cases (> 1000) in which a particular combination of environmental parameters leads to the up- or downregulation of a group of functionally related genes. Also, functional enrichment was compared between the regression analysis including both single and combinatorial effects and the analysis including only single effects. Genes were clustered based on their reconstructed expression patterns that were obtained for both regression models and these clusters were evaluated for enrichment in functional annotation categories. Figure shows that the inclusion of the combinatorial effects leads to increased functional enrichment, and thus further substantiates the existence of the combinatorial influence of the presence of environmental factors and the importance of modeling them. Additional File
2 describes the complete comparison between the regression models including and excluding the combinatorial effects.
The sample preparation protocol has a large impact on the measured gene expression levels
As indicated in Table and Figure the tenth cultivation parameter is termed "Protocol". Unlike the nine other parameters, "Protocol" is not directly related to the cultivation conditions under which yeast is grown, but refers to the protocol to process RNA samples. Several years ago, an improved sample preparation kit was introduced [
14]. This kit obviated the need for the expensive and time-consuming poly-A mRNA purification step included in the original procedure. The decision to omit the purification step, which was also made in other yeast research groups, was supported by information indicating that samples prepared with or without this step were similar [
15]. Thus, two different protocols were used to generate the chemostat compendium's samples for microarray hybridization: Protocol A and Protocol B. The main difference between these protocols is that Protocol A includes the polyA-mRNA isolation step (with cDNA synthesis being performed on purified mRNA), while Protocol B excludes the purification step (with cDNA synthesis being performed on total RNA). (The Methods section and Additional File
3 provide the complete details on both protocols.)
As apparent from Figure , the measured transcript levels of many genes appeared to be influenced by the protocol. Enrichment analysis revealed a significant overrepresentation of characterized genes amongst the genes that have higher apparent transcript levels under protocol B; all three GO root-categories (biological process, cellular component and molecular function) were highly enriched. On the other hand, significantly many uncharacterized genes yielded higher apparent transcript levels under protocol A. Further investigation revealed a trend between transcript level and protocol influence: Genes with higher average expression level tended to yield a higher transcript level in protocol B and genes with a lower average transcript level tended to yield lower transcript levels under protocol B (Figure ). In general, uncharacterized genes have a lower expression than characterized genes, which explains the results from the enrichment analysis. Further evidence for this hypothesis is found when analyzing the genes that encode ribosomal proteins (RP genes), whose mRNA's are highly abundant. Again, significantly many RP genes exhibit higher expression when analyzed with protocol B (middle and bottom plots in Figure ).
The relationship between mRNA abundance (expression level) and protocol is only weak and does not hold for each gene individually. It may, for example, be influenced by the average length of the polyA-tail of different transcripts. Indeed, analysis of mitochondrial genes lacking a poly-A tail demonstrated a large influence of the protocol. Of the 52 transcripts on the microarray representing mitochondrial genes, 27 (amongst which 16 unique mitochondrial genes) were influenced by the protocol, i.e. the regression model selected the protocol as a significant predictor of the expression pattern of these genes. All these 27 mitochondrial genes showed a higher (apparent) transcript level under protocol B. These results illustrate that not only different microarray platforms, labs, and strains, but also the hybridization preparation steps can affect the outcome of microarray analyses. This strongly underlines previous warnings on the challenges involved in comparing microarray results from different experiments.
The chemostat compendium allows us to adequately model the influence of the hybridization protocol on expression. In particular, the compendium contains 18 growth conditions (9 sets of two), where the only differing cultivation parameter is the protocol setting: The growth conditions were identical in these nine cases, only the protocol was different. This provides extra statistical power in the regression procedure and enables us to successfully model the protocol effect. This allows us analyze the influence of the environmental cultivation parameters without interference of the protocol's confounding effect.
Functional categories are specifically associated with combinations of environmental parameters
Many functional categories are specifically influenced by a combinatorial effect. Many genes within such a category are influenced by a combinatorial effect, whereas none or only a few genes are affected by the single effects that constitute this combinatorial effect. See Methods section for these details. This analysis was performed on all MIPS categories. In total 153 significant combinatorial effect-MIPS category pairs were identified. These are depicted in Additional File
4. Here, we focus on the biological interpretation of two such combinatorial effects: Carbon source acetate OR ethanol, and, Limiting element phosphorus OR Sulfur. See Table .
| Table 2MIPS functional categories specifically associated with combinatorial effects |
The first example is provided by the OR effect of carbon sources ethanol and acetate on metabolism and energy household. These C2-compounds share a drastically different impact on central metabolism when compared to using the sugars glucose, maltose and galactose as carbon source. During growth on sugars, all metabolic building blocks can be derived from glycolysis, the tricarboxylic acid cycle and the pentose phosphate pathway, while during growth on C2-compounds, gluconeogenesis and the glyoxylate cycle are essential for the provision of some of these precursors. Furthermore, the higher ATP requirement for biosynthesis during growth on the C2-compounds implies that, at a fixed specific growth rate, dissimilatory fluxes have to be higher with the C2-compounds than with a sugar as the sole carbon source. This is supported by the significant shared influence of the C2 carbon sources on the genes of gluconeogenesis and the tricarboxylic acid pathway.
Besides this and other examples that can be easily explained by current knowledge, there are also many interactions that might represent as of yet unknown regulatory mechanisms. For example, we find that the limiting elements sulfur and phosphorus have a similar effect (i.e. OR effect) on transcription regulation genes. A close inspection of the genes influenced by this OR effect revealed the presence of five genes encoding subunits of Mediator (
MED3/
PGD1 (complex tail),
MED7 and
MED10/
NUT2 (middle),
MED11 and
MED18/
SRB5 (head)), an evolutionarily conserved coregulator of RNA polymerase II [
16] and nine genes encoding chromatin remodeling enzymes (
ARP7,
GCN5,
HST2,
RIF1,
RSC6,
RVB2,
SFH1,
SNF6 and
SPT8). In eukaryotes, gene transcriptional regulation depends on a complex interplay between signal transduction, specific and general gene regulators and complexes that modify chromatin and RNA polymerase II. Under sulfur limitation
S. cerevisiae adapts its transcriptome in order to reduce the expression of sulfur rich genes and proteins [
17,
18]. This response is mediated by Met4 the main sulfur metabolism regulator. The transcriptional changes upon phosphate limitation are mainly related to high affinity phosphate transport, phosphate assimilation and polyphosphate metabolism [
5,
18]. Although
S. cerevisiae requires the transcription of different specific genes under sulfur or phosphate limitations, it is tempting to speculate that the mechanisms that govern the transcription control of these specific sets of genes are shared and depend on shared mechanisms involving specific subunits of the Mediator complex. Such high degree of specificity was demonstrated with the implication of Med2 (a Mediator tail subunit) in the regulation of the low iron response regulon [
16].
Combinatorial regulation within biochemical pathways provides further insight into sulfur metabolism and scavenging
As demonstrated above, we can assess whether groups of genes are influenced by particular (combinations of) environmental parameters using enrichment tests. This opens up the interesting possibility to correlate new and previously known patterns of regulation of individual genes with the regulation of larger families of genes connected to each other in pathways. In contrast to other gene groups, in a metabolic pathway clear connections exist between the gene products and their functions, which allows for more in-depth analysis. Here, we focus on biochemical pathways as described in SGD, which depict the series of chemical reactions converting metabolites, and the enzymes catalyzing these reactions. Enrichment analysis indicated that 5 of the 9 downloaded 'SGD superpathways' were influenced by at least one significant combinatorial effect (at p < 10-3, q < 0.08).
An illustrative example is presented by analyzing the expression profiles of the gene family involved in sulfur- and sulfur containing amino acid-metabolism in yeast (Figure ). Sulfur amino acid biosynthesis involves a considerable number of enzymes required for the de novo biosynthesis of methionine and cysteine and the recycling of organic sulfur metabolites. Expression of the genes encoding the enzymes for this metabolic network is tightly controlled by the available sulfur source, through modulation of the intracellular S-adenosyl-methionine levels. Six different cultivation parameters were significantly often selected to explain the expression patterns of the genes in this pathway (
p < 10
-3). Five of these are combinatorial cultivation parameters. Not surprisingly, the only single effect is sulfur limitation, which causes the upregulation of ten out of the eighteen genes [
19]. See box 1 of the bars near the enzyme names in Figure . Despite large variations in expression under different combinations of conditions, many of the
MET-,
CYS-,
SAM- and
HOM -genes invariably respond to the presence of methionine in the growth medium by clearly reduced expression. See Figure , which depicts the normalized gene expression patterns of all genes of the pathway. This response is independent of the presence of oxygen or growth limitation by carbon or nitrogen sources. Only in the case where methionine is utilized both as sulfur and nitrogen source
and methionine is the limiting element, we observe that the expression of the corresponding genes is not reduced (but even slightly induced, mimicking the (known) response under sulfur limitation). This explains the selection of the combinatorial effects involving methionine depicted by boxes 4, 5 and 6 in Figure .
Interestingly, two genes involved in this sulfur-metabolizing network in part respond differently. HOM2, which is involved in homoserine biosynthesis, responds reciprocally to the availability of methionine in the growth medium compared to the other HOM genes, especially under aerobic conditions. The same observation is made for STR2, which is involved in cystathionine biosynthesis. (In Figure magenta boxes mark the conditions, where methionine is part of the growth medium.) This discrepancy is indicative of a differential regulatory mechanism operating between the HOM2, HOM3 and HOM6 genes of the homoserine pathway, and of the complex regulation of the transsulfuration pathway, involving CYS3, CYS4, STR2 and STR3. Further detailed analysis would be required to elucidate the molecular mechanisms operating in these differential combinatorial controls. Such differential controls operating within a pathway are likely to be involved in intricate flux balancing mechanisms.
Surprisingly, for many of the genes in the pathway under investigation expression levels under zinc limitation are almost as high as under sulfur limitation, especially under aerobic conditions. Moreover, the genes of the transsulfuration pathway are highly expressed under zinc and sulfur limitation, yet lower expressed under the other nutrient limitations. Also here,
STR2 responds reciprocally and is lower expressed under zinc limitation. Although transcript levels per se cannot be used as flux indicators, this expression behavior is consistent with an upregulation of the flux towards cysteine under zinc limitation via the increased synthesis of the corresponding enzymes. (See the graph structure of the pathway near cysteine in Figure .) The exact nature of this response is not immediately apparent. However, it provides an interesting hypothesis on the oxidative stress response of
S. cerevisiae under zinc limitation. As previously described [
20], a "first line of defense" in oxidative stress response is formed by the superoxide dismutase genes
SOD1 and
SOD2, which are induced under aerobic conditions. See Figure . The dithiol glutaredoxin genes
GRX1 and
GRX2 [
21], and the monothiol glutaredoxin genes
GRX3 -
GRX5 [
22], which also participate in the response against oxidative stress, exhibit highly differential transcriptional profiles.
This may provide new insight into the specific roles for each of the varying combinations of glutaredoxins under different growth conditions. Surprisingly, under zinc limitation not only the Cu, Zn-dependent SOD1 gene is lower expressed; also the SOD2 gene, encoding the mitochondrial superoxide dismutase, which is dependent on Mn and not on Zn, is much less induced. A boost in glutathione synthesis apparently takes over the main defense, since the glutathione synthase genes GSH1 and GSH2 are clearly induced, especially under zinc-limited aerobic conditions. This can be seen from the magenta ellipses in Figure . This fits with the fact that significantly many genes in the sulfur scavenging pathway are upregulated under zinc-limited aerobic growth, presumably leading to an induced cysteine pool, cysteine being one of the three components of the tripeptide glutathione.
Functional characterization of uncharacterized and dubious genes using the chemostat compendium
In a recent review [
23] it was pointed out that many (> 1000) genes in the yeast genome are still uncharacterized. Possible reasons for this include genetic redundancy, lack of strong growth phenotype and the possibility that not all of them are real genes. Additionally, genes may be involved in environmental and metabolic responses, which are normally not queried in the lab. Concerning the "characterized" genes, it can be noted that the function of many annotated genes is derived from large-scale studies, and hence, in-depth detailed analysis is lacking for these genes.
We conjecture that the visualization of the expression behavior of a gene over the conditions of the compendium, together with the identification of the significant cultivation parameters to which the gene responds, provides valuable information regarding gene function. With this information, one can design directed biological experiments or assays that probe a specific pathway or activity in order to advance towards the functional characterization of a gene. We mapped our regression results to SGD's genome snapshot, upon which the division of
Saccharomyces cerevisiae ORF's into verified ORF's, uncharacterized ORF'S and dubious ORF's in [
23] was based. For 1350 genes the regression model lead to a good reconstruction of the observed expression pattern (explained variance including replicate variance > 70%). According to SGD, 1009 of these genes were verified ORF's; 286 were uncharacterized and 54 were classified as dubious genes. Amongst the uncharacterized genes, many genes were found to be expressed under conditions which have not been extensively studied before. For example, amongst the 286 uncharacterized genes, five genes are most significantly influenced by zinc limitation, i.e. zinc limitation was the first condition selected by the regression model. One of these,
YOR387C, is only expressed under zinc limitation. These results immediately link the function of a gene to a particular cultivation parameter or a specific biological process related to this cultivation parameter. The expression pattern of these five zinc responsive genes as well as the other genes to be discussed in this section are visualized in Figure . Also, amongst the 54 dubious genes, there are many genes that are highly expressed under one or a few cultivation parameters, while having a constant expression over the remaining conditions. For example,
YJL119C is only highly expressed under phosphorus limitation.
YBL070C also responds to phosphorus limitation, yet particularly when the yeast is grown aerobically. The expression of
YBR292C is influenced by aerobic sulfur-limited growth and
YBL065W is only expressed when grown at a low temperature (12°C). 35 of the 54 dubious genes were affected by the aeration effect or the interaction effect between carbon limitation and aeration. These genes were screened against a recent proteomics study, where expression data of yeast grown in aerobic and anaerobic carbon-limited chemostats was measured [
24]. We found that for three genes unique peptides were quantified. This establishes the existence of the proteins encoded by these "dubious" genes. See Additional File
5 for a list of the 54 dubious genes and details on the peptide identification. Notably, 51 of the 54 dubious genes are no longer present on YG 2.0, the successor of the Affymetrix YG S98 GeneChip, after comparative genomics [
25] and phylogentic footprinting [
26] approaches identified these as false ORF's. However, our analysis reveals a clear-cut influence of environmental conditions on the expression levels of many of these genes, implying that these genes
do have a functional role, at least in the
Saccharomyces cerevisiae strain that was used in this study.
Analysis of shake-flask experiments with the chemostat compendium
Changes in the extracellular environment or perturbations on genetic level do not only affect (signaling) pathways in which the change or perturbation has direct involvement, but can also impact the cell's viability, metabolism or other processes in the cell. For example, there are many experimental conditions and genetic perturbations that will impact the growth rate of the cell. For shake flask cultivations it is not possible to distinguish between the direct and indirect effects, since cultivation parameters like growth rate and nutrient availability cannot be controlled. This also confounds the analysis of gene expression data from shake flask experiments [
11]. By screening a group of genes, which were grouped together on the basis of shake flask experiments, against the compendium, some of the confounding effects can be resolved. The group can be subdivided into clusters of genes that respond to particular environmental parameters within the compendium and thereby identify the cultivation parameters or biological processes that could have played a role in the original shake flask experiment, even when these have not been measured.
To this end, we apply the following strategy: First, we select the (combinatorial) cultivation parameters that are significant for the group under investigation. These are the cultivation parameters that are significantly often selected by the regression model to explain the expression pattern of the genes in the group when compared to the complete genome. Next, the genes are clustered based on the normalized regression coefficients under these cultivation parameters. Finally, these newly obtained clusters are consulted for enrichment of annotation categories. See Methods section for details. As an example, Figure depicts the results of this analysis for the groups of genes, which were found to be induced or repressed in a
dig1Δ,
dig2Δ mutant strain grown in a shake-flask [
27]. To make the induced and the repressed gene groups, we consulted the gene expression data of this study (i.e. the Hughes
et al. yeast mutant microarray compendium [
27]). The induced group is formed by all genes that are upregulated by one fold-change or more in the
dig1Δ,
dig2Δ mutant strain compared to the wild-type strain. The repressed group is formed in a similar fashion by identifying the genes that are downregulated by one fold-change or more.
The results show a clear difference between direct and indirect effects. On the one hand, the enrichment analysis on the TF binding data tells us that the genes in Clusters 3, 4 and 5 form a significantly large part of Dig1's regulon, i.e. the direct targets of TF Dig1. The known role of Dig1 and Dig2 in regulating mating-specific and pheromone-responsive genes is confirmed by the enrichment of these functional categories in Cluster 3. Also, binding sites of TFs Tec1 and Ste12, which together with Dig1 form a regulatory complex involved in mating and filamentation [
28], are enriched for Cluster 5 and Clusters 3 and 5, respectively. Interestingly, the genes within Clusters 3, 4 and 5 were clustered together based on their response to the addition of organic acids propionate, benzoate and sorbate. (The clusters are characterized by the shared transcriptional response of their genes to these acids.) On the other hand, a large set of genes that is induced after the knockout of
DIG1 and functionally redundant
DIG2, is affected by growth rate in the chemostat microarray compendium. See Clusters 1, 6 and 10. The genes of Cluster 1 show high enrichment for metabolism and energy functional categories as well as for general stress response TF Msn2. From this observation we conclude that besides the genes that are directly affected, the double knockout also has a large impact on the metabolism and energy household of the cell when grown in a shake-flask.