shows the model used to infer the functionality of variable positions. The analysis involved four steps. First, the target genes and the binding motif consensus of each TF in
S. cerevisiae were downloaded from the MYBS database (
http://cg1.iis.sinica.edu.tw/~mybs/) (
20) and the variable positions in each consensus were identified. Second, the target genes were separated into groups based on the nucleotides at those positions. Third, the degree of co-expression was quantified. Finally, the Kolmogorov-Smirnov (KS) test was applied to determine whether the gene expression of two groups was significantly different.
Collecting TFBSs
The TFBSs in the
S. cerevisiae genome were also downloaded from the MYBS database (
20), which integrates an array of experimentally verified and predicted PWMs that correspond to 183 TFs. The database allows users to identify TFBSs by using DNA-binding affinity data and phylogenetic footprinting data from eight related yeast species. We used the following two criteria to collect TFBSs in the promoter region of each gene: (i) if a TFBS exists in the promoter region of the gene in
S. cerevisiae, the TFBS should be conserved in at least two of the other seven yeast species; and (ii) the
p-value of the corresponding TF ChIP-chip experiment for the gene should be ≤ 0.01.
Because MYBS integrates both experimental and predicted motifs from several databases, there can be more than one binding motif consensus for a TF. Therefore, for each TF with more than one consensus, a universal consensus was determined by taking the part of each consensus that was common to the entire consensus of the TF. To avoid ambiguity, a gene was excluded in the analysis if the TFBS motif occurred more than once in the promoter region of the gene and the sequences of the occurrences were different. After the above steps, the refined dataset consisted of 71 TFs.
The variable positions in a consensus were determined according to the following criterion. For each position in a consensus, we calculated the frequency of each nucleotide (i.e. the number of target genes containing that nucleotide in the position). Though it is customary to use information content (IC) cutoff to decide whether a position is variable, in our work, for calculation purposes a position was defined as variable if at least two nucleotides were each found more than five times in the total number of occurrences. This is a limitation imposed by the KS test statistic in our method (see the following paragraph). The 71 TFs in our refined dataset contained 632 positions. As binding motifs of 47 TFs lacked variable positions, we omitted them from our analysis. The remaining 24 TFs (with 213 positions) contained 75 variable positions ().
Identifying functional variable positions in TFBSs
For TF α, we grouped its target genes according to their nucleotides at a variable position p in its consensus. The target genes with nucleotide b (A, C, G, or T) at position p formed group b and the remaining genes constituted group ¬b (b). We used this grouping strategy to determine whether the nucleotide b relates to a particular pattern in gene expression. If the nucleotide variant at a variable position contained A, T, C and G, we further assessed whether a combination of two nucleotides relates to a particular pattern in gene expression.
The degree of co-expression of any group of genes in a condition was quantified by calculating the distribution of the pairwise Pearson correlation coefficients for all genes in the group [(d)]. In a pair of genes, if any of the data for a condition was missing, we only used data that was present for both genes to compute the similarities under the constraint that the proportion of calculated observations in each condition was >65%.
To determine whether the degree of co-expression in one group was significantly higher than that in another group, we applied the one-sided Kolmogorov-Smirnov (KS) test, a non-parametric and distribution free statistical method. The hypotheses
H0:

and
H1:

were tested using the one-sided KS test, where
F denotes the distribution function of the co-expression levels of genes in a specific group. If
H0 is rejected,

, which means that the co-expression levels in group
b are ‘stochastically greater’ than the co-expression levels in group
¬b. In addition, we used the false discovery rate (FDR) (
21) to compute the
q-value in order to determine the balance between the numbers of true and false positives. If a position had at least one nucleotide with a
q-value ≤ 0.05, it was deemed a functional variable position.
To find the dependent relationships between two variable positions, we extended the individual position model to consider two variable positions simultaneously. For two variable positions, pi and pj, the target genes with nucleotide bi at pi and nucleotide bj at pj were collected to form the bibj group, and the remaining genes formed the ¬bi bj group [(c)]. Then, based on the two groups, we deduced whether positions pi and pj had an inter-dependent relationship that related to the difference in their gene expression.
Before measuring the difference in gene co-expression between two groups, we applied the
χ2-test to determine the dependency between two variable positions. Let
n(
b, i) be the number of target genes with nucleotide
b at position
pi, and let
N be the total number of target genes. The probability of the occurrence of nucleotide
b is
In addition, let
n(
bi, bj, pi, pj) be the number of target genes with nucleotides
bi at position
pi and nucleotides
bj at position
pj. Then, the
χ2 statistic is defined as
where
n(
bi,bj,pi,pj) is the observed value and

is the expected (theoretical) frequency asserted by the null hypothesis. We use this test to verify the hypothesis
H0:

. If
H0 is rejected, there is reason to believe that these two variable positions are interdependent. The criterion for a combination of two variable positions to be dependent is that the
p-value derived by the
χ2-test must be ≤ 0.05. Correction for multiple testing was not applied at this stage, because it could reduce the number of potential position pairs for further analysis and because we later applied the correction when measuring the difference in gene co-expression. The latter procedure would effectively exclude false positives that might have been included in the set of dependent variable positions detected by the
χ2-test.
Microarray data
From the 25 expression datasets of
S. cerevisiae, available in the Stanford Microarray Database (SMD,
http://genome-www5.stanford.edu/) (
22), we downloaded the datasets with at least seven time points; we empirically determined that a dataset should have at least seven time points for a reliable estimation of the
Pearson correlation coefficients. Ten expression datasets satisfied this criterion. Each dataset corresponded to a particular biological function. In the
glucose dataset (
23), we only used the experiments related to galactose limitation and transcriptional response. The functions of the other nine datasets are as follows:
glucT2 (
24) relates to the physical responses in glucose-limited cultures;
calcium (
25) is an experiment that adds Ca
2+ to yeast;
mec1 (
26) investigates the relationship between DNA damage responses and Mec1 in yeast;
fkh (
27) probes the role of Fkh1 and Fkh2 in the regulation of the cell cycle;
snf (
28) deals with Snf2 and Swi1 in both rich and minimal media;
alpha and
cdc15 (
29) are experiments in which
alpha is obtained from cells treated with alpha-factor transiently, and
cdc15 is collected from a cdc15-2 temperature sensitive mutant that resumes growth after release from heat shock;
sporulation (
30) is related to yeast meiosis and spore formation; and
diauxic (
31) investigates the gene expression accompanying the diauxic shift. In addition, we used the MA lowess (
32) and quantile (
33) normalization methods to reduce the systematic biases within each microarray, as well as the intensity-dependent effects and biases between microarray data.
Determining potential co-occurring TFs
For each TF α, we also investigated whether there was another TF β that was associated in the same target genes more often than random expectation, and could therefore be a potential co-occurring TF. This is estimated by calculating whether N12/N is greater than the random expectation (N1/N) × (N2/N), where N1 is the total number of target genes of TF α; N2 is the total number of target genes of TF β; N12 is the number of target genes of both TFs α and β; and N is the total number of genes in the S. cerevisiae genome. Under random association, the joint probability of N12/N should be equal to the product of the two marginal probabilities, (N1/N) × (N2/N), which correspond to the random variables of row and column factors in the contingency table. If N12/N is significantly greater than (N1/N) × (N2/N), then there is a positive association. The pairs of TFs whose binding sites overlapped in more than 60% of their target genes were not considered. Our criterion for detecting potential co-occurring TFs was nominal, because using a stringent criterion would reduce their number for further statistical analysis.
Association between variable positions and co-occurring TFs
For each variable position
p in the binding motif consensus of TF
α, we studied whether the nucleotide variations at position
p were significantly associated with the co-occurrence of TF
β. For this purpose, we constructed a contingency table (). Each of the target genes of TF
α is grouped according to the nucleotide at position
p (column) and with/without the co-occurring TF
β (row). Fisher’s exact test (
34) was used to examine the association between the row and the column variables. The null hypothesis is that the nucleotide frequencies at position
p are independent of the occurrence of TF
β. If the
p-value is significantly small, it means that, at position
p in the consensus binding motif of TF
α, there is a statistically different nucleotide preference when TF
β co-occurs, compared to that without the co-occurrence of TF
β. We also determined the FDR (
21) by computing the
q-value in order to correct for possible false positives from multiple tests. If the
q-value of a position is ≤0.05, we consider that it has a significant association with the co-occurring TF
β.
Conservation of variable position and position pairs
For each TF that had a predicted functional variable position (or position pairs), we collected the TFBSs in the promoters of its putative target genes and their ortholog genes in seven related species of
S. cerevisiae from MYBS (
20). We formed two groups: one for functional variable position/position-pair, which contained the ortholog genes that had TFBSs and our predicted nucleotide variant (or combination of two nucleotides) at these variable positions in the promoters of
S. cerevisiae and called them the functional group. The other group (non-functional group) corresponds to functional variable positions/position-pairs, but lacked our predicted nucleotides at these positions.
We then calculated the proportion of a nucleotide variant at the functional variable position/position-pair that is conserved in the functional group in the following manner. We computed the ratio of the number of target genes from the functional group with the total number of target genes with the TFBS in the promoter that contains the nucleotide variant. We calculated the proportion of variable positions/position-pairs for the non-functional group in a similar manner, using the target genes from the non-functional counterpart.
To examine whether the conserved proportion of the functional group was higher than that of non-functional group, the one-sided Wilcoxon Signed-Rank test was performed. The null hypothesis was that the proportion of nucleotide variants that are conserved in the non-functional group is greater than or equal to the proportion in the functional group.
In addition, we performed the one-sided two-sample proportion test (
35) to determine the precise significance of the proportion of a nucleotide variant in functional variable positions/position-pairs that are conserved. This was done with a cutoff of
Z-score at the 0.01 critical value from the standard normal distribution table. The null hypothesis was that the proportion of a nucleotide variant of functional variable positions/position-pairs in the functional group and the proportion of a nucleotide variant in the non-functional group are equal. Next, we calculated the standard error of differences between the two proportions. If zero lies within the one-sided confidence interval at
Z0.01, the null hypothesis cannot be rejected, implying there is no statistical difference in the two proportions.