Genome sequence and annotation
Sequence and feature files (.gff files) for the S288c genome were obtained from the Saccharomyces
Genome Database (http://www.yeastgenome.org
) on 7 March 2007. The sequence for YJM789 was obtained from Wei et al (2007)
and aligned to the S288c genome using the procedure described by Wei et al (2007)
We designed a custom Affymetrix tiling array (product no. 520055) with a total of ~6.5 million probes (25-mers) including perfect match and mismatch probes. The probes tile both strands of the S288c genome at a resolution of 8 bp, with a shift between the strands of 4 bp (David et al, 2006
). The array also includes ~106 000 probes complementary to the YJM789 genome (Wei et al, 2007
) at positions of polymorphism between the strains. We also added 10 647 negative-control probes of randomly generated sequences with GC content ranging from 2 to 25 GCs.
Yeast strains and sample preparation
Laboratory and clinically derived S. cerevisiae
strains used in this work were isogenic to S288c and YJM789 and were designated as ‘S' and ‘Y', respectively. Three independent heterozygous hybrid strains (designated as ‘Y/S') were obtained by crossing Y and S strains. Reciprocal hemizygote strains for PHO84
alleles were constructed by crossing relevant Y and S background strains. Supplementary Table VI
lists all strains used in this study.
Total RNA was extracted from yeast cultures grown at 30°C in YPD medium (2% peptone, 1% yeast extract and 2% dextrose) and processed for array hybridizations as described earlier (Perocchi et al, 2007
). Importantly, to remove reverse transcription artifacts, first-strand cDNA was synthesized in the presence of 6.25 μg/ml actinomycin D. As cDNA is chemically same as DNA, we did not expect any systematic differences between cDNA and genomic DNA labeling.
For making mixture series, cDNA from S and Y strains was mixed in the following proportions, according to mass: 0:1, 1:3, 1:1, 3:1 and 1:0.
Probe filtering and classification
Using the following procedure, we classified each probe as common, S-specific, Y-specific or control. Ungapped alignments of the probes to the S288c genome and the aligned portion of the YJM789 genome were produced using the software exonerate (Slater and Birney, 2005
). We considered all perfect matches and near matches (up to two mismatches). A common probe has a unique perfect match to both parental genomes at the same alignment position and no near match. An S-specific probe has a unique perfect match and no further near matches to the S288c genome. It has no perfect match to the YJM789 genome and no near match to the YJM789 genome, except possibly at the same aligned position as its perfect match position in S288c. Y-specific probes were defined analogously. Specific probes whose match overlaps a polymorphism at ±4 bp of its central base were called ‘centered specific probes (CSP)'. Finally, we ensured that each negative control probe had neither a perfect nor a near match in either genome.
Normalization and background subtraction
Calibration of intensities between arrays was done using a variant of quantile normalization (Bolstad et al, 2003
), as follows. The sets of cDNA and genomic DNA (gDNA) hybridizations were treated separately. As specific probes are expected to have different behavior depending on the strain, we restricted the quantile normalization to the set of common probes and used linear interpolation to normalize the intensities of the specific probes.
The background of cDNA hybridizations was subtracted as described earlier (Huber et al, 2006
). Briefly, probes were binned into 10 groups according to their intensity level in the gDNA hybridizations. For each probe group and for each cDNA hybridization, probes falling outside annotated transcribed regions were used to estimate a background level. This level was then subtracted from the intensities of all probes within the group. To subtract the background of DNA hybridizations, we grouped probes by GC content. For each group and hybridization, we estimated the background level as the 10% trimmed mean of the negative control probes and subtracted it from all probes of the group.
New transcript identification and transcript probe sets
We ran a segmentation algorithm combining heterozygote cDNA hybridizations with parental cDNA hybridizations using the R package ‘tilingArray' (Huber et al, 2006
). Segmentation was carried out on the set of common probes, for which the assumption of a constant level across the transcript can be made. For each chromosome, the segmentation parameter S
(number of segments) was set so that the average segment size was 1500 bp. Segments corresponding to unannotated transcripts were then categorized as unannotated intergenic or antisense as described earlier (David et al, 2006
) (‘intergenic' were termed ‘isolated' in the earlier study). Segments with less than 20 probes were discarded. A subsequent manual inspection discarded six dubious antisense segments and recovered 10.
We subsequently inferred the expression of a transcript from the intensities of its probe set. We defined the probe set of a new transcript as the probes for which the match entirely falls within the boundaries of the segment. We defined the probe set of an annotated transcript as the probes whose match entirely falls within the boundaries of an annotated S288c exon.
Probe intensity model
We modeled yij, the normalized and background-subtracted intensity of probe i in hybridization j, as
are the affinities of the probe to its matches in each genome, c1ij
are the expression levels of the respective complementary sequences in the sample j
are the errors. The affinities and the expression levels are non-negative real numbers expressed in arbitrary units. For common probes, we have λ1i
We considered five possible types of hybridization samples: genomic DNA (gDNA) of the two homozygous strains S and Y, their cDNA, and cDNA of the heterozygous Y/S. We set ckij=2 if sample j is homozygous genomic DNA of genome k. Moreover, we fixed ckij=0 if sample j is genomic DNA or cDNA of homozygous strain different from k.
Following Rocke and Durbin (2001)
, we modeled the variance of the errors ij
as functions of the expected intensity Iij
The coefficients aj
and γ were inferred using the R package vsn (Huber et al, 2002
) by treating the cDNA and the gDNA hybridization as two separate groups. We assumed the scaled errors
to be independent and identically distributed and of mean 0.
For the cDNA samples of each strain, we assumed a constant level of each allele across one transcript's probe set. The regression proceeds with each transcript separately using probes only of the transcript probe set.
We denoted p1
the nominal expression levels of the alleles in the homozygous strains, h1
the levels of each allele in the Y/S strain. From equation (1)
, We obtained a set of equations for all hybridizations j
and probes i
that depend on the hybridization sample types:
We fitted the model by weighted least squares. More precisely, we searched for a set of affinities and expression levels that minimizes the sum of squared scaled residuals:
subject to λ0
for common probes, where the weights wij
= 1/ var (ij
were estimated by using equation (2)
We took advantage of the form of the model for the optimization procedure. Indeed, assuming fixed weights, the cost function is a sum of squared terms bilinear in λ
and (p, h
). For a given expression-level vector (p, h
), there is a closed-form solution to the unique optimal affinity vector λ
and vice versa. We thus devised a component-wise optimization algorithm that iteratively optimizes expression levels given affinities and reciprocally, updating the weights at each step using equation (2)
. We considered that the algorithm had converged, if all fitted expression levels of the last 2 iterations differ by less than a value corresponding to 10% of the background level, and stopped the algorithm if convergence did not occur before the 30th iteration.
We estimated confidence intervals per ORF probe set by resampling the scaled residuals with replacement. The regression results in fitted parameters and thus, according to the model, in an estimated intensity Îij
, an estimated weight ŵij
and a scaled residual
for each observed intensity:
We generated new synthetic data as noisy measurements of the fitted intensities:
where the function σ is a random sampling with replacement of the index pairs ij
. We repeated this B
=999 times and obtained B
estimates of the parameters. For all statistics of interest (expression level, allelic differential expression, etc.), 95% equi-tailed confidence intervals were estimated according to the non-parametric basic confidence limit as described in Davison and Hinkley (1997)
P-values and false discovery rates
We estimated significance levels (P
-values) by simulating data under the null hypothesis for the two following hypotheses:
- H1: Levels in parent equal: p1=p2
- H2: Levels in hybrid equal: h1=h2
We fitted an appropriately constrained model for each probe set and for each hypothesis (p1=p2 and h1=h2). Similar to the procedure for estimating confidence intervals, we generated B=999 new synthetic data as noisy measurements of those fitted intensities. Here again we sampled scaled residuals of the primary unconstrained fit, because they reflect the true noise better than those of the constrained fits. On each simulated dataset, we performed an unconstrained regression. For each hypothesis respectively, we considered the T-statistic.
The P-value is then approximated by
is the statistic value for the primary, unconstrained fit and ti*
=1, …, B
are the bootstrap statistic values (Davison and Hinkley, 1997
Treating each hypothesis H1
-values, i.e. false discovery rates (FDR), were obtained using the R package qvalue (Storey and Tibshirani, 2003
) with default parameters.
Sequence validation of differentially expressed transcripts
Quantitative estimates of allelic expression ratios by sequencing were obtained using the method described by Ge et al (2005)
. Primers (Supplementary Table VII
) were synthesized such that they spanned multiple SNPs between the two alleles of a transcript. From two independent Y/S strains, XHS768 and XHS769, cDNA was synthesized using random hexamers and PCR was carried out on the resulting cDNA for sequence analysis. PCR products using the same primers on genomic DNA of a Y/S strain, XHS768, was used to provide reference traces in situation of 1:1 allelic concentrations. The resulting sequence traces were analyzed with the software PeakPicker (Ge et al, 2005
), which estimates allelic expression ratios from relative peak heights at SNP positions. We calculated the allelic ratios of transcripts as the median over all SNPs and traces (Supplementary Table VII
). Out of the 24 transcripts tested, one (HOP1
) did not confirm polymorphic positions in the genomic DNA. Two others (ICL2
) were rejected from further analysis for having ratio estimates derived from less than two SNPs.
We defined the ADE coefficient as ( hY
), where hY
are the expression levels of the Y allele and S allele, respectively in the heterozygote.
Proportion of cis- and trans-regulatory effects
The ratio of cis
-regulatory divergence to the total regulatory divergence (Wittkopp et al, 2008
) is computed as C
) where C
, the cis
-regulatory effect, is the log ratio of the allelic expression levels in the hybrid and T
, the trans
-regulatory effect, is the difference between the log ratio of the parental gene expression levels and C
Analysis of PHO84 reciprocal hemizygote strains hybridizations
The hybridizations of the two PHO84 reciprocal hemizygote strains were analyzed using the same model as described above. Total transcript expression levels (i.e., hY+hS, the sum of the two allele levels for each transcript) were considered for comparison.