Genome sequence and annotation

Sequence and feature files (.gff files) for the S288c genome were obtained from the

*Saccharomyces* Genome Database (

http://www.yeastgenome.org) on 7 March 2007. The sequence for YJM789 was obtained from

Wei *et al* (2007) and aligned to the S288c genome using the procedure described by

Wei *et al* (2007).

Array design

We designed a custom Affymetrix tiling array (product no. 520055) with a total of ~6.5 million probes (25-mers) including perfect match and mismatch probes. The probes tile both strands of the S288c genome at a resolution of 8 bp, with a shift between the strands of 4 bp (

David *et al*, 2006). The array also includes ~106 000 probes complementary to the YJM789 genome (

Wei *et al*, 2007) at positions of polymorphism between the strains. We also added 10 647 negative-control probes of randomly generated sequences with GC content ranging from 2 to 25 GCs.

Yeast strains and sample preparation

Laboratory and clinically derived

*S. cerevisiae* strains used in this work were isogenic to S288c and YJM789 and were designated as ‘S' and ‘Y', respectively. Three independent heterozygous hybrid strains (designated as ‘Y/S') were obtained by crossing Y and S strains. Reciprocal hemizygote strains for

*PHO84* alleles were constructed by crossing relevant Y and S background strains.

Supplementary Table VI lists all strains used in this study.

Total RNA was extracted from yeast cultures grown at 30°C in YPD medium (2% peptone, 1% yeast extract and 2% dextrose) and processed for array hybridizations as described earlier (

Perocchi *et al*, 2007). Importantly, to remove reverse transcription artifacts, first-strand cDNA was synthesized in the presence of 6.25 μg/ml actinomycin D. As cDNA is chemically same as DNA, we did not expect any systematic differences between cDNA and genomic DNA labeling.

For making mixture series, cDNA from S and Y strains was mixed in the following proportions, according to mass: 0:1, 1:3, 1:1, 3:1 and 1:0.

Probe filtering and classification

Using the following procedure, we classified each probe as common, S-specific, Y-specific or control. Ungapped alignments of the probes to the S288c genome and the aligned portion of the YJM789 genome were produced using the software exonerate (

Slater and Birney, 2005). We considered all perfect matches and near matches (up to two mismatches). A common probe has a unique perfect match to both parental genomes at the same alignment position and no near match. An S-specific probe has a unique perfect match and no further near matches to the S288c genome. It has no perfect match to the YJM789 genome and no near match to the YJM789 genome, except possibly at the same aligned position as its perfect match position in S288c. Y-specific probes were defined analogously. Specific probes whose match overlaps a polymorphism at ±4 bp of its central base were called ‘centered specific probes (CSP)'. Finally, we ensured that each negative control probe had neither a perfect nor a near match in either genome.

Normalization and background subtraction

Calibration of intensities between arrays was done using a variant of quantile normalization (

Bolstad *et al*, 2003), as follows. The sets of cDNA and genomic DNA (gDNA) hybridizations were treated separately. As specific probes are expected to have different behavior depending on the strain, we restricted the quantile normalization to the set of common probes and used linear interpolation to normalize the intensities of the specific probes.

The background of cDNA hybridizations was subtracted as described earlier (

Huber *et al*, 2006). Briefly, probes were binned into 10 groups according to their intensity level in the gDNA hybridizations. For each probe group and for each cDNA hybridization, probes falling outside annotated transcribed regions were used to estimate a background level. This level was then subtracted from the intensities of all probes within the group. To subtract the background of DNA hybridizations, we grouped probes by GC content. For each group and hybridization, we estimated the background level as the 10% trimmed mean of the negative control probes and subtracted it from all probes of the group.

New transcript identification and transcript probe sets

We ran a segmentation algorithm combining heterozygote cDNA hybridizations with parental cDNA hybridizations using the R package ‘tilingArray' (

Huber *et al*, 2006). Segmentation was carried out on the set of common probes, for which the assumption of a constant level across the transcript can be made. For each chromosome, the segmentation parameter

*S* (number of segments) was set so that the average segment size was 1500 bp. Segments corresponding to unannotated transcripts were then categorized as unannotated intergenic or antisense as described earlier (

David *et al*, 2006) (‘intergenic' were termed ‘isolated' in the earlier study). Segments with less than 20 probes were discarded. A subsequent manual inspection discarded six dubious antisense segments and recovered 10.

We subsequently inferred the expression of a transcript from the intensities of its probe set. We defined the probe set of a new transcript as the probes for which the match entirely falls within the boundaries of the segment. We defined the probe set of an annotated transcript as the probes whose match entirely falls within the boundaries of an annotated S288c exon.

Probe intensity model

We modeled *y*_{ij}, the normalized and background-subtracted intensity of probe *i* in hybridization *j*, as

where λ

_{1i} and λ

_{2i} are the affinities of the probe to its matches in each genome,

*c*_{1ij} and

*c*_{2ij} are the expression levels of the respective complementary sequences in the sample

*j* and

_{ij} are the errors. The affinities and the expression levels are non-negative real numbers expressed in arbitrary units. For common probes, we have λ

_{1i}=λ

_{2i}.

We considered five possible types of hybridization samples: genomic DNA (gDNA) of the two homozygous strains S and Y, their cDNA, and cDNA of the heterozygous Y/S. We set *c*_{kij}=2 if sample *j* is homozygous genomic DNA of genome *k*. Moreover, we fixed *c*_{kij}=0 if sample *j* is genomic DNA or cDNA of homozygous strain different from *k*.

Following

Rocke and Durbin (2001), we modeled the variance of the errors

_{ij} as functions of the expected intensity

*I*_{ij}=λ

_{1i}*c*_{1ij}+λ

_{2i}*c*_{2ij}:

The coefficients

*a*_{j},

*b*_{j} and γ were inferred using the R package vsn (

Huber *et al*, 2002) by treating the cDNA and the gDNA hybridization as two separate groups. We assumed the scaled errors

to be independent and identically distributed and of mean 0.

Least-squares regression

For the cDNA samples of each strain, we assumed a constant level of each allele across one transcript's probe set. The regression proceeds with each transcript separately using probes only of the transcript probe set.

We denoted

*p*_{1} and

*p*_{2} the nominal expression levels of the alleles in the homozygous strains,

*h*_{1} and

*h*_{2} the levels of each allele in the Y/S strain. From

equation (1), We obtained a set of equations for all hybridizations

*j* and probes

*i* that depend on the hybridization sample types:

We fitted the model by weighted least squares. More precisely, we searched for a set of affinities and expression levels that minimizes the sum of squared scaled residuals:

subject to

**λ****0**,

**p****0**,

**h****0** and λ

_{1i}=λ

_{2i} for common probes, where the weights

*w*_{ij} = 1/ var (

_{ij} were estimated by using

equation (2).

We took advantage of the form of the model for the optimization procedure. Indeed, assuming fixed weights, the cost function is a sum of squared terms bilinear in

**λ** and (

**p, h**). For a given expression-level vector (

**p, h**), there is a closed-form solution to the unique optimal affinity vector

**λ** and vice versa. We thus devised a component-wise optimization algorithm that iteratively optimizes expression levels given affinities and reciprocally, updating the weights at each step using

equation (2). We considered that the algorithm had converged, if all fitted expression levels of the last 2 iterations differ by less than a value corresponding to 10% of the background level, and stopped the algorithm if convergence did not occur before the 30th iteration.

Confidence intervals

We estimated confidence intervals per ORF probe set by resampling the scaled residuals with replacement. The regression results in fitted parameters and thus, according to the model, in an estimated intensity

*Î*_{ij}, an estimated weight

*ŵ*_{ij} and a scaled residual

′

_{ij} for each observed intensity:

We generated new synthetic data as noisy measurements of the fitted intensities:

where the function σ is a random sampling with replacement of the index pairs

*ij*. We repeated this

*B*=999 times and obtained

*B* estimates of the parameters. For all statistics of interest (expression level, allelic differential expression, etc.), 95% equi-tailed confidence intervals were estimated according to the non-parametric basic confidence limit as described in

Davison and Hinkley (1997).

*P*-values and false discovery rates

We estimated significance levels (

*P*-values) by simulating data under the null hypothesis for the two following hypotheses:

*H*_{1}: Levels in parent equal: *p*_{1}=*p*_{2}*H*_{2}: Levels in hybrid equal: *h*_{1}=*h*_{2}

We fitted an appropriately constrained model for each probe set and for each hypothesis (*p*_{1}=*p*_{2} and *h*_{1}=*h*_{2}). Similar to the procedure for estimating confidence intervals, we generated *B*=999 new synthetic data as noisy measurements of those fitted intensities. Here again we sampled scaled residuals of the primary unconstrained fit, because they reflect the true noise better than those of the constrained fits. On each simulated dataset, we performed an unconstrained regression. For each hypothesis respectively, we considered the *T*-statistic.

The *P*-value is then approximated by

where

*t* is the statistic value for the primary, unconstrained fit and

*t*_{i}^{*},

*I*=1, …,

*B* are the bootstrap statistic values (

Davison and Hinkley, 1997).

Treating each hypothesis

*H*_{1} and

*H*_{2} separately,

*q*-values, i.e. false discovery rates (FDR), were obtained using the R package qvalue (

Storey and Tibshirani, 2003) with default parameters.

Sequence validation of differentially expressed transcripts

Quantitative estimates of allelic expression ratios by sequencing were obtained using the method described by

Ge *et al* (2005). Primers (

Supplementary Table VII) were synthesized such that they spanned multiple SNPs between the two alleles of a transcript. From two independent Y/S strains, XHS768 and XHS769, cDNA was synthesized using random hexamers and PCR was carried out on the resulting cDNA for sequence analysis. PCR products using the same primers on genomic DNA of a Y/S strain, XHS768, was used to provide reference traces in situation of 1:1 allelic concentrations. The resulting sequence traces were analyzed with the software PeakPicker (

Ge *et al*, 2005), which estimates allelic expression ratios from relative peak heights at SNP positions. We calculated the allelic ratios of transcripts as the median over all SNPs and traces (

Supplementary Table VII). Out of the 24 transcripts tested, one (

*HOP1*) did not confirm polymorphic positions in the genomic DNA. Two others (

*ICL2* and

*YDL237W*) were rejected from further analysis for having ratio estimates derived from less than two SNPs.

ADE coefficient

We defined the ADE coefficient as (

*h*_{Y} −

*h*_{S})/(

*h*_{Y} +

*h*_{S}), where

*h*_{Y} and

*h*_{S} are the expression levels of the Y allele and S allele, respectively in the heterozygote.

Proportion of *cis*- and *trans*-regulatory effects

The ratio of

*cis*-regulatory divergence to the total regulatory divergence (

Wittkopp *et al*, 2008) is computed as

*C*/(

*C* +

*T* ) where

*C*, the

*cis*-regulatory effect, is the log ratio of the allelic expression levels in the hybrid and

*T*, the

*trans*-regulatory effect, is the difference between the log ratio of the parental gene expression levels and

*C*.

Analysis of *PHO84* reciprocal hemizygote strains hybridizations

The hybridizations of the two *PHO84* reciprocal hemizygote strains were analyzed using the same model as described above. Total transcript expression levels (i.e., *h*_{Y}+*h*_{S}, the sum of the two allele levels for each transcript) were considered for comparison.