The white and opaque transcriptomes
To characterize the transcriptomes of white and opaque cells, we sequenced the poly(A) fraction of RNA extracted from replicate white and opaque cell cultures (Materials and Methods
and ), expecting to find messenger RNAs, polyadenylated non-coding RNAs, and abundant non-polyadenylated transcripts that persist through the purification steps. Importantly, the sequencing libraries were prepared using an approach that preserves the genomic strand from which the sequenced RNA fragments were originally transcribed (see Materials and Methods
and Figure S1
. Our sequencing runs yielded 29–136 million 50-base sequence reads per sample, which were subsequently aligned to a filter database (containing, e.g., rDNA sequences) and then to the Candida albicans
genome (build Ca21) and a database of previously annotated splice junctions (Materials and Methods
and Figure S2
). An overview of the results is depicted in . The majority of reads from each sample (60–68%) was successfully aligned, allowing detection of 93–95% of previously annotated exons with mean 50–200x sequence coverage (i.e., the number of reads aligned across a genomic position). 37–47% of positions were covered by an alignment in the strand-specific genome, and 423–904 deletions, which represent both splice junctions and deletion polymorphisms relative to the haploid reference genome, were detected (Mitrovich et al. 
, in preparation). On the whole, we have obtained more than sufficient sequence depth from these samples to build the first transcript annotation for C. albicans
Candida albicans transcript annotation
Our RNA-Seq dataset allows us the first opportunity to define a true transcript annotation for C. albicans
, which until now has had a gene annotation based primarily on computationally-predicted open reading frame (ORF) sequence boundaries and generally not informed by experimental data. We first developed a general computational approach () that can define a new transcript annotation by combining an existing annotation (in this case the ORF-based annotation) with evidence found in RNA sequence data for un-translated regions (UTRs) and entirely novel transcripts. This effort included the development of new methods for the de novo
identification of splice junctions and transcriptionally active regions (TARs), which are based on gapped read alignments and clusters of sequence coverage, respectively (Materials and Methods
, Figure S3
, and Mitrovich et al. 
, in preparation). We applied these methods to a single dataset produced by combining the reads from all four RNA sequence libraries, reasoning that (1) combining the datasets at this stage would be more powerful and straightforward than combining four separate annotations further downstream, and (2) the different datasets were sufficiently similar to one another. This is supported by the high reproducibility of biological replicates (r
0.95−0.99; Figure S5
) and the observation that most exons, when expressed in both cell types, appear to extend to roughly the same boundaries.
Defining a new transcript annotation for C. albicans.
Rather than providing a completely de novo
gene annotation (as for S. cerevisiae
in Yassour et al. 
, for example), we sought to leverage the existing ORF-based annotation to provide an updated annotation in which existing transcripts, if expressed, were augmented with 5′ and 3′ UTRs, and new, isolated clusters of expression (i.e., those not overlapping an annotated exon on the same strand) were added to the annotation as novel TARs (nTARs). Thus, we devised a method to merge the splice junction and TAR-finding output with the existing ORF-based annotation (Materials and Methods
and Figure S4
) and applied it to our datasets, resulting in the new C. albicans
transcript annotation (Tables S1
; summarized in ).
The new transcript annotation contains 23% more transcripts (N
7,823) covering 13% more of the genome (76.1% versus 63.6%) than the old annotation. We estimate that roughly 1,048 of these transcripts are non-coding because they do not contain a canonical ORF that is at least 120 nucleotides long (i.e., encoding a peptide at least 40 amino acids long), which increases the number of non-coding RNAs (ncRNAs) annotated in C. albicans
by nearly 500%. However, there are also a large number of new coding transcripts (i.e., transcripts that contain putative ORFs encoding peptides 40 or more amino acids long), leading to an estimated 9% increase in the number of coding transcripts. Many of these ORFs may have been missed in previous annotations due to their short length (91% are shorter than 100 amino acids) and, in some cases, due to lack of conservation in other species. It is likely that some of the ORFs defined here by our arbitrary length cutoff are not translated into protein. On the whole though, the number of putative ORFs at least 40 amino acids long found in novel transcripts (N
561) is significantly higher than expected by chance (median N
453; P-value <0.0001 by simulation; Materials and Methods
), suggesting that many are translated into protein. As detailed in the next section, at least 18 of these short, novel ORFs are likely to serve an important function in opaque cells.
In the new transcript annotation 5′ and 3′ UTRs of median length 99 and 136 bases were defined for 5,465 and 5,768 transcripts, respectively. These estimates are longer than estimates of 5′ and 3′ UTR length based on tiling arrays (68 and 91 in David et al. 
), but closely resemble those based on RNA-Seq data (111 and 142 in Yassour et al. 
) for the related model yeast, Saccharomyces cerevisiae
. Finally, 50% of transcripts in the new annotation overlapped transcripts from the opposite strand by at least 1 bp and 31% did so across more than 10% of their length, indicating that, as in other eukaryotes 
, there is widespread antisense transcription in C. albicans
. This observation underscores the importance of sequencing RNA in a strand-specific manner. Overall, the new transcript annotation described here represents a dramatic revision from previous annotations that microarrays were designed to assess. Using this new annotation we revisited the differences in gene expression between white and opaque cells.
Transcripts differentially expressed between white and opaque cell types
We determined which of the 7,823 newly defined transcripts were differentially expressed between white and opaque cell types by employing a likelihood ratio test 
. We required a 2-fold or greater change in expression and false discovery rate (FDR) of 10−4
or less, which resulted in a set of 1,306 differentially-expressed transcripts (Table S3
). As expected, we find strong (50-fold) up-regulated expression of WOR1
, the gene that encodes a master regulator of white-opaque switching (). As predicted by a previous study 
has an unusually long 5′ UTR (1,978 bp, compared to the genome-wide median length of 99 bp). Unexpectedly, the lower WOR1
expression in white cells is associated with increased expression on the strand opposite this long UTR, suggesting an alternative internal antisense promoter is active and may be repressing WOR1
expression in white cells.
Transcripts differentially expressed between white and opaque cell types.
To confirm the quality of these data we compared them directly to data generated using microarrays that are commonly used to study gene expression in C. albicans
. We hybridized the same samples used for RNA sequencing (Materials and Methods
) and examined the fold-change measurements produced by each technology for all previously annotated transcripts (). We found a strong overall correlation (r
0.79), which, as noted in other comparisons of RNA-Seq and microarray data, is stronger for high abundance transcripts (r
0.89) than it is for low abundance transcripts (r
0.71), which are generally more accurately measured by RNA-Seq 
The 1,306 differentially expressed transcripts found here represent a 3-fold increase in the number observed by microarray 
, which is partly attributable to the fact that 37% of these transcripts are novel (N
488) and thus were not probed on previous microarrays. Novel transcripts are unexpectedly frequent amongst the set of white-opaque differentially-expressed transcripts (N
488 versus 218 expected; χ2
), a provocative observation we can not yet entirely explain, but which suggests an important role for non-coding transcripts and short proteins in the white-opaque circuit. In any case, this observation emphasizes the importance of “hypothesis-free” approaches to measuring gene expression. The remaining differentially-expressed transcripts, not recognized as such by microarray (N
376), may be explained by the documented, improved sensitivity and dynamic range of RNA-Seq 
; indeed, these transcripts not discovered by microarray have 2-fold lower average abundance than those that were, as estimated by RPKM (reads per kb of transcript per million uniquely aligned reads).
We were especially interested in the 488 novel differentially expressed transcripts, which fall into three major classes: (1) antisense transcripts, (2) isolated transcripts that encode proteins, and (3) isolated non-coding transcripts. We discuss these three classes in turn. We found 213 novel transcripts that overlap another transcript on the opposite strand across at least one third of their length. NTAR_364
is a particularly informative example of a differentially expressed novel transcript that overlaps another transcript on the opposite strand (). The gene opposite NTAR_364
, which encodes the β subunit of the heterotrimeric G protein complex required for mating 
. Mating is a process specific to opaque cells 
, and accordingly, NTAR_364
's 14-fold down-regulation is inverse to STE4
's 8-fold up-regulation in opaque cells. The anti-correlated expression of these two overlapping transcripts strongly suggests a mechanism in which NTAR_364's expression acts to repress expression of STE4. There is ample precedent for this type of regulation in eukaryotes and bacteria 
. To determine the prevalence of such mechanisms in C. albicans
, we examined the expression profiles of all 759 such sense-antisense transcript pairs, filtering down to the subset of 44 pairs in which both transcripts are significantly changed and at least one transcript is coding (). Our expectation was that we would observe strong anti-correlated differential expression across all such pairs if these mechanisms are prevalent and a lack of correlation if they are not. Instead, we found a modest and significant anti-correlation (r
0.05; ). Sense–antisense pairs in which one member is differentially-expressed are 2-fold more likely, than expected by chance, to have the second member differentially-expressed in the opposite direction (17% versus 8%; χ2
). These results suggest that some, but not all, anti-sense transcripts act to repress the steady-state abundance of their sense counterpart. Despite the lack of perfect anti-correlation, there are several transcript pairs that, like the STE4
pair mentioned, are considerably differentially-expressed in opposite directions (), which strongly suggests a regulatory function for the novel antisense transcripts involved.
A selection of sense-antisense gene pairs with the most strongly anti-correlated expression.
The second major class of novel, differentially-expressed transcripts contains those that are isolated in the genome and code for protein. In total, we identified 224 novel differentially expressed transcripts that do not overlap a transcript on the opposite strand. Sixty-nine of these transcripts encode a putative protein at least 40 amino acids long. Amongst these is a group that clusters into three genomic locations and encodes a large family of novel, short ORFs (, Figure S6A and S6B
). Eighteen of the 24 ORFs in this family are encoded by transcripts that are opaque-specific, including NTAR_1179.2
, which with 287-fold higher abundance in opaque cells is the third most differentially-expressed transcript genome-wide. Using a combination of BLAST and PSI-BLAST against fungal genomes and eukaryotic protein sequence databases, we identified 46 members of this family (see sequence alignments in and Figure S6C
), 24 from C. albicans
and 22 from its closest known relative, Candida dubliniensis
. Homologs could not be identified in any other species, further underscoring the potential importance of these genes to opaque-cell differentiation, since these two yeast species are the only two known to switch between distinct white and opaque forms 
. The neighbor-joining phylogeny inferred for these ORFs ( and Figure S6D
) indicates that most were present and similarly clustered in the common ancestor of C. albicans
and C. dubliniensis
. Computational predictions of secondary structure 
indicated there are likely three β sheets followed by two α helices in these proteins () and the structure prediction server I-TASSER 
found a putative bacterial hemolysin (PDB ID: 3HP7) to be the closest structural analog.
Three clusters of novel Candida-specific ORFs are strongly up-regulated in opaque cells.
Finally, 155 of the isolated, differentially-expressed transcripts do not appear to code for protein. At this time it is difficult to assess their functions in a purely computational manner; thus, their roles in the white-opaque switch await experimental characterization.
In all three classes of novel transcripts we observe examples in which the master regulator Wor1 is bound adjacent to or overlapping the differentially expressed transcripts ( and ), suggesting that these novel antisense and isolated transcripts are directly regulated by Wor1 binding. Thus, they may form a key, but heretofore unknown, part of the circuit.
The new transcript annotation illuminates the Wor1 circuit
To assess the concordance between Wor1 binding and differential expression of nearby transcripts more globally we compared the previous ORF-based and our new RNA-Seq-based gene annotations to regions identified as Wor1-bound in chromatin immunoprecipitation-on-tiling microarray (ChIP-Chip) experiments 
. We first associated Wor1-bound regions with adjacent genes using both the new and the old annotations (Figure S7
), and then evaluated both the frequency with which Wor1 binding flanked at least one differentially expressed gene and the frequency with which Wor1-bound genes were differentially expressed (). We also compared measurements of differential expression from three different platforms: (a) hybridization to spotted PCR-product microarrays (reported previously by Tsong et al. 
), (b) hybridization to custom-designed Agilent 8x15k microarrays (reported here), and (c) strand-specific RNA-Seq (also reported here). The pairing of the new transcript annotation with the RNA-Seq measurements of differential expression (, first row) clearly yields the strongest concordance between Wor1 binding and differential expression: 65% of Wor1-bound regions are associated with at least one differentially expressed transcript. This represents a greater than 2-fold improvement in concordance over a previously published association 
, in which only 30% of bound regions were observed to flank at least one differentially expressed transcript (, last row). In this previous association, differential expression of transcripts was measured by spotted PCR-product arrays designed to assay only transcripts in the old annotation. The concordance between binding and differential expression improves incrementally with the use of better microarray platforms (38–40%; , rows 5–6) and with RNA-Seq-based expression measurements computed using the old transcript annotation (48–51%; , rows 3–4). However, by far the best concordance is found when RNA-Seq-based expression measurements are computed using the new transcript annotation. Thus, the dramatically improved association of master regulator binding and cell type-specific expression observed here is attributable to both the novel transcripts and the improved expression measurements provided by RNA-Seq.
Association of Wor1 binding with white-versus-opaque differential expression when different transcript annotations and measurement platforms are employed.
Unusual properties of the Wor1 circuit
The fact that the WOR1
gene has a 2 kb long 5′ UTR and about 6 kb of Wor1-bound intergenic DNA upstream of it () suggests that this master regulator of white-opaque switching is under complex regulation. We next examined whether other transcripts in the circuit have similar properties. It was previously noted that Wor1-bound intergenic regions are, on average, 5-fold longer than typical intergenic regions (median 3,390 bp for Wor1-bound genes versus 623 bp genome-wide) 
. However, given the substantial changes we have made to the gene annotation, it was unclear whether this length bias would remain; in particular, it seemed plausible that some of the unusually long “intergenic” regions may actually contain, and thus be due to, previously unannotated long UTRs. We find that while genome-wide intergenic length is, on average, more than 2-fold shorter in the new annotation (new median length
262 bp), the intergenic regions bound by Wor1 are still, on average, 5-fold longer than expected by chance (new median length
1346 bp; Mann-Whitney P-value
; ). Unexpectedly, we also found that 5′ UTRs of Wor1-bound genes are 58% longer than expected (median 157 bp in the circuit versus 99 bp genome-wide; Mann-Whitney P-value
; ) and 3′ UTRs in the circuit are 22% longer than expected (median 166 bp in the circuit versus 136 bp genome-wide; Mann-Whitney P-value
Properties of transcripts in the Wor1 circuit.
The unusually long UTRs found in the Wor1 circuit and the apparent change in UTR length at WOR1
() motivated us to look more generally into changes in promoter usage and transcriptional termination between cell types, as reflected in changes in 5′ and 3′ UTR length, respectively. We devised a simple method to isolate putative cases of UTR length change, reasoning that a change in UTR length for a given transcript could be detected as a change in the apparent
expression of the UTR that is significantly less than or greater than what was measured for the transcript's coding region. We required a minimum 2-fold difference in fold-change between UTR and coding region and a χ2
P-value less than 10−5
(Materials and Methods
). Using these criteria, we identified 145 transcripts with at least one UTR apparently changing length between white and opaque cells (Table S4
). Visual inspection revealed that not all these cases are straightforward to interpret; however, many are, and these provide several examples for further study (). Most of the cases identified here are changes in 5′ UTRs (N
111; 77%), which likely reflects an emphasis on the usage of alternative promoters as a means of differentiating the two cell types. One of the transcripts, EFG1
, is a regulator of white-opaque switching and was previously shown to exhibit different 5′ UTR lengths in white and opaque cells 
and 26 other transcripts with significant 5′ UTR changes are also associated with Wor1 binding nearby their genomic loci (observed frequency
). For several of these transcripts, such as ORF19.2049
() and EFG1
(), the UTR is shorter in opaque cells and Wor1 is bound in opaque cells between the apparent white- and opaque-preferred transcription start sites, suggesting a direct regulatory mechanism. Other examples, such as PPS1
(not shown) and ORF19.7060
(), are probably not directly related to Wor1 binding, but may instead involve mechanisms related to the transcription of antisense genes.
Comparing Wor1 binding to gene expression revealed an additional feature of Wor1-controlled transcripts: direct binding of Wor1 within a transcribed region (rather than upstream of it) is associated with strong down-regulation of the bound transcript in opaque cells. The non-coding transcript NTAR_913 provides a clear example of this phenomenon (). Genome-wide, we found 89 cases in which a transcript overlaps a Wor1-bound region by more than 50%, and the expression of such transcripts is frequently white-specific (). This observation suggests the prominence of an underappreciated mode of gene regulation in which a transcription regulator may repress transcription via direct binding to the transcribed region. Given the unusual characteristics of the WOR1 locus and Wor1's target genes, we next examined whether other examples of heritable cell differentiation circuits exhibited similar features.
Unusual properties of the Oct4 circuit governing mammalian differentiation
One of the most studied transcription circuits is that of Oct4, which governs the differentiation and pluripotency of mammalian embryonic stem (ES) cells 
. Oct4 is a master regulator of mammalian cell types in the same sense that Wor1 is a master regulator of Candida
cell types: Oct4 expression is required to maintain the pluripotent ES cell type 
, and Oct4's over-expression in other cell types, along with additional factors, returns them to the ES cell state 
. Although much is known about this circuit, we could not find any previous reports on the general properties of the circuit (e.g., relative UTR length of Oct4-bound genes). To determine if the unusual properties of the Wor1 circuit in Candida
are shared with the Oct4 circuit, we performed a meta-analysis of publicly-available data, including ChIP-Seq-based Oct4 binding data 
and microarray-based profiles of gene expression during stem cell differentiation 
(Materials and Methods
). We discovered that the Oct4 circuit of mice does indeed share “unusual” characteristics with the Wor1 circuit of Candida
. Intergenic regions bound by Oct4 are 33% longer than expected by chance (median 23 kb in the circuit versus 17 kb genome-wide; Mann-Whitney P-value
) and are 2-fold longer than expected if they also flank a transcript that is differentially expressed during differentiation (median 34 kb in the differentially-expressed circuit; Mann-Whitney P-value
; ). 5′ UTRs and 3′ UTRs are also longer than expected (161 and 1048 bp in the circuit versus 137 and 727 bp genome-wide; Mann-Whitney P-values
, respectively; ), but the relative magnitude of length bias for 5′ versus 3′ UTRs (+18% and +44%, respectively) is flipped relative to that observed in the Wor1 circuit (+58% and +22%, respectively). Unfortunately, the appropriate data are not yet available to determine whether UTR lengths are frequently changing between cell types in the Oct4 circuit of mice as they are in the Wor1 circuit of Candida
Properties of transcripts in the Oct4 circuit.