Here we describe a novel RNA-Seq method to map transcribed regions of sequenced genomes. This method offers several advantages compared to existing technologies, particularly DNA microarrays, which are currently the most commonly used tool for mapping transcribed regions(4
). First, it allows interrogation of all unique sequences of the genome, including those that are closely related; as long as unique bases exist they can be monitored. Microarrays often cannot readily distinguish closely related sequences due to cross-hybridization. Second, because a large number of reads can readily be obtained, the method is very sensitive and offers a large dynamic range; we found that RNA-Seq has an 8000 fold dynamic range (see ). This is likely due to the low background of RNA-Seq; indeed, analysis of over 29 Million reads did not reveal a single tag that corresponded to deleted regions of the genome. Thus, RNA-Seq can detect and quantify levels of RNAs expressed at very low levels. In contrast, DNA microarrays have a dynamic range of 60–100 fold and the quantification of RNAs expressed at a low level can be difficult; the reduced dynamic range of microarrays is likely due, at least in part, to cross-hybridization to the different probes in the array. Indeed, comparison of our RNA-Seq data with that of published results (15
) revealed that RNA-Seq was significantly better for quantification of RNA levels than standard gene expression microarrays. Third, RNA-Seq can allow accurate determination of exon boundaries. The 3′ polyA signature offers a precise definition of 3′ UTR boundaries, and mapping of discontinuous sequences coupled with the recognition of splicing consensus sequence allows discovery of introns. In principle, determination of the exact boundaries of 5′ ends by overrepresentation of 5′ end sequences is also possible. However, because a) yeast 5′ ends are often heterogeneous (7
) and b) we performed an amplification step we did not obtain nucleotide resolution in our study. Rather, an approximate location was deduced by a sharp transition in signal over a small interval. Nonetheless, overall we provide a useful map of exon boundaries with the RNA-Seq approach.
Using RNA-Seq we generated a high-resolution transcription map of the yeast genome. We globally mapped the 3′ ends of the yeast genome for the first time and found remarkable heterogeniety at the 3′ ends of many yeast genes. A large fraction of genes contain local heterogeneity in 3′ ends suggesting differential local processing events. In addition we found more than one polyA location for 540 yeast genes, suggesting different regions of polyA site selection. In many organisms, alternative polyA sites have been shown to produce unique transcripts with distinct biological properties by altering their protein coding capacity (18
) translational regulation (19
), stability (21
) and intracellular localization(22
). It will therefore be important to determine if differential functions exist for the alternative 3′ UTRs of yeast genes with multiple polyA addition sites.
One important aspect of our study is the discovery that S. cerevisiae contains a large number (793) of expressed genes with overlapping 3′ ends. Pervasive occurrence of overlapping transcripts property may be a unique feature to S. cerevisiae and other organisms that lack Dicer homologs and thereby avoid mRNA processing and degradation. Overlapping transcription at the 3′ ends could lead to interesting forms of gene regulation in which neighboring genes can potentially influence the expression of one another.
In addition to revealing alternative 3′ ends of genes, RNA-Seq allowed us to map the 5′ ends and introns of the majority of yeast genes. We found that the first ATG in 35 genes resides upstream of the annotated start codon in SGD, and for 29 others the first ATG lies downstream of the annotated start codon. Although we cannot ascertain that the upstream ATGs are used in translation, they are consistent with the expectation that the first ATG is usually used in eukaryotes (23
). For cases where the 5′ end is mapped downstream of the annotated ATG, we presume that a downstream ATG is used, at least in the vegetative growth conditions that we analyzed. It is possible that a longer message and the annotated ATG is used in other cell types. Finally, we confirmed the existence of 240 introns. Interestingly we observed instances in which there an annotated intron but no evidence for splicing; the lack of sequence tags that span the intron indicates that they are not abundantly spliced at least in vegetative cells. In two instances presence or absence of the intron affect the resulting protein product. Thus, RNA Seq can define the presence of introns as well as their absence, at least at a particular level within an mRNA population.
The mapping of 5′ ends is particularly valuable for understanding not only gene regulation but also biochemical and genetic characterization of the genome. Currently extensive efforts are underway to biochemically characterize the yeast genome using protein microarrays and other methods (24
). Likewise, efforts to genetically characterize the yeast genome are underway using overexpression experiments and other methods (26
). Assignment of the proper ATG is crucial for ensuring that the entire native protein and gene is analyzed in these studies. Therefore, the reannotated data generated in this analysis will provide a valuable resource to the scientific community for characterization of gene and protein function.
Our study also revealed a large number of genes (321) with uORFs, which have been implicated in gene regulation (27
). In yeast, thus far only 17 genes have been reported to contain uORFs (27
). Therefore, our data indicate that uORFs are much more prevalent than previously appreciated, indicating that many genes may be regulated using uORFs. Our finding that many DNA binding proteins contain uORFs suggests that these key regulators are often likely to contain additional mechanisms controlling their regulation. To date only the expression of the GCN4 (13)
) have been shown to be controlled by uORFs; our results suggest that this mechanism is much more widespread.
Our method of analyzing polyA RNA using fragmentation of yeast cDNA proved useful for defining gene boundaries. We have also varied the RNA Seq protocol by preparing RNA lacking ribosomal RNA. Fragmentation of this RNA and then generation of cDNA using random primers followed by Illumina sequencing of the ends revealed more uniform gene coverage. However, the gene boundary definition was not as distinctive for yeast genes and thus the resulting data were not as useful for this study.
In addition to characterizing known yeast genes, we also found evidence for novel transcribed regions of the yeast genome. We find that, overall much (74.5%) of the yeast genome is transcribed. We believe this transcription is not an artifact because of the very low background of RNA-Seq. Moreover, these data are consistent with previous studies in which lacZ
insertions lacking an initiation ATG are often expressed even when located in intergenic regions (29
). The extensive transcription of the yeast genome allows a significant fraction of the genome to be expressed and therefore is likely to be valuable for the evolutionary selection of novel gene function(31
Analysis of the intergenic regions reveals that many of them likely form novel transcription units. The exact number of novel transcription units is difficult to determine since it is subject to an arbitrary threshold, but there are at least 487 transcribed regions of high confidence. Of these, 204 have been discovered only by RNA-Seq. Examination of 18 novel transcribed regions using qRT PCR confirms that the majority are expressed, indicating that these regions make bona fide RNAs in yeast.
In conclusion, our novel RNA-Seq method described here allowed us to map the transcriptional landscape of the yeast genome and define for the first time UTRs and many novel transcribed regions. In the future, application of this method should aid in determining precise transcriptional landscape of other genomes, including those of complex mixtures of organisms.