We have shown that 5′ SAGE is able to generate accurate TSS data for
S.cerevisiae. This method combines the use of TypeIIS restriction enzymes and the ditag strategy from SAGE with the template switching TSS identification of SMART™ RACE (
47,
48), to determine the TSS. We have used a TS approach to capture the TSS instead of the cap-capture approach, which has been used successfully with human and mouse mRNA (
8–
10), showing that there are at least two workable approaches to 5′ SAGE. The accuracy of this method derives from the ditag strategy in 5′ SAGE that, as in SAGE, is essential to insure that each tag originates from a different mRNA and is not an artifact of PCR amplification. This confidence that multiple occurrence tags are independent is essential as we depend on multiplicity of occurrences to validate the individual unitag sites.
Carrying out 5′ SAGE on
S.cerevisiae we have found that the TSS identified by 5′ SAGE agrees well with previously reported data, with significant disagreement for only 7 out of 48 genes. The seven genes (
PIM1,
HEM1,
URA1,
URA4,
HSC82,
ADH2 and
TPI1) for which there is significant disagreement as to the location of the TSS need to be more closely investigated to identify the source of the discrepancy. For
TPI1, primer extension results support the −30 position of the main TSS detected by 5′ SAGE, suggesting that the previous report (
32) is erroneous. In some cases, the 5′ SAGE TSS data differed from published primer extension data by 1–5 bases. This 1–5 base discrepancy may result from the difficulties associated with obtaining single base pair resolution with either primer extension or nuclease protection assays. An example of this is
TEF2, where our 5′ SAGE and primer extension results identify the 5′-UTR as being 1 nt shorter than the previously published (
31). In some cases, including
TDH3,
GLN1,
IMD2,
RPS17A,
GCN4,
TEF2,
ARO4 and
HEM3, a small number of tags are upstream of the majority of the 5′ SAGE tags and the previously published TSS. It is unclear whether these tags are artifacts, represent rare transcripts from an upstream TSS, or represent regulatory RNAs similar to
SRG1. For
TEF2,
TDH3 and
GCN4, we have verified by primer extension the occurrence of these upstream TSS.
Our estimation of the accuracy of this method is also supported by the consistency of the TSS identified by 5′ SAGE with the previously reported TSS consensus sequence. Based on the frequency of tags falling outside of the 5′-UTR and the frequency of potentially spurious single occurrence tags within the 5′-UTR, we estimate that 54–70% of all 5′ SAGE TSS tags represent actual TSS. While 30–46% of 5′ SAGE tags do not represent true start sites and likely result from premature termination of reverse transcription or degraded mRNA, these false tags cause little problem as our basis for identifying TSS relies on multiple tag occurrence. Determination of the TSS for a specific gene by 5′ SAGE thus requires sequencing of the 5′ SAGE library to sufficient depth so that multiple TSS are represented for each gene. Approximately 10% of the tags had multiple positions in the genome, consistent with the fraction of repetitive sequences in this genome, and their transcription level.
The refined consensus sequence we have identified for
S.cerevisiae differs significantly from that of human. The larger TSS consensus sequence in
S.cerevisiae combined with the variable distance of TSS from the TATA element (
49) has been previously suggested to indicate that the mechanism of transcription start in
S.cerevisiae is significantly different than in human (
50). It has been suggested that instead of assembly of the RNA polymerase II complex at the TATA element coupled to transcription initiation at an adjacent site (
51), RNA polymerase II in
S.cerevisiae may utilize a scanning model to identify the TSS (
49). Supporting the idea that RNA polymerase may scan from the TATA to the TSS in
S.cerevisiae is the observation from Giardina and Lis (
52) that for the
GAL1 and
GAL10 promoters, the region of transcription-associated promoter melting is located adjacent to the TATA element as has been shown for mammalian TATA elements. In
Schizosaccharomyces pombe, the TSS has been reported to be 25–40 bp from the TATA element (
53), suggesting that having the transcription start adjacent to the site of polymerase assembly may be the more typical eukaryotic mode of transcription initiation, and that
S.cerevisiae has an unusual mode of transcription initiation. The TSS consensus sequence found in this study is highly A-rich. As the
S.cerevisiae genome is overall >60% AT, and the region of the TSS is >37% A, interpretation of the role of the A-rich regions in this consensus sequence is unclear, whether the poly(A) stretches surrounding the TSS are important in transcription initiation, or whether they are an artifact of some other aspect of genome structure. While some mutagenesis work on the TSS has been carried out in
S.cerevisiae (
54,
55), further mutagenesis work in
S.cerevisiae and TSS identification in one of the closely related but GC-rich hemiascomycetes, such as
A.gossypii (
1) will be necessary to refine our understanding of this consensus sequence. The identity of the TSS consensus sequence from TATA-containing genes and TATA-less genes suggests that the mechanism by which the TSS is selected in
S.cerevisiae is independent of the presence of a TATA element. This is in contrast to transcription initiation in human, where, based on a small number of genes, it has been shown that for genes lacking a TATA element a 19 bp window including the TSS, the Inr element, is necessary and sufficient for transcription initiation (
16).
We have identified multiple tag TSS for 660 protein-coding genes in this study, representing ~11% of the ~5800 protein-coding genes in
S.cerevisiae (
56). Consistent with the previously published results, 5′ SAGE data indicate that the majority of
S.cerevisiae genes have multiple TSS. We have refined the TSS consensus sequence, refined the TSS to ATG and TSS to TATA average distance measurements, identified the location of a potential regulatory RNA upstream of
ODC2 and identified 24 genes with potential uORFs in their 5′-UTR. We have also identified a previously overlooked protein-coding gene and have provided evidence, suggesting the starting methionine for 14
S.cerevisiae genes differs from that currently annotated. Thus, 5′ SAGE, in identifying the TSS of both protein-coding genes and RNA polymerase II transcribed RNA coding genes, provides data useful to understanding gene regulation as well as providing data refining the annotation of
S.cerevisiae.
A variation on 5′ SAGE we are currently using involves replacing the oligo(dT) primer used in the reverse transcription step with a pool of 480 oligonucleotides selected from the minus strand at approximately +100 relative to the starting coding. These primers are selected to be unique in the genome, have similar melting temperatures and GC content. Initial experiments have revealed that it is also necessary to select the pool of primers to minimize primer–primer complementarity as the MMLV reverse transcriptase is able to extend complementary primers to create false tags. Using pools of primers, a method we refer to as 5′ SAGE II, we are currently generating and evaluating libraries. Preliminary results include the identification of the TSS of
PFY1, the gene coding profiling, at −41, consistent with the previously published value of −41 as major TSS (
57). Further refinements of these methods will allow the identification of the TSS of the set of transcribed
S.cerevisiae genes, not just the most highly expressed genes.