A computational screen for structural ncRNA in S. cerevisiae was performed using thermodynamic stability to discriminate structural ncRNA from background sequence. The method was tested on positive and negative control sets to determine its effectiveness for identifying known ncRNA and to develop optimal search parameters. These parameters were determined to be a Z-score <−3.5, window sizes 75 nt to 200 nt, step size of 5 nt, and window delta of 5 nt. The parameters were then used to screen for novel ncRNA in the intergenic regions of S. cerevisiae chromosome VI. To reduce the number of false positive predictions, an independent analysis was performed on syntenic regions of S. bayanus. The set of predictions found in common in both species were subjected to further experimental verification. Like all computational ncRNA gene discovery approaches currently available, our method can only provide guidance on regions likely to contain structural elements. It cannot predict the exact location of the ncRNA gene or its precise ends. These must be determined experimentally.
Northern blots, rapid amplification of cDNA ends (RACE), and publicly available cDNA library data were used to test the predictions. Each of these methods was selected for specific reasons. The strength of northern blot analysis is that it does not rely on transcript amplification and hence avoids artifacts that can result from an amplification step. However, it is not as sensitive as other methods and this can be a significant limitation when testing for ncRNA that may be expressed at low levels. RACE provides greater sensitivity than northern blot analysis but may be subject to amplification artifacts. The potential for artifacts is reduced because the 5′ and 3′ ends of the transcript are captured. The presence of a cap and poly-A tail provides strong evidence that the transcript has been processed by the cellular machinery and is a legitimate functional transcript. This makes the approach superior to methods such as tiling arrays that provide information on transcription but for which it is difficult to distinguish transcriptional noise from genuine transcripts. The publicly available cDNA data used here also has the advantage of capturing the transcript 5′ and 3′ ends, providing strong evidence for a legitimate, processed transcript.
The initial computational screen presented here produced sixteen ncRNA gene candidates on chromosome VI of
S. cerevisiae. Four candidates are well supported by experimental data and have been given the names
RUF20 to
RUF23 (). The
RUF20 candidate is also expressed in
S. bayanus and in the more distantly related species
A. gossypii (). All of the transcripts were evaluated for the possibility that they might be snoRNA or encode a protein but this was shown to be unlikely (see
Materials and Methods). Two additional candidates are also supported by experimental evidence but further experimental testing is needed to confirm their legitimacy. Six of the candidates were found to be part of the 5′ or 3′ untranslated regions (UTRs) of annotated protein-coding genes. These structures are interesting because they may play a functional role in the UTRs of these genes (). Additional experimental analysis will be needed to determine the function of the structures as well as the function of the four new ncRNA,
RUF20 to
RUF23.
There are several possible explanations why experimental data could not be obtained to support three of the ncRNA predictions. These predictions may represent false positives, they may not be expressed under the conditions tested, or they may be expressed at such a low level that they could not be detected. It has been shown that transcript abundance in yeast varies over six orders of magnitude and that some important transcription factors are expressed at levels as low as one transcript per thousand cells
[59]. It is also possible that these transcripts are not transcribed by RNA polymerase II, the method used in this study to generate cDNA is dependent on a poly-A tail in the RNA transcript. If the ncRNA candidates are transcribed by polymerase I or III, they would likely not be captured in the cDNA library.
It should be noted that there were three genes in the positive control set () that did not generate a Z-score <−3.5 (
snR76,
SER3,
RNA170). It is questionable whether these genes actually contain significant structural elements. One of them,
snR76, is a C/D box snoRNA and data from other investigators
[19] shows that structural features are only present in a subset of these genes. It is not surprising that this category of ncRNA was not easily detected in this screen based on structural thermodynamic stability. It is clear that some classes of ncRNA will not be identified very well in structural screens. The other two genes in the positive control set were
RNA170 (unknown function) and
SER3. The
SER3 gene suppresses expression of its neighboring gene,
SRG1, by blocking access to the SRG1 promoter region via its transcription.
SER3 and
RNA170 are unlikely to contain significant structural features so the fact that they did not generate Z-scores less than −3.5 tends to validate the method.
Two previous investigators have performed computational genome-wide screens for ncRNA in S. cerevisiae. McCutchen and Eddy, 2003 used the QRNA program to search for structural elements based on observed compensatory changes in pair-wise alignments of S. cerevisiae species. A fixed window size of 150 nts and a step size of 50 nt were used to perform the analysis. Two structural ncRNA candidates were found on chromosome VI. One prediction, between RIM15 and HAC1 (74738–74738), was near one of the candidates predicted in this study between the same genes (74926–75006). They were unable to obtain sufficient experimental support for expression of this transcript. This is consistent with our experimental results as well. The second McCutchen and Eddy prediction, between SMC1 and BLM10, did not correspond to any predictions generated in this study. They obtained northern blot and RACE data to support expression of this second predicted gene.
A second screen for ncRNA was performed by Steigele et al using the RNAZ program
[35]. This program searches for compensatory changes in multiple sequence alignments as well as for thermodynamic stability cues indicative of structural elements. The relative contribution of these two factors in the prediction is not specified. A fixed window size of 120 nt and step size of 40 nt was used to perform the analysis. They reported a sensitivity (true positives/total) for identifying snoRNA of 47% (pooling H/ACA box and C/D box snoRNA), sensitivity for identifying snRNA of 66%, and a sensitivity of 72% for tRNA. The screen generated a total of 18 novel intergenic structural predictions on chromosome VI. Of these, 8 were predicted to be on the Crick strand and 8 on the Watson strand. Five of these intergenic regions were shared by our predictions (YFL051C-ALR2, ACT1-YPT1, TUB2-RPO41, GYP8-STE2 and YFR017C-YFR018C). All 5 of the Steigele et al predictions were on the Watson strand in these regions. Two of the predictions overlapped with our predictions (ACT1-YPT1 and YFR017C-YFR018C).
Our experimental data suggested that the YFL051C-ALR2 region is transcriptionally complex and is likely to produce more than a single transcript. This could account for the fact that both studies predicted structural elements in this region. Our RACE analysis of the ACT1-YPT1 region showed that the predicted structural element was contained within the ACT1 UTR on the Crick strand. The Steigele et al prediction overlaps within the ACT1 UTR but is predicted to be on the opposite strand (Watson). For the TUB2-RPO41 region, we experimentally confirmed a transcript on the Crick strand encompassing our predictions. This transcript overlaps with the Steigele et al prediction but is again on the opposite strand (Watson). Our GYP8-STE2 prediction proved to be part of the GYP8 5′ UTR on the Crick strand. The Steigele et al prediction in this region was on the Watson strand and is beyond the region we measured for the GYP8 UTR (although we were unable to map the end of this 5′ UTR). In the YFR017C-YFR018C region, we obtained RACE results that mapped our prediction to the Crick strand as part of the YFR018C 3′ UTR. The Steigele et al prediction, which largely overlaps our prediction, was for a gene on the Watson strand. Hence, while our predictions and those of Steigele et al are close to one another or overlapping in five regions, in all five cases they are on opposite strands.
It is interesting that there is no overlap between the QRNA and the RNAZ predictions of chromosome VI since both programs consider compensatory changes within alignments to identify structural elements. The reason for this is unclear.
There are two primary differences between the search for ncRNA presented here and the work of previous investigators. First, this method does not require sequence alignments in the analysis. Instead, it relies entirely on thermodynamic stability in unaligned syntenic regions of related species to predict ncRNA structure. The approach is capable of finding ncRNA that have moved out of register within syntenic regions and can be applied in situations where accurate alignments may be difficult to obtain.
The second difference in this work is its examination of the impact of various window sizes and step sizes on ncRNA detection. The analysis shows that small step sizes are necessary to ensure that most ncRNA are identified. It also shows that more than one window size is needed when screening for ncRNA. Some ncRNA are detected only when using short window sizes while others are detected when using only long window sizes (). Limiting the search to a single window size, as has traditionally been done, is likely to bias the screen toward a subset of ncRNA for which that window size is optimal.
| Table 7Detection of each snoRNA for each window size. |
The need for multiple window sizes and step sizes in the screening algorithm increases the computational investment necessary to perform the analysis. However, with the rapid increase in computer performance and the availability of computer clusters, these computations are not unreasonable. The increased computational investment will be rewarded by increased sensitivity.
Our analysis suggests that a few carefully selected window sizes will be nearly as effective at detecting ncRNA as the entire set between 75 nt and 200 nt (total of 26 window sizes). For example, when we used the entire set of window sizes from 75 nt to 200 nt, we detected 22 of the 29 known H/ACA snoRNA within embedded sequences (). If we had used only 4 window sizes (80 nt, 120 nt, 160 nt, 200 nt), we would have succeeded in identifying 90% of these H/ACA box snoRNA (20 of the 22) while reducing computational requirements by approximately 85% (4 of 26 window sizes). If these four window sizes were used with a step size of 25 nt, 77% (17 of 22) of the H/ACA box snoRNA would be detected (
Table S10). This becomes 64% (14 of 22) if the step size is increased to 50 nt (
Table S11).
Tradeoffs between sensitivity and computational requirements should be evaluated when performing computational screens. We recommend using a range of four window sizes when screening for ncRNA in a genome (one short, one long, and two intermediate values appears to be optimal). Our results suggest that the values of 80, 120, 160 and 200 should provide good results. A step size between 5 and 10 should also provide a good screen. These parameters should provide good ncRNA detection while keeping computational time manageable. The development of an efficient computational algorithm implementing the methodology presented here would also significantly reduce computational run time.
This screen used a simple cutoff Z-score value (≤−3.5) to discriminate ncRNA. The sensitivity of the screen could probably be improved if a more sophisticated cutoff criteria were developed in which the Z-score cutoff was a function of window size. The number of aberrant negative Z-scores dropped as a function of window length in the negative control sets demonstrating that the likelihood of producing large negative Z-score drops with increasing window length. Developing a Z-score cut-off value as a function of window length would probably improve the sensitivity of the screen at longer window sizes.
This work demonstrates that structural thermodynamic stability is an effective tool for predicting ncRNA genes. As examples of ncRNA are accumulated through computational screens such as this, it may become possible to determine ncRNA key features and gain insight into their biological function. Computational methods can complement experimental approaches in the effort to gain a deeper understanding of these genes.