Conservation of uORFs in GCN4 homologues in other fungi
To estimate the degree of evolutionary conservation of functional uORFs among fungal species, we decided to initially investigate the homologues of the GCN4
locus, which is well-characterised in S. cerevisiae
with respect to the regulatory role of its four uORFs [6
]. Using WU-BLAST2-TBLASTN at SGD, we identified GCN4
orthologue candidates in 18 fungal species. In all cases it was possible to find one unambiguous homologous locus. All upstream regions were aligned, and uORFs were examined for similarity in sequence and distance from the main ORF (Fig. ). All four uORFs are well conserved in all species up to and including Ashbya gossypii
, with the sole exception of Kluyveromyces lactis
. uORFs 1, 2, and 4 have discernible homologues at even longer evolutionary distances, as far as Yarrowia lipolytica
(representing a split of > 200 MYr [13
]). In even more distantly related fungi, representing basidiomycetes and filamentous ascomycetes, no homologous uORFs were found, however. These findings demonstrate that uORFs with a proven regulatory role in S. cerevisiae
are indeed conserved in genomes throughout most of Hemiascomycetes
. It is thus a reasonable expectation to find conservation of uORFs with a regulatory role among Saccharomyces
sister species, and to use this as a criterion for classifying them as functional.
Figure 1 Conservation of uORFs in the GCN4 locus of S. cerevisiae and homologues in 18 fungal species. The species are ordered approximately according to evolutionary distance from S. cerevisiae . uORFs which are conserved with respect to sequence and position (more ...)
Conservation between species among previously recognised uORFs
The starting point for our investigation was a set of 16 S. cerevisiae
genes with characterised 5'-UTRs containing uORFs [3
], (Fig. , set A). Investigation of this set revealed 27 uORFs, for an average of 1.8 uORFs per gene. A summary of the properties of this set is found in Table . Among this set of uORFs, we discerned three subclasses with respect to their length and positioning (Fig. ). The first and most abundant subclass, typified by GCN4
, has short uORFs that do not overlap either with each other or with the main ORF. The second class, which includes YAP2
, has short as well as longer uORFs, which overlap with the main ORF but not with each other. The third class, represented here by PET111
, has short and long uORFs that overlap both with each other and with the main ORF.
Figure 2 Flowchart of the steps in defining criteria to find novel uORFs that share characteristics with known functional uORFs. Solid arrows denote partition of a gene set into subsets; dotted arrows denote that a gene set or an algorithm is influenced by or (more ...)
Evolutionary conservation of uORFs highlighted by Vilela and McCarthy . Genes with conserved uORFs are shown in bold.
Three major classes of organisation of uORFs found in the S. cerevisiae genome. Not drawn to scale.
To investigate to which extent these uORFs are conserved, we aligned the sequences from 1000 bp upstream of the start codon of each of these S. cerevisiae genes with their orthologues from the other members of the Saccharomyces sensu stricto group, plus S. castellii and S. kluyveri (for an example of visualisation of an alignment, see Fig. ). The result is shown in Table . Nine of the 16 genes (CLN3, CPA1, GCN4, HAP4, HOL1, PET111, TIF4631, YAP1, YAP2) turned out to possess uORFs that are visibly conserved in most other Saccharomyces species where an orthologue could be identified. As expected, there was generally a gradual decline of conservation with increasing evolutionary distance. Thus, all 18 uORFs were conserved in S. paradoxus, S. mikatae, and S. bayanus; 10 were conserved in S. castellii, 8 in S. kudriavzevii, and 3 in S. kluyveri.
Figure 4 Alignment of a region containing uORF1 (closest to the start codon of the main ORF) from S. cerevisiae YJL139c (YUR1) with the orthologous sequences from four other Saccharomyces species. A, sequence alignment. The start and stop codons of the uORF are (more ...)
An analysis of common properties of the 9 genes, where conservation of uORFs was evident, showed two features that the majority of them share, and which might be used to distinguish them from spurious uORFs. First, the uORFs are short, on average 6.5 codons, to be compared with the average of 12.9 codons for all uORFs in this set, and 15.0 codons for the non-conserved uORFs. Second, the most downstream uORF is placed not closer than 50 nt from the start codon of the main ORF; in most cases at a distance between 50 and 150 nt.
Extension of heuristics for classification of functional uORFs in a larger dataset
In the second step, we extended our analysis to the whole collection of S. cerevisiae
genes for which the extent of the 5'-UTR is known [18
]. All 294 5'-UTR sequences were downloaded from the UTRResource database and analysed for their uORF content. In 90 of these genes, at least one uORF was found (Fig. , set B). The corresponding sequences from the other genomes were aligned as previously. Out of these 90 genes, 16 were found to contain at least one conserved uORF (average 1.7 uORF per gene; Fig. , set D). The properties of uORFs, both conserved and non-conserved, in this set are summarised in Table , and the 16 genes with conserved uORFs detected in this work are listed in additional file 1
Properties of uORFs found in 294 previously identified 5'-UTRs , after classification as evolutionarily conserved or non-conserved.
We then reanalysed the combined set of 106 (16 + 90; Fig. , set A + B) uORF-containing 5'-UTRs, again looking for features that distinguish uORFs of the 25 (9 + 16; set C + D) 5'-UTRs where evolutionary conservation was detected, from those without detectable conservation.
Creation of an expert system and its implementation to discriminate functional from spurious uORFs on a genome-wide level
We wanted to perform an analysis of all 5'-flanking sequences of recognised genes in the S. cerevisiae genome, using the approximate criteria that we derived from the set of conserved uORFs in characterised 5'-UTRs. For this, we needed a formal implementation of criteria, which was also able to perform a genome-wide scan in a reasonable time. We used an expert system (see Materials and Methods) where the following rules, derived from the analysis of the 106 genes with conserved uORFs (Fig. , set A + B), were encoded. The system gave as an output a numeric score for each uORF based on: a) the length of the uORF (optimal 4 – 6 codons); b) the distance of the gene-proximal uORF (optimal 50 – 250 nt); c) the number of uORFs upstream of a main ORF (optimal < 10). These values were stored in frames structures in an expert system shell. A score (cf) for each uORF was deduced using a set of production rules with associated cfs, and the highest score among the uORFs upstream of a certain gene was assigned to that gene. A diagram visualising the length, position, certainty factor and conservation in other Saccharomyces species is produced automatically for each gene (Fig. ). We analysed a total of 5602 intergenic sequences of recognised genes from S. cerevisiae (Fig. , set E). As in most cases the length of the 5'-UTR was unknown, the entire intergenic sequences were used. Among these sequences, a total of 51904 potential uORFs were found. In our scoring system, 24449 uORFs distributed among the 5' flanks of 2735 genes (set F) were assigned a cf ≥ 0.98.
Figure 5 Schematic of the arrangement of uORFs in the 5' flank of S. cerevisiae YJL139c (YUR1) and its homologues in other Saccharomyces species. This type of diagram is produced automatically for each gene, showing the intergenic sequence as a numbered axis; (more ...)
Conservation of uORFs that conform to newly derived rule set
We extracted the intergenic region from each of the 2735 genes and aligned them to their counterparts from the other 6 Saccharomyces species as described above. uORFs from S. cerevisiae with scores above 0.7 were visualised by colour-coding (red, see Fig. ). We manually examined all alignments. We found 379 uORFs distributed among 252 genes (Fig. , set G) to show a clear conservation of sequence and position in at least 4 species. The mean score of these genes was 0.98, notably higher than the average score of the entire set (-0.09), and the average score of the genes selected for inspection of alignment (-0.005).
The fact that uORFs with a high score were significantly better conserved indicates that the rules of our scoring system are indeed detecting features that have been conserved in evolution, and by inference, are likely to play a functional role. Out of the 16 previously characterised genes with uORFs (Table ), 9 are conserved as previously mentioned, and 7 out of these 9 (CLN3
) were also found in the list of 252 genes with uORFs that we identified in the screen described above. By contrast, for a group of 40 randomly selected genes (with an average score of -0.09), the degree of conservation of uORFs was 11.4% in S. paradoxus
; 2.4% in S. mikatae
, 5.2% in S. bayanus
, 2.3% in S. castellii
, 6.7% in S. kudriavzevii
, and 5.2% in S. kluyveri
. The fact that the degree of conservation does not follow the evolutionary closeness between species is a sign that this does not reflect actual conservation of sequences. It should be noted that for PET111
, only the shorter uORFs that do not overlap with the main ORF (PET111
uORF1 and uORF3; YAP2
uORF1; Fig. ) received high scores. The complete list comprising 252 genes with conserved uORFs predicted to be functional is shown in additional file 2
In the course of our work, the study by Zhang and Dietrich [19
] verified the 5' ends of a large set of S. cerevisiae
mRNAs, 24 of which were shown to contain uORFs (additional file 3
). We did not use these to modify our rule set, but examined to what extent they are conserved and predicted to be functional according to our work. The uORFs of three genes (AGE1
) are conserved and conform well to our rule set; those of another two (AMN1
) are conserved but get lower scores since they deviate too much from the optimal length. Out of the remaining 19 genes, the uORFs are not conserved in other species (17 genes), no orthologues were found (IMD1
), or no uORF was found at the indicated position (YNR034W-A
Sequence properties of conserved uORFs
Having identified a large set (379) of uORFs predicted to have biological function, we analysed these for common properties. First, we noted that there is no correspondence between the reading frames of functional uORFs and the frames of either the main ORF or of other uORFs upstream of the same gene.
We noticed that a marked feature of uORFs with a high score and a high degree of conservation was a clear physical separation from other, low-scoring (and by inference spurious), uORFs. In our set of 252 genes, the average distance between a predicted functional uORF and another neighbouring functional uORF is 127 nt, whereas the average distance between a uORF predicted to be functional and its closest neighbouring non-functional uORF is 100 nt. A genome-wide investigation of all intergenic regions in S. cerevisiae of the average distance between neighbouring uORFs gave the value 79 nt. This indicates that functional uORFs are indeed characterised by having a wider uORF-free zone around them than spurious uORFs. Therefore, we decided to add this criterion to augment the process of ranking uORFs according to the likelihood of them having a functional role. From the group of 252 genes with high scores, we manually selected 32 cases (Fig. , set H) with the following properties: a) the uORF responsible for the high score given to that 5' flanking sequence was well separated from other uORFs with low scores, b) optimal distance from main ORF, c) optimal length. This conforms to the properties of the 9 + 16 (set C + D) conserved uORFs that we initially identified. Of these 34 uORFs from 32 genes, all 34 (100%) are conserved in S. paradoxus, 29 (83%) in S. bayanus, 23 (66%) in S. kudriavzevii, 14 (40%) in S. castellii, and 3 (9%) in S. kluyveri. In the S. mikatae genome sequence, syntenic homologues could be identified for only 16 out of the 32 genes, and all 16 of these (100%) had conserved uORFs. These 32 genes, shown in Table , represent the cases where we make the strongest prediction for the presence of functional uORFs with a regulatory role. The uORFs in this sub-group are better conserved than the average in the group comprising 252 genes that they were selected from. In this larger set, only 85% of uORFs were conserved in S. paradoxus, 43% in S. mikatae, 37% in S. bayanus, 37% in S. kudriavzevii, 20% in S. castellii, and 11% in S. kluyveri.
Table 3 32 newly identified genes with highly conserved uORFs strongly predicted by the rule set to be functional (marked in bold), with an optimal spacing to the main ORF and other uORFs. Numbering of uORFs is 3' to 5', as uORFs were found from intergenic sequences. (more ...)
Since we used genomic DNA to derive the uORFs for this study, it is important to consider whether they lie within the transcribed region (5'-UTR) of the gene in question. We manually examined the position of the 34 top-scoring uORFs (set H) using data from the recently published high-density S. cerevisiae
transcriptome map obtained from tiling arrays [20
]. In 23 of the 34 cases, the genomic uORF was unambiguously placed within the transcribed region (the corresponding genes marked in bold in Table ), and in one additional case (SHO1
), it is quite close to the predicted transcript start site. To determine to what extent genomic uORFs not predicted to be functional were transcribed, we picked 40 uORFs with the lowest score, on average located at the same distance from the start codon of the main ORF (250 nt) as the 34 uORFs in the top group in Table . In stark contrast, only 20% of these low-scoring uORFs were located within transcripts.
The A/T-rich sequence downstream of GCN4 uORF1 and the G/C-rich sequence downstream of GCN4 uORF4 have been proposed to be essential for their translational regulatory properties. Therefore, we also compared the G/C content of the 20 nt immediately upstream and downstream of all uORFs in the whole genome with those from the top-scoring 32 genes where uORFs in addition have an optimal distance to the main ORF and a clear separation between uORFs (Fig. , Table ). We found an average G/C content of 38.6% upstream of high-scoring uORFs (vs. 36.9% for all uORFs in the whole genome), and 36.9% downstream of uORFs (vs. 36.3% for all uORFs in the whole genome). We conclude that there is no significant deviation in G/C content from the genome average for sequences flanking functional uORFs.
Finally, we examined the sets of genes carrying candidate functional uORFs found in this work for the predicted folding energies of their 5'-UTRs. It has been shown that 5'-UTRs generally are more weakly folded than bulk or randomised sequences, and that strongly translated mRNAs tend to be even less folded [21
]. We found that the predicted folding energies of the 200 nt immediately preceding the AUG of the main ORF were weaker for the initial set of genes containing previously recognised functional uORFs than for the average gene (Table ). Interestingly, our newly found genes containing uORFs predicted here to be functional also have weaker folding energies in this region; most significantly for the 32 most highly ranked genes, and to a lesser extent also the larger set of 252 genes (Table ).
Calculated minimum free folding energy of the 200 nt immediately upstream of the start codon of different sets of uORF-containing genes .
Possible role of peptide product of predicted functional uORFs
We then wanted to estimate the prevalence among regulatory uORFs of mechanisms that depend on the encoded peptide. We reasoned that if the encoded peptide is relevant, this should be reflected by the absence of frameshift mutations (e.g. one +1 followed by a -1 frameshift, thus preserving the length of the uORF but altering the peptide sequence) and by a high ratio of synonymous to non-synonymous mutations (ds/dn), similar to other protein-coding sequences. Among the 34 uORFs we investigated (from the 32 genes in Table ), we found one case of frame-shifts within one uORF, namely YER118c in S. kudriavzevii. As a complementary approach, we calculated the ratio of synonymous to non-synonymous substitutions in uORFs by comparing the orthologous sequences of S. cerevisiae, S. paradoxus, S. mikatae, and S. bayanus. For the uORFs in Table , the ds/dn ratio calculated from a total of substitutions is 0.41. This is significantly lower than the average ds/dn value we determined from 3268 protein-coding sequences from the same species, namely 1.80.
As a further estimation of the likelihood that uORFs encode a functional peptide, we compared the codon adaptation index (CAI; [22
]) of the set of 252 conserved uORFs in additional file 2
(CAI = 0.151) with those of the entire group of 24449 uORFs (mostly non-functional; CAI = 0.149). This is to be contrasted with the indices for weakly (CAI = 0.19) and highly (CAI = 0.77) expressed protein-coding main ORFs [23
]. There is thus no bias for a higher CAI in the conserved uORFs examined.
The sequences around the start codon that promote efficient translation are much less frequent in uORFs than in main ORFs [24
]. In accordance with this, we did not find good fits to the consensus found for S. cerevisiae
]) in most high-scoring uORFs. For the positions with the greatest impact on translational efficiency, the base frequencies as calculated from the set of 252 genes were not significantly different from bulk DNA: at -3; 35% A, 16% C, 20% G, 29% T; at +4; 32% A, 22% C, 17% G, 29% T.
Biological context of genes with predicted functional uORFs
In order to identify any common denominator for the biological function of these 252 genes, we performed a Gene Ontology (GO) term analysis at SGD. There was no single term unifying the majority of the genes; however there was a moderate overrepresentation of genes with the function "transcription regulator activity" (9.6% vs. 4.4% in the whole genome; P = 3.1 × 10-4); see Table . There was also an overrepresentation of the cellular process "development" (10.4% vs. 5.4%; P = 10-3). The genes associated with "development" are mainly involved in establishment of cell polarity and sporulation. Related to this, we also noted an overrepresentation of genes with a role in pseudohyphal growth (2.4% vs. 0.6%; P = 7 × 10-3), even though this category is not classified under "development" in GO. Most of the genes for pseudohyphal growth are also included in one of the other categories (cell polarity, transcription); see Table .
Major functional classes for genes that harbour conserved uORFs predicted to play a regulatory role (Fig. 2, set G).