We present the first comprehensive genome wide analysis of Pol III-dependent genes in ten eukaryotes (nine hemiascomycetes and the archiascomycete S.pombe). This exhaustive analysis unearthed several original observations. Unexpected features for decoding were first revealed. Yeasts close to S.cerevisiae follow the bacterial sparing rules to decode Leu CUN and Arg CGN codons. Such changes, which are unique among eukaryotes, can be precisely dated on the phylogeny of hemiascomycetes. As shown in , the most ancient switch appears to be the change of decoding Arg CGN codons from the regular eukaryotic to a bacterial-type (node #1). The change in the genetic code that reassigned the CUG codon to Ser occurred later, in the branch leading to the Candida genus (D.hansenii and C.albicans, node #2). Independently, in another branch leading to other hemiascomycetes, including S.cerevisiae, the decoding of Leu CUN codons switches from the eukaryotic to bacterial mode (G34- to A34-sparing, node #3). Remarkably, S.castellii has reverted to the usual eukaryotic G34-sparing (node #5). The capture of tDNA-Asp leading to a novel tDNA-Arg (CCG) appears to be also concomitant with the events occurring at node #3. Finally, the loss of tDNA-Leu (CAG) seems to have occurred several times independently (in these cases, the CUG codon is read by tRNA-Leu (UAG)).
The large size of the collection of tDNA sequences originating from a single eukaryotic phylum allows extensive comparisons between both orthologous genes (i.e. between yeast species) and paralogous genes within each species. For a given tDNA species (given anticodon), the large variation in the number of gene copies is particularly remarkable [e.g. 1–27 copies for tDNA-Glu (CTC)]. This variation in number is at least partly correlated to variation in codon usage between yeast species. It is also remarkable that within a yeast species, the various gene copies are always (or nearly) identical. Remarkably, specific deviations with respect to the eukaryotic cloverleaf model apply to all gene copies within a genome. For example, the tertiary base pair T15A48 present in all five tDNA-Phe (TGG) in C.albicans replaces the usual R15Y48 pair; G21 is found instead of the universal A21 in all three tDNA-Met (CAT) in S.pombe; the A53T61 pair, which makes the outer bases of the B-box, is substituted to G53C61 in all nine copies of tDNA-Ala (AGC) in S.pombe). This suggests a specific role for such deviations and also the existence of a survey mechanism permanently unifying the different tDNA copies of the same tDNA (same anticodon) within each species.
The sequence homogeneity between orthologous tDNA (tDNA coding for the same amino acid in different genomes) contrasts with the sequence divergence between paralogous tDNAs (tDNAs bearing different amino acid within a same genome) as shown by our p-distance analysis. Note that a similar histogram of distance () was already reported several years ago with a much more limited tDNA set, insufficient for phylogenetic analysis (
86,
95). With our new dataset that includes ~600 different tDNA sequences, single clustering of orthologous tDNA was observed for most amino acids, with the sole exception of tDNA from
S.pombe, offering the opportunity to examine the significance of the exceptions to this rule. A first exception is the close relation between the tDNA-Arg (CCG) and the tDNA-Asp (GTC) in yeasts close to
S.cerevisiae (
87). Actually, the origin of the tDNA-Arg (CCG) in the two related genomes
D.hansenii and
C.albicans still appears unclear. While the tDNA-Arg (CCG) of
D.hansenii sides together with other tDNA-Arg within the main tDNA-Arg cluster, that of
C.albicans sides into the extra cluster defined by five other tDNA-Arg (CCG) (). For the time being, it seems reasonable to conclude that tDNA-Arg (CCG) from
D.hansenii is a regular tDNA-Arg, not derived from a tDNA-Asp (GUC) anscestor, and that this is also the case for
C.albicans. It remains that the emergence of the tDNA-Arg (CCG) (, node #3) is complex and that detailed analyses of more genomes are necessary to clarify its origin in the different organisms, including hemiascomycetes. The second exception is the intriguing clustering of the tDNA-Met (CAT) from
Y.lipolytica into the Thr cluster that suggests a possible case of capture (, node #6). Here again, more genomes (close to
Y.lipolytica) will be needed to conclude unambiguously.
Prior to this work, the definition of the promoter elements in the A-box recognized by TFIIIC was uncertain. We used the most representative class of Pol III genes, the tRNA genes, which always amount to more than 41 different types of genes and more than 100 gene copies per genome (up to 500), to extract the A- and B-boxes genomic signatures. These short sequence elements were searched and retrieved in four other Pol III ncRNAs from the ten genomes (except two cases of probable Pol II transcribed genes in S.pombe). Examination of the 39 A- and B-box sequences () shows that the consensus signatures are indeed found always at appropriate locations, with a few sequence exceptions.
Directed mutagenesis experiments have established that the B-box is the most critical region for TFIIIC binding and that the interaction between A-box and TFIIIC is less important to the stability of the DNA–TFIIIC complex (
96). Among the 2nd, 4th and 5th positions of the B-box (equivalent to positions 54, 56 and 57 of the tRNA), the 4th position, always occupied by a C, is the most critical and its replacement by G lowers the
in vitro binding affinity of TFIIIC by 370-fold (
96). Only in one case, divergence at the 4th position of the B-box over 39 exists (a T is present instead of C in the
SNR6 gene of
C.albicans). In accordance with the less prominent role of the A-box, more numerous cases of sequence deviations were observed. Nevertheless, A-boxes were always localized no more than 21 nt away from the 5′ end of the mature products, which fits with a distance of about 25 nt between the A-box and the start of transcription. The shortest A-B distance observed (24 nt) is greater than the minimal distance experimentally determined for the correct binding of TFIIIC (21 nt) (
97). In 35 cases over 39, the terminator (poly-T) is followed by A or G, which is indicative of an efficient Pol III termination (
15).
In contrast to the high conservation of the A- and B-promoter elements throughout the ten genomes, their locations are highly variable, depending on the gene and on the genome. For example, the
RPR1 B-box, which is external to the mature product in the yeasts from
S.cerevisiae to
K.waltii, becomes internal in
C.albicans and
Y.lipolytica. This illustrates the adaptability of the Pol III transcription machinery to overcome the additional constraints exerted on an internal B-box at the RNA level. In these ten genomes, cases of dicistronic Pol III genes (
98) were searched, but none except the tDNA pairs were found. Preliminary investigations for tDNA pairs in higher eukaryotes also remained unsuccessful, suggesting that this type of organization and the mechanism that maintain species-specific pairs are restricted to yeasts.