SD sequences in bacterial and archaeal genomes.
We defined SD% as the fraction of genes in a given group that possesses an SD sequence. For each genome, Table reports the G+C content, the count of genes encoding products of at least 100 amino acids in length, the anti-SD sequence at the 3′ end of the 16S rRNA sequence, the SD% of all the genes in the genome (≥100 amino acids), and the optimal aligned spacings (OAS) for the SD sequences (discussed below). In bacterial genomes, the anti-SD sequence is AUCACCUCCUUU, although the archaeal genomes show some variation in their anti-SD sequences around the conserved core CCUCC (Table ).
Using the free-energy method and a cutoff value of −4.4 kcal/mol, all the SD sequences detected were at least 4 bases in length, and most harbored the motif GGAG, GAGG, or AGGA (e.g., 88% of the SD sequences in Escherichia coli
K-12). In some natural mRNAs an SD sequence can consist of a weaker motif, e.g., AAGG, with a ΔGSD
of −2.9 kcal/mol (57
). For our purposes we prefer to find only unambiguous SD sequences. In terms of base-pairing potential with the anti-SD sequence, the SD sequences defined by our method may be considered strong SD sequences. Most of them are present at an aligned spacing of between 5 and 13 bases, as verified by histograms of spacings of all the SD sequences in a genome (Fig. and C; also see Supplementary Data Fig. S-2). An SD sequence at this range of spacings has been established to be effective (7
Of the 30 genomes, 22 had an SD% exceeding 40% for all genes. Bacillus subtilis and Thermotoga maritima registered the highest SD%, 89.4% and 90.1%, respectively. The lowest genome SD% occurred for Rickettsia prowazekii, Mycoplasma genitalium, Mycoplasma pneumoniae, Halobacterium sp. strain NRC-1, Thermoplasma acidophilum, Sulfolobus solfataricus, and Pseudomonas aeruginosa, each at around 20%. In general, fast-growing bacteria, gram-negative thermophiles, spirochetes, methanogens, and hyperthermophilic archaea achieved relatively high SD%, while obligate intracellular parasites, surface parasites, pathogens, and cyanobacteria had diminished genome SD%.
We carried out a simulation study to determine whether these SD% values represent real DNA elements or just random motifs. For each genome, we generated 100 (1,000 for Escherichia coli K-12) data sets of random sequences 20 nucleotides long according to the base composition of the original 20-nucleotide 5′ end sequence data set, each with the same number of sequences as in the given genome. SD sequences were detected and SD% was calculated for each set of these random sequences. The SD% values shown in Table were found to represent real motifs in all the genomes except for Mycoplasma genitalium and Halobacterium sp. strain NRC-1, as assessed by distributions of the SD% for these simulated data sets (the probability of these SD% values coming from random sequences was <0.01).
Correlation between SD presence and predicted gene expression levels.
It is known that not all genes contain an SD sequence. In some genomes, the majority of genes do not have such a motif (Table ). Although an SD sequence is not compulsory for the translation of many genes (21
), it may still be effective for genes that contain such a motif. This raises the question of how the SD sequences are distributed in different gene classes.
First we examined SD sequences for the RP genes. Primarily highly expressed during fast growth, the RP gene class showed a very high SD%, around 80% in most genomes (Table ). Even for genomes with a low overall SD%, the RP SD% was significantly high. For example, the SD% was 85.7% for RP genes in Thermoplasma acidophilum (23.5% for the genome) and 58.5% for RP in Sulfolobus solfataricus (23.0% for the genome). This is consistent with a greater SD presence for highly expressed genes.
We then divided the genes of a genome (≥100 codons) into three classes, PHX, PA, and PMX, based on codon usage biases (22
). The percentage of PHX genes in different genomes ranged from 2% to 19%, whereas PA genes ranged from 0 to 13% (Table ). PMX genes constitute the bulk of a genome and consisted mostly of average genes. The major PHX genes were RP, TF, and CH genes. Other PHX genes included those encoding enzymes of essential energy metabolism pathways and the principal genes of amino acid and nucleotide biosyntheses (22
). Our results on PHX agree well with two-dimensional gel experimental assessments in several prokaryotes (1
). The PHX genes in most of the 30 genomes carried a significantly higher SD% than PMX genes. PA genes generally showed an SD% about the same as or less than that of the PMX genes (Table ). Since PA genes are largely composed of putative lateral transfer genes, they tend to have low expression levels (28
To verify the positive correlation of SD presence and gene expression levels, we applied logistic regression analysis. The regression coefficient β and its estimated standard error for each genome are given in Table . All but six genomes (Borrelia burgdorferi, Bacillus subtilis, Mycoplasma genitalium, Methanococcus jannaschii, Halobacterium sp. strain NRC-1, and Pyrobaculum aerophilum) recorded a significant positive correlation between SD presence and E(g) values (P < 0.01 for a likelihood ratio test of the regression). For the genomes of Borrelia burgdorferi, Methanococcus jannaschii, and Halobacterium sp. strain NRC-1, the P value for the likelihood test was between 0.05 and 0.1, indicating a relatively strong correlation. Of the three genomes that did not record a significant correlation (P > 0.1), Mycoplasma genitalium had the lowest SD% (10.8%); Bacillus subtilis was among the highest in SD%; and Pyrobaculum aerophilum was low at about 23% (Tables and ).
Since all the data sets used were original genome annotations, a reasonable concern was that incorrect annotations of the gene start sites may have affected the accuracy of our SD analysis. To better determine how the genome data would compare to more reliable data sets, we analyzed the SD% for genes from several human-curated Escherichia coli
K-12 data sets and achieved very similar results, as shown in Table . The data sets on essentiality were from the Profiling of E. coli
Chromosome (PEC) database (http://www.shigen.nig.ac.jp/ecoli/pec/
). The PEC data set classifies all E. coli
genes into three groups: genes essential for cell growth (“essential”; total of 191 genes), those dispensable for cell growth (“nonessential”), and those unknown to be essential or nonessential (“unknown”), mainly using information from the literature. The “verified” (total, 656 genes) data set was extracted from EcoMap12 (http://bmb.med.miami.edu/EcoGene/EcoWeb/
), which consists of genes whose starts have been confirmed by N-terminal protein sequencing (41
). There are 65 genes in the verified set whose start sites were incorrectly annotated in the NCBI genome (4
), giving an accuracy of about 90% for start site annotation, which is consistent with the average accuracy estimated for various gene-finding programs (25
Naturally, both the essential and the verified data sets have much higher fractions of PHX genes, and thus higher overall SD%, than do the other data sets (Table ). However, the PHX genes in all these data sets registered an even higher SD% than the PMX or PA genes (Table ). Furthermore, the collection of 591 correctly annotated genes in the verified set displayed a significant positive correlation between SD presence and predicted expression levels by logistic regression analysis (β = 0.62, standard error = 0.22; P < 0.005).
To further reduce potential errors caused by annotation inaccuracies, we compiled a “single-start genes” data set for each genome, which consists of genes with a single start codon (AUG, GUG, or UUG as the first codon) within 90 nucleotides of their annotations. Of the 65 wrongly annotated genes in the E. coli
“verified” data set, the correct start was found within 30 codons of the annotations for 54 (83%). Therefore, the single-start genes may have a chance of <0.02 of being wrongly annotated if the error rate for the genome annotations is 10% or about only 0.04 if the error rate reaches as high as 25% in certain genomes, as estimated by some authors (3
). In general, these genes constitute about 26% of a genome (29% PHX genes, 25% PMX, and 26% PA; see Supplementary Data Table S-2). Compared to the whole-genome data, they registered highly comparable SD% for the three gene classes PHX, PMX, and PA, indicating that the inaccuracies in start site prediction could only slightly affect the validity of our results obtained from genome annotations (see Supplementary Data Table S-2).
There was also evidence suggesting that wrong starts are likely to be distributed evenly among the different classes of genes (PHX, PMX, and PA) that we used and thus would not significantly affect our comparisons of SD presence between PHX and PMX or PA gene classes. Of the 65 E. coli genes with incorrect starts mentioned above, 20% were PHX, 77% were PMX, and 3% were PA, indicating that incorrect annotations do not tend to bias strongly toward PMX or PA genes.
Taken together, our results on the correlation of SD presence and predicted expression levels have been verified by both human-curated E. coli data sets and the high-quality single-start gene data sets. The validity of the results holds despite the existence of a few incorrectly predicted gene start sites in the genome data.
It is also evident that the increased SD% for PHX genes is not due solely to the presence of RP genes, as shown in Table for Escherichia coli K-12. The collection of PHX genes, excluding RP genes, achieved an SD% similar to that of the complete PHX class for the verified, essential, and whole-genome data sets (Table ).
The results corroborate our assignment of genes as PHX based on codon usage, even in the many prokaryotes for which little direct information on protein abundances is available. Although many factors affect protein abundances, a high rate of translational initiation is essential to achieve a high level of expression and is the factor most simply observed by genome analysis.
SD sequences for PHX and PMX genes.
We also tried to determine whether the SD sequences of RP and PHX genes are stronger than those of PMX genes in terms of base-pairing potential with the anti-SD sequence and with respect to their aligned spacings, which reflect the two major determinants of the strength of an SD sequence (17
). Ringquist et al. (34
) showed experimentally that the SD sequence UAAGGAGG is about fourfold more effective than AAGGA. The former SD has a ΔGSD
of −12 kcal/mol, while the latter has a ΔGSD
of −5.3 kcal/mol. Spacing has a substantial effect only when the SD sequence is short (17
). Experimental evidence demonstrated that an aligned spacing of 8 to 10 bases is optimal for E. coli
We first determined the OAS for each genome based on the distribution of SD spacings for all the genes in general and the PHX and RP gene classes in particular. The genomes of Escherichia coli K-12 and Pyrococcus abyssi are shown as two examples in Fig. . The OAS are 7, 8, and 9 bases for Escherichia coli K-12 and 9, 10, and 11 bases for Pyrococcus abyssi (Fig. ). Notably, 6, 7, and 8 bases are the most occupied SD spacings for PMX genes from Escherichia coli K-12, whereas 7, 8, and 9 bases are preferred by PHX and RP genes (Fig. ). In fact, no SD sequence for the Escherichia coli K-12 RP genes occurs at an aligned spacing of 6 bases.
Assuming that the SD sequences for RP genes are the most optimal, the three aligned spacings of 7, 8, and 9 bases were chosen as the OAS for SD sequences in Escherichia coli
K-12. These OAS agree excellently with experimental evidence that 8 to 10 bases are optimal for SD sequences in Escherichia coli
K-12 genes (7
). These also indicate that SD sequences for PHX genes may have a distribution closer to the actual optimal spacings than PMX genes.
For the genomes of Haemophilus influenzae, Vibrio cholerae, Campylobacter jejuni, Helicobacter pylori 26695, Chlamydophila pneumoniae, and Chlamydia trachomatis, the OAS were determined in a way similar to that used for Escherichia coli K-12. In other genomes, the OAS were aligned spacings occupied by the largest fraction of SD sequences for both PHX and PMX genes, e.g., for Pyrococcus abyssi (Fig. ; see also Supplementary Data Fig. S-2). However, the SD sequences in the genomes of Mycoplasma genitalium and Pyrobaculum aerophilum were spread to all positions. Their OAS were chosen in the same way but may not represent optimal spacings (see Supplementary Data Fig. S-3).
Table displays the OAS for each genome. In general, bacterial genomes attain similar OAS, with position 8 being the most common optimal spacing. Archaeal genomes show a preference for OAS about 2 bases longer than that of most bacterial genomes, usually at positions of 9 to 11 bases (Table , Fig. ).
We display in Fig. for each genome the mean ΔGSD of the SD sequences and the frequencies of the SD sequences at the OAS (designated OAS%) in RP, PHX, and PMX genes. The mean ΔGSD indicates the average affinity of the SD sequences for a given gene class. The 30 genomes are divided into three groups (Fig. ). The first group consists of the proteobacteria. Their SD sequences were among the weakest, with a mean ΔGSD of −6.5 kcal/mol, and about 50% to 70% occurred at the OAS. The most common SD sequence for these genomes was AGGAG (ΔGSD = −6.5 kcal/mol). In comparison, AGGAGG had a ΔGSD of −9.8 kcal/mol. It is also noteworthy that these genomes were highly similar in SD sequences for all three classes of genes (Fig. ).
FIG. 2. SD sequences for RP, PHX, and PMX gene classes. (A) The y axis, OAS%, is the fraction of SD sequences present at the three OAS (given in Table ) for each gene class. * indicates genomes where the OAS% for RP is significantly higher (more ...)
The second group included the other bacteria except Aquifex aeolicus and Thermotoga maritima. The SD sequences in these genomes were more variable in the mean ΔGSD and with an OAS% of around 40%. Bacillus subtilis was the only genome in this cluster to have very strong SD sequences (lower mean ΔGSD).
The third group consisted of Aquifex aeolicus, Thermotoga maritima, and all the archaea. The SD sequences in this cluster were the strongest, except for the genomes with a very low genome SD% (Halobacterium sp. strain NRC-1, Sulfolobus solfataricus, and Pyrobaculum aerophilum). In Bacillus subtilis, Aquifex aeolicus, Thermotoga maritima, and the euryarchaea, the SD sequences for RP genes were significantly higher in OAS% and significantly lower in mean ΔGSD than the PMX genes. This was mostly valid also for the PHX gene classes in these genomes (Fig. ). In particular, Bacillus subtilis did not show a significant correlation between SD presence and predicted expression levels (Table ), but the SD sequences for its PHX genes did tend to be stronger than those of its PMX genes in both ΔGSD and OAS% (Fig. ). In contrast, the genomes of Mycoplasma genitalium and Pyrobaculum aerophilum appeared to have SD sequences that were weak and not at optimal spacings, even for the PHX and RP genes (Fig. ). The SD sequences in these genomes may not play any significant role in translation initiation as in other genomes, which is also implied by the logistic regression analysis (Table ; see below).
It was previously suggested that there is no direct correlation between the affinity of the SD sequence for the anti-SD sequence and the efficiency of initiation complex formation under certain experimental conditions (10
). An SD interaction that involves the center of the anti-SD sequence, CCUCC, may be more efficient in facilitating translation initiation than when it involves off-center sequences (24
). This could explain the results of Ringquist et al. (34
) and also the twofold-higher yields for GAGGU (ΔGSD
= −6.6 kcal/mol) than for UAAGG (−4.2 kcal/mol) found by Chen et al. (7
). Not coincidentally, the core anti-SD sequence CCUCC provides the greatest contribution to ΔGSD
, as a G:C pair is more stable than an A:U pair.
Since a majority of the SD sequences that we detected involved interaction with the core anti-SD sequence, it might be reasonable to speculate that a lower mean ΔGSD indeed signifies a higher efficiency for SD sequences of PHX genes. We also found that, in Escherichia coli K-12, SD sequences for PHX genes had a higher frequency of GGAG and GAGG (24.7%) and a lower frequency of AGGA (5.0%) than the PMX genes (16.7% and 7.8%, respectively). These three SD sequences had the same ΔGSD of −4.4 kcal/mol, but AGGA was apparently a weaker SD sequence than the other two. In fact, 72% of all the SD sequences for PHX genes in Escherichia coli K-12 harbored the core SD motif GGAG or GAGG, compared to 62% for PMX genes. This trend appears to be valid for most genomes, even those for which no significant decreases in the mean ΔGSD were found for PHX genes versus PMX genes, e.g., proteobacterial genomes (see Supplementary Data Fig. S-4). Therefore, it appears that PHX genes tend to have an SD sequence that has higher affinity to the anti-SD sequence, occurs at a more optimal spacing, and involves interaction with the core anti-SD region. Such an SD sequence is very likely to have a higher efficiency in translation initiation.
Variation of SD% for different functional gene classes.
We also tried to find out whether SD presence is correlated with certain gene classes by assessing the SD% for different functional classes defined in the Cluster of Orthologous Groups (COG) database (50
). The two COG categories that are persistently highest in SD% are J (translation, ribosome structure, and biogenesis) and C (energy production and conversion) (see Supplementary Data Table S-3), consistent with the recognition that most genes in these groups are PHX (22
). In contrast, the COG categories with low SD% include L (DNA replication, recombination and repair), M (cell envelope biogenesis, outer membrane), and I (lipid metabolism) (see Supplementary Data Table S-3). Genes in these classes usually attain the expression levels of PMX genes (22
). Thus, variations in SD% for different COG classes seem to reflect an association with the expression levels of the genes in the class.
Relationship between SD presence and start codon.
Most genes rely on AUG as a start codon, while GUG and UUG are used sparsely (Table ). Moreover, genes with an AUG start codon tend to have a higher SD% than genes with either GUG or UUG. The increase was significant in 12 genomes and most pronounced in the five euryarchaeal genomes with SD% exceeding 40% (Table ).
SD% for genes with different start codonsa
Considering that AUG is a more potent initiator than GUG and UUG (34
), the weak start codons GUG and UUG, in conjunction with lack of an SD sequence, might substantially reduce the expression level of a given gene. For example, of the 449 genes in Escherichia coli
K-12 that start with either GUG or UUG and do not have an SD sequence, 228 are annotated as “orf, hypothetical protein” and many others encode “putative” proteins. Only 2 of these 228 open reading frames (ORFs) are PHX, whereas 21 are PA. Many of these ORFs may code for low-expression proteins or may be wrongly annotated. For these genes, a strong SD sequence could compensate for their weak start codons, especially UUG, as shown in laboratory manipulations (10
). Our results suggest that the SD sequence might work in concert with the start codons as part of an elaborate regulatory system for gene expression to maintain different expression levels for different genes.
We have shown that SD presence is significantly correlated with predicted gene expression levels in most prokaryotic genomes. In particular, the RP genes and more generally the PHX genes display a higher SD% than the PMX genes (i.e., the average genes). Also, in some genomes the SD sequences of RP and PHX genes are closer to optimal in both base-pairing potential with the anti-SD sequence and spacing to the start codon (Fig. ). This provides further evidence that the SD sequence is important in translation of these genes. A strong SD sequence may also work together with other features of the highly expressed genes, e.g., the stronger start codon AUG and favorable secondary structure around the translation initiation region (16
), that ameliorate the translation initiation efficiency.
Relationship between SD presence and distance between successive genes.
The intergenic distance (Dg) is another important feature of prokaryotic genes that might correlate with the SD presence. For ease of discussion, we refer to the Dg of gene g
as the distance (in base pairs) from g
's start codon to the end of its immediate upstream gene in the same orientation. Negative values of Dg signify genes that overlap their immediate upstream genes. In most genomes, the most prevalent value of Dg is −4 bp (the junction is always AUGA; also see reference 38
), which is observed for on average 7.8% and as much as 18% for Thermotoga maritima
The median Dg in a genome varies from 9 bp for Campylobacter jejuni and 11 bp for both Thermotoga maritima and Mycoplasma genitalium to 187 bp for Methanococcus jannaschii and 201 bp for Halobacterium sp. strain NRC-1 (see Supplementary Data Table S-4). In most archaeal genomes, the SD% for genes with a Dg of −4 bp is marked higher than the SD% for all the other genes, at a level comparable to the SD% of the RP genes. In contrast, many genomes recorded a reduced SD% for the collection of genes with a Dg of >20 bp, compared to genes with a Dg of <20 bp. This is especially valid for all the archaeal genomes (see Supplementary Data Table S-4).
We then assessed SD% for genes with different Dg ranges. Since the SD% does not show much variation among the groups with a Dg of greater than 30 bp, we focused on genes with a Dg of below 30 bp, which on average constitute 35% of a genome. We divided all the genes in a genome into seven Dg groups: genes with a Dg below −20 bp; five groups with a Dg of from −20 to 30 bp, with 10-bp intervals; and genes with a Dg exceeding 30 bp (see Supplementary Data Table S-5). In most genomes, each group contained more than 30 genes. The gene group with a Dg of −10 to 0 bp was the largest among the five groups of 10-bp intervals. Figure shows the SD% for these Dg groups.
FIG. 3. Relationship between SD% and distances between successive genes (Dg). The y axis represents SD%. The symbols for the lines and points for each plot are shown. In each plot, the seven data points represent seven Dg groups (from left to right): genes with (more ...)
In bacterial genomes, the first group (Dg of below −20 bp) persistently carried a much reduced SD%, except for Pseudomonas aeruginosa, Aquifex aeolicus, and Mycoplasma genitalium (Fig. ). One possible explanation for the low SD% is that many of these genes might be incorrectly annotated. At the other end, the last group (genes with a Dg in excess of 30 bp) contained about 60% to 80% of the genome and had an SD% at about the genome level. In 16 genomes, the group with a Dg of −10 to 0 bp was significantly higher in SD% over the genome level. The groups with a Dg of between 10 and 20 bp were significantly higher in SD% for 10 bacterial and three archaeal genomes (see Supplementary Data Table S-5). The increased SD% in these Dg groups were not due to higher expression levels (data not shown). Of particular interest was the genome of Mycoplasma genitalium, which contained 75 genes with a Dg of between −10 and 0 bp, of which 25% had an SD sequence (Fig. ). Whether these SD sequences are functional remains unclear.
Genes with a Dg of 0 to 20 bp may have strong biases in base composition in their translation initiation region because their 5′ end is located in the regions around the stop codon of the upstream gene (49
). Rocha et al. (35
) found that the 6 bases following the stop codon in Bacillus subtilis
genes are AU rich. Such biases could discount the occurrence of an SD sequence, which might be the reason for the somewhat reduced SD% for the group with a Dg of 0 to 10 bp in bacterial genomes (Fig. ). On the other hand, Eyre-Walker (12
) showed that Escherichia coli
K-12 genes overlapping a downstream gene tend to have low codon preferences at the 3′ end, which would more easily enable the presence of an SD for the downstream gene (e.g., with a Dg of −20 to 0 bp).
The archaeal genomes revealed a common trend distinctive from the bacteria. The genes with a Dg of less than 20 bp (Fig. ) or less than 10 bp (Fig. ) were strongly biased with an extant SD compared to genes with a larger Dg. This was even more emphatic for genomes with less than 30% overall SD%, especially for gene groups with a Dg of between −20 and 10 bp (Fig. ). These increased SD% were again not correlated with higher expression levels (data not shown). It is interesting that Bacillus subtilis, Aquifex aeolicus, and Thermotoga maritima were distinctively like bacteria in their relationship between Dg and SD presence (Fig. ), even though they were very similar to the archaea in the SD sequences with respect to ΔGSD and OAS (Table ; Fig. ). Thus, the parameters of translation initiation do not sort along simple phylogenetic lines.
Relationship between SD presence and operon structure.
The greatly increased SD presence in genes in close proximity to their upstream genes led us to investigate the connection between the SD sequence and operon structure. Apparently many genes in the groups with a Dg of −20 to 20 bp are genes within operons (38
). It has been suggested that operon structure might have arisen during the evolution of both bacteria and archaea by thermoreduction from a common thermophilic ancestor (14
). The operon structures in the two kingdoms thus might have some common features, such as the SD sequence. The high SD presence suggests that the SD sequences may play an essential role in translation of these genes.
We analyzed SD sequences for 391 documented operons from Escherichia coli
K-12 (each with at least two genes) extracted from the RegulonDB database (39
). Of the 601 internal genes within these operons, 69.2% had a Dg of between −20 and 30 bp, compared to only 6.6% of the 391 initial operon genes. The SD% was 71.0% for genes within operons and 67.3% for initial genes.
We then conducted a more general analysis over the 30 genomes. Based on the Dg, we partitioned the genes in a genome into three classes, types I to III, as illustrated in Fig. . Type I consists of genes at least 100 bp in distance from both the upstream and downstream genes; type I genes are presumably single genes. Type II consists of genes with a Dg larger than 50 bp and followed by at least two consecutive downstream genes with a Dg below 20 bp; type II genes are likely initial genes of operons. Type III comprises all genes with a Dg below 20 bp following a type II gene; type III genes are likely genes within operons. The three classes encompass about half of a genome. We found that more than one third of the type II and type III genes in Escherichia coli
K-12 were present in the 391 known operons, and most of them were also predicted to be operons by Salgado's method (38
). On average, there were three type III genes following each type II gene (see Supplementary Data Table S-6). Figure presents the SD% for these three gene classes.
FIG. 4. SD sequences for genes with different internal positions. (A) How the three types of genes were classified (see text for details). (B) Asterisks indicate genomes where the SD% for type III genes is significantly higher than that for type I genes. Boldface (more ...)
Type II genes always attain an SD% about the same as or lower than that of type I genes in most genomes. In fact, only Chlamydia trachomatis, Halobacterium sp. strain NRC-1, and Sulfolobus solfataricus recorded significantly higher SD% for type II genes than for type I genes (Fig. ). It appears that initial genes of operons are similar to single genes in SD presence. This may be expected because these genes are both at the start of a transcript. In contrast, type III genes recorded significantly higher SD% than type I genes in all the thermophiles, from Deinococcus radiodurans to Pyrobaculum aerophilum, and four other bacteria: Escherichia coli K-12, Haemophilus influenzae, Vibrio cholerae, and Mycoplasma genitalium (Fig. ). These results imply that the presence of an SD sequence is especially conserved for genes within operons in bacterial and archaeal genomes, prominently for thermophiles.
This conservation was even more significant in the genomes where the overall SD% was very low and/or no correlation between the SD presence and predicted expression levels was observed. Such genomes included those of Mycoplasma genitalium
, Mycoplasma pneumoniae
sp. strain PCC6803, Halobacterium
sp. strain NRC-1, Sulfolobus solfataricus
, and Pyrobaculum aerophilum
(Fig. , 3, and 4B). Thus, it is tempting to speculate that the SD sequence may have coevolved with the operon gene structure in both bacteria and archaea (14
). The correlation of SD presence with gene expression levels might have been established later. This would explain the observation that, in all archaeal genomes and Aquifex aeolicus
, PHX genes with a Dg of below 50 bp recorded a significantly higher SD% than other PHX genes (data not shown). The RP genes are both highly expressed and profusely expressed in operons, and not surprisingly, they always attained the highest SD% (Table ).
The archaeal genomes provide an excellent system with which to analyze the evolution of both the SD sequence and the bacterial translation mechanism utilizing the SD-anti-SD interaction. Some euryarchaea (Thermoplasma acidophilum
sp. strain NRC-1), and especially crenarchaea (Sulfolobus solfataricus
and Pyrobaculum aerophilum
), seem to have gradually lost conservation of both the anti-SD and the SD sequences (Table ; Fig. ). Accumulating evidence suggests that many single genes, or initial genes of operons, in these genomes are translated through leaderless mRNA by mechanisms that do not involve the SD-anti-SD interaction (45
). The SD sequence may thus become dispensable for these genes. However, for genes within operons, the SD sequence appears to be particularly important, evidenced by the prevalence of the SD motifs in those genes (Fig. ). Experimental evidence supporting this hypothesis has been reported for Sulfolobus solfataricus
SD presence and other gene features.
It has been suggested that the SD sequence is especially important in a genome where an S1 ribosomal protein is missing, e.g., Bacillus subtilis
, which has only a reduced S1 homologue and achieves the second highest SD% of all the genomes (Table ) (35
). However, we did not find such a correlation for other genomes. Three bacteria (Ureaplasma urealyticum
, Mycoplasma genitalium
, and Mycoplasma pneumoniae
) and all archaeal genomes did not have an S1 or any S1 homologues. But, unlike Bacillus subtilis
, the genomes of Ureaplasma urealyticum
, Mycoplasma genitalium
, and Mycoplasma pneumoniae
recorded a very low SD% (Table ). On the other hand, genomes with an S1 gene can achieve very high SD%, e.g., Thermotoga maritima
, which had the highest SD% (Table ). Thus, SD presence is not correlated with the presence or absence of an S1 RP gene. Also, the SD sequence seems to be uncorrelated with factors such as copy number of the 16S rRNA, G+C content, total number of genes, gene length, or lifestyle (data not shown).
Given the correlation between the SD sequence and other gene features, especially expression levels and distances between successive genes, it is suggested that the SD sequence should be incorporated in algorithms for gene start determination, expression level prediction, and operon prediction to improve accuracy. Most of the genomes studied in this report were annotated with the programs GeneMark (20
) and GLIMMER (40
) or a combination of automatic gene-finding methods and similarity searches in protein databases. Now SD information has been incorporated in recent programs, such as GeneMark.hmm and GeneMarkS (3
). It appears to work well for genomes with high SD%, such as low-G+C gram-positive bacteria (e.g., Listeria monocytogenes
]). However, for many genomes, the SD% is around 30 to 50% and thus would provide only marginal improvements (36
On the other hand, the relationship between SD presence and intergenic distances may contribute greatly to operon predictions, an important part of prokaryotic genomics. No highly reliable method to date has been developed for operon prediction (38
). Also, little is known about operons in archaeal genomes. Our findings that archaeal genes that are presumably within operons have remarkably increased SD presence should help in developing an effective method for operon characterization in these genomes.
Recently, the crystal structures of both the 50S and 30S complexes of the bacterial ribosome have been determined at high resolution (2
). A structure of the 80S ribosome from Saccharomyces cerevisiae
was also reported (48
). These accomplishments greatly augment our understanding of the mechanisms of protein synthesis at the atomic level (5
). Furthermore, Yusupova et al. (58
) directly observed the path of mRNA in the 70S ribosome from Thermus thermophilus
at 7 Å resolution. The model mRNA was based on the phage T4 gene 32 mRNA except that the SD sequence was expanded to AAGGAGGU. They found that about 30 nucleotides are bound to the 30S subunit (15 bp upstream of the initiator to 15 bp downstream), which is roughly the whole translation initiation region. The SD interaction was clearly observed to form a helix, which was accommodated in a cleft formed by 16S rRNA elements and the ribosomal proteins S11 and S18 (58
). These results provide additional proof that the SD interaction can be an important part of translation initiation.
The SD sequence in the mRNA, AAGGAGGU, had an aligned spacing of 7 bases. It is interesting that of the 67 AAGGAGGU SD sequences in the 21 bacterial genomes (Table ), only 4 occurred at an aligned spacing of 7 bases, while 10, 19, and 12 conferred 8, 9, and 10 bases of spacing, respectively. A total of 55 (82.1%) were present at a spacing larger than 7 bases. Thus, most likely an aligned spacing of 9 bases should be more preferable for the mRNA in the structure. There are apparently structural constraints that require such an optimal spacing, and three-dimensional simulation studies based on the structure using different SD sequences and spacings could provide insights into these structural constraints and a better understanding of the SD interaction.