Search tips
Search criteria 


Logo of compfungenJournal's HomeManuscript SubmissionAims and ScopeAuthor GuidelinesEditorial BoardHome
Comp Funct Genomics. 2010; 2010: 343569.
Published online 2010 April 22. doi:  10.1155/2010/343569
PMCID: PMC2860111

Codon Usage Patterns in Corynebacterium glutamicum: Mutational Bias, Natural Selection and Amino Acid Conservation


The alternative synonymous codons in Corynebacterium glutamicum, a well-known bacterium used in industry for the production of amino acid, have been investigated by multivariate analysis. As C. glutamicum is a GC-rich organism, G and C are expected to predominate at the third position of codons. Indeed, overall codon usage analyses have indicated that C and/or G ending codons are predominant in this organism. Through multivariate statistical analysis, apart from mutational selection, we identified three other trends of codon usage variation among the genes. Firstly, the majority of highly expressed genes are scattered towards the positive end of the first axis, whereas the majority of lowly expressed genes are clustered towards the other end of the first axis. Furthermore, the distinct difference in the two sets of genes was that the C ending codons are predominate in putatively highly expressed genes, suggesting that the C ending codons are translationally optimal in this organism. Secondly, the majority of the putatively highly expressed genes have a tendency to locate on the leading strand, which indicates that replicational and transciptional selection might be invoked. Thirdly, highly expressed genes are more conserved than lowly expressed genes by synonymous and nonsynonymous substitutions among orthologous genes fromthe genomes of C. glutamicum and C. diphtheriae. We also analyzed other factors such as the length of genes and hydrophobicity that might influence codon usage and found their contributions to be weak.

1. Introduction

It is well established that the codon usage patterns are generally not used with equal frequency. Grantham et al. firstly explained the phenomena of unequal usage and proposed the “genome hypothesis”, stating that the biases are species specific [1], and multivariate analysis methods were used to analyze codon usage and amino acid composition [24]. As more and more complete genome sequences of diverse species are investigated, researchers have found that biased usage of synonymous codons may result from various factors. Some unicellular species have extremely biased compositions, where compositional constraints are the main factors in determining the codon usage variation among genes [57]. In contrast, both translational selection and compositional constraint operate on the codon usage variation in other organisms [814]. Moreover, in several bacteria, the replication and translational selection is responsible for the codon usage variation among genes [1518]. In organisms, such as Escherichia coli [36], Drosophila melanogaster [19], and Caenorhabditis elegans [20], the frequency of codon usage is directly proportional to the corresponding tRNA population and the preferred codons in highly expressed genes are recognized by the most abundant tRNAs. Meanwhile, it has been reported that amino acid conservation and hydrophobicity are the main factors shaping codon usage among the genes in Mycobacteria [21, 22]. Other factors may also influence the synonymous codon usage, such as protein secondary structure [2326], mRNA folding stability [27, 28], gene function [29, 30], and gene length [3133].

Corynebacterium glutamicum ATCC 13032, used industrially for the production of amino acids, is an aerobic, gram-positive rod-shaped bacteria capable of growing on a variety of sugars or organic acids [34]. In this study, we used the available complete genome sequence of this organism and analyzed its codon usage, aiming to understand the genetic organization of the C. glutamicum genome. Our results show that mutational bias, natural selection, and amino acid conservation are the main factors driving codon usage patterns in C. glutamicum genes.

2. Materials and Methods

2.1. Genome Sequence Data

The complete genome sequences and coding sequences of C. glutamicum and C. diphtheriae were obtained from the NCBI ftp site ( To minimize sampling errors, only genes of at least 100 codons in length with correct initiation and termination codons were used in further analysis.

2.2. Multivariate Analysis of Codon Usage

The COA (codon usage correspondence analysis, plots the codon usage data in a multidimensional space of 59 axes, excluding Met, Trp, and termination codons, identifies the axes which represent the most prominent factors contributing to the variation among genes), GC3s (the frequency of G + C at the third synonymously variable coding position, excluding Met, Trp, and termination codons), ENC (the “effective number of codons”; a measure of the bias in codon usage of genes, usually highly expressed genes display lower values compared with lowly expressed ones), RSCU (the “relative synonymous codon usage”; a value greater than 1.0 indicates that the corresponding codon is more frequently used than expected, whereas the reverse is true for RSCU values less than 1.0), CAI (the “codon adaptation index”; high values mean higher codon usage bias and higher expressed level), Fop (the “frequency of optional codons”), GRAVY index of hydrophobicity, and A3s, G3s, C3s, and T3s (the composition of each individual base A, G, C, and T at the third synonymous codon positions) were performed using the program CodonW1.42 ( CAI was calculated taking the codon usage of the ribosomal proteins as a reference. Other statistical analyses were performed with the SPSS statistical software version 11.0.

2.3. Locating Genes Situated on the Leading and Lagging Strands of Replication

Asymmetrical mutational bias between the two complementary strands may contribute to variations in codon usage. To locate the genes on the leading or lagging strand of replication, the sites of origin and termination were determined by using the oriloc program ( and GC skew (G − C/G + C) was determined using the GC Skewing program ( by taking a 24 kb window size and a step size of 3 kb to locate the leading and lagging strands.

2.4. Orthologous Gene Pairs and Analysis

Orthologous genes were identified by the reciprocal best blast hit approach as those pairs displaying value of 60% identity, an E-value of 10−5, and overlapped by at least 60% of the length of the longest protein, with at least 100 amino acids in length using the local BLASTP program ( The protein sequences of 1525 orthologous gene pairs were aligned using the MUSCLE program (; then the aligned protein sequences were used to generate the corresponding codon alignment. The Ka (the number of synonymous substitutions per site) and Ks (the number of nonsynonymous substitutions per site) for each pair of aligned sequences were estimated using the PAML version 4.3 package ( with runmode = −2 and CodonFreq = 2. Only those pairs of sequences having Ks values below 1.0 were considered in further analysis and the final dataset was comprised of 437 gene pairs.

3. Results

3.1. Overall Codon Usage

As shown in Figure 1(a), the genome of C. glutamicum is biased towards high G + C contents ranging from 40% to 68% with an average of 54.7% and a standard deviation of 3.7%. With the exception of small regions, its genome shows little variation around the mean value. Due to composition constraints, G and C are expected to predominate at the third position of codons. Indeed, the codon usage indicated that C ending codons are predominant overall (data not shown). In order to understand the codon usage variation among different genes, ENC and GC3s values were calculated (Figure 1(b)). ENC values vary from 24.46 to 61.00 with a mean of 46.9 and standard deviation of 7.55%. The heterogeneity of codon usage was further confirmed from the GC3s values ranging from 28% to 87% with a mean of 57.18% and standard deviation of 8.3%. Wright suggested that plotting ENC against GC3s values could be used to effectively explore codon usage variation among genes [35]. If GC3s are the only determination of the codon usage variation among genes, then the values of ENC would fall on the continuous curve. The GC3s versus ENC plot reveals that only a small proportion of points lie on the expected curve (Figure 1(b)), which indicates that apart from the effect of compositional constraints, there might be some additional factors driving codon usage variation among the genes.

Figure 1
(a) The GC content and GC skew of the genome C. glutamicum with a 24 kb of window size and a 3 kb of step size. (b) The ENC plot of C. glutamicum. The continuous curve represents the relationship between GC3s and ENC values under random ...

3.2. Gene Expression and Codon Usage Bias

In order to investigate the other possible trends in shaping codon usage variation among the genes in C. glutamicum, we subjected the data to multivariate statistical analysis. Figure 1(c) shows the position of genes along the first two axes. At the positive end of the first axis, it comprises of putatively highly expressed genes, such as ribosomal proteins, translation elongation factors, while the majority of putatively lowly expressed genes are scattered towards the other extreme. A more important result emerged when the genes were sorted according to their respective CAI values, and the highest positions were displayed not only by the genes encoding ribosomal protein but also by almost the same genes along the extreme of the first axis. Table 1 shows the first axis accounts for 20.33%, compared with 10.5% of the second axis and this value of the first axis is high and much larger than that of the second axis, indicating a primary trend in codon usage across genes. Furthermore, there are positive correlations between the first axis and CAI (r = 0.855, P < .001), with Fop (r = 0.892, P < .001), with GC3s (r = 0.594, P < .001), and especially with C3s (r = 0.881, P < .001). Those results suggest that gene expression may be the main factor shaping the codon usage in this organism, the first axis is associated with expression levels, and highly expressed genes have higher (G + C) content, especially C content at their synonymous third codon position than lowly expressed genes.

Table 1
Result of factorial correspondence analyses on codon usage in C. glutamicum.

To investigate the differences between highly and lowly expressed genes, we compared the codon usage of genes that locate the two extremes of the first axis (Table 2). Chi square tests were performed taking P < .01 as the significant criterion. We found that there were 22 coding codons (corresponding to 18 amino acids) that are more highly used in putatively highly expressed genes than putatively lowly expressed genes. Among the 20 codons, there are 14 C ending codons and 3 G ending codons, which demonstrate that the presumed highly expressed genes tend to be C3-rich.

Table 2
Codon usage in putative highly expressed and lowly expressed genes of C. glutamicum.

3.3. Replicational and Transcriptional Selection and Codon Usage

Recent reports of several bacterial strains show that codon usage bias is mainly governed by transcriptional and translational selection [19, 20, 36, 37]. After the origin versus termination and leading versus lagging strands were determined, we located the genes on the leading or lagging strands of replication and found that the proportion of genes located on the leading strands increases with CAI, from about 55% for low CAI genes (<0.35) to 67% for high CAI genes (>0.65) in the organism. For the putatively highly expressed ribosomal proteins, the proportion of genes on the leading strands reaches 84% (44/52) (Table 3). This observation is consistent with previous research results that essential genes are enriched to a greater extent than nonessential genes in the leading strand [38].

Table 3
Percentages of genes in C. glutamicum on the leading (versus lagging) strand.

3.4. Gene Conservation and Codon Usage

The rate of synonymous substitutions has been reported to be nonuniform among different genes in the same species [39]. When we calculated the Ka and Ks between the orthologous genes from C. glutamicum and C. diphtheriae, several results were determined. Firstly, there is a negative correlation between the Ka and CAI value (r = −0.523, P < .001), comparative with Ks and CAI values with r = −0.459 and P < .001 (Figure 2). When the genes are sorted according to the respective Ks, the genes displaying the lowest values are those presumed highly expressed genes, such as ribosomal protein and translation elongation factors., Taken together, this indicates that highly expressed genes have diverged less at the synonymous position than lowly expressed genes. Secondly, the Ka and Ks are correlated with r = 0.473, P < .001. Thirdly, there is significant correlation between Ks and Fop (r = −0.431, P < .001), which indicated that the genes diverging less are the ones displaying highest frequencies of optional codon usage.

Figure 2
Plot of CAI values for C. glutamicum against Ka and Ks. (a) Plot of CAI values for C. glutamicum against Ka. (b) Plot of CAI values for C. glutamicum against Ks. The correlation coefficients (r) and level of significance (P) are shown.

Finally, we also investigated the relationship between codon usage and gene length (r = −0.137, P < .001), codon usage, and hydrophobicity (r = −0.094, P < .001), suggesting that their contributions to the codon usage variation are weak.

4. Discussion

Among prokaryotes, it is generally accepted that the preferences of synonymous codons can be explained as the result of mutational bias and natural selection acting at the level of translation. In C. glutamicum, the composition bias towards GC constraint indicates that these bases are predominant at the third codon positions across all genes. Indeed, the putatively highly expressed genes show an increment of several codons, most of which are C-ending triplets. Ikemura showed that there is a match between these codons and the most abundant tRNAs [36]. In Escherichia coli [36], Drosophila melanogaster [19], and Caenorhabditis elegans [20], highly expressed genes have a strong selective preference for codons with a high concentration for the corresponding acceptor tRNA molecule; the preferred codons are those best recognized by the most abundant tRNAs. This trend has been interpreted as the coadaptation between amino acid composition of protein and tRNA-pools to enhance the translational efficiency. Remarkably, in this study, there is a strong positive correlation (r = 0.94, P < .001) between the Fop in each gene and respective CAI value. This strongly suggests that translational selection influenced the codon usage of C. glutamicum and the “optional codons” were more frequent in highly expressed genes.

As more prokaryotic genomes are analyzed, it becomes evident that codon usage is rather dependent on mutational bias and natural selection. For example, the complex pattern of codon usage in Chlamydia trachomatis is inferred to be the result of strand-specific mutation, natural selection, the hydropathy level of each protein and amino acid conservation [17]. In this study, we present evidence suggesting that, apart from mutational bias and natural selection, strand-specific and amino acid conservation also contribute to the codon usage of C. glutamicum. Strand bias also dominates codon usage in other symbiotic or parasitic bacteria, such as Rickettsia prowazekii, Borrelia burgdorferi, and Lawsonia intracellularis [15, 40, 41]. We found a distribution bias of genes (particularly for those with a high CAI) on the leading strands in C. glutamicum. This is usually interpreted as the result of “replicational selection”, by which presence on the leading strand would permit the avoidance of collision between polymerases when replication and transcription occur at the same time [16].

It was reported that the codon usage is more biased for amino acid that are more conserved between species [42, 43]; natural selection has a larger contribution than mutation to the observed correlation between evolutionary rates and gene expression level in Chlamydomonas [44]. A correlation between Ks and Fop was also identified. This correlation with Ka might be explained in many ways. Akashi argued that the selection for translation accuracy maintains a high frequency of preferred codons for highly conserved amino acids [43]. Two additional hypotheses for this pattern are a possible mechanistic bias in mutation and the fact that synonymous sites are also subject to some degree of selection [45]. The latter scenario could mean either selection on codon usage, or that synonymous substitutions might not always be silent or evolutionary responses to adaptations [46]. A similar interaction between the level of expression, the level of codon bias, and gene conservation was demonstrated in Mycobacterium [21].

In summary, this study has shown that the codon usage variation among the genes of C. glutamicum is influenced by mutational bias, translational selection, and amino acid conservation. As more complete prokaryotic genomes are being studied, different factors shaping the pattern of codon usage might be found.


This work is supported by the National Natural Science Foundation of China (30571009). G. Liu and J. Wu contributed equally to this work.


1. Grantham R, Gautier C, Gouy M, Jacobzone M, Mercier R. Codon catalog usage is a genome strategy modulated for gene expressivity. Nucleic Acids Research. 1981;9(1):r43–r74. [PMC free article] [PubMed]
2. Medigue C, Rouxel T, Vigier P, Henaut A, Danchin A. Evidence for horizontal gene transfer in Escherichia coli speciation. Journal of Molecular Biology. 1991;222(4):851–856. [PubMed]
3. Pascal G, Médigue C, Danchin A. Universal biases in protein composition of model prokaryotes. Proteins. 2005;60(1):27–35. [PubMed]
4. Pascal G, Médigue C, Danchin A. Persistent biases in the amino acid composition of prokaryotic proteins. BioEssays. 2006;28(7):726–738. [PubMed]
5. Ohama T, Muto A, Osawa S. Role of GC-biased mutation pressure on synonymous codon choice in Micrococcus luteus, a bacterium with a high genomic GC-content. Nucleic Acids Research. 1990;18(6):1565–1569. [PMC free article] [PubMed]
6. Andersson SGE, Sharp PM. Codon usage in the Mycobacterium tuberculosis complex. Microbiology. 1996;142(4):915–925. [PubMed]
7. Andersson SGE, Sharp PM. Codon usage and base composition in Rickettsia prowazekii. Journal of Molecular Evolution. 1996;42(5):525–536. [PubMed]
8. Malumbres M, Gil JA, Martin JF. Codon preference in corynebacteria. Gene. 1993;134(1):15–24. [PubMed]
9. Ghosh TC, Gupta SK, Majumdar S. Studies on codon usage in Entamoeba histolytica. International Journal for Parasitology. 2000;30(6):715–722. [PubMed]
10. Romero H, Zavala A, Musto H. Compositional pressure and translational selection determine codon usage in the extremely GC-poor unicellular eukaryote Entamoeba histolytica. Gene. 2000;242(1-2):307–311. [PubMed]
11. Musto H, Romero H, Zavala A. Translational selection is operative for synonymous codon usage in Clostridium perfringens and Clostridium acetobutylicum. Microbiology. 2003;149(4):855–863. [PubMed]
12. Das S, Paul S, Chatterjee S, Dutta C. Codon and amino acid usage in two major human pathogens of genus Bartonella—optimization between replicational-transcriptional selection, translational control and cost minimization. DNA Research. 2005;12(2):91–102. [PubMed]
13. Sau K, Sau S, Mandal SC, Ghosh TC. Factors influencing the synonymous codon and amino acid usage bias in AT-rich Pseudomonas aeruginosa phage PhiKZ. Acta Biochimica et Biophysica Sinica. 2005;37(9):625–633. [PubMed]
14. Sau K, Gupta SK, Sau S, Mandal SC, Ghosh TC. Factors influencing synonymous codon and amino acid usage biases in Mimivirus. BioSystems. 2006;85(2):107–113. [PubMed]
15. McInerney JO. Replicational and transcriptional selection on codon usage in Borrelia burgdorferi. Proceedings of the National Academy of Sciences of the United States of America. 1998;95(18):10698–10703. [PubMed]
16. Lafay B, Lloyd AT, McLean MJ, Devine KM, Sharp PM, Wolfe KH. Proteome composition and codon usage in spirochaetes: species-specific and DNA strand-specific mutational biases. Nucleic Acids Research. 1999;27(7):1642–1649. [PMC free article] [PubMed]
17. Romero H, Zavala A, Musto H. Codon usage in Chlamydia trachomatis is the result of strand-specific mutational biases and a complex pattern of selective forces. Nucleic Acids Research. 2000;28(10):2084–2090. [PMC free article] [PubMed]
18. Stoletzki N, Eyre-Walker A. Synonymous codon usage in Escherichia coli: selection for translational accuracy. Molecular Biology and Evolution. 2007;24(2):374–381. [PubMed]
19. Moriyama EN, Powell JR. Codon usage bias and tRNA abundance in Drosophila. Journal of Molecular Evolution. 1997;45(5):514–523. [PubMed]
20. Duret L. tRNA gene number and codon usage in the C. elegans genome are co-adapted for optimal translation of highly expressed genes. Trends in Genetics. 2000;16(7):287–289. [PubMed]
21. de Miranda AB, Alvarez-Valin F, Jabbari K, Degrave WM, Bernardi G. Gene expression, amino acid conservation, and hydrophobicity are the main factors shaping codon preferences in Mycobacterium tuberculosis and Mycobacterium leprae. Journal of Molecular Evolution. 2000;50(1):45–55. [PubMed]
22. Zhou T, Sun X, Lu Z. Synonymous codon usage in environmental chlamydia UWE25 reflects an evolutional divergence from pathogenic chlamydiae. Gene. 2006;368(1-2):117–125. [PubMed]
23. Oresic M, Shalloway D. Specific correlations between relative synonymous codon usage and protein secondary structure. Journal of Molecular Biology. 1998;281(1):31–48. [PubMed]
24. Xie T, Ding D. The relationship between synonymous codon usage and protein structure. FEBS Letters. 1998;434(1-2):93–96. [PubMed]
25. Gupta SK, Majumdar S, Bhattacharya TK, Ghosh TC. Studies on the relationships between the synonymous codon usage and protein secondary structural units. Biochemical and Biophysical Research Communications. 2000;269(3):692–696. [PubMed]
26. Gu W, Zhou T, Ma J, Sun X, Lu Z. The relationship between synonymous codon usage and protein structure in Escherichia coli and Homo sapiens. BioSystems. 2004;73(2):89–97. [PubMed]
27. Chamary JV, Parmley JL, Hurst LD. Hearing silence: non-neutral evolution at synonymous sites in mammals. Nature Reviews Genetics. 2006;7(2):98–108. [PubMed]
28. Kahali B, Basak S, Ghosh TC. Reinvestigating the codon and amino acid usage of S. cerevisiae genome: a new insight from protein secondary structure analysis. Biochemical and Biophysical Research Communications. 2007;354(3):693–699. [PubMed]
29. Epstein RJ, Lin K, Tan TW. A functional significance for codon third bases. Gene. 2000;245(2):291–298. [PubMed]
30. Fuglsang A. Strong associations between gene function and codon usage. APMIS. 2003;111(9):843–847. [PubMed]
31. Moriyama EN, Powell JR. Gene length and codon usage bias in Drosophila melanogaster, Saccharomyces cerevisiae and Escherichia coli. Nucleic Acids Research. 1998;26(13):3188–3193. [PMC free article] [PubMed]
32. Duret L, Mouchiroud D. Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. Proceedings of the National Academy of Sciences of the United States of America. 1999;96(8):4482–4487. [PubMed]
33. Marais G, Duret L. Synonymous codon usage, accuracy of translation, and gene length in Caenorhabditis elegans. Journal of Molecular Evolution. 2001;52(3):275–280. [PubMed]
34. Kalinowski J, Bathe B, Bartels D, et al. The complete Corynebacterium glutamicum ATCC 13032 genome sequence and its impact on the production of L-aspartate-derived amino acids and vitamins. Journal of Biotechnology. 2003;104(1–3):5–25. [PubMed]
35. Fuglsang A. The ‘effective number of codons’ revisited. Biochemical and Biophysical Research Communications. 2004;317(3):957–964. [PubMed]
36. Ikemura T. Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. Journal of Molecular Biology. 1981;151(3):389–409. [PubMed]
37. Banerjee R, Roy D. Codon usage and gene expression pattern of Stenotrophomonas maltophilia R551-3 for pathogenic mode of living. Biochemical and Biophysical Research Communications. 2009;390(2):177–181. [PubMed]
38. Rocha EPC, Danchin A. Essentiality, not expressiveness, drives gene-strand bias in bacteria. Nature Genetics. 2003;34(4):377–378. [PubMed]
39. Kawahara Y, Imanishi T. A genome-wide survey of changes in protein evolutionary rates across four closely related species of Saccharomyces sensu stricto group. BMC Evolutionary Biology. 2007;7, article 9 [PMC free article] [PubMed]
40. Davis JJ, Olsen GJ. Modal Codon Usage: assessing the typical codon usage of a genome. Molecular Biology and Evolution. 2010;27(4):800–810. [PMC free article] [PubMed]
41. Guo F-B, Yuan J-B. Codon usages of genes on chromosome, and surprisingly, genes in plasmid are primarily affected by strand-specific mutational biases in lawsonia intracellularis. DNA Research. 2009;16(2):91–104. [PMC free article] [PubMed]
42. Ticher A, Graur D. Nucleic acid composition, codon usage, and the rate of synonymous substitution in protein-coding genes. Journal of Molecular Evolution. 1989;28(4):286–298. [PubMed]
43. Akashi H. Synonymous codon usage in Drosophila melanogaster: natural selection and translational accuracy. Genetics. 1994;136(3):927–935. [PubMed]
44. Popescu CE, Borza T, Bielawski JP, Lee RW. Evolutionary rates and expression level in chlamydomonas. Genetics. 2006;172(3):1567–1576. [PubMed]
45. Jordan IK, Rogozin IB, Wolf YI, Koonin EV. Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Research. 2002;12(6):962–968. [PubMed]
46. Drummond DA, Wilke CO. The evolutionary consequences of erroneous protein synthesis. Nature Reviews Genetics. 2009;10(10):715–724. [PMC free article] [PubMed]

Articles from Comparative and Functional Genomics are provided here courtesy of Hindawi