The "Measure Independent of Length and Composition" (MILC)
Our primary motivation in developing this novel method was to correct for possible artefacts due to sequence length variability. The measure should be able to quantify the distance in codon usage between a gene and some expected distribution of codons. The codon distribution could either be calculated from the background nucleotide composition, or derived from a single gene or a gene group. Therefore, MILC is conceptually similar to Karlin and Mrazek's B [21
], Novembre's ENC' [19
] or Urrutia and Hurst's MCB method [22
Mathematically, the measure is based on a log-likelihood ratio score used in the statistical G
-test for goodness-of-fit. This methodology yields numerically similar results to the more commonly used χ2
test, but may hold theoretical advantages over it in statistical analyses [23
]. Both of the methods have been used in past examinations of codon usage patterns [24
The individual contribution Ma of each amino acid a to the MILC statistic is calculated as
where Oc denotes the actual observed count of the codon c in a gene, and Ec stands for the expected count of the same codon. The Oc/Ec ratio is mathematically equal to, and can be replaced by fc/gc, where fc is the frequency of the codon c in a gene, and gc is the expected frequency of the same codon. The sum of f or g over all codons for each amino acid should equal 1. The total difference in codon usage is then assessed by the following formula:
The sum of contributions of all amino acids (stop codons are excluded from calculation) is divided by L
, the gene length in codons, in attempt to compensate for the expected increase with total number of codons. This is analogous to the procedure described in [25
]. However, such a „scaled χ2
" statistic still depends on gene length [20
], greatly overestimating the overall amount of bias in shorter sequences. The correction factor C
in Equation 2 attempts to correct for this overestimation.
The cause for the abovementioned effect are sampling errors: a relatively small number of observations (counted codons) cannot exactly fit the expected distribution, leading to a higher perceived χ2 score. In order to demonstrate the effect, let us presume that the expected codon frequencies for two cysteine codons are g(UGU) = 0.5 and g(UGC) = 0.5; and that our hypothetical gene complies with these codon frequencies. However, a short gene might have only a single codon for Cys, thus the observed counts can be only OUGU = 1 and OUGC = 0, or vice versa. Either way, instead of being equal to 0, the cysteine's contribution to the χ2 score will be:
In case the gene has two cysteines, there is a 50% chance that OUGU
= 1, which would yield a (correct) χ2
score of 0; and a 50% chance that one of them will be 2, and the other 0, which gives a χ2
score of 2. The weighted average of these scores will again be equal to 1. Moving on to cases with 3, 4 or more cysteines we see that always MCys
= 1, and it can be shown that for each amino acid in this case Ma
is equal to its degree of redundancy minus 1 (e.g. MIle
= 2, MPro
= 3). In fact, this is the expected value of the χ2
statistic under the null hypothesis (observed frequencies match the expected frequencies), which equals the number of degrees of freedom. The calculation can be generalized to cases when the observed frequencies do not match the expected codon distribution, and is also applicable to the G
statistic MILC is based upon. Further examples to better illustrate this point are given in the material accompanying this paper [see Additional file 1
To reiterate, in a situation where the gene's codon usage matches the expected distribution, with all amino acids present, the sampling errors will increase the χ2 score by 41, and the „scaled χ2" by 41/L. The correction factor C is therefore calculated as:
where ra is the number of possible codons for the amino acid a – its degeneracy class. Only the amino acids actually present at least once in the sequence contribute to C, e.g. if a gene missed one of the four-fold amino acids, C would be 38/L + 0.5. When the observed frequencies match the expected codon distribution closely, MILC can assume negative values. In order to compensate, a constant of 0.5 is added to the correction factor C (see Equation 4). Regarding minimum sequence length, we recommend that only sequences of 80 codons or longer be analysed using MILC (or any other measure of codon usage); many researchers set this threshold to even higher values, such as 100.
Behaviour of codon usage measures under varying conditions
A multitude of methods to measure codon usage has been published, including "scaled χ2
], "effective number of codons" ENC [26
], "codon bias index" CBI [27
], "intrinsic codon bias index" ICDI [28
], two versions of "codon bias" B [21
], "maximum likelihood codon bias" MCB [22
], "effective number of codons prime" ENC' [19
], and "synonymous codon bias orderliness" SCUO [30
]. Among those, we chose to test the methods that have been either frequently used in codon usage examinations, or that are new and haven't been extensively tested [20
ENC is an older, widely accepted measure that quantifies the degree of deviation from equal use of synonymous codons; ENC' gives results comparable to ENC but allows comparison to any desired codon distribution; the 1998 version of Karlin and Mrazek's B has been used extensively in later research of microbial genomes by the same authors; MCB is a method conceptually similar to B, used in examinations of human genes; and SCUO is a representative of the information theory-based measures, which have recently been used on several occasions [31
] to analyze codon usage. Finally, the method proposed in this paper, MILC, is compared in performance to the aforementioned methodologies.
Figure demonstrates the behaviour of the methods when examining genes of differing lengths. Pseudorandomly generated sequences (or 'genes') obtained using INCA [33
] were used for testing under varying conditions (see Methods): Figures and show the performance (degree of misestimation) for chosen measures at 5 different lengths, with 1b, 1d and 1f showing the standard deviations for the 10000 measurements performed at each length. In this aspect, our testing conditions resemble the ones previously used by Comeron and Aguade [20
] or Novembre [19
], the essential difference being the normalization and comparison of the results. Here, the values are presented as percentages of the 'dynamic range' of a measure (the largest difference between its high and low values under realistic conditions, see Methods). We feel this is more reasonable than e.g. normalizing a mean of the sample at a certain length by simply dividing it by the value at 2500 codons, which (i) unfairly penalizes measures which approach zero as bias lessens, as opposed to those approaching an arbitrary value, e.g. 61 for ENC and ENC', and (ii) among the measures approaching zero, favours those displaying larger values at 2500 codons, in spite of this being an undesirable quality – the value should be as close to zero as possible. For instance, both B and MCB are meant to equal 0 when expected and observed codon frequencies match, however in practice at the length of 2500 codons B assumes the value of approx. 0.1, and MCB of 0.033 (Table , "None" dataset). Dividing the misestimation of each measure by the above values would be unfairly advantageous for B; a more extreme example is ENC with its baseline value of 60.9. These issues are addressed by expressing the results as percentages of the dynamic range – a simple linear transformation essential for objective comparison of the methods' performances. However, when using a single measure to compare genes (or gene groups), or to determine association with other genomic data, it should not matter if the normalization is performed or not. The relative distances of codon usage in two genes (gene groups) would remain equal in both cases, and the degree of correlation with other genomic data would also not change.
Figure 1 Effect of sequence length on behaviour of codon usage measures. Figure 1a, 1c and 1e illustrate the degree of misestimation of a measure at varying sequence lengths (x axis), compared to the values at 2500 codons.. The values were obtained by calculating (more ...)
Determining the 'dynamic range' for measures of codon usage
We designed three experiments to determine to what extent changing gene length affects each measure. In the first experiment (Figures and ) the expected distribution assumes equal codon frequencies ("None", see Methods) and the generated sets of genes attempt to mimic that distribution. Therefore, the methods should ideally report a minimal distance between the observed and the expected distribution. ENC, ENC', MILC and MCB are generally well behaved under these conditions and tend to somewhat overestimate the amounts of bias in short sequences, MCB overestimates bias also in longer sequences. In contrast, B and SCUO greatly overestimate the bias in shorter genes (by "shorter" we assume a range of gene lengths most frequent in genomes, e.g. 100–500 codons). For example, using B on sequences 250 and 500 codons long would result in the first sequence being seemingly different twice as much from the expected distribution as the second one. Moreover, the overestimation at 250 codons may amount to as much as a quarter of the dynamic range of B. As anticipated, the variability of all measures (Figure ) decreases with an increase in gene length. It must be noted that MCB measurements introduce significantly less noise than the rest of the methods, particularly in short genes.
The second experiment, where the overall amount of bias in both the generated sequences and the expected distribution increases (Figure ) shows little change regarding length dependence – all methods see a very modest improvement in performance. ENC now tends to slightly underestimate bias, however, the variability chart (Figure ) shows that here it becomes noticeably less reliable than other methods, and so does SCUO. MCB is still the best performer, followed by MILC and B for shorter sequences, and ENC' for longer ones.
Figures and , representing the third experiment, demonstrate what happens when a gene unbiased in codon usage differs from the biased expected codon frequencies, derived from the "Med-1" dataset (see Methods). This is, in fact, a situation more likely to occur in real-life applications, as a gene would probably show at least some deviation from the expected codon distribution. ENC and SCUO expectedly behave precisely the same as in 1a and 1b, because they by definition always assume an unbiased expected distribution. Interestingly, B improves significantly and does not feel as much influence of gene length when the observed and expected codon distributions differ. It now performs on par with ENC' and MCB, both of which show a detrimental effect of increasing distance between the observed and the expected distribution. This factor also increases the amount of variation introduced by measures (excluding ENC and SCUO), most of all ENC', and causes MCB to lose its advantage over MILC and B.
We have shown that ENC and ENC' display a drop in reliability as the overall amount of bias (measured by ENC, Figure ), or the difference in bias (measured by ENC', 1f) increases. The explanation is the cutoff value that both measures introduce [19
], causing the distribution of the measurements to become asymmetrical and therefore artificially reducing the measures' variance when the observed codon distribution is close to the expected one. Having such a threshold might, in theory, mask biologically relevant information; for an example, see the ENC' plot in Figure
Figure 2 Plots of the E. coli genome made using different measures of codon usage. The four plots were made by using measures that allow an expected codon distribution to be specified: B, MCB, ENC' or MILC. The distance of codon usage of a gene from E. coli ribosomal (more ...)
Measures of codon usage introduce different levels of statistical bias in shorter genes; however, it must be noted that even if this influence were completely eliminated, there might still exist a connection between codon bias and length caused by the inherent properties of the sequences. Selection might be acting to optimize codon usage patterns (and therefore translational efficiency) in energetically costly longer genes; on the other hand it might also act to reduce the size of highly expressed (and strongly biased) proteins [7
]. The only way to nullify these length effects – if this is desired – is to use regression, while employing a length-insensitive measure.
In addition to being resistant to length variation, the methods should ideally be invariant to both overall bias and the relative difference in codon usage. Moreover, the measures should be commutative with respect to properties of the observed and expected distributions. We designed two experiments to investigate these issues.
Figure shows the influence of overall amount of codon bias ('background nucleotide composition') on performance of the individual methods: we examined sets of 10000 sequences generated to match the expected frequencies at varying degrees of bias; the sequences were 2500 codons long to eliminate gene length effects. The baseline value was determined by comparing unbiased ("None") genes to unbiased ("None") expected frequencies. ENC and SCUO report higher differences from the baseline as the overall bias increases, which is anticipated since overall bias is exactly what the two methods attempt to quantify. The other methods' results should not vary between datasets. Indeed, ENC', MILC and MCB have proven to be independent of this factor, while B only slightly decreases as overall bias rises.
Figure 3 Effect of overall amount of bias on behaviour of codon usage measures. Figure 3a describes the change in behaviour of each measure as the overall bias increases from unbiased ('None') to a nucleotide composition noted on the x axis. The values were obtained (more ...)
Furthermore, in order to test the commutative property, using each measure we compared datasets with varying levels of bias to the "None" expected distribution, and vice versa. Theoretically, when using many long sequences, comparing "None" genes to, for instance, "Med-1" expected distribution should yield the same result as comparing "Med-1" genes to the "None" expected distribution. In Figure we show that among the measures that allow comparisons, the only one handling this appropriately was Karlin and Mrazek's B. MILC is less sensitive than ENC' and especially MCB, which displays a polar effect, being more strongly influenced by changes in the overall bias in the expected frequencies.
In genomes, individual amino acids may vary in amount of codon bias, an occurrence termed 'codon bias discrepancy', best described by the phrase "some codons are more optimal than others" in Fuglsang's paper [34
]. For instance, in E. coli
the CGU and CGC codons for arginine are strongly preferred over the other four codons, while six codons for serine are chosen more uniformly, with a mild preference for AGC over the others.
It has been implied that ENC may be dependant on the strength of the codon biasdiscrepancy [35
], and the same limitations are expected to apply to the ENC' due to the similarities in calculation of the two statistics. Based on two frequency tables adopted from Fuglsang [35
], representing examples of moderately biased codon distributions with and without discrepancy, we generated genes of varying lengths and compared them to a uniform distribution of codons. Figure demonstrates that this amount of discrepancy causes most of the methods to moderately overestimate overall bias (10–15% of the dynamic range), while B is less affected by this change. Figure illustrates a similar situation, however this time we performed the test using our own codon distribution, "Med-1d", that preserves the GC3s content of the "Med-1" while introducing discrepancy (see Methods). All of the methods again overestimated bias, although to a lesser degree; relations between methods remain similar. It is still undetermined to which extent amino acids differ in degree of bias in real genomes, and our tests do not indicate too strong an influence of this issue on measures of codon usage.
Figure 4 Effect of codon bias discrepancy on behaviour of codon usage measures. The figure shows how the measures react to codon discrepancy, i.e. when the amino acids within a sequence differ in amounts of bias. The value on the y axis is the amount of overestimation (more ...)
Improving prediction of microbial gene expressivity
Analogous to Karlin and Mrazek's method of predicting expression levels of genes [36
], we formulate a statistic named MELP (MILC-based Expression Level Predictor), computed simply as the ratio of respective distances of a gene's codon usage from the genomic average, and a predefined reference set:
This novel method of quantitatively predicting gene expressivity is then compared to existing methods: CAI [37
], E [36
] and GCB [38
]. Instead of testing for context-independence, as we did with general measures of codon usage, we chose to rate the expression level predictors by how well they approximate real-world observations. We have collected datasets, listed in Table (Methods), which consist of either mRNA or protein abundance data for unicellular organisms obtained by different methods – mostly cDNA microarrays, but also by Affymetrix arrays (Pfa-2, and partly Sce-3 data), SAGE (also partly in Sce-3), and a number of quantitative proteomics techniques. This was done in order to assemble a collection of heterogeneous data large enough to allow a rough comparison of codon usage-based predictors of gene expression. Since we wanted to avoid making any assumptions about the distributions of data in each dataset, we used a nonparametric statistic, Spearman's (rank) correlation coefficient, to quantify agreement with predicted expression levels (Figure ). We also tried calculating Pearson (linear) correlation coefficients for the data, which in some cases showed significant improvement by log-transforming the data, however this effect was not observed consistently among datasets or expression predictors [see Additional file 1
Transcript/protein abundance data used for validation of expression level predictors
Figure 5 Performance of codon usage-based expression level predictors. Height of the columns shows the Spearman's (rank) correlation coefficient for each gene expression dataset / predictor combination. Error bars illustrate the change in success of the prediction (more ...)
The agreement of predicted and actual protein/transcript levels varied greatly between all examined combinations of prediction method and dataset. The cause may lie in the quality of experimental data; for instance, mRNA abundances and protein 2D-PAGE data have been shown not to agree well in certain cases [39
]; 2D-PAGE as a method may only be suitable for detection of abundant proteins [40
], while microarray data tends to suffer from noise introduced at each step of different experimental protocols [41
]. The other probable reason for relatively incoherent results is that a model for predicting gene expression from genomic data, based solely on codon usage, is oversimplified. Other factors, such as promoter strength and gene copy number should also be taken into account. Fortunately, optimal codon usage in genes seems to coincide with factors enhancing transcription – this is why it is possible to observe a correlation between codon usage (acting at translation level) and transcript abundances. Keeping these limitations in mind, it seems safe to say that, in comparison to other predictors, GCB and MELP behave more consistently throughout all datasets.
Transcript and/or protein levels in a cell are normally subject to regulation, as opposed to codon usage patterns, which are 'hard-coded' in the genome sequence. If we suppose the major force shaping gene-specific codon usage patterns in microbes is selection for translation efficiency, which operates in periods of fast competitive growth, it follows that codon usage will be 'optimised' for genes highly expressed in such periods. For that reason we chose datasets of organisms harvested in exponential growth phase, and without severe nutritional restrictions in the medium. For instance, the Bsu-2 datasets describes Bacillus
harvested at OD600
0.4 – 0.6; an analogous dataset [see Additional file 1
] for bacteria harvested at OD600
1.1 does not correlate so well with predicted expression levels (Pearson's correlation coefficient for MELP = 0.234 vs. 0.187, for GCB = 0.277 vs. 0.185). In addition, the growth conditions should match the organism's natural habitat. For instance, E. coli
grown in a rich medium has gene expression levels closer to the predicted values than E. coli
in a defined medium; should the data in Eco-2 dataset be replaced with data from MOPS+glucose grown cells [see Additional file 1
], the Pearson's correlation coefficient for log-transformed data drops from 0.720 to 0.663 (MELP), or from 0.708 to 0.642 (GCB). Furthermore, nitrogen or phosphorus starvation of E. coli
in the Eco-3 dataset reduces the correlation with predicted values (data not shown). Such connections between codon usage and gene expression under different conditions can be used to hypothesize about the exact 'natural' environment of a microbe [42
Any codon usage-based prediction of gene expression relies on a prior definition of a 'reference set', consisting of highly expressed genes. Our reference sets were defined as all genes coding for ribosomal proteins, longer than 100 codons; other approaches to this issue exist. For instance, the original definition for CAI [37
] listed a set of genes which have been empirically proven to be highly expressed in yeast and E. coli
; Karlin and Mrazek [36
] included transcription/translation related factors and chaperones in the reference set, in addition to the ribosomal protein genes; attempts have been made to detect major trends in codon usage by iterative computational methods [38
] and use the results to define a reference set. We investigated to what extent reference set composition affects prediction of gene expression; the alternative reference sets used were obtained from Merkl [44
] and generated by computationally detecting the major trend in codon usage in a genome. The sets normally contained ribosomal protein genes, elongation factors and energy metabolism genes; also photosynthesis genes in Synechocystis
and histones in P. falciparum
; such functional assignments for reference set genes were not unexpected. Under the assumption that the major trend is due to translational selection, the change in reference set composition should have theoretically resulted in improved prediction. However, the outcome was highly dependent on the genome examined, and the predictor used (shown as error bars in Figure ). In some instances, the use of the alternative reference set resulted in poorer correlation. More high-quality transcript/protein abundance data would be required to reach a definite recommendation on forming a reference set.