|Home | About | Journals | Submit | Contact Us | Français|
The genetic code is nearly universal, and the arrangement of the codons in the standard codon table is highly non-random. The three main concepts on the origin and evolution of the code are the stereochemical theory, according to which codon assignments are dictated by physico-chemical affinity between amino acids and the cognate codons (anticodons); the coevolution theory, which posits that the code structure coevolved with amino acid biosynthesis pathways; and the error minimization theory under which selection to minimize the adverse effect of point mutations and translation errors was the principal factor of the code’s evolution. These theories are not mutually exclusive and are also compatible with the frozen accident hypothesis, i.e., the notion that the standard code might have no special properties but was fixed simply because all extant life forms share a common ancestor, with subsequent changes to the code, mostly, precluded by the deleterious effect of codon reassignment. Mathematical analysis of the structure and possible evolutionary trajectories of the code shows that it is highly robust to translational misreading but there are numerous more robust codes, so the standard code potentially could evolve from a random code via a short sequence of codon series reassignments. Thus, much of the evolution that led to the standard code could be a combination of frozen accident with selection for error minimization although contributions from coevolution of the code with metabolic pathways and weak affinities between amino acids and nucleotide triplets cannot be ruled out. However, such scenarios for the code evolution are based on formal schemes whose relevance to the actual primordial evolution is uncertain. A real understanding of the code origin and evolution is likely to be attainable only in conjunction with a credible scenario for the evolution of the coding principle itself and the translation system.
Shortly after the genetic code of Escherichia coli was deciphered (1), it was recognized that this particular mapping of 64 codons to 20 amino acids and two punctuation marks (start and stop signals) is shared, with relatively minor modifications, by all known life forms on earth (2, 3). Even a perfunctory inspection of the standard genetic code table (Fig. 1) shows that the arrangement of amino acid assignments is manifestly nonrandom (4–7). Generally, related codons (i.e., the codons that differ by only one nucleotide) tend to code for either the same or two related amino acids, i.e., amino acids that are physico-chemically similar (although there are no unambiguous criteria to define physicochemical similarity). The fundamental question is how these regularities of the standard code came into being, considering that there are more than 1084 possible alternative code tables if each of the 20 amino acids and the stop signal are to be assigned to at least one codon. More specifically, the question is, what kind of interplay of chemical constraints, historical accidents, and evolutionary forces could have produced the standard amino acid assignment, which displays many remarkable properties. The features of the code that seem to require a special explanation include, but are not limited to, the block structure of the code, which is thought to be a necessary condition for the code’s robustness with respect to point mutations, translational misreading, and translational frame shifts (8); the link between the second codon letter and the properties of the encoded amino acid so that codons with U in the second position correspond to hydrophobic amino acids (9, 10); the relationship between the second codon position and the class of aminoacyl-tRNA synthetase (11), the negative correlation between the molecular weight of an amino acid and the number of codons allocated to it (12, 13); the positive correlation between the number of synonymous codons for an amino acid and the frequency of the amino acid in proteins (14, 15); the apparent minimization of the likelihood of mistranslation and point mutations (16, 17); and the near optimality for allowing additional information within protein coding sequences (18).
When considering the evolution of the genetic code, we proceed under several basic assumptions that are worth spelling out. It is assumed that there are only 4 nucleotides and 20 encoded amino acids (with the notable exception of selenocysteine and pyrrolysine, for which subsets of organisms have evolved special coding schemes (19), see also discussion below) and that each codon is a triplet of nucleotides. It has been argued that movement in increments of three nucleotides is a fundamental physical property of RNA translocation in the ribosome so that the translation system originated as a triplet-based machine (20–22). Obviously, this does not rule out the possibility that, e.g., only two nucleotides in each codon are informative (see, e.g., (23–26) for hypotheses on the evolution of the code through a “doublet” phase). Questions on why there are four standard nucleotides in the code (27, 28) or why the standard code encodes 20 amino acids (29–31) are fully legitimate. Conceivably, theories on the early phases of the evolution of the code should be constrained by the minimal complexity that is required of a self-replicating system (e.g., (32)). However, this fascinating are of enquiry is beyond the scope of this review, and for the present discussion we adopt the above fundamental numbers as assumptions. With these premises, we here attempt to critically assess and synthesize the main lines of evidence and thinking about the code’s nature and evolution.
The code expansion theory proposed in Crick’s seminal paper posits that the actual allocation of amino acids to codons is mainly accidental and ‘yet related amino acids would be expected to have related codons’ (6). This concept is known as ‘frozen accident theory’ because Crick maintained, following the earlier argument of Hinegardner and Engelberg (2) that, after the primordial genetic code expanded to incorporate all 20 modern amino acids, any change in the code would result in multiple, simultaneous changes in protein sequences and, consequently, would be lethal, hence the universality of the code. Today, there is ample evidence that the standard code is not literally universal but is prone to significant modifications, albeit without change to its basic organization.
Since the discovery of codon reassignment in human mitochondrial genes (33), a variety of other deviations from the standard genetic code in bacteria, archaea, eukaryotic nuclear genomes and, especially, organellar genomes have been reported, with the latest census counting over 20 alternative codes (34–38). All alternative codes are believed to be derived from the standard code (35); together with the observation that many of the same codons are reassigned (compared to the standard code) in independent lineages (e.g., the most frequent change is the reassignment of the stop codon UGA to tryptophan), this conclusion implies that there should be predisposition towards certain changes; at least one of these changes was reported to confer selective advantage (39).
The underlying mechanisms of codon reassignment typically include mutations in tRNA genes, where a single nucleotide substitution directly affects decoding (40), base modification (41), or RNA editing (42) (reviewed in (35)). Another pathway of code evolution is recruitment of non-standard amino acids. The discovery of the 21st amino acid, selenocysteine, and the intricate molecular machinery that is involved in the incorporation of selenocysteine into proteins (43) initially has been considered a proof that the current repertoire of amino acids is extremely hard to change. However, the subsequent discovery of the second non-canonical amino acid, pyrrolysine, and, importantly, the existence of a pyrrolysine-specific tRNA revealed additional malleability of the code (19, 44). In addition to the variations on the standard code discovered in organisms with minimized genomes, many experimental attempts on code modification and expansion have been reported (45). Recently, a general method has been developed to encode the incorporation of unnatural amino acids in genomes by recruiting either one of the stop codons or a subset of a codon series for a particular amino acid and engineering the cognate tRNA and aminoacyl-tRNA synthetase (46). The application of this methodology has already allowed incorporation in E. coli proteins of over 30 unnatural amino acids, in a striking demonstration of the potential malleability of the code (45, 46).
Three major theories have been suggested to explain the changes in the code. The ‘codon capture’ theory (47, 48) proposes that, under mutational pressure to decrease genomic GC-content, some GC-rich codons might disappear from the genome (particularly, a small, e.g., organellar, genome). Then, due to random genetic drift, these codons would reappear and would be reassigned as a result of mutations in non-cognate tRNAs. This mechanism is essentially neutral, i.e., codon reassignment would occur without generation of aberrant or non-functional proteins.
Another concept of code alteration is the ‘ambiguous intermediate’ theory which posits that codon reassignment occurs through an intermediate stage where a particular codon is ambiguously decoded by both the cognate tRNA and a mutant tRNA(49, 50). An outcome of such ambiguous decoding and the competition between the two tRNAs could be eventual elimination of the gene coding for the cognate tRNA and takeover of the codon by the mutant tRNA (37, 51). The same mechanism might also apply to reassignment of a stop codon to a sense codon, when a tRNA that recognizes a stop codon arises by mutation and captures the stop codon from the cognate release factor. Under the ambiguous intermediate hypothesis, a significant negative impact on the survival of the organism could be expected but the finding that the CUG codon (normally coding for leucine) in the fungus Candida zeylanoides is decoded as either leucine (3–5%) or serine (95–97%) gave credence to this scenario (37, 52).
Finally, evolutionary modifications of the code have been linked to ‘genome streamlining’ (53, 54). Under this hypothesis, the selective pressure to minimize mitochondrial genomes yields reassignments of specific codons, in particular, one of the three stop codons.
The three theories explaining codon reassignment are not exclusive considering that the ‘ambiguous intermediate’ stage can be preceded by a significant decrease in the content of GC-rich codons, so that codon reassignment might be driven by a combination of evolutionary mechanisms (55), often under the pressure for genome minimization, especially, in organellar genomes and small genomes of parasitic bacteria such as mycoplasmas (38, 54, 56, 57).
The existence of variant codes and the success of experiments on the incorporation of unnatural amino acids briefly discussed in the preceding section indicates that the genetic code has a degree of evolvability. However, all these deviations involve only a few codons, so in its main features, the structure of the code seems not to have changed through the entire history of life or, more precisely, at least, since the time of the Last Universal Common Ancestor (LUCA) of all modern (cellular) life forms. This universality of the genetic code and the manifest non-randomness of its structure cry for an explanation(s). Of course, Crick’s frozen accident/code expansion theory can be considered a default explanation that does not require any special mechanisms and is only predicated on the existence of a LUCA with a an advanced translation system resembling the modern one (that is, the implicit assumption is that LUCA was not a “progenote” with primitive, very inaccurate translation (58)). However, this explanation is often considered unsatisfactory, first, on the most general, epistemological grounds, because it is, in a sense, a non-explanation, and second, because the existence of variant codes and the additional, experimentally revealed flexibility of the code (see above) present a challenge to the frozen-accident view. Indeed, the fact that there seem to be ways to “sneak in” changes to the standard code, and yet, the same limited modifications seem to have evolved independently in diverse lineages suggests that the code structure could be non-accidental. Three, not necessarily mutually exclusive main theories have been proposed in attempts to attribute the pattern of amino acid assignments in the standard genetic code to physico-chemical or biological factors or a combination thereof. Rather remarkably, the central ideas of each of these theories have been formulated during the classic age of molecular biology, not long after the code was deciphered or even earlier, and despite numerous subsequent developments, remain relevant to this day. We first briefly outline the three theories in their respective historical contexts and then discuss the current status of each.
Extensive early experimentation has detected, at best, weak and relatively non-specific interactions between amino acids and their cognate triplets (5, 73, 74). Nevertheless, it is not unreasonable to argue that even a relatively weak, moderately selective affinity between codons (anticodons) and the cognate amino acids could have been sufficient to precipitate the emergence of the primordial code that subsequently evolved into the modern code in which the specificity is maintained by much more precise and elaborate, indirect mechanisms involving tRNAs and aminoacyl-tRNA synthetases. Furthermore, it can be argued that interaction between amino acids and triplets are strong enough for detection only within the context of specific RNA structures that ensure the proper conformation of the triplet; this could be the cause of the failure of straightforward experiments with trinucleotides or the corresponding polynucleotides. Indeed, the modern version of the stereochemical theory, the ‘escaped triplet theory’ posits that the primordial code functioned through interactions between amino acids and cognate triplets that resided within amino-acid-binding RNA molecules (75). The experimental observations underlying this theory are that short RNA molecules (aptamers) selected from random sequence mixtures by amino-acid-binding were significantly enriched with cognate triplets for the respective amino acids (76, 77). Among the 8 tested amino acids (phenylalanine, isoleucine, histidine, leucine, glutamine, arginine, tryptophan, and tyrosin) (75), only glutamine showed no correlation between the codon and the selected aptamers. The straightforward statistical test applied in these analyses indicated that the probability to obtain the observed correlation between the codons and the sequences of the selected aptamers due to chance was extremely low; the most convincing results were seen for arginine (75). However, more conservative statistical procedures (applied to earlier aptamer data) suggest that the aptamer-codon correlation could be a statistical artifact (78) (but see (79)).
A different kind of statistical analysis has been employed to calculate how unusual is the standard code, given the aptamer-amino-acid binding data (75, 77). A comparison of the standard code with random alternatives has shown that only a tiny fraction of random codes displayed a stronger correlation with the aptamer selection data than the standard code (the real genetic code has greater codon association than 90.3% random codes, and greater anticodon association than 99.8 random codes). The premises of this calculation can be disputed, however, because the standard code has a highly non-random structure, and one could argue that only comparison with codes of similar structures are relevant, in which case the results of aptamer selection might not come out as being significant.
On the whole, it appears that the aptamer experiments, although suggestive, fail to clinch the case for the stereochemical theory of the code. As noticed above, the affinities are rather weak, so that even the conclusions on their reality hinge on the adopted statistical models. Even more disturbing, for different amino acids, the aptamers show enrichment for either codon or anticodon sequence or even for both (75), a lack of coherence that is hard to reconcile with these interactions being the physical basis of the code.
Quantitative evidence in support of the translation-error minimization hypothesis has been inferred from comparison of the standard code with random alternative codes. For any code its cost can be calculated using the following formula:
where a(c) : C → A is a given code, i.e., mapping of 64 codons c C to 20 amino acids and stop signal a(c) A ; p(c′ | c) is the relative probability to misread codon c as codon c′, and d(a(c′), a(c)) is the cost associated with the exchange of the cognate amino acid a(c) with the misincorporated amino acid a(c′). Under this approach, the less the cost ϕ(a(c)) the more robust the code is with respect to mistranslations, i.e., the greater the code’s fitness.
The first reasonably reliable numerical estimates of the fraction of random codes that are more robust than the standard code have been obtained by Haig and Hurst (16) who showed that, under the assumption that any misreadings between two codons that differ by one nucleotide are equally probable, and if the polar requirement scale (80) is employed as the measure of physicochemical similarity of amino acids, the probability of a random code to be fitter than the standard one is P1 ≈ 10−4. Using a refined cost function that took into account the non-uniformity of codon positions and base-dependent transition bias, Freeland and Hurst have shown that the fraction of random codes that outperforms the standard one is P2 ≈ 10−6, i.e., ‘the genetic code is one in a million’ (81). Subsequent analyses have yielded even higher estimates of error minimization of the standard code (15, 17, 82, 83).
Despite the convincing demonstration of the high robustness to misreadings of the standard code, the translation-error minimization hypothesis seems to have some inherent problems. First, to obtain any estimate of a code’s robustness, it is necessary to specify the exact form of the cost function (I) that, even in its simplest form, consists of a specific matrix of codon misreading probabilities and specific costs associated with the amino acid substitutions. The form of the matrix p(c′| c) proposed by Freeland et al. (81) is widely used (e.g., (15, 83–86)) but the supporting data are scarce. In particular, it has been convincingly shown that mistranslation in the first and third codon positions is more common than in the second position (65, 87, 88), but the transitional biased misreading in the second position is hard to justify from the available data. In part, to overcome this problem, Ardell and Sella formulated the first population-genetic model of code evolution where the changes in genomic content of a population are modeled along with the code changes (89–91). This approach is a generalization of the adaptive concept of code evolution that unifies the lethal-mutation and translation-error minimization hypotheses and incorporates the well-known fact that, among mutations, transitions are far more frequent than transversions (92, 93). Essentially, the Ardell-Sella model describes coevolution of a code with genes that utilize it to produce proteins and explicitly takes into account the “freezing effect” of genes on a code that is due to the massive deleterious effect of code changes (90). Under this model, evolving codes tend to “freeze” in structures similar to that of the standard code and having similar levels of robustness.
Another problem with the function (I) is that it relies on a measure of physicochemical similarity of amino acids. It is clear that any one such measure cannot be totally adequate. The amino acid substitution matrices such as PAM that are commonly used for amino acid sequence comparison appear not to be suitable for the study of the code evolution because these matrices have been derived from comparison of protein sequences that are encoded by the standard code, and hence cannot be independent of that code (94). Therefore one must use a code-independent matrix derived from a first-principle comparison of physic-chemical properties of amino acids, such as the polar requirement scale (80). However, the number of possible matrices of this kind is enormous, and there are no clear criteria for choosing the “best” one. Thus, arbitrariness is inherent in the matrix selection, and its effect on the conclusions on the level of optimization of a code is hard to assess.
A potentially serious objection to the error-minimization hypothesis (95) is that, although the estimates of P1 and P2 indicate that the standard code outperforms most random alternatives, the number of possible codes that are fitter (more robust) than the standard one is still huge (it should be noted that estimates of the code robustness rely on the employed randomization procedure; the one most frequently used involves shuffling of amino acid assignments between the synonymous codon series that are intrinsic to the standard code, so that 20! ≈ 2.4 · 1018 possible codes are searched; different random code generators can produce substantially different results (86)). It has been suggested that, if selection for minimization of translation error effect was the principal force of code evolution, the relative optimization level for the standard code would be significantly higher than observed (96). The counter argument offered by supporters of the error-minimization hypothesis is that the distribution of random code costs is bell-shaped, where more robust codes form a long tail, so because the process of adaptation is non-linear, approaching the absolute minimum is highly improbable (17).
It has been suggested that the apparent code robustness could be a by-product of evolution that was driven by selective forces that have nothing to do with error minimization (97). Specifically, it has been shown that the non-random assignments of amino acids in the standard code can be almost completely explained by incremental code evolution by codon capture or ambiguity reduction processes. However, this conclusion relies on the exact order of amino acids recruitment to the genetic code (98, 99), primarily, on a specific interpretation of the evolution of biosynthetic pathways for amino acids, which remains a controversial issue.
Regardless of the exact nature of the selective forces that had the greatest effect on the evolution of the code, it is a fact that the standard code is substantially robust to translational misreadings as well as mutations. Thus, is seems to be of considerable importance to determine, as objectively as possible, the level of the code’s optimization. Intriguing questions associated with this problem are how much evolution the standard code underwent and what would be the most likely starting point for such evolution.
Estimates on the total level of code optimization have a long history. The straightforward comparison can be made between the standard code and the most robust code with respect to the mean cost value of random codes. This measure of the optimization level was dubbed the minimization percentage (100, 101); more precisely, MP = (ϕmean − ϕstand )/(ϕmean − ϕmin), where ϕmean is the mean cost of random codes, ϕstand is the cost of the standard code, ϕmin is the cost of the most optimal code [all values are calculated given a particular cost function of the form (I)]. The minimization percentage of the standard code has been estimated at ~70% when the polar requirement scale is used as the measure of amino acid exchangeability (96, 101). Figure 2 shows an example of a code that was optimized for robustness to translation errors by swapping codon assignments for amino acids to minimize the value of the cost function given by formula (I). With respect to this code, the minimization percentage of the standard code is 78% (this MP value is somewhat higher than those reported by Di Giulio (96) because a more realistic misreading matrix p(c′ | c) was employed).
Recently, we explored possible evolutionary trajectories of the genetic code within a limited domain of the vast space of possible codes (only codes that possess the same block structure and the same level of degeneracy as the standard code were analyzed) (86). The assumption behind the choice of this small part of the vast code space is that, at an early stage of the evolution of the code, its block structure was fixed (“froze”) in the current form that could not be changed without a dramatic deleterious effect (a notion that is obviously related to Crick’s frozen accident). Thus, we employed a straightforward, greedy evolutionary algorithm, with elementary steps comprising swaps of amino acid assignments between four-codon or two-codon series, to investigate the level of code optimization. The properties of the standard code were compared with the properties of four sets of random codes (purely random codes, random codes whose robustness is greater than that of the standard code, and two sets of codes that resulted from optimization of the first two sets). Under this model, the code fitness landscape is extremely rugged, so that almost any random code yields its own local maximum. Rather unexpectedly, starting from a random code, the level of optimization of the standard code can be easily achieved with 10–12 evolutionary steps on average, and often, optimization can be continued to reach the level that is attainable when the optimization starts from the standard code. When the starting point is a random code that is more robust than the standard one, the optimization procedure yields much higher levels of optimization than that reachable from the standard code, i.e., the standard code is much closer to its local fitness peak than most of the random codes with similar levels of robustness. Comparison of the standard code with the four described sets of codes shows that the standard code is very close to the set of optimized random codes. Thus, the standard genetic code appears to be a point that is located about half way (measured in the number of codon series swaps) along an upward evolutionary trajectory from a random code to the summit of the respective local peak. Moreover, this peak is rather mediocre, with a huge number of taller peaks existing in the landscape (Fig. 3). It should be emphasized that, under this model, the standard code is not locally stable, that is, it can be readily “improved” by a small perturbation (an additional swap). Thus, under the assumption that the function (I) is an adequate measure of the code fitness, it is hard to attribute the lack of further optimization of the standard code to anything other than frozen accident.
The coevolution theory (reviewed in (71, 102, 103)) postulates that prebiotic synthesis could not produce 20 modern amino acids, so a subset of the amino acids had to be produced through biosynthetic pathways before they could be co-opted into the genetic code and translation; hence coevolution of the code and amino acid metabolism (104). Therefore codon allocations to amino acids could have been guided by metabolic connections between the amino acids. According to the coevolution theory, there were three main phases of amino acid entry into the genetic code: the first (phase 1) amino acids came from prebiotic synthesis, phase 2 amino acids entered the code by means of biosynthesis from the phase 1 amino acids, and phase 3 amino acids are introduced into proteins through post-translational modifications (105). The particular choice of phase 1 amino acids (Fig. 4) is supported by a survey of a variety of criteria used to infer the likely order of amino acid appearance (98) (with one exception), and by the list of amino acids produced by high energy proton irradiation of a carbon monoxide-nitrogen-water mixture (106). Under the coevolution theory, evolution of metabolic pathways is an important source of new amino acids. Given the precursor-product pairs of amino acids, the allocation of amino acids in the standard code is almost impossible to obtain by chance (Fig. 4). Experiments demonstrating that the amino acid composition of proteins is evolvable are construed as supporting the coevolution theory. For instance, it has been shown that Bacillus subtilis could be mutated to replace its Tryptophan by 4-fluoroTrp, and even further to displace Trp completely (107).
Two major criticisms of the coevolution theory have been put forward. First, the coevolution scenario is very sensitive to the choice of amino acid precursor-product pairs, and the choice of these pairs is far from being straightforward. Indeed, in the original formulation of the coevolution theory, Wong did not directly use biochemically established relationships between amino acids but instead employed inferred reactions of primordial metabolism that remain debatable (70, 103). Amirnovin (108) generated a large set of random codes and found that, if the original 8 precursor-product pairs proposed by Wong (70) are considered, the standard code shows a substantially higher codon correlation score (a measure that calculates number of adjacent codons coding for precursor-product amino acids) than most of the random codes (only 0.1% of random codes perform better). However, after the pairs Gln-His and Val-Leu are removed (the validity of the latter pair has been questioned (109)), the proportion of better random codes rises to 3.6%, and if the precursor-product pairs are taken from the well-characterized metabolic pathways of E. coli, the proportion that a random code shows a stronger correlation reaches 34%. Second, the biological validity of the statistical analysis of Wong (70) appears dubious (109). Ronneberg et al., together with consistent definition of amino acid precursor-product pairs, suggested that, according to the wobble rule, the genetic code contains not 61 functional codons coding for amino acids, but 45 codons, where each two codons of the form NNY are considered as one because no known tRNA can distinguish codons with U or C in the third base position. Under this assumption, there was no statistical support for the coevolution scenario of the evolution of the code (109) (but see (110)).
As discussed above, despite a long history of research and accumulation of considerable circumstantial evidence, none of the three major theories on the nature and evolution of the genetic code is unequivocally supported by the currently available data. It appears premature to claim, e.g., that ‘the coevolution theory is a proven theory’ (103), or ‘There is very significant evidence that cognate codons and/or anticodons are unexpectedly frequent in RNA-binding sites […]. This suggests that a substantial fraction of the genetic code has a stereochemical basis’ (75). Is it conceivable that each of these theories captures some aspects of the code’s origin and evolution, and combined, they could yield a more realistic picture? In principle, it is not difficult to speculate along these lines, for instance, by imagining a scenario whereby first abiogenically synthesized amino acids captured their cognate codons owing to their respective stereochemical affinities, after which the code expanded according to the coevolution theory, and finally, amino acid assignments were adjusted under selection to minimize the effect of translational misreadings and point mutations on the genome. Such a composite theory is extremely flexible and consequently can “explain” just about anything by optimizing the relative contributions of different processes to fit the structure of the standard code. Of course, the falsifiability or, more generally, testability of such an overadjusted scenario become issues of concern. Nevertheless, examination of the specific predictions of each theory might take one some way toward falsification of the composite scenario.
The coevolution scenario implies that the genetic code should be highly robust to mistranslations, simply, because the identified precursor-product pairs consist of physico-chemically similar amino acids (97). However, several detailed analyses have suggested that coevolution alone cannot explain the observed level of robustness of the standard code so that additional evolution under selection for error minimization would be necessary to arrive to the standard code (82, 85, 111). Thus, in terms of the plausibility of a composite scenario, coevolution and error minimization are compatible. However, error minimization also appears to be necessary whereas the necessity of coevolution remains uncertain.
The affinities between cognate triplets and amino acids detected in aptamer selection experiments appear to be independent of the highly optimized amino acid assignments in the standard code table (112). Thus, even if these affinities are relevant for the origin of the code, the error minimization properties of the standard code are still in need of an explanation. The proponents of the stereochemical theory argue that some of the amino acid assignments are stereochemically defined, whereas others have evolved under selective pressure for error minimization, resulting in the observed robustness of the standard code. Indeed, it has been shown that, even when 8–10 amino acid assignments in the standard code table are fixed, there is still plenty of room to produce highly optimized genetic codes (112). However, this mixed stereochemistry-selection scenario seems to clash with some evidence. Perhaps, rather paradoxically, amino acids for which affinities with cognate triplets have been reported, largely, are considered to be late additions to the code: only 4 of the 8 amino acids with reported stereochemical affinities are phase 1 amino acids according to the coevolution theory (Fig. 4). Notably, arginine, the amino acid for which the evidence in support of a stereochemical association with cognate codons appears to be the strongest, is the “worst positioned” amino acid in the code table, i.e., of all amino acids, a change in the codon assignment for arginine results in the greatest increase in the code’s fitness (e.g., (86)). This unusual position of arginine in the code table makes it tempting to consider a different combined scenario of the code’s evolution whereby the early stage of this evolution involved, primarily, selection for error minimization, whereas at a later stage, the code was modified through recruitment of new amino acids that involved the (weak) stereochemical affinities.
Whether the code reflects biosynthetic pathways according to the coevolution theory or was shaped by adaptive evolutionary forces to minimize the burden caused by improper translated proteins or even to maximize the rate of the adaptive evolution of proteins (113–115), a fundamental but often overlooked question is why the code is (almost) universal. Of course, the stereochemical theory, in principle, could offer a simple solution, namely, that the codon assignments in the standard code are unequivocally dictated by the specific affinity between amino acids and their cognate codons. As noticed above, however, the affinities are equivocal and weak, and do not account for the error-minimization property of the code. An alternative could be that the code evolved to (near) perfection in terms of robustness to translational errors or, perhaps, some other optimization criteria, and this (nearly) perfect standard code outcompeted all other versions. We have seen, however, that, at least with respect to error minimization, this is far from being the case (Fig. 3). What remains as an explanation of the code’s universality is some version of frozen accident combined with selection that brought the code to a relatively high robustness that was sufficient for the evolution of complex life.
Under the frozen accident view, the universality of the code can be considered an epiphenomenon of the existence of a unique LUCA. The LUCA must have had a code with at least a minimal fitness compatible with cellular life, and that code was frozen ever since (except for the observed limited variation). The implicit assumption behind this line of reasoning is that LUCA already possessed a translation system that was (nearly) as advanced as the modern version. Indeed, the universality of the key components of the translation system including a nearly complete set of aminoacyl-tRNA synthetases among the extant cellular life forms (116, 117) strongly suggests that the main features of the translation system were fixed at a pre-LUCA stage of evolution.
The recently proposed hypothesis of collective evolution of primordial replicators explains the universality of the code through a combination of froze accident and a distinct type of selection pressure (118, 119). The central idea is that universality of the genetic code is a condition for maintaining the (horizontal) flow of genetic information between communities of primordial replicators, and this information flow is a condition for the evolution of any complex biological entities. Horizontal transfer of replicators would provide the means for the emergence of clusters of similar codes, and these clusters would compete for niches. This idea of collective evolution of ensembles of virus-like genetic entities as a stage in the origin of cellular life apparently goes back to Haldane’s classic paper of 1928 (120) but was subsequently recast in modern terms and expanded (121–124), and developed in physical terms (125, 126). Vetsigian et al. (118) explored the fate of the code under collective evolution using a simple evolutionary model which is a generalization of the population-genetic model of code evolution described by Sella and Ardell (90, 91). It has been shown that, taking into consideration the selective advantage of error-minimizing codes, within a community of subpopulations of genetic elements capable of horizontal gene exchange, evolution leads to a nearly universal, highly robust code (118).
The writing of this review coincides with the 40th anniversary of Crick’s seminal paper on the evolution of the genetic code (6) that synthesized the preceding research in this area and presciently outlined the principal lines of thinking on this difficult subject. In our opinion, despite extensive and, in many cases, elaborate attempts to model code optimization, ingenious theorizing along the lines of the coevolution theory, and considerable experimentation, very little definitive progress has been made.
Of course, this does not mean there has been no advance in understanding aspects of the code evolution. Some clear conclusions are negative, i.e., allow one to rule out certain a priori plausible possibilities. Thus, many years of experimentation including the latest extensive studies on aptamer selection show that the code is not based on a straightforward stereochemical correspondence between amino acids and their cognate codons (or anticodons). Direct interactions between amino acids and polynucleotides might have been important at some early stages of code’s evolution but hardly could have been the principal factor of the code’s evolution. Almost the same seems to apply to the coevolution theory: the possibility exists that evolution of amino acid metabolism and evolution of the code were, to some extent, linked, but this coevolution cannot fully explain the properties of the code. The verdict on the adaptive theory of code evolution, in particular, the hypothesis that the code was shaped by selection for error minimization, is different: in our view, this is the only concept of the code evolution that can legitimately claim to be positively relevant as (so far) no attempt to explain the observed robustness of the code to translation errors without invoking at least some extent of selection has been convincing. So it does appear that selection for translation error minimization played a substantial role in the evolution of the code to the standard form. However, there is also a flip side to the adaptive theory as the standard code appears not to be particularly outstanding in terms of error minimization and, apparently, easily reachable from a random code with the same block structure. Statements like “the genetic code is one in a million” (or even in 100 million) are technically accurate but can be easily misconstrued should one overlook the fact that there is a huge number of possible codes that are significantly more robust than the standard code that sits on the slope of an unremarkable local peak in an extremely rugged fitness landscape (Fig. 3). Of course, it cannot be ruled out that the fitness functions employed in modeling selection for error minimization (Eq (I) and similar ones) in the evolution of the code are far from being an accurate representation of the “real” optimization criterion. Should that be the case, the general assessment of the entire field of code evolution would have to be particularly somber because that would imply we have no clue as to what is important in a code. However, this does not seem to be a particularly likely possibility. Indeed, recent theoretical and empirical studies on correlations between gene sequence evolution and expression strongly suggest that minimization of the production of potentially toxic misfolded proteins is a crucial factor of evolution (127–130). It stands to reason that minimization of protein misfolding has driven evolution concordantly at several levels including protein sequences, codon usage (130) and the genetic code itself. Furthermore, general considerations, stemming from Eigen’s theory of quasispecies and mutational meltdown, indicate that, for any complex life to evolve, sufficient robustness of replication and expression is a pre-requisite (131–133). Thus, these more general lines of reasoning from evolutionary biology seem to complement the results of specific modeling of the code’s evolution.
And then, there is, of course, frozen accident, Crick’s famous “non-explanation” that, even after 40 years of increasingly sophisticated research, still appears relevant for the problem of the code’s origin and evolution. Indeed, given the relatively modest optimization level of the standard code, it appears essentially certain that the evolution of the code involved some combination of frozen accident with selection for error minimization. Whether or not other recognized and/or still unknown factors also contributed remains a matter to be addressed in further theoretical, modeling and experimental research.
Before closing this discussion, it makes sense to ask: do the analyses described here, focused on the properties and evolution of the code per se, have the potential to actually solve the enigma of the code’s origin? It appears that such potential is problematic because, out of necessity, to make the problems they address tractable, all studies of the code evolution are performed in formalized and, more or less, artificial settings (be it modeling under a defined set of code transformation or aptamer selection experiments) the relevance of which to the reality of primordial evolution is dubious at best. The hypothesis on the causal connection between the universality of the code and the collective character of primordial evolution characterized by extensive genetic exchange between ensembles of replicators (118) is attractive and appears conceptually important because it takes the study of code evolution from being a purely formal exercise into a broader and more biologically meaningful context. Nevertheless, this proposal, even if quite plausible, is only one facet of a much more general and difficult problem, perhaps, the most formidable problem of all evolutionary biology. Indeed, it stands to reason that any scenario of the code origin and evolution will remain vacuous if not combined with understanding of the origin of the coding principle itself and the translation system that embodies it. At the heart of this problem is a dreary vicious circle: what would be the selective force behind the evolution of the extremely complex translation system before there were functional proteins? And, of course, there could be no proteins without a sufficiently effective translation system. A variety of hypotheses have been proposed in attempts to break the circle (see (132–135) and references therein) but so far none of these seems to be sufficiently coherent or enjoys sufficient support to claim the status of a real theory.
It seems that detailed modeling of the code evolution from simpler predecessors such as doublet codes could offer some new windows into the early stages of the evolution of coding (72). Notably, backtracking the standard code to the most likely doublet versions yields codes with an exceptional, nearly maximum error minimization capacity (ASN and EVK, unpublished), an observation that moves selection for error minimization and/or frozen accident at least one step closer to the actual origin of translation. Nevertheless, these and other theoretical approaches lack the ability to take the reconstruction of the evolutionary past beyond the complexity threshold that is required to yield functional proteins, and we must admit that concrete ways to cross that horizon are not currently known.
On the experimental front, findings on the catalytic capabilities of selected ribozymes are impressive (136). In particular, highly efficient self-aminoacylating ribozymes and ribozymes that catalyze the peptidyltransferase reaction have been obtained (137, 138). Moreover, ribozymes whose catalytic activity is stimulated by peptides have been selected (139), hinting at the possible origins of the RNA-protein connection (133). Nevertheless, in a close analogy to the situation with theoretical approaches, we are unaware of any experiments that would have the potential to actually reconstruct the origin of coding, not even at the stage of serious planning.
Summarizing the state of the art in the study of the code evolution, we cannot escape considerable skepticism. It seems that the two-pronged fundamental question: “why is the genetic code the way it is and how did it come to be?”, that was asked over 50 years ago, at the dawn of molecular biology, might remain pertinent even in another 50 years. Our consolation is that we cannot think of a more fundamental problem in biology.
Although the study of the evolution of the genetic code is a relatively well focused field, the literature accumulated over the 50 years of research is extensive, and we could not possibly cover all of it in a brief review article. Our sincere apologies to all colleagues whose relevant work is not cited due to space restrictions. EVK is grateful to Nigel Goldenfeld, Paul Higgs, and Claus Wilke for insightful discussions during the workshop on “Evolution: from Atoms to Organisms” at the Aspen Center for Physics (Aspen, CO), 8/10/2008-8/31/2008. The authors’ research is supported by the Department of Health and Human Services intramural program (NIH, National Library of Medicine).