How faithfully a given stretch of nucleotides is replicated and expressed depends not only on the machinery for DNA and RNA processing in the cell, but also on the sequence of the nucleotide stretch itself. Certain sequences are inherently prone to errors during replication and expression, whereas other sequences are more stable. The stability of a nucleotide sequence can evolve independently of the sequence of the encoded protein. This is a consequence of the redundancy of the genetic code. As 61 codons code for 20 amino acids, any given amino acid sequence can be encoded by different nucleotide sequences that differ in their propensity for errors during replication and expression. Here we ask if the nucleotide sequences actually used by organisms are a random sample of all the possible sequences encoding that particular amino acid sequence, or if they deviate from a random choice in the direction of stability or instability.
It has been speculated that the evolution of unstable sequences could result from selection for novel and advantageous mutations. This idea goes back to reports about high mutation rates in certain loci of pathogenic bacteria [1
]. More recently, high local mutation rates have been implied for loci in non-pathogenic bacteria [2
] and yeast [3
], and it has been speculated that unstable nucleotide sequences could generally make a substantial contribution to genetic variation for selection to act upon [4
The evolution of stable sequences is thought to be the consequence of the costly flipside of instability. Although mutations can occasionally confer benefits, most mutations are deleterious, and a higher mutation rate can lead to a harmful mutational load. Additionally, errors during transcription or translation are metabolically costly. Such combined costs could lead to selection towards alleles with stable nucleotide sequences.
Currently, the relative importance of selection for stability or instability of DNA sequences is not clear. To investigate this question, we asked whether coding sequences of a number of organisms were more or less stable than expected by chance. We focused on one important determinant of stability, the occurrence of mononucleotide repeats: homogenous runs of one nucleotide. Mononucleotide repeats have a strong influence on the local mutation rate [6
]. In yeast, extending a mononucleotide repeat (of length ≥4) by one nucleotide leads to an increase in the local mutation rate by about a factor of two [7
]. The most common mutations in mononucleotide repeats are insertions or deletions of one or more nucleotide, often leading to a change in the reading frame of the remainder of the protein. This alters the amino acid sequence and typically leads to the emergence of a premature stop codon. Errors during expression are also strongly influenced by mononucleotide repeats: high error rates of transcription [8
] and translation [9
] have been reported for mononucleotide repeats in Escherichia coli
. For this study, we focus on mononucleotide repeats, because repeats consisting of short units are much more common and less stable [10
] than repeats of longer units.
The distribution of mononucleotide repeats in organisms' genomes has been addressed by many other studies [11
]. These studies have often found that the observed number of mononucleotide repeats exceeded the expected number [12
], and this finding is sometimes interpreted as evidence for selection for evolvability [4
]. A few studies also reported under-representation of mononucleotide repeats [11
]. The expected number of repeats is usually calculated by assuming that nucleotides or codons are randomly distributed within a gene (but see [17
]). This null model does not preserve the amino acid sequence of proteins. It would be appropriate if the amino acids were randomly distributed in genes. However, many proteins contain amino acid repeats [19
] and some of these repeats have functional significance [20
]. Amino acid repeats make the emergence of nucleotide repeats more likely and thus increase their numbers relative to a null model that does not take into account the amino acid sequence. This effect could explain the common result that actual nucleotide sequences contain more mononucleotide repeats than the random sequences generated under this particular null model. These studies are thus not sufficient to resolve the question whether the amino acid sequence is encoded in a way that avoids or promotes the emergence of nucleotide repeats.
In contrast to most of these earlier studies, we used a null model that preserves the amino acid sequence. Such null models have been used for comparing observed and expected nucleotide sequences in terms of RNA secondary structure [21
], the frequency of short nucleotide motifs in different frames [22
], and the frequency of targets for errors during translation [23
]. A recent study used such a null model to identify runs of adenines and thymines that are thought to be involved in errors during transcription in bacterial genomes [24
]. Here we used this method to investigate the occurrence of mononucleotide repeats in the genomes of E. coli, Saccharomyces cerevisiae,
and Caenorhabditis elegans.
We analyzed all genes of these organisms and asked whether they contain more or fewer repeats than expected under this null model. This allowed us to determine whether these organisms use stable or unstable nucleotide sequences to encode their proteins.