|Home | About | Journals | Submit | Contact Us | Français|
We investigated 53 complete bacterial chromosomes for intrachromosomal repeats. In previous studies on eukaryote chromosomes, we proposed a model for the dynamics of repeats based on the continuous genesis of tandem repeats, followed by an active process of high deletion rate, counteracted by rearrangement events that may prevent the repeats from being deleted. The present study of long repeats in the genomes of Bacteria and Archaea suggests that our model of interspersed repeats dynamics may apply to them. Thus the duplication process might be a consequence of very ancient mechanisms shared by all three domains. Moreover, we show that there is a strong negative correlation between nucleotide composition bias and the repeat density of genomes. We hypothesise that in highly biased genomes, non-duplicated small repeats arise more frequently by random effects and are used as primers for duplication mechanisms, leading to a higher density of large repeats.
DNA repeats can be defined as sequences sharing extensive similarity with other sequences of the same genome. It is usually supposed that repeats arise by successive duplications and several causal mechanisms, including hyperploidisation (even polyploidisation), tandem duplication, double-strand break repair by insertion or transposition, have been proposed to be involved. The underlying mechanisms are thought to act at different levels depending on the kingdom, or even on organism [i.e. polyploidisation has been proposed to explain the presence of large repeats in eukaryotes (1,2), but is probably absent in Archaea and Bacteria]. Once a repeat is created, it can be targeted by the recombination apparatus and be subject to deletion. Thus, genome size results from a balance between duplication and deletion events. The importance of deletion processes seems crucial in compact genomes, especially in those of intracellular endosymbionts or pathogens (3).
Usually, repeats in Bacteria are divided into two subclasses: low complexity repeats (sometimes mislabeled ‘tandem repeats’) and longer repeats (the centre of our interest). The first category is constituted of small oligonucleotides (typically ranging from mononucleotide to pentanucleotide in size) repeated many times in a head-to-tail configuration. These low complexity repeats, e.g. microsatellites, are very abundant in the genomes of eukaryotes, in which they have been widely studied (4). Although less abundant in bacterial and archaeal genomes (5), the mechanisms of their origin (6), their function (7), the consequences for genome dynamics (8) and the structural constraints imposed on the chromosome (9) have all been studied
Longer repeats include transposable elements, minisatellites (mostly in Eukarya), large tandem repeats and spaced repeats. DNA transposable elements (like IS) are widely distributed among the Archaea and Bacteria. As specific mechanisms for the duplication of mobile elements have been identified (10), such self-replicating elements have to be considered separately when the origin of repeats is analysed. However, they must be taken into account when the influence of repeats on genome stability is considered.
Several mechanisms have been proposed for the genesis of tandem repeats: slipped strand mispairing, unequal crossover (by homologous recombination), rolling circle and circle excision with reinsertion (11). Some of these mechanisms could also result in a tandem repeat deletion. These mechanisms render tandem repeats unstable, easy to create but also easy to delete. In contrast, distant repeats can almost only be deleted by homologous recombination and at the cost of large deletions of genetic material. As a consequence, they may persist more easily during genome evolution. Two mechanisms have been envisaged to create spaced repeats ex nihilo. The first, known as Campbell-like insertion, creates repeats by inserted exogenous sequences and has been proposed to explain the peculiar distribution of many repeats in Bacillus subtilis (12). The second, referred to as ‘conversion’ or ‘insertion’, repairs a double-strand break by copying a sequence sharing similarity with the edges of the broken sequence: this mechanism works either by break-induced replication or by gap repair (for reviews in yeast see 13,14).
The first question we tackled in this work concerns the origin of interspersed repeats (excluding transposable elements). Our previous studies (15,16) had led us to propose a model (Fig. (Fig.1)1) for the origin of eukaryote intrachromosomal repeats based on the permanent genesis of close direct repeats (CDR, repeats with copies separated by <1 kb). Since our model is compatible with all mechanisms, we do not assume any particular one for the creation of CDR. Newly created CDR are then subject to a strong rate of exchange (conversion and deletion). Experimental studies undertaken on B.subtilis (17) and Escherichia coli (18–20) have shown that the rate of illegitimate recombination is negatively correlated with the distance between the copies (spacer size) and positively correlated with repeat length. Recombination between close repeats tends to maintain neighbouring repeats identical (by conversion) but also to eliminate them (by deletion). At each round of exchange, both events are possible (although we ignore whether they are equally likely). If conversion can be followed by deletion, the opposite is not true: a deletion event cannot be followed by conversion. Over a long time, this will result in a bias in favour of deletions, with CDR disappearing sooner or later (depending on the relative rates of conversion and deletion). Thus, in the absence of strong selective pressure, long CDR are too unstable to persist, except if the copies are moved further apart by chromosomal rearrangements (i.e. insertion, translocation and inversion). In this case, the rate of illegitimate recombination will drop severely and the repeats may be maintained.
In this context, one expects CDR to be more similar than distant repeats, since either they are more recent or they are more subject to conversion. On the other hand, one expects that larger repeats will only survive fast deletion by frequent illegitimate recombination if they are placed distantly. Thus, under our model, CDR tend to have smaller and more identical repeats whereas distant repeats tend to be longer and less similar. This matches the observations we have made in eukaryote genomes, where repeats are both more identical and smaller when they are closer (15). The main goal of this work was to test if this model, first established in Eukarya, could be applied to Bacteria and Archaea.
The second focus of our attention concerns the factors influencing the dynamics of our model, i.e. rates of duplication, deletion and rearrangement. Here we analyse precisely the relation between the origin of tandem repeats and the genome composition biases. Duplication mechanisms typically require the pre-existence of a region of similarity. Levinson and Gutman (8) proposed that small non-duplicated repeats (afterwards referred to as repeats appearing by chance) are primers for mechanisms such as slipped strand mispairing, thus creating larger repeats. We have tried to analyse this proposition by deciphering the relations between repeat density and the relative frequencies of nucleotides in the chromosome.
We analysed the complete genomes of 40 Bacteria and 11 Archaea (Table (Table1).1). All sequences were extracted from GenBank (ftp://ftp.ncbi.nih.gov/genbank/Bacteria), except for those of Pyrococcus furiosus, downloaded from http://www.genome.utah.edu.
We followed the methodology previously developed to detect repeats in eukaryote genomes (15,16), but made an extra effort to detect smaller, but significant, repeats, since bacterial chromosomes are smaller. The methodology is described below and follows four main steps.
First step: detection of seeds. In this step, exact direct and inverse repeats (seeds) of 15 bp were detected using the REPuter software (21). Many seeds with lengths that are not statistically significant according to Karlin and Ost statistics were retained (22). The second step is intended to further extend these seeds into larger, non-strict repeats.
Second step: from seeds to repeats. Local alignment (23) is used to extend the edges of the seeds into larger repeats. Except for the construction of the score matrix, the extension process is the same we used to analyse eukaryote chromosomes (16). This method produces non-exact repeats by extending a seed on both sides when similarity is high. To do so, we used an algorithm based on a local alignment procedure (23).
Nucleotide frequencies differ widely between species genomes, from 25 to 75% (24). Therefore, if an identity matrix is used for the local alignment, seeds of the same size in chromosomes with a very unbalanced distribution of nucleotides (e.g. Ureaplasma urealiticum where A ≈ T ≈ 0.37 and C ≈ G ≈ 0.13) tend to produce larger repeats than in genomes with equal frequencies (e.g. E.coli). In order to avoid this effect, we used an empirical scoring matrix for each chromosome, which takes into account its specific composition. These matrices provide a better score for matches between rare nucleotides:
matchi/i = 100 × (1 – pi2); matchN/i = 25
mismatchi/j = –100 × (1 – pi × pj); gapopen = –400; gapext = –100
where pi is the frequency of nucleotide i. By building these matrices for all species, we observed scores for matches ranging from 86 to 98 and scores for mismatches ranging from –98 to –86. Thus the score of gapopen is always less than 4× mismatch and the score of gapext always less than 1× mismatch. We also tried other matrices that gave similar results.
Third step: removing repeats that are not statistically significant. Since seeds are rather small, many repeats may not have statistically significant lengths. To remove these non-significant repeats, we built, for each chromosome, 10 additional random chromosomes by shuffling it with respect to its trinucleotide composition (Markov chains of order 2). In these random sequences, repeats were detected as in real sequences (steps 1 and 2). Afterwards, we built a distribution of observed alignment scores from the set of repeats detected in the 10 random chromosomes. We then defined a threshold of significance, corresponding to 0.001 of this distribution. Below this minimal score (Smin), repeats were regarded as non-significant and removed from further analysis. Smin depends essentially on the size and composition of the genome (and naturally on our choice of scoring system) and ranges from 2052 (Chlamydia pneumoniae) to 2258 (Mycoplasma pulmonis). Using score (S), length (L) and identity (Id), characteristics of some pertinent repeats from these two organisms are given with more details: (i) for C.pneumoniae, the smallest score corresponds to S = 2052, L = 36 and Id = 80.6%; the medians of the distributions being S = 4505, L = 220 and Id = 63.1%; (ii) for M.pulmonis, the smallest score corresponds to S = 2258, L = 82, Id = 71.7%; the medians being S = 3005, L = 90 and Id = 68.9%.
Fourth step: determining family sizes. At this stage, all significant repeats are given as a series of pairs. However, many repeats are organised in multicopy families (i.e. IS and rRNA operons). Hence, we developed a procedure to detect such multicopy families in our data set.
To do so, we built, for each chromosome, a map in which each position is linked to its ‘n-plication’ degree: unique, duplicated, triplicated, etc. These maps were built by counting, for each chromosome position, the number of times this position is found in repeats (direct and inverted ones were pooled together). Each pair was then associated with the map and the family size of each repeat was determined.
In order to characterise the repeats, we used two measures of density, the density in number and the density in length. They are defined as:
DN = no. of copies/size of chromosome (Mb)
DL = 100 × [size of repeat sequence (bp)/size of chromosome (bp)]
Complexity is frequently used as a compact measure of the difference of the nucleotide distribution to equal repartition. In this context, information entropy has been proposed to describe biases of mononucleotide distributions (25):
where pi is the frequency of nucleotide i. If a sequence exhibits an equal repartition of its four nucleotides (maximum complexity), its entropy is 1. In bacterial chromosomes it ranges from 0.91 to 1.
CDR were originally defined as repeats with a distance between their two copies of <1 kb. We estimated the proportion of CDR expected if repeats are spread randomly along a chromosome. The proportion of CDR is calculated as the ratio between the number of CDR and the total number of repeats. Two cases were taken into account. (i) If the chromosome is circular, the largest spacer size is L/2, where L is the chromosome length. The distribution of spacer size is constant from 0 to L/2. So, the proportion of CDR in a circular chromosome is 1000 × 2/L. (ii) If the chromosome is linear, the largest spacer size is L and the spacer distribution decreases linearly from 0 to L. Using the intercept theorem of Thales (or any analytical demonstration), it could easily be demonstrated that the proportion of CDR is 1000/L × (2 – 1000/L).
We have found a large number of repeats in most (but not all) bacterial genomes (Table (Table2).2). In order to characterise these repeats, we used two measures of repeat density, DN and DL (see Materials and Methods). As expected, both densities were positively correlated (τ = 0.63, P < 10–4, Kendall τ rank test): a chromosome with many repeats also exhibits a high proportion of duplications in its chromosome. However, the biological interpretation of these measures may be quite different: DN can be assimilated to the rate of amplification (a balance between duplication and deletion processes) and DL to the history of the chromosomes, a measure of the redundancy tolerated by a chromosome. Thus, DN and DL should be analysed in parallel as they give complementary information on chromosomal redundancy. The data in Table Table22 brings to the fore two issues. (i) Chromosomes of related organisms often exhibit similar densities of repeats: both Chlamydia trachomatis strains, the three C.pneumoniae strains, the three Pyrococcus strains, both Mycobacterium tuberculosis strains, both Staphylococcus aureus strains, both Neisseria meningitidis strains and both Helicobacter pylori strains. However, exceptions do exist. Escherichia coli O157:H7 is more repeated than K12, in agreement with previous observations (26). Also, when we broaden the phylogenetic range, we observe that the four Mycoplasma spp. show very different densities (DN and DL), indicating fast divergence, possibly due to their rudimentary repair mechanisms and to the selective pressure for variation in these pathogens (27). (ii) Both DN and DL exhibit a positive correlation with chromosome size (τ = 0.24, P < 10–3 for DN and τ = 0.37, P < 10–4 for DL),. These observations are in good agreement with previous observations on parts of both bacterial genomes and eukaryote genomes (16,28) (Fig. (Fig.22).
Since we were interested in the repeats’ origins and in the supposition that it proceeds by duplication, we determined the proportions of two-copy repeats (and respective densities DN2 and DL2) among all repeats (Table (Table2).2). As expected, DN2 is positively correlated with DN (τ = 0.77, P < 10–4) and DL2 with DL (τ = 0.73, P < 10–4). It could be noticed that, in contrast to eukaryote genomes in which DN2 is similar for chromosomes of the same species (16), densities varied between the two chromosomes of Deinococcus radiodurans and also between the two chromosomes of Vibrio cholerae.
Chromosomes containing transposable elements exhibit lower DN2/DN and DL2/DL ratios (P < 0.01, Mann–Whitney rank tests). Since transposable elements are mostly multicopy families, this can be easily understood. We observed few exceptions (low ratios in the absence of transposable elements), involving small genomes and, in particular, Mycoplasma genitalium and Mycoplasma pneumoniae. These repeats are associated with the immunodominant proteins of these genomes and are related to antigenic and tissue tropism variation (27).
In order to test whether our model holds for Bacteria and Archaea we have tested its four major predictions. If interspersed repeats originate massively from tandem repeats, one might expect that (i) direct repeats are more numerous than inverted ones and that (ii) CDR are in large excess. Since the exchange rate between CDR is expected to be negatively correlated with spacer size and positively correlated with repeat length there should be (iii) a negative correlation between repeat similarity and spacer size and (iv) a positive correlation between repeat length and spacer size. Since we are interested in the origin of repeats, we decided to analyse only two-copy repeats further. This removed all low complexity repeats from our data set. Based on the annotations, we show that repeats located at least half in rRNA, tRNA or functional transposase represent ≤5% of our two-copy repeats, except for C.trachomatis (14%, four of 28) and the second chromosome of V.cholerae (7%, two of 29) (data not shown).
Direct repeats are more numerous than inverted ones. The large majority of the chromosomes (47 of 53) exhibit a higher density of two-copy direct repeats as compared with inverted ones (P < 0.001, binomial test), although sometimes the relative difference is not very high (Fig. (Fig.3).3). It is worth noticing that the two chromosomes that exhibit the largest excess of direct repeats are M.genitalium and M.pneumoniae. This is due to the previously described repeats located inside the adhesin genes.
CDR are over-represented. We estimated the numbers and densities of two-copy CDR, N2CDR and DN2CDR, respectively, and the theoretical number of CDR as a function of the number of direct repeats in linear and circular chromosomes (see Materials and Methods). As predicted by the model, CDR are over-represented in all chromosomes, taking into account the number of repeats (Table (Table3).3). The only exception is Buchnera sp., for which there are few CDR repeats, but it is unclear if this is a statistical artifact or has biological meaning. The Buchnera sp. genome is thought to be undergoing reductive evolution (3) and lacks an evident RecA homologue (29). Further, there is evidence that intracellular bacteria are subject to weaker selection (30). Thus, the absence of CDR could be the result of the reductive evolution process. Even if CDR are created, selection will not prevent them from being deleted. This deletion could arise easily since CDR deletion is mainly RecA independent.
Identity and length are constrained by spacer size. We looked for correlations between identity and spacer size within two-copy CDR for species in which there were at least 20 CDR (24 chromosomes). In 18 chromosomes identity was significantly negatively correlated with spacer size (P < 0.01, Table Table4).4). In order to extend our analysis, we also took into account multicopy repeats for chromosomes with less than 20 two-copy CDR or for those exhibiting a non-significant correlation for two-copy CDR (17 + 6 chromosomes). However, because the number of couples increases when families become very large [c = n × (n – 1)/2, where c is the number of couples and n the number of copies], we retained only repeats with between two and five copies. This test identified significant positive correlations for 15 additional chromosomes (P < 0.01). Thus, out of the 41 chromosomes tested, 33 exhibited a significant negative correlation between identity and spacer size. Table Table44 suggests that many others are weakly correlated.
Correlations between length and spacer size were tested under the same conditions as for identity (Table (Table5)5) and were also in agreement with the model. A negative correlation was found in 24 of the 41 chromosomes at P < 0.01 and in nine further chromosomes at a less significant α level (P < 0.05). Although very significant, these results are weaker than for the correlation between identity and spacer size and this deserves some comment. In the model, interspersed repeats are mostly created as identical tandem repeats, but their size can vary. Successive rounds of recombinational exchange constrain these repeats to be both highly identical and small due to the deletion bias mentioned above. Therefore, while the conversion process only maintains the pre-existing characteristics of the repeats (a high identity), the deletion process establishes an additional new constraint (small length). It is then conceivable that more rounds of exchange are required to establish the correlation between length and spacer size, thereby justifying weaker correlations.
Since the previous results suggest the adequateness of our model, we proceeded to test the influence of chromosomal features on the duplication process, and in particular of nucleotide composition biases. Bacterial chromosomes exhibit large differences in their nucleotide composition, especially in terms of G + C composition, which can vary from 25 to 75% (24). We used the information entropy to measure the composition bias and found a significant negative correlation between entropy (and then composition bias) and the density of two-copy repeats, DN2 (τ = –0.34, P < 10–3, Fig. Fig.4),4), as well as with total repeat densities, DN (τ = –0.34, P < 10–3, Fig. Fig.4).4). One would expect more biased random chromosomes to be more repetitive, since they use a subset of the possible symbols more frequently. However, our methodology to search for repeats already tackles this effect: we determined threshold scores based on empirical distributions for each genome and also defined specific scoring matrices, calculated taking into account the nucleotide compositions of the genomes (see Materials and Methods). This is why the minimal significant alignment score is larger for more biased genomes, such as some Mycoplasma spp. Since methodological biases were taken into account in the search for repeats, one is inclined to explain these results from a biological point of view.
Whatever the mechanism of tandem repeat genesis, it always requires pre-existing small repeats (11). Levinson and Gutman (8) have proposed that small repeats appear by chance and are at the origin of larger repeats that are created by slipped strand mispairing between these small repeats. It so happens that low complexity genomes, by chance alone, present a larger number of small repeats. If we accept the hypothesis that tandem genesis mechanisms are not down-regulated in low complexity genomes, then we are immediately led to the conclusion that tandem genesis must be more frequent in these genomes, simply due to their higher compositional bias. Thus, we propose that in such genomes a higher number of primers appear by chance and lead to more abundant repeats.
Small, non-duplicated repeats can be used as primers for initiation of tandem duplications. Thus, many types of repeats are related: small repeats are transformed into tandem repeats, which are then turned into interspersed repeats. As a consequence one gains by analysing these repeats together, instead of dividing them into different classes.
In this respect, it is interesting to note that chromosomes 2 and 3 of Plasmodium falciparum exhibit a very high density of repeats (as compared with eukaryote chromosomes of the same size) (16) which is associated with a very low G + C content (18%). It is therefore tempting to suggest that in eukaryote chromosomes complexity of the genome also plays an important role in the mechanisms of repeat generation. Naturally, the statistical testing of this generalisation will have to await the availability of a larger sample of complete eukaryote genomes.
We have shown that a model for the dynamics of repeats (previously established in Eukarya), based on tandem genesis with further dispersion, holds for most Bacteria and Archaea. As predicted by the model, we show that in most genomes (i) direct repeats are more numerous than inverted repeats, (ii) CDR are in large excess, (iii) there is a negative correlation between repeat identity and spacer size and (iv) there is a positive correlation between repeat length and spacer size. This strongly suggests that despite their diversity, intrachromosomal repeats of all genomes share similar dynamics that are probably related to very ancient mechanisms shared by the three domains of life. Naturally, this model is not exclusive of other mechanisms of duplication (transposition, horizontal gene transfer, insertions, hyperploidisation, etc.).
We have also shown that nucleotide composition biases of the chromosome strongly influence the rate of tandem repeat creation and thus the rate of repeat amplification. Other effects are likely to shape the dynamics of bacterial repeats and the large availability of complete genomes will shed light on them. This will certainly provide new clues in deciphering the dynamics of repeats in bacterial genomes and shed additional light on genome evolution.
We would like to thank I. Gonçalves, D. Higuet, E. Maillier and J. Pothier for their scientific help and their friendly support. We would also like to thank P. Avner and E. Leguern for their helpful remarks on previous versions of this manuscript. This work was supported by grants from the Association pour la Recherche sur le Cancer. G.A. was funded by the Fondation pour la Recherche Médicale. E.C. and P.N. are members of Université Pierre et Marie Curie (Paris, France).