|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: G Morelli, B Kusecek, C Bahlawane, S Suerbaum, M Achtman. Performed the experiments: G Morelli, B Kusecek, S Schwarz, C Bahlawane. Analyzed the data: G Morelli, X Didelot, C Bahlawane, D Falush, M Achtman. Wrote the paper: X Didelot, M Achtman.
Our understanding of basic evolutionary processes in bacteria is still very limited. For example, multiple recent dating estimates are based on a universal inter-species molecular clock rate, but that rate was calibrated using estimates of geological dates that are no longer accepted. We therefore estimated the short-term rates of mutation and recombination in Helicobacter pylori by sequencing an average of 39,300 bp in 78 gene fragments from 97 isolates. These isolates included 34 pairs of sequential samples, which were sampled at intervals of 0.25 to 10.2 years. They also included single isolates from 29 individuals (average age: 45 years) from 10 families. The accumulation of sequence diversity increased with time of separation in a clock-like manner in the sequential isolates. We used Approximate Bayesian Computation to estimate the rates of mutation, recombination, mean length of recombination tracts, and average diversity in those tracts. The estimates indicate that the short-term mutation rate is 1.4×10−6 (serial isolates) to 4.5×10−6 (family isolates) per nucleotide per year and that three times as many substitutions are introduced by recombination as by mutation. The long-term mutation rate over millennia is 5–17-fold lower, partly due to the removal of non-synonymous mutations due to purifying selection. Comparisons with the recent literature show that short-term mutation rates vary dramatically in different bacterial species and can span a range of several orders of magnitude.
Mutation rates in bacteria have generally been considered to be much slower than in viruses. This is partly because estimates of long-term mutation rates for the evolution of distinct species have been inappropriately used for dating divergence within species. Furthermore, the most commonly used long-term mutation rate is based on geological dates that are no longer accepted. In addition, only few short-term mutation rates have been calculated within bacterial species, and these differ with the species by several orders of magnitude. Here, we provide robust estimates for short-term mutation and recombination rates within Helicobacter pylori, a bacterium that commonly infects the human gastric mucosa, based on serial isolates from long-term infections and on differences between isolates from multiple family members. These short-term mutation rates are 5–17-fold faster than long-term mutation rates in H. pylori that have been calibrated by parallel ancient migrations of humans. Short-term mutation rates in bacteria, including those for H. pylori, can be quite fast, partially overlapping with those for viruses. Future calculations of ages of bacterial species will need to account for dramatic differences in mutation rate between species and for dramatic differences between short- and long-term mutation rates.
When did modern pathogenic bacteria evolve? Current wisdom teaches that 10,000–50,000 years have elapsed since a variety of genetically highly monomorphic bacterial pathogens evolved from their last common ancestors – and the ages of pathogenic bacteria with greater levels of genetic diversity have been estimated as reflecting millions of years of evolution , . Age estimates for bacteria are higher than those of viruses, many of which appeared a few hundred years ago , primarily because many bacterial estimates are based on a supposedly universal molecular clock rate, μS, for synonymous polymorphisms in genes that encode proteins. In 1987, Ochman and Wilson calibrated this clock rate as 3.4×10−9 per nucleotide per year by dating the split between Escherichia coli and Salmonella enterica within the framework of a universal clock rate for bacterial rRNA sequences . The divergence between E. coli and S. enterica was equated with the age of mammals, estimated as ~160 Myr. However, the validity of this molecular clock rate for dating bacterial evolution is highly questionable.
Some of the geological dates used to calibrate the rRNA clock rate have since been revised (Table 1). These revisions are so drastic that the original linear regression of diversity with time  is no longer valid  (Figure 1). Furthermore, the estimate of ~160 Mya for the age of the split between E. coli and S. enterica depends on the assumption that E. coli is specific for mammalian hosts, unlike S. enterica which infects reptiles as well as mammals. But E. coli can be readily isolated from reptiles or birds , which invalidates this argument. An independent recent study also dates the split between E. coli and S. enterica at 57–176 Mya on the basis of long-term phylogenies of protein-encoding sequences . However, both this recent estimate and the original estimate of Ochman and Wilson share the problem that geological events that occurred billions of years ago are extrapolated to speciation events that supposedly occurred ~100 Mya, which implicitly assumes that molecular clock rates are linear over large time scales for diverse microorganisms. This is unlikely to be the case (see below). The use of such long-term clock rates is even more problematical for age estimates of divergence within genetically monomorphic or recently emerged pathogens –, which require extrapolations over a further three to four orders of magnitude.
Long-term clock rates are now thought to accelerate by one to two orders of magnitude for recent events , . Furthermore, clock rates for genetic diversity between species should not be used for dating within a species. Diversity between species represents fixation events whereas diversity within a species reflects the accumulation of polymorphisms , . Finally, molecular clock rates probably vary between different bacterial species, which can differ by up to two orders of magnitude in their relative ratios of divergence of rRNA to protein-encoding genes . As a result of these considerations, almost all age estimates for recently evolved bacterial pathogens need to be reconsidered  and should be based on species-specific short-term molecular clock rates.
Age estimates for viruses depend on the use of archival samples that were stored over several years or decades. Only very few attempts, summarized in Table 2, have been made to estimate ages in bacteria with this approach, in part because their clock rates were thought to be too slow. In the case of Yersinia pestis which was introduced to Madagascar in the early 20th century, the clock rate was similar to that of Ochman and Wilson (Table 2). However, a clock rate dated by migration of Buchnera, an aphid endosymbiote, to North America in the late 19th century is two orders of magnitude higher (Table 2).
Two recent studies of Campylobacter jejuni and Vibrio cholerae have found synonymous clock rates of >10−6 per site per year, several orders of magnitude higher than the clock rate of Wilson and Ochman. However, we are sceptical about the validity of these two estimates due to problems with their sampling schemes. The C. jejuni isolates were obtained over a three year period from infected humans within a sampling area of only 968 km2 in Lancashire, England , and might reflect admixture due to the import of novel polymorphisms from outside the catchment area. Similarly, the V. cholerae estimates were based on a comparison of only three genomes whose epidemiological patterns suggested that they had evolved soon before the dates of sampling . A third recent study found a clock rate of 3×10−6 for ST239 of Staphylococcus aureus, which would mean that ST239 evolved in the mid-1960's . However, the ST239 genealogy consists of multiple, early radiations, which suggests adaptation due to selective pressures.
Clock rates are distorted when based on polymorphisms that are under positive selection because adaptation can increase the fixation rate for mutations by orders of magnitude . As an extreme example, serial isolates from human infections that are repeatedly treated with antibiotics acquire mutations that are associated with antibiotic resistance and can result in hyper-mutation . 68 mutations in the 6.5 Mbp genome were observed over eight years of lung infection by Pseudomonas aeruginosa in a patient with cystic fibrosis  and 35 mutations in the 2.9 Mbp genome during 12 weeks of endocarditis caused by S. aureus . Similarly, patho-adaptive, transient mutations in an E. coli adhesin are selected during infection of the urinary tract but rapidly disappear due to source-sink dynamics . Short-term positive selection may be common because an appreciable fraction of E. coli genes show traces of such selection .
These various analyses show that mutation rates may be sufficiently high in some bacteria that microevolution can be observed within serial bacterial isolates from individual humans. Here we analyze such microevolution within Helicobacter pylori. H. pylori is commonly acquired in childhood, after which, in the absence of antibiotic therapy, it can continue to infect the stomachs of humans over their entire lifespan . H. pylori has infected humans for at least 60,000 years because it accompanied anatomically modern humans out of Africa –. H. pylori also exhibits an atypically high genetic diversity: every third nucleotide in housekeeping gene fragments is polymorphic in global analyses , , and the pair-wise synonymous diversity of individual genes ranges from 0.1–0.3 . High genetic diversity can reflect a long evolutionary history but can also result from a high mutation rate. Indeed, the frequency of mutants per cell among natural isolates is approximately 10–100 fold higher in laboratory experiments than for E. coli , , with some variation between individual isolates. That high mutation frequency may reflect the lack of genes encoding the MutHLS1 mismatch repair system , . A high mutation rate in the laboratory suggests that the mutational clock rate may also be high during natural infection, possibly facilitating the adaptation of these bacteria to individual human hosts . However, as for most bacteria, robust estimates of the microevolutionary mutation rate are lacking.
In addition to a high mutation rate, recombination is also particularly frequent in H. pylori. This conclusion was originally reached on the basis of homoplasy analysis . Although this methodology has been recently criticized , recombination is clearly frequent in nature because mosaic imports have been observed, a direct signal for homologous recombination. In laboratory experiments, DNA transformation followed by homologous recombination introduces mosaic stretches of 1.3–3.9 Kbp into the recipient, occasionally interrupted by interspersed segments of recipient DNA sequences that have not been replaced , . In nature, mixed infection of individual humans with multiple distinct strains – occurs sufficiently frequently that unambiguous mosaics were detected in serial isolates  or isolates from members of a family . Recombination is also indicated by analyses using Structure  and the three gamete test  on random isolates from diverse global sources. In the analyses of serial isolates , the sequences of 10 gene fragments were compared between pairs of strains that were isolated from 26 individuals in Louisiana and Colombia at intervals of 3–36 months (mean 1.8 years). No sequence differences were found in 14 pairs, three pairs of isolates differed by a single nucleotide, and six pairs of isolates differed by eight mosaic stretches. (Four other pairs were excluded from analysis because they either reflected mixed infections with genetically unrelated strains or an infection with a cloud of related isolates whose genetic diversity had arisen prior to infection.) For the 6 pairs of isolates with mosaic stretches, homologous recombination had introduced imports of an average size of 417 bp (CI [95% confidence interval] 259–732) at a rate per nucleotide per year of 6.9×10−5 (CI 3.5×10−5 to 1.2×10−4). The three pairs that differed by a single polymorphism were used to calculate a maximal mutation rate per nucleotide per year of 4.1×10−5, but these polymorphisms could not be definitively ascribed to mutation because they might have represented atypically short imports .
Here we have reanalyzed the same pairs of isolates plus others that spanned longer time periods. We examined the sequence diversity in 78 gene fragments in order to provide robust short-term clock rates for mutation and recombination. These clock rates were compared to long-term clock rates that were calibrated by the dates of human migrations.
We sequenced 78 gene fragments from 97 isolates (Table 3, Table S1, Table S2). Two of these fragments are parts of genes that encode outer membrane proteins and all others are within housekeeping genes. We first sequenced an average of 398 bp from each gene fragment; for fragments with polymorphisms we also sequenced ~500 bp from each of the flanking regions. This resulted in an average total of 39,301 bp that was sequenced per isolate, almost ten times more than in our previous study . The 97 isolates included 34 pairs of serial isolates from continuously infected individuals, of which 22 had been the subject of our previous analysis . Twelve other pairs were from chronically infected patients in the Netherlands  with an average sampling interval of 8.4 years (Table 3). The remaining 29 isolates were from 10 families consisting of siblings plus their parents with an average age of 44.5 years from Colombia (4 families), Korea (3), the UK (2) and the USA (1) . The strains within each pair or group of isolates must have diverged very recently because each pair/group shared identical sequences within at least four of the seven MLST housekeeping fragments. In contrast, in previous population genetic studies based on these seven gene fragments , , , , random pairs of isolates were usually distinct at all or most of the seven gene fragments. Despite the limited differences found here between pairs of isolates, the frequency of polymorphic sites across the entire data set was high (0.18±0.04), almost as high as in a comparison of the same 78 gene fragments from seven genomic sequences (0.27±0.07; Table S4).
Figure 2 shows a comparison of the paired sequences from the serial isolates. Out of a total of 2650 pair-wise sequence comparisons of gene fragments, 62 contained 1 polymorphic site, 12 showed two polymorphisms and 50 showed more than two polymorphisms. The total number of fragments with sequence differences correlates significantly with the time difference between the serial samples (R=0.4, p=0.02; Figure 3A), referred to as the minimal age below. Thus, sequence diversity introduced by mutation plus recombination seems to accumulate in a clock-like manner in infected individuals. We note that minimal age represents only a lower bound for the time of divergence between those isolates because the variant might have arisen earlier and persisted together with the parent in the form of a mixed infection. The maximal age is the extreme opposite scenario to the minimal age, namely that the variants evolved soon after birth. We approximated the maximal time of divergence within each individual as the sum of the ages at sampling. There is apparently no correlation between this maximal age and the number of polymorphic fragments (R=0.07, p=0.7, Figure 3B).
Pair-wise comparisons of sequences from the family isolates revealed even greater diversity (Figure 4), as expected because the time of separation of these pairs is greater. Out of 2568 pair-wise gene fragment comparisons, 183 showed one nucleotide difference, 30 had two and 186 had at least three. However, although the longer time span for divergence of the family isolates was expected to show even stronger correlations with time, this was not the case. Instead, we could not find a significant correlation between the numbers of non-identical gene fragments and any function of the age of the family members that was tested. For example, if infection were transmitted to siblings or children when they reached 20 years of age, a significant correlation should have been observed between the numbers of distinct gene fragment sequences and the minimum age of the two family members – 20 (minimal age), but this was not the case (R=−0.19; p=0.28) (Figure 3C). Similarly, if each of the family members were infected at birth, a significant correlation would have been expected against the sum of the ages of the two family members (maximal age), but again this was not the case (R=0.03; p=0.86; Figure 3D). Visual examination of the data indicated that this lack of correlation with age largely reflected two families, numbers 23 and 26, which had unusually high levels of polymorphism. After removal of data from these two families, the number of differences was significantly correlated with maximal age (R=0.4; p=0.045; Figure S1D).
We designed a statistical model of the microevolutionary process in order to analyze our data. Our model assumes that each sequenced fragment evolved independently for an unknown number of years. During that time, mutation events happen according to a molecular clock with a constant rate m per site and per year, and independent recombination events occur in and around the fragment at a constant rate r per initiation site and per year. We follow Falush et al.  in assuming that when a recombination event happens, it affects a stretch of DNA with a geometrically distributed length of mean λ from the initiation point. In the affected region, each site has a probability of being substituted which is drawn from a normal distribution with mean equal to ν. Our recombination model is therefore similar to that of ClonalFrame , except that the rate of substitution introduced by each recombination event is drawn from a distribution rather than being constant. The use of such a distribution is advantageous because it reflects the diversity of the level of relatedness between donor and recipient for all recombination events.
We applied this microevolutionary model to our data using Approximate Bayesian Computation (ABC). ABC is a Monte-Carlo method to perform statistical inference on the parameters of a model using summary statistics , and is well suited to deal with the complex models that arise in population genetics –. We therefore performed ABC inference under the model described above, using the algorithm described by Marjoram et al. . This algorithm uses a Monte-Carlo Markov Chain, but instead of guiding the random walk on the parameter space according to the likelihood, as is usually done, it is guided according to the ability of the parameters to produce a dataset with similar summary statistics (see Materials and Methods).
Our model can be directly applied to the serial isolate data since it describes the evolution between a pair of isolates, resulting in the parameter estimates that are summarized in the first column of Table 4. However, we also wanted to perform the same statistical analysis with the family isolate data as for the serial isolate data. To do so, we first attempted to deduce the genealogical relationships between the isolates within each family using ClonalFrame , but the statistical uncertainty found in these reconstructions was too high to make this approach practical, i.e. it is unclear who infected whom. Therefore, we made no assumptions about phylogeny but rather performed pair-wise comparisons of each pair of isolates within a family. This technique has the disadvantage that it might count some microevolutionary events several times in the pair-wise comparisons, but it is the only approach available in the absence of a robust estimate of phylogenies. The parameter estimates for the family data are also reported in Table 4.
We assessed the validity of our model by comparing the observed distributions for two summary statistics that were not used in the ABC inference with their posterior predictive distributions , i.e. the distribution obtained by simulations using parameters from the posterior sample (Figure 5). This method of model criticism has been applied previously in multiple ABC studies , . The distribution of the number of polymorphisms per gene fragment was quite similar between the data and the posterior simulations from the serial isolates: most gene fragments contained only one polymorphism, several contained two or three polymorphisms, and the frequencies of larger numbers of polymorphisms were spread fairly uniformly over the entire data set (Figure 5A). The length of the polymorphic stretches was less uniform (Figure 5B). The data contained multiple fragments with polymorphisms in stretches of less than 50 bp whereas larger polymorphic stretches were distributed fairly evenly up to the maximum length of just under 1,600 bp. In contrast, the posterior predictive distribution of lengths of polymorphic stretches was fairly uniform, except that stretches of 500–900 bp and of 1,300–1,500 bp were somewhat more frequent. However, these differences between observed data and simulations were relatively minor, again providing support for the validity of our model and inference methodology. Similarly, only minor differences were found when comparing the family data in the same way (Figure S2).
The average rate of polymorphism introduced by recombination events (ν) was 0.02 (Table 4), which is somewhat lower than the average genetic distance between unrelated members of H. pylori from Ladakh in northern India (0.03)  or Europe (0.04) . In turn, this lower rate indicates that donors and recipients were somewhat more closely related than are random, unrelated isolates, and may reflect increased opportunities for recombination within members of the same subpopulations due to geographical structure . Local geographic structure arises due to isolation by distance  and isolates within families may have had more opportunities for prior recombination events that would reduce diversity than do geographically separated isolates.
The mean length of imports (λ) was 1247 bp, which is in good agreement with recent estimates from experimental work , , but considerably greater than the value of 417 bp found previously among serial isolates by Falush et al. . We ascribe this discrepancy to the limited number (eight) of recombination events examined by Falush et al. rather than to differences in methodology. The combination of these two estimates (λ ν) indicates that on average 18.6 nucleotide substitutions were introduced by each recombination event, although this number ranged greatly between individual recombination events (Figure 5A).
The average rate of mutation m (per nucleotide site, per year) was estimated as 1.4×10−6 and the average rate of recombination r (per initiation site, per year) was 2.4×10−7. These estimates are sensitive to our choice of prior for the evolutionary time of split between isolates, on which there is much uncertainty. However, Figure 3A provides support for clock-like microevolution versus the time of isolation of the paired isolates (minimal age) and the ABC analyses were performed using very uninformative priors for their time of separation, consisting of the range since birth to the time of isolation of the bacterial strains. Furthermore, data and simulations based on the estimated parameters correspond well in regard to the frequencies of numbers of polymorphisms and reasonably well for the lengths of polymorphic stretches (Figure 5). We therefore conclude that these estimates are reasonably accurate as measures of mutation and recombination rates over very short time periods of up to 10 years.
The ratio r/m should be a robust measure of the relative frequencies at which mutation and recombination are initiated at a given site because both r and m are equally affected by any under- or over-estimation of the split times. The mean estimate for r/m is 0.19. Thus mutations are on average 5 times more frequent than recombination events over the genome of H. pylori. However, even though it happens less often than does mutation, the effect of recombination is much more dramatic than that of mutation, as indicated in Table 4 by the estimate of 3.4 for r λ ν/m, which represents the ratio of rates at which a site is substituted through recombination and mutation. According to this estimate, a site is >3 times as likely to be substituted by recombination than by mutation.
The average estimates for m and r were about 3 times higher within the families than in the paired isolates (Table 4). We considered the possibility that the different estimates of r and m between serial and family data might reflect the fact that families 23 and 26 exhibited elevated numbers of polymorphisms. However, after excluding these two families, the resulting parameter estimates did not differ dramatically from the estimates summarized in Table 4. We note, however, that in the absence of specific evidence from the data, Bayesian analysis with a broad uniform prior will tend to settle on values within the range of the prior rather than at the extremes. Genetic diversity within families correlated with maximal age (after excluding families 23 and 26; Figure S1D) whereas diversity between serial isolates correlated with minimal age (Figure 3A). Thus, this tendency to use internal values within a broad prior range would shift our parameter estimates for the serial and family isolates in opposite directions away from the extreme age that best correlated with diversity, and could well account for the threefold difference between the two sets of parameter estimates. Finally, we also note that we tested 10 family isolates to see whether the elevated numbers of polymorphisms in families 23 and 26 were accompanied by extreme in vitro frequencies of mutation and DNA transformation (from strain J99). However, although a broad range was measured for the frequencies of both mutation (sevenfold) and transformation (200 fold) (each with one outlier), there was no clear correlation between the two exceptional families and the extremes of the laboratory rates (data not shown).
In contrast to r and m themselves, the ratio r/m is independent of time and should be robust. This ratio has a mean value of 0.18, very similar to the estimate of 0.19 for the serial isolates (Table 4). Similarly, the tract length λ and the frequency ν at which polymorphisms were introduced are also independent of time, and were only slightly higher in the family data than in the serial isolate data (Table 4). ν remains lower than the average pair-wise distance between two random strains of H. pylori and λ is consistent with recent estimates of tract lengths introduced by recombination in the laboratory , . Finally, the relative effect of recombination and mutation, r λ ν/m, should also be relatively robust in regard to uncertainties about time of separation. The mean value of 5.5 was 50% higher than for the serial isolates (3.4), possibly reflecting more opportunities for recombination over the longer time period of infection in the families than within the serial isolates.
The estimated short-term mutation rates in the serial and family isolates were 1.4×10−6 and 4.5×10−6, respectively. This range is a robust estimate of the mutation rate over years to decades. It is also possible to calculate a longer term mutation rate for genetic diversity between H. pylori from different global sources, because isolation by distance over the last 60,000 years has resulted in parallel trends in changes in genetic diversity between these bacteria and their human hosts . As a result, diversity between H. pylori from different global sources has accumulated in a clock-like manner that correlates with, and can be dated by, the times of separation of their human hosts . We estimated the long-term mutation rate on the basis of the ClonalFrame analyses described by Moodley et al. , yielding a long-term estimate for m of 2.6×10−7 (Table 2). This value is 5–17 fold lower than the short-term rates calculated here, which is probably a general phenomenon among bacteria according to theoretical considerations , . One reason for such discrepancies is that even neutral polymorphisms are usually lost over time through genetic drift. A second reason is that non-synonymous mutations will be selected against with time because many of them are slightly deleterious, which should result in a lower dN/dS ratio, the relative rates of non-synonymous to synonymous mutations. A loss of non-synonymous mutations will reduce the apparent mutation rate because approximately 75% of all mutations in coding genes are non-synonymous.
We estimated what proportion of the 5–17 fold reduction in the long-term mutation rate could be accounted for by the loss of non-synonymous mutations. Based on our simulations with the serial isolates, approximately 99% of paired fragments with only one polymorphism resulted from mutation rather than recombination. Thus we could equate the polymorphisms within fragments containing only one SNP to mutations, allowing the calculation of dN/dS even when other fragments had undergone recombination. The resulting dN/dS ratio was 0.5, which indicates that only little purifying selection had taken place over the time period considered here, as is also the case in other examples of recent microevolution , . Over longer time periods, purifying selection of deleterious non-synonymous mutations does take place in H. pylori, resulting in an average dN/dS ratio of 0.07 (sevenfold lower) in housekeeping genes among unrelated isolates , which is in good agreement with the 5–17 fold difference in mutation rates.
Finally, we return to the general question of the short-term clock rate within bacteria. The results presented here demonstrate that the short-term clock rate in H. pylori is approximately the same (0.4–1.4 fold) as the short-term clock rate in S. aureus ST239, 6.2–20.5 times the rate in Buchnera and 158–524 times the rate in Y. pestis (Table 2). These comparisons show that the short-term clock rate varies dramatically among different bacteria, and in some cases overlaps with those of RNA viruses . However, in all cases considered here, it is higher than the long-term (synonymous) clock rate of 3.4×10−9 that has often been used until now to calculate the ages of genetically monomorphic bacteria.
We studied two types of bacterial isolates of H. pylori: serial isolates which were collected from individual persons after a specified time interval, and family isolates which were collected concurrently from two or more members of the same family (Table S2). The 68 serial isolates were collected from 34 patients at intervals ranging from 3 months to 10.2 years. The 29 family isolates were collected from 2 to 5 members of 10 families.
Fragments of 78 genes were sequenced (Table S1). Additional extended flanking regions were also sequenced when sequence polymorphisms were detected in the standard fragments. PCR products were amplified and sequences were performed by standard Sanger sequencing on an ABI 3730 XL as described  using the oligonucleotide primers listed in Table S3, except that PCR products were cleaned by using shrimp alkaline phosphatase plus exonuclease I. All sequence data has been deposited in the Helicobacter pylori Multi Locus Sequence Typing website (http://pubmlst.org/helicobacter/projects/microevolution/alldata.zip) developed by Keith Jolley and sited at the University of Oxford .
We designed a microevolutionary model which describes the evolution of the genome of a strain over a certain period of time T. During this time, each nucleotide of the genome is mutated with probability T×m and is the initiation site of a recombination with probability T×r. When a recombination occurs, it affects a segment of the genome starting from the initiation site and stretching to the right over a length which is geometrically distributed with mean λ. Each site of the affected segment has a probability to be substituted which is normally distributed with mean ν.
The parameters of this microevolutionary model are the time T separating each compared pair of isolates, the mutation rate m per site per year, the recombination rate r per initiation site per year, the average tract length of recombination λ and the average rate of polymorphism introduced by recombination ν. The prior for the time of divergence between the paired isolates is described below. Priors for the four other parameters were uniform from 0 to infinity (improper prior).
Because the evolutionary time separating pairs of isolates is unknown, we had to assume a prior for this quantity in order to perform Bayesian inference. For the serial isolates, we know that the time spent between successive isolations represents a lower bound. If we further assume that the two isolates originated from the same infection, and since this infection must have happened after the birth of the patient, we get an upper bound equal to twice the age of the infected person. We thus assumed a uniform prior for the evolutionary time separating serial isolates between these lower and upper bounds.
For the evolutionary time separating a pair of family isolates, we took a lower bound equal to the minimum of the ages of the two family members minus 20, based on the idea that H. pylori infection usually occurs before the age of 20. We took an upper bound equal to the sum of the ages of the two family members. We assumed a uniform prior for the evolutionary time separating pairs of family isolates between these lower and upper bounds.
We performed inference under the model above using the Approximate Bayesian Computation (ABC) algorithm described by Marjoram et al. . This algorithm was run independently for the serial isolates and the family isolates. The length of each run was set at 100,000 iterations, which took approximately 5 hours on a Desktop computer. Several independent runs were performed and compared manually in order to ensure that good convergence and mixing properties were achieved.
One essential step in ABC analysis is the choice of the summary statistics used, which determines how exact the inference is . If the whole data were used as a summary, the algorithm would be exact but unfeasibly slow. If no summary statistic were used at all, the Markov chain would explore the prior on the parameters. It is thus important to find a handful of statistics that summarize the information contained in the data about the parameters as well as possible. Here we found that the data was well summarized by the numbers of gene fragments with zero, one, two or at least three substitutions, and the average spread of substitutions for the fragments with at least 3 substitutions. The rationale behind this choice is that fragments with one substitution are likely to be caused by mutation whereas fragments with at least 3 substitutions are likely to be caused by recombination. Therefore, even though our model makes no assumption about the cause of observed polymorphisms, the number of fragments with one substitution is informative about the mutation rate m and the number of fragments with at least 3 substitutions is informative about the recombination rate r. Furthermore, the average spread of substitutions for the fragments with at least 3 substitutions is informative about the average tract length of recombination λ.
We note that this model determines mutation and recombination by a phylogenetic approach, which implicitly assumes that each mutation is fixed rather than resulting in a polymorphism. This approach allows comparisons with the other mutation rates in Table 2, which were also calculated by a phylogenetic approach, except C. jejuni. However, as pointed out by one of the reviewers, Joshua B. Plotkin, the sequence differences we have analyzed correspond to segregating polymorphisms, which might have implications for our estimated mutation rates , , .
As in Figure 3, except that pair-wise comparisons between isolates from families 23 and 26 were not included in (C,D).
(0.16 MB PDF)
Comparisons of data and simulations from family isolates. All other details are as in Figure 5.
(0.27 MB PDF)
78 gene fragments whose sequences were compared between paired isolates and within isolates from families.
(0.04 MB XLS)
(A) Paired serial isolates from 34 individuals. (B) Single isolates from 29 individuals in 10 families.
(0.03 MB XLS)
Sequences of oligonucleotide primers used for amplification and sequencing.
(0.09 MB XLS)
Polymorphic sites in 78 gene fragments from genomic sequences and from the paired isolates.
(0.06 MB XLS)
We gratefully acknowledge receipt of the family isolates from Johannes G. Kusters and additional information on them from Ernst J. Kuipers. We thank William Martin for discussions and citations on dating and Yoshan Moodley for providing the original data from which we could calculate a long-term mutation rate for H. pylori. Incisive comments by Sylvain Brisse resulted in re-examination of the family data and comments by Francois Balloux resulted in improvements in the text. We also thank Jessika Schulze for expert technical assistance and the two reviewers for their helpful and enthusiastic remarks.
The authors have declared that no competing interests exist.
This study was supported by grants Ac 36/11-2 and SU 133/7-2 (Deutsche Forschungsgemeinschaft) to MA and SS, INCA LSHC-CT-2005-018704 from the Sixth Research Framework Programme of the European Union and ERA-NET PathoGenoMics HELDIVNET to SS, and 05/FE1/B882 (Science Foundation of Ireland) to MA. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.