Two opposing models of within-patient bacterial evolution have been proposed: a “dominant lineage” model, in which beneficial mutations drive superior lineages to dominate in the population, and a “diverse community” model whereby adaptive lineages rise to intermediate frequency and coexist with other lineages ()11-14
. The diversity of within-patient pathogen populations has major implications for drug treatment and resistance7,15,16
, for inferring transmission networks8,9,17,18
, and for understanding evolutionary processes13,19
. Here, to distinguish between these models and to understand the sources of genetic diversity, we compared the genomes of many bacterial cells of the same strain from the same clinical sample.
Alternative models of within-patient evolution
We focused on chronic infections with Burkholderia dolosa
, a rare and deadly opportunistic pathogen that spread among 39 people in with cystic fibrosis (CF) cared for at a single center in Boston starting in the 1990s20,21
. The airways of these patients were infected with very similar starting strains, and surviving patients have been colonized for years. A previous retrospective study of single-colony isolates revealed specific B. dolosa
genes that evolved under strong selective pressures during the outbreak8
. Now, using sputum samples collected during clinical care, we characterize contemporary intraspecies diversity in 5 individuals from this outbreak who have been infected with B. dolosa
since the early 2000's.
We used two genomic approaches, colony re-sequencing (Patient 1) and deep population sequencing (Patients 1-5), to identify single nucleotide mutations and their frequencies within single sputum samples. In our colony re-sequencing approach, we isolated dozens of colonies from a clinical sample and analyzed their genomes individually by alignment of reads to a B. dolosa
reference genome, AU0158, a strain taken from a different patient in this outbreak. Since each colony originates from a single bacterium, this approach is equivalent to comparing different bacterial cells from the initial clinical sample. In the population sequencing approach22,23
, we pooled hundreds of colonies from each clinical sample and sequenced the pool with deep coverage (~450×). We then aligned reads to AU0158 and identified fixed mutations, appearing in all reads, and polymorphisms, appearing in only a fraction of the reads. To remove false positive polymorphic sites caused by systematic sequencing or alignment errors24,25
, we developed a set of thresholds and statistical tests that reject polymorphic sites where the mutated and ancestral reads have significantly different properties22,23
(see Supplementary Note
). We calibrated this approach using an isogenic control for which we expect no polymorphisms. For validation, we performed both methods on a single sample from Patient 1, comparing diversity among 29 individual colonies to the population sequencing approach (). The population sequencing approach reliably detects polymorphisms where the minor allele frequency is larger than 3%, while decreasing the cost and labor required per sample.
Two methods for studying genomic intraspecies diversity
We found that most mutations that arise during the course of infection do not fix, but remain polymorphic within the patient. The colony re-sequencing approach performed for Patient 1 revealed 188 mutations occurring in some, but not all, isolates and only 10 mutations shared among all isolates. This dominance of polymorphisms, also seen in the population sequencing from the same sample, strongly supports the diverse community model (). Similarly, for the four other patients, population sequencing on single samples identified a preponderance of polymorphisms compared to fixed mutations (≥73% of mutations, ). We found these excesses despite the bias to overestimate mutations fixed during infection; some fixed mutations in a sputum sample might be polymorphic within the patient's airways or have fixed prior to patient colonization (Supplementary Fig. 1
Within-patient evolution leads to diversification, not substitution
The observed genomic diversity is a reflection of multiple coexisting lineages. Investigating the community structure of B. dolosa within Patient 1, we found a deeply branched phylogeny with 6 lineages separated by at least 5 lineage-specific mutations (). On average, pairs of isolates from this sample differed by 26 mutations, and, of all 406 possible isolate pairs, only one was identical. Thus, even within a single sputum sample, the population is so diverse that full identity between isolates is extremely rare.
In one patient (Patient 5), the B. dolosa
community had many more mutations than other patients' populations (P < 0.05, Grubbs' test for outliers). This excess of mutations is due solely to increased transitions and not transversions, suggesting hypermutation (Supplementary Fig. 2a
, P < 0.01, Grubbs' test). A search of the 199 mutated genes unique to Patient 5's population revealed a single mutation involved in DNA repair: a nonsynonymous mutation at a conserved position in mutL
, defects of which are known to cause excess transitions26
(Supplementary Fig. 2b
). These excess mutations are enriched in synonymous mutations relative to the other patients, further supporting hypermutation (P < .001, Supplementary Fig. 2c
). While hypermutation is a common phenotype in many pathogens, hypothesized to accelerate the evolution of antibiotic resistance27-30
, it has not been previously described in members of the B. cepacia
For how long have these diverging lineages coexisted? The time to the last common ancestor (LCA) of each non-hypermutating patient's population32
can be estimated using the number of mutations accumulated since the LCA and the molecular clock previously measured for this outbreak (2.1 SNPs/year8
). Given the phylogeny of isolates from Patient 1, we calculated the distribution of the number of mutations since the LCA, dLCA
, across the population (). The mean value of dLCA
across isolates, <dLCA
>, is 19.6 single nucleotide mutations per genome (95% confidence interval, CI = 18.3-20.8), suggesting that the LCA existed 9.3 years ago (CI = 8.7-9.9). This places the LCA of the isolates from this sample slightly earlier than the first B. dolosa
culture from this patient (7.6 years before sample collection), suggesting that the B. dolosa
population in Patient 1 has been diverging since, or perhaps before, initial colonization. While the population sequencing approach cannot provide a distribution of dLCA
, due to a lack of information regarding linkage between mutations, we can still calculate <dLCA
>: it is the sum of the polymorphic mutation frequencies (see Supplementary Note
for derivation). Using this approach, the estimated time to LCA for Patient 1's population is 7.9 years. This value is slightly lower than calculated from the clonal re-sequencing approach, likely due to mutations left undetected by our conservative polymorphism caller (see Supplementary Note
for discussion of error). For Patients 2 and 4, the time to LCA calculated by this population sequencing approach is several years less than the time since first positive culture, suggesting fixation events sometime during these patients' histories (Supplementary Table 1
). For all these patients, we estimate that diverging lineages have coexisted in each of these patients for at least 5 years ().
Sublineages coexist within a patient for many years after divergence
To explore the drivers of this long-coexisting diversity, we examined the identity of the evolving genes. Interestingly, we found that within each sample, several B. dolosa
genes carried 2-4 coexisting polymorphisms (Supplementary Table 2
). This clustering is a significant departure from a neutral model given the number of mutations and the distribution of gene lengths (, P < 0.005 for Patients 1-4; Online Methods
). A similar analysis at the operon-level further identified several operons enriched for polymorphisms (Supplementary Table 3 and Supplementary Fig. 3
). An enrichment of nonsynonymous mutations in these multi-diverse genes and operons suggests that they are drivers of adaptive change in vivo
(dN/dS = 7.0, CI = 2.3-34.9, ). Polymorphisms are thus concentrated within genes undergoing adaptive evolution.
Coexistence of alternative adaptive mutations in the same sample highlights specific genes as drivers of within-host evolution
To understand why polymorphisms cluster within some genes, we asked if coexisting mutations in the same gene appeared in different lineages or were linked in a double mutant. Examining the single isolate genomes, we found no isolates with doubly mutated genes (Supplementary Fig. 4
). Similarly, for the population sequencing, in 10 of 11 cases where polymorphic positions are close enough on the genome to be covered by the same short sequencing reads, we did not find reads that contain both variants (, Supplementary Fig. 5
). In some of these cases, the ancestral genotype is completely purged from the population (). Thus, diversification is driven by multiple adaptive mutations in the same genes evolving in parallel within individual patients.
These findings provide a new signature of past selective pressures detectable in a single clinical sample; the coexistence of multiple polymorphisms within the same gene in a clinical sample. Sixteen B. dolosa
genes display this multi-diverse signature, including genes with homologs involved in outer membrane synthesis, antibiotic resistance, iron scavenging, oxygen sensing, amino acid synthesis, lactate utilization, and stress response. Additionally, some genes with less characterized biological roles display a multi-diverse signature, including two transcriptional regulators with unknown targets in B. dolosa
, an uncharacterized glucoamylase, and two genes that encode hypothetical proteins (Supplementary Table 2
). A similar signature for selection is seen in three operons, two involved in lipopolysaccharide transport and one containing a two-component regulatory system with unknown targets (Supplementary Table 3
). Selection on many of these elements can be rationalized based on the relevance of their annotated functions to conditions to which the bacteria are exposed in the course of the infection. Yet, further investigation will be required to understand the potential roles of some of these genes in antibiotic resistance, fitness, and other aspects of pathogenesis.
We found that many of the selective forces acting on the pathogen are the same across patients (). Often, genes showing a multi-diverse signature for selection in one patient also carry mutations in other patients (P < 0.002, hypergeometric test). A prominent example is gyrA, a well-studied target of quinolones, which is mutated in all patient populations. Further support for commonality in mutational trajectories across patients emerges from a significant overlap between this list of 16 multi-diverse genes and 17 genes previously found to be under parallel evolution across a larger group of patients, only one of whom (Patient 2) was included in both studies (P < 0.001, hypergeometric test). Thus, the study of a single clinical sample can provide generalizable lists of selective pressures felt within the human body.
Yet, some multi-diverse signatures are patient-specific. A penicillin-binding protein (BDAG_01166, homologous to PBP7) has 3 nonsynonymous mutations in Patient 1, but is not mutated in other patients. Such patient-specific parallel evolution might reflect patient-specific selective pressure or perhaps a fitness benefit dependent upon previously acquired mutations. But these hypotheses are hard to test because the genomic target for a selective force might include more than one gene. For example, four of the five patients' populations have a mutation in a homolog of the histidine kinase fixL
(BDAG_01161, known to be under strong selection in these infections8
) while the fifth has a mutation in the corresponding response regulator.
To investigate the stability of these multi-diverse signatures for selection, we collected a second sputum sample 14 days after initial sample collection from Patient 2. Three of the four genes with the multi-diverse signature at day 0 show the same pattern at day 14. The absence of the signature in the fourth gene at the later time point does not reflect a relaxation in selection for mutant alleles, but rather incomplete detection of genes under selection; this gene also has abundant nonsynonymous mutants at day 14, concentrated at a single nucleotide position (Supplementary Fig. 6
). These results suggest that the multi-diverse signature for selection is relatively stable and that multiple sample collections per patient can increase the sensitivity of the detection.
Our results reject the dominant lineage model of infection, yet demonstrate that these diversifying bacteria adapt under the pressure of natural selection. These observations are consistent with clonal interference: in large asexual populations, multiple beneficial mutations emerge and compete, impeding the ability of these lineages to reach fixation33-35
. In addition to large population size (108
cells/mL sputum), the branched structure of the airways may further hinder the capacity of any adaptive lineage to dominate and fix, and the immune system or niche-specific adaptations might directly promote diversity. Diversified by any of these means, lineages may then continue to evolve in parallel against common selective forces.
As B. dolosa
adapts to the airways of people with cystic fibrosis, mutations lead to diversification rather than fixation and replacement. Though it is possible that adaptive mutations will lead to fixation more frequently in other infections, there is evidence that, at least in long-term colonization, diversity might be common14,36-38
. This long-term coexistence of diverse lineages records the genomic history of selection on the pathogen within its host. The ability to rapidly read off within-patient evolutionary history from the genotypic diversity within a single clinical sample may greatly accelerate the ability to survey selective pressures acting on bacterial pathogens in vivo
– shifting from an epidemic level investigation to a single-patient paradigm.