|Home | About | Journals | Submit | Contact Us | Français|
Genetic markers previously reported to occur at significantly different frequencies in isolates of Escherichia coli O157:H7 obtained from cattle and from clinically affected humans concordantly delineate at least five genetic groups. Isolates in three of these groups consistently carry one or more markers rarely found among clinical isolates.
Escherichia coli serotype O157:H7 is an important zoonotic pathogen that may cause diarrhea, bloody diarrhea, and hemolytic-uremic syndrome (3, 10, 14). E. coli O157 is transmitted to humans by direct contact with the animal reservoir (which includes cattle and other ruminant animals) or indirectly by ingestion of contaminated food or water (3, 10). Genetic analyses of bovine isolates of E. coli O157 from diverse geographic origins have provided evidence for the global dissemination of genotypes and also for significant regional differences in the relative prevalence of some genotypes (5, 8, 18, 19).
Several research groups have identified genetic markers that occur at different relative frequencies among E. coli O157 isolates from human clinical cases and from cattle. One group initially used octamer-based genomic scanning to identify two lineages of U.S. origin E. coli O157 (7), of which lineage I was composed mostly (36/44) of clinical isolates and lineage II was composed mostly (25/32) of cattle isolates. Subsequently, a simpler multiplex PCR-based assay (Table (Table1),1), the lineage-specific polymorphism assay (LSPA), was developed to indentify these lineages (19). Six LSPA loci with alleles characteristic of lineage I, lineage II, or neither lineage I nor lineage II are, respectively, classified with the digit 1, 2, or 3, and these digits are concatenated to an LSPA code: 111111 indicates lineage I, and 211111 indicates a genetically intermediate group termed lineage I/II (20), whereas all other genotype variations are considered to belong to lineage II. More recently, a typing assay based on Shiga toxin-encoding bacteriophage insertion (SBI) sites grouped 91 of 92 clinical E. coli O157 isolates from the northwestern United States into three clusters, of which clusters 1 and 3 predominated (>90%) (16). SBI consists of six PCRs (Table (Table1)1) that amplify the Stx toxin genes and the insertion site junctions of the Stx1- and Stx2-encoding bacteriophages of E. coli O157. In a subsequent study, the predominance (92.6%) of clusters 1 and 3 was confirmed in 190 additional human (clinical) isolates (1). In contrast, many (48.8%) E. coli O157 isolates from cattle in the northwestern United States and western Canada demonstrated SBI patterns rarely found among human (clinical) isolates (1).
Additional individual markers reported to occur at differing frequencies among clinical and reservoir isolates include the presence or absence of stx2-Q junction alleles (e.g., Q933 and Q21 alleles in 90% and 15.2% of 66 human isolates versus 44% and 64.8% of 91 bovine isolates, respectively) (9) and the nonsynonymous single nucleotide polymorphism (SNP) 255T→A in tir, a key virulence gene of E. coli O157:H7 (<1% of 108 human isolates versus 44% of 77 bovine isolates had the A allele) (2).
The goal of this study was to evaluate the concordance of these various markers reported to occur at different frequencies among isolates from asymptomatic cattle and from human patients. A convenience set of 145 E. coli O157 isolates obtained from cattle, aggregated from two isolate sets chosen to maximize the diversity of geographic and temporal origins within our isolate bank and whose provenance and SBI types were described previously, was used for this study (1, 5, 18). Briefly, these isolates were non-sorbitol-fermenting, beta-glucuronidase-negative E. coli O157 isolates from cattle on 130 different premises in five countries and 14 U.S. states, isolated in 12 different years ranging from 1991 through 2004. The isolates from outside North America included isolates from Australia (n = 7, obtained in 1993 to 2003), Japan (n = 17, obtained in 1996 to 1997), and Scotland (n = 11, obtained in 1999). LSPA was applied to this set by using previously described primer sequences (19), although capillary rather than gel electrophoresis was used (Table (Table1;1; DNA analyzer 3730, LIZ 600 size standard; Applied Biosystems, Foster City, CA). Data were analyzed with GeneMarker software (SoftGenetics, LLC, State College, PA). Q-stx2 alleles Q933 and Q21 were detected by PCR, and the tir polymorphism was detected by real-time PCR as described previously (2, 9).
Comparison of typing results produced by the LSPA, SBI, Q-stx2, and tir methods showed considerable overall agreement. Cross-classification of the LSPA and SBI results (Table (Table2)2) showed particularly strong agreement in assignment to the two human disease-associated genotypes (LSPA 111111 and 211111; SBI 1 and 3; chi square = 268, 20 df, P < 0.001; Cramer's V statistic = 0.681). Q-stx2 typing identified the Q933 allele in 117 isolates, including 59 of 60 LSPA/SBI human disease-associated genotypes. The Q21 allele was detected in 67 isolates but was not strongly associated with either human disease or cattle-associated genotypes overall (data not shown). The tir nucleotide 255A allele was detected in 39 isolates, only 1 of which had an LSPA/SBI human disease-associated genotype.
While these cross-comparisons supported a significant degree of concordance between the results of the various typing systems, the data analysis was complicated by the differing numbers of genotypes determined by the different systems, and in particular by the classification by LSPA and SBI of numerous isolates into a number of sparsely populated genotypes (Table (Table2).2). More generally, it seemed likely that the best classification of the isolates would result from a consideration of all of the data generated. Therefore, we used Markov chain Monte Carlo (MCMC) model-based clustering, implemented in the structure software package, version 2.2 (6), to investigate the population structure using as input data the 15 locus-specific test results (i.e., the six loci each from the LSPA and SBI genotyping panels together with the Q933, Q21, and tir loci) (see Table S1 in the supplemental material). The model assumes K populations, each of which is characterized by allele frequencies at multiple unlinked or weakly linked loci. Within each population, the loci are assumed to be at linkage equilibrium. It was not possible to test the validity of these assumptions for the isolate set modeled here, and it is likely that at least some degree of linkage disequilibrium is present within E. coli O157:H7 populations (12). We utilized this model both to determine the most likely number of populations (K) within the isolate set and to assign individual isolates to the best-fitting population(s). K = 1 would imply a lack of genetic substructure within the isolate set, while any K of >1 would assume the presence of the corresponding number of subgroups with distinct sets of allele frequencies. Initial assignments of group membership for each isolate were based on the location (North America, Scotland, Japan, or Australia) of the cattle from which the E. coli O157 isolates were obtained, due to the potential for genetic divergence of geographically separated populations.
K values of 1 to 10 were initially evaluated with 10 model runs each, with each run consisting of a 20,000-step burn-in followed by a 50,000-step parameter estimation. Comparison of the estimated logarithmic posterior probabilities [ln P(X|K), where X is the data] of these runs revealed that K values of <4 or >7 were highly unlikely. Additional runs (25 runs, each consisting of 100,000 steps for burn-in, followed by 100,000 steps for parameter estimation), were then performed in order to model each K value from 4 through 7. The results of these models demonstrated nearly equal maximum relative posterior probabilities for K = 5 and K = 6.
We selected K = 5 models for assigning isolates to specific clusters, based on (i) the parsimony principle (K = 5 being a less complex population structure than K = 6), (ii) the precision of the posterior probabilities (K = 5 models had consistently lower variances than K = 6 models), (iii) the lack of sensitivity of the model-derived posterior probabilities to the prior population assignments used to initialize the model (posterior probabilities of models initialized or not initialized with each isolate's country of origin increasingly diverged in values as K increased from 6), and (iv) the admixture determinations for individual isolates (as K increased from 6, an increasing proportion of the study isolates shared characteristics of two or more clusters). Cluster assignments from six independent, randomly selected K = 5 model runs were compared for concordance: using a criterion of a 0.5 or higher probability to assign isolates to their best-fit clusters, all cluster assignments from the six selected runs were perfectly concordant, with 140 to 142 isolates assigned to specific clusters, leaving only 3 to 5 isolates (depending on the run) with no cluster assignable at a 0.5 or higher probability (see Fig. S1 in the supplemental material). However, it is possible that the uncertainty of these ancestry assignments was underestimated or that the assignments were biased as a result of possible violation of the assumptions of linkage equilibrium within populations (6).
The concordant assignments of 142 isolates to five genetic clusters (designated A to E) were then used as the basis for individual evaluation of the different genetic typing systems by comparing each genotyping test or system for agreement with the model-derived cluster assignments. These comparisons revealed associations between genetic markers typical of human infection (for example, SBI type 1 and LSPA type 211111 in cluster A and SBI type 3 and LSPA type 111111 in cluster B), whereas isolates in clusters C to E each contained one or more markers rarely found in clinical isolates (Fig. (Fig.1).1). All markers/marker systems were strongly nonindependently distributed among the model-derived clusters (χ2 = 84 to 338; 4 to 16 df, Cramer's V = 0.662 to 0.937; P < 0.001 for each system). Not surprisingly, some isolates were assigned to clusters C to E by the model based on the complete data set despite carrying one or more markers typical of clinical isolates. For example, LSPA type 211111 was frequent among isolates assigned to both clusters A (17 of 19) and D (9 of 15), suggesting that this LSPA genotype may be polyphyletic. Clusters A and B cumulatively contained 73 of the 142 classified isolates (51%). We previously reported that the proportions of isolates with SBI genotypes typical of clinical isolates in different countries was weakly correlated to the respective national incidences of E. coli O157:H7-associated hemolytic-uremic syndrome (18). The structure version 2.2-derived cluster assignments reported here also differed by isolate provenance (Fig. (Fig.2;2; χ2 = 30.0, 4 df, P < 0.001; Cramer's V = 0.262). While the number of international source isolates examined here is clearly insufficient to support strong inferences, the data indicate the possibilities of (i) the unique occurrence of cluster C in North America, (ii) a relatively high frequency of cluster A and a low frequency of cluster E in Scotland, and (iii) a relatively low frequency of cluster A in Japan and Australia. As the genetic markers of cluster A have been associated with increased virulence (11), further research on the association of the distribution of E. coli O157:H7 genotypes and the national incidence and severity of E. coli O157:H7-associated disease may be merited.
Multiple-correspondence analysis (MCA) and hierarchical clustering were used in a second approach to explore the relationships between the isolates defined by the same set of genetic markers by using the methods of Murtagh (13). The application of MCA provided an opportunity to test whether the clusters identified by the MCMC models were supported by this very different analytical method. MCA identifies a lower-dimensional subspace that approximately represents the diversity within a multivariate data set. In initial MCA models using the full data set, uncommon LSPA and SBI types (specifically, those each comprising less than 5% of the isolate set) exhibited a strong tendency to cocluster, and therefore these unusual types were pooled to produce four LSPA categories. MCA of this reduced data set (SBI [1, 3, 5, 6, or other], LSPA [111111, 211111, 213111, or other], Q933 [positive or negative], Q21 [positive or negative], and Tir [255T or 255A]) identified four dimensions (factors) that cumulatively accounted for >80% of the variation within the data set and retained 69 to 93% of the quality of representation of each marker (13) (see Table S2 in the supplemental material). The coordinates of the projections of each marker onto these four dimensions were extracted from the model and hierarchically clustered by using minimum-variance methods, weighting each marker by its mass (marginal total) (13) (Fig. (Fig.3),3), resulting in five clusters very similar (and named accordingly) to those produced by the MCMC model illustrated in Fig. Fig.11.
In summary, these results clearly demonstrate that the several individual genetic tests or multiple test marker systems previously reported to occur at different frequencies among isolates from cattle and humans identify largely concordant genotypes of E. coli O157. The distribution of these markers among this international collection of isolates strongly indicated the existence of five (or more) genetic groups of E. coli O157, only two of which (clusters A and B) predominantly carry markers previously associated with clinical isolates. It is nearly certain that additional genetic groups or subgroups of E. coli O157 exist in nature, since delineation of these five groups is based on the sampling of only a tiny proportion of the genome: For example, in a recent study, 96 SNPs differentiated clinical E. coli O157 isolates into nine discrete clades (11). The Stx content and the relative frequencies of the two numerically predominant clades of clinical isolates identified in reference 11, clades 2 and 8, are consistent with those of clusters B and A, respectively, as described here.
The concordance of the multiple genetic markers, each with alleles differentially associated with human disease, supports the hypothesis of the existence of discrete genotypes of E. coli O157 that differ in their virulence for humans. This diversity is consistent with a source-sink ecological model characterized by broad genetic diversity in the reservoir (source) bovine populations that includes at least five genetic clusters, of which only two carry genetic markers typical of clinical isolates (17). In this ecological model, human infections represent a “sink” characterized by relatively short-duration infections unlikely to be persistently transmitted (R0 < 1.0). The source-sink model implies that various E. coli O157 genotypes diverged in the bovine reservoir through genetic drift and/or through bovine fitness-based selection, during which some genotypes evolved into accidental human pathogens. Based on this model, we predict that the genomic DNA sequences of E. coli O157 genotypes largely restricted to the bovine reservoir will reveal more genetic diversity than is apparent from the clinical isolate sequences now available, and SNP data supporting this prediction have already appeared (4). Investigation of the presence and expression of virulence factors by diverse bovine E. coli O157 genotypes may be required to reveal the mechanism(s) underlying their differential association with human disease.
This work was funded in part by NIAID NIH contract N01-AI-30055 and by the Agricultural Animal Health Program, Washington State University College of Veterinary Medicine, Pullman.
Published ahead of print on 30 October 2009.
†Supplemental material for this article may be found at http://aem.asm.org/.