|Home | About | Journals | Submit | Contact Us | Français|
The purpose of this work was to evaluate the evolutionary history of Campylobacter coli isolates derived from multiple host sources and to use microarray comparative genomic hybridization to assess whether there are particular genes comprising the dispensable portion of the genome that are more commonly associated with certain host species. Genotyping and ClonalFrame analyses of an expanded 16-gene multilocus sequence typing (MLST) data set involving 85 isolates from 4 different hosts species tentatively supported the development of C. coli host-preferred groups and suggested that recombination has played various roles in their diversification; however, geography could not be excluded as a contributing factor underlying the history of some of the groups. Population genetic analyses of the C. coli pubMLST database by use of STRUCTURE suggested that isolates from swine form a relatively homogeneous genetic group, that chicken and human isolates show considerable genetic overlap, that isolates from ducks and wild birds have similarity with environmental water samples and that turkey isolates have a connection with human infection similar to that observed for chickens. Analysis of molecular variance (AMOVA) was performed on these same data and suggested that host species was a significant factor in explaining genetic variation and that macrogeography (North America, Europe, and the United Kingdom) was not. The microarray comparative genomic hybridization data suggested that there were combinations of genes more commonly associated with isolates derived from particular hosts and, combined with the results on evolutionary history, suggest that this is due to a combination of common ancestry in some cases and lateral gene transfer in others.
Campylobacter species are a leading bacterial cause of gastroenteritis within the United States and throughout much of the rest of the developed world. According to the CDC, there are an estimated 2 million to 4 million cases of Campylobacter illness each year in the United States (37). Campylobacter jejuni is generally recognized as the predominant cause of campylobacteriosis, responsible for approximately 90% of reported cases, while the majority of the remainder are caused by the closely related sister species Campylobacter coli (27). Not surprisingly, therefore, the majority of research on Campylobacter has centered on C. jejuni, and C. coli is a less studied organism.
A multilocus sequence typing (MLST) scheme of C. jejuni was first developed by Dingle et al. (13) on the basis of the genome sequence of C. jejuni NCTC 11168. There have also been a number of studies using the genome sequence data to develop microarrays for gene presence/absence determination across strains of C. jejuni and to identify the core genome components for the species (6, 15, 32, 33, 42, 43, 53, 57). Although C. coli is responsible for fewer food-borne illnesses than C. jejuni, the impact of C. coli is still substantial, and there is also evidence that C. coli may carry higher levels of resistance to some antibiotics (1). C. coli and C. jejuni also tend to differ in their relative prevalences in animal host species and various environmental sources (4, 48, 58), and there is some evidence that both taxa may include groups of host-specific putative ecotype strains (7, 36, 38, 39, 52, 56). At present, there is only a single draft genome sequence available for C. coli, and there are no microarray comparative genomic hybridization data for C. coli strains. Thus, there is no information on intraspecies variability in gene presence/absence in C. coli and how such variability might correlate with host species.
The purpose of this work was to develop and apply an expanded 16-locus MLST genotyping scheme to evaluate the evolutionary history of Campylobacter coli isolates derived from multiple host sources and to use microarray comparative genomic hybridization to assess whether there are particular genes comprising the dispensable portion of the genome that are more commonly associated with isolates derived from different host species.
A collection of 84 Campylobacter coli isolates from diverse geographic origins (the United States, Canada, the United Kingdom, Poland, Switzerland, Sweden, Israel, Belgium, Slovenia, and Bosnia Herzegovina), including the sequenced strain RM2228 and 11 isolates from swine, 19 from bovines, 12 from chickens, 16 from turkeys (chickens and turkeys are hereinafter occasionally grouped in the host category “poultry”), and 26 from humans, were selected for MLST genotyping and comparative genomic comparisons.
MLST was performed by sequencing the seven housekeeping genes (protein products are shown in parentheses) aspA (aspartase A), glnA (glutamine synthetase), gltA (citrate synthase), glyA (serine hydroxymethyltransferase), pgm (phosphoglucomutase), tkt (transketolase), and uncA (ATP synthase α subunit) according to the method of Dingle et al. (12, 13). To increase the genotyping resolution, nine additional housekeeping loci, scattered through the C. coli chromosome, were selected from the complete and draft sequences of C. jejuni strain NCTC 11168 and C. coli strain RM2228. The chromosomal locations of these housekeeping loci were chosen such that it was unlikely for any of these loci to be coinherited in the same recombination event. Details regarding these nine loci can be found in Table Table11 .
The evolutionary history of the C. coli isolates was evaluated using eBURST (http://eburst.mlst.net/) (21) and ClonalFrame version 1.1 (11). Sequence types (STs) were grouped into clonal complexes (CCs) by using eBURST version 3, and phylogenetic analysis was performed using ClonalFrame version 1.1, including all 16 loci. ClonalFrame has received wide use in the assessment of evolutionary relationships of strains of the same species of bacteria, including C. jejuni and C. coli (e.g., references 2, 10, 14, 23, and 52). Two of the many benefits of reconstructing the evolutionary history of bacterial clonal lines via ClonalFrame analysis are that bacterial recombination can be taken into consideration when the history is reconstructed and that the time, in coalescent units, to the most recent common ancestor of different groups can be estimated (11). To assess the influence of recombination, two 50% consensus trees were created for all 84 isolates with 16 loci, one estimating parameters of recombination and the other with recombination parameters fixed at zero. Five independent runs were performed for each model, with each run consisting of 100,000 burn-in iterations plus 200,000 sampling iterations. The first half of the chains was discarded, and the second half was sampled at intervals of 100 iterations. Convergence was estimated based on the Gelman-Rubin statistic (25).
To examine the effects of host/environment and geography on C. coli population structure, a large data set of 969 isolates, including the 84 isolates in this study and the 885 isolates available on pubMLST, from different host species and geographic regions (see Table S1 in the supplemental material) were assigned to bacterial populations by using the linkage model of the program STRUCTURE 2.2 (18, 19, 44). STRUCTURE has been used in similar analyses involving a range of species of bacteria, including C. jejuni (e.g., references 20, 36, and 54). The data set was assembled by treating each of the 1,683 polymorphic sites from the seven MLST genes as a single locus. Map distances, used by the linkage model, were assumed to be proportional to the number of base pairs between sites, except for sites in different gene fragments, which were treated as being unlinked. The number of bacterial populations, K, was determined by comparing the posterior probability from multiple runs, assuming that 2 was ≤K and that K was ≤14. Three individual runs (100,000 burn-in iterations and 200,000 sampling iterations) were performed for each value of K. An additional examination of these pubMLST data, focusing on assessing the importance of host species and geography in structuring the genetic variation, was conducted using the analysis-of-molecular-variance (AMOVA) approach in Arlequin (17, 49).
Combimatrix CustomArray 4X2K was used in this study (26). This array is divided into 4 sectors, each of which contains 2,240 in situ-synthesized oligonucleotide probes (spots) with the same probe design and layout. On the basis of the sequence of Campylobacter coli strain RM2228, oligonucleotide probes were designed to have a similar annealing temperature of 56°C and a length of 35 to 40 bp. Two separate designs were used in this study; both included 100 control probes (20 negative controls with sequences from plants and Escherichia coli phage, each with 5 replicate spots) as well as loci from the RM2228 genome. Because of the strict criteria for probe design, not all open reading frames (ORFs) could be covered in this analysis. The first design included 1,942 of the 1,967 protein-coding genes described to occur in the unfinished sequence of C. coli strain RM2228. The second-generation design was based on genes that were not clearly present (loci with low intensity or no hybridization for at least one strain) in the hybridization results involving the first design and included a total of 615 loci. Two to five additional probes, separated from one another in order to span the entire gene, for each of these 615 ambiguous loci were synthesized in situ to occupy the 2,240 independent microarray spots.
Genomic DNA was digested by sonication to sizes of 200 to 400 bp, as visualized by agarose gel electrophoresis, and then purified by using a Qiagen Qiaquick PCR purification kit. Purified fragments (1 to 2 μg) were labeled with biotin, using a Mirus Label IT μArray biotin labeling kit (Mirus Corp., Madison, WI) according to the manufacturer's instructions, followed by removal of unincorporated dyes by use of QIAquick (Qiagen) columns.
The standard hybridization conditions for the biotinylated target included preblocking for 30 min at 50°C with 6× SSPE (1× SSPE is 0.18 M NaCl, 10 mM NaH2PO4, and 1 mM EDTA [pH 7.7]) containing 0.05% Tween 20, 5× Denhardt's solution, and 100 ng/μl salmon sperm DNA, followed by hybridization of the biotinylated target in hybridization solution (6× SSPE, 20 to 50 ng/μl labeled DNA, and 0.05% SDS) overnight at 50°C. The arrays were then washed once for at least 15 min with SSPE wash 1 (6× SSPE, 0.05% Tween 20) and then for 1 min each with SSPE wash 2 (3× SSPE, 0.05% Tween 20), SSPE wash 3 (0.5× SSPE, 0.05% Tween 20), and PBST wash (2× phosphate-buffered saline [PBS], 0.1% Tween 20), followed by a final 2× PBS wash at room temperature. The hybridized array was then blocked with 5× casein-PBS buffer (BioFX Laboratories, Owings Mills, MD) for 15 min at room temperature and labeled for 30 min with Cy5-streptavidin (GE Healthcare, Amersham Biosciences, Piscataway, NJ) diluted 1:1,000 in 5× casein-PBS buffer. The arrays were scanned after they were washed twice each with the PBST and PBS solutions.
Hybridized microarrays were scanned with a GenePix 4000B laser scanner (Axon Instruments, Union City, CA) at a wavelength of 635 nm to obtain raw Cy5 fluorescence intensity. Replicate or triplicate arrays were hybridized for 65 strains tested in this study. Background intensity was estimated as the 75th percentile of all negative-control probes and subtracted on a log2 scale from the foreground Cy5 intensity of all spots on the array. Such normalization made the magnitudes of signal values of different arrays more comparable, with absent genes on each array centering around 0 (see Fig. S1 in the supplemental material). An expectation maximization (EM) algorithm was then applied on the adjusted probe signals of each array to estimate the means and standard deviations of present genes and absent genes, with the following parameters at initiation: the means ± standard deviations for the genes present and absent were 5.0 ± 1.0 and 0 ± 0.5, respectively, with the percentage of absent genes at 10%. The EM algorithm was run for 100 iterations or stopped when the magnitude of change in mean estimate was less than 0.001 between iterations. The EM algorithm fitted a well-established Gaussian mixture model to the normalized signals of each array independently to distinguish the population of absent genes from that of present genes (see Fig. S1 in the supplemental material). It did not rely on comparison to signals of positive- and negative-control strains, so it was robust to technical processing variation in the array experiment.
For each array, genes with spot signals below the lower 0.5th percentile of the estimated distribution of present genes were called absent, i.e., at P values of <0.005. For C. coli strains tested on multiple arrays, a gene for a given probe was called present if it had a present call on at least one array. For genes with multiple probes, a present call ratio, based on the total number of present calls out of the total number of probes for each gene, was used as a measurement of the divergence of the test strain from the reference strain. The ratio was in the range of 0 to 1, which reflects the gene divergence level (absent to present, respectively) in the test strain compared to the reference strain.
To assess the microarray performance, gene presence and absence predictions were compared to the genome sequence of two randomly selected C. coli strains, cco4 and cco74 (swine and human origins, respectively). Draft genome sequences were obtained for these two strains by using the Illumina Genome Analyzer II system. For each strain, one lane was used and yielded 5 million and 8 million 36-bp reads, respectively. Reads were first aligned to the RM2228 draft genome with the mapping-and-assembly-with-qualities (MAQ) method, using default parameters (34). A preliminary analysis of the MAQ consensus sequences revealed that many regions of both genomes were too divergent for the reads to be mapped and for a polymorphism to be accurately differentiated from the absence of the region in the sequenced genome. To resolve these undetermined regions, de novo assemblies were performed using Velvet (62). Several hash lengths and coverage cutoffs were used, and the best assembly was selected on the basis of a combination of the N50, contig number, and total contig size statistics. For each of these resulting assemblies, open reading frames were called by using Glimmer, using default settings (8). Then, using BLAST, each MAQ coding consensus sequence that had any unresolved positions (i.e., either absent or too divergent) in the original mapping was searched against the de novo assemblies. When a single hit was found, the corresponding open reading frame was then aligned to the consensus sequence, and whenever possible, the undetermined positions were resolved using the de novo assembled sequence. The resulting enhanced MAQ consensus sequences were used to predict gene presence/absence by counting the percentage of sites absent for each coding sequence. The distribution of the percentages of absent sites appeared clearly bimodal, with a peak at 0% and 100%, corresponding to present and absent genes, respectively (see Fig. S2 in the supplemental material). There were nevertheless a few genes showing intermediate levels of present sites, which appeared to be duplicated and divergent genes for which the de novo assemblies could not be used (more than one hit). In the following analyses, we used a 50% absent-site threshold to delimit present and absent genes.
On the basis of both microarray designs, a presence ratio was calculated, with the number of present calls divided by the number of probes in a particular ORF, and used to predict ORF presence and absence. The presence ratio threshold used to call a gene present was determined using the two control strains sequenced using the Illumina technology and was drawn at 0.5 (see Fig. S3 in the supplemental material). When compared with the Illumina sequencing presence and absence calls, the microarray showed a false-negative rate (FNR) ranging between 3% and 5% (for cco74, 3.3%, and for cco4, 4.8%), while the false-positive rate (FPR) remained between 0% and 3% (for cco74, 0%, and for cco4, 2.6%). Following the Illumina sequencing and assembly procedure, there were only 5 genes for strain cco4 that were absent but were called present by the array. These 5 loci were plasmid genes and included the following: CCOA0022, CCOA0027, CCOA0031, CCOA0151, and CCOA0152. The alignment of the Illumina reads against these genes showed that very small portions of the genes (<100 bp) were indeed conserved but that the vast majority of the locus was absent. Coincidentally, some of the microarray probes were designed in these regions and therefore suggested that the ORFs were present. Given the very recombinant nature of plasmid genes, it is difficult to state if these genes should be considered present or not. If one excludes the plasmid genes, FPRs equal to 0 for both control strains are obtained. Thus, the microarray double design used in this paper yields virtually no false positives, while maintaining a reasonable number of false negatives (<5%).
All sequence data arising from the nine additional housekeeping loci have been deposited in the NCBI GenBank database under accession numbers GQ325800 to GQ326546. The Illumina reads were deposited in the NCBI trace archive under accession numbers SRX016174 and SRX016251, and the two assembled sequences are available for download at http://stanhope.vet.cornell.edu/data.html. The microarray data discussed in this paper have been deposited in NCBI's Gene Expression Omnibus (16) and are accessible through GEO Series accession number GSE16787.
A total of 84 isolates were successfully typed by both MLST schemes. A comparison of the data derived from the 7-gene scheme to the Campylobacter MLST database (http://pubmlst.org/campylobacter/) indicated that our 84 isolates comprised 55 STs (Table (Table2),2), including 11 STs from swine (11/11; 100% unique), 8 from bovines (8/19; 42% unique), 7 from chickens (7/12; 58% unique), 13 from turkeys (13/16; 81% unique) and 23 from humans (23/26; 88% unique). A total of 15 STs, including 4 isolates from swine, 4 from bovines, 4 from humans, and 3 from turkeys, were new to the MLST database. With the default eBURST setting of 6 out of 7 shared alleles, the vast majority of our isolates could be assigned to the ST 828 complex (71/84) and included isolates from all the different host sources involved in this study. An examination of host species and ST, based on the 7-gene scheme, that incorporates our data along with the previously existing data from the MLST database, indicated that many C. coli STs are present in multiple host species. For example, one of the more commonly represented genotypes, ST 825, has been detected among isolates from human, chicken, turkey, swine, and environmental samples. There were, however, several exceptions to this, including, for example, ST 1104 (n = 6), currently listed only in the pubMLST database from bovines, as well as several other STs nearly exclusively represented in particular host species (e.g., ST 1017, represented in 8/10 isolates from poultry [chicken and turkey], and ST 1096, represented in 5/6 isolates from swine). Furthermore, a Fisher exact test rejected the null hypothesis of independence of ST and host species (P values of <0.001, with chicken and turkey included together as poultry), and therefore, these data support a tendency for particular C. coli genotypes (STs) to be more commonly associated with certain host species.
On the basis of the 16-locus MLST scheme, the 84 test isolates were resolved into 73 different sequence types (Table (Table2).2). eBURST analysis of the 16-locus MLST scheme, with the two most stringent settings for group definition (15 shared alleles out of 16 or 14 shared alleles out of 16), identified the same four host-specific groups, ranging in size from 3 to 8 isolates. A clonal complex definition of 13/16 alleles resulted in 11 host-specific groups, and a group definition of 12/16 alleles resulted in 12 host-specific groups (with the exception of CC1, which includes 9 bovine isolates and 1 human isolate) (Table (Table2).2). The groups ranged in size from 2 to 10 isolates, with the two largest groups coming from bovines. When the number of shared alleles was dropped to 11, several groups of mixed host composition were observed. Fisher exact tests rejected the null hypothesis of independence of genotype (in this case, clonal complex) and host species for all of these group definitions (P values of <0.001). Thus, with decreasing numbers of shared alleles, more groups, some consisting of more isolates, were identified, but nonetheless, each of these groups was host specific up to the level of 11/16 shared alleles, at which point mixed-host groups became evident. This result, combined with the tendency for STs based on the 7-gene scheme also to be host specific, argues for the possible existence of multiple host-preferred groups; however, n is relatively small for any of these groups, suggesting the need for a much more extensive sample before a more definitive view could be reached.
Yet another way to evaluate the evolutionary history and possible host group composition of genetic groups is through ClonalFrame analysis of our 16-gene MLST data. The analysis suggests that the relative impact of recombination versus that of point mutation, expressed as a ratio (r/m), was approximately 1.56 (mean of results from 5 independent runs, summarized in Table Table3),3), and the relative frequency of recombination in comparison to point mutation (ρ/θ) was about 0.25. This estimate of frequency of recombination suggests that recombination is relatively rare compared to some species, such as Streptococcus uberis (ρ/θ, 9.05 ), Streptococcus pneumoniae (ρ/θ, 2.1 ), Clostridium perfringens (ρ/θ, 3.2 ), and Neisseria meningitis (ρ/θ, 1.1 ), but roughly similar to that observed for other groups, like lineage I of Listeria monocytogenes (ρ/θ, 0.13 ). With some minor exceptions, a 50% majority rule ClonalFrame consensus tree recovered most of the same groups apparent in the eBURST 16-gene analysis (Fig. (Fig.1).1). Phylogenetic analyses using more-traditional approaches, such as neighbor joining (NJ; data not shown), of the 16-gene concatenated alignment also recovered the same ClonalFrane groups highlighted in Fig. Fig.11 (bovine A and B and poultry A and B), with one minor exception: bovine isolate cco067 did not group with other bovine isolates in the NJ tree. The levels of neighbor-joining bootstrap support for the groups indicated in Fig. Fig.11 were as follows: for bovine A, 100%; for bovine B, 90%; for poultry A, 61%; and for poultry B, 80% (relationships between these groups were unresolved). A comparison of the ClonalFrame tree without correction for recombination (Fig. (Fig.1A)1A) to that with correction for recombination (Fig. (Fig.1B)1B) indicates that the time to the most recent common ancestor (TMRCA) for bovine group A that ignores recombination (Fig. (Fig.1A)1A) is about 0.18 coalescent units and that the TMRCA which incorporates recombination (Fig. (Fig.1B)1B) is about 0.05. On the other hand, bovine group B has values for TMRCA that are about 0.07 and 0.05 coalescent units for histories which ignore and incorporate recombination, respectively. This indicates that recombination has played a different role in the diversification of these two bovine groups; however, the similar values for TMRCA for the two groups in Fig. Fig.1B1B suggests (assuming that the mutation rate follows a molecular clock in both lineages) that the two groups are of similar age. There are also two poultry clades, one of which (poultry B) includes within it a turkey-specific subclade. The tree ignoring recombination suggests coalescent times for the poultry A and B clades of about 0.03 and, when recombination is incorporated, about 0.05. Thus, the data suggest that these two poultry groups are of ages similar to one another and to the bovine groups, while the effect of recombination in the diversification of these two clades is more even and overall less significant than that observed for the bovine case. However, recombination does appear to be an important factor in the diversification of the turkey-specific subclade of poultry B; the TMRCA observed for this turkey group when recombination is ignored is less than 0.01, and that observed when recombination is incorporated is about 0.025. This also suggests a relatively recent origin for this turkey group. Multi-isolate, exclusively human clades were not evident; however, a partial human clade was present in both trees, inclusive of three turkey isolates (cco106, cco117, and cco080), with coalescent times without and with correction for recombination at about 0.045 and 0.085 coalescent units, respectively. This suggests a possible older origin for this human/turkey group than for the two poultry and bovine clades. There were no multi-isolate swine clades, but several pairwise associations were scattered throughout both trees.
Our sample of isolates does, however, have certain geographic aspects to it that should be considered along with any inferences regarding host preference. For example, our bovine samples come predominately from the states of Washington and California, and the two bovine groups discussed above have a distinct state bias to their composition: bovine group A includes 9 bovine isolates and 1 swine isolate, of which 8 bovine isolates come from Washington State and 1 comes from Oregon. Bovine group B includes 8 bovine isolates, 6 from California and 2 from Washington State. For these bovine isolates, we have more specific information available, and if we look at this from the perspective of town or herds, this geographic tendency tends to break down, with isolates from different herds or towns frequently more closely related on the 16-gene ClonalFrame tree; thus, there tends to be a state clustering but not a town or herd clustering. Interestingly, Miller et al. (39) noted that C. coli samples from bovines tended to be more clonal than those derived from other hosts. They found ST 1068 (7 locus MLST scheme) to comprise 83% of the 63 isolates sampled from bovines in 26 feedlots and 11 different states. We have 7 examples of ST 1068, 6 of them from bovines and 1 from a human. Thus, although our geographic sample of isolates from bovines is limited, our results are consistent with a pattern of particular clones being predominately associated with this host organism on a much broader geographic scale. An examination of the relative importance of geography versus host species in the poultry and human host groups tends to provide mixed support for the significance of geography. For example, the human/turkey ClonalFrame group includes 4 human isolates and 3 turkey isolates; all of the turkey isolates are from the United States, and all the human isolates are from Poland. However, our data set includes two further human isolates from Poland, and these do not group together. Another point arguing against geographic influence involving the human isolates is that we have 7 human isolates in the data set from Switzerland and they appear in various places scattered throughout the ClonalFrame tree. Similarly, we have 5 chicken isolates from Switzerland, and they do not group together or with the human isolates from Switzerland. Overall, the genotyping and ClonalFrame analyses provide tentative support for the development of C. coli host-preferred groups and suggest that recombination has played various roles in their diversification; however, a more diversified geographic sampling involving proximal and distal samples from the same and different hosts would be necessary for definitive exclusion of geography as a contributing factor behind host group formation.
Yet further inferences regarding population history can be derived from the ClonalFrame analyses by implementing the external/internal branch length ratio test, which computes the lengths of the external branches divided by the sum of the lengths of the internal branches, compares it to the expected distribution under the coalescent model, and calculates the statistical significance of the deviation between the observed and expected ratios. In our case, the external-to-internal branch length ratio is significantly smaller than expected (Fig. (Fig.2),2), which indicates that our C. coli evolutionary history is consistent with an expansion of population size or the acquisition of a fitness advantage early in the history of the group (11). This is what one might expect with repeated evolution of multiple host-preferred groups and the presumed population expansion and/or fitness advantage accompanying the development of new host resources. It has been suggested elsewhere (51) that C. coli and C. jejuni may be converging as a consequence of recent changes in gene flow, involving an acceleration of import of C. jejuni alleles by C. coli, and that this could be associated with the development of agricultural practices that brought the two taxa together (however, for an alternative opinion, see references 5 and 60). Part of this overall picture could be the repeated evolution of C. coli host-preferred groups, which may be coincidental with the widespread development of the bovine and poultry industries, reflecting a highly adaptable bacterial species.
The limitations of our data regarding lack of geographic diversity in the samples, and the relatively limited number of isolates, can to an extent be addressed in analysis of the pubMLST database; however, in our opinion the drawback to this database is the resolution provided by the 7-gene data set. For example, ClonalFrame analysis of just our 84 isolates by use of only the 7-gene MLST data set (data not shown), which incorporates recombination in the evolutionary reconstructions, fails to recover any of the host groups discussed above, with the exception of one bovine group, although the composition of this group is not at all similar to that observed in the 16-gene analysis. In fact, the vast majority of isolates in this 7-gene analysis are entirely unresolved. Nonetheless, it is possible to take advantage of the large number of isolates represented in the pubMLST database, with their relatively broad geographic distribution, and analyze the 7-gene MLST data collectively in ways other than ClonalFrame, employing population genetic approaches. Analysis using STRUCTURE assumes that the observed data are derived from K ancestral subpopulations. The optimal K was determined by doing multiple runs with different K values and choosing that with the highest likelihood score. Our analysis of C. coli by use of this approach suggests that a K value of 9 may be optimal; however, approximately similar results were obtained with K values as low as 5 (Fig. (Fig.3).3). The program, therefore, infers for each site of each sequence its posterior probability of deriving from one of the K ancestral subpopulations and computes the average proportion of genetic material derived from each ancestral subpopulation by each individual. The analysis is summarized and presented in Fig. Fig.33 and illustrates a few important general trends: (i) isolates from swine tend to be a relatively homogeneous genetic group, largely distinct from isolates derived from other hosts; (ii) chicken and human isolates tend to show a great deal of genetic overlap, strongly suggesting that the majority of C. coli human infections arise from chickens; (iii) isolates from ducks and wild birds show distinct similarity to those from environmental water samples, suggesting transmission from wild birds to water or vice versa; (iv) isolates from turkeys have a connection with human infection similar to that observed for isolates from chickens, but with a greater proportion forming a unique genetic group; and (v) geography (at least at the macrogeographic level of North America, the United Kingdom, and Europe) tends to have little bearing on the determination of genetic groups, and host species is a more influential factor. Our surmise above regarding the potential source of human infections is in agreement with recent studies of C. jejuni and C. coli which find chicken meat as the most likely source of human infection for both species of Campylobacter (50, 59). This analysis is also in agreement with other studies suggesting that campylobacteriosis due to C. coli infection is less commonly associated with consumption of pork products (35, 52). Earlier studies have also implicated waterfowl as a possible source of water infection (e.g., reference 40), but generally, this evidence has centered primarily on C. jejuni; our analysis provides a probable similar connection for wild birds and water involving C. coli. The degree to which contaminated water is then a source of human infection remains uncertain, although there are occasional outbreaks which have been linked in this regard (24, 28). A recent study from northwest England suggests that C. coli samples from surface water comprise a population genetically distinct from that described for human cases of disease (52). A population genetic study of C. jejuni involving modeling DNA sequence evolution of MLST genotyping data from over 1,000 clinical isolates in Lancashire, England, concludes that the vast majority (97%) of sporadic disease can be attributed to animals farmed for meat and poultry (59). Our analysis of C. coli supports only minimal direct connection between birds/water and humans, although the connection between birds and water seems quite clear. AMOVA of the 7-gene C. coli pubMLST database, using Arlequin, does not support geography (North America, Europe, and the United Kingdom) as a significant factor in explaining the observed genetic variation but does support host species as a significant factor in explaining genetic variation (Table (Table4).4). It should be acknowledged that although the pubMLST database includes a variety of hosts from diverse geographic regions, the sample set is not completely randomized with regard to host species and geography. Some hosts, for example, are better represented in certain geographic areas, and there is a lack of information on the ecological success of isolates from different host/geographic settings. Nonetheless, the results that we obtained with ClonalFrame, STRUCTURE, and AMOVA collectively suggest that the most parsimonious explanation is that isolates have evolved certain host preferences and then spread throughout different geographic areas rather than that certain clones diversify, largely independently, in separate geographic regions.
The hybridization profile for each of 65 strains analyzed in this study (Fig. (Fig.4)4) was obtained by hybridizing labeled genomic DNA to a C. coli microarray designed from sequence data for the genome strain RM2228. Divergent and absent genes are expected to show decreased hybridization signals with respect to those obtained with the reference, sequenced strain, while the core C. coli genome consists of genes present and highly conserved in nucleotide sequence across a wide diversity of strains. In the first design, 991 ORFs out of 1,942 tested ORFs were conserved across test strains. This study employed a double design strategy to eliminate potential ambiguous calls. The second design was based on 615 ambiguous ORFs, which showed no hybridization or intermediate levels of hybridization in at least one test strain in the first design. Of this set of 615 ORFs, 98 were misidentified as absent or highly divergent in the first design and were corrected as present in the second design. Of the tested 1,942 genes, 1,089 were common to all strains, representing 56% of the C. coli RM2228 genome; 853 ORFs were variable in at least one of the 65 test strains. However, based on our FNR estimate of 4% (see Materials and Methods, “Verification of microarray data”), we can predict that the microarray will overlook some core genes. For example, assuming a Bernoulli distribution, with the 65 test strains, the probability that a core gene will be found absent in at least one strain is 0.93, while the probability that a core gene will be found in 60 or more strains is 0.95. If one considers that any gene present in at least 60 strains is a core gene, this is consistent with the convention of Lan and Reeves (29), which suggests that genes present in 95% or more of independent strains are considered the core genome, and we obtain a core genome size estimate of 1,473, which is approximately similar to other estimates of core genome size for C. jejuni (15). This result highlights the fact that microarrays will generally underestimate the core genome size, as the number of false negatives quickly sums up, and therefore, microarray approaches should be considered a minimum estimate of the core genome. This phenomenon is illustrated by a core genome size cumulative plot (see Fig. S4 in the supplemental material), where it is apparent that, despite the analysis of a relatively large number of strains, a plateau was not reached. A similar result, pertaining to core genome size estimation from microarray data, was reported recently for Streptococcus thermophilus (46). A list of the C. coli core genome loci appears in Table S2 in the supplemental material.
Dendrogram clustering of the gene presence/absence data indicated that isolates of the same host tended to cluster together (Fig. (Fig.4).4). For example, the heat map dendrogram suggests a poultry group, including 67% (12/18) of the poultry isolates in our analysis, plus a single human isolate (cco083). This group bears only partial similarity in terms of isolate composition to the poultry clades on the ClonalFrame tree, instead including isolates scattered throughout the ClonalFrame tree, suggesting that the similarity in genome composition evident in the heat map clustering is largely independent of common ancestry and instead reflects a tendency for certain genes to be more common in isolates derived from poultry hosts, presumably arising through lateral gene transfer (LGT). A similar line of logic applies to the human isolates, which form two adjacent heat map dendrogram clusters. One of these groups bears some partial resemblance in isolate composition to an association of human isolates in the ClonalFrame analysis, while the other group is composed of human isolates scattered throughout the ClonalFrame tree. There are also two bovine heat map clusters; however, in this case they are not immediately adjacent to one another, and both of these are identical, or nearly so, in composition to the ClonalFrame bovine clades. Finally, in complete contrast to the ClonalFrame analysis, where nearly all the swine isolates were scattered throughout the tree, all but two of the swine isolates cluster together, with the inclusion of a single chicken isolate. Thus, we have a combination of common ancestry in some cases and lateral gene transfer in others, which underlie a tendency for sets of genes to be common to isolates derived from particular hosts. It should be noted, however, that factors other than LGT, such as lineage-specific gene deletion, could result in some of the same clustering as what we observe in Fig. Fig.44.
Groups of genes common to particular host groups were occasionally evident in a more detailed inspection of the microarray data. For example, in the case of the bovine isolates, there was a group of genes common to bovine group A that was largely absent from the rest of the isolates, which included the following 8 loci: haloacid dehalogenase hydrolase (CCO1528), cyclase (CCO1530), dihydroxyhept-2-ene-1,7-dioic acid aldolase (CCO1533), alpha-2,3-sialyltransferase (CCO1538), sulfate adenylyltransferase (CCO1541), 3′(2′),5′-bisphosphate nucleotidase (CCO1543), a glycosyl transferase (CCO1546), and capsular polysaccharide synthesis (CPS) C (CCO1547). Conversely, there was also a group of genes largely absent in bovine group A, generally present in the other isolates, including 9 loci: periplasmic proteins (CCO0110 and CCO0113), peptidyl-prolyl cis-trans isomerase (CCO1240), hypothetical proteins (CCO1241 and CCO1672), arginyl-tRNA synthetase (CCO1244), uroporphyrinogen decarboxylase (CCO1333), a MoaA/NifB/PqqE family protein (CCO1334), and a membrane protein (CCO1335). Of course, this assessment of presence and absence is all relative to the reference microarray strain RM2228, and because of the likelihood of considerable gene diversity in the dispensable portion of the C. coli genome (see, for example, references 31 and 55), this will not reflect a thorough picture of genes that are possibly present or absent in particular host groups. However, the facts that we do detect a few genes that appear characteristic of some groups and that we do get some clustering of host specificity on the heat map dendrogram strongly suggest that there are sets, or combinations of genes, more important to particular types of host adaptation.
A detailed look at the gene presence/absence data for several important pathogenic gene clusters across the different strains of C. coli reveals very different levels of gene conservation across the different gene regions (Fig. (Fig.5).5). The most divergent region was the CPS locus, with only 5 strains having gene composition similar to that observed for the sequenced strain. The majority of the remaining strains did not have just a few genes that were different from RM2228, but instead, most of these strains had an almost entirely different gene composition for this cluster. This suggests enormous gene diversity for the CPS locus in the species C. coli, similar to that reported in comparative genomic hybridization studies involving C. jejuni (41, 45) and for comparative sequence studies of other species of pathogenic bacteria (e.g., Streptococcus pneumoniae ). The genes involved in O-linked glycosylation and lipid oligosaccharide synthesis (LOS) were the next most variable gene clusters. No clear pattern of presence and absence within these clusters was correlated with host type. In each of these gene groups, there were a few loci that were consistently represented across all isolates and blocks of genes that were much more variable. The block of genes that were most frequently absent in the LOS locus (CCO1211 to CCO1218) includes loci which share variable levels of low homology with C. jejuni strains and includes several sialyl transferase genes, which have been implicated in Guillain-Barre and Fisher syndromes (61). This block of genes was generally either entirely absent or, more rarely, entirely present across the C. coli strains. Similar to genomic comparisons of C. jejuni strains (15), the N-linked glycosylation gene cluster (pgl) was largely conserved in gene composition across C. coli isolates.
The results of this work suggest C. coli may have evolved multiple host-preferred groups, although more-extensive geographic sampling would be required for definitive evaluation of this issue. Comparative genomic hybridization data suggested that there were combinations of genes in the dispensable portion of the genome, more commonly associated with isolates derived from particular hosts, with a history of common ancestry in some cases and lateral gene transfer in others. This suggests that a more thorough understanding of the pan-genome of C. coli could result in the identification of genes of key functional significance to host specific adaptation.
This work was supported by NIH contract N01-AI-30054 (ZC003-05), awarded to M.J.S.
Published ahead of print on 22 January 2010.
†Supplemental material for this article may be found at http://aem.asm.org/.