This study is based on the identification of 1180 gene families present in 27 genomes of
Campylobacter jejuni and
C. coli, identified using a cutoff of 0.8 BLASTP distance, as defined in the Methods section. This cutoff is relatively permissive, allowing proteins that only share 20% amino acid similarity to appear in the same gene family. As a result, more than half of an average
Campylobacter genome belongs to the core. However, other ways of computing gene families also use cutoffs in the same range, e.g. the 50-50 rule used by [
17], corresponds roughly to a cutoff of 0.75 in our approach. Both [
17], and [
18] produced core size estimates for
Campylobacter populations in a similar range. As seen in Figure , any choice of BLAST distance cutoff between 0.6 and 0.8 results in almost the same number of core gene families (less than 1% difference). With a smaller cutoff some of the gene families will have additional members from some genomes, but since we only include the ortholog from each genome in the downstream analysis, this will have no impact. The cutoff 0.8 maximizes the number of core gene families, which is our reason for choosing it. A too small cutoff will result in more gene families, but fewer core gene families since at least one genome will be lacking in some of the families obtained by cutoff 0.8. A too large cutoff will produce fewer core gene families because it produces too few gene families in the first place, by merging some of the gene families obtained by smaller cutoffs. The cutoff 0.8 obtains the balance between these two effects for this data set.
Principal components
The principal component analysis revealed that 60% of the variation in normalized evolutionary distances can be captured in three linear combinations (see Figure ). This figure also indicates a substantial incongruence in the evolutionary distances for the various core gene families. If all genes displayed the same evolutionary signal, we would have captured all variability in a single principal component, i.e. 100% explained variance after the first component in Figure . The fact that the explained variance grows fairly slow means that the 1180 rows of the data matrix X contain many different patterns. We tried to build phylogenetic trees based on each CGF separately, and computed consensus-trees that indeed verified this (see Additional file 1: Figure S1). By considering only the three-dimensional principal component space, we are focusing our analysis on the major variability in the data. Performing the analysis in this subspace means the results are based only on the dominating evolutionary patterns, and all the smaller differences will be downweighted. Our use of PCA here will have an effect similar to the use of bootstrapping on phylogenetic trees, in the sense that only the dominating patterns in the data come to the surface.
It is clear from Figure that most CGFs are found in a dense region near the origin, where 5 of the 7 MLST genes are also found. Apart from these, genes are mainly scattered to the right (along PC 1) or upwards (along PC 2) in the upper panel, or downward (along PC 3) in the lower panels. The loadings of Figure indicate that the major variation in this data is related to the separation of C.coli from C. jejuni. Core genes with a small value in the first component coordinate (left side of Figure , upper panel) show a different separation of species than those with a large coordinate value (right side of Figure , upper panel). The remaining variation we have included (component 2 and 3) is highly influenced by three distinct genomes, C. jejuni 414, C. coli 6461 and C. coli 6067, from which the distances to all other genomes fluctuate severely.
Core gene clusters
The cluster analysis reveals some clusters of genes that are distinctly separated from the majority. The gap-statistic analysis clearly indicates that going from K=1 to K=2 gives a large increase, indicating that this is not a homogeneous set of genes, and at K=5 we get the first peak, indicating that a partition of the CGFs into 5 clusters is optimal (see Figure ).
These five clusters were further compared. The blue and the cyan clusters are just two parts of the same central set of CGFs. Merging these into one big group, it contains 935 of the 1180 Campylobacter core genes. Six of the seven MLST markers are in this main group, and in Figure we can see that the consensus-tree for these genes separates all C. coli from the C. jejuni strains, but with C. jejuni 414 as a ’coli-like’ strain of jejuni. The red cluster in Figure is mainly separated from the rest along the first principal component, which makes it the most distinct cluster outside the main group. The loading plot in Figure suggests that this principal component has to do with the separation of the two species, and the consensus-tree of the red cluster in Figure confirms this. Here C. coli strains are not separated from the majority of the C. jejuni. Hence, the 120 core genes in the red cluster tell a consistently different story about how all these strains are related compared to the blue cluster. Also, note that the MLST marker aspA as well as the marker PorA are in this red group. The green cluster in Figure is located at the same position along PC1 as the blue group, and is only separated along the second component. The green consensus-tree is also quite similar to the blue, but with the noticeable difference that for these 103 core genes C. coli 6067 is no longer found in the C. coli-branch. This is in essence the effect of the second principal component, as was also indicated in Figure (distances to coli 6067 are different). Finally, the small brown cluster, which is only separated along the third component, has a consensus-tree that is a mixture of the red and the blue tree. The PC3-typical information, which is related to the strains jejuni 414 and coli 6461 is not strong enough to affect the consensus-tree in Figure .
Many tests for phylogenetic congruence are designed to compare neighboring sequences on the chromosome (sequence ’windows’) and breakpoints are identified that may correspond to recombination events. Our search for gene clusters is not using the positional information, but as shown in Table , the clusters we find are still highly enriched by neighboring genes. The fact that all groups show a clumping index I larger than 1.0 indicates that core genes are themselves not a random selection of genes in the reference genome (C. jejuni 11168 was arbitrarily chosen, see Methods). The three groups we identify outside the main group (colored red, green and brown in the figures) all have a very large clumping index. Thus, the genes within these clusters are very often found next to each other on the chromosome.
We also found that among those genes showing indication of being under selective pressure, 28 out of 30 are in the red or brown cluster (Table ). These two clusters deviate from the other CGFs by their location along the PC1 direction which, as can be seen from Figures and , represents the separation of species. A large score along PC1 means less separation between jejuni and coli, and this seems to coincide with selection pressure.
The computation of the population recombination rate γ is another descriptor of the the CGFs. CGFs with a large γvalue are indications of loci with HGT contributing to increased genetic variation. From Figure and Table we see that again the red cluster separates from the blue main group by having on average an almost twice as large recombination rate. Also the green cluster tends to have slightly larger γ values, but this increase is just weakly significant (p=0.02).
In [
11] indications of convergence between the two sympatric sister species
C. jejuni and
C. coli were found, based on analysis of a large number of MLST isolates. These results have later been countered in a re-analysis by [
19], and in a pangenome study by [
18] it was also concluded there is no evidence of convergence between these two species. Lefebure
et al. found that a total of 80% of the core genes were free of any between-species recombination, and even if we have made no attempt of tracing the history of any recombination events, our results show that 89% of the core genes maintain a good separation of the two species (blue/cyan and green clusters). Also, our interpretation of the first, and most important, principle component as a species separation means our results support the conclusion in [
18] with respect to convergence of the species.