|Home | About | Journals | Submit | Contact Us | Français|
Strains of the same bacterial species often show considerable genomic variation. To examine the extent of such variation in Rhizobium etli, the complete genome sequence of R. etli CIAT652 and the partial genomic sequences of six additional R. etli strains having different geographical origins were determined. The sequences were compared with each other and with the previously reported genome sequence of R. etli CFN42. DNA sequences common to all strains constituted the greater part of these genomes and were localized in both the chromosome and large plasmids. About 700 to 1,000 kb of DNA that did not match sequences of the complete genomes of strains CIAT652 and CFN42 was unique to each R. etli strain. These sequences were distributed throughout the chromosome as individual genes or chromosomal islands and in plasmids, and they encoded accessory functions, such as transport of sugars and amino acids, or secondary metabolism; they also included mobile elements and hypothetical genes. Sequences corresponding to symbiotic plasmids showed high levels of nucleotide identity (about 98 to 99%), whereas chromosomal sequences and the sequences with matches to other plasmids showed lower levels of identity (on average, about 90 to 95%). We concluded that R. etli has a pangenomic structure with a core genome composed of both chromosomal and plasmid sequences, including a highly conserved symbiotic plasmid, despite the overall genomic divergence.
It is becoming clear that bacterial genomes of strains of the same species vary widely both in size and in gene composition (39). An unexpected degree of genomic diversity has been found by comparing whole genomes (39). For instance, in Escherichia coli strains, differences of up to 1,400 kb account for some strain-specific pathogenic traits (5, 56). The extent of intraspecies genome diversity varies in different bacterial lineages. Some species have a wide range of variation; these species include E. coli (42), Streptococcus agalactiae (53), and Haloquadratum walsbyi (34). Other bacteria display only limited gene content diversity; an example is Ureaplasma urealyticum (1, 54). Tettelin and colleagues have suggested that bacterial species can be characterized by the presence of a pangenome consisting of a core genome containing genes present in all strains and a dispensable genome consisting of partially shared and strain-specific genes (53, 54). This concept is rooted in the earlier ideas of Reanney (43) and Campbell (7) concerning the structure of bacterial populations, and it indicates both that there is a pool of accessory genetic information in bacterial species and that strains of the same or even different species can obtain this information by horizontal transfer mechanisms (7, 43).
Genome size and diversity are related to bacterial lifestyle. Small genomes are typical of strict pathogens such as Rickettsia prowazekii (2) and endosymbionts such as Buchnera aphidicola (44a). In contrast, free-living bacteria, such as Pseudomonas syringae and Streptomyces coelicolor, have large genomes (4, 6). The bacteria with the largest genomes are common inhabitants of heterogeneous environments, such as soil, where energy sources are limited but diverse (32). An increase in genome size is attributable mainly to expansion of functions such as secondary metabolism, transport of metabolites, and gene regulation. All these features are common to the nitrogen-fixing symbiotic bacteria of legumes, which are collectively known as rhizobia, and their close relative the plant pathogen Agrobacterium. The genomes of such bacterial species have diverse architectures with circular chromosomes that are different sizes or linear chromosomes, like that in Agrobacterium species, and the organisms contain variable numbers of large plasmids (31, 49). Comparative genomic studies have highlighted the conservation of gene content and order among the chromosomes of some species of rhizobia (22, 23, 25, 40). Furthermore, Guerrero and colleagues (25) observed that most essential genes occur in syntenic arrangements and display a higher level of sequence identity than nonsyntenic genes. In contrast, plasmids, including symbiotic plasmids and symbiotic chromosomal islands (like those in Mesorhizobium loti and Bradyrhizobium japonicum) are poorly conserved in terms of both gene content and gene order (21). It is not clear what evolutionary advantage, if any, is provided by multipartite genomes, but some authors have speculated that such genomes may allow further accumulation of genes independent of the chromosome. Recently, Slater and coworkers (46) proposed a model for the origin of secondary chromosomes. Their idea is based on the notion of intragenomic gene transfers that might occur from primary chromosomes to ancestral plasmids of the repABC type. Observations of conservation of clusters of genes in secondary chromosomes or in large plasmids that retain synteny with respect to the main chromosome support this hypothesis (46).
We have been studying Rhizobium etli as a multipartite genome model species (23). This organism is a free-living soil bacterium that is able to form nodules and fix nitrogen in the roots of bean plants. The genome of R. etli is partitioned into several replicons, a circular chromosome, and several large plasmids. In the reference strain R. etli CFN42, the genome is composed of a circular chromosome consisting of about 4,381 kb and 6 large plasmids whose total size is 2,148 kb (23). A 371-kb plasmid, termed pSym or the symbiotic plasmid, contains most of the genes required for symbiosis (21). Previous studies have described the high level of genetic diversity among geographically different R. etli isolates (41). The strains are also variable with respect to the number and size of plasmids. Nevertheless, there has been no direct measurement of diversity at the genomic level, nor have comparative studies of shared and particular genomic features of R. etli strains been reported. Therefore, to assess the degrees of genomic difference and genomic similarity in R. etli, we obtained the complete genomic sequence of an additional R. etli strain and partial genomic sequences of six other R. etli strains isolated worldwide. Our results support the concept of a pangenomic structure at the multireplicon level and show that a highly conserved symbiotic plasmid is present in divergent R. etli isolates.
Seven strains of R. etli were chosen for this study (Table (Table1),1), and in our analysis we also used the complete genomic sequences of R. etli CFN42 (referred to below as RetCFN42) and R. leguminosarum biovar viciae 3841 (referred to below as Rlv3841) described previously (23, 58). R. etli strains were previously classified by using the following taxonomic criteria: the ability to nodulate and fix nitrogen in common bean plants, the presence of reiterations of the nifHDK operon encoding nitrogenase, and demonstration of species-specific growth characteristics. The strains were isolated from distinct locations worldwide and had different plasmid contents (Table (Table11).
Random genomic sequences were obtained from the seven strains by the shotgun method, and this was followed by capillary DNA sequencing using an ABI3730XL automatic DNA sequencing machine (Applied Biosystems, Foster City, CA). We determined the complete genomic sequence for one strain, R. etli CIAT652 (referred to below as RetCIAT652), and partial genomic sequences (coverage, about 1×) for the other six strains (Table (Table1).1). Assemblies were obtained using Phred-Phrap-Consed software (15, 16, 24). Gaps were filled in by use of appropriate PCR amplification. Partial assemblies were obtained for the low-coverage genomes using only high-quality readings. An ad hoc perl script was used to trim at least the first 20 low-quality bases at the 5′ and 3′ ends of each read. Reads with Phred values lower than Q20 were discarded.
We used the formula of Fraser and Fleischmann (19) to estimate genome size (L) (L = −nw/ln(c/n) where n is the number of clones, w is the average read length, and c is the number of contigs). To predict the fraction (percentage) (p) of DNA which was not sequenced and to estimate genome coverage, we employed the formula of Lander and Waterman (p = e−nw/L) (33). To evaluate the accuracy of our predictions, we assembled 13,000 random reads for the complete genomes of RetCFN42 and RetCIAT652 and then predicted the genome sizes of the other strains using the formula described above and the partial genomic sequences. We found good agreement between the predictions and the experimentally determined lengths of the complete genomes (Table (Table11).
We performed pairwise comparisons of the whole proteomes of RetCFN42, RetCIAT652, and Rlv3841 using BLASTp with an E value of <1e-7 without filtering for low complexity and enabling the algorithm of Smith and Waterman (47). Orthologs were defined by constructing a similarity matrix using OrthoMCL (35). Similarity matrices were used to run the MCL program (14) to cluster orthologs (and recent paralogs) into families using the following parameters: inflation (I) = 1.5, initial iteration (i) = 1.4, initial inflation (l) = 5, scheme (K) = 7, and centering (c) = 1.2. Proteins that were not grouped into any family were classified as single-member families.
Genes encoding members of single-member protein families were located in the replicons of RetCFN42 (seven replicons), RetCIAT652 (four replicons), and Rlv3841 (seven replicons). To identify replicons with common gene contents in the three genomes, we constructed a profile for each family using genome localization. The profiles were given the following numbers: 1 for the chromosomes of the three genomes; 2 to 7 for plasmids of RetCFN42 and Rlv3841, starting with the smallest plasmid; and 2 to 5 for plasmids of RetCIAT652, starting with the smallest plasmid. The absence of a gene in any replicon was encoded 0. As expected, linked genes in a replicon had the same profile. For instance, the profile 1-1-1 reflected genes present exclusively in chromosomes, and the profile 1-1-2 meant that the gene was located in the chromosome in two species but in plasmid 2 in the other species. To visualize the profiles, we used the program E-Burst V3 (17, 50).
The RetCIAT652 genome was annotated manually by following the gene model constructed for the previously reported genomic sequence of RetCFN42 (23). Open reading frames were predicted by using GLIMMER 2.0 (10, 44), and annotations were obtained by analysis of BLASTx hits with the nonredundant databases of GenBank and Interpro. To compare partial genomic sequences with the nonredundant database of GenBank, BLASTx searches were performed, and the top hits were classified with respect to organisms with which they matched. Additional comparisons of the complete genomes of RetCFN42, RetCIAT62, and Rlv3841 and the collection of shotgun genomic sequences of strains of R. etli were performed using either BLASTn, BLASTx, or Mummer (12). To be considered homologous contigs, genes, or proteins, alignments had to be 30% (60% in the case of contigs) identical over 60% of the length of the largest contig, gene, or protein. Chromosomal islands were defined as contiguous groups of genes that were present in one strain but not in another. Predictions of chromosomal islands were obtained with the aid of the Alien-Hunter program (55). Pangenomic prediction was performed using the power law fit method as described by Tettelin and colleagues (54).
R. etli complete genomes and the contigs of partial genomes were compared all versus all by BLASTn. Then we used the median percentage of identity for bidirectional best hits between pairs of genomes as a similarity measure, making no distinction between coding and noncoding regions (48). Further, this value was used to estimate the evolutionary distances based on the numbers of substitutions per site, to construct a distance matrix, and subsequently to construct an unrooted tree by the neighbor-joining method (57). For estimation of evolutionary distances we used the Poisson distance (d), determined as follows: d = −lnμ, where μ is (q − 0.05)/0.95 and q is the percentage of identity.
The complete sequences of the RetCIAT652 chromosome (accession number NC_010994) and plasmids pCIAT652a (NC_010998), pCIAT652b (NC_010996), and pCIAT652c (NC_010997) have been deposited in the GenBank database. Draft genomes of R. etli strains 8C-3 (NZ_ABRA00000000), Brasil5 (NZ_ABQZ00000000), CIAT894 (NZ_ABRD00000000), GR56 (NZ_AABRD00000000), IE4771 (NZ_ABRD00000000), and Kim5 (NZ_ABQY00000000) have also been deposited in the GenBank database.
R. etli strains have diverse genomic architectures highlighted by disparities in plasmid size and number (Table (Table1).1). To study the degree of intraspecies genomic similarity and divergence, we obtained the complete genomic sequence of RetCIAT652, a strain isolated in Costa Rica, and the partial sequences of six other R. etli strains isolated from various sites worldwide (23). RetCIAT652 had a circular chromosome consisting of about 4,513 kb and three plasmids (designated pCIAT652a, pCIAT652b, and pCIAT652c) consisting of about 414 kb, 429 kb, and 1,091 kb, respectively. The chromosome of RetCIAT652 is 131 kb larger than the chromosome of RetCFN42. In RetCIAT652, most of the genes required for symbiosis were carried on the pCIAT652b plasmid, which is 58 kb larger than the equivalent plasmid of R. etli CFN42, p42d. Annotation of the CIAT652 genome yielded 4,072 protein-encoding sequences (CDS) in functional classes and a substantial number (about 2,220) of hypothetical and orphan CDS. Compared with the CFN42 genome, the CIAT652 genome contained 473 more CDS with unknown functions and 215 fewer CDS for which functional annotations were available. Like the CFN42 genome, the CIAT652 genome contained a large number of CDS involved in transport and transcriptional regulation. There were 20 sigma subunits encoded in the CIAT652 genome, compared with the 23 sigma subunits encoded in the CFN42 genome.
To establish structural correspondence among the replicons of the two R. etli strains, the chromosomal and plasmid sequences were aligned using Mummer (11). The chromosomes of both strain showed a straight line of synteny interrupted by several gaps of different sizes but without inversions or any other large rearrangements (Fig. (Fig.1).1). The plasmids were structurally heterogeneous, but some of them seemed to be equivalent. For instance, pCIAT652a had several large segments in common with p42e, as did pCIAT652b with p42d (pSym), and pCIAT652c with p42f and p42b. There were no matches with plasmids p42a and p42c of RetCFN42, indicating that these plasmids were not present in RetCIAT652.
Previous genomic comparisons of RetCFN42 and its close relative Rlv3841 showed that there is extensive chromosomal synteny and, to a lesser degree, synteny between some pairs of plasmids (9). A similar result was obtained when RetCIAT652 and Rlv3841 were compared (data not shown). Despite the divergence of these species, plasmid pCIAT652a showed conservation with pRL11 and plasmid pCIAT652c showed conservation with pRL9 and pRL12. The symbiotic plasmids of R. etli (p42d and pCIAT652b) are not related to any replicon in R1v38411, except for 20 common genes required for symbiosis (9). Recently, the complete genome sequences of two strains of R. leguminosarum biovar trifoli (RtrWSM1325 and RtrWSM2304) were deposited in the GenBank database; for RtrWSM1325 the accession number for the chromosome was NC_012850 and the accession numbers for the plasmids were NC_12848, NC_12858, NC_12853, NC_12852, and NC_12854, and for RtrWSM2304 the accession number for the chromosome was NC_011369 and the accession numbers for the plasmids were NC_11366, NC_011368, NC_011370, and NC_011371. Strain RtrWSM1325 has 5 plasmids, and strain RtrWSM2304 contains 4 plasmids. We looked for plasmid equivalence between these strains and RetCFN42. Nucmer comparisons showed that the plasmids of the RtrWSM1325 and RtrWSM2304 strains have large syntenic regions in common with plasmids p42b, p42c, p42e, and p42f but not with the pSym (p42d) plasmid. This observation suggests that there may be a common plasmid pool for R. leguminosarum and R. etli.
Synteny relationships among the replicons of the two R. etli strains and the single Rlv3841 strain indicate that the core genome of Rhizobium might be not confined to the chromosome but may extend to some plasmids. To test this possibility, we used a clustering method to group the whole predicted proteomes encoded by the three complete genomes into protein families and then examine the distributions of these families in replicons. We found that a set of 3,971 protein families was encoded in the three genomes; 3,753 of these protein families were protein products of single genes, whereas 218 families corresponded to families with two or more protein homologs encoded in the genomes (Fig. (Fig.2).2). The genes encoding the 3,753 single-protein families were localized in the replicons of the three genomes by constructing presence-absence profiles, as described in Materials and Methods. Genes encoding core protein families were found predominantly in chromosomes but were also present in plasmids common to the three genomes (Fig. (Fig.3).3). There were two main clusters that contained plasmid-encoded proteins. One cluster corresponded to p42f, pCIAT652c, and pRL12 genes encoding 242 proteins and represented 42% of the coding capacity of p42f, 23% of the coding capacity of pCIAT652c, and 30% of the coding capacity of pRL12. The second cluster consisted of 237 proteins encoded by genes in p42e (51% of the total coding capacity), pCIAT652a (59%), and pRL11 (37%). Furthermore, most genes encoding these proteins were arranged in syntenic segments common to the three plasmids (data not shown). Another cluster consisted of 91 proteins common to plasmid pCIAT652c, the largest plasmid in the CIAT652 genome, and plasmids p42b and pRL9. Since, as shown above, plasmid pCIAT652c is related p42f and pRL12, this cluster shows that pCIAT652c might represent a chimeric structure that originated by interreplicon recombination. In addition, several other minor profiles appeared in the analysis, indicating that intragenomic recombination also involves small DNA segments (Fig. (Fig.3,3, smallest circles).
Core genome stability is affected by other recombination processes, like gene duplications and gene loss. For construction of profiles we only used single-member proteins; thus, we were unable to observe the effect of gene duplications. However, gene loss was illustrated well by the presence of profiles that included proteins encoded by only two genomes in common clusters. Proteins encoded by only two genomes are subgroups of core replicons (data not shown). For example, proteins represented by the pCIAT642c-pCFN42f, pCIAT652c-pRL12, and pCFN42f-pRL12 profiles, as well as proteins encoded by two chromosomes, fall into this category (data not shown). A particular but important case in this category is the proteins encoded by the symbiotic plasmids that have a unique profile for the two R. etli strains and are grouped separate from the symbiotic plasmid pRL10 of Rlv3841 together with R. etli plasmid p42c (data not shown) (9). These data suggest that the genes that encode proteins involved in symbiosis are not part of the core genome of Rhizobium (Fig. (Fig.3).3). Quite a few of the symbiosis genes that have been identified are common among Rhizobium species (9, 22). In RetCIAT652, RetCFN42, and Rlv3841, only 11 genes were maintained in the symbiotic plasmids (cluster 18) (Fig. (Fig.3).3). These genes are nodA, nodB, nodC, fdxB, nodI, nodJ, fixX, fixA, fixB, fixC, and nifN.
Estimation of the size of the core genome of Rhizobium was performed by using the methods of Tettelin et al. (54) and the three complete Rhizobium genome sequences available (RetCFN42, RetCIAT652, and Rlv4841 sequences). This estimation yielded 3,220 core genes (with a 99% confidence interval) and predicted that about 99 new genes might be added to the pangenome by every new complete genome sequence (data not shown). Although the analysis was limited by the small number of complete Rhizobium genome sequences available, the estimate for the α parameter was 0.6. Since α represents the proportion of new genes discovered as more genomes are sampled, the fact that in this case α is <1 suggests that Rhizobium might have an open pangenome.
The difference between R. etli and R. leguminosarum can be measured by estimating the numbers of species-specific proteins. Rlv3841 has the highest number of individual protein families (2,071), whereas the two R. etli strains were significantly different for this characteristic (Fig. (Fig.2).2). There were 698 protein families of RetCFN42 that were not present in RetCIAT652 and 994 protein families in RetCIAT652 that were not present in RetCFN42. Most members of these families are hypothetical conserved proteins with unknown functions or orphans (23% and 35% for RetCFN42 and RetCIAT652, respectively). In contrast to core genes, which have an average G+C content of 61%, genes unique to each genome had low G+C contents (on average, 57 to 58%).
To further examine genomic differences between R. etli strains, we analyzed samples of genome sequences from six R. etli strains having distinct geographical origins. The data were obtained by random shotgun sequencing of whole genomes with coverage of about 1× that of the predicted genome length (Table (Table1).1). This coverage allowed estimation of the genome lengths for these strains of R. etli. The genome size varied over a wide range; strain GR56, an isolate from Spain, had the smallest genome (about 5,000 kb), and RetCFN42 and RetCIAT652 had the largest genomes. Thus, the differences in genome sizes among the strains of R. etli examined here were on the order of hundreds of kilobases to 1,500 kb (Table (Table11).
To determine genomic relationships among R. etli strains, BLASTn analysis was used to compare the contigs of each partial genomic sequence with the whole-genome sequences of RetCFN42 and RetCIAT652 and the GenBank nonredundant database. As expected, most sequences of R. etli strains were present in the complete genomes (Fig. (Fig.4,4, red and orange bars). These sequences were equivalent to approximately 3,000 kb of common DNA and thus represented less than 50% of the total genome of RetCFN42 or RetCIAT652. In addition, there was about 1,000 kb in each strain for which no homologous sequences were detected in the complete RetCFN42 genome. A smaller, but still substantial, amount of extra DNA was found when similar comparisons were performed using the RetCIAT652 genome as the reference (Fig. (Fig.4).4). A proportion of this extra DNA showed matches to sequences of other organisms deposited in the GenBank database. However, most of the extra DNA did not match any other sequence in the database. A small proportion of the extra DNA was similar to sequences present in at least one other R. etli strain (Fig. (Fig.4,4, dark blue and light blue bars).
When the collection of shotgun genomic sequences was aligned with the sequence of the chromosome of either RetCFN42 or RetCIAT652, the distribution of sequences along the chromosomes was found to be essentially random. Almost every chromosomal region of RetCFN42 and RetCIAT652 contained sequences present in at least one strain. Nevertheless, some chromosomal regions in RetCFN42 and RetCIAT652 contained sequences with no matches with any sequence from either incompletely or completely sequenced R. etli strains or Rlv3841 (Fig. (Fig.5).5). A prediction of the Alien Hunter program suggested that many of these regions are chromosomal islands that were acquired by horizontal transfer. Such regions differ from the average with respect to nucleotide composition and codon usage. We analyzed 13 chromosomal islands that were present exclusively in RetCFN42 and 12 chromosomal islands that were unique to RetCIAT652. As Fig. Fig.55 shows, these islands were for the most part not present in the other R. etli strains examined and Rlv3841. The chromosomal islands were variable in length (range, about 8 kb to 69 kb), and the largest chromosomal island was found in the RetCFN42 chromosome. The islands were dispersed throughout the chromosomes, and only three locations appeared to be preferentially used for island insertion. Islands 3, 4, and 6 of RetCFN42 had the same locations as islands in the RetCIAT652 chromosome (islands 3, 4, and 5), but the islands differed in gene composition. Island 3, between the tRNAHis and tRNAGln genes in the RetCFN42 chromosome, contains genes already described as the α-lps region, which is involved in synthesis, maturation, and transport of the O antigen (8, 13). Although mutations in some genes of this locus affect the symbiotic capabilities of R. etli CE3 (a streptomycin-resistant strain derived from RetCFN42), the presence of this island is not widespread in R. etli. In contrast, an island at the same position was found in the chromosome of RetCIAT652, and it also seemed to contain genes involved in polysaccharide synthesis and transport; however, none of these genes was homologous to genes of the α-lps loci. Island 4 of both chromosomes was highly variable in terms of genetic content, whereas island 6 in RetCFN42 (island 5 in RetCIAT652) was located near the putative terminus of replication, a region of the bacterial chromosome thought to undergo frequent genetic rearrangement. The other islands harbored most genes unique to RetCFN42 or RetCIAT652. These genes are genes related to a variety of enzyme activities, genes with unknown functions, and mobile elements (insertion sequences and genes of phage and plasmid origin). Genomic comparisons of B. japonicum strains using macroarrays of the reference strain B. japonicum USDA110 showed the presence of 14 genomic islands, some of which were associated with the symbiotic performance of strain USDA110 (29). Thus, the presence of genomic islands in R. etli might also be related to some symbiotic capabilities, but this possibility was not addressed here.
We showed that the two complete R. etli genome sequences displayed a high degree of conservation at the chromosomal level and that large syntenic segments were present in plasmids. To evaluate the level of DNA divergence between homologous replicons of the two completely sequenced R. etli genomes, we compiled all local alignments made by BLASTn. The aligned regions of plasmids pCIAT652a and pCIAT652c, as well as those of the chromosomes, had levels of nucleotide identity of about 85 to 95% (Fig. (Fig.6).6). In marked contrast, sequences that the pSym plasmid of RetCFN42 (pCFN42d) and the pSym plasmid of RetCIAT652 (pCIAT652b) had in common showed the highest levels of nucleotide identity for these genomes (98 to 100%) (Fig. (Fig.6).6). Mummer alignments of the collection of partial genomic sequences with the complete genome sequence of RetCFN42 showed a similar pattern of nucleotide identity in the pSym plasmids, in contrast to the rest of the genomic sequences (Table (Table2).2). An exception to this pattern was strain IE4771, an isolate from Puebla, México, which had the most divergent pSym sequences compared with the pSym plasmid sequences of both RetCFN42 and RetCIAT652. Strain IE4771 also had a low proportion of pSym sequences (less than 10%) compared with other strains, which clearly contained at least 50% of the pSym sequence (Table (Table2).2). This indicates that the IE4771 strain lost pSym or that a different type of pSym plasmid is present. The latter suggestion seems more plausible because BLASTx comparisons of the partial sequence of strain IE4771 yielded matches with some pSym genes of RetCFN42 or RetCIAT652, including nifH, nifA, fixN, fixA, and fixB.
When the R. etli CIAT652 genome was used as a reference in a comparison of the partial genome sequences, other conservation patterns were observed. Strains 8C-3 and Brasil 5 exhibited a high level of nucleotide conservation compared to RetCIAT652, and there were no appreciable differences in nucleotide identities among the pSym sequences and the rest of their genomes. Two strains (IE4771 and Kim5) had the lowest levels of nucleotide identity in the genome as a whole, and only strains CIAT894 and GR56 had pSym sequences that were more conserved than the chromosome (Table (Table22).
To evaluate if there is a relationship between the overall genomic divergence in R. etli strains and the conservation of pSym, we constructed a distance matrix tree based on BLASTn comparisons performed in an all-versus-all manner, using both the two complete genomes and the partial genomic sequences of R. etli and including Rlv3841 as the outgroup (Fig. (Fig.7).7). This tree showed that RetCIAT652, 8C-3, and Brasil 5 are very closely related but the rest of the strains have diverged to different degrees. In particular, RetCFN42 appeared to have no close relatives. Based on all of these observations, we concluded that pSym sequences are highly conserved and dispersed in variable genomic backgrounds.
Previous theories of bacterial evolution emphasized that bacteria have evolved a “strategy to expand the effective genome size of the species without imposing on each individual the burden of reproducing the entire genome” (7). Campbell and other researchers hypothesized that a bacterial genome is composed of an “euchromosome” and “accessory elements” (7, 43). The terms have been changed in the modern genomic era and are now the “core genome” and “pangenome” (3, 37, 52). The core genome is defined as the set of genes shared by all members of a monophyletic group (1). In contrast, the pangenome is an expanded version of the repertoire of genes found in a species, and accessory genes are not present in all individual bacteria. Genome sequences of a number of strains of species such as S. agalactiae, E. coli, Haemophilus influenzae, and Streptococcus pneumoniae have provided support for this concept (27, 28, 42, 52).
We have previously sequenced the complete genome of RetCFN42, the reference strain. Here we approached an understanding of the pangenomic structure of R. etli by sequencing another complete genome and by low-coverage sequencing of six other genomes of strains of R. etli. We found a set of genes that might represent the core genome of Rhizobium by comparing sequences of the two complete R. etli genomes and the sequence of the close relative Rlv3841. Furthermore, we found that in R. etli (and Rlv3841) the core genome is not limited to the chromosome but also extends to some large plasmids. It is known that members of the rhizobia have multipartitioned genomes composed of the chromosome and a variable number of plasmids (31, 49). Genes of the core genome are commonly carried on the chromosome and are maintained in syntenic blocks in closely related species (25, 51). In contrast, neither symbiotic plasmids nor other plasmids are conserved, except for the presence of a few common genes (21, 23). We found here that plasmids p42f, pCIAT652c, and pRL12 are highly related in terms of gene content, as are plasmids p42e, pCIAT652a, and pRL11 and might be considered part of the core genome. The large plasmids and the linear chromosome of Agrobacterium, chromosome II of Brucella, and the pSymB plasmid of Sinorhizobium meliloti might also be viewed as secondary chromosomes (46). The sizes of the replicons, the G+C content, and the presence of certain classes of genes normally found in primary chromosomes make this possibility plausible. Recently, Slater and colleagues suggested that secondary chromosomes originated via intragenomic gene transfers from primary chromosomes to an ancestral repABC replicon (46). Evidence for this hypothesis comes from the conservation of several syntenic blocks of genes, such as minCDE (cell division proteins), hutIGU (histidine biosynthesis), and pcaGHID (protocatechuate biosynthesis), in secondary chromosomes and plasmids across members of the rhizobia (46). These three blocks of genes are located in the p42e-pCIAT652a-pRL11 cluster that is included in the set of genes encoding core proteins. Our comparison in the present work revealed that a substantial proportion of about 479 single-member protein families are encoded in the largest plasmids common to the genomes of R. etli and Rlv3841. Furthermore, most genes encoding these proteins were arranged in common syntenic segments (data not shown).
We found that a substantial proportion of DNA in the newly partially sequenced strains of R. etli was not present in the model strain RetCFN42 and in RetCIAT652, whose new complete genome sequence is reported here. There were 738 and 1,002 different protein-encoding genes in RetCFN42 and RetCIAT652, respectively, for which complete genome sequences are available. Similar amounts of extra DNA were detected when partial genomic sequences of various R. etli strains were compared to sequences of the complete genomes of RetCFN42 and RetCIAT652. The extra DNA represents the accessory component of the R. etli pangenome. This DNA has a low G+C content and contains numerous hypothetical genes and mobile elements that are also common in the accessory components of other bacterial species. Strain-specific chromosomal islands, which were shown here to be present in the chromosomes of RetCFN42 and RetCIAT652, are some of the locations of such extra DNA. The plasmid pool of R. etli is variable and also contributes importantly to the extra DNA, but it is not known to what extent.
A striking result of the present work was the high level of nucleotide identity in homologous segments of pSym of R. etli, in contrast to the more divergent sequences seen in the rest of the genome. In these segments there are 210 very conserved CDS (98 to 100% nucleotide identity), which represent 60% of the coding capacity of the pSym plasmid of RetCFN42, including known symbiosis genes (nif, nod, fix, fdx), as well as other genes not involved in symbiosis (vir, tra) and hypothetical genes. Only strain IE4771 displayed a low level of nucleotide identity and a low level of coverage compared with pSym sequences of RetCFN42 and RetCIAT652. According to 16S RNA gene data and nodulation tests, strain IE4771 belongs to an R. etli group that is distinct because it has two copies, not three copies, of nifH (three copies are usual in the more common R. etli strains) (45). These data suggest that at least two symbiotic plasmids may exist in the R. etli population. One plasmid would be highly conserved, often found in R. etli isolates, and prototypically defined by three nifH reiterations. This plasmid is exemplified by the pSym plasmids of RetCFN42 and RetCIAT652. The other, more divergent pSym plasmid, which is present in isolates from Puebla, México, has not been characterized at the genomic level yet. A recent origin of pSym is the simplest explanation for the high level of conservation of pSym in very divergent R. etli isolates. Alternatively, nodulation performance might provide strong selection pressure, selecting against any variation in pSym. The latter hypothesis seems improbable, as pSym genes with roles in nodulation and genes having unknown functions both have identical nucleotide sequences. In previous work, we analyzed the patterns of single-nucleotide polymorphisms in DNA sequences of the pSym plasmids of several R. etli strains (some of which were included in this study) (18). The data indicate that most of the nucleotide substitutions are spread over the population by recombination and that the contribution of mutations to polymorphism is relatively low. In agreement with this model, very few nucleotide variations were found in the pSym sequences compared here.
Several years ago Palacios and colleagues (38) asked how many genotypes would be capable of conferring the R. etli phenotype. Our data indicate that a unique pSym genotype (or perhaps very few pSym genotypes) might be responsible for the ability of R. etli to nodulate the common bean. Other comparative genomics studies using microarrays have shown that the symbiotic plasmid pSymA is the most variable replicon in strains of S. meliloti (20, 26). This result contrasts with the pSym conservation in the R. etli strains compared here. Since it was demonstrated that the ability to nodulate is encoded by plasmids (30), it has been common to call these plasmids “symbiotic.” New genome sequencing technologies and comparative genomics have revealed that a variety of mechanisms have led to symbiosis with legumes (36) and that pSym plasmids in rhizobial species are not comparable despite the fact that they have some common nod and nif genes. It should be emphasized that as genome sequence technology is becoming more accessible, it is now feasible to analyze many more R. etli genomes to understand diversification and evolution in R. etli pSym plasmids and the genome of the species as a whole. Future work should also help determine more accurately the sizes of the core and accessory genomes, as well as the size of the plasmid pool of R. etli. Lastly, a clear picture of the evolutionary relationships among the different genome components of Rhizobium should emerge from studies performed with this kind of approach.
We thank José Espíritu and Víctor del Moral for help with computational resources and Miguel A. Cevallos for critical reading of the manuscript. We also thank the anonymous reviewers for their valuable suggestions.
This work was supported by grants from CONACyT (grant U4633) and PAPIIT-UNAM (grants IN215908 and IN223005).
V.G. was responsible for the experimental design and manuscript preparation; J.L.A. was responsible for the comparative genomic analysis; R.I.S. and P.B. were responsible for genome sequencing and annotation; I.L.H.G. was responsible for bioinformatic analysis; J.L.F. and R.D. were responsible for genome sequencing; M.F., R.P., and G.D. were responsible for discussion of the data and provision of supporting materials; and J.M. was responsible for genome sequencing and participated in discussions. All authors read and approved the final manuscript.
Published ahead of print on 4 January 2010.