|Home | About | Journals | Submit | Contact Us | Français|
Symbiotic bacteria known as rhizobia interact with the roots of legumes and induce the formation of nitrogen-fixing nodules. In rhizobia, essential genes for symbiosis are compartmentalized either in symbiotic plasmids or in chromosomal symbiotic islands. To understand the structure and evolution of the symbiotic genome compartments (SGCs), it is necessary to analyze their common genetic content and organization as well as to study their differences. To date, five SGCs belonging to distinct species of rhizobia have been entirely sequenced. We report the complete sequence of the symbiotic plasmid of Rhizobium etli CFN42, a microsymbiont of beans, and a comparison with other SGC sequences available.
The symbiotic plasmid is a circular molecule of 371,255 base-pairs containing 359 coding sequences. Nodulation and nitrogen-fixation genes common to other rhizobia are clustered in a region of 125 kilobases. Numerous sequences related to mobile elements are scattered throughout. In some cases the mobile elements flank blocks of functionally related sequences, thereby suggesting a role in transposition. The plasmid contains 12 reiterated DNA families that are likely to participate in genomic rearrangements. Comparisons between this plasmid and complete rhizobial genomes and symbiotic compartments already sequenced show a general lack of synteny and colinearity, with the exception of some transcriptional units. There are only 20 symbiotic genes that are shared by all SGCs.
Our data support the notion that the symbiotic compartments of rhizobia genomes are mosaic structures that have been frequently tailored by recombination, horizontal transfer and transposition.
Nitrogen-fixing symbiotic bacteria grouped within the Rhizobiaceae, Phyllobacteriaceae and Bradyrhizobiaceae families are widespread in nature . Ordinarily known as rhizobia, these organisms contain genomes of one or two chromosomes and several large plasmids ranging in size from about 100 kilobases (kb) to more than 2 megabases (Mb). A common feature of the genomes of rhizobia is that the genes involved in the symbiotic process are located in specific symbiotic genome compartments (SGCs), either as independent replicons known as symbiotic plasmids (pSym) or as symbiotic islands or regions within the chromosome. Complete genome sequences have been recently reported for Mesorhizobium loti MAFF303099 , Sinorhizobium meliloti 1021 [3-6], Bradyrhizobium japonicum USDA110  and the non-nitrogen-fixing close relative Agrobacterium tumefaciens C58 [8,9]. In addition, the sequence of the pSym of Rhizobium species NGR234 - pNGR234a  - as well as that of the chromosomal symbiotic regions of B. japonicum USDA110 and M. loti R7A have been reported [11,12]. Genomic comparisons reveal that the chromosomes of S. meliloti, M. loti, and the circular chromosome of A. tumefaciens have more than 50% of orthologous genes in common . A clear syntenic relationship is observed between the circular chromosomes of S. meliloti and A. tumefaciens [8,9] and albeit to a lesser extent, synteny is also apparent when both are compared to the chromosome of M. loti [6,8,9]. These results lead to the hypothesis that rhizobial chromosomes have a common ancestral origin [6,8,9]. Other genome constituents of rhizobia (that is, other chromosomes and plasmids) are thought to be the result of subsequent events of genomic rearrangements and horizontal transfer [6,8,9], but the precise mechanisms involved in their generation have not been elucidated so far.
Here we report the complete DNA sequence of the pSym (p42d) of Rhizobium etli CFN42 and its comparative analysis with other rhizobial SGCs. R. etli is the symbiont of the common bean Phaseolus vulgaris and has been widely used as model for metabolic and genome dynamics studies [13,14]. Its genome is composed by one chromosome and six plasmids ranging in size from 184 kb to about 600 kb . The physical map of p42d was previously determined and was the basis for obtaining the entire sequence . In this study we show that the SGCs are heterogeneous in sequence, gene composition and gene order. There are only 20 symbiotic genes that are shared by all SGCs. There are also some conserved gene clusters of related function that are present in some SGCs, but absent in others. Besides genes unique to a particular SGC, several orthologous genes are located in different genome contexts in other rhizobia. Other common features to all SGCs, such as reiterated genes, pseudogenes, and a large amount of insertion sequences (ISs), support the view that p42d, as well as other SGCs, is a mosaic structure that may have assembled from different genome contexts, either chromosomal or plasmidic.
The symbiotic plasmid p42d is a circular molecule of 371,255 base-pairs (bp) (Figure (Figure1)1) that belongs to the repABC type of replicator . We identified 359 coding sequences (CDS), of which 63% have an assigned function, 17% have homologs in databases without an assigned function, and 20% are orphan (Figure (Figure1,1, see also Additional data file 1). The CDS distribution between the two strands is asymmetrical, with 61% of them located in the minus strand. The plus and minus strands were defined according to the previously reported physical map . Moreover, the plus strand contains two reiterated nifHDK gene clusters in a clockwise orientation (NRa and NRb, Figure Figure1).1). The main functional classes of genes identified are: transport, nitrogen fixation, nodulation and transcriptional regulation. Ten pseudogenes related to known genes were identified that carry deletions and frameshifts at their amino or carboxyl termini. The plasmid also contains many reiterated sequences and a large number of elements related to insertion sequences (ERIS) accounting for 10% of the entire sequence. The major reiterations (28 elements) were grouped into 12 families on the basis of their sequence similarity (see below).
The average GC content of the plasmid is 58.1%. When genes were classified as low, average or high GC content (using the mean GC ± 1 standard deviation as thresholds), we observed a clear distinction between high or low GC in some gene clusters (Figure (Figure2a).2a). Several hypothetical genes and the nod genes show the lowest GC values (< 55%), whereas the highest GC values (> 62%) were displayed by the genes for cytochrome P450 (CPX), tra genes (TRA), and the genes for type III (TSSIII) and type IV (TSSIV) transport secretion systems. Similarly, when the genes were classified according to poor, typical or rich codon usage (CU) (see Materials and methods for details), genes with high GC also exhibited a rich CU (Figure (Figure2b),2b), whereas the GC-CU correlation was found to be lower for other genes. For example, the regions that contain the nitrogenase structural genes, other nif genes (NRa, b and c, see below), and the fixNOQPGHIS genes (FIX1) showed average GC content but rich CU. The variable correlation between GC content and CU levels reveals sequence heterogeneity within p42d and suggests a dynamic structure for this plasmid, presumably as a consequence of extensive genomic rearrangements, recombination rates, lateral transfer, and relaxation or intensification of selective pressures.
Most nod genes present in p42d are clustered in a region of 16 kb (NOD); however, nodA is separated from nodBC by 27 kb. The Nod factor backbone of R. etli CFN42 is an N-acetylglucosamine pentasaccharide synthesized by the common nodA and nodBC gene products. Modifications to this backbone consist of methyl and carbamoyl groups at the non-reducing end, while the reducing one is modified by the addition of a fucosyl group that is in turn acetylated . The methyltransferase, fucosyltransferase and acetyltransferase activities required for these modifications are encoded by nodS, nodZ and nolL, respectively. It is unclear, however, which gene product is responsible for the carbamoylation of the Nod factor, as nodU, the most likely gene to carry out this function, is a pseudogene. The two membrane proteins encoded by nodI and nodJ (located downstream of nodBCSU), participate in the transport of the Nod factor to the outside of the cell . Other genes present in p42d whose homologs in other rhizobia have a role in nodulation are nolO, nolE, nolT and nolV, the last two being part of the TSSIII system (see below). In addition to nodU, two other pseudogenes, noeI and nodQ, were identified.
The expression of nod genes depends on the activity of NodD proteins , which interact with specific sites known as nod boxes located upstream of the nod operons . The sequence of the p42d revealed three nodD genes; nodD1 is present in the NOD region while nodD2 and nodD3 are 50 kb apart. We also predict 15 potential nod boxes (see Materials and methods and Additional data file 2), seven of which are associated with almost all nod genes: nodA, nodZ, nodBCSU, nolE, nodD1, nodD2, and nodD3. The rest of the nod boxes are located proximal to genes so far unrelated to the nodulation process; namely, the genes bglS (β-glucosidase), yp108 (putative monooxygenase), and the orphans yh005, yh007 and yh050. There is also a putative nod box upstream of the gene encoding NifA, the major transcriptional regulator of the nitrogen-fixation genes. Even though the regulation of nifA is variable among rhizobia, dependence on flavonoid induction is unknown.
The nif and fix genes are distributed in five regions spanning a total of 125 kb (Figure (Figure1);1); the NOD cluster mentioned previously maps within this section as well. There are three copies of the nitrogenase reductase gene nifH , defining the three nif regions (NR) a, b and c (Figure (Figure1).1). NRa contains nifHDK genes and a truncated nifE pseudogene; NRb contains the nifHDKENX genes; NRc contains nifH and a truncated nifD pseudogene. The largest reiterated regions found in p42d correspond to NRa and NRb regions that share 4,470 identical nucleotides. The NRc region of 1,131 nucleotides is identical to sequences within NRa and NRb. The orientation of NRc is inverted in relation to the direction of NRa and NRb. Recent duplications of these NR regions might underlie the unusually high sequence identity between them. Alternatively, a mechanism of 'copy-correction' resembling gene conversion may be involved in maintaining nucleotide identity .
The highest density of nif and fix genes in p42d occurs 10 kb upstream of NRb, in the FIX2 region. This contains the fixABCX genes that encode a flavoprotein  (see below); nifB, which is needed for the synthesis of the iron-molybdenum cofactor ; nifW and nifZ, whose products may be required for protection of the nitrogenase from oxygen [26,27]; and the genes for the regulatory proteins NifA and RpoN2. The genes nifU, nifS and hesB (also named iscN) also map in the FIX2 region. The products of these genes have been implicated in the formation of the Fe-S cluster required for nitrogenase complex function . In R. etli CNPAF512, the inactivation of hesB (iscN) results in a Fix- phenotype . Other genes commonly found in nif regions of rhizobia were also identified in the FIX2 region. These are the ferredoxin gene fdxN, which is essential for nitrogen fixation in S. meliloti , and the gene for the anaerobic transcriptional regulator FnrNd (see below). The products of nifV and nifQ have been involved in the synthesis of the iron-molybdenum cofactor [30,31]; nevertheless, nifV is absent in the p42d and nifQ is located upstream of nifHc in the NRc region.
The RpoN (σN, also known as σ54) subunit of the RNA polymerase, encoded by rpoN, and the transcriptional activator NifA protein, encoded by nifA (both present in the FIX2 region, Figure Figure1),1), participate in the regulation of nif genes. RpoN binds to specific promoter regions and interacts with the NifA protein that binds to specific upstream activator sequences (UAS) . In R. etli CNPAF512, two rpoN genes have been described , one located in the chromosome (rpoN1), and the other in the pSym (rpoN2). The rpoN gene found in the p42d is orthologous to rpoN2. Regulation by RpoN and NifA has been demonstrated for nifH a, b and c . We predicted, as described in Materials and methods, potential RpoN-binding sites and UAS for NifA in the upstream region of several genes (see Additional data file 3). Both types of sites were also identified upstream of other genes; the reiterated yp003, yp021 and yp099 genes that encode the recently described BacS protein, highly expressed in nodules ; yp010 in the putative operon for terpenoid synthesis ; the fixA, hesB, and cpxA5 genes; and yp104, which encodes a toxin-transport-related protein. The expression of yp003 (bacS) and hesB (iscN) has recently been shown to depend on NifA [28,35].
RpoN-like promoters were also predicted upstream of several genes for which no associated NifA-binding sites could be detected (see Additional data file 3). Among them are the nitrogen-fixation genes fixO, nifQ and nifB; the genes for the putative decarboxylase, pcaC1, and alcohol dehydrogenase, xylB2. Furthermore, potential sites for RpoN were also found in several genes of unknown function. Recently, Dombrecht et al.  predicted RpoN promoter sites in all complete rhizobial genomes and p42d; we report here a larger set of genes potentially regulated by RpoN in p42d. It includes genes for nitrogen fixation, electron transfer, transport, and several of unknown function. The genes reported by Dombrecht et al. are mainly in nif and fix genes, the ferredoxins (fdxB and N; not predicted by us), and some genes of unknown function . The differences between their results and ours may be explained in part by the different strategies used to construct the weight-matrices in both studies, which in our case includes only 85 RpoN promoters whose transcription start sites have been experimentally determined, instead of the whole set of 186 promoters used by Dombrecht et al. (, see also ); see Materials and methods for details.
The electron flux and supply of energy for the reduction of molecular nitrogen requires the flavoprotein encoded by the fixABCX genes mentioned above (FIX2 region, Figure Figure1),1), a specific cytochrome oxidase encoded by fixNOQP genes, and a cation pump encoded by fixGHIS genes. The latter clusters in the region FIX1 (Figure (Figure1).1). A second copy of fixNOQP and fixG genes has been found in the plasmid p42f .
In the symbiotic state, the cytochrome terminal oxidases encoded by the fixNOQP operon provide the energy required to fix nitrogen . The cytochrome production is regulated in response to oxygen concentration and the products of fixLJ, fixK and fnrNd genes are also known to be involved in such regulation . In R. etli, the duplicated fixNOQP operons are differentially regulated and only the fixNOQPd is required for symbiosis . An inactive fixKd is present in p42d but no fixJ genes have been found in R. etli CFN42 . It has been shown that FixKf controls both fixNOQP operons; loss of FixLf (presumably a fusion protein of FixL and FixJ) suppresses fixNOQPf expression, but has only a moderate effect on that of fixNOQPd . Two fnrN genes have been described in R. etli CFN42; one is chromosomal (fnrNchr), and the other is on p42d (fnrNd). Both regulators participate in the activation of the operon fixNOQPd .
In Escherichia coli, Fnr is an oxygen-responsive global transcriptional regulator that binds to conserved boxes upstream of several genes (anaeroboxes) . By computational methods we predict 45 possible anaeroboxes in p42d (see Materials and methods). In some cases there are pairs of anaeroboxes in the same region. For example, two anaeroboxes lie within the intergenic region of the divergent operons fixKd and fixNOQPd and two more were detected upstream of fixG, nocR, nodD3, and fnrNd. Other genes that display single anaeroboxes are fixX, nifW, nifU, hemN2, psiB, hesB, mcpC, teuB1 and some other genes of unknown function. Although there is no direct transcriptional evidence about the expression of these genes in microaerobic conditions, previous observations suggest that several regions of p42d are activated under these conditions .
A variety of transporters, which account for 10% of the CDSs, are scattered throughout p42d. These include several partial and complete ABC transporters for sugars, as well as the type III (TSSIII), and type IV (TSSIV) large-molecule secretion systems (Figure (Figure11).
In several pathogenic bacteria, the TSSIII translocate virulence factors into eukaryotic cells . In Rhizobium this system was first found in pNGR234a  and it has been shown to have a role in nodulation efficiency in some host plants . Genes that encode proteins implicated in this system are also present in some of the SGCs  and were detected in the sequence of p42d. Interestingly, a gene homologous with an elicitor of the hypersensitive response in plants, hrpW, is exclusively present in p42d. This gene might form an operon with pcrD, which encodes a calcium-binding membrane protein that is also part of the type-III secretion system.
The TSSIV encoded by the virB genes has been described in several α-proteobacterial pathogens and plant symbionts  (see below). It consists of a membrane channel for delivering proteins or DNA into eukaryotic cells. In p42d, a complete set of virB genes, from virB1 to virB11, is present (Figure (Figure1).1). Other TSSIV correspond to the tra genes that participate in bacterial conjugation. Although p42d is not a self-conjugative plasmid , it contains the traACDG genes, an oriT and a truncated traI pseudogene (yp096), suggesting that p42d might have lost its self-conjugative capability.
In addition to nifA, fnrN, rpoN2, and nodD1-3, 12 predicted genes encoding potential transcriptional regulators are present in p42d. They belong to different families, including LysR, AraC, Crp and GntR. The plasmid also encodes other functions including plasmid-maintenance, electron transfer, polysaccharide biosynthesis, melanin synthesis and secondary metabolism. The sequence of p42d revealed a putative methionyl-tRNA synthetase that could represent a reiterated gene or could have another functional role (for example in antibiotic resistance) .
In general, large numbers of ERIS have been found in the symbiotic compartments of rhizobia [2,7,10,11]. The genome of S. meliloti, however, contains a relatively low abundance of these elements and their distribution is asymmetric; that is ERIS are more abundant in the pSymA, especially near symbiotic genes . In p42d, ERIS belonging to 12 known IS families comprise 10% of the entire DNA sequence. The great majority of them belong to the IS3 and IS66 families. Although most ERIS represent incomplete, presumably inactive, IS sequences, some of them are organized in complete IS elements (Figure (Figure11).
The positions of some ERIS might suggest a role in plasmid shuffling. The 125 kb region that contains most of the symbiotic genes (Figure (Figure1)1) is flanked by two complete IS elements. Both elements share identical 30 bp direct repeats at their borders, suggesting a potential transposition capability. The presence of the gene for an integrase-like protein (yp018) and the fact that the 125 kb region separates the repABC and the tra genes, has prompted the idea that the entire symbiotic region could be a mobile element. Furthermore, some groups of genes flanked by ERIS might have arrived in p42d as part of composite transposons, such as the cytochrome P450 cluster (see below), the NRb region, and a putative ATPase of an ABC transporter.
It has previously been shown that p42d contains several reiterated DNA sequences  that can recombine, leading to genomic rearrangements [14,49,50]. The sequence of the plasmid revealed a large amount of DNA reiteration. The major reiterated families were defined by containing a continuous stretch of at least 300 nucleotides with identical sequence. There are 12 such families, with two or three members each (Figure (Figure1).1). In addition to the nif family described above, five families are related to ERIS and the rest are various genes such as those that encode the BacS protein , or gene fragments.
As previously shown with pNGR234a , the DNA sequence allows prediction, identification and isolation of the potential rearrangements that may be generated by homologous recombination. The complete sequence of p42d will allow the identification of the precise sites of previously identified genomic rearrangements . In the present study we have predicted the major potential rearrangements in p42d as it was previously described ; these include amplifications, deletions and inversions such as those illustrated in Figure Figure11.
In other SGCs the differences in number, organization, orientation and length of the reiterated elements predict specific genome rearrangements, as exemplified by the rearrangements that involve the nifH reiteration of p42d and pNGR234a [50,51].
The putative protein sequences of p42d were compared to the proteomes of several complete bacterial genomes extracted from GenBank  (see Materials and methods) as well as to the SGC sequences available to date (Figure (Figure3).3). We identified all pairs of potential orthologs between p42d and each of the genomes analyzed, following the strategy and definition described in Materials and methods. As expected, the highest percentage of orthologs common to p42d and to any other bacterial genome was found among the nitrogen-fixing symbiotic bacteria. S. meliloti and M. loti have, respectively, 51% and 45% of the orthologs found in p42d (see Additional data file 5). Members of the α-proteobacterial subclass such as Caulobacter crescentus, Brucella melitensis and A. tumefaciens (a plant pathogenic member of the Rhizobiaceae) have from 25% to 32% of the orthologs present in p42d. The percentage of p42d orthologs within the genomes of plant pathogens varies from 17% for Xyllella fastidiosa to 31% for Ralstonia solanacearum. Human bacterial pathogens such as Haemophilus influenzae and Helicobacter pylori, those with small genomes as Rickettsia prowazekii and Mycoplasma genitalium, and the archaea compared here, display the lowest number of shared orthologs. Instances of putative orthologs found in p42d and some complete bacterial genomes are shown in Figure Figure3a.3a. In general, a collection of orthologs involved in diverse enzymatic activities is present in p42d and in most genomes compared here. They include the genes hemN1, hemN2, ctrE, hisC, icfA, pgmV, aatC, pcaC1, adhE, ribAB, bglS, kprS, mcpG, mcpA and mmsB (see Additional data file 1, for the assigned function). Their identity is in most cases 50% or lower.
When we examined the distribution of orthologs in the six SGCs (see above), including p42d and using the genomes of M. loti and S. meliloti as reference, it was found that half of the hits lie in the respective SGCs and the rest are dispersed among other replicons, including the chromosomes (Table (Table1,1, Figure 3f,3g). In general, the genes for symbiosis are very well conserved in the SGCs, whereas the orthologs of genes not involved in symbiosis are distributed in nonsymbiotic plasmids and in the chromosomes (Figure 3c,3d,3e).
A total of 177 p42d CDSs (49%) have orthologs at least in one SGC. A subset of these (80 CDSs) belongs to the symbiotic region of 120 kb (Figure 3c,3d,3e,3f,3g, from NRa to NRb regions; Table Table1)1) and the rest are interspersed in the remaining 251 kb of the plasmid. Among the SGCs compared, pNGR234a shares the highest percentage of orthologs (30%) with p42d (Table (Table1,1, Figure Figure3d),3d), followed by the pSymA (28%) and the SGC of M. loti R7A (Table (Table1,1, Figure Figure3g3g and and3e,3e, respectively). The SGC of M. loti MAFF303099 and B. japonicum share the fewest orthologs (24%) with p42d (Table (Table1,1, Figure Figure3f3f and and3c,3c, respectively). The A. tumefaciens plasmids display the highest similarity with the TSSIV, TRA, and REP regions of p42d; the rest of the matches are distributed in the circular and the linear chromosomes (Figure (Figure3b3b).
There are 20 genes common to all SGCs. These correspond exclusively to symbiotic genes including both nitrogen fixation (nifHDKENXAB, fixABCX, fdxN, fdxB) and nodulation (nodABCIJD) genes. The essential nodBC genes, however, have possible paralogs in some plant pathogens such as A. tumefaciens, Ralstonia solanacearum and Xanthomonas. Possible paralogs of the transport genes nodIJ are present in all the genomes analyzed. In these bacterial species, putative paralogs of nod genes might participate in the synthesis and secretion of outer membrane lipopolysaccharides .
The fixNOQPGHIS common to different nitrogen-fixing symbiotic rhizobia are not always confined to the SGCs [10,11]. As mentioned above, in R. etli CFN42, the fix genes are distributed in two replicons, p42d and p42f, and some of them are reiterated, as is frequently observed in other genomes [2,3]. In S. meliloti, these genes are reiterated three times in pSymA  and in M. loti there are two copies of the entire operon . In B. japonicum they lie outside of the SGC (410 kb) determined by Gottfert et al.  but are included in the equivalent 681 kb SGC of the complete genome . Moreover, in Rhizobium sp. NGR234 they are chromosomal . The fixNOQPGHIS cluster was identified in the genome of the plant pathogen A. tumefaciens (circular chromosome), the intracellular parasite Brucella melitensis (chromosome I), and in the free-living aquatic bacterium C. crescentus; all of them belonging to the α-proteobacterial subdivision. Among γ-proteobacteria, the plant pathogen Pseudomonas aeruginosa has this fix cluster, which is absent in E. coli. Also, orthologs of this gene cluster are conserved in R. solanacearum, a plant pathogen that belongs to the β-proteobacteria.
The fixABCX operon is highly conserved in diazotrophs as well as in a wide variety of other bacterial and archaeal species such as E. coli, Mycoplasma genitalium, Bacillus subtilis, Thermotoga maritima and Archeoglobus fulgidus. In E. coli these fix genes are related to the carnitine pathway, but their function is unknown in the other species . The ferredoxins FdxN and FdxB are always linked to nif genes in symbiotic as well as nonsymbiotic organisms. In S. meliloti, mutations in fdxN significantly impair the nitrogen-fixation process .
As mentioned above, the CPX cluster (9 kb, 15 CDSs) in p42d exhibits GC and CU profiles that diverge from the rest of the genes; CPX gene function is not known and no symbiotic role has so far been assigned to them. In B. japonicum some of these genes might participate in terpenoid synthesis . The genes included in the CPX region showed similar organization in the SGC of M. loti (strains MAFF303099 and R7A), in pNGR234a, and in p42d. In B. japonicum, the CPX cluster was not located in the 410 kb SGC [11,36] but is present in the 680 kb SGC determined by Kaneko et al. . In pSymA of S. meliloti, this cluster is partially represented by homologs of cpxP2, cpxP4, ctrE and some conserved hypothetical genes, yp013-yp015. Homologs of IS are located at the right border of the CPX region in pNGR234a and pSymA, while they are to the left of the SGC in M. loti R7A. The CPX region in p42d is flanked by ERIS, highlighting its potential for transposition.
A common feature in the SGCs is the presence of either the TSSIII or the TSSIV transport secretion systems. The TSSIII is found in pNGR234a, in the SGCs of B. japonicum, and in M. loti MAFF303099. The TSSIV is located in pSymA of S. meliloti and the symbiotic island of M. loti R7A. Both transport systems are present in p42d. In the pTi and pRi plasmids of A. tumefaciens C58, the TSSIV system is used for transferring the T-DNA to plant cells. In the absence of T-DNA in the SGCs, the precise function of these systems is not clear. Furthermore, both TSSIII and TSSIV are found in bacterial pathogens of plants and animals as well as in some α-proteobacteria. Complete or partial TSSIII or TSSIV are present in Brucella melitensis, C. crescentus, X. citri and X. campestris, whereas in Rickettsia prowazekii, some virB genes are conserved. P. aeruginosa contains a complete TSSIII but lacks homologs of the TSSIV, while in Xyllella fastidiosa, nine putative conjugative proteins of the plasmid pXF41 are clearly orthologs of the corresponding virB gene set found in other microorganisms.
It is generally known that gene order is conserved in closely related strains and species. The six SGCs compared here, except the SGCs of the two M. loti strains, have 20-30% of genes in common according to our estimates (Table (Table1).1). Most of these genes are located within the conserved clusters described above. Furthermore, genes unique to each of the individual genomes are interspersed among genes present in all in SGCs. For example, p42d contains 71 orphan genes throughout its structure.
The SGCs in M. loti strains MAFF303099 and R7A share large conserved segments that contain all the symbiotic genes  (Figure (Figure4,4, panel 10). The colinearity is disrupted by genes unique to either of the SGCs. The smallest region that encloses the 20 common orthologous genes (essentially nod and nif genes) can be delimited to about 50 kb in pSymA, 120 kb in p42d, 250 kb in pNGR234a, 300 kb in the SGC of B. japonicum, and 320 kb in the two SGCs of M. loti (Figure (Figure5).5). Such variability in gene order suggests that the SGCs have recombined frequently with other genome elements.
Several transcriptional units that are conserved in some SGCs appear to have undergone rearrangements in others. Examples taken from the nif, fix and nod operons are illustrated in Additional data file 6. The nifHDK and nifENX, are neighboring conserved transcriptional units in p42d, in the two SGCs of M. loti, and in pNGR234a. However, nifH and nifN are separated from their respective operons in the SGC of B. japonicum and in pSymA of S. meliloti, respectively. Similarly, nodA is located away from the nodBC genes in p42d, and nodB is distant in the SGC of M. loti strains. The operon fixABCX is disrupted in the SGC of B. japonicum, where fixA is in an operon with nifA. In turn, in other SGCs, nifA is commonly found in an operon with nifB and fdxN. Phylogenetic analyses of the 20 common genes in the six SGCs result in nonequivalent trees, even for genes that are organized in operons (data not shown). For example, trees derived from the genes of the operons nifHDK and fixABCX are incongruent, indicating that intraoperon recombination has been also frequent.
Our results indicate that p42d contains several regions that significantly deviate from the average GC content and typical CU. The plasmid harbors a large amount of ERIS and several reiterated DNA families. In addition, it contains 10 pseudogenes. These features resemble those found in other SGCs. All SGCs sequenced so far are heterogeneous regarding their gene content, and the genes common to most of them are mainly those involved in nodulation and nitrogen fixation. Other common genes are present either in SGCs or in other genome locations (see above). The lack of synteny between p24d and the different SGCs analyzed gives further support to the notion that the symbiotic compartments of rhizobial genomes are mosaic structures , presumably assembled from regions derived from diverse genomic contexts, that might have been frequently modified as a consequence of transposition, recombination and lateral transfer events.
A minimal set of cosmids that covers the entire p42d  were used to generate shotgun libraries (1-2 kb mean insert size) cloned in M13 or pUC19 vectors. DNA sequencing reactions were performed using the Big-Dye Terminator kit in an automatic 373A DNA Sequencer (Applied Biosystems, Foster City, CA). Gaps were filled by a primer-walking strategy as well as by sequencing appropriate clones from pBR328 and pSUP202 libraries. A total of 6,210 readings of 450 bases in average were collected to achieve a coverage of 7× for the entire p42d.
Base calling was done using the program PHRED and the assembly was obtained by PHRAP [56,57]. Graphic representation and edition of the assembly were accomplished using the CONSED program . Low-quality and single-stranded regions were located, and further sequencing was done to cover these areas. An error rate of less than 1 per 10,000 bases was estimated using base qualities determined by the PHRAP assembler. To confirm the assembly, pairs of forward and reverse primers were designed and used to raise overlapping PCR products with an average size of 5 kb, covering the entire plasmid in a single circular contig. The PCR products obtained agree well with the determined sequence (data not shown).
The coding capacity of p42d was determined by applying GLIMMER 2.02 [59,60] iteratively to enhance the overall prediction efficiency. Given the evidence indicating that GLIMMER-based predictions are less effective in plasmids , our approach also took into consideration the existence of several gene classes with different codon-usage (CU) patterns , and a potential ribosome-binding site (RBS) specific to p42d to aid GLIMMER in the selection of start codons.
An initial set of presumably functional genes with a corresponding upstream RBS was detected by running BLASTX  comparisons (using a maximum e-value cutoff of 0.001) of the entire plasmid against the nonredundant (nr) database  at the National Center for Biotechnology Information (NCBI). All matches with hypothetical or putative proteins as well as those with an upstream neighbor hit closer than 50 bp were discarded to avoid genes within operons. We took into consideration only hits displaying an identity ≥ 40%, starting at the first amino acid, and alignment coverage of at least 80% of the matched protein. We then extended the selected hits towards the 5' terminus and kept those with an upstream in-frame stop codon before any other possible start codon. This procedure left 21 hits. From the p42d sequence we extracted 20-bp regions upstream of the start codons of these hits and inferred the most probable RBS (6 bp in length) by applying the CONSENSUS program . The resulting consensus matrix supported the sequence GGAGAG with an expected frequency of 2.034 × 10-8.
To train GLIMMER, we took the initial output of the BLASTX comparison detailed above, and selected as training set all hits with an alignment length ≥ 100 amino acids. Again, all matches with hypothetical/putative proteins were discarded. Overlapping hits matching the same protein were merged into a single larger hit, generating a total of 183 DNA segments. The RBS and the training set obtained were then used to run GLIMMER, yielding a prediction of 460 CDSs that included 93% of the 183 segments in the training set. However, we noticed that running GLIMMER iteratively yielded better results, because it produced a lower number of predicted CDSs and a greater number of segments in the training set mapping within predicted CDSs. We applied the method of A.M.-S., G.M.-H., A. Christen and J.C.-V. (unpublished work) to split the initial set of 460 CDSs into three groups displaying poor, typical and rich codon usage. Essentially, this method quantifies the extent to which individual genes use the most abundant codons in the plasmid. Each group was used as a training set and GLIMMER was run for 20 iterations to predict CDSs ≥ 300 bp (CDSs ≥ 500 bp in the first iteration composed the training set for the second iteration, and so forth). The best prediction for each CU group was selected, and the three resulting predictions were incorporated into a single one that produced 396 CDSs and recovered 97.75% of the initial training set.
All CDSs were manually curated using BLASTX comparisons (e-value ≤ 0.001) against the nr database. The following criteria were applied to annotate the CDSs: CDSs were tagged as hypothetical (yh) when no homolog could be detected; hypothetical conserved CDSs (yp) were those displaying strong similarity to hypothetical proteins or weak similarity to known genes; CDSs with similarity ≥ 50% along the entire length of known genes were assigned the same name as the matching gene; CDSs related to insertion sequences (IS) and transposons (yi) were compared with BLASTN and BLASTX against the IS database  to identify the family they belong to. These elements were also analyzed for the presence of inverted repeats at their borders applying OLIGO 6.4  and BLAST2 programs. Functional classification was carried out following the categories proposed in Freiberg et al. . Additional support for annotation was obtained by searching for protein domains and motifs with the Interpro suite . Transmembrane domains and leader peptides were searched using the PSORT program . A relational database that compiles all this information is available at , and Additional data file 1, which shows the set of 359 annotated CDS.
We predicted that all CDS in p42d are organized in 235 transcription units (TUs) by applying a previously reported distance-based methodology . Binding sites were detected using upstream regions of variable length (but properly specified in each case) for all annotated CDS in the pSym. To identify genes potentially expressed by RpoN promoters, we compiled an initial training set containing 85 prokaryotic promoters for which the transcription start site has been experimentally mapped . The CONSENSUS/PATSER set of programs  was then used to predict promoters 16 bp long in upstream regions of 250 bp. A final set of 37 RpoN promoters was obtained using as PATSER threshold the mean (μ) minus one standard deviation (σ) estimated from the set of 85 promoters (μ - 1σ = 6.33). Binding sites for NifA or UAS were predicted using seven reported sites [27,70-73] as the training set. CONSENSUS/PATSER was run to predict sites of 16 bp in length within -400 to +50 bp regions, yielding 21 sites with PATSER score ≥ 8.03 (μ - 1σ); if a more stringent threshold is used instead, several known sites are undetected. We further discarded all predicted UAS without an associated RpoN promoter. In the case of nod boxes, we applied the dyad-sweeping method  to a set of six reported sites [75-78] in order to pinpoint the location of potential nod boxes in the p42d (as a conglomerate of five or more dyads), and then used CONSENSUS/PATSER to determine 47-bp sites within -600 to +50 bp regions.
Seven putative nod boxes were found by these approaches; however, several known functional sites were still undetected, and thus we trained CONSENSUS/PATSER again with the seven p42d sites found. Given that the mean PATSER score for these sites is too high (21.48), the usual threshold (μ - 1σ) is also correspondingly high (16.21), and thus we could not predict any additional sites. For these reasons, we decided to use as threshold the lowest PATSER score (9.15) obtained from the seven training sequences, in this way we finally predicted 15 nod boxes. CONSENSUS/PATSER programs were also applied to identify regulatory motifs for Fnr based on 30 known binding sites in E. coli extracted from RegulonDB . Predictions were carried out in the -400 to +50 bp regions using as threshold the PATSER score ≥ 6.2 (μ - 1σ), which yielded 45 potential Fnr binding sites in p42d. If we get stricter and use the mean PATSER score (9.77) as threshold, only eight sites are detected. Nonetheless, given the evidence suggesting there is high transcriptional activity in the p42d under low-oxygen conditions , we decided to relax the score to allow more Fnr sites.
Protein sequences from different genomes or symbiotic compartments were obtained from GenBank : pNGR234a U00090; B. japonicum USDA110 symbiotic region AF322012 and AF322013; S. meliloti AL591688; M. loti MAFF303039 NC_002678; M. loti R7A symbiotic island AL672111; A. tumefaciens C58 (U. Washington) AE008688 and AE008689; A. tumefaciens C58 (Cereon) AE007869 and AE007870; Ralstonia solanacearum AL646052; C. crescentus AE005673; Rickettsia prowazekii AJ235269; E. coli O157:H7 BA000007; E. coli K12 U00096; Brucella melitensis AE008917; P. aeruginosa AE004091; Xanthomonas citri AE008923; Nostoc NC_003272; Xanthomonas campestris AE008922; Synechocystis PCC6803 AB001339; Xylella fastidiosa AE003851; Borrelia burgdorferi AE000783; Buchnera sp. APS AP000398; Mycoplasma genitalium L43967; Thermotoga maritima AE000512; Aquifex aeolicus AE000657; Archeoglobus fulgidus AE000782; Aeropyrum pernix BA000002; Methanobacterium thermoautotrophicum AE000666; Methanococcus jannaschii L77117; Methanopyrus kandleri AE009439. Most probable orthologs were detected applying a previously reported method . Essentially, the method performs BLASTP pairwise comparisons against the protein sequences of p42d, and bidirectional best hits (BDBHs) were used to define the most likely orthologous genes. All BDBHs with an e-value ≤ 0.0001 and alignment coverage of at least 50% of the smaller CDS were taken into consideration.
The nucleotide sequence reported here has been deposited in GenBank under the accession number U80928.
The most relevant features of the functional annotation of the p42d can be found in Additional data file 1 available with the online version of this paper. It contains the name, the predicted protein size, the best nr-matching homolog, and the percentage of similarity/identity. Lists of predicted binding sites are shown in Additional data file 2 (nod boxes), Additional data file 3 (RpoN promoters and NifA UAS) and Additional data file 4 (anaeroboxes). The number of BDBHs between several complete genomes and the p42d is given in Additional data file 5. The topological representation of the 20 common genes in the six SGCs is detailed in Additional data file 6, A, p42d; B, M. loti MAFF303099; C, pNGR234a; D, M. loti R7A SGC; E, B. japonicum SGC; F, S. meliloti pSymA.
The most relevant features of the functional annotation of the p42d
A list of predicted binding sites for nod boxes
A list of predicted binding sites for RpoN promoters and NifA UAS
A list of predicted binding sites for anaeroboxes
The number of BDBHs between several complete genomes and the p42d
The topological representation of the 20 common genes in the six SGCs
We dedicate this paper to Rafael Palacios and Jaime Mora in gratitude for their support and stimulating critical discussions. We are grateful for the skillful technical support and advice given by J.A. Gama, R.E. Gómez, R.I. Santamaría, S. Caro, J. Espíritu, D. García, F. Sánchez, E. Díaz, E. Pérez-Rueda, V. del Moral, K.D. Noel, J. Sanjuan, M. Rosenblueth, P. Gaytán, E. López, P. Rabinowicz, and P.M. Reddy. This work was partially supported by a public grant from CONACyT (México) under the Program for Emerging Areas (N-028).