The DNA for the libraries was obtained from the total DNA content of B.garinii strain PBi. Thus, besides the chromosome, the plasmids of this strain should also be represented at least in part in the shotgun data. All sequences from the whole-genome library were binned according to their similarities to the B.burgdorferi genome parts (i.e. chromosome and plasmids): 37.8% of all B.garinii reads were derived from plasmids (Table ). This value is comparable to that of the plasmid fraction of B.burgdorferi (40%).
Comparison of the B.garinii low redundancy WGS assembly to the B.burgdorferi sequence
Altogether the assembly comprises 1.227 Mb of B.garinii sequence. Three contigs completely cover the counterparts of the corresponding B.burgdorferii chromosome and plasmids lp54 and cp26. Additional 37 contigs >2 kb amounting to 239 kb were obtained for the remaining plasmid fraction of B.garinii (Figure ). Clear-cut and error-free assignment of these plasmid contigs to defined B.burgdorferi plasmids was not possible, further underlining the variability of the plasmid complement in Borrelia species. The whole chromosome and two plasmids (lp54,A and cp26, B) are completely represented in B.garinii. Eight plasmids of B.burgdorferi are highly similar (cp32: L, M, N, O, P, R, S; lp56: Q). These seem to be also completely represented by B.garinii plasmid contigs although neither an exact assignment of individual contigs to a specific B.burgdorferi plasmid nor the calculation of their copy number is possible. The remaining plasmid contigs of B.garinii show also similarities to B.burgdorferi plasmids, but some regions are either unique to B.burgdorferi or have low DNA similarities. In four contigs, we observed similarities to different plasmids of B.burgdorferi indicating breakage/fusion points between different plasmids.
Figure 1 View of the B.burgdorferi genome indicating similarities to sequences of the B.garinii genome. The calculation of the similarities was done using BLAST. Threshold for identity was 75% on a length of 40 bases. Portions of the B.burgdorferi genome with (more ...)
In our low redundancy project, the average coverage of the chromosome assembly is 3.38 (Table ). The initial assembly of the B.garinii
chromosome had 260 gaps (comprising sequencing and clone gaps). After applying gap closure procedures, we obtained one contig covering almost (99.5%) the complete linear B.burgdorferi
chromosome. Despite the low redundancy of the sequence reads >80% of the chromosome is endowed with a error frequency of <10−4
. The overall expected error rate is 0.26% (Supplementary Figure S1). The comparative analysis shows an unbroken collinearity between the B.burgdorferi
chromosomes. The only parts of the B.burgdorferi
chromosome with no counterpart in B.garinii
reside in both telomeric regions with a size of 168 and 8458 bp, respectively. Since ends of linear chromosomes are not clonable without further manipulation, the missing bases on the left end are probably due to a cloning bias. It was previously shown that the right end of the chromosome in B.burgdorferi
exhibits length variations in defined steps in different strains (33
). The shorter right end of the B.garinii
chromosome is comparable to one form of these stepwise variable lengths of the B.burgdorferi
Substitutions, insertions and deletions
Base substitutions are a measure for evolutionary distance between organisms. The overall identity of the B.garinii
with the B.burgdorferi
chromosome is 92.7%. A calculation based on the shared CDS on amino acid level gives the same result indicating an equal distribution of substitutions over the chromosome irrespective of information content. The decrease of similarity between the two chromosomes below 80% in three regions around the origin of replication as can be seen in Figure is apparently caused by larger insertions and deletions (indels; Figure , number 4–6). In total, we found 66
482 single base substitutions (Table ). Transitions and transversions are almost equally distributed in the genic and intergenic regions.
Figure 2 A sketch of the B.garinii chromosome compared to the collinear B.burgdorferi counterpart. Base identity of B.garinii versus B.burgdorferi is shown as green line, GC content as light blue line (both left scale) and GC skew as purple line. All values were (more ...)
Frequency of single base substitutions in the collinear genomic elements of B.burgdorferi and B.garinii
Besides the shorter right end of the chromosome, we found eight insertions and six deletions with a size >100 bp (Figure , numbers 1–8; Supplementary Table S1) in the B.garinii chromosome as major structural differences relative to the chromosome of B.burgdorferi.
The largest observed insertion with a size of 1878 bp is caused by a duplication of a region containing the bmpA gene and part of the bmpB gene (see below) resulting in a tandem repeat of these genes (Figure , number 3). A series of five insertions is separated by short orthologous sequences of at most 470 bases (Figure , number 4; Supplementary Table S1). This cluster of insertions is located in a region containing two tRNA genes (tRNA-Ile-1, tRNA-Ala-1) and expands only intergenic regions.
Indel region 5 consists of an insertion of 538 bases (Figure , peak 3) followed by two deletions of 211 and 498 bases, respectively, which are separated by 133 bases. In this indel region 5 resides the gene encoding inositol monophosphatase (BB0524) in B.burgdorferi. This gene is only partly represented (59 of 284 amino acids) in B.garinii. The eighth insertion (Figure , number 7) is located in an intergenic region. According to the observed indel regions, two ‘hotspots’ for rearrangements on the Borrelia chromosome could be defined: indel region 4 and indel region 5. Most interestingly, these two regions are located in close vicinity to the origin of replication of the chromosome at position 475 kb.
All deletions >100 bases including the missing chromosomal ends comprise 10
448 bases, all insertions 4099 bases. Thus, the B.garinii
chromosome is by 0.7% shorter than that of B.burgdorferi
A comparative annotation of the B.garinii
chromosome was performed using the previously published B.burgdorferi
gene prediction and annotation (GenBank NC_001318.1) (16
). In parallel, we used GeneMarkS for ab initio
gene predictions. This program was not able to detect 36 of the original gene predictions on the B.burgdorferi
chromosome (Supplementary Table S3). Two of these non-verified coding sequences are fused to neighbouring coding sequences in B.garinii
). The failure of GeneMarkS to identify BB0412
is possibly a false negative result, as the program predicted the ortholog in B.garinii
. Additional ten potential genes of the remaining 814 genes annotated on the chromosome of B.burgdorferi
are not predicted on the chromosome of B.garinii
(Supplementary Table S4). Interestingly, these genes are annotated in B.burgdorferi
only as predicted coding region without any other supportive evidence like similarities to other genes. Twelve predicted B.burgdorferi
genes are fused in B.garinii
). Eight annotated genes show extensive length divergence and overlap only partly their B.garinii
counterpart, seven of which are altered due to differing open reading frames (ORFs) on an otherwise orthologous genomic sequence (BB0475
). The lmp1
) is affected by the largest deletion in B.garinii
(Figure , number 1) and thereby shortened by 648 bases at the 5′ end. In summary, 786 GenBank annotated genes of B.burgdorferi
are supported by GeneMarkS predictions in both Borrelia
species and thus most likely represent the orthologous set of chromosomal genes. The length of 452 orthologs is unaltered, 255 genes are longer in B.burgdorferi
and 79 genes are longer in B.garinii
, but since these length differences are only small, the core of the deduced amino acid sequence is not affected.
BB0086 is split in B.garinii. Two genes are affected by a large duplication in B.garinii leading to a second copy of bmpA (BB0382) and a partial copy of bmpB (BB0383; Figure , number 3). Due to nonsense mutations, this partial copy is represented in the predicted gene set by four small ORFs.
In addition to the 807 predicted B.garinii genes with homologous sequences (including split, fused and other altered gene structures) in annotated CDS of B.burgdorferi, GeneMarkS predicts 33 further genes. These genes are comparably small (<60 amino acids) and most likely represent false positive predictions. This is further underlined by the fact that four of these potential genes lie within rRNA and tRNA gene regions and additional four predictions are apparently derived from the truncated copy of bmpB. The additional 39 genes on the B.burgdorferi chromosome, which are predicted by GeneMarkS, may also be false positive predictions.
On the DNA level, no predicted protein-coding gene of B.garinii is identical to its ortholog in B.burgdorferi whereas 20 tRNA genes (out of a total of 33 tRNA genes) are identical to their orthologous counterparts. Interestingly, the mutations seem to affect tRNA genes not randomly, since most sequence changes occur in non-unique tRNA genes (11 of 13). All four copies of tRNA-Leu, all three copies of tRNA-Ser, two of the three tRNA-Arg copies, and the second copy of tRNA-Lys and tRNA-Thr, respectively, are mutated.
Due to the high similarity of the chromosomes, the statistics of the codon usage shows only a slight difference between the two species. For example, in B.burgdorferi a higher preference for TTG as a start codon than in B.garinii predicted genes could be observed (Supplementary Table S2).
On the protein level, only 11 of all B.garinii genes are not altered in comparison to B.burgdorferi. This includes the ribosomal proteins rpsU, rpsL, rpmG, rpsJ, rpsS, rpmF, a putative subunit K of an ATPase, the flagellar motor switch protein fliG-2, the phosphocarrier protein ptsH-2, and the chemotaxis-related proteins cheX and cheY-3.
Additional 25 genes are affected by conservative exchanges with amino acids having the same chemical properties, thus increasing the number of highly conserved proteins to 38; not surprisingly 18 genes of this expanded group code for ribosomal proteins.
As an indication for positive selection, 94 (11.7%) of all orthologous genes and genes with similar sequences contain more non-synonymous than synonymous exchanges. Of these, 61 have no functional assignments. Interestingly, a higher than average proportion of the deduced amino acid sequences is predicted to contain transmembrane domains (39% compared to 26% for all proteins, Supplementary Table S8). The remaining 33 predicted genes are listed in Table . Many of these proteins seem to be located, according to their function, on the surface of the cell.
Proteins with ascribed function with more non-synonymous (Ka) than synonymous (Ks) codons
Different strains of the same Borrelia
species can carry different sets of plasmids. The differing plasmid repertoire of the cells can be partly a function of the living conditions (34
). Additionally, strains can loose parts of their equipment due to a lack of selection pressure (35
). Since the primary assembly of the chromosomal reads without additional gap closure sequences resulted in 91% coverage of the chromosome, we may conclude that also the plasmids are represented in the same range of coverage. Using the whole-genome shotgun data, it was possible to assemble two individual plasmids of the B.garinii
PBi strain completely (Table ). These two plasmids are highly similar and collinear to the linear plasmid lp54 and the circular plasmid cp26 of B.burgdorferi
B31. The nearly two times higher coverage of one of these plasmids (lp54) compared to that of the chromosome indicates that it should be present in about two copies per cell. Compared to the chromosome, we find an equal number of base substitutions (8.9%) on the cp26 plasmid. Interestingly, with 15% substitutions are twice as frequent on plasmid lp54. Most remarkably, the transition:transversion ratio on lp54 is 3:2, whereas that of the chromosome as well as that of cp26 is approximately 4:1 (Table ). The coding capacity of both plasmids is comparable to their counterparts in B.burgdorferi
. Only three B.garinii
lp54 genes predicted by GenMark are orphans, whereas the majority of predicted genes (49 of 74) have orthologs in the B.burgdorferi
plasmid: 22 of the predicted genes match as pairs to different parts of 11 B.burgdorferi
genes indicating nonsense mutations leading to split CDSs in B.garinii
. On the other hand, 14 predicted B.burgdorferi
genes have no counterparts on the B.garinii
plasmid (Supplementary Table S5). None of these genes has an ascribed function. Interestingly, the B.burgdorferi
lp54 gene family (BBA68
) appears to be almost completely conserved in B.garinii
, only BBA72
being split into two predicted genes. The analysis of the coding capacity of cp26 showed that all 26 predicted B.garinii
genes have orthologs in B.burgdorferi
, only three (BBB15
) of the B.burgdorferi
predicted genes are orphans (Supplementary Table S5). Interestingly, two cp26 encoded genes are subjected to a rapid positive selection: ospC
is well characterized as outer surface protein, whereas BBB08
so far has no assigned function. On the other hand, only 17 of the 55 lp54 encoded proteins may be subjected to a neutral evolution or purifying selection.
The assignment of clusters of orthologous groups (COG) (36
) to the predicted proteins is clearly different between the chromosome and the plasmids. Whereas 81.6% of the chromosome-encoded orthologous proteins can be assigned to a COG, only 53.9% (cp26) and 26.5% (lp54) can be categorized this way (Supplementary Table S6).
All other B.garinii
plasmids are represented in our assembly as 37 contigs >2 kb comprising 239 kb. We here refer to these plasmid parts as variable plasmid segments (VPS). As it is known from the assembly of the B.burgdorferi
plasmids, there are redundant segments distributed over several plasmids (37
). The same holds true for the B.garinii
VPS. Some redundant regions containing polymorphisms could separately be assembled. Yet, the read coverage of some contigs is higher than that of the chromosome and cp26. Thus, it is very likely that these portions of the VPS represent paralogous sequences. Therefore, they cannot be assembled properly into individual plasmids. Accordingly, due to this non-unique nature of many segments in the plasmids, a clear 1:1 assignment to defined plasmids of B.burgdorferi
is not possible.
A GeneMarkS gene prediction revealed 338 complete and truncated potential protein-coding genes on the VPS: 284 of these predicted genes have matches to predicted genes in the B.burgdorferi genome on protein level. Many, mainly small genes (117) show partial matches, but 167 predicted genes exhibit similar lengths in both genomes. One of these genes is a true ortholog to BB0844, which is encoded on the chromosome in B.burgdorferi. All other predicted genes are related to plasmid-encoded genes.
To get an overview of our B.garinii
VPS assembly, we performed a BLAST search on nucleotide level of all plasmid-derived contigs against a database of B.burgdorferi
plasmids. This BLAST search revealed that 70% (167 kb) of the VPS are similar enough to B.burgdorferi
plasmid sections to be detected (Figure ). The remaining sequences (73 kb) have no detectable similarity to B.burgdorferi
plasmids on the DNA level. Yet, if we search for similarities on protein level, we find matches (40–70% identity on amino acid level) for all contigs to putative B.burgdorferi
plasmid-encoded proteins. The contig with the lowest similarity to B.burgdorferi
sequences is contig AY722928, which encodes a vls locus involved in antigenic variation in the mammalian host (39
). This locus is located on lp28-1 in B.burgdorferi
. Thus, all VPS sequences seem to be represented in B.burgdorferi
We then asked, which part of the coding capacity of the B.burgdorferi plasmids is present in B.garinii. Since small predicted genes are often false positives, for a reliable comparison of the gene sets, we took into account only B.burgdorferi plasmid-derived proteins >100 amino acids, and searched for their counterparts in the whole B.garinii shotgun data. Only one protein each from plasmids lp5 (T), lp54 (Q), lp21 (U), cp32-3 (S), lp25 (E), lp28-1 (F), lp28-2 (G), lp28-3 (H) and lp28-4 (I) had no counterpart. The failure to detect these proteins in the B.garinii genome could be due to missing shotgun data. Interestingly, from plasmids lp38 (J) 15 of 21 and from plasmid lp36 (K) 9 of 24 proteins were not represented in the shotgun data (Supplementary Table S7). A more detailed inspection revealed that the predicted protein-coding regions from these two plasmids that have matches to the shotgun data belong to protein families. Members of these protein families are encoded also on different plasmids. These results taken together indicate that B.garinii PBi lacks the counterparts of plasmids lp38 and lp36.
Since the copy number of plasmid segments can affect the phenotype of Borreliae (40
), we also analysed the data in this respect. According to the BLAST hits against B.burgdorferi
proteins >100 amino acids plasmids lp28-1 (F), lp28-3 (H), lp28-4 (I), lp17 (D), lp25 (E) and lp5 (T) are present in one copy per cell. Plasmid lp28-2 is represented with two independent segments in our assembly. Thus, it should exist as two slightly different copies in B.garinii
. For the proteins from the highly redundant cp32 and cp56 plasmids, we observed between three and four copies each. Furthermore, since parts of these segments are identical, the assembly resulted in parts of these segments in a higher coverage than average. We thus estimate that this plasmid group is at least present in five copies. The plasmid cp9 encodes similar proteins as the cp32 plasmids, albeit with much lower similarities. Thus, we are not able to determine whether a counterpart of this particular plasmid belongs to the B.garinii