To study the pattern of gene degradation, we developed a program for the identification and visualization of positional pseudogenes in multiple, related species. Previously developed software for the identification of strain-variable regions such as tRNAcc [26
] and Islander [27
] are based on high-throughput systematic interrogation of tRNA and transfer mRNA genes, which act as hotspots for insertions of temperate phages and pathogenicity islands [28
]. The tRNAcc approach identified 49 genomic islands in the vicinity of 18 tRNA genes in Escherichia coli
, representing as much as 1.7 megabases of these genomes [26
]. Islander was applied to the analysis of 106 bacterial genomes of different phylogenetic affiliation, with 95% of the islands identified in Firmicutes, and α- and γ-proteobacteria [27
]. However, Islander failed to identify strain-variable regions in about half of the genomes examined, including all obligate pathogens and endosymbionts.
Yet another program for the identification of variable segments is IslandPath, which searches for horizontally acquired DNA by profiling GC contents and dinucleotide composition patterns of individual genes and clusters of genes [29
]. However, because variability may not only be caused by gene acquisitions, but also by species-specific gene degradation processes, strain-variable segments will go unrecognized if nucleotide and codon usage statistics of the gene remnants temporarily appear normal. Thus, programs such as tRNAcc [26
], Islander [27
], and IslandPath [29
] are not useful for the analysis of the genomes of intracellular bacteria in which the deteriorating ORFs are neither flanked by integrases or tRNAs nor exhibit atypical GC-content statistics.
The application of our program, GenComp, to a visualization study of eight complete genome sequences of Rickettsia has unveiled a mosaic of conserved core genes for basic cellular functions interspersed with segments containing complete and deteriorating genes that are variably present across species. The novelty of the findings reported here is that as much as 75% of the variable genes and pseudogenes in Rickettsia have no homologs in either O. tsutsugamushi or W. pipientis. Many of these are extensively degraded or represent mobile genetic elements and their associated genes. Although some genes such as those for cell wall biosynthesis may have been lost from the outgroup species, we believe that a majority of the variably present and deteriorating genes entered at the base of the Rickettsia lineage. A smaller subset seems to be circulating across some of the modern Rickettsia spp. in a manner that is inconsistent with vertical inheritance. Collectively, the results suggest that a substantial fraction of the variability in gene content and extent of deterioration is accounted for by horizontal gene transfers into the genus Rickettsia.
In total, we identified 688 R7 core genes and 1,160 ORF-clusters, with 469 of the ORF-clusters containing easily recognizable homologs in species outside the R7-class. Although the aim of this analysis was not to quantify the total number of ancestral genes in Rickettsia
, our minimal and maximum estimates of 1,157 and 1,848 ORF-clusters in the rickettsial ancestor, respectively, are consistent with previous inferences of 1,252 to 1,650 ancestral genes [30
]. Our visualization analysis has shown that the fragmentation process often involves segments with multiple pseudogenes, in accordance with the suggestion that lost genes are clustered more frequently than expected by chance [30
]. With only a few exceptions, we found no tendency for genes with similar functions to be clustered in the same segment. Rather, the deletions appear to cover blocks of genes in some species whereas in others they are best explained by independent small deletions in neighbouring genes, putatively acquired by horizontal gene transfer.
Another observation is that the rate of evolution is related to the species distribution patterns such that genes present in fewer Rickettsia
spp. tend to accumulate more substitutions than those present in more species. One explanation for a faster rate of evolution for horizontally transferred genes is positive selection and adaptation [31
]. However, because we observed a correlation between limited species distribution patterns, high substitution frequencies, and small ORF sizes, we believe that most of the enhancement in the rate of evolution is due to degenerative processes.
For many of the heavily deteriorating genes, remnants were identified in the SFG Rickettsia but not in the TG Rickettsia. The presence of homologs in R. bellii suggests that these genes were acquired at the base of the Rickettsia, with subsequent loss in the TG Rickettsia. Hence, it is tempting to speculate that some of the gene acquisitions at the early stage of rickettsial evolution conferred functions that facilitated invasion and spread into novel arthropod hosts, or into multiple tissues of an already infected host. Upon subsequent host switches and/or niche adaptation, this early set of acquired genes may have become superfluous, leading to gene degradation and elimination. This is consistent with the observation that species with a restricted host range, such as R. prowazekii, exhibit extensive gene loss. Among the few genes present in the variable segments of the TG but not the SFG Rickettsia are duplicated genes for glucosyltransferases and enzymes involved in lipopolysaccharide biosynthesis.
Some of the variably present genes, mostly mobile elements such as transposons, conjugative transfer elements, and their associated genes, may represent recent gene transfers into individual lineages. The insertion of these at unique locations in the genome with no indications of remnants in any of the other species supports a recent integration rather than acquisition at the base of the lineage and loss in all other species.
The bias for loss of recently gained genes in Rickettsia
is consistent with computational inferences of insertion/deletion rates based on gene presence/absence data in other bacterial species [32
]. For example, a study of 13 completely sequenced genomes from Bacillus
showed that there are more genes coming in and going out at the tips of the phylogeny than at the deeper nodes, suggesting that most of the laterally transferred genes are lost shortly after their insertion [33
]. In our study, genes of unknown or general function prediction was found to be lost more frequently than expected by chance alone. A trivial explanation for the apparent high turnover rate of genes of unknown function at the tips of the tree may be false gene predictions. This is almost certainly one aspect of the problem; the visualization profiles of the Rickettsia
genomes confirm that many short ORFs located in immediate proximity to each other (previously annotated as different genes) represent short fragments of one and the same gene. For example, the annotated R. conorii
genes RC0215, RC0216, RC0217, and RC0218 are short fragments of the longer positional homolog RP174 in R. prowazekii
(see Additional data file 1; Reg_id 5). However, false gene predictions cannot be the sole explanation because a more rapid deterioration of genes acquired at the base of the Rickettsia
lineage was observed even if counting ORF-clusters with homologs to species outside the R7 class instead of individual ORFs. Thus, our results from Rickettsia
suggest a low residence time for horizontally transferred genes of yet unknown function.
Gene acquisition depends on the availability of mobile elements that can mediate the transfers, whereas the probability for retention is determined by the mutation bias along with selection and drift. Therefore, the transfer-deterioration process is expected be particularly high in bacterial species that contain plasmids or are exposed to bacteriophages. Recent gene acquisitions in Rickettsia
appear to have been mediated by plasmids, discovered in R. felis
, R. monocensis
], and several additional Rickettsia
spp. isolated from ticks [35
]. Species that lack plasmids (for example, R. prowazekii
) exhibit a much lower incidence of recent gene acquisitions than the plasmid-bearing species R. felis
Previous studies of the R. felis
plasmid genes revealed that 38 of the 68 pRF plasmid genes have no chromosomal homologs and are not present in the SFG Rickettsia
, although 18 of these show homology to other bacterial proteins [16
]. Plasmid genes with chromosomal homologs show mostly an evolutionary relationship with earlier diverging species, such as R. bellii
, although the chromosomal homolog may support the expected relationships with other Rickettsia
]. Our phylogenetic analysis of the tra
genes is fully consistent with this pattern, by showing that the conjugative system on the R. felis
plasmid diverged earlier than the chromosomally encoded tra
genes in R. bellii
and O. tsutsugamushi
. This was also observed in a recent phylogenetic analysis of the tra
cluster genes [25
]. Traces of the tra
genes could not be identified in the other five Rickettsia
spp. And neither in W. pipientis
Although some species of Wolbachia
infect the same arthropod host, these two genera do not share the same mobile gene pool; plasmids are the vehicle of choice in Rickettsia
whereas bacteriophages dominate in Wolbachia
. Thus, in contrast to free-living micro-organisms such as Escherichia coli
], species-specific ORFs are not derived from bacteriophages. However, the co-transferred genes encode proteins in the same broad families, such as for example ankyrin and TPR repeat proteins. The taxonomic distribution of species containing the most similar sequences outside the Rickettsiales was different for core genes and strain-variable ORFs, with a higher fraction of non-α-proteobacterial relatives for strain-variable ORFs. However, the phylogenetic analyses indicated fairly distant evolutionary relationship (data not shown), which might suggest transfers via yet unsequenced bacteria and their plasmids.
Taken together, the findings of our analysis suggest that the likelihood for a rickettsial gene that does not have homologs in Wolbachia
to be degraded is at least three times as high as that for a gene that has been vertically inherited since the divergence of the three genera. Our estimate of deteriorating horizontal transfers in Rickettsia
corresponds well with a global 'failed horizontal transfer index' of 2.3, which means that 'pseudogenes' are 2.3 times more likely to arise from horizontal transfer than vertically inherited core genes [1
]. However, at the detailed level there are large inconsistencies between the two studies. Whereas we identified hundreds of pseudogenes and gene fragments in the R. conorii
genome using GenComp, only nine were detected by the prokaryotic pseudogene pipeline, none of which was inferred to have been acquired by horizontal gene transfer [1
]. The discrepancy in gene numbers and origin of acquisition suggests that comparative genomics methods are superior for the identification and analyses of pseudogenes and gene fragments.
The analysis presented here also explains the puzzling observation that the median size of R. conorii
proteins is only 173 amino acids, thereby the shortest of bacterial proteins, as compared with a median size of 267 amino acids estimated from 191,541 bacterial proteins [37
]. We have estimated the median size of the R7 core proteins in R. conorii
to be 284 amino acids, which is equal to the median size of R. prowazekii
proteins and slightly higher than the global bacterial protein size estimate [37
]. Thus, the previously estimated short size of the R. conorii
proteins is that it was markedly influenced by the short sizes of the many pseudogenes present in this genome. The overall variability in protein size among bacterial species has been attributed to different adaptations to stress, temperature, protection, and other environmental factors [37
]. A simpler explanation is different proportions of pseudogenes and gene fragments (of different sizes) in the genome annotation lists. Because 2,300 of an estimated 6,895 candidate pseudogenes in prokaryotic genomes overlap with more than 2,600 annotated hypothetical ORFs [1
], attempts to determine mean and median bacterial protein sizes from genome sequence data are likely to yield underestimates, unless the annotated gene lists have first been decontaminated for pseudogenes.