The results above represents a continuation of earlier work [1
], but limited to prokaryotic genomes. Previously [1
], it was demonstrated that the distribution of RR and YR stretches in eukaryotes were very different to prokaryotes. That is, the distribution of YR and RR stretches in eukaryotic genomes deviate strongly from the Markov-based, short-range correlation model used for prokaryotes. The constraints responsible for the different distributions of RR and YR stretches between prokaryotic and eukaryotic organisms are not known, but may possibly be attributed to the non-linear, multi-scaled and highly fractal organization of nucleotides in eukaryotic genomes not observed in prokaryotes [10
Analyses of the distribution of RR and YR stretches in prokaryotic chromosomes (figures &) reveal that while YR stretches of 10 bp tend to be underrepresented according to what is expected, RR stretches are to a large extent overrepresented. For YR stretches this is true for all phyla except β-Proteobacteria of which the GC-rich Burkholderia
genus is found to have a larger fraction YR stretches than any other genus (see Figure ). As has been noted earlier, YR stretches may form Z-DNA in GC-rich sequences, and Z-DNA is highly unstable in bacteria [7
]. In general, YR stretches tend to be associated with genome arrangement and recombination [11
]. In mammals, Z-DNA formation has been found to generate large genetic alterations possibly associated with certain types of cancer [7
]. The observation that pathogenicity was a significant factor (p < 0.001
) describing YR stretches in bacterial genomes was therefore of considerable interest.
species are also known to contain many CG repeats which are, in general, associated with Z-DNA formation [1
]. Horizontal transfers and frequent DNA exchange is also common within the Burkholderia
]. The significance of the pathogenicity factor reduced to t~2.0
(p < 0.05
) when the entire Burkholderia
genus, consisting of 32 chromosomes, and the extreme outlier Treponema pallidum
were removed from the dataset. In contrast, when the fraction of RR stretches was exchanged as response for the fraction of YR-stretches in the same model for the same dataset, the resulting significance was t = -0.4
. The reduced dataset contained 194 pathogenic and 318 non-pathogenic chromosomes, while the main dataset included 222 pathogenic and 324 non-pathogenic chromosomes.
The finding that alternating pyrimidine/purine stretches of 10 bp or more are significantly associated with pathogenicity may indicate that YR tracts are positively correlated with genomic regions in bacteria that are susceptible to recombination or horizontal gene transfers resulting in the acquisition of pathogenicity islands. The fact that YR-stretches are underrepresented in prokaryotic genomes may suggest a counter selection of unstable regions. This is in stark contrast to what is observed in many eukaryotic organisms [1
Purine stretches are overrepresented in all phyla except for the γ-Proteobacteria, Bacteroidetes/Chlorobi and α-Proteobacteria groups. Actinobacteria and β-Proteobacteria are the only groups found to have a lower than expected fraction of purine stretches. From figures and it can be seen that fractions of RR stretches were most diversely distributed in archaea, while β-Proteobacteria had the most varied distributions of YR stretches. The over- and underrepresentation of RR and YR stretches is also presumed to be influenced by DNA helix preference [1
Both models revealed several important factors associated with the respective distribution of RR and YR stretches. The best model, in terms of R2, was obtained for the distribution of RR stretches. This implies that there may be different factors shaping the distributions of RR and YR stretches in bacterial genomes. This is supported by the regression models which found different factors significant. While AT content, extreme halotolerance, oxygen requirement, and growth temperature were significant factors in the RR based regression model, habitat and pathogenicity were found to be significant in the YR-model. The phyla factor was significantly associated with both RR and YR based regression models.
The model explaining RR stretches found oxygen requirement and growth temperature as important and significant factors (p < 0.001
). GC content has been associated with oxygen requirement in prokaryotes [15
]. A slight, but significant (p < 0.001
), improvement was obtained by adding the oxygen requirement factor to the RR-based regression model, but the addition of the growth temperature factor improved the model considerably. Why thermophilicity and halotolerance is linked with the distributions of purine tracts is not known, but RR-stretches appear to be more stable compared to YR-stretches [4
]. Genomic GC content resists any linear association with growth temperature (p > 0.5
from our data, using a transformed regression model) [16
]. However, the GC content of RNA genes has been found to correlate with growth temperature [18
], and purine tracts are overrepresented in mRNA sequences of thermophilic prokaryotes [2
]. The association between RR stretches and growth temperature was very clear compared to that of genomic AT content and growth temperature.
That AT content is an important factor for oligonucleotide frequencies has been noted previously [20
]. To what extent AT content affects the distribution of RR stretches in prokaryotes has, to the best of our knowledge, not been accurately described for prokaryotes (see Figure ). It has been observed that many bacteria from the AT-rich Firmicutes group tend to prefer purines on the leading strand [1
]. Genomes having an overrepresentation of purine stretches on the leading strand have additionally been found to carry a PolC proof-reading enzyme [21
]. It is therefore also possible that an excessive distribution of purine stretches is associated with the polC gene. More data is needed however, before this can be examined further.
All regression models suffer from the effect of co-linearity. That is, several predictor variables overlap to some extent in terms of explaining the variance in the model. For instance, AT content has been found to correlate with genome size [22
] and some co-linearity is also assumed between phyla and AT content. Therefore, the exact influence of the different predictors in the models can not be precisely stated and the models presented have the primary function of identifying significant influences as a starting point for further analysis.
Overrepresentation of YR stretches in Xanthonomonas oryzae
MAFF 311018 is found to be associated with transposons and a 'RND complex' [9
], both of which are connected to mobile genetic elements and horizontal transfer. The RND complex is also found in many other bacteria, and the associated outer membrane protein found in the Xanthonomonas oryzae
MAFF 311018 genome is presumably promiscuous [23
]. Thus, preliminary analysis may indicate that YR-stretches may play some role in the life of mobile genetic elements and that this may be the link we found to pathogenicity.