All genomic DNA sequences were obtained from the NCBI genome database [
29] together with information about the different organisms. Additional information can also be found in additional file
2.
The computer programs used to generate the results were made according to the explanations given below. The following notation will be used throughout:
Let (w1w2..wn)i, represent an oligonucleotide (n-mer) with 1 ≤ i ≤ N = 4n possible combinations. The function
gives the overlapping empirical frequency of the oligonucleotide (w1w2..wn)i, with respect to the DNA sequence Z = {w1w2..ws}, where S is much larger than n.
This means that:
The hexanucleotide-based relative abundances can then be calculated as follows:
Where 1 ≤ i ≤ N = 4n
The genomic signature is then found by comparing two genomic DNA sequences with the Pearson correlation formula:
N = 4n designates the total number of possible DNA word combinations, with
And
The nucleotides wl, 1 ≤ l ≤ 6, in the denominator of equations (4) and (5), are the corresponding nucleotides in the ith hexanucleotide w1w2w3w4w5w6.
The following formulas
represent the average hexanucleotide relative abundance values.
Hierarchical clustering based on Euclidean distance was performed on the resulting symmetric 867 × 867 correlation matrix. Average linkage was used to put emphasis on the closest matches based on group similarities.
Oligonucleotide usage variance (OUV) can be considered as a measure of oligonucleotide frequency bias, or selection pressure on the genomic DNA composition, and was calculated according to the given formula for each chromosome:
The function M0((w1w2...wn)i) approximates oligonucleotide frequencies with the corresponding mononucleotide frequencies:
The formula implicitly assumes that each nucleotide in the approximated n-mer is independent of the neighbouring nucleotides. In addition, equation (7) assumes that genomic oligonucleotide frequencies are only influenced by AT content, which means that low values can be interpreted as random mutations carrying little or no information. High variance values, on the other hand, mean that substantial information is carried by the oligonucleotide being approximated.
Linear regression analysis was performed between OUV for di-, tetra-, and hexanucleotide frequencies (response variable) and genomic AT content (predictor variable) using log transformation. R2 designates '% coefficient of determination'.
A conditional logistic multinomial (polychotomous) regression model was fitted to asses the individual influences of predictors: genome size, AT content, OUV, phyla, oxygen requirement, habitat, growth temperature and pathogenicity, with the cluster groups as the response variable. The AIC and McFadden R2 statistics were used as indicators of the quality of the fitted model. The following multinomial logistic regression model was run in the statistical program R using the package nnet:
The response variable "Groups" is a categorical variable consisting of the different cluster groups (see Figure ). The predictors Phyla, Oxygen, Habitat and Growth temperature were also categorical factors, while Size, AT and OUV were numerical factors. The Oxygen factor consisted of the categories: aerobic, anaerobic and facultative. Habitat consisted of the categories: host-associated, multiple, specialized, terrestrial, and aquatic, while the growth temperature factor consisted of the following categories: psychrophilic, mesophilic and thermophilic. This information was taken from the NCBI website
http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi. The regression model converged after 220 iterations. Assessment of statistical significance was carried out with the
car package.
All regression models were statistically significant with the significance level set to p < 0.001.