Chromosomes, plasmids and phages were downloaded from the NCBI website
http://www.ncbi.nlm.nih.gov/genome/, while the GIs were downloaded from the Islandviewer website
http://www.pathogenomics.sfu.ca/islandviewer/query.php. Only DNA sequences larger than 10 kb were considered due to limitations of the method. Single copy orthologs were assigned by OrthoMCL [
37] for the genomes of
Mycobacterium tuberculosis F11 (CP000717.1),
M. tuberculosis H37Ra (AL123456.2)
, M. leprae Br4923 (FM211192.1) and
M. leprae TN7 (AL450380.1). Statistical analyses were carried out with R
http://www.r-project.org/, which was also used to create all figures except the BLAST atlas (Figure ). The BLAST atlas was made using CBS in-house software [
24,
38].
The Kullback-Leibler divergence (
DKL, also referred to as the relative entropy) is a measure of difference between two discrete probability mass functions [
18]. Let
s be a DNA sequence, and
z1,...,
z256 be all possible tetramers of the DNA alphabet (4
4 = 256). The observed frequencies of tetranucleotides from DNA sequence
s is written as
O(
zi|
s). The expected frequencies of tetranucleotides from DNA sequence
s found using a zero order Markov model is written as
E(
zi|
s). The KL divergence for the sequence
s is given as:
A lower
DKL is interpreted as lesser information potential is carried by the DNA sequence
s due to lesser dependence between the nucleotides in the corresponding tetranucleotides. Conversely, a higher
DKL is taken to mean that higher information potential is carried by the DNA sequence (higher relative entropy), since the nucleotides in the corresponding tetranucleotides are more dependent on each other. The OUV measure [
17] described in the Discussion section and compared to relative entropy is calculated as follows (
O, E,
zi and
s are the same as above):
Although the OUV measure is similar to relative entropy, we use the latter here due to the larger theoretical framework and tools available from information theory [
12,
18].
Comparisons between DKL and factors such as phyla, AT content, DNA sequence size, etc. were carried out using linear regression with transformations applied to correct for non-normality where needed.
DKL was computed for each DNA sequence (chromosome, plasmid, phage and GI) and compared to AT content, size and phyla using linear regression:
For comparisons between chromosome, plasmid, GI and phage size (Y = Ysize) versus DKL (XKL) no transformation was used.
To examine the relationship between DKL, DNA sequence size and AT content for bacterial chromosomes and plasmids, a linear regression model was used without transformations on the response:
Linear regression between DKL as outcome (Y = YKL) and AT content as response (X = XAT) was log-transformed:
Several transformations were used to assess associations between chromosome, plasmid, phage and GI size (YSize) vs AT content (XAT) using the following regression equation:
A square root transform was used when the response was sequence sizes for chromosomes; log transformations for both phage and plasmid sizes; and (1/ YSize) transform for GI sizes as outcome.
Comparison of DKL between chromosomes, plasmids, phages and GI, as seen in Figure , were carried out using the non-parametric Wilcoxon (Mann-Whitney) test due to skewed (but similar) distributions.
All statistical results presented as results were found to be statistically significant with p < 0.001, if not otherwise stated in the text.
All DKL measurements of DNA sequences were carried out using in-house software. The profiles measuring DKL changes within bacterial chromosomes as seen in Figure were performed using non-overlapping sliding windows of 5 kbp compared to average chromosomal DKL.