Genome Sequencing and Annotation
We analyzed whole-genome sequences from eleven Bp strains, comprising ten clinical isolates from four countries (Australia, Thailand, Singapore, and Vietnam) and one soil isolate (S13) from Singapore. To achieve maximal genetic diversity, we elected to analyze all Bp strains regardless of their source of isolation (clinical or environmental). Notably, environmental Bp isolates have also been shown to exhibit high levels of virulence in animal models 
. Among the clinical isolates, strain pairs 1106a–1106b and 1710a–1710b were isolated from the same patients during either primary infection or disease relapse (Table S1
). Reflecting the genetic diversity in this panel, the Bp isolates belong to different multi-locus subtypes (MLST) with an overall MLST allele/subtype ratio of 2.67, markedly higher than the allele/subtype ratio of the general Bp population (0.43, as of Jan 2009). Ten genomes were sequenced by conventional Sanger based shotgun methods (coverage range 7.75x – 11.4x), while strain Bp 22 was sequenced using next-generation instrumentation (GS20-454, average read length 100 bp, 20× coverage) followed by de novo
assembly using a custom 454 large-insert paired-end sequencing protocol (CN and YR, manuscript in prep). The genome sequences were uniformly annotated by a FGENESB gene prediction pipeline 
, and predicted protein-coding regions, tRNAs, rRNAs, and potential promoters, terminators and operons were identified. Predicted genes were comprehensively annotated against known proteins in the NR, COG, KEGG and STRING databases (details in Methods
). All genomes revealed similar benchmark data such as genome size, GC content, and numbers of predicted genes ().
Genome Statistics of Sequenced B. pseudomallei Strains.
An Updated Bp Annotation Reveals Additional Genomic Complexity
Our comparative analysis allowed us to revisit the original 2004 genome analysis with updated annotation protocols. Our annotation pipeline identified 6332 protein coding genes in Bp K96243 (Datasets S1
), a considerably higher number (~10%) than the 5855 genes originally described 
. The vast majority (90%) of genes, however, were commonly identified in both annotation pipelines (), indicating that differences in the two annotation sets are likely due to subtle differences in the prediction algorithms used 
(FGENESB vs GeneMark/Glimmer). Deciding to investigate these previously unreported genes, we sought to distinguish between likely bona-fide
new genes and those arising due to computational over-prediction (false positives). We manually curated a set of 519 novel predicted genes exhibiting non-overlapping start-stop boundaries to the previously reported genes (see for an example), and subjected the 519 putative novel genes to three independent lines of analysis (mRNA transcript information, homology to previously reported genes, and presence of ribosomal binding sites, RBSs).
Experimental Re-annotation of the Bp Genome.
First, using whole genome tiling microarrays covering the entire non-repetitive Bp K96243 genome, we identified transcription units from Bp cultures isolated from six distinct growth conditions (see Methods
). Confirming the accuracy of the microarray, many mRNA transcripts were tightly associated with the boundaries of previously-identified genes (Figure S2
). Of the 519 novel genes, we found that 280 (53%) were associated with discrete mRNA transcripts. 178 novel genes exhibited mRNA transcripts in at least 1 out of 6 different growth conditions, indicating that they are differentially-regulated (), while the remaining 102 were constitutively expressed across the six conditions. The presence of several novel gene transcripts was also directly confirmed by targeted RT-PCR assays (Figure S3
). To investigate if any of the novel genes might correspond to non-coding RNAs (ncRNAs), we used Rfam, a public database of non-coding RNA families 
, to identify ncRNAs in the BpK96243 reference genome. Of 82 small ncRNAs identified by Rfam analysis, 8 ncRNAs corresponded to the novel genes.
Second, using matching criteria similar to other studies 
), approximately 46% of the novel genes (239) were associated with at least one other matching protein in the COG, KEGG, STRING and NR databases (, 
). 138 novel genes had matching proteins previously observed in other Bp strains, and 97 novel genes had matches to other Burkholderia
species. A small fraction (~1%) exhibited homology to other non-Burkholderia
species (eg Xanthomonas oryzae
pv. oryzae MAFF, Sodalis glossinidius
Third, using the RBSfinder program 
, we checked the novel genes for the presence of ribosome binding sites (RBS). The ability of RBSfinder to detect true RBSs in the Bp genome was confirmed by benchmarking the numbers of RBS predictions using previously-identified Bp genes against a set of background randomized sequences 
). Of the 519 novel genes, we identified high-confidence RBSs in 309 genes (59.5%), without requiring alteration of the predicted gene start/stop coordinates.
Combining these three lines of supporting evidence (mRNA transcripts, database matches, presence of RBS), we identified 282 novel genes supported by two lines of evidence (“dual evidence genes”), and 81 novel genes supported by all three lines (Table S2
). A comparison of compositional features (length, G+C content, CAI, hydrophobicity 
) between the 282 dual evidence genes and 5728 protein-coding genes from the original 2004 annotation revealed striking differences in gene length between the sets (average gene length 98±56 aa vs 348±307 aa between novel and 2004 genes, p
) (). Significant differences in G+C content, CAI, and hydrophobicity were also observed (eg G+C content 0.63±0.1 vs 0.68±0.05, p
) (Table S3
). Interestingly, some of these latter compositional differences might be indirectly related due to the short lengths of the novel genes, as significant G+C content, CAI, and hydrophobicity differences were also observed when a set of “short length” genes from the original annotation (<200 aa) were compared against the entire 5728 set (Table S3
). Because compositional differences can often influence gene prediction accuracy 
, it is possible that some of these differences might have contributed to the novel genes being missed in the original annotation. To facilitate integration with existing genome features, we assigned identities to the 282 novel genes based on their proximity to existing genes (eg BPSL2192.1
) (Table S2
We also investigated the 120 genes missed in the current gene prediction analysis but identified by the previous 2004 genome annotation (Table S4
). Of these 120 genes, 87 genes (73%) were categorized either as “doubtful CDs”, “gene remnants”, or “pseudogenes” in the original 2004 annotation, indicating that these genes were likely regarded as ambiguous in the previous annotation as well. Of the remaining 33 genes, 21 genes encode hypothetical proteins while another 6 appear to have bacteriophage origins that may contain coding signals distinct from the rest of the Bp genome. The ambiguous nature for three-quarters of these genes, coupled with presence of atypical coding signals, provides the most likely explanation for their failure to be detected by the current automated prediction pipeline.
The availability of multiple Bp genomes also permitted the analysis of pseudogene dynamics within a species. Of 26 previously-described pseudo-genes in Bp K96243 
, at least 6 were ‘resurrected’ in >6 other Bp genomes. For example, the BPSL2828
pseudo-gene exhibits a premature truncation due to a stop codon at position 107 (TGG → TGA). This mutation, however, was only observed in Bp K96243 and Bp Pasteur 52237; while the other 9 Bp genomes had an extended gene sequence to position 147 (Figure S4
). The differential presence of multiple pseudogenes across the Bp strains suggests that pseudogene formation in Bp is likely to be an active and highly dynamic process, consistent with its role as a recently evolved pathogen.
Comparative Analysis of the Bp Core Genome
An analysis of gene orthologs across the Bp genomes identified a BpCG of 4908 genes present in all 11 strains (, 
), with slight variations in individual genomes due to the presence of gene duplications and paralogs (range 5049–5139 genes). Similar core genome estimates were obtained when the analysis was confined to the nine independently derived isolates (Figure S5
). We confirmed the robustness of this BpCG estimate using the method of Tettelin et al 
. An evolutionary comparison of the BpCG against two closely related Burkholderia
species with highly distinct niches - B. mallei
ATCC23344 (Bm), a intracellular pathogen specific to horses 
, and B. thailandensis
E264 (Bt), a non pathogenic, environmental bacterium 
, defined a common set of ~3616 genes found in all three species (). 270 out of 335 genes are common to Bp and Bm with no orthologs in Bt, while 641 out of 769 genes are common to Bp and Bt with no ortholog in Bm. Besides the core genes, gene accumulation curves also project the global gene repertoire of Bp (the Bp pangenome) to be ~7,500 genes (), a number close to 1.5x the size of the Bp core genome. A detailed analysis of the Bp pangenome will be described elsewhere.
Comparative Analysis of the Bp Core Genome.
Genetic Variation in the Bp Core Genome
To survey the landscape of genetic variation in Bp, we focused on a high quality ortholog set of 4673 BpCG genes (one orthologous gene per genome with >50% sequence similarity, each member exhibiting positional conservation to every other member, and excluding paralogs). We catalogued single-nucleotide polymorphisms (SNPs) and insertion/deletion sequences (indels) in the BpCG. Each Bp strain exhibited an average of ~8594 SNPs compared to the K96243 reference genome, resulting in an overall SNP/Kb frequency of ~2.0 for BpCG genes, while indels account for 0.1% and 0.3% of the total genetic variation in chromosomes 1 and 2 respectively. We confirmed the reliability of the genetic variation data by several methods. First, we confirmed by targeted resequencing >100 randomly-selected SNPs and 25 randomly-selected indels (data not shown). Second, 83% of identified SNPs are either (a) recurrently observed across multiple genomes (Table S5
, or (b) observed in Bp genomes of particularly high sequence quality (1106a, 1710b, 22, K96243 and 406e) (Table S5
). Third, the SNP distributions are entirely consistent with geographic models in that strains with the highest levels of genetic variation compared to K96243 were observed in isolates from Australia, the most geographically distant locale (). This is consistent with previous proposals that strains from Australia are genetically distinct from their Asian counterparts 
and form an ancestral population 
. The existence of a deep genetic distinction between the South East Asian and Australian strains was further supported by phylogenetic analysis of 14,544 shared orthologous SNPs across 23 Bp genomes (including the genomes analyzed in this study), and also by an MLST population structure analysis involving >1800 Bp strains (647 sequence types) (Figure S6
Genetic Variation in the Bp Core Genome.
Among the clinical isolates, strain pairs 1106a–1106b and 1710a–1710b were isolated from the same patients during either primary infection or disease relapse, with intervening periods of approximately three years (Table S1
). Surprisingly, a comparison of the primary and relapse strain genomes in both pairs failed to reveal a significant number of newly acquired mutations in relapsed strains (4 variants in 1106a vs 1106b, 6 variants in 1710a vs 1710b, none recurrent between both pairs) (Table S6
). This lack of genetic variation between the primary and relapsed strains suggests that the former may have remained dormant in the human host during this intervening period, supporting the notion that that the Bp genome is likely to exhibit a high degree of stability during in vivo
infection and persistence.
Positive Selection in the Bp Core Genome
To assess the functional implications of BpCG variation, we divided the BpCG SNPs into subsets predicted to cause either synonymous (Ks
) or nonsynonymous (Ka
) nucleotide substitutions. The Ks
rate was similar between Bp Chr 1 and 2, indicating comparable levels of background genetic diversity between the two chromosomes. However, the Ka
rate of Chr 2 was significantly higher than Chr 1 (P
, unpaired t-test, under a one-ratio model (M0) assuming a constant Ka
ratio, ), indicating that BpCG genes on Chr 2 are experiencing a higher degree of functional substitution than Chr 1. These chromosomal differences support the model of Holden et al 
that Chr 1 of Bp represents the ancestral chromosome, with genes primarily related to housekeeping functions while Chr 2 contains genes involved in accessory functions and secondary adaptation.
We identified BpCG genes with signatures of positive selection using established methods 
). A maximum likelihood analyses was performed on each Bp core gene to detect coding sequence sites displaying features of differential selective pressure (positive selection) using two different likelihood ratio (LR) models (M1a-M2a, or M7-M8). Out of 4673 genes, Model M1a-M2a was significant for 212 genes, while model M7 -M8 test was significant for 239 genes (Ka
>1; ~2% FDR; P<0.001, LR Test). In total, 211 genes were commonly identified by both models as being positively selected (Table S7
). Consistent with these 211 genes exhibiting above-background rates of functional variation (median Ka
60.07 and P<0.001, LR Test), the average Ks
value of the 211 positively selected genes was similar to the Ks
value of non-PS genes (Ks
0.2 for PS and non-PS genes, p
0.56), while in contrast, Ka
, the rate of non-synonymous substitution was 3 times greater in the positively-selected genes compared to genes under neutral selection (p
, t-test). The Ka
value of the positively selected genes was also markedly higher compared to seven housekeeping genes typically used in MLST analysis (ace
) (P<0.001, LR Test). A significantly greater fraction of positively-selected genes were identified on Chr 2 than Chr 1 (P
test, 10000 simulations). These observations suggest that a significant proportion of the Bp core genome (~4.5%) may be under positive selection.
We investigated whether the elevated Ka
rate of the 211 positively selected genes might be due to mutation or recombination between the genomes in this strain panel. All 4673 core genome alignments were tested for the potential presence of recombination using two different methods (GENECONV 
, and the Pairwise Homoplasy Index (Phi)) 
. Combining both methods, 56 out of 4673 core genes were identified as exhibiting a recombination signature. Of these 56, only 3 belong to the 211 positively selected genes, indicating that only a relatively minor component of the 211 genes are associated with a recombination signature. We also assessed rho/theta, the recombination/mutation ratio, of the Bp genomes analyzed in this study 
. Using the Clonalframe algorithm 
, an inspection of 4294032 variation sites estimated rho/theta to be 0.012–0.015 (95% credibility region) for Chr 1 and 0.015–0.019 for Chr 2 respectively. This low value suggests that mutation rather than recombination appears to be the predominant evolutionary process explaining the patterns of genetic variation observed in the current panel of Bp strains.
Consistent with the BpCG responding to multiple selective pressures, the positively selected genes were widely dispersed across a wide variety of functions, including metabolic processes, membrane functions, signal transduction, and gene expression regulation (). A functional category analysis subsequently revealed that positively selected genes in the Bp core genome were significantly enriched in COG categories related to secondary metabolism (P
0.036) and carbohydrate metabolism (P
0.01, binomial test after correction for multiple hypotheses) (), highlighting these two metabolic pathways as major processes experiencing selective pressure.
Representative Bp Genes Exhibiting Signatures of Positive Selection.
Positively Selected Genes May Contribute to Mammalian Virulence
We were intrigued by the possibility that the positively selected genes, while overtly responding to environmental pressures encountered by Bp in soil, might indirectly facilitate the colonization of mammalian hosts. Supporting this notion, the positively selected genes were significantly enriched in genes previously identified as putative virulence-related genes 
(20 genes, P
0.019, based on 10,000 empirical permutations). For example, one representative class of virulence-related genes are Type IV pili (TFP), which are bacterial surface proteins implicated in multiple cellular processes, including motility, cell adhesion, microcolony formation, and virulence 
. Of eight previously identified TFP loci in Bp K96243 
, positively selected genes were associated with three TFP loci (TFP2, TFP4 and TFP7), with the TFP4 Type IVA minor pilin locus containing two positively selected genes (BPSL2754
pilW and BPSL2755
pilV). To evaluate if TFP4 might be involved in mammalian virulence, we generated isogenic Bp mutant strains deleted in the TFP4 locus, and tested the virulence of TFP4 deletion strains in a BALB/c mouse intranasal infection assay 
. TFP4 deleted strains exhibited significantly reduced virulence compared to parental Bp K96243 wild-type controls (p
0.048, Mantel-Haenszel log-rank test, ), supporting a role for Type IV minor pilin activity in murine virulence. These results suggest that a subset of positively selected genes in Bp may influence virulence in mammals.
Functional Analysis of Positively Selected Genes.
To further explore if other positively selected genes might conceivably provide traits facilitating successful mammalian infection, we then investigated two other features typically associated with successful intracellular human pathogens - a) the ability to interact with host cellular processes, and b) the ability to utilize host metabolites as nutrients. Previous studies have shown that many microbial pathogens can alter host cytoskeletons and cell morphology during infection, using proteins such as TTS factors to induce actin stress fibers, lamellipodia, and filapodia 
. To examine the role of positive selection in this process, we curated a list of ten positively selected genes, either related to TTS biology (BPSS1552
) or present in Bp and Bm (both pathogenic species) but absent from Bt (non-pathogenic) (Table S8
). We cloned and expressed these ten genes in Hela cells, and examined the transfected cells for cytoskeletal perturbations. As a positive control, we also included BopE
), a TTS effector protein capable of inducing actin rearrangements 
. Nine of the positively selected genes were successfully expressed in Hela cells but did not induce any significant differences in actin morphology compared to vector controls (eg BPSS0415
, ). In contrast, cells transfected with BPSL1057F1
, a hypothetical protein and one of the novel genes identified in this study, exhibited a marked increase in actin stress fiber formation in the majority (60%) of transfected cells, with phenotypes very similar to BopE
transfection (). Protein analysis of BPSL1057F1
revealed the presence of a twin-arginine signal peptide sequence, often found in proteins exported into an extra-cellular environment 
. These results suggest that some positively selected genes in Bp may provide Bp with the potential to interact with host cellular pathways.
We also analyzed the list of positively selected genes for potential genes involved in host metabolite catabolism. Of metabolites linked to the 10 positively selected secondary metabolism genes, we focused on taurine (2-aminoethanesulfonate), since taurine is an amino acid found at high levels in potential mammalian hosts in muscles, bile, and white blood cells, but absent or present at only trace levels in bacteria and plants 
. Supporting the notion that Bp has developed an ability to metabolize taurine, the taurine dixoygenase gene BPSS0161
) exhibited a significant degree of positive selection across the eleven Bp genomes (P<0.001, Ka
57.6, EC 184.108.40.206). Prompted by this finding, we further explored the role of taurine metabolism genes in Bp and discovered a previously-unreported species-specific expansion of additional tauD
gene members in Bp. Specifically, compared to Bt or Bm which have three tauD
genes on Chr 2, the Bp Chr 2 genomes harbor eight-nine tauD
genes, a three-fold expansion ( 
, also on Chr 2). The Bp tauD
genes all share the same tauD
pfam family domain (PF02668) but otherwise exhibit low sequence similarity between each other (average nucleotide homology of 36%), arguing against this expansion occurring by gene duplication. Instead, sequence analysis suggests that many of the Bp tauD
genes were likely acquired by lateral gene transfer. For example, BPSS0665
, another tauD
gene, is localized to genomic island 14 (GI14), a region of codon bias deviation and atypical % GC content (Figure S8
). Intriguingly, despite exhibiting many features of mobile elements, GI14 has been previously shown to be consistently present across a large panel of natural Bp isolates in contrast to other GIs 
). It is possible that a selective requirement for maintaining levels of tauD
activity might have contributed to GI14 behaving as a conserved feature of the Bp genome.
In other bacterial species, tauD
is required to metabolize taurine as a sulphur source 
. Experimental assays comparing the growth Bp and Bt strains confirmed that Bp also exhibits a significantly enhanced ability to efficiently utilize taurine as a sulphur source compared to Bt (p
0.002, ). The ability of Bp to metabolize taurine for sulphur utilization is specific, as Bp was unable to use taurine as an alternative carbon or nitrogen source, activities which are not mediated by tauD
). Finally, to investigate the molecular response of Bp to taurine, we generated whole-genome transcriptome profiles of Bp exposed to high levels of taurine (250 uM). Here, the taurine concentrations used were based on previous reports studying taurine metabolism in E. coli 
. Compared to Bp grown in standard laboratory media, taurine-exposed Bp exhibited transcriptional up-regulation of ~280 genes, of which 40% (126 genes) have been previously associated with pathogenicity, host–cell interaction, or survival in diverse and challenging environments 
. Specific examples of taurine-regulated genes implicated in virulence included several flagella gene clusters (BPSL0024-BPSL0032
, BPSL3288- BPSL3330
, siderophore biosynthesis and iron metabolism genes (BPSL1771- BPSL1787
, BPSS0239- BPSS0244
, BPSS0581- BPSS0588
, and fimbrae/pili (BPSL2026- BPSL2031
, BPSS1593- BPSS1605
(, Table S9A and S9B
). Taken collectively, these findings suggest that altered taurine metabolism likely mediated by tauD
may represent a species-specific adaptation of Bp that may have also facilitated its ability to survive in infected mammalian hosts