The 31 organisms used in our analysis are described in Table . These genome sequences are all contained in the TB database (TBDB) [75
]. The three unpublished sequences generated at the Broad Institute (M. tuberculosis F11, M. tuberculosis Haarlem
, and M. tuberculosis C
) are high-quality genome sequences. M. tuberculosis F11
and M. tuberculosis Haarlem
are finished, and M. tuberculosis C
has 6.7× coverage and 4 scaffolds. The Broad Institute sequencing read pipeline interacts with the sample management system to ensure the read is associated with the correct sample. Vector identification, length checks and quality clipping were performed on all reads. Contamination checks and organism checks were also performed using a kmer-based algorithm that can compare sequence to a profile from any organism.
Defining protein families and constructing phylogenetic trees
The SYNERGY algorithm [27
] was applied to the 31 genomes in Table . SYNERGY organizes groups of genes across organisms into orthogroups, or groups of orthologs and paralogs, which consist of all the genes descended from a single ancestral gene in the species' last common ancestor. SYNERGY also associates orthogroups with a gene tree, from which we can derive an "extended phylogenetic profile", showing the gene copy number in each extant organism and at each ancestral node. Importantly, by reconciling an organism tree with each gene tree, SYNERGY provides an evolutionary scenario for each gene tree predicting where all losses, gains, and duplications occurred in its evolution. These lists of losses, gains, and duplications contain actual evolutionary events, as well as artifacts caused by genes that could not be properly categorized by SYNERGY. However, we observe that SYNERGY is effective at properly categorizing genes into orthogroups, and the SYNERGY orthogroups were very useful in our analysis. Analysis of the 31 genomes resulted in a total of 32,505 orthogroups, including those containing single genes from only a single genome (below). There were 177 "uniform" (1:1:1:1...) orthogroups representative of some of the most conserved and indispensible housekeeping genes. Additional file 4
: Figure S3 summarizes the SYNERGY orthogroups.
We started running SYNERGY using an initial phylogenetic tree generated using orthologs based on bidirectional best BLAST hits. The list of uniform orthogroups from the first SYNERGY run was used to construct a refined phylogenetic tree. SYNERGY was then re-run using the refined phylogenetic trees. To generate our final phylogenetic tree, the final set of 177 31-way orthologs (31-way uniform orthogroups from the SYNERGY analysis) were aligned according to their nucleotide sequences with CLUSTALW [88
] and concatenated, distances were computed with Phylip's DNADIST algorithm [89
], and Phylip's FITCH algorithm was used to create the tree.
Because of the similarity of the genomes within the Mtb complex, we were not able to resolve the phylogeny using only these 177 proteins that are uniform across all 31 organisms. In order to better resolve the tree within the Mtb cluster, we computed a separate tree using 1747 orthogroups that are uniform across the Mtb cluster and M. ulcerans, which we used as an outgroup. Using this expanded gene set, we were able to resolve the tree for the Mtb cluster.
Bootstrap analysis was performed to validate tree topologies. Phylip's SEQBOOT was used to create 1000 bootstrap input replicates for each tree. Phylip's CONSENSE was used to obtain a bootstrap tree (Additional file 5
: Figure S4)
Metabolic pathways and functional groups
] was used to assign EC numbers for proteins in all 31 organisms. Metabolic pathways were constructed in Biocyc [91
]. An orthogroup was considered to be part of a metabolic pathway if any of its component genes had been identified as part of that pathway using this pipeline.
We obtained the Gene Ontology (GO) [29
] and GO Slim terms for each of the 31 organisms using BLAST2GO [93
]. PFAM assignments [30
] were taken from http://www.tbdb.org
]. An orthogroup was associated with a GO, GO Slim, or PFAM descriptor if greater than half of its protein members were associated with that descriptor.
For each node in the phylogenetic tree, we tabulated orthogroups lost, gained, or duplicated. Using GO terms, GO Slim terms, and PFAM domain groupings with less than 500 members, we calculated over-representations within losses, gains, and duplications each of these groupings at each node using the hypergeometric test. A complete summary of gains, losses, and duplications for all nodes in the phylogenetic tree is available on our supplementary information website.
Extended phylogenetic profiles for each category (metabolic pathways, GO terms, GO Slim terms and PFAM categories) were obtained from SYNERGY output by summing the phylogenetic profiles from their component orthogroups. We define a category-level phylogenetic profile as the sum of its component orthogroup-level phylogenetic profiles. The evolution of each of these categories can be quickly visualized on our website. Since genes with the same phylogenetic profile can be linked functionally [94
], the webpage for each category contains a link to other categories with similar phylogenetic profiles (Methods
). Categories with the most similar profiles were obtained by calculating Euclidean distances to all other profiles.
Instances of expanded or missing pathways across the 31 organisms will have non-uniform pathway-level phylogenetic profiles. Thus we tabulated the number of genes in each genome for each category, and automatically searched for gene categories whose copy number (normalized for genome size) had the most non-uniform distribution across the 31 organisms in order to identify the most significant examples of expansions or losses. To identify categories with bimodal properties (such as a categories with a loss or a large expansion on only certain branches of the phylogenetic tree), we clustered each profile into two groups and looked for the pathways with the greatest separation between the two clusters. We used k-means (k = 2) to cluster the profile vectors, and compared the intra- and inter-cluster point-to-centroid distances to find the clusters with the greatest separation. We ranked categories by this separation to find bimodal categories. We further select those that have at least five organisms in the smallest of the two clusters, and an average of at least five genes per genome. P-values are calculated from a T-test between the values for the two groups, with Bonferroni correction applied. In our Supplementary Information website we list those categories with p < 0.05, ranked by the difference between their inter- to intra-centroid distances. When we select the metabolic pathways, PFAM domains, and GO terms with the most non-uniform category-level phylogenetic profiles overall, we find that many of the top categories are lipid metabolism-related categories expanded in the Mycobacteria.
We also measured the similarity between evolutionary profiles to find the PFAM categories and GO terms with the biggest difference between pre-defined sets of organisms. For example, we compared both the Mtb complex and a group consisting of other pathogenic Mycobacteria to the set of soil-dwelling Mycobacteria in order to examine the evolution of soil-dwelling, free-living Mycobacteria into more pathogenic Mycobacteria that require a host to survive. We used the following categories:
1. All Mycobacteria (excluding M. leprae because of its massive gene loss).
2. All non-Mycobacteria in our set (excluding Nocardia and Rhodococcus because of their similarity to Mycobacteria)
3. Mtb complex (8 organisms)
4. Other pathogenic Mycobacteria (M. ulcerans, M. avium 104, M. avium K10, M. marinum).
5. Soil-dwelling Mycobacteria that do not require a host (M. sp. MCS, M sp. KMS, M. smegmatis, M. vanbaalenii, M. abscessus, M. gilvum).
6. R. jostii RHA1 and N. farcinia
We calculated differences between two sets of organisms exactly as we calculated distances between clusters (above). However, rather than using different clusters of organisms determined by k-means clustering, we used these pre-defined clusters of organisms. We looked at distances between the following sets of organisms: 1-2, 3-4, 3-5, 3-6, 4-5, 4-6, 5-6. For each PFAM domain or GO term represented in at least two organisms in these pairings, we calculated p-values for the differences between the profile values by T-test (Bonferroni-corrected by the number of PFAM domains represented in that set of organisms) and computed inter-and intra-centroid distances (as described in the above paragraph). We compiled lists of those that are most expanded and a list of those most contracted across these pairings. On our website we have included complete lists of PFAM categories, including those that do not make the strict Bonferroni-corrected p-value cutoff. Many potentially interesting expansions do not make the overly conservative Bonferroni-corrected p-value cutoff [95
Using a compendium of 946 microarray experiments from the TB database [75
], we used several different clustering methods to generate predicted regulons. We searched the upstream regions of these regulons for shared transcriptional regulatory motifs. We clustered microarray data by hierarchical and k-means clustering. Because real regulons can be of varying sizes, we performed k-means with k = 50, 100, 200, and 250, then used all the resulting clusters for further analysis. We found that the clusters obtained from hierarchical clustering were not very useful because their size distribution did not approximate that of real regulons as well as those from k-means; therefore we did not analyze clusters from hierarchical clustering further.
We used AlignACE [97
] to search the upstream regions of the genes in these clusters for motifs. We used the methods for operon prediction, selecting upstream regions, and applying AlignACE to prokaryotic genomes as described in McGuire et al. [77
]. Briefly, because of the presence of operons in prokaryotes, we must choose the upstream region of the operon head rather than the region immediately upstream of the gene of interest. Since it is more important to include the correct region than to erroneously include extra incorrect regions, we use a loose operon definition and include sequences for several different possibilities if there is any ambiguity. We look upstream of our gene of interest and select all intergenic sequences until we encounter either a divergent intergenic region or an intergenic region longer than 300 bp.
Motifs of interest were selected by applying a set of filters: specificity score [77
], quality of alignment (AlignACE MAP score) [97
], palindromicity [77
], and conservation. To determine the degree of conservation, a search matrix was constructed for each motif. Each of the other genomes was searched with this search matrix using CompareACE, and N-way conserved sites were identified. N-way conserved hits are hits identified upstream of orthologous genes in N genomes, where orthology is defined by membership in the same SYNERGY orthogroup. To select interesting motifs we required specificity score < 1e-10, palindromicity > 0.7, MAP score > 5, and at least 8 sites conserved in 8 genomes.
Motifs were compared to a library of search matrices for 9 known Mtb
motifs (Acr, Crp, CsoR, DosR, FurA, IdeR, KstR, MprAB, and ZurB), as well as a library of 55 E. coli
] and 22 Corynebacterial
]. Comparison of motifs was done using CompareACE [76
Defining groups based on expression under different lipids
We separated the experiments in our compendium of Mtb H37Rv microarray experiments into separate conditions based on what nutrients were present in their growth conditions (focusing on different lipid conditions, because of the observed importance of lipid metabolism in these organisms). The following categories were used (the number of experiments in each category is shown in parentheses): Palmitic acid (168), Oleic acid (102), Arachidonic and Eicosatetraynoic acids (76), Linoleic acid (41), Eicosatetraynoic acid (13), Ceramide (4), Nordihydroguaiaretic (3), Cholesterol (2), Glucose (1), KstR knockout (1), KstR knockout with cholesterol added (1).
Within each experiment, we extracted a list of genes upregulated 1.5 and 2 standard deviations above the mean. For each category, we considered a gene to be upregulated if it was upregulated in more than 50% of the experiments making up that category. We then searched for genes that were only upregulated under certain conditions or sets of conditions.
We used PAML to calculate dN
values according to several different evolutionary models [100
]. Since orthogroups contain paralogs as well as orthologs, we used the gene trees output from SYNERGY when running PAML. Some orthogroups may contain single-copy orthologs in only two closely related organisms, whereas others could contain paralogs in all 31 organisms.
For the basic model, we used the following parameters: model = 0 and getSE = 1 (to calculate standard errors). This simple evolutionary model gives one value of dN
for each orthogroup, averaged over all lineages as well as all positions in the gene [102
]. While this model does not reflect the evolutionary history that has taken place, it is nevertheless a very blunt yet efficient tool for observing selection.
To gain insight into the evolution of the three major clades of the phylogenetic tree we also used a "branch model" where a different dN
value is allowed on a "foreground" branch (but dN
is averaged along positions in the protein) [103
]. This was done in PAML by using "Model = 2". We compared this model to the basic model using a log-likelihood χ2
test with d.o.f. = 1. For each of the three foreground branches, we used a Bonferroni correction equal to the number of orthogroups present at the branch. We ran this separately for three different "foreground" branches on the phylogenetic tree (labeled in Figure ): A) The branch leading to the Mtb
complex; B) The branch leading to pathogenic Mycobacteria; and C) The branch leading to soil-dwelling, non-pathogenic Mycobacteria. The log-likelihood model that we use here compares this branch model to the simple model with a single value of dN
described above, and tests whether the model allowing dN
to differ on the foreground branch fits the data better than the basic model.
Branch-site models allow dN
to vary across branches of the tree and among sites in the protein. We also used the branch-site model of Zhang and Nielsen [100
] using Model = 2, NSsites = 2, and fix_blength = 2. We used the model = 0 calculations to determine branch lengths for the branch-site model calculations to save computational time. We compared the results for a subset of the orthogroups with and without fixed tree lengths and determined there was little difference in the results). We chose the same three sets of branches (A-C) that we used for the branch model described above. We compared this model to the corresponding null model using a log-likelihood χ2
test with d.o.f. = 1 [100
]. For each of the three foreground branches, we used a Bonferroni correction equal to the number of orthogroups present at the branch. The branch-site model was the most informative.
We calculated the functional group over-representations separately for each functional group dataset. These datasets included 21 COG categories, 168 KEGG categories, 749 metabolic pathways, and 7 additional Mycobacteria-specific groupings (PE genes, PPE genes, toxin-antitoxin genes, DosR regulon, esx genes, Rv0474 regulon, and the KstR regulon). We multiplied the hypergeometric p-values by a Bonferroni correction equal to the number of categories or tests performed.
Generating multi-genome alignments
We constructed whole-genome alignments of all 31 organisms, as well as subsets including only Mycobacterial organisms, and organisms within the Mtb complex. These alignments can be downloaded from our website. Our whole genome multiple alignments reveal unannotated stretches of conservation in noncoding regions including transcription factor binding sites in promoter regions, noncoding RNAs, and mis-annotated proteins.
To generate whole-genome multiple alignments, we first aligned the reference genome to each target genome in a pairwise manner. The process of pairwise whole-genome alignment consists of using PatternHunter [105
] to identify anchors of local alignment, grouping collinear anchors separated by a limited distance into chains, filtering out smaller chains that shadow larger ones, and finally using LAGAN [106
] to globally align the entire chain. Once all genomes have been aligned to the reference, we then identified intervals of the reference that map tightly to a single interval of some or all of the target genomes, and we consider these the endpoints of blocks of multiple alignment. These blocks are generally smaller than any precursor pairwise alignment, because a rearrangement or loss of detectable similarity in any genome will truncate the block for all member genomes. We then ran the multiple aligner MLAGAN on each block. Finally, to facilitate searches for constrained regions of the reference, we projected the blocks onto the reference genome, effectively unwinding all genome rearrangements in the target genomes relative to the reference. We visualized the alignments in the GenomeView browser [64
Selecting conserved regions within the alignments
We used Gumby [65
] to select conserved regions in our multiple alignments using a value of p < 0.5. In a multiple alignment of all 19 Mycobacterial genomes, we identified 4697 regions of conservation overlapping coding genes in the reference annotation, and 394 regions in intergenic regions.
We also used the method of Ruzzo and Tompa [107
] to identify conserved regions. Scores were normalized to the background inferred from the 3 rd-base frequencies. For all H37Rv
coding sequences, all bases in the third position were extracted from the 31-way multiple alignment. These were concatenated in a new multiple alignment only containing third bases. From this new multiple alignment we calculated the baseline conservation which is used to normalize the conservation scores for the regular alignment. Both sets of highly conserved regions can be viewed as alignment tracks for the GenomeView browser [64
], downloadable on our website.
We predicted regions likely to form RNAs within the conserved intergenic regions of our multiple alignment of 19 Mycobacteria, using Evofold [66
]. We divided the intergenic region into 240-bp segments, tiled by 80 bp, to run Evofold. Looking within intergenic regions, we identified 536 regions with Evofold (regions greater than 5 bp in length with length-normalized folding potential score > 0.2).
We examined these 536 regions, as well as the 394 conserved intergenic regions found by Gumby, to see if any of these showed significant expression in our log-phase Mtb
RNA-seq data. We calculated RPKM [108
] values for each of these regions. We examined the regions with RPKM value ≥200 and a number of RNA-seq reads ≥ 20. We eliminated an additional 35 regions which corresponded to known RNAs from the Mtb
annotation, or RNAs similar to those found in M. bovis
], including 26 tRNAs, 2 riboswitches, and 3 found in other organisms.
To select intergenic regions with high levels of expression that do not correspond to UTRs, we also calculated RPKM values for the 100 bp regions of the flanking genes closest to the intergenic regions. We selected those intergenic regions with the highest ratio of the RPKM value of the region of interest (within the intergenic region) to the RPKM of the start/stop of the flanking genes. We also looked for regions with a gap in expression between the gene and the region of interest. This will eliminate many regions that merely correspond to UTRs, and select for regions that are disproportionately expressed within the intergenic region only. We found this method to be most useful for selecting regions of interest, and successfully enriched our top hits for previously known small RNAs. The top 50 predicted RNAs can be viewed as a track in the GenomeView browser (see Supplementary Information).
We further examined log-phase RNA-seq data from M. smegmatis to confirm that many of the orthologous regions also show expression in M. smegmatis.
Strain, media, and culture conditions for RNA-seq
Mycobacterium tuberculosis H37Rv and M. smegmatis were grown at 37°c in 7H9 media supplemented with 10% ADC (Becton Dickinson), 0.2% glycerol and 0.05% Tween 80. For log phase, cells were grown to OD540 0.2. Roller bottles were used for culturing M. tuberculosis, and shaker flasks for M. smegmatis.
RNA isolation from in vitro cultures for RNA-seq
Bacterial pellet from log-phase cultures of M. tuberculosis and M. smegmatis were resuspended in TRIzol reagent (Invitrogen) and immediately transferred to 2 ml screw-cap tubes containing 0.1 mm zirconia/silica beads (BioSpec Products). M. tuberculosis cells were lysed using a FastPrep-24 bead-beater (MP Biomedicals) 3 times for 30 seconds each at speed 6. M. smegmatis cells were lysed using MagNalyser (Roche). Samples were kept on ice for 1 min between pulses. The TRIzol extracted RNA was treated twice with DNAse and further purified using RNAeasy kit (Qiagen).
Directional mRNA-seq libraries for RNA-seq
We generated mRNA-seq libraries for sequencing on Illumina's GA Sequencer (San Diego, CA). 2 μg purified RNA was depleted of ribosomal RNA using Ambion's MICROBExpress Kit (Austin, TX) as per manufacture's recommended protocol. The enriched mRNA was used to prepare libraries using Illumina's Directional mRNA-seq Library Prep v1.0 protocol. Briefly, 100 ng mRNA was fragmented with cations and heat, end-repaired, adapted by sequential ligation of unique 5-prime and 3-prime adapters, reverse transcribed, PCR amplified, and purified using Agencourt's AMPure Beads (Beverly, MA). The libraries were visualized on an Agilent 2100 Bioanalyzer (Santa Clara, CA) and found to have the expected average fragment length of ~250 bp.
RNA isolation and Northern Blotting
Total RNA was isolated from Mtb as described previously [109
] with minor modifications. Briefly, log-phase cells were pelleted, resuspended in TRIzol (Invitrogen), and transferred to Lysing Matrix B tube (QBiogene). The cells were lysed using MagNalyser (Roche), and RNA extracted with Trizol reagent as instructed by the manufacturer. RNA was treated with Turbo DNase (Ambion) for 30 minutes at 37°C twice and purified further using TRIzol solution and 100% Ethanol.
Total RNA was separated on 10% TBE-Urea acrylamide gels (Bio-Rad) and electroblotted onto Hybond N + membranes (GE Healthcare). After UV cross-linking the membranes were pre-hybridized and hybridized with labeled probes at 48°C as per the DIG manual (Roche). Probe sequences are CGATGGTCGAAAAGGAACTCGATACGGCTATGCGGTTCT (RNA1), AGTTCACGAAACGAAGAAAGAAGCTAAGAAGACATAGGTT (RNA2), GACTGCCAGCAGGCGCCGCGCAATGCGCTTGCAGGACTTC (RNA3), and GGGTGACATGGCTCAGGGAAGCCCGGGCGGGCTGGGACGT (RNA9). After hybridization the membranes were washed twice using a low stringency buffer (2× SSC, 0.1% SDS), and a high stringency buffer (0.1× SSC, 0.1% SDS), for 15 and 5 minutes at 48°C, respectively. The membranes were processed with DIG detection system (Roche) and exposed to X-ray film.