|Home | About | Journals | Submit | Contact Us | Français|
Virulence and immunity are poorly understood in Mycobacterium tuberculosis. We sequenced the complete genome of the M. tuberculosis clinical strain CDC1551 and performed a whole-genome comparison with the laboratory strain H37Rv in order to identify polymorphic sequences with potential relevance to disease pathogenesis, immunity, and evolution. We found large-sequence and single-nucleotide polymorphisms in numerous genes. Polymorphic loci included a phospholipase C, a membrane lipoprotein, members of an adenylate cyclase gene family, and members of the PE/PPE gene family, some of which have been implicated in virulence or the host immune response. Several gene families, including the PE/PPE gene family, also had significantly higher synonymous and nonsynonymous substitution frequencies compared to the genome as a whole. We tested a large sample of M. tuberculosis clinical isolates for a subset of the large-sequence and single-nucleotide polymorphisms and found widespread genetic variability at many of these loci. We performed phylogenetic and epidemiological analysis to investigate the evolutionary relationships among isolates and the origins of specific polymorphic loci. A number of these polymorphisms appear to have occurred multiple times as independent events, suggesting that these changes may be under selective pressure. Together, these results demonstrate that polymorphisms among M. tuberculosis strains are more extensive than initially anticipated, and genetic variation may have an important role in disease pathogenesis and immunity.
Current evidence suggests that as a species Mycobacterium tuberculosis exhibits very little genomic sequence diversity (24, 34). Most genetic variability that has been detected is associated with transposable elements and drug resistance phenotypes (5, 17, 28, 38). It follows that M. tuberculosis should exhibit very little phenotypic variation in immunologic and virulence factors. However, evidence of phenotypic diversity among clinical isolates conflicts with this hypothesis (22, 37). The presence of significant sequence diversity in M. tuberculosis would provide a basis for understanding pathogenesis, immune mechanisms, and bacterial evolution. Polymorphic genes are good candidates for virulence and immune determinants, because proteins that interact directly with the host are known to have elevated divergence. Polymorphic sequences also serve as markers for phylogenetic and evolutionary studies. Such studies are currently limited by a paucity of known genetic markers.
Recently, the genome of the M. tuberculosis laboratory strain H37Rv was completely sequenced (GenBank accession no. NC_000962) (10). This sequence provided important insights into the biology of this species but did little to address issues of sequence diversity. Furthermore, significant differences can be documented among the genomes of laboratory strains with long histories of passage and among recent clinical isolates (25). H37Rv had been passaged for many decades outside of the human host. Thus, the relevance of the H37Rv genome sequence to clinical M. tuberculosis strains has been questioned. We sequenced the genome of a clinical M. tuberculosis strain, CDC1551 (GenBank accession no. AE000516), and performed a comprehensive sequence comparison. In contrast to H37Rv, CDC1551 is a strain involved in a recent cluster of tuberculosis cases and is known to be transmissible and virulent in humans (38). The CDC1551 strain appears to be highly infectious in humans, is comparable in virulence to strain H37Rv in animal models (8), and has greater immunoreactivity than H37Rv and other clinical strains due to increased induction of tumor necrosis factor alpha, interleukin-6 (IL-6), IL-10, and IL-12 (22).
Investigators have utilized a variety of low- and high-resolution comparative genome techniques to identify differences in the genomes of Mycobacterium bovis, M. bovis BCG vaccine strains, and M. tuberculosis laboratory and clinical strains. Low-resolution analyses have included subtractive hybridizations and investigations of sequence variations associated with restriction length polymorphisms. These methods have identified a number of sequence differences between the different mycobacterial species and strains (15, 16, 21, 26). However, low resolution analysis has inherent limitations. Consequently, the polymorphisms identified have been either very large or restricted to localized areas of the M. tuberculosis genome. Comparative genomic techniques with higher resolution have also been performed recently. However, these studies have been limited by the use of H37Rv as the single reference strain (6, 7, 18). The only whole-genome comparison undertaken thus far has been an incomplete and inaccurate comparison of the H37Rv and CDC1551 strains. This study apparently used an incomplete version of the CDC1551 strain sequence and resulted, for example, in the misidentification of several comparative deletions in strain CDC1551 (7). Here we present the first comparison of the complete genomes from strains H37Rv and CDC1551, including differences resulting from large-sequence polymorphisms (LSPs) (greater than 10 bp) and single-nucleotide polymorphisms (SNPs). We discovered an unexpectedly high degree of sequence variation between the two genomes and determined that much of the variation was also present in a large panel of clinical M. tuberculosis isolates.
The methodologies for library construction, template preparation, sequencing, and closure of the M. tuberculosis genome were primarily those developed during the whole-genome shotgun sequencing of the 1.83-Mbp genome of Haemophilus influenzae (13). In brief, high-molecular-weight genomic DNA from strain CDC1551 was obtained from a seed lot culture maintained by William Bishai, Johns Hopkins University School of Medicine. The genomic DNA was randomly sheared by nebulization and size selected for 2.0-kb fragments. The DNA fragments were cloned into a SmaI-digested pUC18 vector. The recombinant molecules were freshly transformed by electroporation into DH10B electrocompetent cells (catalog no. 8290SA; GIBCO BRL) and transferred directly to a nutrient-rich SOB plate (13). Approximately 50,000 templates were prepared and sequenced with M13 forward and reverse primers. Sequencing reactions were analyzed on AB377 DNA sequencers. The random shotgun fragments were assembled with the TIGR assembler (35). Sequence gaps were filled by selecting a template whose forward and reverse sequence reads were in adjacent contigs. Physical gaps were ordered by combinatorial PCR based on oligonucleotide primers designed from the ends of each group of assemblies. The sequences of the physical gaps were determined by primer walking across each of the PCR products. Contigs were edited with the TIGR editor.
Open reading frames (ORFs) were identified with GLIMMER (30). The ORFs were searched against an in-house nonredundant amino acid database with blast_extend_repraze, which uses a BLASTP algorithm to generate pairwise amino acid alignments (4, 42). In addition to the pairwise alignments used to generate gene assignments, the ORFs were evaluated by comparison to a database of hidden Markov models generated from multiple sequence alignments for protein families and superfamilies (33). A team of annotation experts evaluated the results generated by these various tools and assigned to each ORF with a significant match an accession identification and a biological-role identification.
We have developed a heuristic approach to aligning large segments (millions of base pairs in length) of closely related DNA sequence (11). The system uses a combination of three ideas: suffix trees, the longest common subsequence, and Smith-Waterman alignment. The complete nucleotide sequences of H37Rv and CDC1551 are provided as input. The alignment process then follows these steps. (i) A maximal unique match decomposition of the two genomes is performed to identify all maximal unique shared subsequences in both genomes. (ii) The matches are sorted based on a longest ascending subsequence algorithm, providing an easy and natural scan of the alignment from left to right. (iii) Gaps in the alignment are closed by performing local identification of large inserts, repeats, small mutated regions, tandem repeats, and SNPs. (iv) The alignment is outputted, including all the matches in the maximal unique match alignment and the detailed alignments of regions that do not match exactly.
One hundred sixty-nine clinical isolates taken at random from a collection of M. tuberculosis isolates cultured at Montefiore Medical Center and 64 clinical isolates cultured at The Johns Hopkins University School of Medicine were analyzed for LSPs. Genomic DNAs from the clinical isolates and strains CDC1551 and H37Rv were bound onto Biotrans plus nylon membranes (ICN Pharmaceuticals, Costa Mesa, Calif.) in longitudinal strips using a multislot hybridization apparatus (immunoblotter; Immunetics, Cambridge, Mass.). The membrane was then turned 90° in the same slot blot apparatus and simultaneously hybridized with γ-32P-labeled probes complementary to different LSPs. Forty-four different genomic DNA samples could be slotted in an array consisting of 44 lines extending across the membrane. Hybridizing of probes for each LSP at 90° to this array permitted every probe to come into contact with every genomic DNA sample. All isolates were subjected to DNA fingerprinting using IS6110-based restriction polymorphism analysis; low-band-number isolates were also tested with a secondary fingerprinting technique (1).
Some apparent SNPs might represent sequencing errors rather than true SNPs. Verification of the sequence differences was accomplished by two independent methods. One hundred SNPs were chosen at random, and the base calls were independently verified by inspection of the original electropherograms at The Institute for Genomic Research (CDC1551) and The Sanger Center (H37Rv). In collaboration with Qiagen Genomics, Inc., these 100 SNPs were verified for both strains by using Masscode technology for SNP identification (19). The visual inspection of the electropherograms and the Masscode results were in good agreement and indicated that 91% (80 of 88 successful assays) of the nucleotide differences were genuine. The number of synonymous and nonsynonymous substitutions between the two strains was determined by alignment of homologous ORFs with ClustalW (36). Poor alignments were removed based on visual inspection, resulting in a comparison of 3,535 ORFs. The number of synonymous differences (Sd) and nonsynonymous differences (Sn) between homologous ORFs was calculated with the program SNAP (20).
The complete genome sequence of M. tuberculosis strain CDC1551 was determined utilizing the whole-genome shotgun strategy (13). The complete annotation of the CDC1551 genome is located at The Institute for Genomic Research website at www.tigr.org/CMR and under GenBank accession no. AE000516. A whole-genome alignment was performed utilizing MUMmer software developed at The Institute for Genomic Research (11). The circular representation of the M. tuberculosis chromosome illustrated in Fig. Fig.11 depicts the location of each predicted protein coding region as well as selected features differing between the CDC1551 and H37Rv strains, including LSPs and SNPs.
The two genomes contained notable differences. The H37Rv strain contained 37 insertions (greater than 10 bp) relative to strain CDC1551. Twenty-six insertions affected ORFs (Fig. (Fig.1;1; Table Table1),1), and 11 were intergenic. The insertions in strain H37Rv included tandem repeats, additions to the 5′ or 3′ ends of ORFs, and the addition of complete ORFs. Complete ORFs included three encoding hypothetical proteins (Rv0793, Rv3427c, Rv3428c), two encoding PPE proteins (Rv3425, Rv3426), one encoding a PE_PGRS protein (Rv3519), and two encoding proteins with putative functions (Rv0794c, a dihydrolipoamide dehydrogenase, and Rv0792c, a putative transcriptional regulator). Forty-nine insertions were identified in strain CDC1551 relative to strain H37Rv. Thirty-five insertions affected ORFs (Fig. (Fig.1;1; Table Table2),2), and 14 were intergenic. In addition to tandem repeats and additions to the 5′ and 3′ ends of ORFs, insertions introduced 17 complete ORFs. Eight ORFs encoded conserved hypothetical or hypothetical proteins (MT1813, MT2080, MT2080.1, MT2081, MT2420, MT2421, MT3427.1, and MT3429). The nine additional CDC1551 ORFs with functional assignments included genes encoding an adenylate cyclase (MT1360), a glycosyl-transferase (MT1800), an oxidoreductase (MT1801), a 12 transmembrane protein (MT1802), a membrane lipoprotein (MT2619), a PPE family protein (MT3248), paralogs of moaB (MT3426) and moaA (MT3427), and a gene encoding a putative transcription regulatory protein (MT3428). The changes in the 5′ end of a phospholipase C gene (MT1799) and the addition of a 12 transmembrane transport protein (MT1802) are particularly notable because of their potential role in bacterial virulence (32). It is worth noting that almost half of the insertions and deletions in both strains involved genes encoding PPE or PE_PGRS family proteins. We found only one major rearrangement of genome structure between the two strains. A prophage, initially identified in the RD3 region of M. bovis (21) and later characterized as prophage phiRv1 in strain H37Rv (10), is associated with the REP13E12 family of repeats (14). In strain H37Rv, prophage phiRv1 is integrated between coordinates 1,779,312 and 1,788,503. In strain CDC1551, phiRv1 is integrated into a second member of the REP13E12 family at CDC1551 coordinates 3,870,803 and 3,879,990.
The IS3-type insertion sequence IS6110 is the principal epidemiological marker for M. tuberculosis. A number of the insertions and deletions were associated with this insertion sequence, suggesting a role for this element in genome plasticity (23, 41). Studies have shown that homologous recombination between nearby copies of IS6110 may result in genomic deletions and can be a mechanism for generating genomic diversity (12). We identified 4 copies of IS6110 in CDC1551 compared to 16 copies in H37Rv. Four of the 16 IS6110 elements present in H37Rv lacked the characteristic 3- to 4-bp direct repeat and were directly adjacent to regions that were deleted relative to strain CDC1551. Two of these deletions included the 6,807-bp region containing ORFs MT1799, MT1800, MT1801, and MT1802 and the 4,083-bp region deleted in strain H37Rv containing the molybdopterin cofactor biosynthesis gene cluster (MT3426 and MT3427) (Table (Table2).2). While strain H37Rv contained several deletions associated with a possible mechanism of homologous recombination between nearby IS6110 elements, none of the deletions in strain CDC1551 appeared to be the result of such a mechanism. The association of the IS6110 element with the deletion of important genes calls into question its utility as a neutral marker for phylogenetic analysis.
Regions differing in the copy number of several genes between strains H37Rv and CDC1551 were identified. Phylogenetic analysis of these regions revealed surprisingly contradictory evolutionary relationships among CDC1551, H37Rv, and M. bovis. Among the genes encoding membrane lipoproteins in strain H37Rv were two genes in tandem (Rv2543 and Rv2544). The homologous genome region in strain CDC1551 contained the orthologs MT2618 and MT2620, respectively. However; a third ORF, MT2619, was interspersed between the two genes (Fig. (Fig.2a).2a). Examination of the homologous region in the M. bovis genome (www.sanger.ac.uk) revealed orthologs of the three membrane lipoprotein genes observed in the CDC1551 genome in an indistinguishable organization and nucleotide sequence. There was essentially no nucleotide diversity between orthologs. The nucleotide diversity among the paralogs was similar for all three genes, indicating equivalent evolutionary distances. Phylogenetic analysis suggested that the three paralogs arose in a common ancestor of the M. tuberculosis complex by gene duplications and that subsequent loss of MT2619 occurred in the H37Rv lineage.
A second polymorphic region contained an unusual organization of three (Rv1318c, Rv1319c, and Rv1320c) and four (MT1359, MT1360, MT1361, and MT1362) putative adenylate cyclase genes in tandem in strains H37Rv and CDC1551, respectively (Fig. (Fig.2b).2b). Each ORF had a strong match to the catalytic domain of guanylate or adenylate cyclase. The comparative sequence data between the two strains showed that in H37Rv, the middle gene of the cluster (Rv1319c) is actually an in-frame chimera corresponding to the 3′ and 5′ ends of genes MT1360 and MT1361, respectively, from CDC1551 (Fig. (Fig.2b).2b). M. bovis has a structure similar to that of H37Rv. Phylogenetic analysis of this polymorphic region indicated that the in-frame chimera was generated by a deletion-fusion event. Thus, the ancestral structure was likely four tandem genes. The shared fusion in H37Rv and M. bovis could be due to a single event, indicating that these strains share a common ancestor relative to CDC1551. Notably, this hypothesis conflicts with the evolutionary scenario proposed for the membrane lipoprotein gene duplication. Together, these findings indicate that genetic variability in M. tuberculosis arises through a complex evolutionary process that involves recombination or multiple insertion-deletion events occurring independently at the same locus.
The comparative sequence data between CDC1551 and H37Rv provided us with a starting point for characterizing the degree and frequency of certain deletion/insertion events among clinical M. tuberculosis isolates. We tested 169 clinical M. tuberculosis isolates for the presence of a subset of seventeen CDC1551-H37Rv LSPs representing known and hypothetical genes. The seventeen probes for these LSPs, the genes involved, and their respective coordinates are listed in Table Table3.3. The isolates included 19 restriction fragment length polymorphism (RFLP)-defined clusters and 88 unique isolates. We defined “clustered” isolates as those isolates which through DNA fingerprinting of an insertion sequence element (IS6110) shared identical banding patterns. “Unique” isolates contained unique banding patterns. Clustered isolates are thought to have a common ancestor and to be possibly linked together by recent transmission events, while unique isolates are likely to be genetically distinct strains (40). We discovered a surprisingly large degree of sequence heterogeneity among the clinical isolates. All of the 169 tested isolates lacked at least one LSP. An average of 3.7 sequences were missing for each isolate, with a range of one to seven deletions. We reproducibly demonstrated the presence or absence of specific LSPs in duplicate experiments, and we performed polymorphism specific PCR assays to reconfirm a subset of these results (data not shown).
We used a second set of epidemiologically well-characterized isolates to determine whether variability of LSPs also occurred among isolates that were closely linked through epidemiological investigations (referred to herein as “epi-linked” isolates). Clusters of epi-linked isolates presumably share a recent common ancestor as part of an outbreak of disease. Thus, they represent very closely related isolates. We studied 42 isolates that included 16 RFLP-defined clusters and 15 unique isolates. Seven clusters contained a total of 17 epi-linked isolates. The pattern of LSPs was identical in all epi-linked isolates within a cluster. In contrast, four isolates within three clusters without epi-links contained deletions of at least one sequence (Fig. (Fig.3).3). As each cluster is likely to have arisen from a common ancestor, the additional deletions likely occurred subsequent to the common ancestor. We also found the same deletions in other clustered and unique isolates. These findings suggest that the loss of these LSPs occurred multiple times as independent events, and that new polymorphisms rarely develop within the short time frame of a clinical outbreak of disease. Alternately, the disparate clustered and unique strains containing identical deletions could have arisen from a common strain in which a unique ancestral deletion occurred. However, this second hypothesis is not supported by evidence for recurrent deletions within the phospholipase C region (16) or by the contradictory deletion-based phylogeny of the adenylate cyclase region (see above).
The comparison of the H37Rv and CDC1551 genomes identified 1,075 SNPs between the two genomes (Table (Table4).4). Approximately 85% of the substitutions occurred in coding regions (93% of the genome). We found transitions (purine to purine and pyrimidine to pyrimidine) to be more numerous than transversions (purine to pyrimidine and pyrimidine to purine), which represented 61 and 39% of the substitutions, respectively. We calculated synonymous and nonsynonymous substitutions for 3,535 pairs of homologous ORFs. There was at least one synonymous or one nonsynonymous substitution in 298 (8.4%) or 457 (12.9%) of the ORFs, respectively. In total, there were 342 and 579 synonymous (Sd) and nonsynonymous (Sn) substitutions, respectively. The proportion of synonymous differences per synonymous site corrected for the possibility that a site changed multiple times (Ds) was 3.6 × 10−4, or a 1/Ds of 1 synonymous substitution per 2,752 synonymous sites. This value for Ds was more than threefold greater than that previously described for M. tuberculosis (34), emphasizing the unexpected sequence diversity.
Surprisingly, the ratio of M. tuberculosis Ds to nonsynonymous substitution (Ds/Dn) was approximately 1.6, in contrast to studies of housekeeping genes in Escherichia coli and Salmonella enterica and invasion genes in S. enterica which show a Ds/Dn ratio of ranging from 4 to 17 (9, 39). The observation in E. coli and S. enterica is consistent with the generally accepted view that the many nonsynonymous substitutions are lost through purifying selection. The ratio observed in M. tuberculosis indicates that additional selective pressure is present on synonymous substitutions or there is decreased selective pressure against nonsynonymous mutations.
Patterns of synonymous and nonsynonymous substitutions can reveal information about mutations and selective pressures on genes, as well as information about population structure and recombination. We analyzed the frequency of substitution for 877 gene families to see if any contained significantly more SNPs than the genome as a whole. We calculated a P (the probability that the frequency of substitution was due to random fluctuation) for synonymous and nonsynonymous substitutions for each gene family. We used a P of <0.0001 as a cutoff to identify gene families in which the higher frequency of substitution was not likely to be due to random fluctuation (Table (Table5).5). We identified three paralogous families with significantly more synonymous substitutions and nonsynonymous substitutions (gmt [“Genome Mycobacterium tuberculosis” database] domain_PF00924, gmt domain_75, and gmt domain_337 [http://www.tigr.org/tigr-scripts/CMR2/ParalogousList.spl?db=gmt]).The gmt domain 69 also had significantly higher synonymous substitution rates, but the P for nonsynonymous substitutions was slightly above the cutoff. Interestingly, the PE and PPE paralogous family (gmt domain_PF00924) was among the three families with significantly higher synonymous and nonsynonymous substitutions. The PE and PPE genes encode a family of acidic, glycine-rich proteins that are postulated to be expressed on the extracellular surface and are considered potential antigens for host immunity (27, 31).
The fact that the H37Rv strain has been in culture for nearly a century raised the possibility that many of the nucleotide differences observed between the two strains could be a result of in vitro passage. We examined a subset of 28 of the clinical isolates for 11 nonsynonymous SNPs (4 of which resulted in conservative amino acid substitutions) initially detected through the H37Rv-CDC1551 comparison (Table (Table6).6). Seven of the eleven SNPs were polymorphic in at least 1 of the 28 isolates tested. Two of the SNPs, SNP-2904 in the gene encoding the carbon starvation A protein and SNP-4090 in a gene encoding a putative transcriptional regulator, were highly polymorphic (each allele present in ≥20% of the isolates). Thus, the SNPs discovered by the two-genome comparison appear to be broadly represented in clinical strains. The distributions of SNP-2904 and SNP-4090 are comparable to those of two SNPs previously described as highly polymorphic (gyrA-95 at codon 95 of the gyrA gene and katG-463 at codon 463 of the katG gene) (34), suggesting that they might be useful for phylogenetic analysis (Table (Table66).
We performed a phylogenetic analysis of 21 clinical isolates, H37Rv, CDC1551, and M. bovis using a combination of data from LSPs, SNPs, and selected phenotypic traits (Fig. (Fig.4).4). The tree is supported by statistical analysis for most of the branching patterns. The phylogenetic analysis allows the assessment of the consistency of different markers with the consensus tree (Fig. (Fig.5).5). Several of the markers have a high consistency index as well as a reasonable level of variability. These markers provide information for constructing models of the phylogenetic relationships between strains. As described above, analysis of the adenylate cyclase region contradicted the model in which H37Rv, CDC1551, and M. bovis shared a common ancestor. The low consistency index of this region (marker 2) confirms on a population level that the cyclase region would be expected to be a poor phylogenetic marker, explaining this apparent contradiction. Markers with low consistency indices represent regions of the genome which may have evolved by mechanisms of convergence, recombination, and mutational frequencies outside the average for the species. Such sites can be useful in discriminating among strains linked by other markers, such as insertion sequence markers.
Several studies have compared closely related strains of bacteria, including Helicobacter pylori, several Chlamydia species, and pathogenic and nonpathogenic E. coli (2, 3, 25, 29). In general, these studies were limited to the sequencing of a comparative strain or species and identification of polymorphisms between the two but failed to extend the findings beyond the two strains being compared.
Several studies of the Mycobacterium genus describe LSPs among the M. bovis BCG vaccine strains and virulent M. bovis as well as among other tubercle bacilli (6, 15, 21, 18). While these studies describe several regions of LSP, the methodologies of subtractive hybridization and RFLP analysis limit the resolution and therefore the utility of the markers described. Betts et al. (7) in their analysis of the M. tuberculosis proteome attempt a comparative analysis of the genome contents of the H37Rv and CDC1551 strains. Several conclusions drawn from this analysis appear to be incorrect based on our present analysis. In particular, they describe eight ORFs completely unique to H37Rv, including Rv0278c, Rv0279c, Rv0746c, Rv0747, and Rv1087, all of the PE_PGRS family. The identification of a 5,742-bp deletion associated with ORFs Rv0278c and Rv0279c and a deletion of 4,910 bp associated with ORFs Rv0746c, Rv0747, and Rv0748 is incorrect and may be the result of analyzing an incomplete version of the sequence of strain CDC1551. Two other regions of strain H37Rv, bp 2,714,308 to 2,714,808 and bp 3,933,523 to 3,936,659, are incorrectly identified as being deleted in strain CDC1551. Their analysis of insertions in the strain CDC1551 genome relative to strain H37Rv also appears to contain several errors. They failed to identify approximately 15 regions of insertion in strain CDC1551 relative to H37Rv. Among these regions was a 4,425-bp region containing ORFs MT3426 and MT3427 and moaB and moaA paralogs, respectively. Most of the coordinates reported by Betts et al. (7) are incorrect with respect to strain CDC1551. This is again likely due to the analysis of an incomplete genome sequence.
We initially undertook the sequencing and annotation of a second M. tuberculosis strain, CDC1551, in an attempt to correlate genotypic changes with strain phenotype. Prior to our study it was generally accepted that little sequence variation existed. Our study demonstrates that a much higher level of polymorphism is present among the M. tuberculosis species. This discovery complicates attempts to associate specific genotypic changes with phenotypic differences. However, it also provides several important opportunities.
First, it is likely that a subset of the polymorphic sequences code for genes involved in host-pathogen interactions. A statistical analysis of the single-base substitution frequency in 877 gene families identified several families in which the frequency of substitution was significantly greater than that observed for the genome as a whole. This included the PE/PPE gene family, which encodes acidic, glycine-rich proteins that are postulated to be expressed on the extracellular surface and are considered potential antigens for host immunity (27, 31). The higher substitution frequency in this family might be the result of antigenic variation or may be due to other interactions with the host. Three other gene families representing conserved hypothetical proteins also had a higher substitution frequency. These genes warrant further investigation as potential candidates for virulence or immunogenicity. The LSPs that we discovered may also encode proteins involved in disease pathogenesis as demonstrated by the polymorphisms in one of the phospholipase C genes and members of the PE/PPE gene family.
Second, the level of LSPs and SNPs that we discovered among the M. tuberculosis isolates suggested that we could also develop a set of markers that would be valuable in studying the phylogenetics of the M. tuberculosis species and other tubercle bacilli. Analysis of two LSPs, an ORF encoding a membrane lipoprotein and an ORF encoding a putative adenylate cyclase, suggested conflicting scenarios of the evolutionary relationship of CDC1551, H37Rv, and M. bovis. However, phylogenetic analysis of 21 clinical isolates along with strains CDC1551, H37Rv, and M. bovis demonstrated that the adenylate cyclase region had a low consistency index and would be a poor phylogenetic marker, explaining its apparent contradiction with the membrane lipoprotein region. Other polymorphisms with high consistency indexes were discovered that are likely to be excellent markers for investigating the evolution and phylogenetics of this species.
We demonstrated that the Ds is more than threefold greater than estimated previously (24, 34). Interestingly, the Ds-to-Dn ratio was close to 1, unlike the much higher level expected if selection was against nonsynonymous but not synonymous substitutions. This could be due to decreased selective pressure against nonsynonymous mutations. For example, the prolonged passage of strain H37Rv in culture may have permitted the accumulation of nonsynonymous mutations in many genes that would otherwise be under strict selection in vivo. Alternatively, selective pressure may exist for certain synonymous substitutions. This may be due to codon bias aimed at maintaining a high G+C content (65.6%) and thus limiting the number of synonymous substitutions. Other explanations consistent with M. tuberculosis biology include a low recombination frequency, a small population size, or a recent bottleneck in M. tuberculosis evolution.
The sequencing of approximately 8.8 Mbp of M. tuberculosis (the combined complete sequences of strains CDC1551 and H37Rv) provided approximately 74 LSPs and more than 1,000 SNPs as potentially informative markers. Based on the consistency indices of the 17 LSPs and 10 SNPs evaluated, 5 and 2, respectively, would be good phylogenetic markers. This compares favorably with the effort by Sreevatsan et al. (34) in sequencing 2 Mbp of 26 selected genes and the identification of 32 SNPs, of which 2 were present at high frequency. The comprehensive comparison based on a genome-wide scan between just two isolates suggests that polymorphisms between M. tuberculosis strains may be more extensive than initially anticipated and that such polymorphisms may have great value in providing comparative information for the basis of human colonization, infectivity, and virulence and informative loci for the analysis of evolutionary and phylogenetic relationships within the Mycobacterium genus and among clinical isolates.
We thank J. Parkhill, The Sanger Center, for review of the H37Rv electropherograms; A. Duesterhoeft and K. Dix, Qiagen Genomic, Inc., for Masscode verification of SNPS; and The Sanger Center for the availability of the M. bovis sequence data.
This work was supported by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, grants RO1-AI40125 and AI46669.