Several studies have compared closely related strains of bacteria, including Helicobacter pylori
, several Chlamydia
species, and pathogenic and nonpathogenic E. coli
). In general, these studies were limited to the sequencing of a comparative strain or species and identification of polymorphisms between the two but failed to extend the findings beyond the two strains being compared.
Several studies of the Mycobacterium
genus describe LSPs among the M. bovis
BCG vaccine strains and virulent M. bovis
as well as among other tubercle bacilli (6
). While these studies describe several regions of LSP, the methodologies of subtractive hybridization and RFLP analysis limit the resolution and therefore the utility of the markers described. Betts et al. (7
) in their analysis of the M. tuberculosis
proteome attempt a comparative analysis of the genome contents of the H37Rv and CDC1551 strains. Several conclusions drawn from this analysis appear to be incorrect based on our present analysis. In particular, they describe eight ORFs completely unique to H37Rv, including Rv0278c, Rv0279c, Rv0746c, Rv0747, and Rv1087, all of the PE_PGRS family. The identification of a 5,742-bp deletion associated with ORFs Rv0278c and Rv0279c and a deletion of 4,910 bp associated with ORFs Rv0746c, Rv0747, and Rv0748 is incorrect and may be the result of analyzing an incomplete version of the sequence of strain CDC1551. Two other regions of strain H37Rv, bp 2,714,308 to 2,714,808 and bp 3,933,523 to 3,936,659, are incorrectly identified as being deleted in strain CDC1551. Their analysis of insertions in the strain CDC1551 genome relative to strain H37Rv also appears to contain several errors. They failed to identify approximately 15 regions of insertion in strain CDC1551 relative to H37Rv. Among these regions was a 4,425-bp region containing ORFs MT3426 and MT3427 and moaB
paralogs, respectively. Most of the coordinates reported by Betts et al. (7
) are incorrect with respect to strain CDC1551. This is again likely due to the analysis of an incomplete genome sequence.
We initially undertook the sequencing and annotation of a second M. tuberculosis strain, CDC1551, in an attempt to correlate genotypic changes with strain phenotype. Prior to our study it was generally accepted that little sequence variation existed. Our study demonstrates that a much higher level of polymorphism is present among the M. tuberculosis species. This discovery complicates attempts to associate specific genotypic changes with phenotypic differences. However, it also provides several important opportunities.
First, it is likely that a subset of the polymorphic sequences code for genes involved in host-pathogen interactions. A statistical analysis of the single-base substitution frequency in 877 gene families identified several families in which the frequency of substitution was significantly greater than that observed for the genome as a whole. This included the PE/PPE gene family, which encodes acidic, glycine-rich proteins that are postulated to be expressed on the extracellular surface and are considered potential antigens for host immunity (27
). The higher substitution frequency in this family might be the result of antigenic variation or may be due to other interactions with the host. Three other gene families representing conserved hypothetical proteins also had a higher substitution frequency. These genes warrant further investigation as potential candidates for virulence or immunogenicity. The LSPs that we discovered may also encode proteins involved in disease pathogenesis as demonstrated by the polymorphisms in one of the phospholipase C genes and members of the PE/PPE gene family.
Second, the level of LSPs and SNPs that we discovered among the M. tuberculosis isolates suggested that we could also develop a set of markers that would be valuable in studying the phylogenetics of the M. tuberculosis species and other tubercle bacilli. Analysis of two LSPs, an ORF encoding a membrane lipoprotein and an ORF encoding a putative adenylate cyclase, suggested conflicting scenarios of the evolutionary relationship of CDC1551, H37Rv, and M. bovis. However, phylogenetic analysis of 21 clinical isolates along with strains CDC1551, H37Rv, and M. bovis demonstrated that the adenylate cyclase region had a low consistency index and would be a poor phylogenetic marker, explaining its apparent contradiction with the membrane lipoprotein region. Other polymorphisms with high consistency indexes were discovered that are likely to be excellent markers for investigating the evolution and phylogenetics of this species.
We demonstrated that the Ds
is more than threefold greater than estimated previously (24
). Interestingly, the Ds
ratio was close to 1, unlike the much higher level expected if selection was against nonsynonymous but not synonymous substitutions. This could be due to decreased selective pressure against nonsynonymous mutations. For example, the prolonged passage of strain H37Rv in culture may have permitted the accumulation of nonsynonymous mutations in many genes that would otherwise be under strict selection in vivo. Alternatively, selective pressure may exist for certain synonymous substitutions. This may be due to codon bias aimed at maintaining a high G+C content (65.6%) and thus limiting the number of synonymous substitutions. Other explanations consistent with M.
tuberculosis biology include a low recombination frequency, a small population size, or a recent bottleneck in M. tuberculosis
The sequencing of approximately 8.8 Mbp of M. tuberculosis
(the combined complete sequences of strains CDC1551 and H37Rv) provided approximately 74 LSPs and more than 1,000 SNPs as potentially informative markers. Based on the consistency indices of the 17 LSPs and 10 SNPs evaluated, 5 and 2, respectively, would be good phylogenetic markers. This compares favorably with the effort by Sreevatsan et al. (34
) in sequencing 2 Mbp of 26 selected genes and the identification of 32 SNPs, of which 2 were present at high frequency. The comprehensive comparison based on a genome-wide scan between just two isolates suggests that polymorphisms between M. tuberculosis
strains may be more extensive than initially anticipated and that such polymorphisms may have great value in providing comparative information for the basis of human colonization, infectivity, and virulence and informative loci for the analysis of evolutionary and phylogenetic relationships within the Mycobacterium
genus and among clinical isolates.