Samples of six H37Rv strains obtained from separate laboratories were sequenced using an Illumina GenomeAnalyzerII (Table ). All strains were sequenced with a read length of 36 bp, though two were sequenced in single-end mode and four in paired-end mode (where reads are sequenced from both ends of ~250-bp fragments, providing additional localization constraints). The depth of coverage ranged between 29.4× (H37RvHA) and 151.3× (H37RvJO). The genomes were sequenced to >97.97% completion (defined as sites covered by at least one read), with the remaining sites with zero depth of coverage largely confined to the highly GC-rich coding regions of PE_PGRS genes, where sequencing is not as efficient. The spoligotypes of the strains were determined by matching sequencing data (reads) to the oligonucleotide sequences of the 43 spacers in the direct-repeat (DR) region (20
). While H37RvLP, H37RvJO, and H37RvCO have spoligotype patterns that match the general pattern for the H37Rv reference sequence (octal representation, 777777477760771; missing spacers 20 to 21 and 33 to 36), the other three strains (H37RvAE, H37RvMA, and H37RvHA) are missing an additional spacer, number 26. The distinct spoligotypes suggest a clustering of the strains into subgroups based on phylogenetic relationships (see below).
H37Rv strains sequenced in this study
There are 72 polymorphisms that are shared in common among all six H37Rv variants plus H37Ra, in comparison to the H37Rv reference sequence. These include 57 single-nucleotide polymorphisms (SNPs) and 15 insertions/deletions (indels) 1 to 3 bp in length, although polymorphisms in PE_PGRS genes were excluded (due to ambiguity caused by low coverage). Among the SNPs, there are 6 in noncoding regions, 21 synonymous mutations, and 30 nonsynonymous mutations. Rv0197 and pstA1 show truncations (mutations to a stop codon mid-open reading frame [ORF]), and pks3 and Rv1783 show mutations in their stop codons to extend their ORFs. There are eight indels in coding regions, causing frameshift mutations in Rv0197, PPE7, Rv0907, Rv1046c, Rv1575, Rv2251, Rv3655c, and sigM. Of the 15 mutations in noncoding regions, 6 are within putative regulatory regions (<100 bp upstream of coding regions). It should also be noted that there are two local clusters of SNPs (12 within a 137-bp span in PPE9 and 14 within an 80-bp span in PPE47) that are not included in this analysis.
There is only one site where there is a polymorphism shared only among the five H37Rv variant strains but not the H37Rv or H37Ra reference strains: A459399C in a noncoding region, at −84 bp upstream of Rv0383c. This SNP appears to be unique to the H37Rv lab strains as a group. H37Ra has a 55-bp deletion in this region.
Many of these sites are consistent with errors discovered in the H37Rv reference sequence upon resequencing by Zheng et al. (48
). Of 48 successfully resequenced sites with apparent SNPs in the H37Ra genome that matched CDC1551 but not H37Rv (excluding SNPs in PE_PGRS genes, PPE9, and PPE47), they found only 3 sites with genuine differences, and the rest turned out to be errors in the H37Rv sequence. Our results agree exactly with the resequencing, in that all of the 45 errors appear as shared SNPs among our H37Rv lab strains shown in Table , and the 3 genuine SNPs between H37Rv and H37Ra were not found in the H37Rv lab strains (position 754186, G to A; position 1077312, G to A; and position 4100975, C to T [these remain unique to H37Ra]). Nine of the 15 indels in our study were also resequenced by Zheng et al. (48
), and all were confirmed as errors in the H37Rv sequence, including the frameshift in amino acid (aa) 160 of sigM
(shortening the ORF from 222 aa in H37Rv to 196 aa in H37Ra).
Shared polymorphisms among all five H37Rv lab strains and H37Ra, relative to H37Rva
Polymorphisms unique to certain strains or shared among a subset of strains were also identified, potentially representing differences among the stock cultures (Table ). None of the differences were shared with H37Ra, which matched H37Rv at all these sites. Six polymorphisms were shared among H37RvAE, H37RvMA, H37RvHA, and H37RvCO, which are ATCC 27294 specific. One of these is a −CCG deletion in ponA1
(a predicted penicillin-binding transpeptidase/transglycosylase gene) corresponding to Pro630, which is an apparent reduction in copy number of a CCGn
tandem-repeat region from six copies in H37Rv and H37Ra to five copies. Four polymorphisms are shared by only H37RvAE, H37RvMA, and H37RvHA, including three SNPs (in lprO
[silent], Rv2604c, and a noncoding region) and +CAC in Rv2553c (a hypothetical protein). The similarity of these strains is consistent with the spoligotyping results, suggesting that these three strains form a derived subgroup. Three unique mutations are observed in H37RvAE (in Rv1063c [silent], fadD31
, and ppsA
). The mutation H620N in fadD31
, encoding an acyl coenzyme A (acyl-CoA) ligase, is outside the boundaries of the AMP-binding domain (spanning residues 70 to 553 according to Pfam [http://pfam.sanger.ac.uk
]) and thus is unlikely to affect function. The mutation in ppsA
is a nonsense mutation (W1294*). ppsA
encodes a multidomain polyketide synthase involved in PDIM biosynthesis, specifically the polymerization of the phthiocerol component. The mutation occurs near the C terminus of the protein product (total length, 1,876 amino acids) and would truncate the terminal acyl carrier protein (ACP) domain and thus presumably abrogate the function of ppsA
. It is most likely that this mutation is a one-off mutation (specific to the sample sequenced) that occurred between passaging and is not propagated in the parental strain, as this strain is regularly passaged in mice and retains virulence and PDIM production (W. R. Jacobs, Jr., unpublished results). Supporting this, we did not observe the ppsA
W1294* mutation in several isogenic mutants that we sequenced. Mutations in genes involved in PDIM biosynthesis are frequently observed in strains maintained in the laboratory without the selective pressure to maintain virulence (11
), reinforcing the value of regular passaging in mice.
Polymorphic sites containing mutations unique to one lab strain or shared among a subset (excluding polymorphisms in PPE and PE_PGRS proteins)
The two ATCC 25618 samples, H37RvLP and H37RvJO, share five mutations, including a frameshift mutation +GC in mas
(encoding mycocerosic acid synthase, which is also involved in PDIM biosynthesis) in amino acid 829 out of 2,111. This mutation is known to disrupt the function of mas
, as H37RvJO has been shown to be PDIM deficient (21
). However, both H37RvLP and H37RvJO are fully virulent in mice (21
). This would appear to contradict reports that PDIM deficiency correlates with reduced virulence (10
). One possible explanation for this discrepancy could be a difference in genetic backgrounds, e.g., a compensating mutation that allows these strains to retain virulence despite loss of PDIM. There are three mutations shared between H37RvLP and H37RvJO (other than the frameshift in mas
and a synonymous mutation in Rv0573c): R9H in Rv0480c, P402S in Rv0538, and a 17-bp deletion in Rv3785. Rv0480c encodes a putative nitrilase, but the mutation occurs in the N terminus, which is on the surface of the protein, not the active site, based on the crystal structure of the yeast homolog (Protein Data Bank [PDB] accession no. 1F89). Rv3785 is a hypothetical protein of unknown function. However, Rv0538 has previously been recognized as an immunogenic protein (38
). Rv0538 is a 548-amino acid membrane protein of unknown function containing Pro/Thr repeats, and hence it is also called proline-threonine repetitive protein (PTRP). In a study aimed at identifying antigenic proteins for potential diagnostic development, antibodies to Rv0538 were found with high frequency in HIV-negative, TB-positive patients (38
). Peptide epitope mapping identified four 20-amino-acid regions that each bound antibodies in sera of >50% of patients with active infections, and Pro402 is located directly in the middle of the third region, PT40. Though the connection between Rv0538 and virulence is still unresolved, this study directly links the site of the P402S SNP in Rv0538 shared by H37RvLP and H37RvJO to stimulation of the immune response of the host and therefore could be part of the explanation for how these strains retain their virulence despite the loss of PDIM production.
Several of the mutations observed among these six H37Rv strains (7 out of 20 in coding regions) are silent and thus unlikely to cause phenotypic differences. However, the functional relevance of the remaining mutations, many of which are conservative (e.g., D151E in Rv2604c and V393A in Rv3645), is unknown. The SNP in pks1
in H37RvLP probably does not cause a phenotype, since it is part of a disrupted polyketide synthase gene that produces phenolglycolipid (PGL) in other strains of M. tuberculosis
but which has been split into two ORFs (pks15
) by a frameshift mutation in H37Rv and thus is already nonfunctional (9
). Of the three SNPs that occur in noncoding regions, two are in putative regulatory regions near the transcriptional start sites of operons. The closest one is an A-to-G substitution 33 bp upstream of Rv0991c (encoding a hypothetical protein). Of four indels observed, two are in frame (−3 bp in ponA1
and +3 bp in Rv2553c), and two cause frameshifts (in mas
[discussed above] and Rv3785, encoding another hypothetical protein whose function is presumably disrupted in H37RvLP and H37RvJO).
The polymorphisms shared among certain strains suggest that they can be clustered by similarity. A phylogenetic tree was constructed using maximum parsimony (dnapars in PHYLIP 3.66) based on the polymorphic sites list in Table (Fig. ), augmented with genuine differences between H37Rv and H37Ra not attributed to sequencing errors (48
). As expected, this phylogeny suggests that H37RvLP and H37RvJO diverged from H37RvCO, H37RvAE, H37RvMA, and H37RvHA, forming two distinct clusters, consistent with their typing as ATCC 25618 and ATCC 27294, respectively. There are five ATCC 25618-specific polymorphisms and six ATCC 27294-specific polymorphisms distinguishing the highest-level branches. Subsequently, H37RvAE, H37RvMA, and H37RvHA diverged from H37RvCO as a subcluster. Several of the strains subsequently accumulated a small number of independent mutations (three in H37RvAE, five in H37RvLP, and three in H37RvJO). The number of polymorphisms in each strain is small compared to the number of differences between the H37Rv and H37Ra (30 SNPs, excluding those in PE_PGRS genes). Note that H37Rv is on a shallow branch because all of the sequencing errors have been filtered out, leaving only one site (position 459399, A to C) where it differs from all the other strains.
Maximum-parsimony tree of relationships among strains. Branch lengths represent number of nucleotide differences (total tree length = 57).
In addition to the small number of substitutions and insertions/deletions distinguishing these variant strains of H37Rv, there are also several differences in insertion sites of the IS6110
transposable element. All six strains have the same 16 copies of IS6110
as in the H37Rv reference strain, with three exceptions. First, H37RvLP and H37RvJO have lost IS6110
12, located at coordinate 3.55 Mb. Second, H37RvJO has a novel insertion site at coordinate 3.49 Mb, disrupting Rv3128c, encoding a hypothetical protein of unknown function. This site occurs in a 556-bp region called NTF-1, in which Beijing strains of M. tuberculosis
also have one or two unique IS6110
). Third, H37RvCO contains a novel IS6110
insertion site in plcD
617 bp upstream from the site in H37Rv. In addition, H37RvAE, H37RvMA, and H37RvHA appear to have only one copy in this region and lack the intervening portion of plcD
. This can be explained by a series of three IS6110
insertion/loss events (Fig. ). H37Ra shows an insertion in the coding region of plcD
, along with an additional one downstream, creating an adjacent pair and spanning several genes in between (including mmpL14
). It was postulated that the IS pair may have been ancestral and that the 8-kb region, including the upstream IS element, was lost in H37Rv (25
). It appears that after this event, a new insertion of IS6110
occurred even further upstream in plcD
in a progenitor of four of the six H37Rv lab strains, resulting in the pattern observed in H37RvCO. Subsequently, this IS and the intervening portion of plcD
in H37RvAE, H37RvMA, and H37RvHA were deleted, probably through homologous recombination with the IS6110
element downstream. Given the frequency of insertion events in this region, it is likely that plcD
represents a “hot spot” for IS6110
insertion sites (37
). However, these insertion/deletion events probably do not have any functional relevance, since they are in a gene that is already disrupted.
Proposed history of IS6110 insertion/deletion events in the region of plcD. “hypo” indicates coding regions annotated as encoding hypothetical proteins in the H37Ra genome.