WGS analysis of all M. tuberculosis isolates from a recent outbreak of TB enabled us to 1) identify microevolutionary events within a chain of transmission, 2) confirm the epidemiologic links and the directionality of the chain of transmission, and 3) determine the number of SNPs per transmission event during this outbreak.
The median time of diagnosis between patient A1 and the 5 patients with confirmed transmission based on the WGS data was 9 months (range 4 to 15 months), and the median time of diagnosis between B1 and B2 was one month. Both patient A1 and B1 had multiple cavities, an indicator of high bacterial burden and a factor associated with a patient’s increased ability to transmit infection 
. Although none of these patients were infected with HIV, a known factor for rapid progression to disease, a recent analysis of TB outbreaks in the US found that most patients did not have established medical risk factors (although not all patients were tested for HIV), but they did have social factors for TB 
. In our study population, all patients were recent immigrants to US with an average stay of 30 months in the country, and may have had multiple factors that increased the risk of developing active TB 
. While we cannot rule out the possibility that patients were infected by the same bacteria in a previous outbreak and that they all reactivated over a 22-month period, the likelihood of 9 patients developing active TB during a short period by the same unique latent strain is unlikely. The MIRU genotype of this M. tuberculosis
strain was not found in the genotyping databases from San Francisco, California, USA, the CDC TB GIMS (TB Genotyping Information Management System) and the international MIRU-VNTR website (November 28, 2012), which suggest that this may not be a prevalent M. tuberculosis
The sequences of all patients with the outbreak strain were identical except for seven SNPs that were confirmed by Sanger sequencing. These polymorphisms were useful in identifying genetic variations in a well described-outbreak. These data, together with detailed epidemiologic information, enabled us to study the microevolutionary events during transmission in detail. The genomic and epidemiologic data indicated that one patient (A1) transmitted M. tuberculosis to multiple contacts, and only one of these contacts became a source of secondary cases (B1). The use of WGS to identify a single additional source of infection is an important finding. By relying solely on the clinical and epidemiologic data, ascertainment of the directionality and sequence of transmission was difficult since the cases were all diagnosed within a short period of time.
As indicated by WGS, the M. tuberculosis
isolate from patient A1 resulted in five different variants of M. tuberculosis.
We do not know if these mutations occurred randomly and then expanded or were product of natural selection. We cannot rule out the possibility that some of the mutations may have impacted the pathogen’s transmissibility and ability to cause active disease. The M. tuberculosis
from patient A2 () had an nsSNP in Rv2629, a gene induced during hypoxia 
. The isolate from B1 had a nsSNP in Rv0146, a gene involved in the innate responses of primary macrophages 
. The SIFT scores of these genes suggested that these mutations may have an impact on the protein function. Further research is required to determine what effect, if any, mutations occurring during transmission events have on the pathogen, and if these mutations are the result of chance or the result of the transmission event acting as a selective force on the bacterial population.
The number of SNPs we found was much lower than the 204 SNPs reported by Gardy et al. 
who performed WGS of 32 M. tuberculosis
isolates from an outbreak using the same sequencing platform that we used. The large discrepancy between the numbers of SNPs found by Gardy et al. 
and our study is probably due to the criteria used to define a SNP. Based on the reported methodology, Gardy et al. 
did not confirm the data obtained by WGS. They also included 76 SNPs located in paralog gene families, some of which may be false SNPs due to mapping errors. SNPs detected by WGS in these genes were not further analyzed in our study and we only included WGS detected SNPs in which more than 85% of the base calls were homozygotes and that were subsequently confirmed by PCR sequencing. The number of mutations we found is similar to what was reported by Schürch et al. who performed WGS analysis on three isolates (from years 1992, 2004 and 2006) of M. tuberculosis
that had the same IS6110-
RFLP genotyping pattern 
. These isolates were part of a prevalent cluster in the community that was composed of 104 cases identified during 1992–2008, and the authors found eight SNPs among the three sequenced isolates. These SNPs were subsequently investigated in all 104 isolates of the cluster, resulting in the identification of secondary and tertiary source patients. It is interesting that the number of SNPs from a strain prevalent in one population for 14 years is similar to the number of SNPs from a non-prevalent strain causing an outbreak over a 22-month period. This may be partly explained by the fact that not all the isolates were sequenced by Schürch et al. Recently, Walker sequenced 114 paired isolates from individuals and household members with tuberculosis, and in 96% of the cases; the whole genome differed by five or fewer SNPs, which is compatible with our finding 
WGS verified the epidemiologic links established through contact investigation and confirmed index case A1 as the source case of the outbreak. Additionally, WGS was able to show that it was patient A1, not patient B1, that transmitted the TB organism to patient C. Additionally, patient D (who was symptomatic 2 weeks before diagnosis and almost 22 months and 7 months after patients A1 and C were diagnosed, respectively) had a mixed population of M. tuberculosis compatible with M. tuberculosis isolates from patients A1 and C. It is possible that patient A1 may have had both the wild-type M. tuberculosis population as well as a small population of the SNP5 mutant, which was not cultured or detected by WGS, who then transmitted these isolates to both patient C and patient D. The likelihood that patient D is the link between A1 and C is low, as the timeline is not compatible with this link and the contact investigation did not reveal any epidemiologic links. Similarly, we did not find epidemiological links to patient E who became symptomatic 4 months before diagnosis and was diagnosed 22 months after patient A1.
WGS was able to provide a link between cases that did not otherwise have an epidemiologic link. In patients with an identical strain, WGS indicated the sequence and direction of transmission, revealed the mutations acquired between transmission events, and gave an indication as to whether those mutations would affect the fitness of an organism or not.
Although WGS, combined with epidemiologic data, enables us to determine the most likely sources of infection, it is not possible to determine if the mutations occurred during the time M. tuberculosis
was replicating in the index case, or if the mutations occurred in the secondary cases after transmission. Therefore, it is not possible to estimate mutation rate per number of generations of M. tuberculosis
or per unit of time 
WGS has the potential to become an important molecular epidemiologic tool as it provides information about the microevolution of a strain during transmission as well as the source(s) of infection and the sequence of transmission events. This will be important as it can distinguish patients who are part of a recent chain of transmission from those with disease resulting from progression of remote infection. Moreover, WGS can be used in areas where the isolates causing TB are genetically similar (prevalent M. tuberculosis stains) and where current tools cannot distinguish between recent transmission and prevalent strains. Applying WGS to large population-based studies is still limited by the available laboratory and analytic tools. More importantly, the programmatic utility of WGS needs to be defined.
Our study has limitations. The WGS was performed on DNA extracted from M. tuberculosis
sub-cultured from the original liquid media. It is possible that the culture process may have caused a bottleneck and may not reflect the true genetic make-up of the bacteria causing the disease in the patient. Direct sequencing of M. tuberculosis
from clinical specimens would eliminate such concerns. However, the SNP-based network correlated with most of the clinical and epidemiologic based network suggesting that this did not have a substantial impact on our findings. We did not identify small deletions and insertions; however, these are difficult to identify accurately. It is possible that the cultures had more heterozygosity, which we did not actively investigate, except on the loci that were polymorphic. SNPs in mobile elements and PE and PPE families were excluded from the analysis. These last two account for 10% of the genomic coding potential. Recently, it have been shown that nsSNPs in these families are 3 to 3.3 times more frequent than in non-PE/PPE genes, therefore it is possible that these isolates have mutations in these genes 
In conclusion, WGS was useful in determining the mutations that occurred during microevolution of an M. tuberculosis strain during a well-documented outbreak. The SNP data were useful to validate the directionality and sequence of transmission that were suggested by the epidemiologic data. Comparative genomic studies will be needed to determine if the mutations observed have any impact on the ability of the bacteria to transmit and cause disease.