We included TB cases that had valid genotyping data and were reported by the 50 states and Washington, D.C., to the CDC National Tuberculosis Surveillance System during 2004–2010 
, the latest genotyping data available at the time of this analysis. Genotyping data were obtained from the CDC National Tuberculosis Genotyping Service by methods described elsewhere 
. M. tuberculosis
culture isolates were analyzed to determine spoligotype and 12-locus mycobacterial interspersed repetitive units-variable number tandem repeats (MIRU-VNTR) pattern. Two patients were considered to have matching genotypes if their isolates had indistinguishable spoligotype and MIRU-VNTR patterns. A genotype cluster was defined as 2 or more TB patients with matching genotypes in the same geographic area.
The statistical program SaTScan, version 9.1.0, was used to identify spatially concentrated clusters of TB cases with a specific genotype during 2006–2010; we used residential zip code as the geographic unit of measurement 
. We applied the discrete Poisson probability model, using all culture-positive TB cases as the background population. SaTScan uses the spatial scan statistic, based on the log-likelihood ratio (LLR), to determine spatial concentration of cases in a cluster. SaTScan identified significant clusters with the smallest p-value first. Additional cases not yet assigned to a cluster were evaluated to identify additional clusters. Parameters were set so that individual cases were allowed membership into only 1 cluster. Clusters with both significant (p<0.05) and nonsignificant concentrations were included in the cohort. We set a cluster's radius to be no more than 50 kilometers, as previous analyses demonstrated that setting a maximum of 100 kilometers produces the same clusters, while a 20-kilometer radius may split clusters 
. To focus on incident rather than endemic clusters (i.e., those present over a long period of time) 
, we restricted our analysis to new clusters, defined as those in which the initial case occurred during 2006–2008 and was preceded by a 24-month period of no reported cases. Routine genotyping was not initiated in all areas at the same time, and in 2006, national genotyping coverage, defined as the proportion of culture-positive cases with a reported genotype in the National Tuberculosis Genotyping Service, was 70%. When an area first begins genotyping, all clusters in the area will appear to be new. To avoid inclusion of endemic clusters as newly emerging strains from areas with incomplete genotyping coverage, we excluded clusters if the county with the most cases had <75% annual genotype coverage. If a cluster had the majority of its cases in a county that did not meet this criterion, the entire cluster, not just the cases, was excluded.
To ensure that clusters had an equal chance (i.e., an equal time period) to become outbreaks, we established a standardized observation period of 24 months after the third case. The 24-month follow-up period was derived from an analysis of the time between the third and sixth cases (or the last case, if the cluster did not reach 6 cases by 2010). The longest time interval observed was 23.9 months (data not shown). This approach identified 148 new clusters of at least 3 cases that could be observed for 24 months.
Although no standard definition of a TB outbreak exists, for the purposes of this analysis, we defined an outbreak as a cluster that grew from 3 to at least 6 cases during the observation period, in which at least two of the cases could be linked epidemiologically (i.e., had spent time in the same place when at least one of them was contagious), and in which the cluster was confirmed to be an outbreak by local public health officials (usually state as well as county TB control officers). Time intervals between dates of diagnoses were calculated based on the earliest of 3 possible dates (i.e., the date a patient specimen was collected for drug susceptibility testing, the date TB treatment was initiated, or the date the patient was counted as a verified TB case). We used the rate of initial cluster growth as a predictive variable; we considered the times between diagnosis of the first and second case, between the first and the third case, and between the second and the third case. We used SaTScan to determine which clusters were significantly concentrated (p<0.05) at the time of the third case. (SaTScan could not define significance for 3 clusters.)
Other predictive variables were based on patient characteristics reported to the National Tuberculosis Surveillance System, described elsewhere 
. The unit of analysis was the cluster. A cluster was considered “exposed” by a characteristic if any 1 of the first 3 patients had that characteristic. For brevity in this report, we refer to patients who reported homelessness or excess alcohol use or illicit drug use in the 12 months before diagnosis, or who reported being incarcerated at the time of TB diagnosis as being “homeless, incarcerated, or drug or alcohol users.” Clusters with at least 1 of the first 3 patients who reported any of these conditions are described as “marginalized” in this context. Socioeconomic measures for crowding, education, and unemployment were derived from the 2000 U.S. Census; median values were calculated for all zip codes. A cluster was considered “exposed” if the zip code with the most cases had a value above the median. For clusters from multiple zip codes with equal numbers of cases, one zip code was randomly selected. The influence of genotype lineages was assessed based on spoligotyping of TB in the United States; lineages include M. bovis
as well as the subgroups of Indo-Oceanic, Euro-American, East Asian, and East-African Indian 
. (M. Africanum
was not identified in the cohort.) Univariate analysis was performed to describe a cluster's risk for becoming an outbreak. We used SAS 9.2 (SAS, Cary, NC, USA) to calculate relative risks and 95% confidence intervals (CI).
SAS JMP 9.0.1 was used for the decision-tree analysis, based on recursive partitioning, to determine which combination of variables best predicted clusters that became outbreaks. JMP compares possible binary partitions based on the LogWorth statistic, which is calculated as -log10 (adjusted p-value), where the adjusted p-value takes into account the number of different ways partitions can occur for each variable 
. JMP determines the partition that best predicts the outcome of interest for both continuous and categorical variables. When the decision-tree analysis identified a partition that resulted in a node with fewer than 20 clusters, we stopped the partitioning.
Approval by an institutional review board was not required because data were collected and analyzed for this project as part of routine TB surveillance, and the project was therefore not considered research involving human subjects.