VNTR typing

*M. bovis* is a member of the closely related

*Mycobacterium tuberculosis* complex, which consists of several species with a shared ancestry

[27] but which have evolved marked but not absolute host preferences

[28],

[29]. On a global scale, the

*M. tuberculosis* complex can now be subdivided into discrete lineages, which show strong phylogeographical localisation to regions

[30],

[31]. VNTR profiling is a genotyping technique based on determining the copy number of a series of short, simple DNA repeats, originally identified by genome analysis

[32]. However, while mutations in VNTR type have been observed within the timescale of observation at the regional level, in most cases, probable transmission events are associated with the same VNTR-type, therefore requiring finer resolution typing to order the members of these groups.

Herd-level

*M. bovis* genotyping has been performed by the Agri-Food and Biosciences Institute, Belfast, UK since 2003, as follows. The first (disclosing)

*M. bovis* isolate from all bovine TB herd incidents was subjected to genotyping (eight-VNTRs and spoligotyping convention). Heat-inactivated

*M. bovis* cell lysates were used directly as PCR-ready templates. VNTR profiling, spoligotyping, nomenclature, reference strains and quality control were as described

[32]. The inferred tandem repeat copy number at each VNTR locus was used to produce a concatenated multi-locus VNTR profile (a string of integers), which was then simplified to a number indicating the prevalence of that profile. Genotype 001 (SB0140), with a spoligotype of SB0140, was the most prevalent in Northern Ireland when surveyed in 1999 to 2003

[32]. Spoligotypes were named according to an agreed international convention (

www.mbovis.org).

DNA extraction and sequence analysis

*M. bovis* was isolated and confirmed from suspect bovine granulomatous tissue using standard protocols. Confirmed cultures were grown to single colonies on LJ slopes and single colonies were amplified for DNA extraction using the standard CTAB and solvent extraction protocol

[34]. Extracted DNA was sequenced at the Sir Henry Wellcome Functional Genomics Facility at the University of Glasgow using an Illumina Genome Analyser IIx. Pair-end reads of 70 bp in length, separated by an average of about 350 bp, were trimmed from both ends based on phred quality scores so as to result in an error rate of 0.001 or less for each base call in the remaining sequence. Reads were mapped to a published UK reference genome (AF2122/97)

[35] using the Geneious assembler under the “medium-low sensitivity” option, allowing for a maximum of 10% gaps and mismatches per read

[36]. The reference sequence belongs to the same spoligotype (SB0140) as VNTR type 10 and shares identical repeat numbers with it for four out of eight loci used for typing. Mapping resulted in greater than 99% genome coverage with at least 1× and an average read depth of 60–112× for all isolates (see Table S1 in

Text S1 for full details). Consensus sequences were generated from the mapped contig based on the quality score sum for each position. A cumulative quality score threshold of 60 (corresponding to an error probability of 1 in 1,000,000) was applied to each position to ensure that accuracy of the final consensus sequence was dependent on both quality and read depth, rather than read depth alone. Below this threshold, the consensus base call was scored as unknown (“N”). Alignment of consensus sequences was carried out using Mauve

[37], as implemented within Geneious, assuming collinear genomes and with automatic calculation of seed weight and of the minimum Locally Collinear Blocks (LCB) score. Regions that were difficult to align or which contained >3 consecutive columns of unknown bases or gaps were removed from the final alignment. Similarly, sites that were polymorphic solely due to one or more sequences having ambiguity base calls were removed; this was the only context in which ambiguities were observed. The final alignment, which still represented 99.2% of the reference genome, thus only contained dimorphic single site polymorphisms situated within otherwise invariable regions.

After stripping identical sites, a total of 39 SNPs were identified (38 substitutions, 1 deletion, Table S2 in

Text S1), of which seven were shared between two or more sequences. All SNPs were examined to confirm their validity before further analysis. Of particular concern was the potential inclusion of spurious SNPs associated with repeat regions in the genome for which mapping may be unreliable. While four of the SNPs were found to fall either in or close to potentially problematic regions, the reliability of the mapping and SNP calling could be confirmed in all four cases (see Supplementary Materials). All SNP calls were supported by at least 38× coverage, with high consistency among reads (usually>95%). The only exception to this was a SNP in position 221927 (G−>A), for which consensus calling was ambiguous in one of the four isolates in which it occurred (Herd5_E_2010, 92×, A: 64%, G: 36%, Table S2 in

Text S1). Preliminary analyses further showed that the phylogenetic information provided by this site was in conflict with that of other informative positions (which were in complete agreement). Because these observations raised doubts about the reliability of scoring this SNP as well as about the information it provided, the site was removed from the data set. The final data set was used to generate a maximum likelihood tree using phyml

[38] under a Jukes-Cantor model using a heuristic search and the reference genome for outgroup rooting.

All sequence data generated for this project are available from the European Nucleotide Archive (

http://www.ebi.ac.uk/ena/) under accession number ERP001418.

Modelling of transmission chains

Using the anonymised life history data of the herds containing genotyped isolates, we adapted a method previously used to study foot and mouth disease virus epidemiology and transmission phylodynamics at the between-farm scale

[6],

[19]. All cattle were assumed to be in one of four states that reflect the critical points in the cattle infection process

[18],

[39]–

[41]: susceptible (S), exposed (E), where an animal is infected but neither tests positive nor infects others, potentially test positive (T), where the animal can test positive but is not yet infectious and infectious (I), with a constant transition rate from the E to T states and from the T to I states. States T or I can be truncated by a positive test. Knowing the dates each reactor (an animal that tests positive) was detected, we use the transition rates to determine the set of distribution functions for each animal describing the probability of being in each of the four infection states at any time in its life. We used these distributions to determine the likelihood of the observed pattern of reactors in the cluster of herds.

This key simplification (conditioning on the observed test data) is less rigorous than an approach where the likelihood is integrated forward across all possible dates of infection and the observed data is treated as a random variable. However the latter approach would have been computationally prohibitive for our system. The simplification implies that mean generation times are consistent over the observed timeframe, even though they would naturally be expected to contract over the course of an epidemic

[42]. The assumption is expected to be reasonable if the disease is in an endemic state, or an endemic state is rapidly reached after the initial introduction, where ‘rapid’ is relative to the observed timescale – i.e. the distribution across states is ‘quasi-stationary’, with E and T proportional to the reciprocal of their respective transition rates, i.e.

and

respectively. Though observation of our incidence data suggested this to be likely, to test this assumption, we allowed the proportion in state I at the time of detection to vary over the course of the epidemic, with probability

that it is infectious (i.e. in the I state) and

for being in the T state, where

*t* is the time since the genotype was initially detected. Here, we have assumed that

has the form

, where

*a* and

*b* are fitted constants and

is a probability with range [0…1].

We assumed that all cattle are equally infectious if in state ‘I’ but that an infected animal does not contribute to further infection after it is detected; most reactor cattle are removed from the herd within days of being identified. All infection is caused by either infectious cattle or by a reservoir (either undetected infected cattle within the herd or external, probably local, factors).

At this population scale, we assumed that the probability of a false positive is negligible. To account for untested and undetected contributions to the epidemic, we assumed all test-negative/untested cattle to have a fitted probability of being infectious.

The date(s) of an animal “

*i*” having a positive test is denoted as

, at which it may be infectious with a probability

, where

is the time the outbreak was initially detected in the herd and

*t* is equal to

at the point of evaluation. We assumed that a reactor in the infectious state became infectious (i.e. moved from the test sensitive state) within a maximum of

days prior to the positive test.

is the maximum infectious lifetime of the animal prior to detection. Because a uniform test is applied to all cattle over 6 months of age on an annual basis, the date of first test after infection would fall in the range of 0 to 1 year; therefore we assume that the duration of the effective infectious period is uniform. This has the (conservative) effect of minimising the potential duration of cattle-only transmission chains since it does not allow for long infectious periods. The probability that a reactor was infectious at any time prior to the date it tested positive is:

All other model states were assumed to be exponentially distributed. The probability that an individual is test sensitive at a given time

*t* is therefore

where

is the transition rate from the T to the I states in the forward SETI model. The parameter

determines the duration of the state T and therefore the time when the backwards T to E transition occurs, conditional on the backwards transition from I to T and the time of detection

. The first term accounts for the animal being in the potentially test positive state at

and the second term accounts for the animal being infectious at

. Here,

is the increase in

that occurs due to transition from

at time

*t′*. Similarly the probabilities of being in the exposed and susceptible states are

where

is the transition rate from the E to the T states in the forward-in-time model. The forward-in-time transition from the S to the E state is determined by

and is estimated via the calculated force of infection based on the probabilities of being in the I state. It is incorporated directly into the calculation of the likelihood as described below.

A similar approach was used for animals identified by a post mortem test in which case

is the date the animal was post mortem examined. We assumed further that post mortem testing only identifies the same infection classes as the standard tuberculin test (i.e. only

*T* or

*I* class individuals are detected).

We calculated the probability that an animal became infected by considering the forces of infection on it during its lifetime. The life history of each animal that resided in the cluster between January 1990 and 31 December 2010 was converted into a lookup table of the herds in which it resided in fortnightly time steps. If an animal moved between herds in a time step it was treated as belonging to both herds since it can contribute to an outbreak in both herds during that step. Animals born into a herd are added to the lookup table for that time step and conversely death results in an animal being removed from the lookup table – movements back to the herd cluster via other herds are also retained. Using this lookup table we were able to calculate the probability an animal became infected in each period.

We calculate the probability that an animal would have become infected at any time

*t*,

, given that is was a known reactor based on the force of infection calculated from the model. If an animal was a reactor, then the incremental probability

that an animal,

*i*, was infected by all sources over the time interval

*t*,

*t+dt* is therefore given as

calculated using the explicit infection histories. If the animal never tested positive then

i.e. we consider the probability that an animal was infected despite never testing positive. Young calves destined for slaughter at a young age are test exempt and are considered to be a negligible risk due to their short lifespan; they are therefore excluded from the analysis. The parameter

is the transmission rate from a contact with an infected animal in the same herd at the same time (the summation is over animals in the same herd at

*t*). The reservoir term,

, can be written as

i.e. a combination of any external forces of infection,

, (e.g. infected wildlife) that is represented by a fixed rate and

, the fraction of herd at

*t* that never tested positive, for which the contribution of latent infections is summarized by the probability

*P*_{latent}.

We calculated the probability each animal was infected during its lifetime by integrating

over the animals' entire life (approximated by a summation over fortnightly time steps) and thus calculate the likelihood for the model as:

Here,

is the fitted sensitivity of the test for bTB infection. Consistent with the low number of reactors in officially TB free countries such as Scotland, the test specificity is assumed to be effectively 100%. Using

*L*, we used the Metropolis-Hastings algorithm to derive the posterior distributions over the parameter space defined by

. The length of burn-in for the MCMC simulations was checked using multiple MCMC chains with dispersed starting points, perturbing a subset of the parameters at each step in the chain (further descriptions in the supplementary information, and Figures S4 and S7 for the evolution of the MCMC chains and the posterior distribution of the parameters). Prior distributions were chosen to be uniform, with ranges

,

,

,

,

,

,

,

and

, based on the parameters estimates previously reported in various field and experimental studies

[43]–

[45].

A running sum of the probability of a transmission event (

) was retained, ignoring potential transmissions that were incompatible with the phylogenetic tree (for example between animals with isolates in different clades of the tree, excluding these animals from being linked by transmission). We used a modified Dijkstra's algorithm

[46] to identify all possible infection paths between cattle for which we have isolates. Each link “

*i*” in a defined chain has a probability

*p*_{i} of being associated with transmission, so that the probability of that transmission chain occurring is given by the product of the

*p*_{i}'s. Then the probability that a path will exist between two individuals

*A* and

*B* is simply the probability that at least one of the identified possible chains will connect the two, and therefore

The calculation included all chains that do not pass through another sequenced isolate, and for computational convenience, we have limited the calculation to chains of less than three individual intermediate cattle. In order to test our assumptions, we ran forward simulations of a SETI model, based on the most likely parameter values of the posterior distributions from the outbreak data analysis. Applying our estimation approach to the simulated outbreak data confirmed that all the input parameter values fell within the 95% credible intervals of their corresponding posterior distributions derived from the original analysis of the data (not shown).