|Home | About | Journals | Submit | Contact Us | Français|
Identifying risk factors for the presence of Escherichia coli O157 infection on cattle farms is important for understanding the epidemiology of this zoonotic infection in its main reservoir and for informing the design of interventions to reduce the public health risk. Here, we use data from a large-scale field study carried out in Scotland to fit both “SIS”-type dynamical models and statistical risk factor models. By comparing the fit (assessed using maximum likelihood) of different dynamical models we are able to identify the most parsimonious model (using the AIC statistic) and compare it with the model suggested by risk factor analysis. Both approaches identify 2 key risk factors: the movement of cattle onto the farm and the number of cattle on the farm. There was no evidence for a role of other livestock species or seasonality, or of significant risk of local spread. However, the most parsimonious dynamical model does predict that farms can infect other farms through routes other than cattle movement, and that there is a nonlinear relationship between the force of infection and the number of infected farms. An important prediction from the most parsimonious model is that although only ~ 20% farms may harbour E. coli O157 infection at any given time ~ 80% may harbour infection at some point during the course of a year.
Escherichia coli O157 emerged over two decades ago and is now widespread in Scotland, where the incidence rate of human infection is generally higher than in most other UK, European or North American countries (Locking et al., 2006; Chase-Topping et al., 2008). Cattle are the main reservoir host for E. coli O157 (Armstrong et al., 1996) and play a significant role in the epidemiology of human infections (Griffin and Tauxe, 1991). Previous work has shown that direct contact with animals, their faeces and the farming environment are all important risk factors for sporadic human infections (O'Brien et al., 2001; Locking et al., 2001; Willshaw et al., 2003). Spatial analyses also identified a positive association between human infections and areas of high cattle density (Michel et al., 1999; Kistemann et al., 2004; Innocent et al., 2005). Although human infection may arise from person to person contact and from consumption of food contaminated by asymptomatic human carriers, primary human infection can be attributed to contamination of the environment or the food chain from several livestock species, especially cattle (Espie et al., 2006). Therefore understanding the mechanisms of persistence and spread of E. coli O157 on Scottish cattle farms is key to reducing the risk of human infection in Scotland and elsewhere.
Two recent surveys (Gunn et al., 2007; Pearce et al., 2009) concluded that c.20% of Scottish cattle farms harbour E. coli O157 infection, although there is variation in both time and space. Investigations with an exploratory mathematical model (Liu et al., 2007) suggested that although cattle movement is a significant contributor to the observed prevalence of E. coli O157 positive farms, it is not by itself sufficient to explain the persistence of E. coli O157 on Scottish cattle farms. The objective of this study is therefore to better understand the impact of other risk factors on the spread and persistence of E. coli O157 in Scotland and their interaction with cattle movement. To do this we developed a set of stochastic epidemiological models that represent different assumptions regarding the transmission of infection among farms. By comparing the goodness of fit of different models we gain insights into the underlying epidemiology of infection. Data were also analysed using traditional risk factor analysis in order to produce comparative results. Correspondence between the results from empirical statistical models and those from the fitting of dynamical process models has not, as far as we are aware, previously been examined in infectious diseases systems.
Data. Three separate databases were used in this study. First, the June 2003 Agricultural census of livestock premises combined with the Department for Environment, Food and Rural Affairs (DEFRA) list of livestock premises; second, the Cattle Tracing System (CTS); and third, E.coli O157 prevalence data. Each data set will be described briefly below.
We consider a metapopulation model where individual farms are regarded as either susceptible (S) or infected (I) (i.e. E. coli O157 positive at least one cow). E. coli infection is not pathogenic in adult cattle and infections are transient (Shere et al., 1998; Robinson et al., 2004); therefore infected cattle can recover to become susceptible and E. coli positive farms can lose their infected status.
One route of transmission from one farm to another is the movement of infected cattle between them. Because movement data are provided for each day, it is sensible to use a time unit of one day. We assume that if a susceptible farm i on day t receives Mij(t) cattle from an infected farm j on which a fraction xj is infected, the chance that farm i will become infected due to this movement event is 1 − (1 − xj)Mij (c.f., Green et al., 2006). Summing over all cattle moving onto the farm from infected farms, the probability that farm i will become infected on day t is
where I(t) is the set of infected farms at time t (and also the number of infected farms at time t, see below). The fraction of cattle infected on farm j, xj was randomly sampled (with replacement) from the distribution of on-farm prevalences found from IPRAVE survey data (Liu et al., 2007), and Mij(t) was obtained from DEFRA CTS data as described above.
Farms can also become infected through a variety of other routes including acquisition of infection from other host species and a contaminated environment. Let Λi represent the force of infection for farm i due to sources other than movements. Combining together, the overall probability that farm i becomes infected on day t is
When both contributions in transmission are small, Eq. (2) can be approximated as Pi(t) = Λi(t) + ∑ j I(t)xiMij(t), which suggests that the probability is proportional to the force of infection and the number of infected animals moved in. Λ(t) has units per day. The infected farms can recover to become susceptible again. For simplicity, we initially assume that this happens with a constant probability of recovery per day
The expression for the force of infection Λi depends on the assumptions made about the process of transmission. As E. coli O157 is transmitted via the faecal-oral route the main transmission vehicles are thought to be contaminated feed, water and grazing. Once in the environment, E. coli O157 can remain viable for long durations (Gagliardi and Karns, 2000; Williams et al., 2005). Transmission is easier between animals kept at higher densities and hence in closer proximity (Synge et al., 2003; Smith et al., 2009). A longitudinal study of a dairy herd (Mechie et al., 1997) and previous work in Scotland (Synge et al., 2003) reported seasonal incidence of shedding. Livestock species other than cattle, and also wildlife, may be infected and transmit infection (Synge, 2000). Mechanical transmission of contaminated faeces is a likely means of farm-to-farm spread.
To represent all the possible mechanisms that are relevant to transmission, we first consider the following basic model: the probability of a farm becoming infected depends on the number of cattle present Ni and the number of infected farms at time t, I(t) (equivalently, the prevalence of infection across the whole population of farms) in a nonlinear way,
Here (for the case b = 1) βNia is the weighted probability of infection of farm i per infected farm per day. The dimensionless nonlinear index b determines the relationship between the force of infection and the fraction of farms infected (c.f., Liu et al., 1987), ranging from b = 0 (Λ independent of I) to b = 1 (Λ proportional to I) (c.f., Liu et al., 2007). The dimensionless nonlinear coefficient a describes heterogeneity in susceptibility as a function of the herd size, N (Tildesley et al., 2008). We note that the field data do not indicate any relationship between herd size and on-farm prevalence (Spearman rank correlation: rs = −0.061, p = 0.574), i.e. xj is not a function of Nj.
Further, to examine the influence of other possible risk factors on the transmission of E. coli O157, the following variants to the basic model were considered:
The infection of E. coli O157 on Scottish farms is likely to be at an approximate steady state because a comparative analysis of the IPRAVE data with an early Scottish Executive Environment and Rural Affairs Department (SEERAD)-funded survey, performed between 1998 and 2000, found no statistically significant difference in farm-level prevalence of E. coli O157 (Pearce et al., 2009). However, the IPRAVE prevalence data are quasi-longitudinal (although individual farms were visited only once, the survey design involved a stratified sequence of visits to different regions of the country) and some fluctuation in prevalence is apparent (Fig. 2b). The time scale of these fluctuations allows us to obtain an estimate of the recovery rate.
We used the following method to obtain model predictions of the probability that each farm i is infected, i, for i = 1 to 461. For a given model and set of parameter values, a large number of replications of simulated steady state conditions was needed to estimate the average probability i given the highly stochastic nature of the infection and recovery processes. Visual C++ was used to code the simulation programme. We randomly chose five different infection seeds, and allowed a burn-in period of 15 years to let the system reach a steady state condition (c.f. Liu et al., 2007). The model system was run for another 1000 × 3 years. κ–statistic tests of autocorrelation (Dohoo et al., 2003) showed that the infection status of the farm in one 3 year interval was independent of its status in the adjacent interval; thus each simulated 3 year period can be regarded as an independent sample. From these 1000 samples, the predicted probability was estimated for each individual IPRAVE farm i as: i = (number of times farm i was positive) / 1000. Here, ‘positive’ indicates that the simulation predicted that infection would be present on the day (relative to the simulated 3 year period) on which the farm was sampled in the field during 2002–2004. That 1000 samples were adequate was justified by numerical experiment. When a few samples are taken, then the average log-likelihood l (see below) fluctuates widely across different runs. As the number of samples increases l converges quickly. When the number of samples equals 1000, l varies over a very small range such that its standard deviation is about 0.5 units in absolute value, with no discernible reduction in variability for larger sample sizes (Fig. 3).
Given the model prediction i for each of the 461 IPRAVE survey farms, the natural logarithm of the likelihood is calculated to quantify model fit:
where ri = 1 if farm i is recorded in IPRAVE survey as positive, ri = 0 otherwise. The downhill simplex method (Press et al., 1997) was employed to search for the maximum likelihood estimates (MLE) of the model parameters for each model. To select the most parsimonious model (balancing goodness of fit with the number of parameters fitted), the Akaike Information Criterion (AIC) (Akaike, 1973) was calculated:
where pm is the number of parameters of model m and MLEm represents the maximum likelihood estimates of the parameters of model m. The most parsimonious model is the one with the lowest AIC value. To quantify the extent to which the most parsimonious model is better than competing models, evidence ratios were calculated as:
where model 1 is the estimated most parsimonious model and k indexes each alternative model. This value quantifies the relative likelihood that model 1 is better than model k, given the data set (Burnham and Anderson, 2002).
To evaluate each model's accuracy in predicting the observed infection status of each individual IPRAVE farm, an empirical estimate of the odds ratio for each independent sample was calculated by:
where F++ is the number of farms that are positive for both model prediction and observed data, and the meanings of F--, F+- and F-+ follow accordingly. An odds ratio greater than 1 indicates that the model is more likely than not to predict the correct infection status of a farm. The larger the odds ratio, the stronger the predictive power of the model. We also calculate the area under the Receiver Operating Characteristic (ROC) curve, which provides a measure of a model's ability to discriminate between farms that are positive and those that are negative (Dohoo et al., 2003). For this (non-parametric) calculation, the farms are arranged in the descending order of the predicted probability i, which acts as the cut-off point for distinguishing between positive and negative farms. In our system, 461 farms were sampled, of which 87 positive. For a given value of cut-off point, say ρCT, the false positive rate (FPR) is calculated as NFP/374 where NFP is the number of farms that are actually negative and have predicted probability i ≥ ρCT, the true positive rate (TPR) as NTP/87 where NTP is the number of farms that are actually positive and have predicted probability i ≥ ρCT. We obtain the Receiver Operating Characteristic (ROC) curve by plotting TPR versus FPR over all possible ρCT values from which the area under the ROC curve (AUC) can be derived (Fig. 4).
Data for 461 of the 481 farms sampled in the IPRAVE study were extracted from the DEFRA CTS database (time of movement events, number of animals moved, coordinates of destination farm) during year 2002–2004 and the DEFRA 2003 census database (the X–Y coordinates of the farm-house, the area of the farm, and the numbers of cattle, sheep and pigs) as described above. Additional variables were created to describe farm clustering (number/presence of farms within 1, 3 and 5 km), characteristics of surrounding farms (size, type, distance, number of cattle), recent movement (movement within 1, 2, 3, 4, 8, and 30 weeks before sampling), amount of movement (number of movement ‘events’ onto farms, number of cattle moved, number of different farms supplying cattle). Movement events were further characterised by distance moved (< 10 km, > 10 km), number of animals moved per ‘event’ and the season of movement. Risk factors for the presence of E. coli O157 on a farm were analysed using generalised linear models (GLM) and Generalised Linear Mixed Models (GLMM) procedures in SAS (SAS Institute Inc., Cary, NC) with a binomial error distribution and a logit link function. GLM analyses were carried out on a univariate basis initially. A hierarchical forward selection and backward elimination approach was used. The change in the deviance of the model was monitored as an indicator of improved fit. Variables were added and removed based on significant improvement in the mean deviance after changes to the model. Two-way interactions were also tested in this manner. The final model was carried forward for GLMM analysis incorporating a random variable called ‘farm cluster’ which records the temporal/spatial nature of the sampling carried out during the IPRAVE survey.
In order to compare the results of the statistical analysis to the results generated from the stochastic modelling an analogous empirical odds ratio estimate was developed. Values for the empirical estimate of the odds ratio were derived using the parameter estimates from the GLMM model. These estimates were used in the statistical model to simulate binary response random variables for each farm (absence/presence of E. coli O157 on-farm). The predicted presences/absences were then related to the observed presences/absences, and aggregated over all farms, from which summaries an odds ratio was calculated as above (Eq. (9)). This process was repeated to produce a distribution of odds ratios which could then be summarised.
During a survey from March 1998 to May 2000, funded by The Scottish Executive Environment and Rural Affair Department (SEERAD), 952 farms were surveyed. It was found that E. coli O157 is present on 21.7% (19.2–24.5%) of the farms (Gunn et al., 2007). Among 461 IPRAVE farms of interest here, 395 farms were also in the SEERAD study. As a test of the predictive power of the most parsimonious model (see Results), the expected degree of association (using Cohen's kappa statistic, κ — Cohen, 1960) between different sets of simulation results for those 395 farms was calculated and compared to the observed degree of association.
Table 1 shows the comparison among different model variants. According to the AIC value the most parsimonious model is the basic model, i.e. that described by Eqs. (2)–(4). We therefore compare all other models with the basic model (noting that there is some error in likelihood estimates – see Fig. 3 – and therefore in the AIC values).
The most parsimonious model generates a predicted prevalence of infection on the IPRAVE study farms close to the observed prevalence of 18.9% and well within the exact binomial 95% confidence intervals (15.5 to 22.7%). However, the mean odds ratio (which gives a measure of the ability to distinguish between positive and negative farms) is low (1.38, with a standard error of 0.39) and the area under the ROC curve (AUC) is only 0.636 (Fig. 4) (below the 0.7 threshold for ‘acceptable’ discrimination specified by Hosmer and Lemeshow, 2000). Several of the alternative models perform marginally better using one or more of these indicators but in general the models perform poorly at discriminating between positive and negative farms.
The models can also be compared in terms of the parameter estimates obtained. Estimated values of β are highly dependent on model structure and so vary widely between models. Estimated values of γ (with the exception of the model relating recovery rate to farm size) are not directly influenced by model structure and show much less variation. For the best five models estimates of γ range from 0.0232 to 0.0354. Similarly, estimates of a range from 0.249 to 0.476 and of b (unless fixed) from 0.176 to 0.230. Overall, where comparable, parameter estimates are reasonably robust across the set of best models.
The results of comparing the simulation output for different time periods show that the model predictions for the periods 1998 to 2000 and 2002 to 2004 are only weakly associated: the modal value of κ is 0.010 (Fig. 5), i.e. slightly larger than random. Comparison of the SEERAD and IPRAVE survey data gives a similarly low κ value. Among the 395 farms, 22 farms were positive and 253 were negative in both surveys, while 56 farms were positive in the SEERAD survey and negative in the IPRAVE survey, and 64 farms were negative in the SEERAD survey and positive in the IPRAVE survey giving κ = 0.077 (95% confidence interval −0.028 to 0.182, p = 0.128).
The results of the statistical analysis of the GLMM model are shown in Table 2. The results suggest that farms that were positive for E. coli O157 tended to have a larger number of cattle as well as recent movements of cattle (within 4 weeks of sampling) onto the farm. The mean overall odds ratio for this model was 1.297, with a standard deviation of 0.379. None of the variables related to the clustering of farms or the presence of sheep were statistically significant in the multi-variable models. These results are very similar to the results obtained by the basic model in the simulation models above.
To investigate factors that affect the transmission dynamics of E. coli O157 infection on Scottish cattle farms, we developed a set of stochastic models and employed AIC to select the most parsimonious model. By comparing the fit of different model variants, we have shown that the most parsimonious model to describe the dynamics of the spread of E. coli O157 infection between farms is the one that is given by Eqs. (2)–(4). This model simply says that the transmission between farms is by two different routes: i) movement of infected cattle; and ii) all other routes, including acquisition of infection from a contaminated environment or other reservoir. The recovery rate is a constant, but the farm susceptibility increases with herd size (c.f. Tildesley et al. (2008) who found a similar result for farm susceptibility to foot-and-mouth disease). The force of infection (ignoring transmission via movements) is not a linear function of the number of infected farms (contrasting with the standard mass action assumption). The nonlinearity (with coefficient b considerably below 1) indicates saturation, that is, the risk that a susceptible farm becomes infected increases more rapidly with each additional infected farm when there are few infected farms than it does when there are many (c.f. Liu et al., 1987). Indeed, we cannot formally exclude the possibility that the force of infection is independent of the number of infected farms, i.e. b = 0. This result has a significant influence on the anticipated effectiveness of intervention measures, explored in detail elsewhere.
More complex models incorporating other risk factors do not significantly improve the fit. For example, because sheep and pigs can also carry E. coli O157, the presence or absence of these other livestock species could give rise to different probabilities of infection. Similarly, the prevalence of infection varies seasonally, so season could also affect transmission rate. Inclusion of these risk factors marginally improves the model fit but increases the AIC value, so the models are rejected as not parsimonious. Also, it is plausible to argue that proximity to an infected farm makes a farm more likely to become infected. However, our model fitting exercise provided no support for localized spread. Further, it is possible that a large farm will remain infected for longer and has a smaller recovery rate but this does not improve model fit, and thus was rejected.
All of these results are consistent with our risk factor analysis. Generalised Linear Mixed Models also identify herd size and cattle movements as risk factors, but the other risk factors considered here (sheep, seasonality, clustering of farms) do not have statistically significant effects. Correspondence between the results from empirical GLMMs and those from the fitting of dynamical models has not, as far as we are aware, previously been examined for any infectious diseases system. The good agreement found here is encouraging since both approaches are commonly used to inform the design of disease control programmes, which implies that the targeting of interventions could be based on either (this point will be considered in more detail in subsequent work).
Model selection, not parameter estimation, was the major objective of this analysis. The maximum likelihood method used here does not readily yield confidence intervals on parameter estimates; these are being obtained using Markov chain Monte Carlo methods and will be reported separately. However, we note that the estimates obtained appear reasonably robust across the best fitting set of models, which includes two models allowing for the slight under-detection of infection allowed for in the field survey design and for the imperfect diagnostic test sensitivity. The estimated rate of loss of infection gives an average duration of infection on a farm of approximately one month; this is highly consistent with data from smaller scale longitudinal studies (Synge et al., 2003).
Although these analyses do suggest farm size and cattle movements as significant risk factors the risk profile across the population of farms is relatively flat. This is indicated by the low odds ratios (Tables 1 and 2) and the small area under the ROC curve (Fig. 4). Reflecting these results, model simulations indicate that, while less than 20% farms are E. coli positive on any given visit, over 80% can be expected to be infected at some stage during a calendar year, reflecting both the risk profile and the typically short duration of infection on a farm (Robinson et al., 2004). This prediction is testable and clearly has important consequences for the design of future control programmes.
This work was supported by the Wellcome Trust (project grant to MEJW) and the DEFRA/SFC VTRI programme.
The authors do not have commercial or other associations that might pose a conflict of interest.
X-SZ performed all the process modelling and wrote the paper. MECT performed the statistical modelling and prepared the manuscript for submission to the journal. IJM designed the analysis to generate comparable AUC estimates from the empirical modelling and helped to interpret the data. MEJW and NJS supervised the study and helped to interpret the data. All authors read and approved the final manuscript.
The E. coli O157 prevalence data used in this study was collected as part of the International Partnership Research Award in Veterinary Epidemiology (IPRAVE), Epidemiology and Evolution of Enterobacteriaceae Infections in Humans and Domestic Animals, funded by the Wellcome Trust. The CTS data and census data extracts used in this study were provided by Victoriya Volkova through funding by the Scottish Government's Centre of Excellence in Epidemiology, Population Health and Disease Control (EPIC). Thanks to Lynda Reid, Paul Gavin, Andy Reid and the team at the Scottish Government Rural and Environment, Research and Analysis Directorate. Thanks also to Paul Bessell for assistance with database management.