|Home | About | Journals | Submit | Contact Us | Français|
We present a general statistical modeling framework to characterize continental-level influenza dynamics in the United States for the purposes of examining state-level epidemiological sources and sinks. The methods we describe depend directly on state-level influenza data that are prepared on a weekly basis by Google Flu Trends. The Google Flu Trends team has provided a powerful new approach to collecting and reporting epidemiological data and, when used in conjunction with sophisticated statistical models, can allow for the identification and quantification of the flow of influenza across the continental United States. Our proposed methods, when conditioned on such a comprehensive search query product, can provide unprecedented scientific learning about large-scale pathways and barriers to disease transmission which can ultimately be helpful for policy, remediation, and response efforts.
With the recent outbreaks of the H5N1 and H1N1 strains of the influenza virus, there is an increasing need for sophisticated methods of characterizing the spatio-temporal dynamics of these epidemic processes. New statistical models have been developed that allow researchers to intuitively represent dynamic behavior in epidemiological processes and are capable of identifying both barriers and corridors to the spread of disease over large spatial extents (e.g., Lawson and Zhou, 2005; Waller et al., 2007; Wheeler and Waller, 2008; Hooten and Wikle, 2010). Additionally, advances in search engine and data mining technology have allowed for the collection and reporting of continental-scale epidemiological data at a high temporal resolution (Ginsberg et al., 2009).
The transmission of disease can be complicated at the small-scale (Altizer et al., 2006), however, important features of large-scale transmission may become apparent if a model can be specified that is able to partition the dynamics into scientifically meaningful and identifiable components that can be estimated using available data. Resulting statistical inference can then provide information about past, current, and future states and mechanisms of the epidemiological process.
More specifically, in what follows we present a general spatio-temporal modeling framework that is specified in terms of intuitive small scale (i.e., fine scale) dynamics and allows us to learn about commonly studied, but unobserved, epidemiologic features of the population in addition to both pathways and barriers to the spread of influenza in the continental United States. Moreover, we are able to link the spread of influenza to various environmental and anthropogenic covariates in an effort to determine whether certain states function as epidemiological sources or sinks and how susceptible states are to both intrastate and interstate transmission of influenza-like illness (ILI).
Numerous modeling frameworks for statistically characterizing spreading phenomena have appeared in the literature in the past decade. Many of these adopt a hierarchical structure so that multiple sources of uncertainty can be accounted for explicitly in each of the model levels (Cressie et al., 2009). The Bayesian hierarchical modeling framework (Berliner, 1996) has proven to be a powerful method for solving large complex problems by breaking them up into smaller tractable conditional problems. Such models often involve three levels, commonly referred to as the data model, process model, and parameter model. Specifically, Wikle (2003) and Hooten and Wikle (2008) let the middle level (i.e., the process) be motivated by spatio-temporal partial differential equations, whereas, Hooten et al. (2007) use a discrete matrix model to characterize the dynamics and forecast the spread of invasive species on continental scales. Other approaches involving generalized linear models (for non-Gaussian data) and Markov random fields have also proven to be effective for describing dynamic spatio-temporal processes (Royle and Dorazio, 2008; Zhu et al., 2005).
Another form of model that has successfully been utilized for mimicking the spatiotemporal behavior of epidemics is sometimes referred to as an agent-based model (e.g., Grimm et al., 2005; Grimm and Railsback, 2005). These models are often called individual-based models when implemented in the literal sense, where an individual organism’s behavior is modeled directly. In the broader context, agent-based models can be constructed with the focus on a reporting unit as an “agent,” rather than an individual organism. In this sense, the models are scalable and could operate at the molecular level or, as in our case, at a much larger spatial level. Smith et al. (2002) implemented such a model, which is based on a simple set of rules governing fine (i.e., between agent) behavior that leads to complicated large-scale evolution, to describe the dynamic behavior of a rabies epidemic in raccoon populations in Connecticut. A full statistical implementation of this type of model is presented by Hooten and Wikle (2010) and allows for the estimation of important types of epidemic behavior such as: persistence, anisotropic and non-stationary transmission, and long-distance transmission. A critical feature of the models presented by Hooten and Wikle (2010) links transmission probability to the change in covariates, rather than to the covariates themselves. In this way, they are able to focus the movement of the disease on the dynamic nature of the heterogeneous cellular landscape rather than the static covariates themselves. That is, having modeled persistence (a cell-based epidemic process) in other model components, they rely on the change in landscape to identify neighborhood-based movement probabilities. A complication in the models of Hooten and Wikle (2010) is that they are working strictly with binary data and thus must use discrete probability distributions to properly accommodate the support of the response variable.
In the mathematical literature, the difference between “top-down” versus “bottom-up” dynamic model specifications is referred to as Lagrangian versus Eulerian (Turchin, 1998). An example of the Eulerian approach occurs in cases where the process level of a hierarchical model is formulated in terms of a partial differential equation (PDE, e.g., Wikle, 2003). In these settings, the mathematical form of the PDE dictates the overarching dynamic behavior inherent in the model. By contrast, the Lagrangian approach relies on rules governing the small scale dynamics that are then scaled up to exhibit large-scale behavior. Turchin (1998) shows that certain Lagrangian models, when scaled up and approximated, take an Eulerian form; in these cases, the mathematical approximation implies that the Lagrangian models are more general. The statistical crux then, is in the estimation aspects when considering these models from an inverse perspective (i.e., learning about underlying mechanisms given observed data).
In other recent work, Wheeler and Waller (2008), use a combination of statistical methods to identify and include important features of a heterogeneous environment (such as barriers and pathways), into a hierarchical Bayesian model for characterizing epidemics. In their methods, Bayesian areal Wombling is employed to detect areas of rapid change in the underlying environment. This is similar in spirit to the use of a gradient landscape (i.e., directional derivatives of the covariate mean field) to connect movement of the epidemic to the underlying environment.
A common way to study epidemics at the population level is through compartment models such as the SIR model (i.e., susceptible → infected → recovered), where the entire population of interest can be partitioned into these three compartments at any given time. When immunity to an infectious disease is only temporary, the SIR model is extended to accommodate the ability of a recovered individual to become susceptible again. These are termed SIRS models, and, are often formulated as a coupled system of ordinary differential equations (ODE):
where, the quantities S, I, and R represent the portion of the total population (N, where N = S + I + R) that is either susceptible, infected, or recovered, respectively. Note that this model depends on instantaneous quantities, but can be integrated and considered over intervals or periods of time. In fact, most measurements of S, I, or R are recorded as integrated quantities.
The United States Centers for Disease Control and Prevention (CDC) reports spatiotemporal influenza data, on a weekly basis, in the percentage of total physician visits that are due to influenza-like illness (ILI). The actual description of ILI, according to the CDC, is a person with a fever of at least 100 degrees Fahrenheit and a cough or sore throat. Also, each person can average between 1–6 episodes of ILI per year. Clearly, not all ILI cases will actually be influenza, but record keeping is performed in this manner for convenience and efficiency. Recently, researchers at Google formally linked an influenza-related search query index with the CDC ILI and began reporting both current and historical modeled ILI from June in year 2003–Current as the number of patients (yi,t) of out of 100, 000 physician visits that have ILI for state i and week t (Ginsberg et al., 2009).
In their paper, Ginsberg et al. (2009) claim that by simply compiling and processing their search query data faster than the CDC, they are able to report ILI (http://www.google.org/flutrends/) one to two weeks ahead of the official ILI data. That is, they are not forecasting the ILI, only predicting it using search query data. Thus, yi,t is a noisy version of true ILIi,t, but also more readily available. Accounting for this additional source of uncertainty is an important component of any further modeling and inference based on these data.
The weekly ILI data themselves can be most easily viewed as a set of time series on the same plot, where each time series represents yi,t/100, 000 (Figure 1). The spatio-temporal dynamics of ILI can be illustrated best as a movie, however, we can also view the empirical orthogonal functions (EOFs) of ILI by decomposing the space-time series using a principal components approach and viewing the components (EOFs) as well as the resulting scores. In a space-time setting, the EOFs can be visualized as maps and the scores represent the corresponding time series indicating at which points in the temporal domain the spatial pattern in each EOF reoccurs (Preisendorfer, 1988). In this case, the first two EOFs account for approximately 90% of the overall spatio-temporal variability and are shown in Figure 2. By contrasting the time series with the data in Figure 1 we can see the scores from the first EOF correspond to the overall magnitude of ILI, where the scores from the second EOF indicate a contrast largely between southwestern and northeastern states. In this case, the peaks and troughs in the second set of scores provide an indication of which side of the continental United States the annual epidemic ILI wave begins on. For example, notice that the first wave starts in the west while the second one starts on the east.
These observed epidemic waves contain valuable information as to how ILI, as a spreading phenomenon, propagates within and between states. The model we present in the next section partitions this spreading behavior into multiple components that allow us to make conclusions about the components of ILI transmission in the continental United States.
In constructing a model for continental influenza, where state-level ILI data are available, we seek to incorporate SIRS dynamics based on state-level intra- and interstate disease transmission. In doing so, we take a hierarchical approach where we can accommodate the discrete and integrated nature of the available data conditioned on unobserved SIRS quantities of interest. The dynamics relating these unobserved quantities are then specified in an intuitive fashion while using available scientific knowledge about the epidemiology of ILI.
Recall that Google Flu Trends reports state-level ILI data on a weekly basis in terms of yi,t, a count out of 100, 000 physician visits in state i during week t. Thus, we begin by specifying a stochastic model for the data. In this case, we consider the observed count (yi,t) conditioned on an underlying ILI probability (ILIi,t) to come from a binomial distribution: yi,t ~ Binom(100K, ILIi,t), where K 1000. Note that, for a state population of Ni, the number of people who became infectious at time ν in state i (i.e., Ĩi(ν)) is a function of ILIi,t:
In order to connect this data model to an underlying SIRS model, we need to convert Ĩi(ν) to the number of infectious people in state i at time t (i.e., Ii,t). This requires information about the persistence of infectiousness. We adopt a common model used for the natural history of influenza (e.g., Yang et al., 2009), where the duration of infectiousness extends from 1 to 7 days with probabilities p = (1, 1, 1, 0.8, 0.6, 0.4, 0.2)′, for each of the days respectively. Assuming ILI infection occurs uniformly throughout the week, we arrive at the following relationship between Ii,t and Ĩi(ν):
where the latter equality obtains from substituting (1) into (2). The result in (2) now implies the following data model: yi,t ~ Binom(100K, ). Hence, assuming that Ni is known, we now have a likelihood such that the data depend on an unknown underlying epidemic process (i.e., Ii,t).
The underlying system dynamics are specified in the middle stage of the hierarchical model. In the specific case of ILI (i.e., a set of diseases) a process model based on compartment dynamics in which immunity is only temporary (i.e., SIRS) can be useful for studying transmission. Given the discrete temporal nature of the data, we use a coupled set of ordinary difference equations to represent the SIRS dynamics:
where, ηi,t ~ N(0, σ2) and the equation for the susceptible portion (3) of the population is self-explanatory (essentially, every person that is not either infected or recovered in state i and time t), while the other two equations, (4) and (5), are discussed in detail in the following sections.
Note that, in (4), the model for infection contains three main additive terms. The first, , corresponds to intrastate transmission and is a function of state-level covariates xi. As a dynamic model, this term expresses the number of infected people at the current time as a function of the number of infected people at the previous time and unique state-level characteristics. The βW coefficients, then, represent the importance of each covariate to intrastate transmission. The second term, , is the first of two terms dealing with interstate transmission. In this case, it is a function of the differential between state characteristics and interacts with all contiguous states (of which there are ni) infection status at the previous time. The differences, xj − xi, are used here to describe an underlying flow surface, similar to that considered by Hooten and Wikle (2010), in terms of directional derivatives. The bi,j are weights that represent the connectivity of states i and j; in this case they represent the proportion of shared boundary. The coefficients, βB are most easily visualized when they are premultiplied by the covariate matrix X (x1,…, xn)′. The quantity XβB represents the flow surface itself which can be thought of as a wavy map where ILI travels easily from high values to low values. This surface, when estimated, provides one of the main forms of inference allowed by the model. It should also be noted that this state to state transmission is easily specified and the main reason why we describe this model as agent-based.
The final non-error term in (4), , accommodates interstate transmission through air travel, a second form of connectivity between states. In this equation, the ai,j represent the relative amount of air travel from state j to state i and the coefficient βA then scales this as necessary to provide the contribution of long-distance transmission during period (t − 1, t].
Noting that the recovered portion of the population is known when the Ii,t are known, we simply need to formally define the relationship between Ri,t and Ii,t. In doing so, we first rewrite (5) as
where, traditionally, the coefficient c controls the amount of recovered people who become susceptible in week t. In the latter equality, we express this temporary immunity in terms of the random variables uj, for the jth recovered person at time t. Due to the previously described natural history of ILI, a person who has had an ILI can become infected with another ILI uj more times in a year, where uj ~ discunif(0, 5).
For the final term in (6), we note that, as a consequence of our specification in (2), each week, a total of people recover. Thus, since , we are left with people that were infected but are now recovered.
A Bayesian approach allows us to easily accommodate the multiple sources of stochasticity in the hierarchical model specified in the previous sections. Thus, in completing the full hierarchical specification we need to specify probability distributions for the unknown random parameters of interest. Given that many of the model parameters are assumed to be fixed and known (based on descriptions in the literature), we have only to specify distributions for the interstate transmission coefficients (βW, βB, and βA) and the variance component (σ2). Specifically, we used a N(0,Σβ) prior for β and a uniform on (0, γ) for the standard deviation σ (as suggested by Gelman, 2006). Note that, in the prior specification above, β = (β′W β′B, β′A) and Σβ = 1000· I while γ is chosen to be a very large finite value such that the prior on σ is proper, but has minimal effect on the posterior.
In a Bayesian framework, we are interested in the posterior distribution of the unobserved latent process S, I, R, as well as the unknown random parameters β and σ2. Using the conventional square bracket notation for hierarchical Bayesian models, the posterior distribution can be written:
The posterior distribution in (7) can be easily approximated using Markov chain Monte Carlo (MCMC) with Metropolis-Hastings updates for non-conjugate parameters (Gelman et al., 2004). For computational efficiency, we take an additional approximation step in the implementation, using a normal approximation to the binomial likelihood. Given the actual observed values of yi,t we have found this to be a very reasonable approximation that yields conjugate full-conditional distributions for all unknown quantities and allows for the use of a Gibbs sampler MCMC algorithm.
For covariates, we consider four spatial variables that we believe may be influential for the spread of ILI in the United States (Figure 3):
Using weekly Google Flu Trend data from June 1, 2003 to January 4, 2009, along with the covariates depicted in Figure 3 and a large sample of airline ticket data (http://www.transtats.bts.gov/), we fit the model discussed in the previous section to obtain parameter and flow surface estimates. The Gibbs sampler, which was written in the R Statistical Programming Environment (R Development Core Team, 2008), was run for 5,000 iterations (taking approximately 5 hours on an 8 processor 3 Ghz server with 28 GB of RAM). Using 3 MCMC chains, the Gelman-Rubin convergence statistics (Gelman and Rubin, 1992) were computed for each parameter and were all less than 1.001, indicating acceptable MCMC convergence was achieved. For inference, the first 1,000 MCMC samples were discarded as burn-in, and the remainder were used to approximate for the transmission coefficients β, which are provided in Table 1. Equal tail posterior credible intervals indicate that all of the coefficients are significantly different from zero (at the α = 0.05 level). The estimated posterior mean spatial fields (XβW and XβB) are shown in Figure 4.
Another means for illustrating the results are to view the individual state posterior susceptibilities, both in terms of interstate (Figure 5) and intrastate susceptibility (Figure 6). We refer to intrastate susceptibility as XβW, while interstate susceptibility corresponds to the cumulative effect of all neighboring states on the state in question and is computed as: , where the sum is over all neighbors of state i.
In addition to the full model discussed above, we also fit two additional reduced models: one omitting the between-state transmission (i.e., middle term on the right hand side of (4)), and another omitting the air travel component of transmission (i.e., second to last term in (4)). The deviance information criterion (DIC, Gelman et al., 2004) resulting from each of these model fits were 152615, 152569, and 152315, respectively.
The results from fitting the full model suggest that all model parameters are significant in the model, and thus, deemed important for describing ILI dynamics in the continental United States. For example, the first column of Table 1 indicates that state area, population density (PPSM), and average July temperature are all positively related to intrastate influenza transmission, while average January temperature had a negative effect. These results imply that as area, population, and summer temperature increase within a state, so do the transmission rates; whereas, as winter temperatures get colder, interstate transmission increases. We find the area, population, and winter temperature results to have reasonable scientific interpretations, while the summer temperature results require more thought; it is not immediately apparent why an increase in July temperature should lead to increased intrastate ILI transmission. Of course, given the observational nature of this study, it is quite likely that estimated effects are confounded with lurking variables not included in the study. However, regardless of potential multicollinearity between any missing covariates and those considered here, inference concerning the underlying flow surfaces can still be made. For example, Map 2 in Figure 4 indicates (with darker shades of gray) which states are more sensitive to intrastate transmission than others. This map implies that many of the states with large areas in the middle of the country are more susceptible to intrastate transmission than those in, say, in the northwest (e.g., Oregon and Washington) where January temperatures are higher than other states at similar latitudes. State level transmission could also be viewed as susceptibility, and Figure 6 identifies which of the continental states are most and least susceptible to intrastate transmission. The boxplots shown in Figure 6 represent the marginal posterior distributions of XβW given the Google Flu Trends data and available covariates. Thus, boxplots that are significantly larger than zero indicate the most susceptible states, while those near zero indicate a neutral intrastate susceptibility. States with negative posterior susceptibilities (e.g., Oregon and Washington) suggest that ILI transmission is likely due to other sources (e.g., interstate transmission).
In terms of interstate transmission, we can see from the second column of Table 1 that the effects of the covariates on the interstate flow surface share the same sign, but differ in magnitude, from those influencing intrastate transmission. From Table 1 it is apparent that all coefficients except average January temperature have positive posterior mean effects, though the role that average July temperature plays in interstate transmission is reduced. These results imply that if a neighboring state is larger, has a higher population density, and higher summer temperature than the state of interest, the interstate transmission increases; the converse is true for neighboring states with higher winter temperatures, implying that ILI spreads slower from states with warm winters to states with cold winters. Viewed as a map, the posterior mean interstate flow surface (Map 1, Figure 4), corresponding to XβB, illustrates where these areas of high and low interstate transmission occur. The darker northern Great Plains states and midwest indicate a ridge on the surface. Theoretically, this flow surface would suggest that if an ILI wave began in the northern Great Plains, it would disperse rapidly “downhill” toward the southeastern states and west coast. We do not see this behavior in the data however, and thus it is more likely that the ridge of higher values in the center of the country represents a distinct barrier to cross-country transmission. That is, an epidemic starting on the west coast will spread readily through western states, while only slowly making its way through the center of the country toward the east. When considering this finding in combination with that of Map 2 in Figure 4, it becomes obvious that interstate ILI transmission in the middle of the country actually occurs relatively slowly and thus the intrastate mechanism becomes more of a driver for the epidemiologic process in that region. Generally, darker contiguous regions on such maps indicate barriers to disease spread, while lighter contiguous regions suggest potential corridors for spread. Overall, the slightly larger DIC values for the full versus reduced model (where interstate transmission is omitted) suggest that the component of the infection model (4) involving interstate transmission may not be contributing much to the overall fit of the model. In computing the DIC, we found that the effective number of parameters were about four greater in the full model than the reduced model without the between state component. In this case, if one seeks inference regarding interstate transmission, it may be beneficial to account for these effects even though the DIC is slightly larger when they are included.
The interstate susceptibility can also be viewed on a state by state basis (Figure 5). The boxplots in Figure 5 represent the approximate posterior distribution for the sum of incoming ILI transmission from neighboring states. That is, states with posterior susceptibilities greater than zero indicate that neighboring states are responsible for a significant portion of ILI transmission. Based on the available data and covariate information used in this study, Figures 5 and and66 indicate that small states with high population densities, such as New Jersey and Rhode Island, are most susceptible to intrastate ILI transmission, while states surrounded by larger states with high population densities act as sinks for ILI and are quite susceptible to neighborhood-based interstate transmission. In terms of long-distance interstate transmission, the positive posterior mean coefficient for air travel (Table 1) indicates that air travel may play a significant role in the spread of ILI, beyond the intrastate and neighborhood based interstate transmission. However, in this case, the resulting DIC was lower for the reduced model omitting the air travel component than either of the other two fitted models. Thus, despite a non-zero estimated air travel effect when fitting the full model, the air travel component may not be helpful overall, as including it increases the number of effective parameters by approximately fifty. In terms of modeling the spread of human influenza, this could be welcome news, as airline information can be difficult to obtain.
In summary, we have introduced a general hierarchical framework for utilizing the ILI data produced by Google Flu Trends to link the dynamics and mechanisms (including both barriers and pathways) of influenza dispersal in the continental United States. In this effort, we have built on the previous works of Wheeler and Waller (2008) and Hooten and Wikle (2010) and are able to avoid some of the complications arising from those earlier studies due to the support and extent of the data. The general agent-based framework we present here relies only specification of small scale (within and between state) dynamics and can readily be extended to accommodate important features of continental epidemic spread such as time-varying covariates, different disease natural histories, continental boundary conditions (e.g., other countries), and other forms of data. Our model formulation focuses on separating intrastate dynamics from interstate dynamics using a relevant set of environmental and anthropogenic covariates in both static and gradient forms. Our model results provide an indication that valuable scientific learning can result when these types of models are combined with data arising from sophisticated search engine mining technology.
Mevin B. Hooten, Department of Mathematics and Statistics, Utah State University, Logan, UT 84322.
Jessica Anderson, Department of Mathematics and Statistics, Utah State University, Logan, UT 84322.
Lance A. Waller, Department of Biostatistics, Emory University, Atlanta, GA 30322.