Home | About | Journals | Submit | Contact Us | Français |

**|**Biostatistics**|**PMC3114649

Formats

Article sections

- Abstract
- INTRODUCTION
- METHODS
- SIMULATIONS
- DATA ANALYSIS: INFLUENZA A(H1N1), 2009
- DISCUSSION
- SUPPLEMENTARY MATERIAL
- FUNDING
- Supplementary Material
- References

Authors

Related links

Biostatistics. 2011 July; 12(3): 548–566.

Published online 2010 November 11. doi: 10.1093/biostatistics/kxq068

PMCID: PMC3114649

Eben Kenah^{*}

Department of Biostatistics, University of Washington, Seattle, WA 98105, USA ; Email: eek4/at/u.washington.edu

Received 2010 January 7; Revised 2010 October 10; Accepted 2010 October 11.

Copyright © The Author 2010. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

This article has been cited by other articles in PMC.

We argue that the time from the onset of infectiousness to infectious contact, which we call the “contact interval,” is a better basis for inference in epidemic data than the generation or serial interval. Since contact intervals can be right censored, survival analysis is the natural approach to estimation. Estimates of the contact interval distribution can be used to estimate *R*_{0} in both mass-action and network-based models. We apply these methods to 2 data sets from the 2009 influenza A(H1N1) pandemic.

Infectious disease remains one of the greatest threats to human health and commerce, and the analysis of epidemic data is one of the most important applications of statistics in public health. The most important quantities estimated in the early stages of an epidemic are the basic reproductive number (*R*_{0}) and the generation interval. *R*_{0} is the mean number of secondary infections caused by a typical infectious person in the early stages of an epidemic (Diekmann and Heesterbeek, 2000). The generation interval of an infectious disease is the time between the infection of a secondary case and the infection of his or her infector. The serial interval is the time between the symptom onset of a secondary case and the symptom onset of his or her infector. The generation and serial interval distributions are often considered characteristic features of an infectious disease (Fine, 2003). Higher values of *R*_{0} imply that an epidemic will be larger and harder to control. At a given *R*_{0}, shorter generation intervals imply faster spread of the epidemic.

Serial intervals, which are times between observed events, are often used as a proxy for generation intervals. Recent analyses of several emerging infectious diseases have been based on serial interval distributions, including the 1918 influenza (Mills , 2004), Severe acute respiratory syndrome (SARS) (Lipsitch *and others*, 2003), (, WallingaTeunis), pandemic influenza A(H1N1) (Fraser *and others*, 2009), (, McBryde), and avian influenza (Ferguson *and others*, 2005), (, Ferguson2). Three methods form the basis of these applications. The Lotka–Euler equation can be used to estimate *R*_{0} given a known serial interval distribution and the exponential growth rate at the beginning of an epidemic (Diekmann and Heesterbeek, 2000), (, Svensson), (, WallingaLipsitch), (, RobertsHeesterbeek). Two other methods use the time series of symptom onset times, assuming that all infections are symptomatic and observed and all serial intervals are independent and identically distributed (i.i.d.). Wallinga and M Teunis (2004) use a serial interval distribution to estimate the effective reproductive number *R*(*t*), which is the mean number of secondary infections caused by persons infected at time *t*. This approach has been extended to use serial interval observations from contact-tracing data (Cauchemez *and others*, 2006). White and Pagano (2008) jointly estimate *R*_{0} and the serial interval distribution by assuming the number of secondary cases generated by each infectious person has a Poisson distribution with mean *R*_{0} and serial intervals have a multinomial distribution. These estimators are based on a branching process approximation to the initial spread of disease, where secondary infections are the offspring of their infector, and the generation interval is the age at which the infector gives birth. In reality, epidemics spread through populations rather than creating them, exposure to multiple infectous persons causes generation and serial intervals to contract (Svensson, 2007), (, Kenah3), and serial intervals are i.i.d. in the early stages of an epidemic only if the latent, incubation, and infectious periods are all constant.

In this paper, we outline an alternative analysis of epidemic data based on “contact intervals.” The contact interval from *i* to *j* is the time between the onset of infectiousness in *i* and the first infectious contact from *i* to *j*, where we define infectious contact to be a contact sufficient to infect a susceptible individual. This interval will be right censored if *j* is infected by someone else prior to infectious contact from *i* or if *i* recovers from infection before making infectious contact with *j*. Unlike methods based on branching processes, these methods use information contributed by uninfected person-time, account for the effects of exposure to multiple infectious persons and remain valid with variable latent, incubation, and infectious periods. For simplicity, we focus on the analysis of completely observed “susceptible-exposed-infectious-recovered” (SEIR) epidemics. The SEIR framework applies to acute, immunizing diseases that spread from person to person, such as measles, influenza, and smallpox.

In Section 2, we show that standard maximum likelihood methods can be used to obtain parametric estimates of the contact interval distribution that can be used to estimate *R*_{0}. In Section 3, we evaluate the performance of these methods in simulated epidemic data and examine the effects of various forms of misspecification. In Section 4, we used these methods to analyze 2 sets of data from the 2009 influenza A(H1N1) pandemic. In Section 5, we discuss the advantages and limitations of survival analysis in infectious disease epidemiology.

Consider a closed population of *n* individuals assigned indices 1…*n*. Each individual is in 1 of 4 possible states: susceptible (S), exposed (E), infectious (I), or removed (R). Person *i* moves from S to E at his or her *infection time* *t*_{i}, with *t*_{i} = *∞* if *i* is never infected. After infection, *i* has a “latent period” of length ε_{i}, during which he or she is infected but not infectious. At time *t*_{i} + ε_{i}, *i* moves from E to I, beginning an “infectious period” of length *ι*_{i}. At time *t*_{i} + *r*_{i}, where *r*_{i} = ε_{i} + *ι*_{i} is the “recovery period,” *i* moves from I to R. Once in R, *i* is either dead or immune and can no longer infect others or be infected. The latent period is a nonnegative random variable, the infectious and recovery periods are strictly positive random variables, and the recovery period is finite with probability one.

After becoming infectious at time *t*_{i} + ε_{i}, person *i* makes infectious contact with person *j*≠*i* at their “infectious contact time” *t*_{ij} = *t*_{i} + ε_{i} + *τ*_{ij}^{*}, where the “infectious contact interval” *τ*_{ij}^{*} is a strictly positive random variable with *τ*_{ij}^{*} = *∞* if infectious contact never occurs. Since infectious contact must occur while *i* is infectious or never, *τ*_{ij}^{*}(0,*ι*_{i}] or *τ*_{ij}^{*} = *∞*. We define infectious contact to be sufficient to cause infection in a susceptible person, so *t*_{j} ≤ *t*_{ij}.

An epidemic begins with one or more persons infected from outside the population, which we call “imported infections.” For simplicity, we assume that epidemics begin with one or more imported infections at time 0 and there are no other imported infections.

For each ordered pair *ij*, let *C*_{ij} = 1 if infectious contact from *i* to *j* is possible and *C*_{ij} = 0 otherwise. We assume that the infectious contact interval *τ*_{ij}^{*} is generated in the following way: a contact interval *τ*_{ij} is drawn from a distribution with hazard function *λ*_{ij}(*τ*). If *τ*_{ij} ≤ *ι*_{i} and *C*_{ij} = 1, then *τ*_{ij}^{*} = *τ*_{ij}. Otherwise *τ*_{ij}^{*} = *∞*. Let *S*_{i}(*t*) = 1_{t ≤ ti} and *I*_{i}(*t*) = 1_{t(ti + εi,ti + ri]} be the susceptibility and infectiousness processes, where 1_{X} = 1 if *X* is true and zero otherwise. As defined, both processes are left-continuous with respect to *t* and infectious contact from *i* to *j* is possible at time *t* only if *C*_{ij}*I*_{i}(*t*)*S*_{j}(*t*) = 1.

Let *v*_{j} denote the index of the person who infected person *j*, with *v*_{j} = 0 for imported infections and *v*_{j} = *∞* if *t*_{j} > *T*. The “transmission network” is the directed network with an edge from *v*_{j} to *j* for each *j* such that 0 < *v*_{j} < *∞*. Following Wallinga and M Teunis (2004), we represent it with a vector **v** = (*v*_{1},…,*v*_{n}), and let denote the set of all possible **v** consistent with the observed data. Let _{j} = {*i*:*C*_{ij}*I*_{i}(*t*_{j}) = 1} denote the set of persons who could have infected person *j*, which we call the “infectious set” of person *j*.

We assume that all contact intervals are i.i.d. and have a continuous distribution with *λ*_{ij}(*τ*) = *λ*(*τ*;*θ*_{0}) for all *ij*. We assume that *θ*_{0} is unique and that *λ*(*τ*;*θ*) is strictly positive for *τ* > 0, with continuous third derivatives with respect to *θ* in an open neighborhood of *θ*_{0}.

Our population has size *n*. For all ordered pairs *ij*, we know *C*_{ij}. Observation begins at time 0 and ends at time *T*. During this time, we observe the times of all *S*→*E* (infection), *E*→*I* (onset of infectiousness), and *I*→*R* (recovery) transitions that occur in the population. The total number of infections observed is *m*.

Our discussion of maximum likelihood estimate (MLE) is based on the theory of counting processes and martingales, which is described in Kalbfleisch and Prentice (2002) and Aalen *and others* (2009). Let _{ij}(*t*) = 1_{t ≥ tij} count the first infectious contact from *i* to *j* at or before time *t*. We count only the first infectious contact because *t*_{j} ≤ *t*_{ij}. Since *λ*_{ij}(*τ*) = *λ*(*τ*;*θ*_{0}) and _{ij}(0) = 0,

(2.1)

is a zero-mean martingale when *θ* = *θ*_{0}. We observe infectious contacts from *i* to *j* only while *j* is still susceptible and before time *T*, which gives us the observed counting process

(2.2)

When *θ* = *θ*_{0},

(2.3)

is a zero-mean martingale because it is the integral of a predictable process with respect to _{ij}(*u*;*θ*_{0}). When we observe infectious contacts from *i* to *j* between time 0 and time *T*,

(2.4)

is the log-likelihood and its score process is

(2.5)

which is a zero-mean martingale when *θ* = *θ*_{0}. If 0 < *t*_{j} ≤ *T*, this likelihood requires that we observe whether or not *i* made infectious contact with *j* at time *t*_{j}.

Now fix *j*. If we observe all pairs *ij* from time 0 until time *T*, the log-likelihood is _{·j}(*θ*) = ∑_{i≠j}_{ij}(*θ*) with score process *U*_{·j}(*t*;*θ*) = ∑_{i≠j}*U*_{ij}(*t*;*θ*). Since it is a sum of zero-mean martingales, *U*_{·j}(*t*;*θ*_{0}) is a zero-mean martingale. The likelihood _{·j}(*θ*) corresponds to a survival likelihood, where the *t*_{ij} are failure times and *C*_{ij}*I*_{i}(*t*)*S*_{j}(*t*) = 1 indicates risk of failure (i.e., infectious contact) in the ordered pair *ij* at time *t*. At the earliest infectious contact with *j*, the contact intervals in all other *ij* at risk are right censored, which is a type-II independent censoring mechanism (Kalbfleisch and Prentice, 2002). If 0 < *t*_{j} ≤ *T*, this likelihood assumes that we observe which counting process *N*_{ij} jumped at *t*_{j}, which is equivalent to observing which *i* infected *j*.

The complete-data log-likelihood when we observe who-infected-whom is (*θ*) = ∑_{j = 1}^{n}_{·j}(*θ*) and its score process is *U*(*t*;*θ*) = ∑_{j = 1}^{n}*U*_{·j}(*t*;*θ*). Since it is a sum of zero-mean martingales, *U*(*t*;*θ*_{0}) is a zero-mean martingale. Differentiating (*θ*), evaluating at *θ*_{0}, and taking expectations yields

(2.6)

where *U*(*θ*)(*τ*) is the predictable variation process of *U*(*τ*;*θ*). Under our regularity assumptions for *λ*(*τ*;*θ*), this establishes the consistency and asymptotic normality of the maximum likelihood estimator as the number of observed infections *m*→*∞* (Kalbfleisch and Prentice, 2002).

Now suppose that we do not observe who infected person *j*, so we see only *N*_{·j}(*t*) = ∑_{i≠j}*N*_{ij}(*t*). Since the total hazard of infectious contact to *j* at time *t* is

(2.7)

the process

(2.8)

is a zero-mean martingale when *θ* = *θ*_{0}. When *j* is observed from time 0 to time *T*, the log-likelihood is

(2.9)

and its score process is

(2.10)

which is a zero-mean martingale when *θ* = *θ*_{0}.

The complete-data log likelihood when we do not observe who-infected-whom is with score process . Since it is a sum of zero-mean martingales, is a zero-mean martingale. Differentiating , evaluating at *θ*_{0}, and taking expectations yields

(2.11)

where is the predictable variation process of . Under our regularity assumptions for *λ*(*τ*;*θ*), this establishes the consistency and asymptotic normality of the maximum likelihood estimator as the number of observed infections *m*→*∞* (Kalbfleisch and Prentice, 2002).

Given *θ*, the probability that person *i* infected person *j* is

(2.12)

where *λ*_{·j}(*t*;*θ*) is defined in (2.7) (Kenah *and others*, 2008). A transmission network **v** can be generated by choosing an *i*_{j} for each *j* such that 0 < *t*_{j} ≤ *T*. Given *θ*, the probability of **v** is

(2.13)

The log-likelihood corresponds to the likelihood

(2.14)

Given that we observe **v**, the log-likelihood (*θ*) corresponds to the likelihood , so it follows immediately that . Similarly, .

In a mass-action model, *C*_{ij} = 1 for all *ij*, but the hazard of infectious contact is inversely proportional to the population size. If *λ*_{n}(*τ*;*θ*) is the hazard of infectious contact in a model with population size *n* > 1,

for a “scaled hazard function” . Suppose is specified up to an unknown parameter vector θ with true value *θ*_{0}. When *n* is unknown and the observed number of infections *m**n*, there is an approximate likelihood that depends only on data about infected persons.

Expanding in terms of , we get

(2.15)

All summands in the first term are zero except for those *j* with 0 < *t*_{j} ≤ *T*. The second term is not a function of *θ* and can be ignored. The third term can be split into terms from *j* who get infected on or before time *T* and from those who remain uninfected at time *T*:

where *x*∧*y* = min(*x*,*y*) and . Since the first term is less than or equal to

disp

we have → for a fixed *m* as *n*→*∞*, where

(2.16)

In the case where who-infects-whom is observed, a similar derivation starting from (*θ*) leads to the asymptotic likelihood

(2.17)

Maximum likelihood estimates from and inherit consistency and asymptotic normality from (*θ*) and , respectively, where we take limits first as *n*→*∞* and then as *m*→*∞*.

In a mass-action model, the scaled hazard and cumulative hazard functions can be interpreted in terms of *R*_{0} and the time course of infectiousness in the limit as *n*→*∞*. Given an infectious period *ι*, the expected number of infectious contacts made is

(2.18)

Therefore,

(2.19)

where the expectation is taken over the distribution of *ι* among early infectives. Given that *i* makes infectious contact with *j* and has infectious period *ι*, the probability density function of the infectious contact interval from *i* to *j* is

(2.20)

In a network-based model, transmission takes place across the edges of a “contact network,” so we have *C*_{ij} = 1 if and only if there is an edge leading from *i* to *j* in the contact network. Here, we assume that contact networks are undirected, so *C*_{ij} = *C*_{ji} for all *i* and *j*. Let *d*_{j} denote the degree of node *j* in the contact network. If *j* is infected in the early stages of transmission, he or she can transmit infection across the remaining *d*_{j} − 1 edges. Across each edge, the probability of transmission is 1 − exp( − Λ(*ι*_{j};*θ*_{0})), so the expected number of secondary infections caused by *j* is (*d*_{j} − 1)(1 − exp( − Λ(*ι*_{j};*θ*_{0}))). Therefore,

(2.21)

where the expectation is taken over the joint distribution of *d* and *ι* among early infectives.

To evaluate the performance of the methods from Section 2, we ran 4 series of simulations: mass-action models with exponential and Weibull contact intervals and network-based models with exponential and Weibull contact intervals. All models had exponential infectious periods with mean one and no latent period. For each model, we used data from the first *m* = 1000 infectious in a population of size *n* = 100000. For each infected person *j*, we recorded the infection time *t*_{j} and the recovery time *t*_{j} + *ι*_{j}. In network-based models, the degree *d*_{j} and the indices of all neighbors of *j* were also recorded. All outbreaks started with a single imported infection at time *t* = 0. Outbreaks that terminated with a final size less than 1000 were discarded; if a model ran 100 times without producing an epidemic of size at least 1000, it was discarded and another model generated. In all models, ln *R*_{0} was sampled from a uniform distribution on (1.05,2), corresponding to *R*_{0} between 1.05 and 7.39, and the rate parameter of the contact interval distribution was chosen to achieve the sampled *R*_{0}. All network-based models had undirected Erdős–Rényi contact networks, which have a Poisson degree distribution (Newman *and others*, 2006). A new contact network with mean degree was generated for each simulation, where was sampled uniformly from ln*R*_{0}+0.1,3).

The true contact interval distribution in each model was either exponential or Weibull. The exponential distribution has the hazard function *λ*(*τ*;*β*) = *β* for *τ* > 0, where *β* > 0 is the rate parameter. The Weibull distribution has the hazard function *λ*(*τ*;*α*,*β*) = *α**β*(*β**τ*)^{α − 1} for *τ* > 0, where *α* > 0 is the shape parameter and *β* > 0 is the rate parameter. For models with Weibull contact interval distributions, lnα was sampled from a uniform distribution on ( − 1.25,1.25), corresponding to *α* between 0.29 and 3.49. For each model, analyses were performed assuming exponential, Weibull, and gamma contact interval distributions. The gamma distribution has the hazard function

(3.1)

where *β* is the rate parameter, *κ* is the shape parameter, and Γ(*a*,*b*) is the upper incomplete gamma function. The exponential distribution is a Weibull distribution with *α* = 1 and a gamma distribution with *κ* = 1. Using a gamma distribution allowed us to examine the effect of fitting a richer parametric model to exponential contact intervals and fitting a misspecified parametric model to Weibull contact intervals.

All analyses assumed that who-infected-whom was not observed. For network-based models, we used the log-likelihood . For mass-action models, we used the asymptotic log-likelihood from (2.16). When information about relationships between infected persons are missing, it is common to analyze the data using a mass-action model (Lipsitch *and others*, 2003), (, Mills), (, WallingaTeunis), (, FraserH1N1), (, YangH1N1). To look at the possible effects of this, we analyzed data generated by network-based models using the asymptotic log-likelihood for a mass-action model, ignoring all information about the links between infected persons and about uninfected neighbors of infected persons.

Simulations were implemented in Python 2.6 (www.python.org) using SciPy 0.7 (Jones *and others*, 2001–2009). Contact networks were generated using NetworkX 1.0 (Hagberg *and others*, 2008). Analyses were performed in R 2.10 (R Development Core Team, 2009) via RPy2 2.0 (Moreira and Warnes, 2002–2009). Maximum likelihood estimates and confidence intervals were calculated in the log scale using the “mle” and “confint” functions in R. Bias-corrected bootstrap confidence intervals were calculated for *R*_{0}. Multivariate normal samples were obtained using the Cholesky decomposition of the covariance matrix (Rizzo, 2008). Code for the models and estimates is provided as online supplementary material.

Let *ι*_{k} denote the infectious period of the *k*th infection observed. For mass-action models with exponential contact intervals, *R*_{0} = *β*. Our point estimate of *R*_{0} is , where is the rate parameter maximum likelihood estimate (MLE). A bootstrap replicate of *R*_{0} is , where ln*β*_{*} is a sample from the approximate normal distribution of *In* and *ι*_{1}^{*},…,*ι*_{m}^{*} is a bootstrap sample from the observed *ι*_{1},…,*ι*_{m}. For mass-action models with Weibull contact intervals *R*_{0} = *β*^{α}Γ(*α* + 1). Our point estimate of *R*_{0} is

(3.2)

where is the shape parameter MLE and is the rate parameter MLE. A bootstrap replicate of *R*_{0} is

(3.3)

where (ln*α*^{*},ln*β*^{*}) is a sample from the approximate joint normal distribution of .

In a contact network with *n* nodes, let denote the expected number of edges across which a nonimported infection in the early stages of an epidemic can transmit infection (Andersson, 1998), (, Newman1), (, Kenah1). Let *ι*_{k} and *d*_{k} denote the infectious period and degree, respectively, of the *k*th infection observed. When the contact interval has an exponential distribution, and . A bootstrap replicate of *R*_{0} is , where ln*β*^{*} is a sample from the approximate normal distribution of *ln* and (*ι*_{1}^{*},*d*_{1}^{*}),…,(*ι*_{m}^{*},*d*_{m}^{*}) is a bootstrap sample from (*ι*_{1},*d*_{1}),…,(*ι*_{m},*d*_{m}). When the contact interval has a Weibull distribution, , and

(3.4)

A bootstrap replicate of *R*_{0} is

(3.5)

where (ln*α*^{*}, ln*β*^{*}) is a sample from the approximate joint normal distribution of . We assume that infected persons can report their degree in the contact network, so these estimates of *R*_{0} do not depend on any prior knowledge of the contact network on the part of the investigators.

For mass-action models with exponential contact intervals, Figure 1 shows scatterplots of the estimated versus true *R*_{0} and Table 1, panel A, shows 95% confidence interval coverage probabilities. For analyses assuming an exponential contact interval, *R*_{0} estimates are close to the truth and coverage probabilities for *R*_{0} and *β* are excellent. Analyses assuming a Weibull or gamma distribution produce *R*_{0} estimates that are biased upward and right skewed, particularly for the Weibull distribution. Confidence interval coverage probabilities for *R*_{0} are above 80% but well below 95%. On the bright side, both analyses have greater than 90% chance of failing to reject the null hypothesis that the contact interval distribution is exponential at the 5% level of significance. Performance of the gamma and Weibull analyses improves at lower *R*_{0}.

Scatterplots of estimated versus true *R*_{0} for mass-action models with exponential contact interval distributions, with smoothed means excluding the highest 10 *R*_{0} estimates. Agreement is excellent when the contact interval is correctly assumed to have an **...**

For mass-action models with Weibull contact intervals, Figure 2 shows scatterplots of the estimated versus true *R*_{0} and Table 1, panel B, shows 95% confidence interval coverage probability estimates. Analyses assuming a Weibull contact interval distribution produce estimates that are right skewed but nearly unbiased, with confidence interval coverage probabilities only slightly below 95%. Analyses assuming an exponential or gamma contact interval distributions are biased upward and right skewed, but those assuming a gamma distribution perform far better than those assuming an exponential distribution. Performance of both misspecified analyses improves at lower *R*_{0}.

Scatterplot of estimated versus true *R*_{0} for mass-action models with Weibull contact interval distributions, with smoothed means excluding the highest ten *R*_{0} estimates. In all cases, the distribution of point estimates is strongly right skewed. When the **...**

For network-based models with exponential contact intervals, Figure 3 shows scatterplots of the estimated versus true *R*_{0} and Table 1, panel C, shows 95% confidence interval coverage probabilities. Analyses assuming a network-based model with exponential contact intervals produce excellent point and interval estimates. The cost of relaxing this assumption to allow Weibull or gamma contact intervals is far less than that seen for mass-action models in Figure 1 and Table 1, panel A. *R*_{0} estimates remain close to the truth, and all confidence intervals have coverage probabilities near 95%. Ignoring information about neighbors of infected persons and assuming a mass-action model with exponential contact intervals is far less successful, though performance improves at lower *R*_{0}.

Scatterplot of estimated versus true *R*_{0} for network-based models with exponential contact interval and infectious period distributions, with smoothed means excluding the highest 10 *R*_{0} estimates. Agreement is excellent for all three network-based estimates, **...**

For network-based models with Weibull contact intervals, Figure 4 shows scatterplots of the estimated versus true *R*_{0} and Table 1, panel D, shows 95% confidence interval coverage probabilities. Analyses assuming a network-based model with Weibull contact intervals produce excellent point and interval estimates. Assuming a gamma contact interval distribution produces *R*_{0} estimates biased upward and confidence interval coverage probabilities well below 95%. Assuming an exponential contact interval distribution produces *R*_{0} estimates that are biased upward and confidence intervals that are much too narrow. Ignoring information about neighbors of infected persons and assuming a mass-action model with Weibull contact intervals produces estimates that are strongly biased upward, although confidence interval coverage probabilities are superior to those of a network-based analysis assuming exponential confidence intervals. The performance of all misspecified analyses improves at lower *R*_{0}.

In this section, we use survival methods based on contact intervals to analyze 2 data sets from the influenza A(H1N1) pandemic in 2009. The first is an epidemic curve of the illness onsets of 763 lab-confirmed cases in Mexico City between April 13 and April 24 (Mexican Ministry of Health, 2009). The second is surveillance data from 57 households with 62 index cases and 35 secondary cases collected by the Los Angeles County Department of Public Health between April 22 and May 19. A summary of influenza A(H1N1) surveillance data in Los Angeles County up to June 1 is available online (Los Angeles County Department of Public Health, 2009). Individuals who presented to healthcare providers or the county health department with acute febrile respiratory illness (AFRI, defined as fever ≥ 100^{о}*F* plus cough, sore throat, or runny nose) had nasopharyngeal swabs and aspirates tested for pandemic influenza A(H1N1). Those who tested positive were interviewed by telephone and asked to report AFRI episodes among household contacts, including the dates of illness onset. Additional AFRI episodes among household contacts were ascertained during follow-up interviews 14 days after the illness onset of the index case. All subsequent cases of AFRI in the household were assumed to be cases of pandemic influenza A(H1N1). These data sets are typical of those available early in an epidemic.

In our analyses, we assumed that the incubation period (between infection and illness onset) has a discrete uniform distribution on {1, 2, 3} and that infectiousness begins one day before the onset of illness. In continuous time, infectious persons make their first infectious contacts after *t*_{i} + ε_{i}. In discrete time, infectious persons begin making infectious contacts on the day after their onset of infectiousness. We assumed a fixed infectious period of 5, 6, 7, or 8 days (not including the day of infectiousness onset). These natural history assumptions are adapted from those used in Yang *and M others* (2009). For each data set and each assumed infectious period, we fit exponential, Weibull, and gamma contact interval distributions. To evaluate the sensitivity of our results to the assumed incubation period distribution, we repeated all analyses assuming an incubation period with a discrete uniform distribution on {1,2}. To allow for variation in the incubation period, the likelihood contribution of each infected person *j* is

(4.1)

where *p*_{k} is the probability that the incubation period has length *k* and *L*_{j}(*θ*|incubationperiod = *k*) is the likelihood contribution of *j* assuming an incubation period of length *k*. Since the incubation periods in different individuals are assumed to be independent, the incubation period of *j* affects only his or her own likelihood contribution. Correcting for variable onset of infectiousness or infectious periods is much more difficult because, for each *i*, *t*_{i} + ε_{i} and *t*_{i} + *r*_{i} affect the likelihood contributions for all *j* such that *C*_{ij}*S*_{j}(*t*_{i}) = 1. For each data set, we fit exponential, Weibull, and gamma contact interval distributions.

The analysis of the epidemic curve from Mexico City uses the asymptotic likelihood for a mass-action model with unobserved infectors given in (2.16), and the *R*_{0} estimate given in (2.19). Bias-corrected bootstrap confidence intervals for *R*_{0} were obtained in the manner described in Section 3.1, using 10000 bootstrap replicates.

To analyze the household data from Los Angeles, we treated each household as a small network–based model on a complete graph, using the likelihood for a network-based model with unobserved infectors given in Section 2.6. We disregard transmission from outside the household because these data were from the very early stages of the pandemic. We used the estimated cumulative hazard function of the contact interval distribution and the assumed infectious period *ι* to estimate the “household secondary attack rate (SAR),” which is . In a household of size *H*, the household reproductive number *R*_{h} = SAR(*H* − 1). For both SAR and *R*_{h}, we calculated bias-corrected bootstrap confidence intervals. Bootstrap replicates of the household *SAR* were obtained using the approximate normal distribution of *ln*. Bootstrap replicates of *R*_{h} were obtained by combining samples of the contact interval distribution parameters with bootstrap samples from the household sizes of the index cases. Since the household sizes of the index cases had mean 5.15 and variance 8.29, we found that 100000 bootstrap replicates were necessary to get sufficiently stable confidence intervals. To estimate *R*_{0}, the estimated *R*_{h} can be divided by the proportion of cases infected by a household contact (Yang *and M others*, 2009). It is generally believed that 30–40% of influenza transmission occurs within households (Yang *and M others*, 2009). In these data, 35 out of 97 cases (36%) are thought to have been infected by a household contact.

*R*_{0} estimates based on the epidemic curve from Mexico City depend heavily on the assumed infectious period and the assumed parametric family of the contact interval distribution (Table 2, panel A). Assuming an exponential contact interval distribution, we got *R*_{0} estimates in the upper range of plausible *R*_{0} values for pandemic influenza A(H1N1), which most estimates placed between 1.3 and 2.3 (Fraser *and others*, 2009), (, McBryde), (, YangH1N1). However, model fits for the Weibull and gamma distributions all reject the null hypothesis of an exponential contact interval distribution, with shape parameters significantly greater than one. The behavior of the estimates assuming Weibull and gamma distributions is similar to that seen in the mass-action simulations with exponential contact intervals from Section 3; both tend to overestimate *R*_{0}, with the Weibull overestimating more than the gamma. Since the epidemic spread in Mexico City led to a pandemic, the *R*_{0} in Mexico City during the time interval covered by the epidemic curve may actually have been higher than that estimated elsewhere. Even so, the *R*_{0} estimates obtained by assuming a Weibull or gamma contact interval are too high given the subsequent progress of the pandemic.

In contrast, estimates of the household SAR and *R*_{h} based on the Los Angeles County household surveillance data are remarkably consistent across assumed infectious periods and contact interval distributions (Table 2, panel B). Model fits assuming exponential and Weibull contact interval distributions all indicate a household SAR around 0.08 an *R*_{h} around 0.40, and model fits assuming a gamma distribution indicate a household SAR around 0.09 and an *R*_{h} around 0.46. All fits assuming a Weibull contact interval distribution fail to reject the null hypothesis of an exponential contact interval distribution. Fits assuming a gamma contact interval distribution reject this null hypothesis only for assumed infectious periods of 7 or 8 days, which have *κ* point estimates and 95% confidence intervals of 0.84(0.81,0.89) and 0.81(0.79,0.84), respectively. The household SAR estimates are in the lower range of household SARs for previous strains of influenza A(H1N1), which run from approximately 0.05–0.20 (Yang *and M others*, 2009). The relatively low household SAR in Los Angeles County may be due in part to prophylactic antiviral use or other behavior changes in response to the pandemic. Of the 179 individuals for whom antiviral usage data were available, 38% of cases and 52% of noncases took prophylactic antivirals. Nonetheless, this analysis indicates that pandemic influenza transmission in Los Angeles County was not under control in late April and early May, 2009. Depending on the proportion of influenza transmission that occurs within households, our exponential and Weibull *R*_{h} estimates are consistent with an *R*_{0} between 1.0 and 1.3, and our gamma *R*_{h} estimates are consistent with an *R*_{0} between 1.1 and 1.6.

In both analyses, nearly identical results were obtained assuming an incubation period uniformly distributed on {1,2} (not shown).

The results of the simulations in Section 3 confirm that standard maximum likelihood methods can be applied successfully to survival likelihoods written in terms of the contact interval distribution. Survival methods based on contact intervals can incorporate a much greater variety of transmission models than methods based on generation or serial intervals. Analyses assuming mass action performed poorly when applied to data generated by network-based SEIR models, so this flexibility is essential for accurate statistical inference. The simulations also suggest that our methods produce more robust *R*_{0} estimates for network-based models than for mass-action models.

Similarly, our analysis of the Los Angeles data, which used household data and a network-based analysis, produced more consistent results than our analysis of the epidemic curve from Mexico City, which assumed mass action. The epidemic curve from Mexico City produced *R*_{0} estimates that were high and sensitive to the assumed infectious period. The household data produced a consistent and plausible picture of the state of the influenza A(H1N1) pandemic. Taken together, the simulations in Section 3 and the data analysis in Section 4 suggest that data from households or other settings where close contacts of infected individuals can be identified will support more reliable statistical inference than population-level data with little or no information about the relationships between cases or about uninfected persons.

The methods and results in this paper have several implications for data collection during an epidemic. First, they indicate that information about close contacts of cases can be very valuable, whether or not they are infected. The use of information contributed by uninfected person-time is a fundamental advantage of methods based on contact intervals over methods based on generation or serial intervals. Second, they show the potential value of data about the onset and duration of infectiousness. For an acute infectious disease, the onset and duration of illness may be a useful proxy, especially if there is some knowledge about the incubation period and the pattern of pathogen shedding. Methods based on serial intervals do not require such data, but this apparent advantage comes at a tremendous cost in terms of the flexibility and validity of the subsequent analysis. Framing the analysis in terms of contact intervals provides a statistical basis for treating the analysis of partially observed epidemics as a missing-data problem.

We have made many simplifying assumptions that could be relaxed. The SEIR framework limits our methods to acute, immunizing diseases that spread person-to-person, excluding diseases of great public health importance such as tuberculosis, meningococcal and pneumococcal diseases, foodborne and waterborne diseases, and HIV/AIDS. Many (but not all) emerging infections fit into the SEIR framework, and most infectious disease models make this assumption. The range of diseases covered by our methods can be extended by allowing individuals to experience multiple failures and failures of different types. We assumed that the population is homogeneous. To allow for heterogeneity, we could allow *λ*_{ij}(*τ*) to depend on covariates describing *i*, *j*, and their relationship. We assumed that the contact interval *τ*_{ij} is independent of the infectious period *ι*_{i}, which could be relaxed by including *ι*_{i} as a covariate in *λ*_{ij}(*τ*) or by adapting multivariate survival methods. Finally, we assumed that all times of infection, onset of infectiousness, and recovery are observed. This is unrealistic, but the development of incomplete-data methods requires complete-data methods. Survival methods based on contact intervals could be incorporated in a Bayesian or expectation-maximization framework that imputed missing or unobserved data.

In this paper, we have presented mass-action and network-based models as mutually exclusive. They are not, and real epidemics include transmission between close contacts and between strangers. The general framework of survival analysis based on contact intervals allows the simultaneous inclusion of both types of contacts as well as the inclusion of a hazard of infection from outside the population. It is also possible to derive analogues of Nelson–Aalen and Kaplan–Meier estimators for the contact interval distribution, which can be used to test the goodness of fit of parametric survival models for epidemic data. We hope to describe these methods in future manuscripts.

Survival methods based on contact intervals have the potential to become important tools in infectious disease epidemiology. We have introduced these methods in the simplest setting possible, but the powerful and flexible framework of survival analysis offers many possibilities for generalizing these methods or adapting them to the characteristics of particular diseases. These methods can be seen as descendants of methods based on generation and serial intervals but they are more flexible and more explicit about analytical assumptions and data requirements.

Supplementary material is available at http://biostatistics.oxfordjournals.org.

National Institute of General Medical Sciences (F32GM085945), “Linking transmission models and data analysis in infectious disease epidemiology”. Office space, computing facilities, and administrative support were provided by the Fred Hutchinson Cancer Research Center.

I gratefully acknowledge the guidance of M. Elizabeth Halloran, the computing advice of Dennis Chao and Dirk Peterson, and the comments Yang Yang, Ira M. Longini, Jr., and participants in the workshop “Design and Analysis of Infectious Disease Studies” (MFO Oberwolfach, November 1–7, 2009). Brit Oiulfstad, Dee Bagwell, Brandon Dean, Laurene Mascola, and Elizabeth Bancroft of the Los Angeles County Department of Public Health generously allowed the use of their data, which was provided to me by Jonathan Sugimoto. *Conflict of interest:* None declared.

- Aalen OO, Borgan O, Gjessing H. Statistics for Biology and Health. New York: Springer; 2009. Survival and Event History Analysis: A Process Point of View.
- Andersson H. Limit theorems for a random graph epidemic model. Annals of Applied Probability. 1998;8:1331–1349.
- Cauchemez S, Boëlle P-Y, Thomas G, Valleron A-J. Estimating in real time the efficacy of measures of control emerging communicable diseases. American Journal of Epidemiology. 2006;164:591–597. [PubMed]
- Diekmann O, Heesterbeek JAP. Mathematical Epidemiology of Infectious Diseases: Model Building, Analysis and Interpretation. Wiley Series in Mathematical and Computational Biology. Hoboken, NJ: John Wiley & Sons; 2000.
- Ferguson NM, Cummings DAT, Cauchemez S, Fraser C, Riley S, Meeyai A, Iamsirithaworn S, Burke DS. Strategies for containing an emerging influenza pandemic in southeast asia. Nature. 2005;437:209–214. [PubMed]
- Ferguson NM, Cummings DAT, Fraser C, Cajka JC, Cooley PC, Burke DS. Strategies for mitigating an influenza pandemic. Nature. 2006;442:448–452. [PubMed]
- Fine PEM. The interval between successive cases of an infectious disease. American Journal of Epidemiology. 2003;158:1039–1047. [PubMed]
- Fraser C, Donelly CA, Cauchemez S, Hanage WP, Van Kerkhove MD, Hollingsworth TD, Griffin J, Baggaley RF, Jenkins HE, Lyons EJ. and others. Pandemic potential of a strain of influenza A (H1N1): early findings. Science. 2009;324:1557–1561. [PMC free article] [PubMed]
- Hagberg AA, Schult DA, Swart PJ. In: Exploring network structure, dynamics, and function using NetworkX. Varoquaux Gaël, Vaught Travis, Millman Jarrod., editors. Pasadena, CA: Proceedings of the 7th Python in Science Conference (SciPy2008); 2008. pp. 11–15. Available from http://conference.scipy.org /proceedings/SciPy2008/index.html.
- Jones E, Oliphant T, Peterson P. 2001–2009. SciPy: Open Source Scientific Tools for Python. http://www.scipy.org.
- Kalbfleisch JD, Prentice RL. Wiley Series in Probability and Statistics. 2nd edition. Hoboken, NJ: John Wiley & Sons; 2002. The Statistical Analysis of Failure Time Data.
- Kenah E, Lipsitch M, Robins JM. Generation interval contraction and epidemic data analysis. Mathematical Biosciences. 2008;213:71–79. [PMC free article] [PubMed]
- Kenah E, Robins JM. Second look at the spread of epidemics on networks. Physical Review E. 2007 76, 036113. [PMC free article] [PubMed]
- Lipsitch M, Cohen T, Cooper B, Robins JM, Ma S, James L, Gopalakrishna G, Chew SK, Tan CC, Samore MH. and others. Transmission dynamics and control of severe acute respiratory syndrome. Science. 2003;300:1966–1970. [PMC free article] [PubMed]
- Los Angeles County Department of Public Health. 2009. Novel influenza a(h1n1): epidemiology report (as of June 1, 2009). http://www.lapublichealth.org.
- McBryde ES, Bergeri I, van Gemert C, Rotty J, Headley EJ, Simpson K, Lester RA, Hellard M, Fielding JE. Early transmission characteristics of influenza A(H1N1)v in Australia: Victorian State, 16 May–3 June 2009. Eurosurveillance. 2009;14:19363. [PubMed]
- Mexican Ministry of Health. 2009. Situación actual de la epidemia (20 de mayo del 2009) http://portal.salud.gob.mx/contenidos/noticias/influenza/estadisticas.html.
- Mills C, Robins JM, Lipsitch M. Transmissibility of 1918 pandemic influenza. Nature. 2004;432:904–906. [PubMed]
- Moreira W, Warnes GR. 2002–2009. RPy Reference Manual. http://rpy.sourceforge.net.
- Newman M, Barabási A-L, Watts DJ. Structure and Dynamics of Networks. Princeton, NJ: Princeton University Press; 2006.
- Newman MEJ. Spread of epidemic disease on networks. Physical Review E. 2002 66, 016128. [PubMed]
- R Development Core Team. Vienna: Austria: R Foundation for Statistical Computing; 2009. R: A Language and Environment for Statistical Computing. Available from http://www.R-project.org.
- Rizzo ML. Statistical Computing with R. Boca Raton, FL: Chapman & Hall/CRC; 2008.
- Roberts MG, Heesterbeek JAP. Model-consistent estimation of the basic reproduction number from the incidence of an emerging infection. Journal of Mathematical Biology. 2007;55:803–816. [PMC free article] [PubMed]
- Svensson Å. A note on generation times in epidemic models. Mathematical Biosciences. 2007;208:300–311. [PubMed]
- Wallinga J, Lipsitch M. How generation intervals shape the relationship between growth rates and reproductive numbers. Proceedings of the Royal Society, Series B. 2007;274:599–604. [PMC free article] [PubMed]
- Wallinga J, Teunis P. Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures. American Journal of Epidemiology. 2004;160:509–516. [PubMed]
- White LF, Pagano M. A likelihood-based method for real-time estimate of the serial interval and reproductive number of an epidemic. Statistics in Medicine. 2008;27:2999–3016. [PMC free article] [PubMed]
- Yang Y, Sugimoto J, Halloran ME, Basta NE, Chao DL, Matrajt L, Potter G, Kenah E, Longini, IM The transmissibility and control of pandemic influenza A(H1N1) virus. Science. 2009;326:729–733. [PMC free article] [PubMed]

Articles from Biostatistics (Oxford, England) are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |