Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3402623

Formats

Article sections

- Summary
- 1. Introduction
- 2. Methods
- 3. Simulation Study
- 4. Data Analysis
- 5. Discussion
- Supplementary Material
- References

Authors

Related links

Biometrics. Author manuscript; available in PMC 2013 January 3.

Published in final edited form as:

Published online 2012 April 16. doi: 10.1111/j.1541-0420.2012.01757.x

PMCID: PMC3402623

NIHMSID: NIHMS361614

Yang Yang,^{1} Ira M. Longini, Jr.,^{1,}^{2} M. Elizabeth Halloran,^{1,}^{2} and Valerie Obenchain^{1}

The publisher's final edited version of this article is available at Biometrics

See other articles in PMC that cite the published article.

In epidemics of infectious diseases such as influenza, an individual may have one of four possible final states: prior immune, escaped from infection, infected with symptoms, and infected asymptomatically. The exact state is often not observed. In addition, the unobserved transmission times of asymptomatic infections further complicate analysis. Under the assumption of missing at random, data-augmentation techniques can be used to integrate out such uncertainties. We adapt an importance-sampling-based Monte Carlo EM (MCEM) algorithm to the setting of an infectious disease transmitted in close contact groups. Assuming the independence between close contact groups, we propose a hybrid EM-MCEM algorithm that applies the MCEM or the traditional EM algorithms to each close contact group depending on the dimension of missing data in that group, and discuss the variance estimation for this practice. In addition, we propose a bootstrap approach to assess the total Monte Carlo error and factor that error into the variance estimation. The proposed methods are evaluated using simulation studies. We use the hybrid EM-MCEM algorithm to analyze two influenza epidemics in the late 1970s to assess the effects of age and pre-season antibody levels on the transmissibility and pathogenicity of the viruses.

To assess transmissibility and pathogenicity of acute infectious diseases such as seasonal influenza in close contact groups, it is important to collect information about baseline susceptibility, infection outcomes and symptom outcomes at the individual level whenever possible. Infection outcomes here refer to infection status (infected or escaped) and infection time. Data on symptom outcomes include symptom status (present or absent) and symptom onset time. In more complex settings, the outcomes may incorporate additional information such as the viral type or genetic sequences of the pathogen associated with a given infection. Infection and symptom outcomes of each individual in turn provide information about the exposure history of his or her contacts. In most epidemiological studies, however, not all individual-level information is collected. Among all the crucial measurements, the most difficult one to obtain is the infection time of asymptomatic cases. Missing information also occurs in outbreaks of newly emerging pathogens, in which baseline immunological measures are generally not available. In such settings, a person with a high level of post-epidemic antibody but without symptoms or viral culture may have two potential states: pre-season immunity to infection, or infected but without symptoms.

Failure to account for unobserved key quantities will likely bias the assessment of transmissibility and the effects of risk factors (Sugimoto et al., 2011) and may under-evaluate the variances of parameter estimates. Under the assumption of missing at random (MAR), unobserved quantities can be integrated out for proper statistical inference (Little and Rubin, 2002). For the infection times of asymptomatic cases, MAR means that whether an infection is symptomatic or not does not depend on the time of infection, when conditioning on the observed exposure history determined by observed contacts with infectious sources and covariates that modify the transmissibility. Analytic integration is possible in certain models, for instance, the one addressing unmeasured heterogeneity in baseline susceptibility (Longini and Halloran, 1996). Numeric integration via data-augmentation techniques can be effectively used for unobserved infection times. Within the Bayesian framework, Markov chain Monte Carlo (MCMC) algorithms have been used in different study settings for infectious diseases (Gibson, 1997; Auranen et al., 2000; O’Neill et al., 2000; Cauchemez et al., 2004; Yang, Longini and Halloran, 2009). The use of these models by epidemiological investigators are limited by either the heavy computational burden or the need to specify prior distributions for parameters.

Another option for data augmentation is the EM algorithm which has been used to augment the data with infection times (Yang, Longini and Halloran, 2007) or the transmission trees (Kenah, 2010) when only symptomatic infections are modeled. Direct application of the EM algorithm to asymptomatic infections may be numerically impossible due to the high-dimensionality of the unobserved infection times. In this work, we explore the applicability of an MCEM algorithm based on importance sampling (Levine and Casella, 2001) to the setting of infectious diseases. Taking advantage of the assumption that the close contact groups are independent and that missing data of high dimension often exist in only a few of the observed contact groups, we propose a hybrid algorithm that applies the MCEM algorithm to contact groups with high dimensional missing data and the EM algorithm to the others. To further expedite the algorithm, we propose a bootstrap approach to assess the Monte Carlo error due to sampling and account for such error in the variance estimation.

Our investigation was motivated by the surveillance study of influenza infections in families with school-aged children that was implemented in Seattle, Washington, during 1975–1979 (Fox et al., 1982a,b). In this study, a total of 639 families were recruited during Octobers and Novembers of 1975 and 1976. Blood samples were collected at four-month intervals, from which influenza subtype-specific hemagglutinin antibody levels were measured via hemagglutinin inhibition (HI) titration both before and after the influenza season. The pre-season HI titers can be used to assess susceptibility to the circulating influenza strains, and a substantial increase between the pre- and post-season HI titers indicates the occurrence of an infection. Illness information and nose-throat swabs were collected during regularly scheduled visits or when a new illness was reported. Data are relatively complete for the type of analysis considered here for two of the four winters: 1977–1978 and 1978–1979. The circulating strains are H3N2 (A/Victoria and A/Texas) in the former and H1N1 (A/USSR) in the latter. There were small herald outbreaks of H3N2 in the spring of 1977 and H1N1 in the spring of 1978, which immunized fully or partially a proportion of the study families for the subsequent major outbreaks in the winters. Our interest is to assess the transmissibility of the two strains and the associated effects of age and prior-immunity level measured by the HI titers. The epidemic curves (timeline of illness onsets in symptomatic infections) of the two epidemics are shown in Figure 1. Inclusion/exclusion criteria of study households for analysis is given in Web Appendix A. Numbers of individuals by infection status, age group and pre-season HI titers used in this analysis are shown in Table 1. The household sizes are distributed, in the form of size:number, as (2:6, 3:41, 4:76, 5:31, 6:11, 7:8, 8:2, 9:1) in the earlier epidemic and as (2:5, 3:25, 4:48, 5:20, 6:8, 7:3, 8:3, 9:1) in the later one. Both epidemics lasted for a few months, and each has 20–30% households with one or more asymptomatic infections. The large numbers of potential configurations of individual states in these households make the traditional EM algorithm computationally expensive.

Epidemic curves of illness onsets of symptomatic cases in the Seattle influenza seasons. The dashed lines denote the relative infectivity levels of the unknown source over time, calculated using rescaled 7-day moving averages of daily case numbers. (a) **...**

Consider a prospective study for the transmission of influenza in *H* households each of size *n _{h}*,

Let
${t}_{i}^{}$ and * _{i}* be the infection day and infectiousness onset day of person

We assume *z _{i}* ~ Bernolli(

$$\begin{array}{ll}\text{logit}({\alpha}_{i})=\text{logit}(\alpha )+{\mathit{x}}_{i}^{\tau}{\mathit{\beta}}_{\alpha},\hfill & \text{logit}({it}_{)}\text{logit}({b}_{it})=\text{logit}(b)+{\mathit{x}}_{it}^{\tau}{\mathit{\beta}}_{b},\hfill & \text{logit}({p}_{\mathit{ijt}})={(-\infty )}^{\mathbf{1}({c}_{j}\ne {c}_{i})}{\text{(logit}(p)+{\mathit{x}}_{it}^{\tau}{\mathit{\beta}}_{p})}^{\mathbf{1}({c}_{j}={c}_{i})},\hfill \hfill \end{array}$$

where *α _{i}*,

The probability of person *i* escaping infection during and up to day *t* are, respectively, *e _{it}* = (1−

$${L}_{(i)}(\mathit{\psi}{\mathit{u}}_{i})={\alpha}_{i}^{zi}{(1-{\alpha}_{i})}^{1-{z}_{i}}{\{{{q}_{iT}}^{1-{y}_{i}}{(\sum _{t}f({\stackrel{~}{t}}_{i}-t){q}_{i(t-1)}(1-{e}_{it}){({it}_{)}{s}_{i}}^{{(1-{it}_{)}1-{s}_{i}}^{}}){y}_{i}}^{\}}1-{z}_{i}.}^{}$$

For household *h*, let *O** _{h}* = {

$${L}_{h}(\mathit{\psi}{\mathit{O}}_{h},{\mathit{U}}_{h})=\underset{{L}_{(i)}(\mathit{\psi}{\mathit{u}}_{i})}{i:{c}_{i}=h}$$

and

$$L(\mathit{\psi}\mathit{O},\mathit{U})=\underset{{L}_{h}(\mathit{\psi}{\mathit{O}}_{h},{\mathit{U}}_{h}).}{\overset{}{h=1H}}$$

We assume independence between households, and therefore imputation of missing quantities is performed at the household level. Let *δ _{h}* be the dimension of domain, or equivalently the number of possible realizations, of

The hybrid EM-MCEM algorithm adapted from the importance-sampling-based MCEM algorithm of Levine and Casella (2001) is the following:

- Choose the value of
*J*to partition the households into three groups: Δ= {_{OBS}*h: δ*= 1}, Δ_{h}= {_{EM}*h*: 1 <*δ*<_{h}*J*}, and Δ= {_{MCEM}*h: δ*≥_{h}*J*}. EM is implemented on households in Δ, and MCEM is implemented on households in Δ_{EM}._{MCEM} - Choose an integer
*K*to be the number of importance samples for the MCEM algorithm. The value of*K*is chosen to be reasonablly large, depending on the computational feasibility and how well the MCMC samples are mixing. - Choose an initial value
*ψ*^{(0)}for. For household*ψ**h*Δ, draw_{MCEM}*K*samples of*U*from Pr(_{h}*U*|_{h}*O*,_{h}*ψ*^{(0)}) using a MCMC algorithm, and let these samples be*Û*,_{hk}*k*= 1, …,*K*. - Set
^{(0)}=*ψ*^{(0)} - At iteration
*r*≥ 0,- Update the conditional probabilities for all
*h*Δ:_{EM}$${\lambda}_{hk}^{(r)}=\frac{{L}_{h}({\widehat{\mathit{\psi}}}^{(r)}{\mathit{O}}_{h},{\mathit{U}}_{hk}^{}{\sum}_{l=1}^{{\delta}_{h}}{L}_{h}({\widehat{\mathit{\psi}}}^{(r)}{\mathit{O}}_{h},{\mathit{U}}_{hl}^{},k=1,\dots ,{\delta}_{h},}{}$$and the importance weights for all*h*Δ:_{MCEM}$${\omega}_{hk}^{(r)}=\frac{{L}_{h}({\widehat{\mathit{\psi}}}^{(r)}{\mathit{O}}_{h},{\widehat{\mathit{U}}}_{hk}){L}_{h}({\mathit{\psi}}^{(0)}{\mathit{O}}_{h},{\widehat{\mathit{U}}}_{hk}),k=1,\dots ,K.}{}$$ - Maximize$$Q(\mathit{\psi},{\widehat{\mathit{\psi}}}^{(r)})=\sum _{h{\mathrm{\Delta}}_{\mathit{OBS}}}$$with regard to
to find*ψ*^{(}^{r}^{+1)}, where ${\omega}_{h\xb7}^{(r)}={\sum}_{k=1}^{K}{\omega}_{hk}^{(r)}$ for*h*Δ._{MCEM}

Repeat this step until convergence in the estimates of ** ψ**, and denote the final estimate by

The traditional MCEM algorithm of Levine and Casella (2001) is a special case of the hybrid algorithm with *J* = 2. Like the traditional algorithm, the performance of this hybrid algorithm may be sensitive to the initial value *ψ*^{(0)}. We follow the suggesiton of Levine and Casella (2001) that a few initial iterations be carried out with all weights set to 1 and that new samples be drawn based on updated estimates at each iteration, and *ψ*^{(0)} be set to the estimates of the final iteration.

For the types of data we consider, it is convenient to use the Gibbs sampler to draw the importance samples in step (3). Let Λ* _{i}* be the collection of all possible realizations of complete data for each

$$Pr({\mathit{u}}_{i}={\mathit{u}}_{i}^{}$$

For each household *h*, we start with some initial value of *U** _{h}*, and alternate the sampling of

For point estimation, one maximizes
${\text{E}}_{\mathit{U}\mathit{O},{\widehat{\mathit{\psi}}}^{(r)}}$ at each EM iteration. Taking advantage of the independence between households, each E_{Uh|Oh, (r)} ln *L _{h}*(

$${\text{E}}_{\mathit{U}\mathit{O},\widehat{\mathit{\psi}}}$$

which is not equal to

$$({\text{E}}_{\mathit{U}\mathit{O},\widehat{\mathit{\psi}}}\times {({\text{E}}_{\mathit{U}\mathit{O},\widehat{\mathit{\psi}}}\tau .}^{}$$

To facilitate variance evaluation, we need an equal number of importance samples for all housheolds with *δ _{h}* > 1. This can be attained by generating

$$\begin{array}{l}{\widehat{\mathit{V}}}^{-1}(\widehat{\mathit{\psi}},\stackrel{~}{\mathit{\psi}},\stackrel{~}{\mathit{U}})=(\frac{1}{{\sum}_{k=1}^{K}{\stackrel{~}{\omega}}_{k}}\sum _{k=1}^{K}{\stackrel{~}{\omega}}_{k}\frac{\mathit{d}lnL(\mathit{\psi}\mathit{O},{\stackrel{~}{\mathit{U}}}_{\xb7k})d\mathit{\psi}}{})\phantom{\rule{0.16667em}{0ex}}{(\frac{1}{{\sum}_{k=1}^{K}{\stackrel{~}{\omega}}_{k}}\sum _{k=1}^{K}{\stackrel{~}{\omega}}_{k}\frac{\mathit{d}lnL(\mathit{\psi}\mathit{O},{\stackrel{~}{\mathit{U}}}_{\xb7k})d\mathit{\psi}}{})\tau}^{}-\frac{1}{{\sum}_{k=1}^{K}{\stackrel{~}{\omega}}_{k}}\sum _{k=1}^{K}{{\stackrel{~}{\omega}}_{k}\{\frac{{d}^{2}lnL(\mathit{\psi}\mathit{O},{\stackrel{~}{\mathit{U}}}_{\xb7k})d{\mathit{\psi}}^{2}}{+}|\mathit{\psi}=\widehat{\mathit{\psi}}.}_{}\end{array}$$

Another special case worth noting is ** = **** which yields *** _{k}* = 1 for all

In traditional MCEM algorithms, there are steps to assess whether the conditional MC error (MC variation of *ψ*^{(}^{r}^{+1)}|*ψ*^{(}^{r}^{)}) swamps out the E-M step, and if so, how to increase the number of importance samples. The assessment of the conditional MC errors in previous work was mostly based on the central limit theorem, so that the sampling of *u*_{i}*U** _{h}*, which is computationally expensive, can be skipped. Booth and Hobert (1999) gave a sandwich formula for the conditional MC variance when the samples are independent. Using subsamples drawn at the renewal times of the MCMC samples, Levine and Casella (2001) estimated the conditional MC variance for the weighted average of the score functions, but not for the parameter estimates themselves.

We found in simulation studies that the assessemnt of conditional MC error leads to extremely slow convergence rate, even with a moderate sample size (500 people) and a moderate tolerance level of relative error in parameter estimates, such as 10^{−6}. Alternatively, it is more practical to choose a reasonably large *K* and factor the total MC error directly into the variance estimates of the parameters. By total MC error we mean the variation in the final parameter estimates due to the sampling. Unlike the conditional MC error, the total MC error needs to be assessed only once after convergence; therefore, sampling- and resampling-based approaches are computationally affordable.

An intuitive approach is to generate via MCMC multiple sets of new point-estimation importance samples based on *ψ*^{(0)}, to which we refer as the exact approach. We propose a more efficient resampling-based approach as follows: For each household *h* Δ* _{MCEM}*, assume that the

$${\widehat{\mathit{V}}}^{\#}=\frac{1}{M}\sum _{m=1}^{M}{\left({\widehat{\mathit{\psi}}}_{m}^{\#}-\frac{1}{M}\sum _{m=1}^{M}{\widehat{\mathit{\psi}}}_{m}^{\#}\right)}^{2}+\frac{1}{M}\sum _{m=1}^{M}{\widehat{\mathit{V}}}_{m}^{\#}.$$

(1)

We simulate 30-day epidemics in a population composed of 100 households under the following setting. The household sizes are distributed, in the form of size:probability, as (3:0.1, 4:0.2, 5:0.3, 6:0.2, 7:0.1, 8:0.05, 9:0.05), with a mean size of 5.4. The parameter values are set to *b* = 0.01, *p* = 0.05, *α* = 0.2, and = 0.7. A single time-independent binary covariate was generated for each individual to modify the transmission probabilities via logit(*b _{it}*) = logit(

We first investigate, via both the traditional MCEM and the hybrid EM-MCEM algorithms, the performance of the resampling approach for assessing the total MC error by comparing it with the exact approach. For illustration, only the infectiousness onset times ** of asymptomatic cases are subject to missingness, and the range of possible values for each missing *** _{i}* is 2–30 days, assuming a minimum latent period of one day. We set

Comparing the total MC variances based on *M* = 500 new importance samplings between the bootstrap approach and the exact approach. Results for *α* and are not presented as the estimates are constant. A single 30-day epidemic in 100 households **...**

With the current simulation setting, the time ratio of the bootstrap approach to the exact approach is at the scale of 10^{−5} for the sampling phase and 0.6–0.7 for the whole process of estimating the total MC variance, regardless of the algorithm used or the value of *K*. Relative to the traditional MCEM algorithm, the hybrid EM-MCEM algorithm reduced the total MC variance by more than 50% for all three parameters at all values of *K*. Table 2 also suggests that the MC variance is roughly proportional to 1/*K*, as expected. With a tolerance rate of 10^{−5} for the relative error in , if the true value of *b* is at the 10^{−2} scale, a scale of 10^{−7} is needed for the standard deviation of , or equivalently, 10^{−14} for the MC variance, which translates into a sample size of about 5 million for the traditional MCEM algorithm and 1.5 million for the hybrid algorithm. It is therefore much more practical to factor the MC error into the variance estimation using (1) rather than to control it by increasing the number of importance samples. In addition, the component
${\scriptstyle \frac{1}{M}}{\widehat{\mathit{V}}}_{m}^{\#}$ in (1), which measures the uncertainty inherent in epidemic-generation, is at the scales of 10^{−6} for *b*, 10^{−5} for *p*, and 10^{−2} for the OR (not shown in the table), implying that the MC error is ignorable in magnitude under this simulation setting.

We then investigate the performance in estimation of the hybrid EM-MCEM algorithm versus the traditional MCEM algorithm. In addition to the infectiousness onset times of asymptomatic infections, one more layer of missingness is introduced in this investigation. Among people whose final states are either prior immunity (*z _{i}* = 1,

Results of comparing the two algorithms are based on 500 simulated epidemics and presented in Table 3. When the assumed value of *θ* is correct, most parameter estimates appear to be unbiased, regardless of the value of *ξ*, except for a slight upward bias for OR which will diminish for larger population sizes or longer epidemics. Interestingly, higher values of *ξ*, which correspond to higher uncertainty levels of the data, only slightly increase the standard deviation (SD) estimates for all parameters except for . However, setting *ξ* = 1.0 leads to non-convergence in most simulated epidemics. This observation suggests that it is good enough to determine the prior immunity status in a small proportion of uninfected people, but no attempt of such determination at all will cause a nonidentifiability issue if one intends to estimate the prior immunity level. That estimation for is not affected by the value of *ξ* is expected as the information about only relies on the infections. The traditional MCEM algorithm yielded almost identical results as the hybird algorithm, with only a slight difference in the CI coverage for *α*.

Performance of the MCEM algorithms: mean estimates, standard deviations (SD) and 95% CI coverages are based on 500 simulated 30-day epidemics among 100 households each of size 5, with *b* = 0.01, *p* = 0.05, *α* = 0.2, = 0.3 and *θ* **...**

Misspecification of the value of *θ* in estimation does affect the inference about *b* and *p* but not other parameters. Assuming *θ* = 0 leads to minor overestimation and slightly lower CI coverage than the desired level for *b* and *p*. Conversely, the direction of bias changes to downwards when *θ* = 1 is assumed; nevertheless, the CI coverage for *b* seems relatively insensitive to the bias. The estimate for the odds ratio would not be affected by the value of *θ* as long as the covariate value of a susceptible is independent of his or her exposure level, e.g., independent of whether or not being exposed to asymptomatic infections. For the proportion of preseason immunity, *α*, and the probability of symptom onset given infection, , their information is determined by whether infection occurs or not rather than by the infectiousness level of asymptomatic infections,

With *ξ* = 0.5, *θ* = 0.5, and other simulation parameters the same as in Table 3, the computational efficiency is compared between the traditional MCEM and the hybrid EM-MCEM algorithms in Figure 2 for a few settings of *J* and *K*. Panel (a) shows a highly skewed distribution of *δ _{h}*, the household-specific numbers of possible realizations of missing final states, over all households and 100 simulated epidemics, with the majority of the values below 50 and a few larger than one million. For each setting of

Comparison of the computational efficiency between the traditional MCEM algorithm and the hybrid EM-MCEM algorithm for disease transmission in independent households. Panel (a) shows the distribution of δ_{h}, the numbers of possible realizations **...**

As *J* increases, the times for the sampling and EM iterations stages first decrease until *J* reaches thousands and begin to rise thereafter, the exact turning point depending on *K*. The time for variance calculation is minimal when *J* = 2, as no new sampling is necessary, and is constant when *J* > 2 but constitutes only a small fraction of the overall time. In most scenarios, the overall computational time seems to be primarily driven by the sampling stage, except when *J* is extremely large, e.g., ≥ 10^{6}, where the EM iterations stage become dominant and the overall time becomes unfavorable as compared to *J* = 2. The slight increases in the time of the sampling stage at large values of *J* are attributed to the initial EM iterations to obtain ^{(0)} in this stage. In terms of computational efficiency under our simulation setting, the hybrid algorithm with a wide range of *J* outperforms the traditional MCEM algorithm, and the relative efficiency reaches ≥ 50% at appropriate values of *J*. Not surprisingly, the traditional EM algorithm is the most time-inefficient with such a high dimension of individual-level missing data.

The daily probability of infection due to the unknown source accounts for the risk due to casual contacts outside households; therefore it is reasonable to associate its variation over time with the changes in the observed total number of cases during each epidemic course, assuming that the study households are representative of all infected households. For simplicity, we assign to each day the average daily number of cases in the 7-day window centered around that day, and refer to the moving averages as the expected numbers. These expected numbers are then divided by their average over all days, so that the rescaled expected numbers, denoted by *W _{t}* for day

$$\begin{array}{l}\text{logit}({it}_{)}\end{array}$$

with the constraints *β _{b}*

Results are shown in Table 4, with the relative infectiousness level *θ*, which we do not estimate directly, varying over 0, 0.5 and 1. Estimates of parameters at *θ* = 0.5 are reported as the primary results. Interestingly, the two subtypes of influenza had similar transmissibility in terms of both *b* and *p*. The secondary attack rate (SAR) is the baseline probability that a susceptible is infected by a case during his or her infectious period. Mathematically,
$\text{SAR}=1-{l=0\delta -1}_{(1-pg(l))}^{}$. The SAR presented in Table 4 indicates that the probability a susceptible child with the lowest HI titer level gets infected when exposed to a symptomatic infective household member is about 0.27 for both subtypes in these epidemics. The H3N2 strain was somewhat more pathogenic than the H1N1 strain, with the probability of developing symptoms being 0.81 for the former and 0.62 for the latter, but their confidence intervals are overlapping. For the effects of age group and pre-season HI titer level, odds ratios instead of regression coefficients are presented, and are denoted by OR* _{bp,covariate}* for transmissibility and OR

Analysis of the influenza H3N2 epidemic during 1977–1978 and the H1N1 epidemic during 1978–1979 in Seattle. Estimates and 95% CIs (in parentheses) are presented by the infectiousness level of asymptomatic cases relative to symptomatic **...**

Children were more susceptible to both infection and the disease (symptoms) than adults, although the CIs for OR* _{,AGE}* do include 1. The age effects on pathogenicity are similar between the two subtypes, adults having about 40–55% reduction in the odds of developing symptoms. The effects of pre-season HI titer levels on transmissibility share the same trend between the two subtypes, that is, the level adjacent to the reference level was not protective, but the next and higher levels were. For H3N2, the titers of 40 and 80+ reduced the odds of infection by 50% and 78%, respectively. For H1N1, the titer of 40 or above was highly protective, reducing the odds of infection by 91%, but the titers of 10–20 did not show protection. A further examination of the H1N1 data found that the majority of cases were children (144/149), and the majority of children in the second HI titer level (63/80) had titer values of 10. In addition, the majority of adults (158/220) had titer values of 20 or more, and three out of the five adult cases had a pre-season titer value of 10. An analysis with a finer grouping of pre-season HI titers (< 10, 10, 20, ≥ 40) but without modeling pathogenicity for the H1N1 epidemic yielded OR estimates of 0.92 (95% CI:0.62, 1.36), 0.21 (95% CI:0.065, 0.69), and 0.08 (95% CI:0.01, 0.61), confirming the protective effect of titers ≥ 20. For H3N2, the titers of 40+ reduced the odds of disease by 60–70%. For H1N1, only a small number of cases had titers ≥ 10, so these titers are combined into one level for the estimation regarding the OR for pathogenicity. It seems surprising that, compared to the base titer level, the higher level doubled the odds of disease, but this effect is not statistically significant.

The variation in *θ* changes the estimates for transmission probabilities to a certain extent, with a general association of lower estimates with higher infectiousness levels of asymptomatic infections. If the extra exposure to asymptomatic infections does not change the ratio of existing exposure levels to symptomatic infections between infected and escaped people, which is satisfied under MAR (being symptomatic or not is independent of the time of infection given observed data), then such association is expected as the same proportion of infections among initial susceptibles is explained by more infective sources when asymptomatic infections are assumed infectious. The estimation of *p* for the H1N1 epidemic is quite sensitive to the value of *θ*, leading to a 50% jump in the SAR estimate when *θ* moves from 0 to 1. The estimates for the ORs are highly resistant to the variation in *θ* for both subtypes. Further sensitivity analyses were performed with shorter and longer means for the latent (1.6 and 2.5 days) and infectious (3.8 and 5.8 days) periods within reasonable ranges. Very mild changes occurred in the estimates for *b* and *p* in response to the variation of the latent-period distribution. With a longer infectious period, the estimates for *p* decreased moderately, but the estimates for the SARs remained unchanged. This is expected because a reasonable variation of the distribution of the infectious period should not drastically change the total amount of risks delivered by all cases. All OR estimates are highly robust to the changes in the assumptions about natural history.

The hybrid MCEM algorithm is an extention of the traditional importance-sampling-based MCEM algorithm in Levine and Casella (2001) to account for missing data in settings where independent units have different dimensions of domain of missing data and integration over extremely large domains may be numerically impractical. The use of this algorithm is demonstrated in one such setting, infectious disease transmission data among close contact groups in which individuals may have unknown final disease states. Via simulating influenza transmission within households we show that the hybrid algorithm is computationally more efficient than the traditional MCEM or EM algorithms. The optimal partition point *J* may be difficult to find in advance, mainly because the number of EM iterations is unknown. A practical guideline is to apply MCEM only to units with very high dimension of missing data. If the clusters of transmission under study is much larger than households, such as school transmission studies, there could be much higher numbers of asymptomatic infections, and the computational efficiency of the MCEM component of the hybird algorithm would be even more appreciable.

To avoid the slow convergence problem that often occurs when conditional MC error is assessed in the presence of high-dimensional missing data, we fix the number of importance samples and assess the total MC error. Such assessment can be further expedited via the bootstrap of importance samples. However, this approach does not guarantee monotonic increase in the likelihood over the EM iterations. One way to avoid far-from-optimal estimates is to start the algorithm from multiple initial estimates. If the multiple estimation processes do not agree on the final estimates, then *K* should be increased. The hybrid EM-MCEM algorithm can also be coupled with the approach of assessing conditional MC error and increasing the number of importance samples in Levine and Casella (2001), whenever computationally affordable, and should converge faster than the traditional MCEM algorithm, as the hybrid algorithm has smaller MC errors as shown in Table 2.

The EM algorithm is known to converge slowly in some applications, which we have not experienced in our investigation. In most cases, there were no more than ten EM iterations before convergence. When slow convergence is of concern, various EM accelerators may be considered, for example, the quasi-Newton accelerator QN1 proposed by Jamshidian and Jennrich (1997). Some accelerators such as the conjugate gradient accelerator (Jamshidian and Jennrich, 1993) require knowledge about the likelihood based on the observed data and are therefore not applicable to the data considered here as *L*(** ψ**|

We discussed the application of the hybrid EM-MCEM algorithm for infectious disease data under the prospective design in which households are followed from the beginning of an epidemic. The algorithm can be generalized to the case-ascertained design in which households are enrolled only when index cases are acertained by symptom onsets. However, caution should be taken for determining the sampling range of the infectiousness onset times for asymptomatic cases, as the starting time of the epidemic is generally unknown. It may be reasonable to include in the sampling range a few days before the symptom onset time of the index case, so that all possible realizations of exposure of susceptibles to the asymptomatic infections are taken into account. If an asymptomatic person could be infected much earlier than the index case, then prior immunity can be considered as a possible state in addition to asymptomatic infection, so as to avoid a too wide sampling range.

Using only final infection states but no information on symptom onset times and assuming that asymptomatic infections are as infectious as symptomatic infections, a final size model (Longini et al., 1982) estimated the infection SARs as 0.21 (95% CI:0.17, 0.25) for the 1977–1978 H3N2 epidemic and 0.31 (95% CI:0.22, 0.39) for the 1978–1979 H1N1 epidemic. No covariate was explicitly adjusted for in those analyses, but only subjects with HI titers ≤ 640 (nearly the whole population as only three individuals have higher titers) were considered susceptible in the H3N2 epidemic, and only subjects < 20 years old with HI titers ≤ 10 were considered susceptible in the H1N1 epidemic. In addition, the risk of infection from the unknown source was assumed constant. Applying similar cutoffs for the HI titers and assumption about relative infectivity of asymptomatic infections, our infection SAR estimates are 0.16 (95% CI:0.12, 0.21) for the H3N2 epidemic and 0.27 (95% CI:0.19, 0.37) for the H1N1 epidemic. The discrepancy in the SAR estimates for H3N2 is most likely a consequence of different inclusion/exclusion criteria of households for analysis. The numbers of households used in Longini et al. (1982) are 159 and 93 for the two epidemics, both smaller than the numbers we used. The distribution of the final sizes was given in Longini et al. (1982) for the H1N1 epidemic but not for the H3N2 epidemic, and our H1N1 data give a fairly similar distribution.

Conditioning on the pre-season HI titer level, strong age effects on reducing susceptibility to infection were detected in both epidemics, which may be indicative of the role of either a T-cell response or the neuraminidase antibody level (not available) in preventing transmission. The larger age effect on transmissibility for H1N1 relative to H3N2 is likely due to the fact that it was the first emergence of the H1N1 virus after vanishing in late 1950s (Xu et al., 2010), and therefore only adults of 20 years or above had previous exposure to this virus.

As *θ* increases over the range of [0, 1], the full likelihood for the complete data increases for the H3N2 epidemic and decreases for the H1N1 epidemic (results not shown). However, we did not pursue a profile likelihood approach for the estimation of *θ* for two reasons. First, an estimate of *θ* > 1 would be difficult to interpret in biology as influenza is a respiratory disease whose transmission is primarily driven by symptoms such as coughing and sneezing. Second, how reliable *θ* can be estimated and the sample size required for such estimation are under future investigation. Our assumptions concerning the knowledge about *θ* as well as the distribution of the infectious period are made to avoid non-identifiability of key quantities such as *p* and covariate effects in the presence of uncertainty about the final status. Previous studies have shown that, even with final outcomes completely observed, a relative large population is necessary for more flexibility in modeling the infectious period (Yang, Longini and Halloran, 2009; Kenah, 2010).

Our simulation study indicates that the inference is not greatly compromised even when the proportion of people with uncertain final states is relatively high but is totally impossible when such proportion reaches 100%. The usefulness of a validation set (individuals with lab-confirmed final states) with appropriate statistical methods has been previously found and emphasized in the setting of infectious diseases (Halloran et al., 2003; Yang et al., 2010). However, such usefulness should by no means diminish the importance of the efforts to ascertain as complete information as possible via carefully designed studies.

This work was supported by the National Institute of General Medical Sciences MIDAS grant U01-GM070749 and the National Institute of Allergy and Infectious Diseases grant R37-AI032042.

Web Appendices referenced in Sections 1 and 4, together with the data and computer code, are available with this paper at the Biometrics website on Wiley Online Library.

- Auranen K, Arjas E, Leino T, Takala AK. Transmission of pneumococcal carriage in families: a latent Markov process model for binary longitudinal data. Journal of the American Statistical Association. 2000;95:1044–1053.
- Booth JG, Hobert JP. Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society B. 1999;61:265–285.
- Cauchemez S, Carrat F, Viboud C, Valleron AJ, Böelle PY. A Bayesian MCMC approach to study transmission of influenza: application to household longitudinal data. Statistics in Medicine. 2004;23:3469–3487. [PubMed]
- Elveback LR, Fox JP, Ackerman E. An influenza simulation model for immunization studies. American Journal of Epidemiology. 1976;103:152–165. [PubMed]
- Fox JP, Hall CE, Cooney MK, Foy HM. Influenza virus infections in Seattle families, 1975–1979. I. Study design, methods and the occurrence of infections by time and age. American Journal of Epidemiology. 1982a;116:212–227. [PubMed]
- Fox JP, Hall CE, Cooney MK, Foy HM. Influenza virus infections in Seattle families, 1975–1979. II. Pattern of infection in invaded households and relation of age and prior antibody to occurrence of infection and related illness. American Journal of Epidemiology. 1982b;116:228–242. [PubMed]
- Gibson GV. Markov chain Monte Carlo methods for fitting spatio-temporal stochastic models in plant epidemiology. Applied Statistics. 1997;46:215–233.
- Halloran ME, Longini IM, Gaglani MJ, Piedra PA, Chu H, Herschler GB, Glezen WP. Estimating efficacy of trivalent, cold-adapted, influenza virus vaccine (CAIV-T) against influenza A (H1N1) and B using surveillance culture. American Journal of Epidemiology. 2003;158:305–311. [PubMed]
- Jamshidian M, Jennrich RI. Conjugate gradient acceleration of the EM algorithm. Journal of the American Statistical Association. 1993;88:221–228.
- Jamshidian M, Jennrich RI. Acceleration of the EM algorithm by using Quasi-Newton methods. Journal of the Royal Statistical Society B. 1997;59:569–587.
- Kenah E. Contact intervals, survival analysis of epidemic data, and estimation of
*R*_{0}. Biostatistics. 2010 doi: 10.1093/biostatistics/kxq068. [PMC free article] [PubMed] [Cross Ref] - Levine RA, Casella G. Implementations of the Monte Carlo EM algorithm. Journal of Computational and Graphical Statistics. 2001;10:422–439.
- Little RJA, Rubin DB. Statistical Analysis with Missing Data. New York: John Wiley; 2002.
- Liu C, Rubin DB, Wu YN. Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika. 1998;85:755–770.
- Longini IM, Koopman JS, Monto AS, Fox JP. Estimating household and community transmission parameters for influenza. American Journal of Epidemiology. 1982;115:736–751. [PubMed]
- Longini IM, Halloran ME. A frailty mixture model for estimating vaccine efficacy. Journal of the Royal Statistical Society C. 1996;45:165–173.
- Louis TA. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society B. 1982;44:226–233.
- O’Neill P, Balding DJ, Becker NG, Eerola M, Mollison D. Analysis of infectious disease data from household outbreaks by Markov chain Monte Carlo methods. Applied Statistics. 2000;49:517–542.
- Scharfstein DO, Halloran ME, Chu H, Daniels MJ. On estimation of vaccine efficacy using validation samples with selection bias. Biostatistics. 2006;7:615–629. [PMC free article] [PubMed]
- Sugimoto JD, Yang Y, Halloran ME, Oiulfstad B, Bagwell D, Dean B, Mascola L, Bancroft E, Longini IM. Accounting for unobserved immunity and asymptomatic infection in the early household transmission of the pandemic influenza A (H1N1) 2009. 2011. Under review.
- Xu R, Ekiert DC, Krause JC, Hai R, Crowe JE, Jr, Wilson IA. Structural basis of preexisting immunity to the 2009 H1N1 pandemic influenza virus. Science. 2010;328:357–360. [PMC free article] [PubMed]
- Yang Y, Longini IM, Halloran ME. A data-augmentation method for infectious disease incidence data from close contact groups. Computational Statistics and Data Analysis. 2007;51:6582–6595. [PMC free article] [PubMed]
- Yang Y, Longini IM, Halloran ME. A Bayesian model for evaluating influenza antiviral efficacy in household studies with asymptomatic infections. Biostatistics. 2009;10:390–403. [PMC free article] [PubMed]
- Yang Y, Halloran ME, Daniels M, Longini IM, Burke DS, Cummings DAT. Modeling competing infectious pathogens from a Bayesian perspective: Application to influenza studies with incomplete laboratory results. Journal of the American Statistical Association. 2010;105:1310–1322. [PMC free article] [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |