Despite rapid advances in experimental cell biology, the in vivo behavior of hematopoietic stem cells (HSC) cannot be directly observed and measured. Previously we modeled feline hematopoiesis using a two-compartment hidden Markov process that had birth and emigration events in the first compartment. Here we perform Bayesian statistical inference on models which contain two additional events in the first compartment in order to determine if HSC fate decisions are linked to cell division or occur independently. Pareto Optimal Model Assessment approach is used to cross check the estimates from Bayesian inference. Our results show that HSC must divide symmetrically (i.e., produce two HSC daughter cells) in order to maintain hematopoiesis. We then demonstrate that the augmented model that adds asymmetric division events provides a better fit to the competitive transplantation data, and we thus provide evidence that HSC fate determination in vivo occurs both in association with cell division and at a separate point in time. Last we show that assuming each cat has a unique set of parameters leads to either a significant decrease or a nonsignificant increase in model fit, suggesting that the kinetic parameters for HSC are not unique attributes of individual animals, but shared within a species.
Stochastic two-compartment model; hidden Markov models; reversible jump MCMC; hematopoiesis; stem cell; asymmetric division
DNA Copy number variation (CNV) has recently gained considerable interest as a source of genetic variation that likely influences phenotypic differences. Many statistical and computational methods have been proposed and applied to detect CNVs based on data that generated by genome analysis platforms. However, most algorithms are computationally intensive with complexity at least O(n2), where n is the number of probes in the experiments. Moreover, the theoretical properties of those existing methods are not well understood. A faster and better characterized algorithm is desirable for the ultra high throughput data. In this study, we propose the Screening and Ranking algorithm (SaRa) which can detect CNVs fast and accurately with complexity down to O(n). In addition, we characterize theoretical properties and present numerical analysis for our algorithm.
Change-point detection; copy number variations; high dimensional data; screening and ranking algorithm
When releasing data to the public, data stewards are ethically and often legally obligated to protect the confidentiality of data subjects’ identities and sensitive attributes. They also strive to release data that are informative for a wide range of secondary analyses. Achieving both objectives is particularly challenging when data stewards seek to release highly resolved geographical information. We present an approach for protecting the confidentiality of data with geographic identifiers based on multiple imputation. The basic idea is to convert geography to latitude and longitude, estimate a bivariate response model conditional on attributes, and simulate new latitude and longitude values from these models. We illustrate the proposed methods using data describing causes of death in Durham, North Carolina. In the context of the application, we present a straightforward tool for generating simulated geographies and attributes based on regression trees, and we present methods for assessing disclosure risks with such simulated data.
Confidentiality; disclosure; dissemination; spatial; synthetic; tree
It has been estimated that about 30% of the genes in the human genome are regulated by microRNAs (miRNAs). These are short RNA sequences that can down-regulate the levels of mRNAs or proteins in animals and plants. Genes regulated by miRNAs are called targets. Typically, methods for target prediction are based solely on sequence data and on the structure information. In this paper we propose a Bayesian graphical modeling approach that infers the miRNA regulatory network by integrating expression levels of miRNAs with their potential mRNA targets and, via the prior probability model, with their sequence/structure information. We use a directed graphical model with a particular structure adapted to our data based on biological considerations. We then achieve network inference using stochastic search methods for variable selection that allow us to explore the huge model space via MCMC. A time-dependent coefficients model is also implemented. We consider experimental data from a study on a very well-known developmental toxicant causing neural tube defects, hyperthermia. Some of the pairs of target gene and miRNA we identify seem very plausible and warrant future investigation. Our proposed method is general and can be easily applied to other types of network inference by integrating multiple data sources.
Bayesian variable selection; data integration; graphical models; miRNA regulatory network
The parametric bootstrap can be used for the efficient computation of Bayes posterior distributions. Importance sampling formulas take on an easy form relating to the deviance in exponential families, and are particularly simple starting from Jeffreys invariant prior. Because of the i.i.d. nature of bootstrap sampling, familiar formulas describe the computational accuracy of the Bayes estimates. Besides computational methods, the theory provides a connection between Bayesian and frequentist analysis. Efficient algorithms for the frequentist accuracy of Bayesian inferences are developed and demonstrated in a model selection example.
Jeffreys prior; exponential families; deviance; generalized linear models
Dyadic data are common in the social and behavioral sciences, in which members of dyads are correlated due to the interdependence structure within dyads. The analysis of longitudinal dyadic data becomes complex when nonignorable dropouts occur. We propose a fully Bayesian selection-model-based approach to analyze longitudinal dyadic data with nonignorable dropouts. We model repeated measures on subjects by a transition model and account for within-dyad correlations by random effects. In the model, we allow subject’s outcome to depend on his/her own characteristics and measure history, as well as those of the other member in the dyad. We further account for the nonignorable missing data mechanism using a selection model in which the probability of dropout depends on the missing outcome. We propose a Gibbs sampler algorithm to fit the model. Simulation studies show that the proposed method effectively addresses the problem of nonignorable dropouts. We illustrate our methodology using a longitudinal breast cancer study.
Dyadic Data; Missing Data; Nonignorable Dropout; Selection Model
We analyze the Agatston score of coronary artery calcium (CAC) from the Multi-Ethnic Study of Atherosclerosis (MESA) using semi-parametric zero-inflated modeling approach, where the observed CAC scores from this cohort consist of high frequency of zeroes and continuously distributed positive values. Both partially constrained and unconstrained models are considered to investigate the underlying biological processes of CAC development from zero to positive, and from small amount to large amount. Different from existing studies, a model selection procedure based on likelihood cross-validation is adopted to identify the optimal model, which is justified by comparative Monte Carlo studies. A shrinkaged version of cubic regression spline is used for model estimation and variable selection simultaneously. When applying the proposed methods to the MESA data analysis, we show that the two biological mechanisms influencing the initiation of CAC and the magnitude of CAC when it is positive are better characterized by an unconstrained zero-inflated normal model. Our results are significantly different from those in published studies, and may provide further insights into the biological mechanisms underlying CAC development in human. This highly flexible statistical framework can be applied to zero-inflated data analyses in other areas.
cardiovascular disease; coronary artery calcium; likelihood cross-validation; model selection; penalized spline; proportional constraint; shrinkage
Extreme environmental phenomena such as major precipitation events manifestly exhibit spatial dependence. Max-stable processes are a class of asymptotically-justified models that are capable of representing spatial dependence among extreme values. While these models satisfy modeling requirements, they are limited in their utility because their corresponding joint likelihoods are unknown for more than a trivial number of spatial locations, preventing, in particular, Bayesian analyses. In this paper, we propose a new random effects model to account for spatial dependence. We show that our specification of the random effect distribution leads to a max-stable process that has the popular Gaussian extreme value process (GEVP) as a limiting case. The proposed model is used to analyze the yearly maximum precipitation from a regional climate model.
Gaussian extreme value process; generalized extreme value distribution; positive stable distribution; regional climate model
Research in several fields now requires the analysis of datasets in which multiple high-dimensional types of data are available for a common set of objects. In particular, The Cancer Genome Atlas (TCGA) includes data from several diverse genomic technologies on the same cancerous tumor samples. In this paper we introduce Joint and Individual Variation Explained (JIVE), a general decomposition of variation for the integrated analysis of such datasets. The decomposition consists of three terms: a low-rank approximation capturing joint variation across data types, low-rank approximations for structured variation individual to each data type, and residual noise. JIVE quantifies the amount of joint variation between data types, reduces the dimensionality of the data, and provides new directions for the visual exploration of joint and individual structure. The proposed method represents an extension of Principal Component Analysis and has clear advantages over popular two-block methods such as Canonical Correlation Analysis and Partial Least Squares. A JIVE analysis of gene expression and miRNA data on Glioblastoma Multiforme tumor samples reveals gene-miRNA associations and provides better characterization of tumor types.
Data integration; Multi-block data; Principal Component Analysis; Data fusion
Public data repositories have enabled researchers to compare results across multiple genomic studies in order to replicate findings. A common approach is to first rank genes according to an hypothesis of interest within each study. Then, lists of the top-ranked genes within each study are compared across studies. Genes recaptured as highly ranked (usually above some threshold) in multiple studies are considered to be significant. However, this comparison strategy often remains informal, in that Type I error and false discovery rate are usually uncontrolled. In this paper, we formalize an inferential strategy for this kind of list-intersection discovery test. We show how to compute a p-value associated with a `recaptured' set of genes, using a closed-form Poisson approximation to the distribution of the size of the recaptured set. The distribution of the test statistic depends on the rank threshold and the number of studies within which a gene must be recaptured. We use a Poisson approximation to investigate operating characteristics of the test. We give practical guidance on how to design a bioinformatic list-intersection study with prespecified control of Type I error (at the set level) and false discovery rate (at the gene level). We show how choice of test parameters will affect the expected proportion of significant genes identified. We present a strategy for identifying optimal choice of parameters, depending on the particular alternative hypothesis which might hold. We illustrate our methods using prostate cancer gene-expression datasets from the curated Oncomine database.
Concordance; Validation; Gene-ranking; Meta-analysis; Rank based methods; Gene expression analysis; Microarray; sequencing; Cancer
The vast amount of biological knowledge accumulated over the years has allowed researchers to identify various biochemical interactions and define different families of pathways. There is an increased interest in identifying pathways and pathway elements involved in particular biological processes. Drug discovery efforts, for example, are focused on identifying biomarkers as well as pathways related to a disease. We propose a Bayesian model that addresses this question by incorporating information on pathways and gene networks in the analysis of DNA microarray data. Such information is used to define pathway summaries, specify prior distributions, and structure the MCMC moves to fit the model. We illustrate the method with an application to gene expression data with censored survival outcomes. In addition to identifying markers that would have been missed otherwise and improving prediction accuracy, the integration of existing biological knowledge into the analysis provides a better understanding of underlying molecular processes.
Bayesian variable selection; gene expression; Markov chain Monte Carlo; Markov random field prior; pathway selection
Classical statistical process control often relies on univariate characteristics. In many contemporary applications, however, the quality of products must be characterized by some functional relation between a response variable and its explanatory variables. Monitoring such functional profiles has been a rapidly growing field due to increasing demands. This paper develops a novel nonparametric L-1 location-scale model to screen the shapes of profiles. The model is built on three basic elements: location shifts, local shape distortions, and overall shape deviations, which are quantified by three individual metrics. The proposed approach is applied to the previously analyzed vertical density profile data, leading to some interesting insights.
Functional data; L-1 regression; nonparametric methods; profile control charts
Many epidemic models approximate social contact behavior by assuming random mixing within mixing groups (e.g., homes, schools, and workplaces). The effect of more realistic social network structure on estimates of epidemic parameters is an open area of exploration. We develop a detailed statistical model to estimate the social contact network within a high school using friendship network data and a survey of contact behavior. Our contact network model includes classroom structure, longer durations of contacts to friends than non-friends and more frequent contacts with friends, based on reports in the contact survey. We performed simulation studies to explore which network structures are relevant to influenza transmission. These studies yield two key findings. First, we found that the friendship network structure important to the transmission process can be adequately represented by a dyad-independent exponential random graph model (ERGM). This means that individual-level sampled data is sufficient to characterize the entire friendship network. Second, we found that contact behavior was adequately represented by a static rather than dynamic contact network. We then compare a targeted antiviral prophylaxis intervention strategy and a grade closure intervention strategy under random mixing and network-based mixing. We find that random mixing overestimates the effect of targeted antiviral prophylaxis on the probability of an epidemic when the probability of transmission in 10 minutes of contact is less than 0.004 and underestimates it when this transmission probability is greater than 0.004. We found the same pattern for the final size of an epidemic, with a threshold transmission probability of 0.005. We also find random mixing overestimates the effect of a grade closure intervention on the probability of an epidemic and final size for all transmission probabilities. Our findings have implications for policy recommendations based on models assuming random mixing, and can inform further development of network-based models.
contact network; epidemic model; influenza; simulation model; social network
New advances in nano sciences open the door for scientists to study biological processes on a microscopic molecule-by-molecule basis. Recent single-molecule biophysical experiments on enzyme systems, in particular, reveal that enzyme molecules behave fundamentally differently from what classical model predicts. A stochastic network model was previously proposed to explain the experimental discovery. This paper conducts detailed theoretical and data analyses of the stochastic network model, focusing on the correlation structure of the successive reaction times of a single enzyme molecule. We investigate the correlation of experimental fluorescence intensity and the correlation of enzymatic reaction times, and examine the role of substrate concentration in enzymatic reactions. Our study shows that the stochastic network model is capable of explaining the experimental data in depth.
Autocorrelation; continuous time Markov chain; fluorescence intensity; Michaelis-Menten model; stochastic network model; single-molecule experiment; turnover time
Alternative splicing of gene transcripts greatly expands the functional capacity of the genome, and certain splice isoforms may indicate specific disease states such as cancer. Splice junction microarrays interrogate thousands of splice junctions, but data analysis is difficult and error prone because of the increased complexity compared to differential gene expression analysis. We present Rank Change Detection (RCD) as a method to identify differential splicing events based upon a straightforward probabilistic model comparing the over- or underrepresentation of two or more competing isoforms. RCD has advantages over commonly used methods because it is robust to false positive errors due to nonlinear trends in microarray measurements. Further, RCD does not depend on prior knowledge of splice isoforms, yet it takes advantage of the inherent structure of mutually exclusive junctions, and it is conceptually generalizable to other types of splicing arrays or RNA-Seq. RCD specifically identifies the biologically important cases when a splice junction becomes more or less prevalent compared to other mutually exclusive junctions. The example data is from different cell lines of glioblastoma tumors assayed with Agilent microarrays.
Alternative splicing; gene expression analysis; microarray
Network inference approaches are now widely used in biological applications to probe regulatory relationships between molecular components such as genes or proteins. Many methods have been proposed for this setting, but the connections and differences between their statistical formulations have received less attention. In this paper, we show how a broad class of statistical network inference methods, including a number of existing approaches, can be described in terms of variable selection for the linear model. This reveals some subtle but important differences between the methods, including the treatment of time intervals in discretely observed data. In developing a general formulation, we also explore the relationship between single-cell stochastic dynamics and network inference on averages over cells. This clarifies the link between biochemical networks as they operate at the cellular level and network inference as carried out on data that are averages over populations of cells. We present empirical results, comparing thirty-two network inference methods that are instances of the general formulation we describe, using two published dynamical models. Our investigation sheds light on the applicability and limitations of network inference and provides guidance for practitioners and suggestions for experimental design.
Genetical genomics experiments have now been routinely conducted to measure both the genetic markers and gene expression data on the same subjects. The gene expression levels are often treated as quantitative traits and are subject to standard genetic analysis in order to identify the gene expression quantitative loci (eQTL). However, the genetic architecture for many gene expressions may be complex, and poorly estimated genetic architecture may compromise the inferences of the dependency structures of the genes at the transcriptional level. In this paper, we introduce a sparse conditional Gaussian graphical model for studying the conditional independent relationships among a set of gene expressions adjusting for possible genetic effects where the gene expressions are modeled with seemingly unrelated regressions. We present an efficient coordinate descent algorithm to obtain the penalized estimation of both the regression coefficients and sparse concentration matrix. The corresponding graph can be used to determine the conditional independence among a group of genes while adjusting for shared genetic effects. Simulation experiments and asymptotic convergence rates and sparsistency are used to justify our proposed methods. By sparsistency, we mean the property that all parameters that are zero are actually estimated as zero with probability tending to one. We apply our methods to the analysis of a yeast eQTL data set and demonstrate that the conditional Gaussian graphical model leads to more interpretable gene network than standard Gaussian graphical model based on gene expression data alone.
eQTL; Gaussian graphical model; Regularization; Genetic networks; Seemingly unrelated regression
HIV dynamic studies have contributed significantly to the understanding of HIV pathogenesis and antiviral treatment strategies for AIDS patients. Establishing the relationship of virologic responses with clinical factors and covariates during long-term antiretroviral (ARV) therapy is important to the development of effective treatments. Medication adherence is an important predictor of the effectiveness of ARV treatment, but an appropriate determinant of adherence rate based on medication event monitoring system (MEMS) data is critical to predict virologic outcomes. The primary objective of this paper is to investigate the effects of a number of summary determinants of MEMS adherence rates on virologic response measured repeatedly over time in HIV-infected patients. We developed a mechanism-based differential equation model with consideration of drug adherence, interacted by virus susceptibility to drug and baseline characteristics, to characterize the long-term virologic responses after initiation of therapy. This model fully integrates viral load, MEMS adherence, drug resistance and baseline covariates into the data analysis. In this study we employed the proposed model and associated Bayesian nonlinear mixed-effects modeling approach to assess how to efficiently use the MEMS adherence data for prediction of virologic response, and to evaluate the predicting power of each summary metric of the MEMS adherence rates. In particular, we intend to address the questions: (i) how to summarize the MEMS adherence data for efficient prediction of virologic response after accounting for potential confounding factors such as drug resistance and covariates, and (ii) how to evaluate treatment effect of baseline characteristics interacted with adherence and other clinical factors. The approach is applied to an AIDS clinical trial involving 31 patients who had available data as required for the proposed model. Results demonstrate that the appropriate determinants of MEMS adherence rates are important in order to more efficiently predict virologic response, and investigations of adherence to ARV treatment would benefit from measuring not only adherence rate but also its summary metric assessment. Our study also shows that the mechanism-based dynamic model is powerful and effective to establish a relationship of virologic responses with medication adherence, virus resistance to drug and baseline covariates.
Bayesian mixed-effects models; confounding factors; HIV dynamics; longitudinal data; MEMS adherence assessment; time-varying drug efficacy; virus resistance
We propose a computationally intensive method, the random lasso method, for variable selection in linear models. The method consists of two major steps. In step 1, the lasso method is applied to many bootstrap samples, each using a set of randomly selected covariates. A measure of importance is yielded from this step for each covariate. In step 2, a similar procedure to the first step is implemented with the exception that for each bootstrap sample, a subset of covariates is randomly selected with unequal selection probabilities determined by the covariates’ importance. Adaptive lasso may be used in the second step with weights determined by the importance measures. The final set of covariates and their coefficients are determined by averaging bootstrap results obtained from step 2. The proposed method alleviates some of the limitations of lasso, elastic-net and related methods noted especially in the context of microarray data analysis: it tends to remove highly correlated variables altogether or select them all, and maintains maximal flexibility in estimating their coefficients, particularly with different signs; the number of selected variables is no longer limited by the sample size; and the resulting prediction accuracy is competitive or superior compared to the alternatives. We illustrate the proposed method by extensive simulation studies. The proposed method is also applied to a Glioblastoma microarray data analysis.
Lasso; microarray; regularization; variable selection
Recent advances in tissue microarray technology have allowed immunohistochemistry to become a powerful medium-to-high throughput analysis tool, particularly for the validation of diagnostic and prognostic biomarkers. However, as study size grows, the manual evaluation of these assays becomes a prohibitive limitation; it vastly reduces throughput and greatly increases variability and expense. We propose an algorithm—Tissue Array Co-Occurrence Matrix Analysis (TACOMA)—for quantifying cellular phenotypes based on textural regularity summarized by local inter-pixel relationships. The algorithm can be easily trained for any staining pattern, is absent of sensitive tuning parameters and has the ability to report salient pixels in an image that contribute to its score. Pathologists’ input via informative training patches is an important aspect of the algorithm that allows the training for any specific marker or cell type. With co-training, the error rate of TACOMA can be reduced substantially for a very small training sample (e.g., with size 30). We give theoretical insights into the success of co-training via thinning of the feature set in a high dimensional setting when there is “sufficient” redundancy among the features. TACOMA is flexible, transparent and provides a scoring process that can be evaluated with clarity and confidence. In a study based on an estrogen receptor (ER) marker, we show that TACOMA is comparable to, or outperforms, pathologists’ performance in terms of accuracy and repeatability.
Graphs and networks are common ways of depicting information. In biology, many different biological processes are represented by graphs, such as regulatory networks, metabolic pathways and protein-protein interaction networks. This kind of a priori use of graphs is a useful supplement to the standard numerical data such as microarray gene expression data. In this paper, we consider the problem of regression analysis and variable selection when the covariates are linked on a graph. We study a graph-constrained regularization procedure and its theoretical properties for regression analysis to take into account the neighborhood information of the variables measured on a graph, where a smoothness penalty on the coefficients is defined as a quadratic form of the Laplacian matrix associated with the graph. We establish estimation and model selection consistency results and provide estimation bounds for both fixed and diverging numbers of parameters in regression models. We demonstrate by simulations and a real dataset that the proposed procedure can lead to better variable selection and prediction than existing methods that ignore the graph information associated with the covariates.
Regularization; Sign consistency; Network; Laplacian Matrix; High dimensional data
Neural spike trains, which are sequences of very brief jumps in voltage across the cell membrane, were one of the motivating applications for the development of point process methodology. Early work required the assumption of stationarity, but contemporary experiments often use time-varying stimuli and produce time-varying neural responses. More recently, many statistical methods have been developed for nonstationary neural point process data. There has also been much interest in identifying synchrony, meaning events across two or more neurons that are nearly simultaneous at the time scale of the recordings. A natural statistical approach is to discretize time, using short time bins, and to introduce loglinear models for dependency among neurons, but previous use of loglinear modeling technology has assumed stationarity. We introduce a succinct yet powerful class of time-varying loglinear models by (a) allowing individual-neuron effects (main effects) to involve time-varying intensities; (b) also allowing the individual-neuron effects to involve autocovariation effects (history effects) due to past spiking, (c) assuming excess synchrony effects (interaction effects) do not depend on history, and (d) assuming all effects vary smoothly across time. Using data from primary visual cortex of an anesthetized monkey we give two examples in which the rate of synchronous spiking can not be explained by stimulus-related changes in individual-neuron effects. In one example, the excess synchrony disappears when slow-wave “up” states are taken into account as history effects, while in the second example it does not. Standard point process theory explicitly rules out synchronous events. To justify our use of continuous-time methodology we introduce a framework that incorporates synchronous events and provides continuous-time loglinear point process approximations to discrete-time loglinear models.
Discrete-time approximation; loglinear model; marked process; nonstationary point process; simultaneous events; spike train; synchrony detection
Functional MRI (fMRI) has become the most common method for investigating the human brain. However, fMRI data present some complications for statistical analysis and modeling. One recently developed approach to these data focuses on estimation of computational encoding models that describe how stimuli are transformed into brain activity measured in individual voxels. Here we aim at building encoding models for fMRI signals recorded in the primary visual cortex of the human brain. We use residual analyses to reveal systematic nonlinearity across voxels not taken into account by previous models. We then show how a sparse nonparametric method [bJ. Roy. Statist. Soc. Ser. B
71 (2009b) 1009–1030] can be used together with correlation screening to estimate nonlinear encoding models effectively. Our approach produces encoding models that predict about 25% more accurately than models estimated using other methods [Nature
452 (2008a) 352–355]. The estimated nonlinearity impacts the inferred properties of individual voxels, and it has a plausible biological interpretation. One benefit of quantitative encoding models is that estimated models can be used to decode brain activity, in order to identify which specific image was seen by an observer. Encoding models estimated by our approach also improve such image identification by about 12% when the correct image is one of 11,500 possible images.
Neuroscience; vision; fMRI; nonparametric; prediction
A predictor variable or dose that is measured with substantial error may possess an error-free milestone, such that it is known with negligible error whether the value of the variable is to the left or right of the milestone. Such a milestone provides a basis for estimating a linear relationship between the true but unknown value of the error-free predictor and an outcome, because the milestone creates a strong and valid instrumental variable. The inferences are nonparametric and robust, and in the simplest cases, they are exact and distribution free. We also consider multiple milestones for a single predictor and milestones for several predictors whose partial slopes are estimated simultaneously. Examples are drawn from the Wisconsin Longitudinal Study, in which a BA degree acts as a milestone for sixteen years of education, and the binary indicator of military service acts as a milestone for years of service.
Attenuation; errors in measurement; full matching; instrumental variables; nonbipartite matching; questionnaire design
Acute respiratory diseases are transmitted over networks of social contacts. Large-scale simulation models are used to predict epidemic dynamics and evaluate the impact of various interventions, but the contact behavior in these models is based on simplistic and strong assumptions which are not informed by survey data. These assumptions are also used for estimating transmission measures such as the basic reproductive number and secondary attack rates. Development of methodology to infer contact networks from survey data could improve these models and estimation methods. We contribute to this area by developing a model of within-household social contacts and using it to analyze the Belgian POLYMOD data set, which contains detailed diaries of social contacts in a 24-hour period. We model dependency in contact behavior through a latent variable indicating which household members are at home. We estimate age-specific probabilities of being at home and age-specific probabilities of contact conditional on two members being at home. Our results differ from the standard random mixing assumption. In addition, we find that the probability that all members contact each other on a given day is fairly low: 0.49 for households with two 0–5 year olds and two 19–35 year olds, and 0.36 for households with two 12–18 year olds and two 36+ year olds. We find higher contact rates in households with 2–3 members, helping explain the higher influenza secondary attack rates found in households of this size.