Data collection

Over the period March – May 2006, 750 persons living in Belgium were recruited by random digit dialing on fixed telephone lines. Only one person per household could participate. In line with the total Belgian population (Chi-square test for equality of distributions, p-value 0.85), participants were recruited from the Flemish (n = 441), Walloon (n = 239) and Brussels geographic regions (n = 70), making sure also that the areas in these regions were representative of the overall urban and rural divide in Belgium. With regards to age, sampling was undertaken in order to obtain an age distribution with 10% in each of the age groups: 0–4, 5–9, 10–14, 15–19 years, and 6% in each of the adult age groups 20–24, 25–29, 30–34, 35–39, 40–44, 45–49, 50–54, 55–59, 60–64, 65+ years. Most (480) participants were adults aged 18 years and older of whom 50% were male, representative for the Belgian population (p-value 0.70). There were also 130 teenage participants aged 9–17 years (49% males, p-value 0.60) and 140 children aged 0–8 years (56% males, p-value 0.053).

The contact survey conducted as part of this study required participants to complete a survey anonymously, without changing their usual behaviour. No persons were subject to interventions and no physical samples were collected as part of this study. The study protocol was approved by the ethical committee of the Antwerp University Hospital.

Two types of physical contacts were defined: (1) two-way conversations during which at least three words were spoken; (2) contacts which involved skin to skin touching. Each participant was asked to fill in a paper diary recording their contacts during one randomly assigned weekday and one randomly assigned day on the weekend (not always in that order). Teenagers (9–17 y) filled in a simplified version of the diary, and were closely followed up to anticipate interpretation problems (all surveyors had received identical training instructions to accomplish this). For the children (< 9 y) a parent or exceptionally another adult caregiver filled in the diary. In addition to a pilot test in Luxemburg, specific Belgian Dutch and French language versions of each type of diary were made and tested in Belgium (see also [

6,

11]).

Information recorded in the diary included the exact or estimated age and gender of each contacted person, the duration of contacts per person over the entire day, frequency (habitual nature) and location or circumstance of contact (multiple options possible). A list of variables used for our analysis is given in Table .

| **Table 1**Classification Tree Variables. |

The diaries were sent and collected by mail. Each participant was reminded by phone that they had to fill in the diary, one day prior to each assigned day, and was followed up after the first day to check whether they had. If they had not filled in the diary, they were assigned to a new random day. If they had not returned the diary, they were reminded by a maximum of three follow up calls to send it in. If participants repeatedly failed to fill in the diary on their assigned day, they were excluded, and replaced by a new recruit. Each participant received a small token of appreciation for the amount of 5 Euro. All diaries were double entered in a computer database and checked manually.

Virtually all (98%) participants recorded their contacts in the four weeks from March 18 to April 14 2006, and 49% and 73% of the recorded week days and weekend days, respectively, were during the Eastern holiday period. Note that weekends at the start and the end of the holiday period were considered as holiday and represented 72% of all recorded weekend days.

About half the children aged 0–2 attended childcare (55%), and almost all children aged 3–8 years attended childcare or school (93%), while school participation was 100% for participants aged 8–17 years. Adults aged 18 years or older were mostly employed (49%), unemployed (35%) or in further education (9%). Overall, 10% of the participants lived alone and these were all 18 years or older. Nearly 1 in 5 children aged 0–8 years lived in a single parent family, and 26%, 28% and 11% of the participants lived in a household of size 2, 3 and 4, respectively. Larger household sizes were only rarely observed (4%). These characteristics of the sample are all broadly in line with general Belgian population statistics (National Institute for Statistics, 2006, Belgium).

Household contacts

Although in order to keep the diary manageable, we intentionally did not record the relationship with the contacted persons, we extracted household-like contact data from the database by identifying those contacts with the same age as the registered ages of the household members (knowing that the contacts occurred at home and that an exact age was given, and not an estimated age range as for the ages of contacted persons from whom the exact age is unknown). We performed a sensitivity analysis with respect to the selected contacts of the same age.

Work contacts

As mentioned in the introduction, only a subsample of the Belgian contact survey (BE), was described and analyzed by [

11], together with those from the seven other POLYMOD countries. These contact surveys, although broadly aiming to record similar outcomes, were not quite the same due to dissimilar implementations in each of the countries (see [

11]). One important difference between these surveys was the registration of work contacts.

In half of the POLYMOD surveys (namely BE, DE, FI and NL), a threshold value was set for reporting contacts at work, requiring people with more estimated work contacts than a predefined threshold not to record these contacts in their diary. In the Belgian diaries, the participants were asked prior to filling in their dairy for the first time the following sequence of questions. (1) Do you have a profession through which you have a large number of contacts (e.g., clients, patients, students, etc)? YES/NO; (2) If YES, please make an estimate of the average number of persons you contact professionally per day? .... persons; (3) Please tick in which of these age categories these professional contacts mostly occur (multiple options possible): 0–5 y, 6–11 y, 12–17 y, 18–60 y and over 60 years. (4) If you estimated the number of these contacts at more than 20, then do not record these contacts in the diary, but only record the other (non-professional) contacts. The complete Belgian surveys (including diaries) are available [see Additional file

1,

2,

3,

4,

5,

6 and

7].

Although these questions were asked with the common intention to reduce reporting bias for people with many professional contacts (eg bus drivers), there are some important differences in how we approached this, in comparison to DE, FI and NL. First, based on information from the pilot studies, we set the threshold value at 20, whereas in DE, FI and NL it was set at 10. Second, we asked first if the participants thought they had many professional contacts, and only if they did subjectively think so, how many they estimated these to be. Third, we only revealed in the last question what the consequence of their estimate was for the effort required to complete the diary. Fourth, we did not only ask about the number of professional contacts, but also about their usual age range.

In reporting the results of all the POLYMOD diaries combined, [

11] excluded these extra non-recorded contacts, thus presenting, in this respect, underestimates for the number of contacts in BE, DE, FI and NL, in comparison to the other 4 countries. In the current paper, we used imputation to complete the full Belgian data set. However, this too had its limitations, since imputation enabled generating data from which reliable inferences can be made, but could not recreate the values that were not recorded [

12].

More specifically, the age of contacted persons provided a basis for imputation [see Additional file

7]: Consider

> 20 as the number of work contacts to be imputed for participant

*i *and (s)he indicated that these professional contacts were in specific age-categories denoted by a set

(e.g.

= [

6,

11] for a primary school teacher). We sampled

age-values from

with probabilities according to the population age-distribution. Contrasting this method with data for EW, IT, LU and PL indicated good performance.

In order to impute the other contact characteristics, plausible assumptions were made based on the available information from other POLYMOD countries. Mossong et al. [

11] did not find a significant association between age and whether contacts involved skin-to-skin touching for EW, IT, LU and PL. Therefore we imputed, this variable in the Belgian dataset using the same distribution as when 10 <

≤ 20. The imputation of gender of contacts was performed based on the same reasoning. For the imputation of contact characteristics like duration and usual frequency, we considered it unlikely that a single professional contact in a large set of such contacts would last longer than 4 hours and reoccur daily. Therefore, we imputed these two variables jointly by sampling from the bivariate distribution of duration and frequency for work contacts of participants for whom 10 <

≤ 20. This method could also be more widely applied to all characteristics in an attempt to avoid disrupting dependencies, but could just the same also enforce dependencies. Whereas the imputation for age and gender was well founded (negligible change in distribution with

), other imputations seemed more speculative. Indeed, the higher the recorded

was, the shorter the contact durations were, but the distribution of contact frequency remained quite stable. Furthermore, the choice of distribution for the duration of contacts would be subjective since (a) it was clearly unknown for what was missing and (b) the distributions varied substantially between countries. For the remainder of the paper, we focus on this augmented data set but note that the results were similar whenever these imputations were left out, except when estimating the number of contacts. This will be illustrated for one of the methods introduced in the next section.

Statistical methodology

Since the contact survey contains a lot of information, we will, next to some descriptive statistics, use up-to-date statistical methodology to highlight different aspects from the data. We will first describe the use of association rules and classification trees to identify associations and factors related to type, location, frequency and duration of contact. Then we will use weighted generalized estimating equations to properly model the number of contacts taking into account the correlation of contacts recorded by the same participant on two different days. Finally, we relate contact patterns to the spread of infectious disease using the next generation matrix. While this methodology provides more insight into how close contact infectious diseases are spread, we note that we do not aim to describe the stochastic nature of an emerging epidemic.

In this section we present some more detail on the methods used throughout this manuscript and refer to the literature for an extensive description.

Association rules

In large databases, like the contact database, patterns between variables can be discovered by data mining techniques such as association rules [

13]. If

*A*,

*B *denote properties of contacts (e.g., frequency, location, touching), then association rules would focus on relations

*A *→

*B *that estimate how likely the event

*B *is given the occurrence of event

*A*. While

*A *could consist of more than one contact property (e.g. a home contact which also involved skin-to-skin touching),

*B *is restricted to be a single property. The length of a rule is the total number of properties constituting that rule. Since numerous combinations of contact properties exist, various interestingness measures could be studied. We focused on the support, the confidence and the lift of a rule. The support of a rule expresses the relative frequency of all contacts in that rule and searching for rules with a high support can be seen as a simplification of 'bump hunting' [

14]. The confidence of a rule expresses the conditional probability associated with that rule

*P*(

*B*|

*A*). Finding rules with high confidence is similar to finding rules with a high association between the items, as is finding rules with a high lift value. In the latter case, a higher lift value indicates that the support of the rule is higher than the product of the support of the items in that rule (expressing independence between items [

14]).

Classification trees

To gain more insight in factors determining contact intensity (physical contact, location, frequency and duration), the binary classification tree methodology as introduced by [

15] was used. This is a recursive partitioning method which is nonparametric in nature, simple and intuitively appealing. At each step, the recursive partitioning algorithm determined an optimal cut off point such that all the contacts were split in two subpopulations to achieve high predictive classification with respect to the variable of interest; touching (yes or no), location (home, school, work, school, transport, leisure or other), frequency (daily, weekly, monthly, a few times a year or first time) and duration (0–5 min, 5–15 min, 15 min – 1 hour, 1–4 hours, more than 4 hours). The resulting subpopulations were split repeatedly until no additional partitioning was warranted: either a subpopulation contained only one class or it was too small to divide any further. The program Rpart [

16], implemented in R, was used to generate the decision trees depicting the classification rules generated through recursive partitioning. When growing a tree equal misclassification costs were assigned to the categories of the response variable. Pruning of the trees (to correct for overtraining) was undertaken using the 1 SE rule described by [

15] in combination with a maximal tree-depth of 4 layers. Error rates associated with these trees were determined by simple resubstitution, and a 10-fold cross-validation (CV) was used to evaluate the performance of the tree. Table lists the variables used in the tree construction.

Modelling the number of contacts

While [

11] used a censored negative binomial distribution to relate the overall number of contacts to different participant characteristics, we used generalized estimating equations (GEE, [

17]). This approach allowed us to account for the correlation between the number of contacts recorded by the same individual on the two different days. Weights were included in the analyses, in the first instance to adjust for relative differences in age and household size as compared to Belgian demographic data (source: EUROSTAT), and in the second instance to adjust for differences in sampling proportions with respect to weekdays and holiday periods.

Model building was done a priori, using a nonparametric method called 'random forests' [

18]. Random forests were constructed using binary regression trees [

15] which relate a response variable, the contact number, to explanatory variables in a similar way as with the classification trees. However, in regression trees, the response is continuous and the objective is to minimise the mean squared error (MSE). In view of the skewness of the distribution of the number of contacts, a log-transform was used. Random forests were constructed by joining several of these regression trees, each based on a random sample of the observations and the explanatory variables, to explicitly take into account the variability associated with the construction of a single tree. A by-product of this random forests methodology, the so-called 'variable importance' list, reflects how often a variable is used as a splitting-criterion. The variables with a high importance, more explicitly using a threshold of 5% increase in MSE, were then selected and used in further model building. We refer the reader to [

18] for more details on how 'random forests' are constructed.

Using the most important variables from the 'random forests' importance list, one can opt to look at the interactions between them. Since this inevitably leads to sparse cells, we only analysed interactions with respect to weekdays and holidays and for the remainder only considered main effect terms. We then further reduced the model using a stepwise Poisson regression (with log-link) based on backward selection and the BIC-criterion [

19] and finally applied the weighted GEE analysis.

This procedure is applied separately for the three different types of diaries and for all types of contacts, resulting in 57 analyses.

Who mixes with whom?

Consider the random variable *Y*_{ij}, i.e. the number of contacts in age class *j *during one day as reported by a respondent in age class *i*, which has observed values *y*_{ij, t }= 1,..., *T*_{i}, where *T*_{i }denotes the number of participants in the contact survey belonging to age class *i*. The elements *m*_{ij }= *E*(*Y*_{ij}) make up a *J *× *J *matrix, which is called the 'social contact matrix'. Now, the contact rates *c*_{ij }are related to the social contact matrix as follows:

where

*w*_{i }denotes the population size in age class

*i*, obtained from demographic data (EUROSTAT). When estimating the social contact matrix, the reciprocal nature of contacts needs to be taken into account (see [

5]):

*m*_{ij}*w*_{i }=

*m*_{ji}*w*_{j}, which means that the total number of contacts from age class

*i *to age class

*j *must equal the total number of contacts from age class

*j *to age class

*i*. We used a negative binomial distribution with mean

*m*_{ij }and variance

to model the number of contacts, while allowing overdispersion. The following negative binomial likelihood function was maximized in order to estimate the means

*m*_{ij }and the dispersion parameters

*k*, assuming the

*Y*_{ij }were independent (

*i*=1,...,

*I*,

*j *= 1,...,

*J*):

where

is the diary weight of the

*t*^{th }participant in age class

*i*. In order to achieve the necessary flexibility and to exploit the continuous nature of the data, we estimated the mean contact rate using one-year age-intervals and a bivariate smoothing approach as described by Wood [

20]. The basis was a tensor-product spline ensuring flexibility when modeling the average number of contacts as a function of the respondent's and contact's age. By applying a smooth-then-constrain-approach as proposed by [

21], the reciprocal nature of contacts was taken into account.

Mimicking the Spread of an Epidemic and the Impact of School Closure

In order to mimic the initial spread of a newly emerging pathogen in Belgium, we used the next generation operator as defined in (3).

For Belgium, population size (

*N *= 10 547 958), life expectancy at birth (

*L *= 80) and age-specific mortality rates, μ(

*t*) were obtained from official government statistics (source: EUROSTAT). The infectious period was set to five days

*D *= 5/365, similar to the infectious period typically estimated for influenza (see e.g. [

22]). We assumed

*β*(

*a*,

*a*'), i.e. the effective transmission rate was proportional to the contact rate

*c*(

*a*,

*a*') as estimated by

*c*_{ij }using one-year age-intervals and contacts involving skin-to-skin touching only and chose the proportionality factor such that the largest Eigen value of the next generation operator (3), i.e. the basic reproduction number

*R*_{0}, equaled 2 (thus mimicking the emergence of pandemic influenza [

23,

24]). Note that daily contacts were down weighted with a factor 1/5 to reflect the unlikely occurrence of transmission of infections during a second such contact. We first focused on a regular period and used the leading eigenvector to simulate the initial spread of a newly emerging pathogen [

25]. We investigated the effect of school closure by using contact rates from the non-holiday period, the holiday period and in the weekend (excluding holidays), and comparing the resulting relative impact on

*R*_{0}.