Overall, of the 477,768 collected tweets, 318,379 were classified as relevant to the influenza A(H1N1) vaccine. Of those, 255,828 were classified as neutral, 26,667 as negative, and 35,884 as positive. Starting from late August 2009, we observed a steady increase in the number of relevant tweets in the United States until early November 2009, after which the number of tweets dropped back to previous levels. shows the absolute numbers of positive (n+), negative (n-) and neutral (n0) tweets per day in the United States. The overall influenza A(H1N1) vaccine sentiment score, measured as the relative difference of positive and negative tweets ((n+-n-)/(n++ n- + n0)), started at a negative value in late Summer 2009 and showed relatively large short term fluctuations. The 14 day - moving average turned positive in mid October (as the vaccine became available) and remained positive for the rest of the year ().
(A) Total number of negative (red), positive (green), and neutral (blue) tweets relating to influenza A(H1N1) vaccination during the Fall wave of the 2009 pandemic.
For vaccination sentiments measured online to be meaningful, they need to be compared to empirical data for validation. A positive correlation between the influenza A(H1N1) vaccination sentiment score and estimated vaccination coverage would be relevant to public health efforts because it would allow for the identification of target areas for communication interventions. To test for such a correlation, we used estimated influenza A(H1N1) vaccination rates up to January 2010 as provided by the CDC 
. These estimates are based on results from the Behavioral Risk Factor Surveillance System (BRFSS) and the National 2009 H1N1 Flu Survey (NHFS). We found a very strong correlation on the level of HHS regions (weighted r
0.017; regions as defined by the US Department of Health & Human Services) using the estimated vaccination coverages for all persons older than 6 months (), and a strong correlation at the level of state (weighted r
0.0046). All reported correlation values are Pearson product-moment correlation coefficients because the variables considered for analysis are normally distributed (Shapiro-Wilk test and Anderson-Darling test), weighted by the total number of tweets (n+ + n- + n0
) per region.
Using data on who followed whom among users in the dataset allows us to generate a directed network of information flow whose structure (with respect to the distribution of opinions on vaccination) can provide insight into how sentiments are distributed (see Methods
). In order to investigate if users preferentially seek information from other users who share their opinion, we measured assortative mixing of users with a qualitatively similar opinion on vaccination (homophily) by calculating the assortativity coefficient r
which is defined 
is the fraction of edges in the network that connect a node of type i
to one of type j
(in the direction of i
). A positive value of r
(with maximum value 1) is found in a network where nodes are predominantly connected to nodes of the same type. A value of r
0 would indicate a randomly mixed network, and a negative value ≤ -1 would indicate a disassortative network where nodes of one type are predominantly connected to nodes of the other type (for the technical reasons why the minimum value of r
is not always -1 see ref. 
In the network of 39,284 users who had a non-zero sentiment score (from now on referred to as the opinionated network, i.e. containing only users who expressed predominantly either positive or negative opinions), we find r
0.144. In order to assess the significance of this value, we randomized the opinions on the network (bootstrap with replacement) 10,000 times and found the maximum value for r
among these randomized networks to be 0.0056, more than an order of magnitude lower than in the original network (mean: -3*10-4
, 95% CI: -0.0032, 0.0027). We also calculated for each node (user) the fraction f
of incoming edges from nodes with the same qualitative sentiment, then randomized the opinions and compared the new distribution of f
to the original distribution. For 10,000 randomized networks we found that in all cases the mean of the original distribution (0.601) was significantly larger than the mean of the distribution of the randomized networks (p<10-95
for all tests using paired Wilcoxon signed rank test, max. mean: 0.548, mean of means: 0.531, 95% CI: 0.52, 0.541). These results demonstrate that there is significantly more information flow between users who share the same sentiments than expected based on the distribution of sentiments.
Social networks often naturally divide into communities, i.e groups of people who share common interests, beliefs and opinions. In a network of opinionated users, the question of community structure naturally arises, i.e. are there communities within the network where positive or negative attitudes towards the novel vaccine dominate? In order to tackle this question, we separated the giant component of the opinionated network (34,025 users) into communities of users that are densely connected compared to the rest of the network using the spin glass community detection algorithm 
. We then calculated the proportion of users with negative attitudes p(-)
and compared it to the average in the giant component, p(-)
0.396. With the exception of a single community, all communities (containing at least 1% of the users in the entire network) were significantly more positive or negative than expected (; Fisher's exact test, 10-279
), ranging from p(-)
0.764 in the most negative community (2,453 users) to p(-)
0.266 in the most positive community (2,517 users).
(A) Proportion of negative sentiments p(-) in the network communities. Dashed line shows overall proportion in the opinionated network.
Non-random distributions of opinions on vaccines can have a profound effect on the likelihood of disease outbreaks if this distribution leads to a clustered distribution of vaccination status in the population 
. Communities with very low vaccination rates are not protected by herd immunity even if the overall vaccination rate in the population is high. To quantify this effect, we used a recently collected high-resolution contact network relevant for infectious disease transmission 
to simulate the spread of influenza A (H1N1). We performed simulations as described previously 
with a constant vaccination rate but varying levels of assortativity (see Methods
). shows that the probability of large outbreaks is greatly increased when susceptibility to disease is positively assorted. The probability of an outbreak that infects >5% of the population, for example, can be increased more than 10-fold at r>0.14 (as observed in the Twitter network) relative to the random distribution where r ~0 ().