In the early days of a disease outbreak, clinicians, public health officials, and policy makers need rapidly available data to plan a response to an impending epidemic. Data collected and reported through official public health institutions is often not available for weeks while reporting mechanisms are established and bolstered. We examined data from two informal sources—HealthMap and Twitter, made available on the Internet in real-time, to determine whether the trend in volume over time of such reports would correlate with the trend in volume of cases reported through official mechanisms over time. We found that in the 2010 Haitian cholera outbreak, there was good correlation between trends in volume over time of informal data and officially reported case data, during the initial stages of an outbreak or relevant event. We demonstrate one potential use of this informal data early in an outbreak to gain early insight into an evolving epidemic—estimating the reproductive number of the cholera epidemic, which has important implications for the implementation of disease control measures.
Informal media sources such as search query volume have previously been shown to be accurate metrics for “predicting present activity” in economics, sales, disease prevalence, and consumer activity.10,23
Here for the first time, we show their use in monitoring an outbreak of a neglected tropical disease in a resource-limited setting and in estimating the effective reproductive number of an epidemic, to gain early insight into disease dynamics.
We found that data from the informal sources correlated best with MSPP data with a 1 day lag, meaning that the changes in volume occurred 1 day later in those sources than in MSPP data. However, these data sources are made publicly available in real-time, whereas MSPP data is released with up to 2 weeks of delay. Thus, because access to informal sources is possible in near real-time, estimates from these data sources can be made earlier than from formal sources, which are available after delays incurred in the traditional chain-of-command structure of public health. The use of electronic sources can also facilitate finer temporal resolution than more traditional data streams; often at the level of single days or better. Consequently, estimates derived from these data sources can be generated very early and often, with the potential to precede insight available from official sources. Electronic sources also offer very fine spatial resolution, which is not explored in this study. Near real-time estimates of epidemic activity may provide valuable insights into the trajectory of an infectious disease outbreak, help project the spread of an epidemic, and provide guidance on the magnitude of control measures needed. The reproductive number can be used to determine the proportion of the population that needs to be immunized to contain an epidemic, or the proportion that will be infected when the disease reaches its endemic equilibrium.
In the study presented here, we found that trends in the volume of informal media sources correlated with trends in official case volume early in the epidemic, during periods of exponential growth, where estimates of Re are made. We showed how estimation of Re can vary based on the number of days used to determine this growth rate. Very early estimates from media sources diverged much more than early estimates from official data, indicating a media amplification effect around initial news of an event. During the second period of exponential growth (around the time of Hurricane Tomas), the growth rates were very similar for informal sources and official sources. This could suggest that the media amplification effect may be more important around the time of a new outbreak, whereas this phenomenon is less relevant as an epidemic continues to spread. Because epidemic curves from informal sources had exponential growth during corresponding time periods, estimates of the reproductive number could be made within 10 days of the outbreak onset. Although correlation was not good later in the epidemic, it was strong during periods of exponential rise in cases, which is where the reproductive number is estimated.
In the data from the MSPP, there was a third peak of cases () that was not captured by the informal sources, which could be caused by local disease dynamics that did not garner further media attention, or a loss of media attention after initial stages of the outbreak. Accordingly, the methods here are primarily useful for evaluating the relationship between informal and official data streams during periods of high disease transmission activity, which commonly occurs at the beginning of an outbreak. We found that estimates of Re using informal sources during the second phase of exponential growth in the Haiti epidemic matched within the calculated error margins for the selected range of mean serial intervals, whereas estimates in the initial phase were larger by ~1.2–1.9× than estimates from official sources. Temporal differences in the relationship between informal and official data streams could be caused by differences in accuracy of official reports or in characteristics of the disease dynamics or media as the epidemic progressed.
In principle, the methods and data types presented here can be extended to other diseases and to other metrics of disease activity. Media sources can act as an independent metric for gauging disease activity, which is unaffected by biases of, or can convey trends not captured in, official data. Passive surveillance data collected from health facilities by the government can be afflicted by temporally varying logistical or political limitations and generally result in underestimation of the true disease burden in epidemics.2,7,24
Furthermore, for diseases transmitted purely from person to person, the mean serial interval may be better understood allowing for more precise estimates of Re
. For cholera, there is poor data on timing of transmission between individuals, which represents a major source of uncertainty in the estimates of Re
presented here. Alternative approaches for estimating the reproductive number, such as Bayesian or maximum likelihood frameworks, could also be used to estimate the reproductive number early in an epidemic by using informal data volume combined with assumptions about the serial interval distribution.25–27
Informal data sources may contain biases that should be considered. First, there may be geographic biases constraining media prevalence; media may be more ubiquitous in and about larger urban centers or developed regions in general. Furthermore, media volume originating from Haiti or any post-disaster environment may be reduced because of poor existing or resulting infrastructure. As well, global media coverage regarding neglected tropical diseases may be reduced even if case burden is similar to events for other diseases. Second, data contributed by individuals from informal mediums (such as microblogging, cell phones, etc.) may be more prevalent from certain age or other demographic groups.28
However, penetration and use of consumer technology is constantly increasing and facilitating more communication in a variety of worldwide settings, which will decrease demographic and geographic biases in the information. A third potential bias is that informal media reports may contain false positives; they may appear in the absence of disease, based upon false alerts, rumors, or misreporting, particularly in situations of fear or panic. This would contribute to disproportionality between trends in media reports and the underlying volume of disease. Other studies have generated rules for determining relevancy for and reconciling these spikes in time-series data,29
and these methods could be incorporated into future work. Additionally, broadening our inclusion criteria to include Tweets that also contained words such as “diarrhea” or “vomiting” would have increased the sensitivity of captured Tweets, but decreased the specificity. Finally, we found that correlation between informal media sources and case numbers was not significant later in the epidemic, which may be an important limitation of this method late in epidemics.
We have shown here that social and news media sources yielded data that correlated well with officially reported data from the MSPP. Furthermore, at the early stages of an outbreak informal sources can be indicative not just that an outbreak is occurring, but can highlight disease dynamics through estimation of a key epidemic parameter, the reproductive number. Social and news media such as from HealthMap and Twitter are a cost-effective data source. Further research is needed to determine if informal media will be a good measure of morbidity in other epidemics, and how such sources can best be used for monitoring and characterizing future infectious disease epidemics. The next steps would also entail studying how this could be done prospectively. These methods are not a replacement for traditional surveillance methods; however, our results show that these sources can be used to complement current methods for early estimation of epidemiological parameters.