Although there is a trend towards modernizing surveillance of infectious diseases, dengue surveillance is still very much traditional, mostly based on passive routine reporting or sentinel site surveillance, which is a preferable active but more costly approach
[4]. The current standard approaches to dengue surveillance have recognized shortcomings including low sensitivity and accuracy and lack of timeliness. Therefore, the need to take steps to improve dengue surveillance has been well acknowledged
[1],
[4],
[16], but cost and feasibility remain major obstacles.
The results of this study show that in general, models built on the fraction of Google search volume for dengue-related queries were able to adequately estimate true dengue activity according to official dengue case counts reported by national ministries of health or the WHO for five selected countries for the majority of the seasons during the time-frame analyzed. To our knowledge, few have explored non-traditional clinical/laboratory settings for monitoring dengue epidemics. Our results provide evidence of the availability of a novel data source that could supplement traditional surveillance. Furthermore, a web data based approach would be a low-cost option as it is passive and would require minimal resources to run.
The main added benefit in monitoring web-searching behavior is the potential for earlier detection. While notifications by doctors or laboratories to ministries of health are often delayed until there is a confirmed diagnosis
[17], it is believed that individuals, especially at earlier stages of illness, may seek health information on the Internet before or even instead of making medical visits. One study evaluating a community-based surveillance system in rural Cambodia found that 67% of cases of hemorrhagic fever were treated at home as opposed to a health facility
[18]. While rural areas are less likely to be served by Internet access, in other more developed areas, the Internet could be a source of information for those who do not actively seek clinical care. These data could therefore have the potential to provide earlier signals of epidemics in the community than clinical or laboratory data. Several studies have already demonstrated that web access logs and search query data work well for tracking influenza
[9]–
[12], although whether these data are actually timelier than traditional data is uncertain, with differing results depending on the study and gold standard of comparison.
However, even if the signals in web query data are no timelier than in traditional laboratory/clinical surveillance data, a tool built on the presented models could still provide a time advantage in that it would provide immediate access to an indicator of dengue activity that could help illustrate the dengue situation as it is currently. This idea reflects the concept of “now-casting” as opposed to forecasting, to predict the present rather than the future
[14]. Official case counts are not always made publicly available in all countries, or if they are, there is a broad spectrum in the timeliness of when these data become available, ranging from only a couple days (as in the case of Singapore) to as much as months, or even years (as in case of the WHO's DengueNet system which collects data for all countries). This tool is not meant to serve to fill in these gaps with actual estimates of case counts, but by estimating an indicator of dengue activity that would be available in near real-time, it could serve as a stepping stone to prompt further investigation if warranted.
The lack of data stems from a variety of factors, including under-reporting. Even with mandatory reporting of dengue, under-reporting is prevalent
[1],
[4],
[16]. Field investigations, sero-surveys and capture-recapture methods have yielded some remarkably low estimates for the sensitivity of dengue case notification, reflecting under-reporting
[17],
[19]–
[21]. Reasons for under-reporting include lack of resources (both personnel and equipment), motivation and leadership, in addition to misunderstandings about or unfamiliarity with case definitions, complicated reporting procedures, a tendency to report only the most severe cases, lack of reporting from the private health sector
[4],
[16] and the reality that a proportion of the ill do not seek clinical care whether because they self-treat at home
[18] or because their infection is asymptomatic or subclinical
[21]. Unfortunately, the problem of under-reporting extends to our models as well as they were built on official data that are precisely affected by these problems. Therefore, it is not sensitivity but the ability to capture the same trends as the official data at a potentially earlier time point that is the value that a tool built on such models would be trying to capitalize.
A main challenge remains that rural areas and developing nations tend to lack or have limited Internet access currently. Web-query based surveillance depends on sufficient web search volume from any country of interest in order to both generate signals and drown out noise. In fact, it was this limitation of sufficient search volume that turned out to be a significant limiting factor in our process of identifying appropriate disease/location candidates.
Another limitation to be kept in mind with respect to expanding to different countries is that inter-country comparisons may be difficult due to differences in case definition for the official time series to which models were fitted. Unfortunately, because data using a consistent case definition across all countries do not exist (to our knowledge), each presented country and model must be considered independently.
Lack of Internet access may also be a potential explanation for the discrepancy between the fitted and actual values for the 2005 season in India. Though the gap is narrower today, there is a tremendous amount of regional variation in Internet penetration in India, a reflection of the country's economic disparity, especially between rural and urban areas
[22]. The 2005 season was predominantly driven by a major outbreak that occurred in the state of West Bengal which includes the city of Kolkata, where per-capita Google searches at that time were much less than in cities like Delhi and Mumbai. Correspondingly, a model that fits aggregate national-level search data to national official case count data could underestimate true activity in regions with limited Internet usage. If state level data becomes available, future improvements to the model could include the addition of state-level adjustments.
Another limitation is that not everyone who submits a dengue related search query is actually ill with dengue. Indeed, a prevailing concern in such uses of web-searching behavior data for monitoring epidemic signals is the susceptibility of these data to panic-induced searching; the announcement of a novel outbreak, especially if compounded by media sensationalism, usually leads to increased online searching activity, and while a proportion of that behavior may be spurred by legitimate personal medical concern, a larger proportion is likely driven by fear or curiosity. By training the models over multiple years of data we are able to filter for terms that might be popular at a specific point in time during one season, but not over the multiple seasons. For example, when the first wave of H1N1 swine influenza emerged in 2009, there were large increases in search activity for “swine flu”, but this term was not included in Google Flu Trends models since it was not used significantly prior to 2009. Additionally, dengue is probably somewhat shielded from mass panic-induced searching; being an endemic disease in the regions we have focused on, dengue is less likely to receive the same degree of attention as would happen with a novel or rare disease. This hypothesis is confirmed by our results which demonstrate that dengue-related search queries are generally not as influenced by news coverage. For example, despite more severe and newsworthy outbreaks for Bolivia in 2009, Brazil in 2008, Indonesia in 2004, and Singapore in 2005, the models were able to handle these high levels of dengue activity without any significant overestimation. The one exception occurred in 2006 in India, when news about members of the prime minister's family being hospitalized for suspected dengue caused an unusually large spike in dengue queries. However, adjusting this spike prior to model fitting as described in our
methods proved to be an effective way of retaining model fit. As with Google Flu Trends, despite strong historical correlations, our system remains susceptible to false alerts that could be caused by a sudden increase in dengue-related queries
[9].
Incorrect self-diagnosis is another instance where dengue related search queries may not correspond to true illness. Notably, chikungunya and dengue are particularly difficult to distinguish because they manifest with similar symptoms and share the same vectors
[23]. It has been made even more difficult since late 2005, when chikungunya re-emerged and led to a major outbreak in the Indian Ocean region, resulting in its current co-circulation with dengue in India
[24]. Misdiagnosis is an obvious limitation for this tool, but it should be noted that in the case where it would be difficult for even doctors to make that distinction based on clinical symptoms alone, it is one that afflicts clinical data used in traditional surveillance of dengue as well. Therefore, misdiagnosis should be an acknowledged problem, but search query data could nonetheless be useful for evidence-based decisions, providing earlier signals on the basis of which more formal epidemiological investigation and coordination with diagnostic laboratories could be initiated.
Mining Google search query data raises obvious privacy concerns and it must be ensured that policies to protect personal information are extended to the application of this tool in public health practice. The main safeguard is that such a tool only presents query volume at the aggregate level where the unit of analysis prevents any re-identification of patients. The product of this work is freely available at
www.google.org/denguetrends. The presented tool is not intended to replace traditional dengue surveillance, but by taking advantage of readily available data essentially provided by millions of individuals, it could be a useful and low-cost complement. These data can help mitigate some of the many gaps that exist in the current dengue surveillance landscape. More broadly, these results also contribute to a growing pool of evidence demonstrating the capability of relatively novel sources such as web-based data to assist with public health goals.