We have found that specific internet search terms are highly correlated with dengue incidence. Our best model for data from Singapore which included 16 terms showed a correlation of 0.931 with observed dengue incidence and an
. The 8 term model for Bangkok performs equally well with a correlation of 0.869 and an
. Out-of-sample predictions are predictably lower, but not significantly so. Our predictions of time periods with high dengue incidence are very accurate with sensitivities and specificities of 0.861–1.00 and 0.765–1.00 for multiple thresholds in each location. Together, these results demonstrate the viability of this data stream in supporting dengue surveillance.
Our model performed similarly to models built in other efforts to predict influenza incidence using internet search terms. Ginsberg et al. found a correlation of 0.90 for influenza incidence in the US using a model that included 45 search terms 
, and Polgreen et al. fit a series of models to influenza data in the United States and all had values of 
. In out-of-sample prediction, our models performed slightly worse than the models of influenza produced by Ginsberg et al, which found a correlation of 0.97 (compared to 0.921 in Bangkok and 0.785 in Singapore). It should be noted that our model produces predictions for the entire year including high and low incidence seasons, whereas the models of Ginsberg produce predictions for only the influenza season. The accuracy of our predictions may be due to the clear clinical presentation of severe dengue. The larger interannual variability may also allow us to disentangle seasonal search behavior from dengue specific search behavior.
The search terms included in the models include nomenclature terms, terms describing signs and symptoms as well as treatment seeking. Interestingly, 11 of the 13 search terms that were found to be significant in our final model for Singapore were in English. This suggests that the typical language used for health seeking behavior in Singapore is English. In Bangkok, we also found that three of the seven significant terms are English.
We validated the candidate models using leave-one-observation-out, leave-one-year-out and forward and backward validation techniques. The model performance was fairly consistent across these different approaches. In our validations, we found one year with large incidence to be highly influential for the performance of our model (see Supporting Information S1
). We expect that including future years with large incidence might further improve our results.
Singapore has an extremely well developed dengue surveillance system that makes reported cases available to policy makers and the general public with a delay of around one week. In a setting with this rapidity of reporting, it is challenging for an internet search term model to return results more quickly and with better performance than a model that uses only reported cases to predict future cases 
. This point has been demonstrated elsewhere for predicting consumer behavior: predictive search term-based models perform better when used in conjunction with rich independent data sets 
. Thus, in Singapore, this tool might best be used as a supplement to existing surveillance systems. However, in other settings, with less developed surveillance systems, an internet search term-based system may yield significant gains in the rapidity of predictions. In Thailand, there are significant delays in the reporting of cases from many areas of the country. Our model may give significant improvements in settings with significant delays. It is conceivable that some dengue-endemic settings in South and Southeast Asia may have significant internet use before surveillance systems are developed and thus an internet search term-based model may be a proxy for routine surveillance in these settings.
Caution must be used when generalizing our method to other settings. Even though we have chosen two settings that have very different rates of internet usage, both countries are of higher income than many of the countries in the region. However, it is reasonable to assume increasing internet penetration in the future. Individual models need to be developed for specific settings using local surveillance data and search terms. This effort shows that this approach may have promise in other settings.
There are several other limitations to our work. Internet searching behavior is susceptible to the impact of media reports as has been found for influenza systems 
. The rate of internet use and the rate of health information seeking in this setting may be changing over time and thus our parameters might need to shift over time to incorporate the impact of these changes. Although not affecting performance here, future outbreaks of other clinically similar diseases such as chikungunya may challenge the performance of our model for dengue. Finally, the Google Insight tool returns a sample of actual search data and limits the availability of search terms for which there are very few returns, often aggregating these terms to a large temporal discretization. This limits the utility of these terms for the purposes of prediction.
Search query surveillance is rapidly expanding into many areas of public health including the surveillance of noninfectious diseases and to influencing policy domains 
. The current work demonstrates the utility of using search query surveillance to forecast the incidence of a tropical infectious disease. Additionally, and importantly, we have constructed forecasting models using freely available search query data from Google Insights and publicly available surveillance data from Singapore and Bangkok. In addition, we have developed these models using open source software from the R statistical project. Our approach can be readily adapted to other settings where other proprietary efforts can not be implemented. The approach may be an important tool in many dengue endemic settings in supporting the public health response to dengue.