|Home | About | Journals | Submit | Contact Us | Français|
Effective disease monitoring provides a foundation for effective public health systems. This has historically been accomplished with patient contact and bureaucratic aggregation, which tends to be slow and expensive. Recent internet-based approaches promise to be real-time and cheap, with few parameters. However, the question of when and how these approaches work remains open.
We addressed this question using Wikipedia access logs and category links. Our experiments, replicable and extensible using our open source code and data, test the effect of semantic article filtering, amount of training data, forecast horizon, and model staleness by comparing across 6 diseases and 4 countries using thousands of individual models. We found that our minimal-configuration, language-agnostic article selection process based on semantic relatedness is effective for improving predictions, and that our approach is relatively insensitive to the amount and age of training data. We also found, in contrast to prior work, very little forecasting value, and we argue that this is consistent with theoretical considerations about the nature of forecasting. These mixed results lead us to propose that the currently observational field of internet-based disease surveillance must pivot to include theoretical models of information flow as well as controlled experiments based on simulations of disease.
Despite a rapid pace of advancement in medicine, disease remains one of the most tenacious challenges of the human experience. Factors such as globalization and climate change contribute to novel and deadly disease dynamics , as highlighted by recent events such as the 2014–2015 Ebola epidemic in West Africa  and the ongoing Zika crisis in the Americas .
The highest-impact approach for addressing this challenge is a strong public health system [56, 139], including effective and timely disease surveillance, which quantifies present and future disease incidence to improve resource allocation and other planning. Traditionally, this depends on in-person patient contact. Clinic visits generate records that are accumulated, and these results are disseminated, typically by governments. Given appropriate resources, this approach is considered sufficiently reliable and accurate, but it is also expensive and coarse. Further, it is slow. Measurements of present disease activity are not available, only measurements of past activity.
These problems have motivated a new, complementary approach: use internet data such as social media, search queries, and web server traffic. This is predicated on the conjecture that individuals’ observations of disease yield online traces that can be extracted and linked to reality. The anticipated result is estimates of present disease activity available in near real-time, called nowcasts, as well as forecasts of future activity informed by near real-time observations — as we will argue, a related but fundamentally different product.
This notion was well-received in both the popular press [66, e.g.] and the scientific literature. One of the first papers has been cited over 2,000 times since its publication in late 2008 . Based on these results and amid the rush of enthusiasm, Google founded web systems, Google Flu Trends and Google Dengue Trends, to make nowcasts available to the public.
However, subsequent developments have led to increasing skepticism. For example, after significant accuracy problems  and scientific criticism , Flu and Dengue Trends folded in the summer of 2015 .
Among the numerous studies in the scientific literature, the details of how to process and evaluate internet traces are varied and scattered, as are experimental contexts. Further, we suspect many additional studies yielded negative results and were not published. That is, measuring disease using internet data works some of the time, but not always. We ask: when does it work and how does it work?
We address this question using Wikipedia. The present work is a large-scale experiment across 19 disease/country contexts curated for comparability on these two dimensions, using linear models atop Wikipedia traffic logs and an article filter leveraging relatedness information from Wikipedia categories . This enables us to address four research questions motivated by the above conjecture and core question. These are:
This paper is supported by open-source experiment code1 and a supplementary data archive2 containing our input and output data, early explorations of data and algorithms, and a machine-readable summary of our literature review.
Our contributions are as follows:
Our paper is organized as follows. We review the relevant literature, outline the design of our experiment, and present our results. We close with a discussion of what this really tells us and our proposals for how the field should best proceed.
This paper falls into four contexts: traditional patient- and laboratory-based disease surveillance, internet-based disease surveillance, Wikipedia-based measurement of the real world, and the latter two applied to forecasting. This section extends and adapts the literature review in our previous paper .
To our knowledge, this review is comprehensive in the latter three areas as of April 2016. A BibTeX file containing all of our references is in the supplement, as is a spreadsheet setting out in table form the related work attributes discussed below.
Traditional forms of disease surveillance are based upon direct patient contact or biological tests taking place in clinics, hospitals, and laboratories. The majority of current systems rely on syndromic surveillance data (i.e., about symptoms) including clinical diagnoses, chief complaints, school and work absenteeism, illness-related 911 calls, and emergency room admissions .
For example, a well-established measure for influenza surveillance is the fraction of patients with influenza-like illness. A network of outpatient providers report the number of (1) patients seen in total and (2) who present symptoms consistent with influenza and having no other identifiable cause ; the latter divided by the former is known as “ILI”. Other electronic resources include ESSENCE, based on data from the Department of Defense Military Health System  and BioSense, based on data from the Departments of Defense and Veterans Affairs, retail pharmacies, and Laboratory Corporation of America .
Clinical labs play a critical role in surveillance of infectious diseases. For example, the Laboratory Response Network, consisting of over 120 biological laboratories, provides active surveillance of a number of diseases ranging from mild (e.g., non-pathogenic E. coli and Staphylococcus aureus) to severe (e.g., Ebola and Marburg), based on clinical or environmental samples . Other systems monitor non-traditional public health indicators such as school absenteeism rates, over-the-counter medication sales, 911 calls, veterinary data, and ambulance runs. Typically, these activities are coordinated by government agencies at the local or regional level.
These systems, especially in developed countries, are generally treated as accurate and unbiased.3 However, they have a number of disadvantages, notably cost and timeliness: for example, each ILI datum requires a clinic visit, and ILI data are published only after a delay of 1–2 weeks .
The basic conjecture of internet-based disease surveillance is that people leave traces of their online activity related to health observations, and these traces can be captured and used to derive actionable information. Two main classes of trace exist: sharing such as social media mentions of face mask use  and seeking such as web searches for health-related topics .4
In this section, we focus on the surveillance work most closely related to our efforts, specifically, that which uses existing single-source internet data feeds to estimate a scalar disease-related metric. We exclude from detailed analysis work that (for example) provides only alerts [38, 178], measures public perception or awareness of a disease , includes disease dynamics in its model [147,170], evaluates a third-party method [95, 120], uses non-single-source data feeds [38, 55], or crowd-sources health-related data (participatory disease surveillance) [34,35].
We also focus on work that estimates biologically-rooted metrics. For example, we exclude metrics based on seasonality [11,130], pollen counts [58,165], over-the-counter drug sales volume [82,98], and emergency department visits .
These traces of disease observations are embedded in search queries [5, 7, 9, 12, 14, 17, 21, 25, 26, 31, 32, 33, 39, 49, 50, 53, 59, 63, 64, 71, 72 73, 77, 78, 81, 85, 87, 90, 97, 103, 104, 109, 119, 126, 127, 131, 132, 141, 142, 144, 146, 157, 158, 162, 163, 166, 168, 169, 170, 173, 177, 179, 180, 182], social media messages [1, 2, 8, 10, 20, 36, 40, 41, 42, 46, 51, 60, 62, 68, 76, 84, 89, 92, 93, 115, 116, 118, 123, 124, 148, 149, 151, 176], web server access logs [57, 79, 101, 105], and combinations thereof [13, 19, 30, 91, 136, 143, 167].
At a basic level, traces are extracted by counting query strings, words or phrases, or web page URLs that are related to some metric of interest (e.g., number of cases), forming a time series of occurrences for each item. A statistical model is then created that maps these inputs to the official values of the metric. This model is trained on time periods when both the internet and official data are available, and then applied to estimate the metric over time periods when the official data are not available, i.e., forecasting the future and nowcasting the present. The latter is useful in the typical case where official data availability lags real time.
The disease surveillance work cited above has been applied to a wide variety of infectious and non-infectious conditions: allergies , asthma [136, 176], avian influenza , cancer , chicken pox [109, 126], chikungunya , chlamydia [42, 78, 109], cholera [36, 57], dengue [7, 31, 32, 57, 62, 109], diabetes [42, 60], dysentery , Ebola [5, 57], erythromelalgia , food poisoning , gastroenteritis [45, 50, 71, 126], gonorrhea [77, 78, 109], hand foot and mouth disease [26, 167], heart disease [51, 60], hepatitis , HIV/AIDS [57, 76, 177, 180], influenza [1, 2, 8, 9, 10, 13, 19, 20, 21, 30, 33, 40, 41, 43, 46, 48, 53, 57, 59, 68, 72, 73, 79, 81, 84, 85, 89, 90, 91, 92, 93, 97, 101, 103, 104, 105, 109, 115, 116, 118, 123, 124, 126, 131, 132, 141, 142, 143, 146, 148, 149, 151, 157, 163, 168, 170, 173, 179, 182], kidney stones , listeriosis , Lyme disease , malaria , measles , meningitis , methicillin-resistant Staphylococcus aureus (MRSA) , Middle East Respiratory Syndrome (MERS) , obesity [42, 60, 144], pertussis [109, 116], plague , pneumonia [109, 131], respiratory syncytial virus (RSV) , Ross River virus , scarlet fever , shingles , stroke [60, 162], suicide [64, 169], syphilis [78, 127], tuberculosis [57, 180], and West Nile virus .
In short, while a wide variety of diseases and locations have been tested, such experiments tends to be performed in isolation and are difficult to compare. We begin to address this disorganization by evaluating an open data source and open algorithm on several dimensions across comparable diseases and countries.
Wikipedia article access logs, i.e., the number of requests for each article over time, have been used for a modest variety of research. The most common application is detection and measurement of popular news topics or events [3, 6, 37, 70, 80, 83, 121, 171, 172]. The data have also been used to study the dynamics of Wikipedia itself [133, 155, 161]. Social applications include evaluating toponym importance in order to make type size decisions for maps , measuring the flow of concepts across the world , and estimating the popularity of politicians and political parties . Finally, economic applications include attempts to forecast box office revenue [44, 108] and stock performance [29, 99, 113, 114, 164].
In the context of health information, the most prominent research direction focuses on assessing the quality of Wikipedia as a health information source for the public, with respect to cancer [96, 135], carpal tunnel syndrome , drug information , kidney conditions , and emerging infectious diseases .
A few health-related studies make use of Wikipedia access logs. Tausczik et al. examined public “anxiety and information seeking” during the 2009 H1N1 pandemic, in part by measuring traffic to H1N1-related Wikipedia articles . Laurent and Vickers evaluated Wikipedia article traffic for disease-related seasonality and news coverage of health issues, finding significant effects in both cases . Aitken et al. found a correlation between drug sales and Wikipedia traffic for a selection of approximately 5,000 health-related articles , and Brigo et al. between news on celebrity movement disorders and Wikipedia traffic on topical articles 
A growing number of such studies map article traffic to quantitative disease metrics. McIver & Brownstein used Poisson regression to estimate the influenza rate in the United States from Wikipedia access logs . Hickmann et al. combined official and internet data to drive mechanistic forecasting models of seasonal influenza in the United States . Bardak & Tan assessed forecasting models for influenza in the U.S. based on Wikipedia access logs and Google Trends data, finding an improvement in both nowcasting and forecasting power when both were combined . Finally, we previously found high potential of article traffic for both nowcasting and forecasting in several disease-location contexts .
In short, the use of Wikipedia access logs to measure real-world quantities is growing, as is interest in Wikipedia for health purposes. To our knowledge, the present work is the broadest and most rigorous test of this opportunity to date.
Of the internet data and Wikipedia surveillance studies discussed in the above two sections, 28 included a statistical forecasting component of some kind. (One used disease dynamics for forecasting , which is again out of scope for this review.) This forecasting falls into three basic classes: finding correlations, building models with linear regression, and building models with other statistical techniques.
The basic form of these 8 studies is to vary the temporal offset between internet and official data, either forecasting (internet mapped to later official) or anti-forecasting (internet to earlier official), and report the correlation. Thus, the claim is that a linear model is plausible, but no actual model is fitted and there is no test set. Various offsets have been tested, with up to ±7 months in one case .
Four of these studies report that forecasting is promising in at least some contexts [50, 79, 87, 109], while four report that it is not [101, 103, 126, 179]. Two studies report their best correlations at an anti-forecasting offset [50, 87].
This approach lacks a convincing explanation for why given offsets performed well or poorly. Rather, the goal of these studies appears to be to illustrate a relationship between internet and official data in a specific context, in order to motivate further work.
Extending the above analysis by one step is to actually fit the offset linear models implied by high correlation. We found 8 studies that did this. One used ridge regression , three used ordinary least squares [13, 57, 97], one “applied multiple linear regression analysis with a stepwise method” , one used a log-odds linear model with no fitting algorithm specified , and the rest did not specify a fitting algorithm [77, 131, 157]. We suspect that the latter four used ordinary least squares, as this is the most common linear regression algorithm. Again, the offsets tested varied.
Six of these studies reported positive results, i.e., that the approach was successful or promising in at least some contexts [13, 57, 77, 131, 167, 169], while two reported negative results [97, 157].
An important methodological consideration is whether a separate test set is used. This involves fitting the model on a training set and then applying the coefficients to a previously unseen test set; the alternative is to fit the model on a single dataset and then evaluate the quality of fit or estimates from the same dataset. The former is the best practice, while the latter is vulnerable to overfitting and other problems.
Three of the studies reporting positive results had no test set [57, 131, 169], two performed a single experiment with one training and one test set [77, 167], and one used cross-validation (multiple separations of the data into training/test pairs) . Two of these studies, both reporting positive results, used Wikipedia data [13, 57].
Our work falls into this class, as we use elastic net regression  to test forecasting up to 16 weeks. Our experiment is much larger than any prior work, with thousands of individual training/test pairs, and each test is arranged in a realistic temporal arrangement with the test set immediately following the training set (as opposed to cross-validation, which intermingles them).
A smorgasbord of other methods have been used to address the problem of statistical forecasting. All 12 of these studies reported positive results (though recall that negative results are less likely to be published). None used Wikipedia data.
The linear methods we found are linear autoregressive models [115, 124], stacked linear regression , multiple linear regression with a network component , and a dynamic Poisson autoregressive model . We also found non-linear methods: matrix factorization and nearest neighbor models , support vector machines , AdaBoost with decision trees , neural networks , and matching meteorological data to historic outbreaks [21,151].
It is quite plausible that statistical techniques other than linear regression can produce better models, though as we argue later, this should be supported by solid theoretical motivation for a specific technique.
We elected to use linear regression for several reasons. First, this is most common in prior work, and we wanted to reduce the number of dimensions on which ours differed. Second, in general, simpler methods are preferred; we wanted to find out the performance of linear regression before moving on to other things. Finally, our sense (as we explore in greater detail below) is that a better understanding of the data and the information pipeline is a higher priority than exploring the complex menu of statistical techniques.
That said, our open-source experimental framework can support evaluation of these and other statistical methods. This may be an opportunity to directly compare different methods.
This section describes the data sources we used and the design of our two experiments.
Our goal was to perform a large-scale experiment to evaluate our approach for internet data-based disease monitoring and forecasting across two dimensions: countries and diseases. We tested 19 contexts of 4 countries by 6 diseases at the same weekly time granularity.
We implemented the experiments in Python 3.4 with a variety of supporting libraries, principally Pandas  and Scikit-learn . Our experiment code is open source, and brief replication guidance is included with the code.
Our analysis period is July 4, 2010 through July 5, 2015, i.e., 261 weeks starting on Sunday or Monday depending on the country. This period was selected somewhat arbitrarily as a balance between data availability and a larger experiment.
We used three data sources: official disease incidence data, Wikipedia article traffic logs, and the Wikipedia REST API. All of the data we used are in the supplement.
We downloaded official disease incidence data from national government ministry of health (or equivalent) websites.5 In some cases, these raw data were in spreadsheets; in others, they were graphs presented as images, which we converted to numbers using WebPlotDigitizer . Each disease/country context resulted in a separate Excel file, which we then collated into a single Excel file for input into the experiment.
Our goal was to select a tractable number of countries and diseases that produces a large number of disease/country pairs in the dataset, in order to facilitate comparisons. We had three general inclusion criteria:
This selection is a challenging optimization task due to the variety of national surveillance and reporting practices. We believe that the set we selected is illustrative rather than representative.
Of the 24 disease/country combinations above, we found data for 19, as summarized in Table 1.
Summary access logs for all Wikipedia articles are available to the public.6 These contain, for each hour from December 9, 2007 to present and updated in real time, a compressed text file listing the number of requests for every article in every language, except that articles with no requests are omitted.7
If an article’s traffic during a month was less than 60 requests, we treated that entire month as zero; if all months had traffic below that threshold, we removed the article from our dataset. We normalized these request counts by language and summed the data by week, producing for each article a time series vector where each element is the fraction of total requests in the article’s language that went to that article. These weeks were aligned to countries’ reporting week (starting on Monday for Germany and Sunday for the other three countries).
We did not adjust for time zone between Wikipedia traffic data (in UTC) and official incidence data (presumably in local time). This simplified the analysis, and these offsets are relatively small compared to the weekly aggregation.
Wikipedia provides an HTTP REST API to the public.8 It is quite complex, containing extensive information about Wikipedia content as well as editing operations. We used it for two purposes: (1) given an article, get its links to other articles, and (2) given an article, get its categories. This study used the link and category state as of March 24, 2016.
We applied the latter iteratively to compute the semantic relatedness of two articles. The procedure roughly follows that of Strube and Ponzetto’s category tree search :
We refer to this relatedness as category distance for specificity. The distance between two articles is the number of fetch cycles, capped at 8. A category encountered at different levels in the two expansions yields the greater level as the distance. An article’s distance from itself is 1.
Our approach maps Wikipedia article traffic for a given week to official disease incidence data for the same or a different week, using a linear model. That is, we select a small subset of articles (order 10–1,000) and use a linear regression algorithm on some training weeks to find the best linear mapping from those articles’ request counts to the official case counts. We then apply these fitted coefficients to later, previously unseen article traffic and see how well the prediction matches the known incidence numbers. This approach is illustrated in Figure 1.
The independent variables are:9
A model is a set of linear coefficients ωj and intercept ω0 that maps article requests A1 through An to official data I:
A model can produce one prediction per staleness s, at week i + s.
Accuracy is evaluated by creating a model sequence of models with consecutive epochs and the same horizon, duration, and staleness; this yields a time series of one prediction per model. We then compare the predicted time series to the official data using the coefficient of determination r2 [0, 1 (i.e. the square of the Pearson correlation r), which quantifies the fraction of variance explained by the model .
The key advantage of r2 is that it is self-normalizing and dimensionless, making it possible to compare models with different incidence units and scales. Also, it is widely used, making comparison with prior work possible. However, the metric does have disadvantages. In particular, it can be hard to interpret. Not only is it unitless, it cannot be used to assess sensitivity and specificity, nor answer questions like the number of infections identified by the model. As we argue below, future work should investigate more focused evaluation metrics.
To fit our linear models, we used elastic net regression , an extension of ordinary least squares regression with two regularizers. One minimizes the number of non-zero coefficients, and the other minimizes the magnitude of the coefficients.10 This reflects our desire for models using few articles and for setting roughly the same coefficients on articles whose traffic is correlated.
Elastic net has two parameters. α is the combined weight of both regularizers, which was chosen by Scikit-learn using cross-validation. ρ (called l1_ratio by Scikit-learn) is the relative weight between the two, which we set to 90% non-zero and 10% magnitude.
These choices were guided by data examination; we also considered lasso  and ridge regression  as well as setting the parameters manually or automatically. The Jupyter notebooks  that supported this examination are in the supplement.
We used a two-stage article filter, using first article links and second category distance. For a given disease, we manually select a root article in English and follow Wikipedia’s inter-language links to find roots for the other languages (Table 1). This is the only manual step in our approach. The root article and all articles linked comprise the first group of candidates.
Next, we reduce the candidate pool to those articles whose category distance is at most some limit d [1, 8]. This yields anywhere from a dozen to several hundred articles whose traffic vectors are offered to the training algorithm (which itself selects a further subset for non-zero coefficients).
We performed two experiments. Experiment 1 focuses on U.S. influenza, a context familiar to the authors and with excellent data, thus affording a more detailed analysis addressing all four research questions.
In Experiment 1, we tested category distances of 1–8 with and without confounding. That is, for the non-confounded control, we used a single root of “Influenza”; for the confounded case, we added two additional root articles: “Basketball” and “College basketball”. The goal was to offer the algorithm correlated but unrelated and thus potentially misleading articles, in order to test the hypothesis that purely correlation-based feature selection can be problematic.
This hypothesis is hinted at in Ginsberg et al. , whose feature selection algorithm was driven by correlation between search queries and ILI. This model ultimately included the 45 most-correlated queries, which were all influenza-related. However, of the 55 next-most-correlated queries, 19 (35%) were not related, including “high school basketball”. Models with up to the top 80 queries performed almost as well as the 45-query model, with varying correlation up to roughly 0.02 worse, and may have contained unrelated queries (Figure 1 in the citation). The 81st query was the unrelated “oscar nominations”, which caused a performance drop of roughly 0.03 all by itself. That is, one can imagine that slightly different circumstances could lead a Ginsberg-style model to contain unrelated, possibly detrimental queries.
The remaining independent variables are: 9 horizons from 0–16 weeks, 7 training durations from 16–130 weeks (roughly 4 months through 2½ years), staleness from 0–25 weeks for horizon 0, and staleness 0 for other horizons, for a total of 185,236 models11 comprising 3,808 model sequences.
Experiment 2 broadens focus to all 19 disease/country contexts and narrows analysis to research questions RQ2–RQ4. We built models for category distance limits of both 2 and 3 and chose the best result, as detailed below. The other independent variables are the same as Experiment 1. In total, Experiment 2 tested roughly 432,721 models12 and 9,044 model sequences.
Next, we present the key results of these experiments. All output is in the supplement, but we report only specific analyses.
Our hope for this work was to uncover a widely applicable statistical disease forecasting model, which would complement traditional surveillance with its real-time nature and small number of parameters.
This section presents results of testing this notion. In Experiment 1, we explore U.S. influenza in some detail, then broaden our focus in Experiment 2 to all 19 disease/country pairs.
In this experiment, we tested 3,808 U.S. influenza model sequences. Some of these worked well (informally, succeeded), and others worked less well (failed). Figure 2 illustrates some of these predictions, with Figure 2a being one of the most successful. The other subfigures highlight the effects of the four research questions:
The following sub-sections explore these aspects of accuracy in detail, first the value of semantic relatedness (RQ1), then how the models are less sensitive to training duration and staleness (RQ2 and RQ4), and finally the models’ near-total inability to forecast (RQ3).
We tested the value of limiting model inputs to articles semantically related to the target context in two ways: by testing the effect of likely-correlated but unrelated articles (basketball) and relatedness limits based on Wikipedia categories. We held forecast horizon and staleness at zero. The results are shown in Figure 3.
Comparing the control case of a single root article, “Influenza” to the confounded one of adding two additional roots, “Basketball” and “College basketball”, yields two results: confounding the models either makes things worse, sometimes greatly so, (training duration 78 weeks or less) or has little effect (training of 104 weeks or more). This suggests that casting a wide net for input articles is a bad idea: it is unlikely to help, it might hurt a lot, and it can be very costly, perhaps even turning a workstation problem into a supercomputer problem.13
A complementary experiment tested category distance limits of 1 to 8. All training durations showed a roughly similar pattern: distance 1 was clearly sub-optimal, distance 2 had the highest accuracy or nearly so, and distance 3 or more showed a plateau or modest decline. Considered with the results above, this implies that somewhere in the gap between category distance 8 and entirely unrelated articles, a significant decline in accuracy is probable. This suggests that a relatively small limit is most appropriate, as higher does not help, and offering the fitting algorithm more articles is risky, as we showed above.
In short, we show that an input filter based on some notion of relatedness is useful for Wikipedia-based disease prediction, and we offer a filter algorithm that requires minimal parameterization — simply select a root article, which need not be in a language known to the user due to inter-language links.
We tested the effect of training duration from 16 to 130 weeks (4 months to 2½ years) with all 8 distance limits and forecast horizon and staleness held at zero, as well as staleness from 0 to 25 weeks (6 months) with three training values, distance at 2, and forecast horizon at zero.
Training durations of 26, 52, and 78 weeks at distance 2 show similar r2 of roughly 0.76, while 39 weeks dips to r2 = 0.64 (Figure 4); we suspect this is due to severely overestimating the peak magnitude in the 2013–2014 season. Other category distances yield a similar pattern, with distance 2 being close to the best at all training durations. We speculate that this general plateau of 6–18 months’ training is because the fit selects similar models, yielding similar results.
This suggests that optimizing training duration should have lower priority than other model parameters and that considerations other than pure accuracy may be relevant. For example, influenza is a yearly, seasonal disease; 39 weeks is ¾ cycle, which may be an awkward middle ground that often attempts to combine small fractions of two influenza seasons with summer inactivity. On the other hand, 52 weeks and 78 weeks capture 1 and 1½ cycles respectively, which plausibly is more robust. Another implication is that one shouldn’t simply train on all the data available; conversely, potential models need not be discarded a priori if training data are limited.
Staleness also showed relative insensitivity (Figure 5). At 78 weeks training, accuracy started at r2 = 0.77 for staleness 0 and gradually declined to roughly 0.6 by 25 weeks. Training of 52 and 104 weeks showed a broadly similar pattern with generally lower r2, except that the former had very poor performance from staleness 3 to 12. This was caused by a huge, erroneous predicted spike in spring 2015, caused in turn by the models for epoch 248 and 249 making severe overestimates. We suspect that these models unluckily chose articles with traffic data quality problems.
This result is encouraging because staleness is a parameter determined by data availability rather than user-selected. That is, models robust to high staleness might be usable to fill longer delays in official data publication, and our results suggest that Wikipedia data can be used to augment official data delayed much longer than the 1–2 weeks of U.S. ILI.
We tested forecasting horizons of 0 to 16 weeks, with four training durations, category distance of 2, and staleness of 0 weeks.
Figure 6 shows model sequence examples for forecasts of 0 (nowcasting), 1, and 2 weeks. These results are not encouraging. While the nowcast performs relatively well (r2 = 0.77), the 1-week and 2-week forecasts perform poorly (r2 = 0.41 and 0.51, respectively). In addition to greater noise, the forecasts show a tendency for spikes (e.g., January 2014 in Figure 6b) and an “echo” effect of the same length as the forecast horizon. We speculate that the latter is due to training data that includes summer periods when phase does not matter; when applied to times near the peak when it does, the model cannot compensate.
Figure 7 quantifies this situation. All four training periods show rapid decline in forecasting accuracy, even with a single-interval horizon of 1 week.
These results contrast with prior work. For example, we reported several situations with high correlations between Wikipedia article traffic and official data under a forecasting off-set . However, this work had no independent test set; i.e., it reported training set correlations, which are very likely to be higher than test set correlations. Similarly, Bardak & Tan  reported linear regression models whose performance was best at a 5-day forecast. This work tested a large number of algorithms and parameterizations and used internal cross-validation rather than a test set that strictly followed the training, the latter being a more realistic setting.
In short, it appears that statistical forecasting of disease incidence with summary Wikipedia access logs may be risky; whether this applies to internet data in general is another hypothesis. We return to these broader issues in the discussion section below.
Recall that our goal was to obtain comparable disease incidence data in a dense variety of diseases and locations, hoping that this would enable comparisons on those two dimensions for the training, horizon, and staleness research questions (RQ2, RQ3, RQ4). That is, what is the relationship between these parameters and properties of given diseases or locations?
The best r2 score for any parameter configuration in each disease/country context is listed in Table 1 above. These range from r2 = 0.06 for Israel malaria to 0.93 for Germany influenza, with most of these best scores being rather poor. Thus, these generally negative results do not support the analysis we originally envisioned, though we do observe that all contexts with small case numbers perform poorly.
Instead, we analyze the four best-performing contexts (Germany influenza, Colombia dengue, Germany pertussis, and U.S. influenza) as individual case studies; this number is somewhat arbitrary but does appear to capture all the contexts above the accuracy inflection point. For all three questions, results were generally consistent with Experiment 1.
Parameterizing category distance was guided by our count of how many articles fell at which category distances (category_distance.ipynb in the supplement), which suggested using 3, as well as the results of Experiment 1, which suggested 2. We ran the experiment at both distances and report the distance which yielded the best mean r2 across all training durations at horizon zero, staleness zero.
Figure 8 illustrates accuracy of the four best-performing contexts by training duration, with horizon and staleness both held at zero. In addition to U.S. influenza, both Germany influenza and Colombia dengue both show plateaus of insensitivity to training duration, though these plateaus fall at different parts of the range, roughly 16–52 weeks and 52–130 weeks respectively. We speculate that this is because cultural and disease-related properties both affect how the models work, which in turn influences parameterization.
Germany pertussis shows a different trend: highest performance is at 16 weeks, with a steady and large decline as training increases. Examining these models (see supplement) shows that they are autoregressive; essentially, the best fit sets all article coefficients to zero, and the estimate is the mean of the training period. Colombia dengue shows a similar pattern to a lesser degree. With short training periods, this yields higher r2 because it is responsive.
That is, in these cases, Wikipedia has little to add. We speculate that this is caused by some combination of the Wikipedia data lacking sufficient signal, the official data being noisy (variation of 20% or more from week to week is common), and pertussis being a non-seasonal disease (dengue and influenza both are). This result is consistent with Goel et al., which found a similar result for Yahoo search queries and influenza .
Figure 9 illustrates accuracy by staleness, with horizon held at zero and each context presented at its best training period. Germany influenza is consistent with the U.S.: very modest decline with increasing staleness. Colombia dengue and Germany pertussis show results consistent with the training discussion above: autoregressive models that predict the mean of recent history decline in performance as they reach further into the future.
Figure 10 illustrates the accuracy of best-performing models by forecast horizon, at staleness zero and the best four training durations. Like the U.S., Germany influenza shows immediate and dramatic decline in performance. It appears that there may be some forecasting promise for Colombia dengue, as its performance decline is much slower.
However, Figure 11 illustrates the nowcast and 2-week forecast predictions for Colombia dengue. These show that the high r2 is misleading. While the forecast accurately captures the overall trend, it misses the critical growth period of early 2013. That is, the same echo effect seen in U.S. influenza forecasting results in good predictions when we care least and erroneous ones when we care most. This highlights the pitfalls of non-time-series-aware, yet popular, metrics such as r2. That is, apparently high-performing models may not actually be so, and careful examination of the data as well as appropriate metrics are necessary.
In this section, we address the implications of our results: what does this study tell us, and what should we do next?
Our most direct findings are related to the four research questions. First, we found that filtering inputs with semantic relatedness improves predictions and that filters based on minimal human input are effective (RQ1). This is consistent with the adage “garbage in, garbage out” and prior admonishments on the risks of big data . A further valuable implication is that predictions and prediction experiments are tractable on workstations, not just supercomputers.
Second, we found that our linear approach is less insensitive to both the amount (RQ2) and age (RQ4) of training data. This simplifies parameter tuning and implies that in some cases, nowcasting models can be quite robust.
Finally, and contrary to prior work, we found very little forecasting value, even just one time step ahead (RQ3). Why the different outcomes? We identify eight non-exclusive possible explanations:
We find the last item the most intriguing. That is, though the others likely play a role, we argue that it is the crux of the issue. This argument is rooted in a simple distinction: that forecasting the future is fundamentally different, and more difficult, than nowcasting the present.
The internet-based approach to disease measurement is based in the (very probably true) conjecture that individuals leave traces of their health observations online. These traces, however, are snapshots of the present, not the future. This is why, under the right circumstances, nowcasting likely adds value to traditional disease surveillance systems with reporting lag.
On the other hand, it is not obvious why successful nowcasting should extend to forecasting. Why should lay health observations, let alone their highly imperfect internet traces, be predictive of future disease incidence? If one could forecast disease using a statistical map between internet data and later official data, this would suggest that individuals create online traces weeks or months prior to similar observations finding their way into traditional surveillance systems.
In some contexts, this is plausible. Forexample, Paparrizos et al. recently reported that searches for symptoms of pancreatic cancer such as ulcers, constipation, and high blood sugar significantly lead the diagnosis , though with a high false positive rate.
In others, this implication is unconvincing. For example, influenza and dengue fever move much faster than pancreatic cancer and do not have distinctive leading symptoms. Also, their seasonal patterns are likely more reliably informed by non-internet data sources such as weather reports. Thus, despite previous work to the contrary, we are unsurprised by our models’ inability to forecast.
We argue that forecasting disease requires capturing the mechanism of its transmission. One approach to this is to deploy the statistical model with an informed sense of which and how leading symptoms are likely to be present, as Paparrizos et al. did; another is to combine internet and/or official data with mathematical transmission models rooted in first principles. This approach uses internet data to constrain mathematically plausible forecast trajectories, much like weather forecasting, and has shown promise on multiple-month horizons [67,147].
More specifically, we speculate that an internet-based disease surveillance system useful in practice might have four components: (1) official data arriving via a tractable number of feeds, large enough for diversity but small enough that each feed can be examined by humans at sufficient frequency and detail, (2) internet data feeds, again arriving in a tractable manner, (3) mathematical models of disease transmission appropriate for the diseases and locations under surveillance, and (4) algorithms to continually re-train and re-weight these three components. These algorithms would use recent performance on validated, epidemiologically appropriate metrics (probably not r2) to make evaluations and adjustments. The system would emit incidence nowcasts and forecasts with quantitative uncertainty, as well as diagnostics to enable meaningful human supervision. For example, if a specific Wikipedia article has been informative in the past but then suddenly becomes non-informative, this change might be alerted.
This brings us to the question of how to follow the present study. We propose focused improvements to the approach we presented as well as a broader look at what might be best for the field in the long term.
This study raises several focused, immediate questions. Some of these can be answered using the supplement, and others would require new experiments.
First, we did no content analysis of the articles selected by our algorithm, either at the filter stage or by the elastic net for nonzero coefficients. We also have not compared the automatically derived related-article lists with manually curated ones, nor studied why some seasons did better than others. These would provide additional insight into the information flow.
Second, the statistics and algorithms have various opportunities, including non-linear models, regression that includes time series data properties such as auto-correlation, more advanced semantic relatedness metrics, more focused metrics such as the timing and magnitude of peaks, automatic parameter tuning, and ensemble approaches. (These improvements would be straightforward additions to the experiment source code.)
Finally, the Wikipedia data have drawbacks. A technical issue is that article traffic is noisy; for example, we observed some articles where one hour had many orders of magnitude more hits than the preceding and following hours. Characterizing and mitigating the noise problems may help.
A more important concern is that language is a questionable proxy for location and unlikely to be viable at the sub-national level. We see two solutions to this problem. First, the Wikimedia Foundation and Wikipedia community could make available geo-located traffic data; privacy-preserving methods for doing so are quite plausible . Second, the Wikipedia-based semantic distance measure could be combined with a different data source for quantifying health observations.
One possibility is the Google Health Trends API , which makes available geo-located traffic counts for both search queries and conceptual “entities”  to approved researchers. One could use Wikipedia to specify a set of time series and Google to obtain them. A disadvantage is that these Google data are not available to the public, as Wikipedia traffic is. However, understanding the value-add (if any) would support a cost-benefit analysis of this issue.
We opened the paper by asking “when and how does it work”, hoping to find the answers by comparing across diseases and countries. We did not turn up enough positive results to do this comparison, which caused us to reconsider the approach in general.
This question has three pieces — when, how, and work — none of which have been adequately answered, by us or (to our knowledge) anyone else. To explain this, we invoke the three pillars of science: theory, experiment, and simulation . The field of internet-based disease measurement rests on just experiment, and further, the experiments in question (including ours) are observational, not controlled.
That is, a study proposing a new method or context for such disease measurement must not only show convincingly that it does work but also when and how; otherwise, how can we avoid the risky potential attractor of unrelated yet correlated features? How do we know whether a positive result is robust or how close a negative one was to success? In this section, we propose a research agenda to bring together all three pieces, which may build a new foundation for the field and make it reliable enough for the life-and-death decisions of health policy and practice.
Under what circumstances can an approach be expected to work, i.e., what are the relevant parameters, what is their relationship to success, and what are their values in a given setting? Recall the core conjecture of the field: that individuals’ observations of disease yield online traces that can be transformed into actionable knowledge. This conjecture’s truth is clearly non-uniform across all settings: parameters such as disease, location, temporal granularity, number of cases, internet penetration, culture-driven behavior, various biases, exogenous causes of traffic, internet systems(s) under study, and algorithm choice all play a role. The relevant parameters are poorly known, as the sensitivity of a given approach to their values.
This requires a theoretical argument that a given approach, including all its pieces, can be trusted to work, which in turn requires understanding, not simply a high r2. Without this, one has little reason to believe or disbelieve a given result.
A related issue is correlation vs. informativeness. Finding correlated signals is not the same as finding informative signals; recall that in the U.S., basketball is correlated with influenza. Decent theory will provide a basis to argue that a given signal is in fact informative (or not).
One must quantify the value added over traditional surveillance alone; simply matching official data well does not do this. We need evaluation metrics that are relevant to downstream uses of the model and provide contrast to plausible alternatives.
For example, very simple models such as autoregression are often competitive on metrics like r2, as demonstrated above and by others . However, constraining mechanistic models with internet-based ones seems most useful with estimates of inflection points, rather than incidence. Knowing outbreak peak timing even a week or two sooner can provide much value, and inflection points such as the peak are when simple models might break down. That said, inflection points must be accurate, and the current internet-based approaches are noisy.
The bottom line is: We argue that this field needs (1) mathematical theory describing the flow of disease-related information from human observations through internet systems and algorithms to actionable knowledge and (2) controlled experiments, which are not possible in the real world but can be performed using simulated outbreaks and simulated internet activity.
In short, we claim that the right question is not “what happens when we try X” but rather “what are the relevant fundamental characteristics of the situation, and how does X relate to them”. This cannot be answered by observation alone.
We thank our anonymous reviewers for key guidance and the Wikipedia community whose activity we studied. Emiliano Heyns’ Better Bib(La)TeX for Zotero plug-in14 was invaluable, as was his friendly, knowledgeable, and speedy technical support. Plot colors from ColorBrewer15 by Cynthia A. Brewer, Pennsylvania State University. This work is supported by NIH/NIGMS/MIDAS, grant U01-GM097658-01, and DTRA/JSTO, grants CB10092 and DTRA10027.
3These data streams do have known inaccuracies and biases  but are the best available. For this reason, we refer to them as official data rather than ground truth.
4In fact, there is evidence that the volume of internet-based health-seeking behavior dwarfs traditional avenues .
5Case definitions for the same disease vary somewhat between countries; interested readers should follow the references for details.
7This request count differs from the true number of human views due to automated requests, proxies, pre-fetching, people not reading the article they loaded, and other factors. However, this commonly used proxy for human views is the best available.
9Vocabulary used in the experiment source code differs somewhat.
10That is, elastic net is roughly the weighted sum of lasso regression (minimize non-zero coefficients) and ridge regression (minimize coefficient magnitude).
11The number of models takes into account boundary effects (e.g., whether there is enough training data available for a given epoch) as well as models invalid due to missing data.
12The total number of models created was 464,080, but we estimate that 23,267 of them were for the confounded U.S. influenza, ignored in Experiment 2, and 8,092 for Colombia influenza, which we excluded because we had official data for only one season.
13Our two experiments totaled roughly 12 hours on 8 cores.
AUTHOR CONTRIBUTIONS(Equal contribution within each topic.) Led study and writing: Priedhorsky. Principal experiment design: Osthus, Priedhorsky. Experiment design: Del Valle, Deshpande. Experiment programming and execution: Priedhorsky. Data acquisition: Daughton, Fairchild, Generous, Priedhorsky. Content analysis for unreported pilot experiment: Daughton, Moran, Priedhorsky. Principal results analysis: Priedhorsky. Results analysis: Daughton, Osthus. Supplement preparation: Priedhorsky. Wrote the manuscript: Daughton, Moran, Osthus, Priedhorsky. Critical revisions of manuscript: Del Valle, Deshpande, Fairchild, Generous.
VERIFYING THE SUPPLEMENT
The supplement contains Python pickles and other potentially malicious files. Accordingly, inside the archive is a file sha256sums, which contains SHA-256 checksums of every other file. The checksum of sha256sums itself is: