Imagine a system which continuously monitors Internet postings (be they on websites, blogs, microblogs, including Twitter, social media, discussion board postings, or other publicly available sources), employing natural-language processing and other methods to classify the postings by topic and obtaining indicators on changes over time. We call such metrics supply-based infodemiology indicators.
Information (Concept) Prevalence
The most basic infodemiologic supply indicators are information prevalence and information occurrence ratios (or, perhaps more precisely, concept prevalence and concept occurrence ratios), measuring the absolute or relative number of occurrences of a certain keyword or concept in a pool of information. Note that we are talking about “keywords” if we simply look for the occurrence of certain terms, and "concepts" if we try to “understand” meaning, at a minimum combining multiple keywords to take into account synonyms.
The “pool of information” can be a set of documents, postings, status lines (Twitter, Facebook), a collection of Web pages, or websites. For example, we could, automatically, obtain estimates of the (absolute) number of Internet postings about a certain topic identified by a set of keywords. We call these kinds of data information prevalence. To be more specific on how we obtained the prevalence we could also talk about keyword prevalence or concept prevalence.
Information prevalence data are particularly useful if we track them longitudinally (ie, track how the number of Internet postings on a given health-specific topic changes over time), as we would, for example, to see changes in relation to certain external events, such as a media campaign or a disease outbreak.
A crude method to obtain these prevalence indicators is to enter a search term (with a Boolean OR to include synonyms) into a search engine, which provides an absolute number of occurrences over time (see however the caveat on the reliability of search engines below). An occurrence can (depending on the search engine) be either the number of documents containing the search term at least once, or can be the number of term occurrences in the entire database (the unit does not really matter for our purpose, as long as we use the same method consistently). More advanced methods would also take into account synonyms and do a semantic search (ie, tracking concepts as opposed to keywords), and/or filter the searches to focus on specific geographical regions (for example countries).
illustrates the information prevalence of various cancers in the Canadian top-level domain (.ca) plotted against actual disease incidence (a cautionary note: these data on information prevalence are based on crude Google hits rather than semantic analyses). Such information prevalence versus disease incidence
scatterplots (or other comparators, for example information prevalence versus mortality) may be useful to illustrate to policy makers in which areas there may be an information deficit. From a public health perspective, diseases and conditions which have a high incidence and high disease burden (mortality rate or impact on quality of life), and which are preventable or for which screening tests exist, should enjoy better “coverage” in the media and on the Internet than those which are not. Thus, it is not expected or desirable that there is a strict correlation between cancer incidence and information prevalence. However, illustrates that—compared to other forms of cancer with similar disease burden—breast cancer is an extreme outlier, pointing to a larger health care disparity between, for example, breast and prostate cancers (which have similar incidence and mortality, yet receive different levels of attention and funding), which has been previously referred to as the “prostate cancer gap” [11
]. Policy makers need to be aware of such inequalities and information gaps, and there is a role for supply-based infodemiology indicators, both for management of chronic diseases, as well as for management of public health emergencies. An "infodemiology dashboard" could be developed which displays some of these metrics to inform policy makers for which areas health marketing media campaigns are required.
Information prevalence versus disease incidence scatterplot (Eysenbach, in preparation)
As an analogy to the epidemiological terminology, we can also calculate information incidence rates, which determine the number of new information units created per unit of time. For example, comparing the incidence of Web pages which contain information about a certain topic, such as a new medical discovery, between countries, would provide interesting knowledge dissemination metrics.
Information or concept incidence rates may also point to emerging public health threats. For example, the Infovigil project monitors Twitter microblogs for mentionings of public health relevant keywords and phrases, such as “I have fever”. Together with information on the location of the user, as well as automated conversations and directing users to surveys, these data can provide valuable information for public health agencies and the public alike. illustrates a very basic trend analysis of information incidence from Twitter feeds.
Information incidence (keyword occurrence) trends from Twitter status feeds ("tweets") (DIYCity/sickcity)
Information (Concept) Occurrence Ratios
As the number of websites is constantly increasing, absolute figures on information prevalence are less meaningful than normalized indicators (ie, relative indicators such as rates and ratios). If the total number of “information units” in the “pool of information” is known, then the denominator used to normalize the absolute count could simply be the total number of information units. For example, if we know that the Web has a total of x Web pages in a given language at a given point in time, and y of these pages deal with cancer, then we can express the information prevalence as the proportion y/x. However, in the case of the Web, the denominator, which would be the total number of all indexed files and documents (including, for example, html, excel, and powerpoint files etc) in the specific language of the numerator keywords, is often hard to obtain or not known. While search engines such as Google may have data on the total number of indexed documents in a certain language, this information is usually proprietary and not accessible to researchers.
Thus, it is often easier is to express the information prevalence as a fraction of information units about a certain topic compared to a control keyword or concept. For example, if the number of Web resources mentioning “prostate cancer” is 21.6 million, compared to 214 million resources mentioning cancer, then the occurrence ratio of prostate cancer to cancer is 21.6:214 = 10%.
Studying occurrence ratios may provide fascinating insights into the linguistic and cultural differences of the use of words and concepts between countries, but it may also be a method to study inequalities and differences in access to health information. illustrates differences between the information occurrence ratios for “cervical cancer” information versus “cancer” information in Canada, the United Kingdom, and Australia. However, these are crude analyses based on keyword on Google. A proper infodemiological investigation would attempt to “understand” semantically the content of Web pages.
Information occurrence ratios for various concepts in English-speaking industrialized countries
Another important caveat is that many search engines do not give accurate or reliable hit counts. Not only provide different search engines different results, but even the same search engine queried multiple times during the same day may give different estimates. Systems like Infovigil collect this information from different search engines on different times during the day, and employs statistical methods to even out discrepancies. This also minimizes the potential bias that changes to the number of hits for certain keywords could be confounded by changes in search engine algorithms. Alternative methods exist that can bypass search engines altogether, for example random IP sampling or the random creation of domain names, but these methods have their own set of problems, such as triggering security alerts as they resemble hacking attempts.
Looking for co-occurrences of different keywords or concepts (for example, a disease name and the name of a pharmaceutical substance) could provide knowledge translation or innovation diffusion metrics. For example, after publication of a trial confirming the effectiveness of a new drug in a medical journal, researchers could measure how long it takes for a new therapy to be acknowledged and taken up by the public, as reflected by the incidence of the disease term and the treatment concept occurring together. These indicators could in turn be useful to study different methods to accelerate knowledge translation (eg, publishing in open access journals, hosting workshops, holding press-conferences, and issuing press-releases, etc). Moreover, algorithms could be developed which monitor the medical, peer-reviewed literature, on the one hand, and the Internet, on the other, to collect and provide continuous real-time knowledge translation indicators.
While technically more challenging, it should also be possible to automatically identify and classify cases of misinformation or unbalanced information, tracking trends over time. For example, anti-vaccination websites use specific language, have specific attributes (eg, linking to other anti-vaccination sites), and cite a specific subset of the medical literature to provide a one-sided, biased view of the medical evidence [12
]. A generic algorithm to obtain a measure for bias would, for example, be to compare the reference list of a systematic review to the references cited on a given website, which would enable researchers to quantify the direction and degree of content biases.
Once this information on the incidence of bias in a given field is collected in a longitudinal fashion, the effectiveness of public health and health marketing programs becomes measurable. For example, a media campaign addressing myths surrounding vaccination should lead to a change in the ratio of anti-vaccination postings to pro-vaccination statements, which in turn may be a predictor for changes in actual vaccination rates.
A final application area to be mentioned here is policy implementation and evaluation
. As in the management adage which says, “one cannot manage what cannot be measured”, the case for gathering infodemiology data can be predicated on measuring the progress towards policy objectives
, for example policies which address health information and health communication specifically related to the quality of information for the public. For example, the US public health policy document, “Healthy People 2010” [13
] contained “[Increase of] quality of Internet health information sources” as an explicit policy objective (Objective 11-4). Other policy objectives (not from this document) may, for example, stipulate an increase of information written on a specific reading level, an increase of culturally sensitive health information for certain population groups or in certain languages (eg, minority languages). In most of these cases, it is conceivable that infodemiology methods could be developed and used to obtain and track indicators that would measure the progress towards such policy goals.
Identifying and Aggregating Public-Health Relevant Information from Secondary Sources
Another class of “supply-side” based applications, for example the Global Public Health Intelligence Network (GPHIN), the HealthMap System, and the EpiSPIDER Project, analyze selected secondary data sources, such as news reports and expert newsletters (ProMED mail), and aggregate public-health relevant information, in particular about infectious disease outbreaks [14
]. These systems can be seen as tools for Open Source Intelligence (OSINT) collection. OSINT is intelligence that is “produced from publicly available information that is collected, exploited, and disseminated in a timely manner to an appropriate audience for the purpose of addressing a specific intelligence requirement” (Sec. 931 of Public Law 109-163, National Defense Authorization Act for Fiscal Year 2006). (Note that in this context “open source” refers to publicly available information, not to open source software.)
These systems usually use a more selective approach in terms of choosing high-quality, expert-curated secondary data sources, as opposed to systems such as the Infovigil system, which attempt to harness the “collective intelligence” of people on the Internet by analyzing noisier "primary data" on information supply and demand (eg, Twitter feeds or search and navigation behavior).
Identifying and Aggregating Public-Health Relevant Information on Social Networks
A final category of systems which could be developed would be systems which analyze and extract information from the Internet about the structure of social networks. For certain public health situations, especially in the case of an outbreak, but also for health marketing campaigns, it is advantageous to gather intelligence about the relationship between people. For example, it is conceivable that information on who knows whom from the friends-list on Facebook may help to contain the spread of an infectious agent if public health professionals have ready access to this information. Obviously, "knowing" somebody, communicating with someone, or being a "friend" with somebody on Facebook does not necessarily mean that these people have physical contact, hence, more advanced methods than just extracting the ""friends-list" from Facebook are required in order to be of use for public health.