|Home | About | Journals | Submit | Contact Us | Français|
We analyzed misinformation about Ebola circulating on Twitter and Sina Weibo, the leading Chinese microblog platform, at the outset of the global response to the 2014–2015 Ebola epidemic to help public health agencies develop their social media communication strategies.
We retrieved Twitter and Sina Weibo data created within 24 hours of the World Health Organization announcement of a Public Health Emergency of International Concern (Batch 1 from August 8, 2014, 06:50:00 Greenwich Mean Time [GMT] to August 9, 2014, 06:49:59 GMT) and seven days later (Batch 2 from August 15, 2014, 06:50:00 GMT to August 16, 2014, 06:49:59 GMT). We obtained and analyzed a 1% random sample of tweets containing the keyword Ebola. We retrieved all Sina Weibo posts with Chinese keywords for Ebola for analysis. We analyzed changes in frequencies of keywords, hashtags, and Web links using relative risk (RR) and c2 feature selection algorithm. We identified misinformation by manual coding and categorizing randomly selected sub-datasets.
We identified two speculative treatments (i.e., bathing in or drinking saltwater and ingestion of Nano Silver, an experimental drug) in our analysis of changes in frequencies of keywords and hashtags. Saltwater was speculated to be protective against Ebola in Batch 1 tweets but their mentions decreased in Batch 2 (RR=0.11 for “salt” and RR=0.14 for “water”). Nano Silver mentions were higher in Batch 2 than in Batch 1 (RR=10.5). In our manually coded samples, Ebola-related misinformation constituted about 2% of Twitter and Sina Weibo content. A range of 36%–58% of the posts were news about the Ebola outbreak and 19%–24% of the posts were health information and responses to misinformation in both batches. In Batch 2, 43% of Chinese microblogs focused on the Chinese government sending medical assistance to Guinea.
Misinformation about Ebola was circulated at a very low level globally in social media in either batch. Qualitative and quantitative analyses of social media posts can provide relevant information to public health agencies during emergency responses.
Communicating scientifically accurate information about an outbreak is important, because an informed public will likely be less susceptible to misinformation that could hinder outbreak control.1 Although social media have been used by public health agencies to communicate disease prevention information,2–4 rumors and alternate understandings of disease can circulate. Anecdotal evidence suggests that Twitter might have played a role in Nigeria's efforts to control Ebola5 at the outset of the 2014 Ebola outbreak, but the World Health Organization (WHO) noted rumors circulating on social media claiming that certain products or practices could prevent or cure Ebola virus disease.6 A 2014 study found that 55% of English-language tweets from Guinea, Liberia, and Nigeria during September 1–7, 2014, using the terms “Ebola” and “prevention” or “cure” contained medical misinformation.7
We analyzed information and misinformation about Ebola on Twitter—the world's largest microblogging service—and Sina Weibo—the leading microblogging platform in China—shortly after the WHO declaration of the Ebola outbreak as a Public Health Emergency of International Concern (PHEIC) on August 8, 2014.8 Our study focused on microblogging, an Internet-based self-publishing application that enables people to share user-created content or republish others' messages online. It has played a critical role in mass communication during crises, such as natural disasters.9–11
Microblog users' response was part of the public's reaction to the global threat of Ebola. Most people had insufficient knowledge about the disease and limited time and resources to access additional information sources to understand it.12,13 We used Twitter as a representative social media platform. We included Sina Weibo because Twitter is blocked in China, requiring microblog users in China to use alternative platforms.
Our primary research questions were (1) What Ebola-related information and misinformation (and their proportions) was circulated on two popular microblogging platforms (Twitter and Sina Weibo) in the two most commonly spoken languages (English and Chinese) on the day of the PHEIC announcement? and (2) What changes could be observed a week later? Our analysis may aid public health agencies in developing their social media communication strategies.
We collected microblog data from Twitter and Sina Weibo in two batches to document changes in Ebola-related microblog content one week after the WHO PHEIC announcement. Batch 1 was collected during the first 24 hours after the WHO PHEIC announcement12 (from August 8, 2014, 06:50:00 Greenwich Mean Time [GMT] to August 9, 2014, 06:49:59 GMT); Batch 2 was collected during a 24-hour period seven days later (from August 15, 2014, 06:50:00 GMT to August 16, 2014, 06:49:59 GMT).
Twitter had 271 million monthly active users worldwide in 2014.13 Tweets are seen as digital footprints for monitoring the public's health-related responses and behaviors.14,15 We retrieved Twitter users' publicly available tweets. We applied Twitter streaming Application Programming Interface (API) to sample Twitter data for monitoring the Ebola outbreak (details available upon request). We retrieved a random sample of 1% of all publicly posted tweets: (sample size: n=4,366,946 tweets from Batch 1 and n=4,305,841 tweets from Batch 2). Of these, 4,844 tweets during Batch 1 and 2,001 tweets during Batch 2 (1,109 and 465 per one million tweets, respectively) contained the keyword “Ebola” (Figure).
Sina Weibo had 156.5 million monthly active users as of June 201416 and is often used for digital epidemiology studies on Chinese social media.17 Despite the state's control of political information, Chinese online users can use Sina Weibo to speak with some autonomy on public health affairs.
Because Sina Weibo currently does not provide a streaming API service similar to that of Twitter, Chinese Ebola-related microblogs were obtained via Sina Weibo's Internet search engine. Two Chinese terms for Ebola, 怒꺼윗 and 간꺽윗, were entered into the search engine. The results were then computationally captured page by page by a script (developed by one of the authors) based on R version 184.108.40.206 The script was programmed to run every 10 minutes during the Batch 1 and Batch 2 periods and was executed by two computers located separately in Hong Kong and Athens, Georgia, for data redundancy management. When data collection was completed, items with duplicated identity codes were discarded. Of the 7,645 microblog posts in Batch 1, 219 posts created outside the specified time frame and 49 posts containing the Chinese term for “Ebola” in their username but not in the body of the posts were excluded, leaving 7,377 posts for further text processing. Of the 3,416 microblog posts in Batch 2, 64 posts containing the Chinese term for “Ebola” in their username but not in the body of the posts were excluded, leaving 3,352 posts for further text processing. Finally, we parsed the text of the collected microblog messages and the time of posting and recorded them in a comma-separated value (CSV) file.
The retrieved Twitter tweets were classified into English and non-English languages by using R;19 only English-language tweets were analyzed. Using R's text mining package,20 tweets were then stemmed by stemmer (e.g., reducing words to their roots, such as “tables” to “table”) and tokenized for keyword analysis. Tokenization is the process of segmenting a sentence into different units of meaning (words). Because of the linguistic difference between English-language tweets and Chinese microblogs, the Chinese microblogs were processed differently. Because there is no space in the Chinese syntax to separate words in a sentence as in English, Chinese text in microblogs was segmented into phrases using Viterbi algorithm21 implemented in the Jieba toolkit.22,23 For example, an original Chinese sentence such as
[English translation: Ebola virus disease (formerly known as Ebola hemorrhagic fever) is a severe, often fatal illness, with a death rate of up to 90% caused by Ebola virus, a member of the filovirus family.]
was segmented into phrases by the Viterbi algorithm as
A word-by-word translation in English would be: Ebola / virus / disease / formerly / called / Ebola / hemorrphagic fever / is / from / filial / virus /'s / Ebola virus / cause /'s/ one kind / severe / and / eventually / fatal /'s / disease / death rate/ as high as / 90 / %.
Candidate keywords for subsequent analysis were all unique keywords in the tokenized English-language tweets and segmented Chinese microblogs with more than three occurrences in the entire collection of tweets and microblogs. We extracted hashtags and Web links from the body of the tweets and the Chinese microblogs. The extracted Web links were mostly shortened URLs and were resolved by the cURL program24 to obtain the full Web links and domain names. We removed stop words, punctuation, emoticons, and special symbols used in microblogging (e.g., RT and @author_name) in the English-language and Chinese microblogs. We recorded the contents and time of posting for each microblog post in a CSV file.
We analyzed seven-day changes in the relative frequencies of keywords, hashtags, and domain names of the shared Web links. The unit of analysis was a keyword, a hashtag, or a domain name. “Trending” signifies increasing usage of the item, and “fading” signifies declining usage. We evaluated the relative frequency of each item in Batch 1 and Batch 2 using the relative risk (RR) and χ2 feature selection algorithm.25
The χ2 feature selection algorithm (1) evaluated the strength of evidence for the hypothesis that the frequency of an item between the two batches of microblogs was different and (2) ranked terms by the χ2 value that measured how much the observed count was different from the expected count, assuming that occurrence of the item was independent of the batch of microblogs. High χ2 values indicate strong evidence for the existence of a difference between the observed value and expected value.26 Such measurements did not account for the direction of the difference (i.e., an item with a high χ2 value may have a higher frequency in Batch 2 than in Batch 1, or vice versa). We used RR to supplement the χ2 feature selection algorithm to indicate the direction of the relative frequency. The RR for an item i was calculated as:
Pi, Batch 1 denoted the probability of tweets with item i in Batch 1; Pi, Batch 2 denoted the probability of tweets with item i in Batch 2. A 0.5 was added to both its denominator and numerator to correct for zero frequencies. In this analysis, trending items were those with a high χ2 value and an RR>1, while fading items were those with a high χ2 value and an RR≤1.
We manually categorized microblog content under different themes to identify the information and misinformation contained therein. For manual coding, we randomly selected 5%–7% of the de-identified social media posts. In Microsoft® Excel® spreadsheets, for each Twitter tweet or Sina Weibo post, we assigned a random number between 0 and 100 inclusive (=RANDBETWEEN[0,100]). If the random number was ≤5, the microblog post was selected for manual coding. Because the proportion of random numbers generated that were ≤5 was not the same for each dataset, the manually coded dataset was 5%–7% of the original datasets: Twitter: Batch 1: n=299/4,844 (6.2%); Batch 2: n=116/2,001 (5.8%); Sina Weibo: Batch 1: n=469/7,645 (6.1%); Batch 2: n=207/3,416 (6.1%). After random selection, 17 of the 469 randomly selected Sina Weibo posts in Batch 1 were excluded because they were outside the time frame, leaving 452 manually coded posts.
Our social media samples were first coded (i.e., categorized under different themes) by at least one coder and then recoded by the first author, who made the final coding decision. To assess reliability, a second coder coded a randomly selected 10% sample of our manually coded samples. Comparing the code between the second coder and the first author, the interrater agreement was moderate for Twitter data (Cohen's k=0.58 for Batch 1 and k=0.56 for Batch 2) and substantial for Sina Weibo data (Cohen's k=0.66 for Batch 1 and k=0.78 for Batch 2).27
Within each selected sample, the microblog posts were first categorized into English/Chinese posts and posts that were not English or Chinese. For Twitter data, the non-English, non-Chinese tweets (54 in Batch 1 and 29 in Batch 2) were excluded from further analysis, leaving 245 tweets in Batch 1 and 87 posts in Batch 2 for analysis. We found no significant difference in the proportion of English-language tweets between Batch 1 and Batch 2 (p=0.15). We manually categorized English/Chinese posts in our selected samples according to the following scheme (derived from a previously published scheme)28: news of the Ebola outbreak or cases; news of travel bans, border blockades, flight route suspensions, sports game bans, and travel advice; health education and information; alternative health information; responses to alternative health information; advertisement and entertainment; social issues; and others.
In our analysis of Sina Weibo samples, we created two categories for posts related to the Chinese medical team's departure for Guinea and for posts that reported news of a Sina Weibo user who allegedly spread rumors about a “suspected case in a hospital in Shanghai” and was detained by Chinese police. These news items were unique to Sina Weibo content. We performed Fisher's exact test and χ2 tests as appropriate to compare the number of posts in each category between the two batches for Twitter and Sina Weibo, respectively. The Twitter and Sina Weibo data were anonymized and de-identified prior to analysis.
Key terms used in the WHO's PHEIC announcement were among the top 20 keywords list of the English and Chinese microblogs in Batch 1: “Ebola,” “virus,” “outbreak/epidemic,” “public health,” and “emergency” (Table 1).
Terms related to the WHO's PHEIC announcement, including “emergency event” (χ2=458.9, RR=0.01), “public health” (χ2=371.0, RR=0.10), or “announced” (χ2=232.0, RR=0.12), were fading in Batch 2 on Sina Weibo, while the term “declare” was fading in Batch 2 on Twitter (Table 2).
Two rumors featured in numerous posts, but both were confined to Twitter and did not appear on Sina Weibo. Saltwater appeared in Batch 1 tweets but faded in Batch 2 (salt: χ2=63.9, RR=0.11; water: χ2=39.7, RR=0.14). An experimental drug known as Nano Silver, which was backed by the Nigerian Ministry of Health but had no scientific evidence on efficacy, was on the top 10 trending Twitter hashtag list (χ2=1.7, RR=10.5) (Table 2).
Tweets reading “Ebola may be vastly underestimated,” reflecting a WHO assessment, made “underestimate” the top trending term on Twitter in Batch 2. In Chinese microblogs, the term “assistance to Africa” ranked at the top of the trending list in Batch 2 (Table 2).
We found a significant difference (Fisher's exact test, p=0.01) in content categories between Batch 1 and Batch 2. Of 245 tweets in Batch 1 and 87 tweets in Batch 2, alternative health information (i.e., information that is not in line with current scientific understanding of Ebola and its prevention and control) accounted for six tweets in Batch 1 and two tweets in Batch 2. We found a similar percentage of tweets on health education and information (i.e., information that is in line with current scientific understanding of Ebola and its prevention and control) (Batch 1: n=26, 10.6%; Batch 2: n=10, 11.5%) and tweets in response to alternative health information (Batch 1: n=32, 13.1%; Batch 2: n=10, 11.5%) (Table 3).
We observed a significant difference in the manually coded categories between randomly selected Sina Weibo posts from Batch 1 and Batch 2 (Fisher's exact test, p<0.001). Of 452 posts analyzed in Batch 1 and 207 posts analyzed in Batch 2, alternative health information accounted for 11 posts in Batch 1 and three posts in Batch 2. Percentage of posts about health education and information were similar between the two batches (Batch 1: n=55, 12.2%; Batch 2: n=27, 13.0%). Responses to alternative health information accounted for 43 (9.5%) posts in Batch 1 and 13 (6.3%) posts in Batch 2 (Table 3).
In the manually coded subset, alternative health information about Ebola in Batch 1 included two posts on “bathing in salt water and then drinking it”; one post that the Ebola virus came from space; one post that “crystal meth can cure Ebola”; one post that Ebola arose from human cannibalism; and one post to “tie a palm leaf and a red cloth round your head and your waist, and then dance around any banana tree” to stop Ebola. Alternative health information in Batch 2 included two posts on news about the use of Nano Silver as an Ebola treatment in Nigeria.
In the manually coded subset, alternative health information about Ebola on Sina Weibo in Batch 1 included three posts with scientific comments that contained mistakes (e.g., “from its discovery to today, Ebola had only caused havoc in West Africa”), two posts that advocated homeopathy, and six posts that said traditional Chinese medicine could be used to treat or prevent Ebola. Two posts advocating traditional Chinese medicine criticized the Western medical idea of Ebola virus causing the Ebola virus disease. That particular comment drew 25 posts containing criticism and sarcastic comments from Sina Weibo users. In Batch 2, examples included two posts on traditional Chinese medicine and one post that smoking tobacco can prevent Ebola infection.
Our finding that about 2% of microblogs contained alternative health information in August 2014 contrasts with a September 2014 study that focused on English-language tweets from Guinea, Liberia, and Nigeria and found that 55% of Ebola-related tweets contained medical misinformation.7 Whereas that study focused on three African countries affected by the outbreak, we sampled our microblog contents without geographical restrictions. Moreover, such rumors might have been circulated before or after our sample collections and via other routes or platforms (e.g., word of mouth).
Our findings that 47%–58% of the posts were Ebola-relevant news stories and 19%–24% were health information and responses to alternative health information in both batches are compatible with another analysis of Ebola-related Twitter data from July 24 to August 1, 2014. The four main topics identified in that study were Ebola risk factors, prevention education, disease trends, and prayer for countries in Africa. The authors interpreted their findings as evidence of knowledge gaps about Ebola as relevant information was being provided and sought.29
A study that analyzed 85 of the top 100 YouTube videos on Ebola on December 9, 2014, found that 54 (63.5%) videos had misleading information and 31 (36.5%) videos were useful. However, the study found that the number of views of misleading videos was significantly higher than those of useful videos (p=0.005).30,31 In contrast, an analysis of 118 videos screened on November 1, 2014, found that 31 (26.3%) YouTube videos were misleading and 87 (73.7%) were useful.32 An analysis of the 100 most widely viewed Ebola-related YouTube videos as of December 9, 2014, found that 36 (36%) reported CDC-described transmission routes of Ebola.33 Such differences might reflect differences in criteria for content categorization but also the timing of data collection, because social media content evolves continuously.
Two other studies reported an increase in the volume of Ebola-related Twitter traffic during October 2014, when domestic cases were reported in the United States.34,35 One exploratory study counted the frequency of keywords associated with different emotional states among Ebola-related tweets.34
We found that key terms used in the WHO's PHEIC announcement were among the top 20 keywords list of the English and Chinese microblogs in Batch 1. This observation might suggest that social media platforms helped disseminate the key WHO messages immediately after the announcement. Our observation that the word “China” occurred more frequently than “Ebola” in Batch 2 of Chinese microblogs was not surprising. Because Chinese government-run media channels emphasized China's anti-Ebola efforts, the term “China” was the most frequently used term in Batch 2 of Chinese microblogs.
Our observation of the increased ratio of Sina Weibo bloggers using hashtags from Batch 1 to Batch 2 might be attributable to hashtags created by some news outlets or the service provider Sina Weibo (e.g., a hashtag of a promotional campaign reading “Sina News to share with prizes”).
Although many studies attempted to use digital big data (including social media data) to detect outbreaks36 or to estimate or forecast disease incidence,37–39 their successful application in public health practice faces technical challenges.40–42 Communication surveillance is becoming an important application of social media data in public health surveillance.43 Communication surveillance includes both surveillance of general awareness of certain diseases28,44 and monitoring of reactions to public health messages or campaigns.2,45
The low proportion of microblogs with alternative health information at the onset of the global response to the 2014–2015 Ebola outbreak mirrors results from studies during the 2009 influenza pandemic, when only about 2% of tweets were seen as misinformation.46 We found that most information on social media came from mainstream news agencies, which generally report information from public health agencies. Our findings also indicate a contextual difference between a free and open online platform and a state-regulated online platform in the contents of Ebola-related microblogs. China's Internet market is controlled by the government,47 which explains why posts related to misinformation or rumors were not widely observed among Chinese microblogs, whereas misinformation (e.g., saltwater, Nano Silver) was freely distributed on Twitter. For the same reason, topics of discussion on Twitter were more diverse. These findings suggest that a free and open online platform may enable the dissemination of unofficial and unconfirmed information but also pluralistic views. Although censoring allows governments to control rumors and alternative information, it can put the society at risk of a potential government cover-up, as in the initial denial of the 2003 severe acute respiratory syndrome outbreak in China.48
The strength of manual coding is the understanding and interpretation of the context and meaning of the content. We compensated for our weakness of having a single manual coder by having a second coder code 10% of the sample, with moderate to substantial reliability. Our random sample (about 4.5 million tweets per day) constituted about 1% of the Twitter universe. Although we did not retrieve all Ebola-related tweets, our sample was representative. Our Twitter data retrieval method allowed us to calculate the incidence rate of tweets that were Ebola-related across a period of time.
Unlike the United States and China, where social media are important channels of disseminating outbreak information,4,43,44 the areas most affected by the outbreak are likely to use traditional means of -communication to disseminate misinformation. Therefore, the low proportion of misinformation on social media might not reflect the rumors circulating among the public in West Africa, where the epidemic occurred. Additionally, we did not code microblogs written in languages other than English and Chinese. Future studies analyzing tweets in other languages might help confirm the external validity of our findings. Furthermore, Twitter and Sina Weibo may not be representative of the global population, because users are generally young and educated.49,50
A small percentage of Ebola-related microblogs contained misinformation; most contained outbreak-related news and scientific health information, echoing the Nigerian success in Ebola health communications via Twitter.5 Further studies of health information dissemination on government-censored social media47 can serve as a comparison to studies performed on uncensored platforms. A future retrospective longitudinal study of Ebola-related information on microblogs will allow us to investigate how the volume and contents of misinformation changed during this outbreak. Analyzing the sources of misinformation and understanding the process by which rumors are created and circulated through re-tweeting will inform effective public health communication strategies on social media.
The authors thank Sidi Liu and Zhaochong Liu for their technical assistance in the retrieval of Twitter and Sina Weibo data; Tsun Chan and Angela An Chang Cheung for their preliminary manual coding of Weibo posts; Po-Ying Lai for serving as the second coder; and Manoj Gambhir, PhD, and Karen K. Wong, MD, for their helpful discussion.
Isaac Chun-Hai Fung and Zion Tsz Ho Tse receive salary support from the Centers for Disease Control and Prevention (CDC #15IPA1509134 and #16IPA1619505). This project was not funded by CDC and CDC had no role in the study design, data analysis, or writing of this article. This article does not represent the official position of CDC or the U.S. government.
Our protocol of data processing and de-identification was approved by the Human Research Ethics Committee for Non-Clinical Faculties, The University of Hong Kong, and the Georgia Southern University Institutional Review Board (H14167).