|Home | About | Journals | Submit | Contact Us | Français|
Use of data generated through social media for health studies is gradually increasing. Twitter is a short-text message system developed 6 years ago, now with more than 100 million users generating over 300 million Tweets every day. Twitter may be used to gain real-world insights to promote healthy behaviors. The purposes of this paper are to describe a practical approach to analyzing Tweet contents and to illustrate an application of the approach to the topic of physical activity. The approach includes five steps: (1) selecting keywords to gather an initial set of Tweets to analyze; (2) importing data; (3) preparing data; (4) analyzing data (topic, sentiment, and ecologic context); and (5) interpreting data. The steps are implemented using tools that are publically available and free of charge and designed for use by researchers with limited programming skills. Content mining of Tweets can contribute to addressing challenges in health behavior research.
Use of data generated through social media for health studies is gradually increasing. Due to its growing pervasiveness, social media have the potential to support the collection and analysis of health-related data in real time in the real world.1,2 One social medium that has shown exponential growth is Twitter, a short-message micro-blogging service system with a “what’s happening” prompt and an allowance of 140 characters per “Tweet.” People use Twitter to share their momentary feelings, observations, activities, and daily lives with others.
Twitter usage statistics report 140 million active users generating 340 million Tweets on average per day as of March 2012.3 Tweet content has been used not only as a rapid and inexpensive way to glimpse public opinion in general,4 but also within the health domain for purposes such as monitoring diseases5,6 and delivering health care.7,8 Despite the growing attention to analyzing user-generated content from social media, most health researchers have little knowledge about how to apply content-mining methods.
Applying content-mining methods to social media in order to study health behaviors is important because gaining a full understanding of such behaviors has been difficult due to their complexity. Tweets are a source of real-time, real-world data about health behaviors, and they share characteristics with traditional methods of ecologic momentary assessment that simultaneously capture a behavior and allow individuals to report their current activity, location, and social surroundings at any particular moment.9 However, unlike most ecologic momentary assessment methods, Tweet contents are not dependent on a specific intermittent stimulus to the intended respondent. Thus, Tweets may represent more-naturalistic content and have the additional advantage of being available in large volume. This paper describes a practical approach to analyzing Tweet contents and illustrates application of the approach to physical activity, a substantial and challenging public health issue and a health behavior of interest to many researchers.10
Web mining focuses on the discovery of meaningful knowledge from data such as online mailing lists, blogs, and social media and includes analysis of structure, usage and content.11 Web content mining aims to extract and analyze useful information (e.g., opinions, sentiment, main topics) from web content by applying techniques from multidisciplinary fields including data mining, machine learning, natural-language processing, information retrieval, and statistics. Following the traditional framework of general data mining, a typical content-mining process11 includes preparing data so they can be imported and read in data-mining software, reducing the dimensionality of data, applying classic data-mining techniques, and terminating or iterating the process according to interpretation. Dodds and Danforth12 and Kleinberg13 have reported how to mine social media content using natural-language processing. However, unlike the practical approach presented here, those methods require understanding of sophisticated large-scale computing methods.
The practical steps of Tweet content mining are illustrated in Figure 1 (additionally, the Sidebar provides a condensed example of these steps to illustrate how the process could be used for the content area of obesity): (1) selecting keywords; (2) importing data; (3) preparing data; (4) analyzing data; and (5) interpreting data (Appendix A, available online at www.ajpmonline.org).14 Preliminary steps are to obtain review for human subjects research and to identify specific research questions. For the physical activity example, the analysis met federal criteria for Human Subjects Exemption. Research questions included: (1) What is the content of Tweets that mention specific physical activities?; (2) Does Tweet content vary by specific physical activity?; (3) Does Tweet content change over time?; (4) What proportion of Tweets that mention specific physical activities express positive as compared to negative sentiments?; and (5) How is context expressed in Tweets that mention specific physical activities?
Initially, domain experts analyze a concept to be studied and identify appropriate key terms and phrases (e.g., synonyms/morphologic variants) for extraction of Tweets to create the analytic corpus. For physical activity, 17 diverse activities (e.g., aerobics, jogging, swimming), defined by ‘mypyramid.gov,’ were selected as key phrases.
In the second step, Tweets are imported by searching the selected terms and phrases via Tweet import tools. For the current study, NodeXL was used, a publicly available open-source Microsoft Excel template for creating a Tweet data corpus. Unlike other tools, NodeXL offers convenient data manipulation for users without programming skills by facilitating searching Twitter for public Tweets and importing the Tweets as an Excel file. Up to 1000 Tweets for each activity were randomly imported via NodeXL for each of 12 weeks from Week 1 of March to Week 4 of May 2010 (total 174,394 Tweets) to create the initial analytic corpus.
Text cleaning. The preparation step includes text cleaning, text transformation to generate attributes, and reduction of dimensionality through attribute selection. Compared to other genres of documents such as news stories or traditional webpages, the linguistic characteristics of Tweets are noisy, due to use of a variety of languages, format/signs, unstructured grammar, and unofficial abbreviations. Thus, it is important to remove all nonstandard characters or special characters that would hinder the use of content-mining tools. For instance, Weka 3.7.1, a popular data-mining tool (www.cs.waikato.ac.nz/ml/weka/), has a special use for the typographic symbol for quotation (“). Consequently, quotation marks must be removed from Tweets prior to analysis.
Other examples of typologic symbols that must be removed to ensure readability as CSV (comma separated value) data for import into Weka are apostrophes (‘); single quotes (‘); commas (,); and semicolons (;). Other symbols (e.g., ^, @, space, line feed) and unnecessary letters (e.g., www, http://*) also can be removed in this step. Symbols can be removed via an open-source code editor (e.g., Notepad++) or Microsoft® Word, with the replace function.
In the text-transformation step, Tweet contents are represented as a vector of features. The simplest features are the individual words composing the Tweet, and the associated feature values are the frequencies of word occurrences in the Tweet.15 Other examples of features are N-grams and numeric features, such as the length of a Tweet.
N-grams were used in the current physical activity example. An N-gram is a subsequence of N items in a given sequence, where the sequence items or grams can be anything from characters to words. Given the phrase “physical activity burns,” there are three unigrams (“physical,” “activity,” and “burns”); two bigrams (“physical activity” and “activity burns”); and one trigram (“physical activity burns”).
Because the use of a combination of unigrams, bigrams, and trigrams are reported as effective methods,16 the authors used the combination of unigrams, bigrams, and trigrams in the physical activity content analysis. The Tweet term-frequency dictionary was computed by the N-gram method from the corpus of 174,394 publically available Tweets which were imported from twitter.com/ via NodeXL. Each unigram, bigram, and trigram generated is considered one attribute.
Given that many attributes can be generated from a single sentence (e.g., I am swimming with my sister generates 15 attributes), and that it is more difficult to algorithmically process large data sets with high dimensionality,11 it is typically necessary to reduce the dimensionality of a data set by decreasing the number of attributes. The authors applied two methods for reducing dimensionality: removal of stop words and stemming. Stop words (i.e., words that are very common in any document such as “the” and “to”) have little informational content and are unlikely to help with text mining and can be removed.11,17 Individual stop words (e.g., of, a, an, for) were removed from the dictionary to reduce dimensionality, resulting in a 15%–20% data reduction among various physical activities, but stop words within phrases (e.g., “for” in “physical activity for,” or “of” in “physical activity of”) were retained. Stop words were removed using the stop-word removal function in Weka.18 The number of features ranged from 500 to 1000 per 1000 Tweets per physical activity. Increasing the Java Heap size allows Weka to handle larger number of features than are required for the physical activity analysis.
Stemming reduces dimensionality by identifying a word root and removing suffices and prefixes from different word forms. For example, the two words “exercising” and “exercised” can be stemmed to “exercise” and, thus, the two features can be merged into one attribute representing all the features with “exercise” as a stem. Stemming has the disadvantage of discarding linguistic information.11,19 However, for the physical activity example, stemming was necessary so that the dictionary was not too large to be processed by Weka. There are three different types of stemming algorithms: affix removing, statistical, and mixed. Porter’s algorithm, an affix-removal approach, is the most popular and standard approach because it is concise and efficient20 and was applied through “snowball stemmers” within Weka.21
In this step, preprocessed data are analyzed in order to discover patterns such as “hot topics” or sentiments. Three approaches were used to discover patterns in the physical activity Tweet corpus: topic detection, sentiment analysis, and categorization of ecologic momentary context.
To detect and summarize topics, classic data-mining techniques are used on the structured data matrix that resulted from the previous stages. These include descriptive statistics (frequency counts); visualization; classification; and clustering.14 The frequency of terms can be compared across diverse physical activities with vector values. Frequently occurring terms can be visualized with two-dimensional (2D) graphs and with three-dimensional (3D) motion charts to visualize data trends over time (www.excelcharts.com/blog/google-motion-chart-api-visualization-population-trends/). A keyword-only approach will create noise, and the level of noise may vary according to the concept studied. For example, the Tweet corpus of ‘run’ contains more noise than ‘swim’, ‘weight lifting’ or ‘chopping wood.’
Classification is used to predict which category a new observation belongs to among a set of predefined categories. In contrast, clustering is an unsupervised learning approach, used to aggregate data into groups that are meaningful or useful.22 Because a priori categorization of Tweet contents existed, the clustering data-mining technique was tested in the current study to summarize physical activity Tweet contents. However, because the technique was overly reliant on investigator interpretation for the corpus, only descriptive statistics (frequency counts) and visualization using 2D and 3D motion charts were used to summarize the Tweet corpus. Chi-square statistics were also applied to find the terms that were discriminative of a particular activity using the ‘ChiSquaredAttributeEval’ mining algorithm in Weka.
Recently, there has been more research in computational linguistics and machine learning about sentiment analysis, the automated detection of opinions or attitudes in text.23 Sentiment analysis extracts subjective information about a topic or a document by applying computational analytic techniques. The authors relied on a sentiment analysis tool (twittersentiment.appspot.com/) that categorizes Tweets as positive or negative. Accuracy >80% has been reported for other corpora.24 Tweets were qualitatively examined for aspects of context typically assessed in ecologic momentary assessment: time, purpose, environment, social context, and feeling.
In the interpretation step, the decision is made to terminate the content-mining process or iterate, and domain experts play a critical role. When domain experts decide results are interpretable according to their study’s scope and aims, the processes are terminated.25 Conversely, when the results are not satisfactory, one must return to the previous iteration.11 In the physical activity example, domain experts terminated the process because the study scope was to investigate the spring season. An alternative decision could have been to return to Step 2 in order to import Tweets in a different season (e.g., winter) to increase generalizability.
Results are organized according to the five research questions: (1) What is the content of Tweets that mention specific physical activities?; (2) Does Tweet content vary by specific physical activity?; (3) Does Tweet content change over time?; (4) What proportion of Tweets that mention specific physical activities express positive as compared to negative sentiments?; and (5) How is context expressed in Tweets that mention specific physical activities?
The computed Tweet term-frequency dictionary contained 31,489 terms (Table 1). The most frequently occurring unigram, bigram, and trigram for each of the physical activity terms are described in Table 2. For example, “class,” “aerobic class,” and “water aerobics class” were the frequent unigram, bigram, and trigram in Tweets that mention aerobics. N-gram–based text computing produced the Tweet term-frequency dictionary containing 31,489 terms from the corpus of 174,394 Tweets. Table 1 shows the frequency of six sample terms.
The term “good” appeared across all activities, whereas, “obesity” occurred only for a few activities; “but should” (light workout); “mountain” (hiking); and “the basics” (bicycling) were the most distinct terms calculated by chi-square test. Interactive trends graphs (Appendix B, available online at www.ajpmonline.org) show the selected distinct terms that frequently appeared in Tweets that mention physical activity; those distinct terms include: student, women, unintentional physical activity, breast cancer prognosis, arthritis walk season, cell phone, improve symptoms gerd, everyone benefits and CDC over the course of 12 weeks.
The snapshots of 3D motion charts (Figure 2) display 12-week trends of Tweets that mention bicycling or physical activity; on April 2010, “bike friendly” appeared as the most frequent distinct term on Tweets that mention bicycling. The interactive trends graph (Appendix B, available online at www.ajpmonline.org) shows how the contents of Tweets that mention physical activity changed over time. Students frequently appeared from Week 2 to Week 3 in March. Normal bm (bowel movement) and women occurred from late March to early April. Unintentional physical activity was shortly posted for a few days in early April. Breast cancer prognosis frequently appeared in mid-April for 10 days followed next most often by cell phone. In early May, improve symptoms gerd appeared frequently followed by everyone benefits. CDC was frequently discussed around the last week in May.
For the physical activity data set during the period of July 21 to August 16, 2010, most Tweets reflected positive attitudes, with bicycling (77%) as the top-ranked category. Tweet categories that reflected more than 40% negative sentiments were: walking fast, running, weight-lifting, physical activity, basketball, yard work, weight training, and jogging. Tweets mentioning hiking, golf, dancing, and swimming showed consistent positive sentiments, whereas others varied over time. Dancing-related Tweets were overwhelmingly positive (76%; n=84,217), and the associated trend graph shows that the relationship of positive and negative attitudes did not fluctuate (Appendix C, available online at www.ajpmonline.org).
Tweets that mention outdoor activities, such as hiking and yard work, contained information about physical contexts such as weather condition, trails, seasonal condition, and time of day (Table 3). In addition, Tweets that mention running provided detailed information about the activity (e.g., miles run, duration—hours/minutes/seconds, intensity—fast/slow). Emotional context (e.g., emotional obligation, feelings) was also prevalent in some Tweets. Tweets revealed social context (e.g., with my dad, friend, sister).
The intent of the current paper was to introduce a simple and easy-to-apply method to mining Tweet contents, and to illustrate application of the content-mining pipeline to gain insights for health behavior research. All tools used for data collection and analysis are publically available and free of charge. Further, this study introduced a 3D motion chart as a visualization strategy of vast amount mining results.26 As opposed to 2D traditional visualization methods, a 3D motion chart was able to succinctly merge and present 12 weeks of Tweet topics within one chart.
Key challenges in the health behavior research field include: (1) assessment of complex health behaviors; (2) health-promoting behavior intervention design; and (3) motivation for health-promoting behaviors. The authors believe that the methods described may provide insights and address challenges for health behavior research. First, the study findings support the applicability of Twitter as an ecologic momentary assessment tool by demonstrating the relevance of naturally generated Tweet contents as a source of health behavior data. Such an approach may overcome the recognized limitations of self-reports of health behavior27 and methods that include stimulation to generate responses.
For example, Tweets revealed situational momentary context prior to and right after physical activity. Analysis of frequently occurring terms provided situational context such as purpose (e.g., to build muscle); time (e.g., now, today); social context (e.g., gym with); environment situation (e.g., water, trail); feeling (e.g., felt great, love, hungry); and post-activity plans (sleeping, eating). Surprisingly, Tweets also captured fairly detailed measurement information such as the amount of calories burned (“157 calories burned”) and distance covered (“ran 2.02 miles”).
Second, the simple methods of topic detection may help health behavior researchers that have minimum programming skills utilize the detected topics for health behavior intervention design. In the physical activity content-mining case, distinct term lists indicate that some phrases appeared uniquely for a specific physical activity. Given that designing effective physical activity interventions is a considerable challenge,28 frequently occurring distinct terms may suggest intervention strategies. For example, the basics, traffic, and safety were frequently occurring distinct terms in bicycle-related Tweets. Researchers, government officials and health providers can harness such distinct terms for designing interventions that promote bicycling. In a similar way, terms related to time or distance measurement were distinctly common among Tweets that mention running, thus suggesting that these measures may be particularly important for interventions that incorporate running.
Finally, one of the biggest challenges in the health behavior research field is motivation for changing health behaviors. The authors observed that there is transformation of communication level reflected in the Tweet corpus. Although Tweet contents appeared at individual, community and organization levels, the content of the three different levels of conversations is not independent. For example, when Google formally announced the new release of a bicycling map (i.e., organization level), Twitter users freely expressed their personal feelings with diary-like posts (individual level) and small group discussions (communication level) occurred. In other words, the observations suggest that the Twitter medium was able to transform formal information into informal conversation, thus providing preliminary evidence that Twitter has the potential to support transformation of formal informational materials into conversations that may motivate behavior change.
There are several limitations to the authors’ application of content-mining methods to study the health behavior of physical activity that are applicable to other uses. The primary limitation of this study is its limited generalizability as a result of the characteristics of Twitter users (e.g., mainly young adults) and other factors (e.g., various languages having diverse linguistic structures, different Tweeting culture). The unit of analysis was the Tweet in the virtual Twitter community. However, each Tweet has a different probability of occurring in the data set, and its value is unknown. Further, some Tweets may be missed in a search, due to the informal language used in Tweets. To minimize this issue, hash-tag lists (twitter.com/toptweets) were checked to avoid missing large volumes of Tweets referring to physical activity terms (informal terms or abbreviations) other than the search terms.
The findings are also limited due to the accuracy of the sentiment analysis tool used in this study. Although the accuracy of the tool is reported as being more than 80% according to its developers,24 some experts in this field have expressed their concerns about the accuracy of sentiment analysis.29 Further, Twitter users might have had a tendency to provide information that they believed to be consistent with social norms and expectations, and to over-report engagement in health-enhancing physical activity.30–32 Last, it should be emphasized that the presented methods are simply based on frequency of the words to describe a phenomenon of interest. Other sophisticated natural-language tools and weighting techniques are necessary to provide a deeper understanding of the semantics of the Tweets and a representative sample.
This application of text-mining and sentiment analysis methods to analyze physical activity–related Tweets enhanced understanding of physical activity behaviors and their associated situational contexts. Such approaches offer an alternative to traditional self-reports and ecologic momentary assessments for capturing health-related behaviors.
The authors thank Drs. Mary W. Byrne, Elizabeth Cohn, PoYin Yen and Jacqueline Merrill for their contributions to the dissertation research on which this article is based.
The study was supported by T32NR007969. Article preparation was also supported by R01 HS019853.
No financial disclosures were reported by the authors of this paper.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.