Mechanical Turk Setup
Amazon Mechanical Turk facilitates several steps in a crowdsourcing-based study, in particular publishing the task, recruiting participants, collecting the data, and compensating workers. Data preparation and response quality assessment need to be done offline by the researcher.
The first step in setting up a crowdsourcing task is to create and fund a Mechanical Turk account. There is no cost associated with setting up an account, but funds to compensate the workers and to pay the nominal fees charged by the website need to be paid into the account in advance.
The next step involves setting up the task to be performed by workers. The study designer needs to define the overall task, break it up into microtasks (small tasks that can quickly be performed by an individual worker), formulate the instructions to workers, and prepare the data associated with each task (such as text or images to be annotated, or survey questions). This is done offline, using in-house tools. The Mechanical Turk infrastructure is then used to determine the design of the HIT webpage presented to workers, as well as task and worker attributes, and to upload the data to Mechanical Turk. The desired number and type of HITs are then created automatically from the uploaded data and design template, and they are offered to workers meeting the specified attributes. For example, a researcher might want to annotate 100 different paragraphs. In this case, the template is the form designed to display the paragraph and capture the worker’s annotation. Then the data—the 100 different paragraphs—are loaded to create 100 individual HITs.
Task design and attribute specification can be performed using one of three alternatives: the Web-based requester interface, command-line tools, or an application programming interface. In each case, several predefined options are available for the page design (including, for example, checkboxes, drop-down menus, radio buttons, and free-text answers). The task attributes include the compensation per task, number of days the task will be available on Mechanical Turk, the maximum time allotted to any individual worker for completing the task once he or she has accepted it, the number of assignments per task (how many different workers process a given task), and the autoapproval period (the time period after which the results submitted by the worker will automatically be approved). The worker attributes include his or her approval rating (based on previous HITs completed on Mechanical Turk), geographic location, adult content qualification, and any additional qualifications set up by the requester (such as performance on previous tasks by the same requester). The set of worker attributes allows requesters to cultivate pools of trusted workers who habitually deliver good-quality results.
As soon as the template data collection form is created and the data are loaded, researchers can publish the HITs and start receiving answers from workers. Responses can be downloaded, assessed for quality, and approved or rejected online or by uploading a corresponding data file. Once a HIT has been approved, the worker is paid the promised compensation; requesters also have the option of assigning bonuses to workers for particularly satisfying results.
Validity of Responses
One of the main difficulties faced when conducting crowdsourcing studies is assuring the validity of the responses obtained [27
]. Since participation is anonymous and linked to monetary incentives, crowdsourcing can attract participants who do not fully engage in the requested tasks or might be unqualified to accurately complete them. There are several ways that a researcher might address this validity issue. The first one is one we already mentioned: setting up qualifications, including qualification tests that need to be passed before the worker can accept a HIT. Second, when the task is associated with an objective ground-truth answer for a subset of the data (such as finding a particular image among a set of images), responses can be rejected automatically when they do not correspond to the ground truth, and the worker can be blocked. However, this is not possible when the worker’s task is to provide a purely subjective assessment. Third, crowd-sourced data collection can involve multiple sequential stages—at each stage, a different set of workers correct the output from previous workers. Fourth, different measures of reliability can be computed on the responses offline, such as outlier statistics or agreement between multiple workers performing the same tasks. Finally, sanity checks (eg, comprehension questions) can be included in the HIT itself.
Crowdsourcing in the Development of Health Promotion Materials
We developed an online survey to test formatting and modality preferences for a variety of messages on pediatric dental health issues (see Multimedia Appendix 1
The survey consisted of three sections. In the first part we asked a set of questions about the participants’ demographic background, including country of origin, native language, age range, gender, highest education level achieved, whether participants had a regular dentist, and when they last saw a dentist. In the second part, described in more detail below, a paragraph extracted from a pediatric dental education document was presented in four different formats along with text comprehension questions. In the third part participants were asked to select which of the four formats they preferred, followed by an open-text question asking them to state the reasons for their preferences. Optionally, participants were able to provide feedback on the task itself.
In total, we created 12 different survey forms for 12 different documents, each about a different dental health topic. Consent to participate, including information about time to complete the survey and information being collected, was provided prior to initiating the survey. We did not collect any personally identifiable information during the survey; workers are anonymous and only associated with an alphanumerical identification tag. The University of Washington Human Subjects Division approved the study.
For parts 2 and 3 we selected paragraphs from consumer education materials available on US national dental association websites, including the National Institute of Dental and Craniofacial Research, the American Dental Association, the American Academy of Pediatric Dentistry, and the American Academy of Family Physicians. We selected paragraphs to represent a variety of topics regarding childhood dental health, such as tooth brushing, pediatric dental visits, or fluoride use. The content of the selected paragraph was formatted into four versions. Format A consisted only of the running-text paragraph. Format B was a text-only bulleted list. Format C showed the running-text paragraph and a content-related image (either a photorealistic image or graphics). Format D showed the bulleted list plus the image. All four formats were displayed on the same page. However, the order in which the four formats were presented was determined by random selection. To ensure that participants read and reviewed each of the four versions thoroughly, thus ensuring the validity of their responses, they were requested to answer a different text comprehension question after the presentation of each format. If they answered questions incorrectly, their responses were discarded. We created and tested two versions of the survey, one in English and one in Spanish.
For each survey form, we created a separate HIT on Mechanical Turk. For each HIT, we collected 20 responses (ie, up to 20 different workers answered a single HIT, but a single HIT could not be completed multiple times by the same individual). For each of the two surveys we thus obtained 240 responses.
Participation was limited to individuals located in the United States and those 18 years or older. For the Spanish survey, participants were required to be native Spanish speakers and were asked to specify their country of origin. A separate language qualification test was not applied; however, all Spanish survey materials, including the HIT description and the comprehension questions, were in Spanish, and we did not see any evidence of nonnative speakers taking the Spanish survey. In addition, all comprehension questions were answered correctly. To ensure reliable participants, we also required that they have an approval rate of at least 95% in the HITs they had previously worked on. We allocated 15 minutes for the completion of a single HIT, although we estimated that it could be completed in a much shorter time; the compensation was US $0.25 per HIT.