The Duck of Minerva

Mechanical Turk and Experiments in the Social Sciences

23 July 2013

Amazon created a platform called Mechanical Turk that allows Requesters to create small tasks (Human Intelligence Tasks or HITs) that Workers can perform for an extremely modest fee such as 25 or 50 cents per task.* Because the site can be used to collect survey data, it has become a boon for social scientists interested in an experimental design to test causal mechanisms of interest (see Adam Berinsky’s short description here). The advantage of Mechanical Turk is the cost. For a fraction of the expense it costs to field a survey with Knowledge Networks/GfK, Qualtrics, or other survey companies, one can field a survey with an experimental component. Combined with other low-cost survey design platforms like SurveyGizmo, a graduate student or faculty member without a huge research budget might be able to collect data for a couple of hundred dollars (or less) instead of several thousand. But, storm clouds loom: in recent weeks, critics like Andrew Gelman and Dan Kahan have weighed in and warned that Mechanical Turk’s problems make it an inappropriate tool, particularly for politically contentious topics. Are these criticisms fair? Should Mechanical Turk be off limits to scholars?

I’m going to put my cards on the table here. Most of what I know about experiments I’ve learned from Bethany Albertson, who in addition to being a fabulous political psychologist, is also married to me. We’ve run some experiments together with a survey market research company, and I ran several experiments last summer on Mechanical Turk after programming them on SurveyGizmo. I realize, as a largely qualitative scholar in international relations, I could be at risk of getting out of my lane, especially with statistician (!) Gelman on record expressing his disapproval of Mechanical Turk. Nevertheless, I’m going to stick up for Mechanical Turk. Sure, for some purposes, the gold standard might be nationally representative samples, but Mechanical Turk has its virtues which should not so easily be dismissed.

Social Scientists Who Use Mechanical Turk Know Not What They Do

screenIn covering the announcement of a new Journal of Experimental Political Science, Andrew Gelman writes emphatically on the Monkey Cage that Mechanical Turk produces weak scholarship:

Regular readers of the sister blog will be aware that there’s been a big problem in psychology, with the top journals publishing weak papers generalizing to the population based on Mechanical Turk samples and college students, lots of researcher degrees of freedom ensuring there will be no problem finding statistical significance, and with the sort of small sample sizes that ensure that any statistically significant finding will be noise, thus no particular reason to expect that patterns in the data will generalize to the larger population. A notorious recent example was a purported correlation between ovulation and political attitudes.

Here, he goes on further and makes the point more stridently, writing:

Just to be clear: I’m not saying that the scientific claims being made in these papers are necessarily wrong, it’s just that these claims are not supported by the data. The papers are essentially exercises in speculation, “p=0.05” notwithstanding.

And I’m not saying that the authors of these papers are bad guys. I expect that they mostly just don’t know any better. They’ve been trained that “statistically significant” = real, and they go with that.

He concludes by saying that “what I’m objecting to is fishing expeditions disguised as rigorous studies.” I’m not a regular reader of Gelman’s blogging, and his research interests are pretty far from my own, but I’d like to know more about why he arrived at such claims.

I tried to parse his rationale, reading the comment thread and some earlier posts, some of which focused on nonresponse bias (people who take the survey are different from those who opt not to). I’m certainly all for refraining from reifying p values of .05, but I’m not clear what motivates his argument that the data in these papers doesn’t support the claims, that survey design makes it easy to find statistical significance, that any findings are most likely noise and not generalizable. I’d love for him or other readers to unpack those claims, as I would certainly want to avoid making such mistakes.

Mechanical Turk Respondents Are Too Liberal!logo
In reading through Gelman’s previous blog posts, I found another post “Don’t Trust the Turk” where he cited Dan Kahan’s blog. Kahan’s work focuses on climate change and is much more in my wheelhouse.

Kahan’s objections to Mechanical Turk are more specific, focusing on selection bias, repeated exposure by survey-takers to study measures, and misrepresentation of nationality. Of these, having read the comments thread on Kahan’s posts as well as Gelman’s, I think only the first really has much validity.

As the comments thread point out, repeat survey-takers are likely a problem in student labs as well (and not everyone can do an expensive national sample). On the nationality claim, Kahan seems to be speculating without much evidence. Yes, if you fail to specify that you want Americans only, you can get foreign subjects, mostly from India, but if you specify, you can block out non-American respondents and non-American-based IP addresses. I learned that the hard way! I’ve gotten e-mails from Americans overseas saying, hey I’m on vacation or working abroad, why did you exclude me? Maybe people are just trying to fool me, but that’s a lot of effort for 50 cents. I’d like to see more evidence that this is a problem.

As for selection bias, Kahan worries that:

it is clear that the MT workforce today is not a picture of America.

MT workers are “diverse,” but are variously over- and under-representative of lots of groups.

Like men: researchers can end up with a sample that is 62% female.

African Americans are also substantially under-represented: 5% rather than the 12% they make up in the general population.

There are other differences too but the one that is of most concern to me—because the question I’m trying to answer is whether MT samples are valid for study ocultural cognition and like forms of ideologically motivated reasoningis that MT grossly underrepresents individuals who identify themselves as “conservatives.”

Yes, liberals and women are over-represented on Mechanical Turk. This was my experience as well (at least for the United States; men were over-represented in my India sample). What are the implications?

In Kahan’s view, it means one cannot do serious scholarship on contentious political topics like climate change because conservatives won’t do the survey on-line or those that do are likely to be different from “other” conservatives, so even an oversample of conservatives on Mechanical Turk wouldn’t do:

The problem is what the underrepresentation of conservatives implies about the selection of individuals into the MT worker “sample.” There’s  something about being part of the MT workforce, obviously, that is making it less appealing to conservatives.

Is this problem insurmountable? Does it damn Mechanical Turk as a method? It’s not clear to me that the biases of conservatives on Mechanical Turk are likely to be all that different from other conservatives. Kahan describes some Mechanical Turk findings at odds with his expectations for other studies, but I wonder if he and I are both are engaged in the kind of motivated reasoning he finds troubling (I’ll admit: I’m inclined to like Mechanical Turk because it is cheap).

Many political psychologists are less concerned about the broader generalizability of their findings than they are in assessing whether theorized mechanisms of political persuasion work with small populations. Random assignment allows the scholar to assess differences in political attitudes and behavior between a control and experimental conditions. Whether those claims are generalizable to the broader U.S. population is not the overriding aim. Usually, good methodologists are clear on the limits of their findings.

The commentators to Gelman’s post capture these sentiments thoughtfully. Rice’s Rick Wilson writes:

IF researchers are aiming to generalize to the population, then MTurk or other convenience samples are a problem. On the other hand, if the aim of an experiment is to establish causal direction then I worry far more about internal validity. A convenience sample, randomized into control and treatment groups, allows inferences about causal direction.

In the same comment thread, Michigan’s Ted Brader seconds the call for not fetishizing generalizability. He instead calls for replication and transparency:

However I wouldn’t want the editors or anyone else to interpret this as saying the journal should not publish data from certain kinds of samples (students, MTurk, etc.) or that the journal should become hyper-focused on generalizability. Having read many of your other blog posts, I’m confident this is not what you were saying. Generalizability is a “problem” — in the sense, we can’t be completely sure the relationship holds in all times, under all circumstances, for all people — for just about every empirical paper ever written. The key for authors (and editors) is to make sure the paper is thoughtful about potential limitations in this regard and thus that the claims are tethered appropriately to the reach of the data.

Mechanical Turk’s Alright if You Use it Properly

MIT’s Adam Berinsky is a user of Mechanical Turk and has a 2012 peer-reviewed paper in Political Analysis that successfully replicated a number of experimental findings using Mechanical Turk. In the APSA Experimental newsletter, he describes some of its virtues and limitations.

On the plus side, he writes:

Berinsky, Huber and Lenz (2012) (henceforth: BHL 2012) found that based on observable characteristics, the U.S.-based MTurk sample is more representative of the national population than convenience samples of college students, but less representative than national probability samples.

As Berinsky acknowledges, Mechanical Turk is useful for some purposes but not others:

Furthermore, there are atypical characteristics in the MTurk sample that may bias the estimated treatment effects under certain conditions. For instance, MTurk subjects in the U.S. seem to be relatively young, Democratic, liberal, and knowledgeable about politics (BHL 2012). For developing countries, MTurk is very likely to be over-sampling the sub-population that is more internet-savvy, more proficient in English, and more exposed to Western influence than the average citizen (BQS 2012).

On The Monkey Cage, Berinsky responded directly to Gelman’s/Kahan’s points:

It seems there are some fair points, but I’m not sure the claims made here damn MT (or any subject pool out of hand). I have said before — and I think Kahan would agree — that when running experiments, we want to think about how the composition of the sample affects our results; we don’t want to focus on the sample in isolation from our experiments. So if you want to study the behavior of conservatives and you have reason to think that conservatives in a given sample (say MT) react to your task different than conservatives in the population, that is something you should be worried about. But to go the next step and dismiss a subject recruitment process out of hand for all purposes seems to me a step too far.

Mechanical Turk and IR: Pluses and Minuses

Should you consider using Mechanical Turk? It depends on the purpose. For my forthcoming book with Ethan Kapstein, I wanted to test the persuasive messages of some mock political ads by an advocacy group calling for universal access to pharmaceuticals.  I wanted to test whether or not campaigns for goods essential for life (like cancer and AIDS drugs) performed better than inessential luxury goods like impotence drugs. We also looked at goods not quite as central to survival such as hypertension drugs. With slight differences in campaign wording, we wanted to see if we could generate some differences in attitudes between conditions. condition1

For a limited budget, we were able to run two successive rounds of U.S. experiments on Mechanical Turk and a third experiment with Indian subjects. We found some notable similarities between the U.S. and Indian subjects and we acknowledge the biases of the samples that might not generalize to a wider population. Given the consistent findings, we then hired a survey market research firm to conduct the survey with a more representative U.S. sample, again with similar results.

However, there were clear limits to Mechanical Turk. We initially wanted to pair results from two developed countries (the U.S. and the UK) where advocates had been active on universal access to medicines and two middle income countries with large populations living with HIV (India, South Africa). We weren’t able to generate enough respondents (or more than a handful) in either the UK or South Africa so had to abandon that idea, but we found through replication of the U.S. surveys and testing using a non-U.S. sample that we could draw some suggestive conclusions that corresponded with other evidence and arguments we developed.

Concluding thoughts

Political psychologists that use Mechanical Turk or other samples of convenience such as student samples typically look precisely for variables like partisanship, education, knowledge, gender, ideology or other factors that may moderate how their experimental conditions operate. While such samples may be, for example, more liberal than the general U.S. public, the samples may have enough conservatives to assess whether or not those factors matter.

Recognizing that they face a high bar to convince readers and journals of student samples alone, scholars have looked to other ways to collect non-student data such as Mechanical Turk, the Time-Sharing Experiments for the Social Sciences (TESS), or by paying survey market research firms like Knowledge Networks/GfK. But, some of these other options are not always feasible. TESS is highly competitive, and Knowledge Networks/GfK and other firms are expensive.

I’ve run experiments with a survey market research firm that costs a fraction of GfK, and a single study easily costs $1500 to $2000. The truth in experiments is that they often don’t work. You need to test and refine the experiments. Sometimes, you make a mistake. In some cases, you need to think through your theory and revise, refine, or jettison ideas that don’t work (I’m mindful of the critique of experiments that scholars hide non-findings and keep tweaking until they find results that suit their theory).

If you are graduate student where experiments with U.S. or Indian subjects make sense, then Mechanical Turk may be useful, particularly if you don’t yet have a large research budget. Even if you have some resources, Mechanical Turk can allow you to beta-test a survey before you make expensive mistakes. Moreover, convenience samples like Mechanical Turk can allow you to demonstrate the effectiveness of your studies so that you can compete for funding/time for higher quality data sources like TESS or NSF dissertation improvement grants. In short, in the context of research that uses multiple samples, Mechanical Turk can be an important new source of data for scholars who are mindful of its limits.

(*The name Mechanical Turk is derived from “The Turk” (pictured above), a “machine” that played chess in the 18th century that was made by Wolfgang von Kempelen. The Turk successfully defeated opponents across Europe and the Americas but was later revealed to be a hoax, as a chessmaster was hidden in a compartment below and controlled movement of the pieces.)