The Duck of Minerva

IR, it’s time to talk about what the “multiple comparisons issue” really is

15 November 2019

As a reviewer and recipient of reviews, I’ve noted a recent trend among IR papers. A study uses cross-national data with regression analysis, and runs multiple models with different variables or sub-sets of the data. Sometimes the results are consistent, sometimes they aren’t. But often a reviewer will object to the study’s validity, pointing to the “multiple comparisons” issue. Multiple comparisons can be a real problem in quantitative IR studies, but I worry we’re mis-diagnosing it.

What do I mean? Imagine we’re writing a paper on interstate conflict. We could measure conflict onset, duration or intensity. We could measure intensity by an ordinal scale, the number of deaths, or other more complicated measures. We could include all dyads, politically-relevant dyads, dyads since 1945, 1989, etc. Ideally these choices would be based on our theory, but our theories are often not specific enough to specify such choices.

So we decide to run all the different combinations and report them–some in text, some in an appendix. We find a general pattern among the various dependent variables and sub-sets of data, which we present as a finding. One reviewer objects to the finding’s validity, though; they say that with that many models, it creates a multiple comparisons problem. As I’ve said, this is something I’ve gotten from peer reviewers on my own articles, and seen other reviewers raise when I was a reviewer. And I think it’s a misapplication of the idea that can needlessly undermine interesting studies.

So what is the “multiple comparisons” issue? It was brought to my attention by Andrew Gelman, a statistician who frequently critiques (i.e. tears apart) quantitative studies in psychology and political science. In a paper with Eric Loken, he discusses it as a study having

“a large number of potential comparisons when the details of data analysis are highly contingent on the researcher.”

The attempt to find a statistically significant result amidst numerous potential tests is known as “p-hacking,” “researcher degrees of freedom,” or a “fishing expedition.” That is, scholars run and re-run tests until they find one that works. But Gelman and Loken argued that there is a problem even in the absence of a “fishing expedition;”

researchers can perform a reasonable analysis given their assumptions and their data, but had the data turned out differently, they could have done other analyses that were just as reasonable

Looking at the above IR example, it’s easy to see how a multiple comparisons issue could arise. Someone could run all the different possibilities and pick the one that seems best. Someone could just run the first one they tried, like the result, and write it up. But the findings could be unstable, changing or disappearing if we tried a different set of variables or data.

We can’t avoid messy, open-to-interpretation data in IR without abandoning important questions in favor of narrowly-defined experimental studies. So we will run into such potential multiple comparisons issues. What can we do?

As Gelman and Loken discuss, an increasingly accepted solution is preregistration: post the entire data collection and analysis plan before you run the study. This is tricky as it’s a lot of extra work, and as they note, the best analyses are often part of an iterative process. Alternately, we could “follow up an open-ended analysis with pre-publication replication.” But replication often isn’t feasible in political science, so they suggest a best practice may be to “analyze all relevant comparisons” in the data.

So in the case of our hypothetical conflict paper, we’d run all the possible ways to define “severity of conflict” and report the findings. Ideally there’s a pattern to the findings across models. If not, we still report them and explain the differences, hopefully through reference to our theory.

Now, inconsistent results may make us question the paper’s conclusions. That’s fair (although I think we should give credit for transparency and modesty). But we shouldn’t ding the paper for a multiple comparisons issue. This is actually how to address it.