Labels and tribes


Feb 19, 2012

In the Matrix, it’s trivial to specify the underlying
data-generating process. It involves kung fu.

 Given PTJ’s post, I wanted to clarify two points from my original post on Big Data and the ensuing comment thread.

I use quantitative methods in my own work. I’ve invested a lot of time and a lot of money in learning statistics. I like statistics! I think that the development of statistical techniques for specifying and disciplining our analytic approach to uncertainty is the most important development in social science of the past 100 years. My objection in the comments thread, then, was not to the use of statistics for inference. I’m cautious about our ability to recover causal linkages from observational data, but no more so than, say, Edward Leamer–or, for that matter, Jeffrey Wooldridge, who wrote the first econometrics textbook I read.

My objection instead is to the simple term “inferential statistics,” because the use of that term to describe certain statistical models, as opposed to the application of statistical models to theoretically-driven inquiry, often belies an unconscious acceptance of a set of claims that are logically untenable. The normal opposition is of “inferential” to “descriptive” statistics, but there is nothing inherently inferential about the logistic regression model. Indeed, in two of the most famous applications of handy models (Gauss’s use of least-squares regression to plot asteroid orbits  and von Botkiewicz’s fitting of a Poisson distribution to data about horses kicking Prussian officers), there is no inference whatsoever being done; instead, the models are simply descriptions of a given dataset. More formally, then, it is not the case that “inferential” describes a property of statistical models, but rather should be taken strictly to refer to their use. What is doing the inferential work is the specification of parameters, which is why it is sometimes entirely appropriate to have a knock-down fight over whether a zero-inflated negative binomial or a regular Poisson is the best fit for a given test of a given theory.

So, my objection on this score is narrowly to the term “inferential statistics,” which I simply suggest should be replaced by something slightly more cumbersome but much more accurate: “the use of statistics for inference.” What this definition loses in pedantry it gains in accuracy.

The second point is that my post about Big Data was meant to serve as a warning to qualitative researchers about what could happen if they did not take the promise of well-designed statistical methods for describing data seriously. My metaphor of an invasive species was meant to suggest that we might end up with a much-impoverished monoculture of data mining that, by dint of its practitioners’ superior productivity, would displace traditional approaches entirely. But the proper response to this is not to equate the use of statistical methods with data mining (as I think a couple of commenters thought I was arguing). Quite the contrary: It would be much preferable for historians to learn how to use statistics as part of a balanced approach than for historians to be displaced by purely data miners.

This is all the more relevant because the flood of Big Data that is going to hit traditionally qualitative studies will open new opportunities for well-informed and teched-up researchers who can take advantage of the skills that leverage the availability of petabytes of data. After all, the real enemy here for qual and quant researchers in social science is not each other but a new breed of data miner who believes that theory is unnecessary, a viewpoint best expressed in 2008 by Chris Andersen in Wired:

But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. … There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

I feel confident that no reader of the Duck wants to see this come to pass. The best way to head that off is not to adopt an thinking anti-statistical stance but rather to use those methods when proper in order to support a deeper, richer understanding of social behavior.


+ posts