The Danger of Data without Theory

13 August 2010, 1250 EDT

I came across this Chris Anderson piece from a 2008 issue of Wired via Ana Andjelic. Anderson argues that in the era of Big Data we no longer need to rely on theory and the scientific method to achieve advances in knowledge:

Google’s founding philosophy is that we don’t know why this page is better than that one: If the statistics of incoming links say it is, that’s good enough. No semantic or causal analysis is required. That’s why Google can translate languages without actually “knowing” them (given equal corpus data, Google can translate Klingon into Farsi as easily as it can translate French into German). And why it can match ads to content without any knowledge or assumptions about the ads or the content.

Speaking at the O’Reilly Emerging Technology Conference this past March, Peter Norvig, Google’s research director, offered an update to George Box’s maxim: “All models are wrong, and increasingly you can succeed without them.”

…faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete.

There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

There is certainly value in sophisticated data mining and an inductive approach to research, but to dismiss the deductive approach (construct theory>deduce testable hypotheses>empirically verify or falsify hypotheses) would be shortsighted.

Modern data mining may be enough to authoritatively establish a non-random relationship, and in some cases (translations and advertising) more than suffices for useful application. However, even the largest data sets still represent only a sample–and, therefore, an approximation–of reality. Moreover, establishing correlation still doesn’t get you to the underlying causal mechanisms that drive causation. Even if Google, with enough data and advanced statistical techniques, can claim that a causal relationship exists it can’t tell you why it exists.

For some subjects, “why” may not matter–do we care why Google’s program is able to accurately translate between languages, or is the practical effect enough for us? But for others it is crucial when thinking about how to construct an intervention to alter some state of being (e.g. a medical condition, poverty, civil war, etc). Understanding causal mechanisms can also help us think through the consequences of an intervention–what are some potential side effects? Are there other, seemingly unrelated, areas that might be affected by the intervention in a negative way? When we are dealing with more interconnected, complex systems (like human physiology or society) it behooves us to go beyond relationships and understand what levers are being pulled.

[Cross-posted at Signal/Noise]