Data Dilemmas & Converging Logics

5 December 2013, 1311 EST

One of the recurring subjects among folks using data is: why does person x not share their data with me?  Mostly because they are fearful and ignorant.  Fearful?  That their work will get scooped and/or their data might be found to be problematic.  Ignorant?  That they don’t know that they are obligated to share their data once they publish off of it and that it is in their interest to share their data.  There is apparently a belief out there that data should be shared only after the big project is published, not after the initial work has been published.  I will address this as well as the the converging logics of appropriateness and consequences here.

Let me address first this new belief.  The idea that one can hold onto one’s data after publishing an article because one has not yet published the book is not the norm.  The norm, the obligation imposed by the National Science Foundation and expected from the discipline, is that when the first piece is published, then the data for that piece is supposed to be accessible.  That way, people can replicate that study.  If the larger study is years later (and it almost always is), that means that we would have to wait years to replicate initial article  Not only that but there are no guarantees that the scholar will finish the book project AND find a publisher.

Ok, let’s move from the specific myth to the broader logics of replication.  It has become clear over the decades that providing access to data is the right thing to do from the standpoint of a logic of appropriateness.  The discussions make it clear that scholars need to be transparent about their research and make it easy for others to replicate one’s findings.  This is the basic expectation for doing any research but especially quantitative research.  Providing interview notes can be problematic due to confidentiality issues (although the NSF is funding a project to figure this out), but providing data that one has created/collected is the standard expectation of social science.  Many journals now have replication policies and store data at their websites.  The question these days is not the obligation to share data but to share the “do” files, macros, or programs that are used to analyze the data.

Sharing data is clearly right from an appropriateness standpoint, but it is also right from a rational self-interest perspective as well.  People worry that citations are over-rated, and they may be so.  But if you want to get cited, one of the best ways to do that is to share your very useful dataset.  According to the ISP symposium linked above (p. 21): “An author who makes data available is on average cited twice as frequently as an article with no data.” If you check out the various lists of who gets cited the most, those who create datasets and share them get cited more.  Will Moore shared with some folks a story at a recent conference, saying how he got pretty famous in the discipline long before he published much because his name was attached to the Polity dataset.  He had no idea that this was going to happen.  One of the requirements of new grant applications is to show how one plans to disseminate one’s research.  Sharing data is one basic and very important strategy.

We are in the business of creating public goods–knowledge.  Not just the findings but the data we develop along the way.  It does mean that some folks may free ride, but it also means that the collective enterprise moves forward.  Holding onto data is not only selfish but short-sighted.  Having others work on the same dataset is likely to lead to feedback, which mean your work gets better, and to broader imaginations of what is possible.  When Ted Gurr developed the Minorities at Risk project, he really had no idea how others would use it.  Those that followed him used it in a variety of ways, adding bits and pieces of data (I took some of the IR data they had collected but not coded and used that to test some stuff that Gurr never intended to ponder), asking different questions and developing some very interesting findings.

Yes, folks along the way also discovered problems with the dataset, but Gurr and the larger MAR team (which I subsequently joined) worked on ways to improve the data (and, hey, we got NSF money do that–the next batch of papers will address the improvements and then the revised data will become available with better instructions).  This is how social science works.  Keeping the data to oneself, if even only for a few more years, is completely contrary to our enterprise.

Update: APSA has been working on this, with roundtables at the last annual meeting and a forthcoming special issue of PS.  See pages 9-10 of the APSA’s revised Professional Ethics, including:

6. Researchers have an ethical obligation to facilitate the evaluation of their evidence-based knowledge claims through data access, production transparency, and analytic transparency so that their work can be tested or replicated.
6.1 Data access: Researchers making evidence-based knowledge claims should reference the data they used to make those claims. If these are data they themselves generated or collected, researchers should provide access to those data or explain why they cannot.
6.6 Researchers who collect or generate data have the right to use those data first.  Hence, scholars may postpone data access and production transparency for one year after publication of evidence-based knowledge claims relying on those data,

or such period as may be specified by (1) the journal or press publishing the claims, or (2) the funding agency supporting the research through which the data were generated or collected.
One year after publication (unless other requirements kick in).  Not when the larger project is published.