Summarizing the Research Article: “Where are human subjects in Big Data research? The emerging ethics divide”

Image for post
Image for post

On June 1, 2016, Jacob Metcalf and Kate Crawford published “Where are human subjects in Big Data research? The emerging ethics divide” in the Big Data & Society journal.

In this blog I will offer a simple summary of their paper, so that regardless of your data science understanding, you will understand the authors’ main points.

And although this research paper was written more than four years ago, arguably an eon in technology years, it is especially pertinent as big data continues to grow exponentially! Information Overload Research Group (IORG) estimates that in the last two years alone, 90% of the world’s data has been created.

And if you are a California resident, as of January 1, 2020, you have the right to ask Big Data companies to tell you exactly what information they have about you and what they do with your information, as well as requiring them to delete your information.

Image for post
Image for post
Getty Images

So why would California create such a law? And why all the debate and hearings on Capital Hill about big data companies and what they do with our information?

Well the answer is that we’re not exactly sure what they do with our information but we are pretty certain that they are monetizing it and profiting from it. Additionally there is much debate about the ethics of big data and the potential harm it can cause to individuals and society as a whole. And this is what Metcalf and Crawford’s paper speaks to.

Research papers start with an abstract, which summarize what the paper addresses and how it is addressed. Metcalf and Crawford’s abstract states that data science practices, especially in big data, are exempt, or have exempt themselves, from established ethical standards that are regularly applied to other sciences, such as biomedical research. They say that the Common Rule, which is the primary regulation governing human-subjects research in the USA, largely excludes data science methods from human-subjects regulation. The paper sets to examine the ethics of a few well known cases of big data issues including the 2014 Facebook emotional contagion study and the 2016 use of geographical data techniques which revealed the true identity of the famous artist Banksy, despite his desire to remain anonymous.

The research paper’s introduction lays out two potential ethical issues related to data science. The first is the growing divide between established research ethics in traditional disciplines and the research methods of Big Data. The second is that US research regulations exempt projects that make use of already existing, publicly available datasets, on the assumption that they pose only minimal risks to the human subjects they document. It states that this assumption is founded on a misconception and that publicly available data can be put to a wide range of secondary uses, especially when it is combined with other data sets, that can pose serious risks to individuals and communities.

The introduction goes on to explain how in 2016 researchers used publicly available data to track and ultimately reveal the artist Banksy’s real name, despite his explicit desire to remain anonymous. The researchers who did this claimed that doing so was ethical as all data used to discover him were public data sets. Although big data was not used in this case a parallel can be drawn between using publicly available data on a small scale to track an individual and combining publicly available data on a large scale for other potentially nefarious purposes.

Image for post
Image for post
Painting by Banksy

The authors assert that “there is real urgency to define what a ‘human subject’ is in Big Data research and critically interrogate what is owed to ‘data subjects.’”

They note that the “precursor disciplines of data science — computer science, applied mathematics and statistics — have not historically considered themselves as conducting human-subjects research.” But now, data science is very much involved in human subjects research. Because it is so new it doesn’t fit into already established ethics guidelines for disciplines using human subjects. And that is ultimately what the authors of this research paper want addressed. They ask, “If the familiar human subject is largely invisible or irrelevant to data science, how are we to devise new ethical parameters? Who is the “data subject” in a large-scale data experiment, and what are they owed?”

The stated purpose of the paper is to “offer a preliminary examination of how critical data studies might generate a theory of data subjectivity that would enable responsible scientific practice with Big Data methods.” The say that critical data studies have routinely demonstrated that it is deeply mistaken to treat research data as neutral and raw.

The paper explains that ethics regulation agencies are criticized for addressing ethics with a one-size-fits-all approach, and then applying those rules inconsistently across similar cases. And this might explain why some big data scientists are skeptical about being held to the same regulations. In this case, Metcalf and Crawford propose that Data Scientists should aim for modeling the norms and practices that would build and sustain the public trust necessary to earn the right of effective self-regulation.

The paper asks “What are the actionable ethical obligations data scientists and practitioners have for the well-being of thier data subjects?” And “How do we assess that those obligations are being met?” They revisit the claim made earlier in the paper that assuming that the risk to research subjects depends on what kind of data is obtained and how it is obtained, and not what is done with the data after it is obtained is flawed and dangerous.

They cite the New York City Taxi & Limousine Commission data set that was made public in 2013 as an example of public data gone wrong. In that case researchers were able to figure out the taxi drivers’ medallian numbers and then use the data set to identify personal information about the drivers and their fares such as home addresses, places visited etc.

And they touch on the 2014 Facebook emotional contagion study in which during one week in January 2012, data scientists skewed what almost 700,000 Facebook users saw on their feeds. Some people were shown content with happy and positive words and some were shown sadder content. And when the week was over, the manipulated users were more likely to post positive or negative words themselves based on what they had seen during the week.

Image for post
Image for post

In conclusion it seems like the research paper asks more questions than it answers and that is okay as these questions need to be asked and in front of all of us. Big Data is not going away and the ethics of how it is used must be addressed before it becomes too late.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store