This is a guest post from Alex Engler. Alex is a Data Scientist at the Urban Institute and teaches Data Science and Data Visualization classes at Georgetown University and Johns Hopkins University. Follow him on Twitter @AlexCEngler.
If you haven’t come around yet, it’s past time: Data ethics is really important.
A quick glance at recent ethical dilemmas is telling. Troubling instances of the mosaic effect — in which different anonymized datasets are combined to reveal unintended details — include the tracking of celebrity cab trips and the identification of Netflix user profiles. It is also difficult to remain unconcerned with the tremendous influence wielded by corporations and their massive data stores, most notoriously embodied by Facebook’s secret psychological experiments. And new issues are emerging all the time. I dare you to read MIT’s recent article on why we must train self-driving cars to kill without letting out a disquieted “Huh.”
Given these developments, I was elated to see Data Science DC — graciously hosted by the Pew Research Center’s new data science team — put together a fantastic event on socially responsible data science. The two speakers, Mike Williams and Lisa Singh, offered compelling and complementary perspectives on how critical and difficult the modern ethical considerations of data are.
Mike Williams is a research engineer at Fast Forward Labs with a PhD in astrophysics from Oxford University (he professes his PhD is ‘no big deal’). Dr. Williams argued that supervised learning can, and is even very likely to, perpetuate systemic biases. His illustration of this concept is persuasive, so allow me to paraphrase:
A common service offered by social media analytics companies is the identification of dissatisfied consumers using Twitter and similar sites. These customers can then be targeted and placated with special treatment, aiming to reduce rates of customer loss, or churn. In order to identify dissatisfied customers, companies build algorithmic sentiment analysis systems that work best on clear and unambiguous statements of negativity about the relevant product or service.
Ok, what’s the problem? Well, it turns out that this does not treat all customers equally, as men are far more likely than women to express strong negative statements on the Internet. And thus without any specific intent of the data scientists involved, a new algorithm is deployed that disproportionately benefits men. For more, see Mike’s slides.
While this effect might be negligible in some individual applications (say a discount from an online retailor), if applied in hundreds or thousands of algorithms, this aggregate effect can become quite significant. It is also easy to imagine how algorithms that determine credit worthiness or instead predict employee performance might have more dire consequences.
Lisa Singh, Director of Graduate Studies at the Computer Science Department of Georgetown University, adds another dimension of data ethics to consider. Her talk focused on how public online personal information is exposed and easily aggregated across multiple websites. Using only a few known variables (like name and city location), relatively simple web scraping and analytical methods can probabilistically combine information about individuals from Facebook, Twitter, LinkedIn, FourSquare, Instragram, etc. It is too easy using even the most straightforward approaches, says Dr. Singh, to learn valuable personal information.
It may be too easy to shake off this concern, since this is information we make public somewhat consciously. But even if shockingly well-targeted advertisements don’t worry you, then identity theft and financial fraud should instead give you pause. You can find links to her papers’ on this subject on her website — oh, and I should mention on her behalf that Georgetown’s Computer Science graduate program is expanding (bias disclaimer: I teach in the public policy school at Georgetown).
During his talk, Dr. Williams reminded us that by design, data science methods can be deeply nebulous, predicting the future with historical data inevitably biased in some way. Then our models are applied at a scale we can’t keep up with and often their output is interpreted by people who understand them even less than we do. Speaking at the recent Data Science Education Conference, White House Chief Data Scientist DJ Patil argued that data scientists are “force multipliers,” with the decisions we make echoed through hundreds or even thousands of actors.
He’s right, and we as a community need to more responsibly own up to our influence.