This is a guest post from Alex C. Engler and was originally posted on the Urban Institute's Urban Wire blog. It is reposted here with permission. Alex is a Statistical Programmer at the Urban Insitute and teaches Data Science and Data Visualization classes at Georgetown University and Johns Hopkins University. Follow him on Twitter @AlexCEngler.
For the past century, social science research has carried on with staid and tested research methods. We analyze surveys, government administrative data, and the occasional randomized control trial. Certainly, the computer age has helped simplify and facilitate our work, but policy research does not seem to have discovered what those in the data science community will tell you: we are in the midst of a data revolution.
Last week, the Urban Institute hosted a discussion on the evolving landscape of data and the potential impact on social science. “Machine Learning in a Data-Driven World” covered a wide range of important issues around data science in academic research and their real policy applications. Above all else, one critical narrative emerged:
Data are changing, and we can use these data for social good, but only if we are willing to adapt to new tools and emerging methods.
Data are changing
Boeing’s new 787 Dreamliner produces half a terabyte of data during every flight. Just imagine the smart city of the near future, where there is similarly granular data uploading from every building, bus, taxi, parking meter, and even trash can. This is what we can expect as more data originate from the sensors and wireless devices that make up the emerging “internet of things.” And more ambitious approaches are becoming increasingly feasible, like Toronto’s application ofsatellite imaging data to analyze its green spaces and New York City’s use of laser flyovers to create a 3D scan of the entire city.
Additionally, the creation of unstructured forms of data like text, audio, images, and video now vastly outweighs the traditional tabular variety. Social media content, written medical records, wearable fitness and health devices, cameras at traffic lights and on police cars or officers—all of these can offer insight into important policy questions.
Using data science to make lives better
New York City locates illegal building conversions that can lead to more dangerous fire hazards. Chicago proactively targets homes with young children and lead paint, rather than waiting for symptoms to appear. A data science competition in Boston aims to use Yelp reviews to better guide restaurant health inspections.
The federal government is getting involved as well, using machine learning to fight tax fraud, streamline pharmaceutical evaluations, and predict hospital readmission for sick veterans. Internationally, data science has been able to forecast episodes of civil unrest and detect outbreaks of health crises like Ebola.
These innovative applications are made possible through an expanded view of research tools, since many common methods of policy evaluation are not suited for this new world of data.
Embracing new methods and tools
Video, image, and audio data need to be interpreted with machine learning algorithms. Natural language processing is necessary to extract useful meaning from increasingly huge volumes of text. Networked data sources— whether from systems of sensors or social networks— require analytics rooted in graph theory. These data science methods have been generally eschewed by social scientists, but this is no longer a practical strategy.
According to Constantine Kontokosta, deputy director for academics at New York University’s Center for Urban Science and Progress, advances in data collection may allow for dramatically faster policy evaluation cycles through methods like A/B testing, instead of waiting for the results of multi-year surveys.
There are implications for the tools we use as well. Proprietary statistical programs like SAS and STATA are struggling to keep up with the rapid pace of change, while open source statistical languages like Python and R are thriving, with millions of users and thousands of active contributors.
Engaging with all data
In the past few years, universities have opened new centers for research and teaching that meld policy and data science. In the federal arena, the White House recently hired its inaugural chief data scientist, and 40 percent of major federal agencies now have a chief data officer.
These are encouraging developments, but there is more work to be done. Social scientists need to get more involved with expanding, integrating, and opening city databases—on which much of our future research will rely. Researchers should work to create partnerships with private companies, into whose hands the overwhelmingly majority of data are going. We must also engage with statistical methodologists and computer scientists to ensure that the next generation of big data algorithms can be applied for not just for prediction, but also to learn causal effects.
More than anything else, policy researchers need to start experimenting with data science. By breaking out of the inertia of tradition, we can not only deepen our understanding of society and governance, but also leverage data to directly improve lives.
If you’re looking for ways to engage or opportunities to get started, consider taking some free online courses through Johns Hopkins University’s Data Science Specialization, or getting involved with your local data science community.
As an organization, the Urban Institute does not take positions on issues. Scholars are independent and empowered to share their evidence-based views and recommendations shaped by research.