The Value of the Data Scientist - A Data Science DC Event Review

Earlier this month the Data Science DC Meetup hosted researchers and staff from the Sunlight Foundation  for a presentation on bringing clarity to flat text data using advanced analysis and data visualization. There is a wealth of raw, unstructured data, available to the public and obtainable via the methods outlined under the Freedom of Information Act (FOIA). This data includes disclosure forms pertaining to lobbying activities before the Federal Government by private enterprises in the United States as well as comprehensive summaries of pending, passed, and rejected legislation before the United States Congress.

Lee Drutman, a Senior Fellow at Sunlight, and his colleagues took on the project of pulling a subset of this data out of its flat, unstructured form, and transforming it into meaningful content, illustrating the relationships between lobbying activity by industry sector and various immigration reform bills. In this way, the team at Sunlight has provided “a comprehensive and interactive guide to the web of interests with something at stake”, detailing “who these interests are, what they care about, and how intensely they are likely to lobby to get what they want” (1). They’ve named their project Untangling the Webs of Immigration Reform.

Lee opened the informative session with detailed information on the underlying unstructured data they had to work with: how it’s generated, how it’s collected, statistics on the total volume of data, and other descriptive measures.  Given the volume and complexity of the data, Lee’s team decided a network analysis would best bring the data to life.

Before we can detail the relationships between lobbying activities by sector and immigration reform bills, we need to first understand the relationships among the various immigration bills themselves. Zander Furnas, Research Fellow at Sunlight, described how the team used Latent Semantic Analysis to identify similarities between bills along the lines of immigration reform subtopics. He and his team represented each bill as a vector, using the bill summaries (the actual bills are too long for practical use) to construct the corpus.  Once each bill was represented as a vector, the team simply compared the vectors for similarity. The comparison was completed using a similarity matrix algorithm that enabled a specific type of clustering called Hierarchical Agglomerative Clustering (HAC). The team used Ward’s method to determine the linkage criteria for more uniform clusters.

The output produced by the HAC analysis was a dendrogram which was then combined with the data on lobbying activities using a network diagram. Zander described how the Sunlight team configured the network diagram as a bipartite graph (two types of nodes), which the first node representing industry sectors and the second node representing individual bills. The edges originate from the first type of node to the second type, i.e. from an industry sector to a bill they lobbied on and the edge weight is determined by the quantity of lobbying activities completed as measured in the number of lobbying reports. This basic network diagram, while technically accurate, produces a confusing and unappealing visualization, as shown below (2).

Network Graph, first run, unfiltered. Source: Sunlight Foundation.

Zander detailed the filtering methods that the team used to streamline the data represented in the network diagram, including the exclusion of the sectors that were below a minimum threshold of lobbying activities. This exclusion resulted in a K-Core sub-graph of the original, with a degeneracy of three. The updated version of the graph produced by K-Core filtration was cleaner but still very visually cluttered. The group then turned to their toolbox of layout algorithms in order to rearrange the data in such a way as to produce a more meaningful yet simple visualization. Using the OpenOrd layout provided by Gephi (open source visualization tool of choice for the Sunlight team), Zander was able to produce a visualization of tighter clusters. He refined the network graph further using weighted nodes (sectors that had more lobbying activity are larger compared to those that have less; bills that were lobbied on more are larger compared to those lobbied on less).

Amy Cesal, graphic designer at Sunlight Foundation, became involved in the project at that stage at Zander’s invitation. She was tasked with applying her visual design skills to beautify the final output of Zander’s network graph. She walked the audience through her decision making process with regard to color palette, flat graphs, cluster isolation and breakdown, and overall design strategy.

The final graph incorporating Amy’s work is shown below: (1)

Network Graph, final draft, filtered. Source: Sunlight Foundation.

Following the presentation, the Sunlight team participated in an open and excellent Q&A session with the audience, addressing specific questions on their project and their company.

It’s very important to highlight this kind of visual analytic work that Lee, Zander, and Amy are conducting at Sunlight Foundation because it illustrates the importance of not only having access to pertinent data (every organization today has jumped on the big data bandwagon and has begun to amass large volumes of unstructured data) but also access to skilled data scientists who can pull the unstructured data off the flat page and transform it into a meaningful visualization for concerned audiences. Data scientists who understand the advanced mathematics and work involved in analysis and also understand how to relate to and present information to a non-technical audience are extremely valuable in the data science community.

Audio is available for download in mp3 format here.