government data

DIDC Lean Data Product Development with the US Census Bureau - Debrief and Video

Thank you

I want to thank everyone for attending DIDC's May Meetup event, Lean Data Product Development with the US Census Bureau. This was our first attempt at helping bring potential data product needs to our audience and, based on audience feedback, it will not be our last. That being said, we would love your thoughts on how we could further improve future such events.lean_data_product_panel

I want to add a massive thanks not only to our in-person and online panelists, but also to Logan Powell who was a major force in both organizing this event and also acting as the emcee and guiding the conversation.

Video of the Event

If you missed it, a video of the panel and event is available here:

https://www.youtube.com/watch?v=bWWbk5E1Jzg

Information Resources

Finally, below are some follow up information links for those interested.

From Judith K. Johnson, Lead Librarian SBDCNet

From Sara Schnadt

November Data Science DC Event Review: Identifying Smugglers: Local Outlier Detection in Big Geospatial Data

This is a guest post from Data Science DC Member and quantitative political scientist David J. Elkind. Geopatial Outliers in the Strait of HormuzAs the November Data Science DC Meetup, Nathan Danneman, Emory University PhD and analytics engineer at Data Tactics, presented an approach to detecting unusual units within a geospatial data set. For me, the most enjoyable feature of Dr. Danneman’s talk was his engaging presentation. I suspect that other data consultants have also spent quite some time reading statistical articles and lost quite a few hours attempting to trace back the authors’ incoherent prose. Nathan approached his talk in a way that placed a minimal quantitative demand on the audience, instead focusing on the three essential components of his analysis: his analytical task, the outline of his approach, and the presentation of his findings. I’ll address each of these in turn.

Analytical Task

Nathan was presented with the problem of locating maritime vessels in the Strait of Hormuz engaged in smuggling activities: sanctions against Iran have made it very difficult for Iran to engage in international commerce, so improving detection of smugglers crossing the Strait from Iran to Qatar and the United Arab Emirates would improve the effectiveness of the sanctions regime and increase pressure on the regime. (I’ve written about issues related to Iranian sanctions for CSIS’s Project on Nuclear Issues Blog.)

Having collected publicly accessible satellite positioning data of maritime vessels, Nathan had four fields for each craft at several time intervals within some period: speed, heading, latitude and longitude.

But what do smugglers look like? Unfortunately, Nathan’s data set did not itself include any examples of watercraft which had been unambiguously identified by, e.g., the US Navy, as smugglers, so he could not rely on historical examples of smuggling as a starting point for his analysis. Instead, he has to puzzle out how to leverage information a craft’s spatial location

I’ve encountered a few applied quantitative researchers who, when faced with a lack of historical examples, would be entirely stymied in their progress, declaring the problem too hard. Instead of throwing up his hands, Dr. Danneman dug into the topic of maritime smuggling and found that many smuggling scenarios involve ship-to-ship transfers of contraband which take place outside of ordinary shipping lanes. This qualitatively-informed understanding transforms the project from mere speculation about what smugglers might look like into the problem of discovering maritime vessels which deviate too far from ordinary traffic patterns.

Importantly, framing the research in this way entirely premises the validity of inferences on the notion that unusual ships are smugglers and smugglers are unusual ships. But in reality, there are many reasons that ships might not conform to ordinary traffic patterns – for example, pleasure craft and fishing boats might have irregular movement patterns that don’t coincide with shipping lanes, and so look similar to the hypothesized smugglers.

Outline of Approach

The basic approach can be split into three sections: partitioning the strait into many grids, generating fake boats to compare the real boats, and then training a logistic regression to use the four data fields (speed, heading, latitude and longitude) to differentiate the real boats from the fake ones.

Partitioning the strait into grids helps emphasize the local character of ships’ movements in that region. For example, a grid square partially containing a shipping channel will have many ships located in the channel and on a heading taking it along that channel. Generating fake boats, with bivariate uniform distribution in the grid square, will tend not to fall in the path of ordinary traffic channel, just like the hypothesized behavior of smugglers. The same goes for the uniformly-distributed timestamps and otherwise randomly-assigned boat attributes for the comparison sets: these will all tend to stand apart from ordinary traffic. Therefore, training a model to differentiate between these two classes of behaviors will advance the goal of differentiating between smugglers and ordinary traffic.

Dr. Danneman described this procedure as unsupervised-as-supervised learning – a novel term for me, so forgive me if I’m loose with the particulars – but this in this case it refers to the notion that there are two classes of data points, one i.i.d. from some unknown density and another simulated via Monte Carlo methods from some known density.  Pooling both samples gives one a mixture of the two densities; this problem then becomes one of comparing the relative densities of the two classes of data points – that is, this problem is actually a restatement of the problem of logistic regression! Additional details can be found in Elements of Statistical Learning (2nd edition, section 14.2.4, p. 495).

Presentation of Findings

After fitting the model, we can examine which of the real boats the model rated as having a low odds of being real – that is, boats which looked so similar to the randomly-generated boats that the model had difficulty differentiating the two. These are the boats that we might call “outliers,” and, given the premise that ship-to-ship smuggling likely takes place aboard boats with unusual behavior, are more likely to be engaged in smuggling.

I will repeat here a slight criticism that I noted elsewhere and point out that the model output cannot be interpreted as a true probability, contrary to the results displayed in slide 39. In this research design, Dr. Danneman did not randomly sample from the population of all shipping traffic in the Strait of Hormuz to assemble a collection of smuggling craft and ordinary traffic in proportion roughly equal to their occurrence in nature. Rather, he generated one fake boat for each real boat. This is a case-control research design, so the intercept term of the logistic regression model is fixed to reflect the ratio of positive cases to negative cases in the data set. All of the terms in the model, including the intercept, are still MLE estimators, and all of the non-intercept terms are perfectly valid for comparing the odds of an observation being in one class or another. But to establish probabilities, one would have to replace the intercept term with knowledge of what the overall ratio of positives to negatives in the other.

In the question-and-answer session, some in the audience pushed back against the limited data set, noting that one could improve the results by incorporating other information specific to each of the ships (its flag, its shipping line, the type of craft, or other pieces of information). First, I believe that all applications would leverage this information – were it available – and model it appropriately; however, as befit a pedagogical talk on geospatial outlier detection, this talk focused on leveraging geospatial data for outlier detection.

Second, it should be intuitive that including more information in a model might improve the results: the more we know about the boats, the more we can differentiate between them. Collecting more data is, perhaps, the lowest-hanging fruit of model improvement. I think it’s more worthwhile to note that Nathan’s highly parsimonious model achieved very clean separation between fake and real boats despite the limited amount of information collected for each boat-time unit.

The presentation and code may be found on Nathan Danneman's web site. The audio for the presentation is also available.

DIDC MeetUp Review - The US Census Bureau Pushes Data

Data Community DC is excited to welcome Andrea to our host of bloggers. Andrea's impressive bio is below and she will be bringing energy, ideas, and enthusiasm to the Data Innovation DC organizational team. IMG_20131112_192600

Census Data is cool?

At least that’s what everyone discovered at last night’s Data Innovation DC's MeetUp. The U.S Census Bureau came in to "reverse pitch" their petabytes of data to a group of developers, data scientists and data-preneurs at Cooley LLP in Downtown DC.

First off, let's offer a massive thanks to the US Census Bureau that sent five of their best and brightest to come engage the community long into the evening and late night hours. Who specifically did they send? Just take a look at the impressive list below:

census_contact

Editor's note - a special thank you to Logan Powell who made this entire event possible.

And they brought the fantastic Jeremy Carbaugh jcarbaugh [at] sunlightfoundation.com from the Sunlight Foundation, a company working on making census data (and other government data) interesting, fun, and mobile. They have this sweet app called Sitegeist. You give it a location and it gives you impressive stats such as the history of the place, how many people are baby making, or just living the bachelor lifestyle; it even connects to Yelp and wunderground too just in case you need the weather and a place to grab a brewski while you’re at it. Further, Eric at the Census bureau made a great point for everyone out there in real estate. You can use this app to show potential buyers how the demographics in the area have changed, good school districts, income levels, number of children per household, etc.. You know you’ll look good whipping out that tablet and showing them ;)

By the way, Sunlight created a very convenient python wrapper for the Census API; you can pip it off of PyPI and check out the source on github here (a round of applause for our sunlight folks!) Did I mention that they are a non-profit doing this with far less funding then many others out there?

Sitegeist is nice but exactly how accessible is the Census Data?  I am glad you asked. The census has two approaches, their American Fact Finder and API, both easy to use. The fact finder is good to just go ahead and peruse what you may find interesting before actually grabbing the data for yourself. The api is like the Twitter version 1 API. You get a key and use stateless HTTP GET requests to pull the data via the web. For those non-api folks, I’ll be posting a how-to shortly.

The census also has their own fun mobile app called Americas Economy.

Alright so we’ve got some data, we’ve got some ways to get it but what’s up with the reverse pitch thing? This was the best part as everyone had awesome ideas and ideations.

Some questions included:

Can we blend WorldBank and Federal Reserve Bank data to get meaningful results?

This came from a guy who was already building some nice apps around WB and Fed data. The general consensus was "yes," a lot of business value can come from that, but they need folks like us to come up with use-cases. So, thoughts? Please comment and tinker away.

What about the geospatial aspects of the data?

There were a lot of questions around the GIS mapping data and some problems with drilling down on the geo-spatial data down to block sizes or very small lots of land. People seem really interested in getting this data for things like understanding how diseases spread, patters of migration etc. The Census folks said that with the longer term surveys you can definitely get down to the block level but, because boundaries and borders can be defined differently across the nation, it is very difficult to normalize the data. Another use-case? A herculean effort? Hmm..food for thought. Also, shortly after the event, someone posted this on geo-normalization in Japan. Thanks Logan!

Editor's note: More information on US Census Grids can be found here.

How does Census data help commercial companies?

There was a great established use case where the Census helped Target Retail understand their demographic. That blew me away. The gov’t and a private retail company working to make a better profit, a better product? This definitely got my creative juices flowing, hopefully it will get everyone out there cogitating too.

https://www.youtube.com/watch?v=jgsdQxTv5kY

or, check out this case study from the National Association of Homebuilders:

http://www.youtube.com/watch?v=CBDmE5Nj0BY

and last but not least, an example of Census data helping disaster relief (not really commercial but Logan didn't get a chance to show all of his videos):

http://www.youtube.com/watch?v=PaEu8-xH9LE

We finally had people talking about the importance of longitudinal studies.

What is different now for our nation in terms of demographics, culture, and geography from 20-30-50 years ago? Just imagine some really cool heat map or time series visualization of how Central Park in NY or Rock Creek in DC has changed…yes I am saying this so someone actually goes out and gives that one a go. Don’t worry you can take the credit ;)

Oh and I almost forgot due to obvious privacy issues a lot of the data is pre-processed so you can’t stalk your ex-boss/boyfriend/girlfriend. But, listen up! If you are in school and doing research and want to get your hands on the microdata, you can apply. Go to this link and check it out (http://www.census.gov/ces/rdcresearch/howtoapply.html). For those of you stuck on a thesis topic in any domain that may need information about society, cough cough, nudge nudge ...

So there you have it, these are the kinds of meetups happening at Data Innovation DC. I don’t know about you, but I definitely have a new perspective on government data. I also feel a little more inclined to open my door when those census folk drop by and give them real answers.

Please comment as you see fit and send me questions.  Also, JOIN Data Innovation DC and check out Data Community DC with all of other related data meetup groups. Let us know what kind of information you want to know about and what issues/topics you want us to address.

I’m new to the blog/review game but will continue to review meetups and some hot topics, podcasts etc. that I think need to be checked out. Let me know if you want me to speak to anything in particular.

Why Aren't There More Open Data Startups?

This post is a guest reblog (with permission original 1/19/2011) by Tom Lee, the Director of Sunlight Labs and recent speaker at Data Innovation DC. It's a question I'm seeing asked more and more: by press, by Gov 2.0 advocates, and by the online public. Those of us excited by the possibilities of open data have promised great things. So why is BrightScope the only government data startup that anyone seems to talk about?datagov  I think it's important that those of us who value open data be ready with an answer to this question. But part of that answer needs to address the misperceptions built into the query itself.

There Are Lots of Open Data Businesses

BrightScope is a wonderful example of a business that sells services built in part on publicly available data. They've gotten a lot of attention because they started up after the Open Government Directive, after data.gov -- after Gov 2.0 in general -- and can therefore be pointed to as a validation of that movement.

But, if we want to validate the idea of public sector information (PSI) being useful foundations for businesses in general, we can expand our scope considerably. And if we do, it's easy to find companies that are built on government data: there are databases of legal decisionsdatabases of patent information,medicare data, resellers of weather databusiness intelligence services that rely in part on SEC data, GIS products derived from Census data, and many others.

Some of these should probably be free, open, and much less profitable than they currently are*. But all of them are examples of how genuinely possible it is to make money off of government data. It's not all that surprising that many of the most profitable uses of PSI emerged before anyone started talking about open data's business potential. That's just the magic of capitalism! This stuff was useful, and so people found it and commercialized it. The profit motive meant that nobody had to wait around for people like me to start talking about open formats and APIs. There are no doubt still efficiencies to be gained in improving and opening these systems, but let's not be shocked if a lot of the low-hanging commercial fruit turns out to have already been picked.

Still, surely there are more opportunities out there. A lot of new government data is being opened up. Some of it must be valuable... right?

Government Does What The Market Won't

Well, sure. Much of it is extremely valuable. But it may not be valuable to entrepreneurs. To understand why, we need to get a little philosophical. What does government do? It provides public goods: things of value that the market is not able to adequately supply on its own. A standing army and public schools and well-policed streets and clean water are all things that are useful to society as a whole, but which the market can't be relied upon to provide automatically. So we organize government as a structure that can provide those kinds of things, and which will make sure that everyone can benefit from them in a way that's fair.

These are not ideal conditions under which to start a business: the fact that the government is the one collecting a particular type of data may mean that no one is interested in buying it -- a natural market for the data doesn't exist in the way that it does for, say, sports scores or stats about television viewership. And, even if you create a business that takes advantage of the subsidy represented by government involvement (data collected at taxpayer expense, resold at low, low prices!), your long-term prospects may still be poor since there's no way to deny competitors access to the same subsidy**. Someone else can come along and undercut you, and there's nothing you can do about it except be better and cheaper. That's great for the consumer, but not so great for people hoping to start a lucrative business. (Those who think BrightScope is a counterexample should have a closer look at their about page: they utilize a mix of public data, data that they laboriously capture themselves, and data bought from subscription services.)

Data's Real Value Can Be Hard To Measure

I'll be glad to see more open data startups -- and to be clear, I think we will see more. But the open data movement will be important regardless of whether any IPOs come out of it.

There are lots of types of value that are difficult to measure. If the IRS puts forms online, taxpayers have to spend less time waiting in line at the post office. If Census data reveals where a retailer's new store should go, it can mean profits for shareholders and more jobs for the community. If scientific data's openness allows more researchers to engage with a question, it can lead to better conclusions, better policies and better outcomes. If regulatory data about companies is public, it can give firms an incentive to self-police and help markets price things correctly.

All of these are real benefits, but they can be difficult or impossible to calculate -- and tough for a startup to monetize. Still, this is where I think the really exciting benefits to open data are likely to be found. If government data helps entrepreneurs make money, that's great. If it makes our country work better, that's fantastic.

* Historically, many gov data vendors have made money off of the data's artificial scarcity -- a legacy that we must unravel, even though doing so will be politically difficult: openness's benefits to the public will probably mean less revenue for the vendors.

** There shouldn't be, anyway -- in practice, public/private partnerships often fall short of this goal.

Introducing the Congress App for iOS from the Sunlight Foundation

This post is reblogged with permission from the Sunlight Foundation and the original can be found here. We at Data Community DC are always looking to highlight local innovations in data including software, apps, data sets, infrastructure, databases, startups, algorithms, and more.  If you would like to garner publicity for your efforts, please contact me, Sean Murphy, at SeanM@DataCommunityDC.org. As Congress returns for their July session, the Sunlight Foundation is excited to Congress-app-for-iOS announce our free Congress app for iOS devices that allows anyone to get the latest from Washington. Download it here. Now it is easy to learn more about your member of Congress, contact them directly and see their activities right from your phone. Follow the latest legislation, floor activity and even get a breakdown of votes with just a swipe and tap. The new Congress app for iOS has many more features in development and complements theimmensely popular version for Android.

When you launch the app, you'll go right to a feed of the latest activity. Tap on a piece of legislation and you can see the summary information, sponsorship details, movement through committees, votes and links to the full text. Easily swipe to the left to access the menu of other features. Under the legislators tab, you can quickly browse the list of members sorted by state, chamber or even tap the location icon to see who represents your current spot or wherever you drop a pin. There is no shame in endlessly dropping new pins to discover the interesting shapes of congressional districts.

A screenshot of Rep. Ann Eshoo's profile from the Sunlight Foundation's new Congress app for iOS devices.The Congress app has detailed and update-to-date information for every member of Congress. Quickly see their picture, get directions to their DC office address, find links to their website and social media, see a map of their district and a button to call them directly. You can also see the bills they sponsored and their vote record. Star any legislator or bill to quickly access them through the "Following" section. From there you can easily see the latest activity on the bills you follow and if you're looking at a vote breakdown, the legislators you follow will appear at the top.

In future releases we will have push notifications for actions related to what you're following as well as a new section for committee listings, calendars, floor updates and much more. Stay tuned for the latest updates by following the Congress app Twitter account here. Like all of Sunlight's work, the Congress app is open source with the code available on GitHub. The app uses data from official sources through the Sunlight Congress API and the beautiful maps are powered by MapBox. Please email us with any feedback.

https://www.youtube.com/watch?feature=player_embedded&v=MIH8DWNNyJs