open data

DIDC Lean Data Product Development with the US Census Bureau - Debrief and Video

Thank you

I want to thank everyone for attending DIDC's May Meetup event, Lean Data Product Development with the US Census Bureau. This was our first attempt at helping bring potential data product needs to our audience and, based on audience feedback, it will not be our last. That being said, we would love your thoughts on how we could further improve future such events.lean_data_product_panel

I want to add a massive thanks not only to our in-person and online panelists, but also to Logan Powell who was a major force in both organizing this event and also acting as the emcee and guiding the conversation.

Video of the Event

If you missed it, a video of the panel and event is available here:

Information Resources

Finally, below are some follow up information links for those interested.

From Judith K. Johnson, Lead Librarian SBDCNet

From Sara Schnadt

A Rush of Ideas: Kalev Leetaru at Data Science DC

gdeltThis review of the April Data Science DC Meetup was written by Ross Mohan. Ross is a solutions architect for Five 9 Group.

Perhaps you’ve heard the phrase lately “software is eating the world”. Well, to be successful at that, it’s going to have to do as least as good a job of eating the world’s data as do the systems of Kalev Leetaru, Georgetown/Yahoo! fellow.

Kalev Leetaru, lead investigator on GDELT and other tools, defines “world class” work — certainly in the sense of size and scope of data. The goal of GDELT and related systems is to stream global news and social media in as near realtime as possible through multiple steps. The overall goal is to arrive at reliable tone (sentiment) mining and differential conflict detection and to do so …. globally. It is a grand goal.

Kalev Leetaru’s talk covered several broad areas. History of data and communication, data quality and “gotcha” issues in data sourcing and curation, geography of Twitter, processing architecture, toolkits and considerations, and data formatting observations. In each he had a fresh perspective or a novel idea, born of the requirement to handle enormous quantities of ‘noisy’ or ‘dirty’ data.


Keetaru observed that “the map is not the territory” in the sense that actual voting, resource or policy boundaries as measured by various data sources may not match assigned national boundaries. He flagged this as a question of “spatial error bars” for maps.

Distinguishing Global data science from hard established HPC-like pursuits (such as computational chemistry) Kalev Leetaru observed that we make our own bespoke toolkits, and that there is no single ‘magic toolkit” for Big Data, so we should be prepared and willing to spend time putting our toolchain together.

After talking a bit about the historical evolution and growth of data, Kalev Leetaru asked a few perspective-changing questions (some clearly relevant to intelligence agency needs) How to find all protests? How to locate all law books? Some of the more interesting data curation tools and resources Kalev Leetaru mentioned — and a lot more — might be found by the interested reader in The Oxford Guide to Library Research by Thomas Mann.

GDELT (covered further below), labels parse trees with error rates, and reaches beyond the “WHAT” of simple news media to tell us WHY, and ‘how reliable’. One GDELT output product among many is the Daily Global Conflict Report, which covers world leader emotional state and differential change in conflict, not absolute markers.

One recurring theme was to find ways to define and support “truth.” Kalev Leetaru decried one current trend in Big Data, the so-called “Apple Effect”: making luscious pictures from data; with more focus on appearance than actual ground truth. One example he cited was a conclusion from a recent report on Syria, which -- blithely based on geotagged English-language tweets and Facebook postings -- cast a skewed light on Syria’s rebels (Bzzzzzt!)


Leetaru provided one answer on “how to ‘ground truth’ data” by asking “how accurate are geotagged tweets?” Such tweets are after all only 3% of the total. But he reliably used those tweets. How?  By correlating location to electric power availability. (r = .89) He talked also about how to handle emoticons, irony, sarcasm, and other affective language, cautioning analysts to think beyond blindly plugging data into pictures.

Kalev Leetaru talked engagingly about Geography of Twitter, encouraging us to to more RTFD (D=data) than RTFM. Cut your own way through the forest. The valid maps have not been made yet, so be prepared to make your own. Some of the challenges he cited were how to break up typical #hashtagswithnowhitespace and put them back into sentences, how to build — and maintain — sentiment/tone dictionaries and to expect, therefore, to spend the vast majority of time in innovative projects in human tuning the algorithms and understanding the data, and then iterating the machine. Refreshingly “hands on.”

Scale and Tech Architecture

Kalev Leetaru turned to discuss the scale of data, which is now generating easily  in the petabytes per day range. There is no longer any question that automation must be used and that serious machinery will be involved. Our job is to get that automation machinery doing the right thing, and if we do so, we can measure the ‘heartbeat of society.’

For a book images project (60 Million images across hundreds of years) he mentioned a number of tools and file systems (but neither Gluster nor CEPH, disappointingly to this reviewer!) and delved deeply and masterfully into the question of how to clean and manage the very dirty data of “closed captioning” found in news reports. To full-text geocode and analyze half a million hours of news (from the Internet Archives), we need fast language detection and captioning error assessment. What makes this task horrifically difficult is that POS tagging “fails catastrophically on closed captioning” and that CC is worse, far worse in terms of quality than is Optical Character Recognition. The standard Stanford NL Understanding toolkit is very “fragile” in this domain: one reason being that news media has an extremely high density of location references, forcing the analyst into using context to disambiguate.

He covered his GDELT (Global Database of Event, Language and Tone), covering human/societal behavior and beliefs at scale around the world. A system of half a billion plus georeferenced rows, 58 columns wide, comprising 100,000 sources such as  broadcast, print, online media back to 1979, it relies on both human translation and Google translate, and will soon be extended across languages and back to the 1800s. Further, he’s incorporating 21 billion words of academic literature into this model (a first!) and expects availability in Summer 2014, (Sources include JSTOR, DTIC, CIA, CVORE CiteSeerX, IA.)

GDELT’s architecture, which relies heavily on the Google Cloud and BigQuery, can stream at 100,000 input observations/second. This reviewer wanted to ask him about update and delete needs and speeds, but the stream is designed to optimize ingest and query. GDELT tools were myriad, but Perl was frequently mentioned (for text processing).

Kalev Leetaru shared some post GDELT construction takeaways — “it’s not all English” and “watch out for full Unicode compliance” in your toolset, lest your lovely data processing stack SEGFAULT halfway through a load. Store data in whatever is easy to maintain and fast. Modularity is good but performance can be an issue; watch out for XML which bogs down processing on highly nested data. Use for interchange more than anything; sharing seems “nice” but “you can’t shared a graph” and “RAM disk is your friend” more so even than SSD, FusionIO, or fast SANs.

The talk, like this blog post, ran over allotted space and time, but the talk was well worth the effort spent understanding it.

DATA Act Passes U.S. House - President Urged to Endorse Landmark Open Data Legislation

This is a reposted community announcement from our friends at the Data Transparency Coalition. The original announcement can be seen here: Contact:  Zack Pesavento, / 202-420-1065


WASHINGTON, DC -- The Digital Accountability and Transparency Act (DATA Act) passed the United States House of Representatives this evening by a 388-1 vote. The landmark transparency bill (H.R. 2061) would standardize and publish federal spending data online.

"We are hopeful that the Senate will answer this call from the House of Representatives to reap the rewards from greater accountability and  tech-sector innovation that real spending transparency can provide," said Hudson Hollister, the Executive Director of the Data Transparency Coalition. "And President Obama should put the goals of his Open Data Policy into action by publicly endorsing the DATA Act. As Comptroller General Gene Dodaro testified in July, without this legislative mandate, spending transparency won't happen."

The DATA Act would require the Treasury Department to create government-wide data standards for agency financial reports, payments, budget actions, contract reporting, and grant reporting, direct agencies to use those data standards, and mandate that information be published online. Once it is fully implemented, the DATA Act will be the most significant federal transparency reform since President Johnson signed the Freedom of Information Act in 1967.

"The American people deserve a functioning government that is both open and accountable," said Majority Leader Rep. Eric Cantor (R-VA) on the House floor. "The DATA Act is an important step to achieving this goal, because it will publish federal spending data and transform it from disconnected documents into open, searchable data for people to see and read through online."

Although key features of the House bill remain unchanged from the original iteration that passed the House Oversight Committee in May, some minor amendments were included. A fiscal offset was added to the bill, which was scored last week by the Congressional Budget Office.

The comprehensive House bill retains an accountability platform that was removed from the upper chamber's companion legislation on November 6, when it passed the Senate Homeland Security and Governmental Affairs Committee. The provision would expand the mandate of the Recovery Accountability and Transparency Board's Recovery Operations Center, which used open data analytics to eliminate potential waste and fraud in stimulus spending, to cover all federal disbursements rather than just stimulus grants and contracts.

The two chambers will need to reconcile the differences between the two bills. The Data Transparency Coalition supports the use of a conference committee to produce a unified bill aligned with the broad bipartisan consensus that has emerged around the House version of the legislation. Earlier this month, the Coalition joined with 24 other organizations from across the political spectrum in signing a public letter endorsing the version of the DATA Act that was originally introduced in the House.

On December 5, the Data Transparency Coalition will hold the first installment of a new breakfast series, presented by PwC, exploring the impact of the DATA Act and similar policies across government. Staffers from the Department of the Treasury's Fiscal Service and the Recovery Accountability and Transparency Board -- the two agencies poised to oversee implementation of the DATA Act -- will lead the discussion at the breakfast, entitled "Open Data: Transforming Federal Management and Accountability."

Members of the media can request complimentary tickets, subject to availability, by contacting


About the Data Transparency Coalition

The Data Transparency Coalition is a trade association that advocates for data reform in the U.S. government. The Coalition brings together technology companies, nonprofit organizations, and individuals to support policies that require federal agencies to publish their data online, using standardized, machine-readable, nonproprietary data standards. The coalition is steered by a board of advisors. Members include sector leaders such as Teradata Corporation, WebFilings, RR Donnelley, and PwC, and smaller start-ups such as Level One Technologies, and BrightScope. For more information, visit

OpenGov Voices: PDF Liberation Hackathon - At Sunlight in DC and Around the World - January 17-19, 2014

Joffe_Headshot_1 Marc Joffe is the founder of Public Sector Credit Solutions (PSCS), which applies open data and analytics to rating government bonds. Before starting PSCS, Marc was a Senior Director at Moody’s Analytics. You can contact him at Marc is also one of the winners of Sunlight Foundation’s OpenGov Grants.   Extracting useful information from PDFs is a problem as old as … PDFs. Too often, we focus on extracting information from a specific set of documents instead of looking at the bigger picture. If you’ve ever struggled with this problem, join us for Sunlight’s PDF Liberation Hackathon, dedicated to improving open source tools for PDF extraction.   Instead of focusing on one set of documents, coders will come together to add features, extensions and plugins to existing PDF extraction frameworks, making them more flexible, useful and sustainable. Sunlight’s PDF Liberation Hackathon will tackle real-world PDF data extraction problems. In doing so, we will build upon existing open-source PDF extraction solutions such as Tabula and Ashima’s PDF Table Extractor built on Poppler. In addition, hackers will have the option of using licensed PDF software libraries as long as the implementation cost of these libraries is less than $1,000. If you have an idea for a library you want to use, please mention it in your signup form and we will try to work out the licensing ahead of time so that things run smoothly.

Smithsonian American Art Museum Hackathon Day 2 !

Hackathon Day 2- The sequel!! hackathon_Day2


Welcome back!

Let me just start off by saying thank you SMITHSONIAN AMERICAN ART MUSEUM for putting on the best hackathon ever! Granted, it was my first hackathon, but they have set the bar very high. In case you missed it, check out part 1 of this two part post.

Day 2 commenced with breakfast at 10 am and a continuation of hacking and developing our 2 minute videos and 5 minute presentations to meet the 4:00 pm deadline. You could clearly see many of us had little sleep despite the fact that we got to go home. We all basically went home and continued to ruminate over our beautifully designed ideas.

The buzz in the room began to grow as the deadline neared and everyone started escaping into little nooks and crannies to record their 2 minute video pitches.

At 4:30pm presentations began, and they were incredible. The link to all the videos will be posted shortly. Here is a quick summary of what in less than 48 hours, hackers accomplished:

1. Team Once Upon a Time: A dad and their kids got together and created an incredible video of an interactive museum experience where you would basically tap on items in the museum and get information on them via online sources.

2. Team Muneeb and Sohaib: Two brothers with a mission! Sohaib developed a game simulation of the museum where you could pick a character and immerse yourself in an fantastical experience at the Luce Center. Muneeb developed a phone application where you could call a number and get information on that art piece. It was like 311 for art!

3. Team Kiosk to the Future: A beautifully designed mobile application where you could take a picture of the art, get information about it, share with your friends, and all other sorts of neat tricks. It was fully functional on their phone, by the way.

4. Team Geosafe: A mobile application where you could learn more about the artists of the art pieces, like where they lived and even see where they lived on a map in real time. It had a really nice “touch-screen” feel and view. This a one-man team!

5. Team Patrick: A great prototype of having a number of very interactive kiosks in the art museum so you could get information of the art you were looking at during that moment and find related art works nearby. You could walk around and feel welcomed by the space and always re-orient yourself at the nearest kiosk. Another one man team.

6. Team Megatherium: A great interactive game idea, where you would select tiles on a screen about art, play a game with questions about those art pieces, compete with other friends, record your score, etc. The best part about this is that the game didn’t end at the museum, you could go online at home, continue to play, earn badges. It kept you connected beyond your visit. They also planned to have a Phase 2 to expand to other museums.

7. Team Diego: Another great game where you would learn about art, and also tag it based on what term you think it best described. These tags would be fed back into an algorithm to better help label the art pieces in the database as well. *Diego was not a part of the teams to be judged since he is a fellow and helped develop the API. He was also one of the judges.

8. Team Art Pass: They developed a card with a qr code that you would walking around with to interact with the space. You would scan it to a mobile device, login and also scan it across art pieces. It would record all your activity and save it to your profile. They also mentioned expanding it to other museums, so you could have all your experiences recorded in one place.  They also printed a mock-up of the card with the QR code!

9.Team Back Left: Our team developed a website where you would basically learn about art, play games, see exactly where the art piece via google maps (Regardless of where you were), develop your own art with a word tag cloud that you could print at home or at the museum and more! We wanted to make sure you always felt you were in the museum regardless of where you physically were.

The judges deliberated for half an hour and each came back for reasons why all of the ideas were great. Prizes were awarded in the following categories:

Best use of the API: Team Geosafe

Most Franchise-able: Team Art Pass

Most unexpected: Team Muneeb, the visual arts museum game

People's Choice: Team Once Upon a Time

Runner Up: Team Kiosk to the Future

Absolute Favorite: Team Back Left!

At the end we all took a great communal photo and yes we all managed to crowd source the solution of 30 people fitting in one photo ;). Group Photo!

We then had the opportunity (which we all took), to video tape small 2 minute pitches on why we think this API should stay open and available to the developers for use. I think all of our faces said WHY NOT!!?

There was some interacting, mingling and more ideation until the museum started pushing us out. The day ended at around 7pm.

So there you have it, a magnificent time to make friends, help one of our Smithsonian institutions and refine our hacking abilities.


Until another data adventure!

Signing off,





Smithsonian American Art Museum Hackathon Day 1!

luce-foundation_small At around 8am on a chilly Saturday morning, the Smithsonian American Art Museum opened it's doors early for a special group. A group that decided this weekend they would get together and help the Art Museum modernize the digital face of the it's Luce Foundation Center (

The group consisted of developers, cartographers, user experience specialists, engineers, strategists, White House fellows, game developers and more. I don't know about you, but that is one good looking group!

The Smithsonian team worked hard to get an API up and running that we could use to access the entire collections of the American Art Museum. For now, that api is private, and well, that made us feel pretty special. :) . I would like to give credit to that team now (with contact info to follow in the second post):

  • Sarah Allen- Presidential Innovation Fellow and Smithsonian Institution
  • Bridget Callahan-Luce Foundation Center Coordinator, Smithsonian American Art Museum
  • Georgina Goodlander- Deputy Chief, Media and Technology, Smithsonian American Art Museum
  • Diego Mayer-  Presidential Innovation Fellow and Smithsonian Institution
  • Jason Shen- Presidential Innovation Fellow and Smithsonian Institution
  • Ching-Hsien Wang- Supervisory IT Specialist, Office of the Chief Information Officer, Smithsonian Institution
  • Andrew Gunther- IT Specialist, Office of the Chief Information Officer, Smithsonian Institution and volunteer at the Smithsonian Transcription Center
  • Deron Burba - Chief Information Officer, Smithsonian Institution
  • George Bowman- IT Specialist, Office of the Chief Information Officer, Smithsonian Institution
  • Rich Brassell- Contractor, Office of the Chief Information Officer, Smithsonian Institution
  • Eryn Starun-  Assistant General Counsel, Smithsonian Institution

Day 1

Day 1 started with a general overview of expectations (thank you Georgina), and a fantastic tour of the Luce Foundation Center provided by Bridget. The tour was a walking, talking, invigorating ideation session. As we strolled through the incredible pieces in the collection, questions posed led to note writing, Ipad jotting and inquisitive “I have an idea” looks.

After about 2 hours, we head back down to the MacMillan Education Center to prepare for an all out hackathon session ending at 11pm. We receive our mission to create an elevator pitch of our idea by 4pm Sunday afternoon, and get to work.

Teams were organically created as we all huddled together and started building our prototypes. There was an exhilarating buzz in the room, an energy that can only be found in hackathons. This is enhanced by the fact that we aren't receiving monetary prizes, job promotions, or some kind of national acclaim.

We are here because we believe in bringing a better user experience for visitors of the Luce Foundation Center and the Smithsonian American Art Museum. We are here to help an institution that really belongs to the people of this nation.  Okay, that's enough “tooting our own horn," but it had to be done.

During our "brainwork" and lunch-eating, we were given a great overview of the API, with some sample use cases and instructions. The Smithsonian’s technology team was extremely helpful and stayed with the hackathon the whole time to ensure everyone was able to access resources correctly. They even gave us a very nice web-based “hackpad” where we could all post questions, share thoughts and keep track of our work.

There were small brain-breaks were groups broke out to work on other projects and remove themselves from their work to get a little extra stimulation so they could look at their ideas with a fresh mind. Thank you Erie Meyer for that. Erie works for the White House Office of Science and Technology Policy,  (  That's a pretty cool gig.

White paper with post-its, colored makers, sketches, flow diagrams and the like were lined up across the wall. If you just stopped and looked around, you could see the brain power that room encapsulated.


As the evening hours neared, pizza came and the “idea-ting” continued. I must admit I left a little early to handle some non-hacking work, so I look forward to checking in with my team, and getting back to you all on Hackathon Day 2. I will be writing about  final results, photos, videos and some general commentary from my fellow hackathon crew.

If you want to check out other resources before the second post of the hackathon, feel free to peruse these useful links:

Stay tuned for Day 2 and final results!

Signing off


DIDC MeetUp Review - The US Census Bureau Pushes Data

Data Community DC is excited to welcome Andrea to our host of bloggers. Andrea's impressive bio is below and she will be bringing energy, ideas, and enthusiasm to the Data Innovation DC organizational team. IMG_20131112_192600

Census Data is cool?

At least that’s what everyone discovered at last night’s Data Innovation DC's MeetUp. The U.S Census Bureau came in to "reverse pitch" their petabytes of data to a group of developers, data scientists and data-preneurs at Cooley LLP in Downtown DC.

First off, let's offer a massive thanks to the US Census Bureau that sent five of their best and brightest to come engage the community long into the evening and late night hours. Who specifically did they send? Just take a look at the impressive list below:


Editor's note - a special thank you to Logan Powell who made this entire event possible.

And they brought the fantastic Jeremy Carbaugh jcarbaugh [at] from the Sunlight Foundation, a company working on making census data (and other government data) interesting, fun, and mobile. They have this sweet app called Sitegeist. You give it a location and it gives you impressive stats such as the history of the place, how many people are baby making, or just living the bachelor lifestyle; it even connects to Yelp and wunderground too just in case you need the weather and a place to grab a brewski while you’re at it. Further, Eric at the Census bureau made a great point for everyone out there in real estate. You can use this app to show potential buyers how the demographics in the area have changed, good school districts, income levels, number of children per household, etc.. You know you’ll look good whipping out that tablet and showing them ;)

By the way, Sunlight created a very convenient python wrapper for the Census API; you can pip it off of PyPI and check out the source on github here (a round of applause for our sunlight folks!) Did I mention that they are a non-profit doing this with far less funding then many others out there?

Sitegeist is nice but exactly how accessible is the Census Data?  I am glad you asked. The census has two approaches, their American Fact Finder and API, both easy to use. The fact finder is good to just go ahead and peruse what you may find interesting before actually grabbing the data for yourself. The api is like the Twitter version 1 API. You get a key and use stateless HTTP GET requests to pull the data via the web. For those non-api folks, I’ll be posting a how-to shortly.

The census also has their own fun mobile app called Americas Economy.

Alright so we’ve got some data, we’ve got some ways to get it but what’s up with the reverse pitch thing? This was the best part as everyone had awesome ideas and ideations.

Some questions included:

Can we blend WorldBank and Federal Reserve Bank data to get meaningful results?

This came from a guy who was already building some nice apps around WB and Fed data. The general consensus was "yes," a lot of business value can come from that, but they need folks like us to come up with use-cases. So, thoughts? Please comment and tinker away.

What about the geospatial aspects of the data?

There were a lot of questions around the GIS mapping data and some problems with drilling down on the geo-spatial data down to block sizes or very small lots of land. People seem really interested in getting this data for things like understanding how diseases spread, patters of migration etc. The Census folks said that with the longer term surveys you can definitely get down to the block level but, because boundaries and borders can be defined differently across the nation, it is very difficult to normalize the data. Another use-case? A herculean effort? for thought. Also, shortly after the event, someone posted this on geo-normalization in Japan. Thanks Logan!

Editor's note: More information on US Census Grids can be found here.

How does Census data help commercial companies?

There was a great established use case where the Census helped Target Retail understand their demographic. That blew me away. The gov’t and a private retail company working to make a better profit, a better product? This definitely got my creative juices flowing, hopefully it will get everyone out there cogitating too.

or, check out this case study from the National Association of Homebuilders:

and last but not least, an example of Census data helping disaster relief (not really commercial but Logan didn't get a chance to show all of his videos):

We finally had people talking about the importance of longitudinal studies.

What is different now for our nation in terms of demographics, culture, and geography from 20-30-50 years ago? Just imagine some really cool heat map or time series visualization of how Central Park in NY or Rock Creek in DC has changed…yes I am saying this so someone actually goes out and gives that one a go. Don’t worry you can take the credit ;)

Oh and I almost forgot due to obvious privacy issues a lot of the data is pre-processed so you can’t stalk your ex-boss/boyfriend/girlfriend. But, listen up! If you are in school and doing research and want to get your hands on the microdata, you can apply. Go to this link and check it out ( For those of you stuck on a thesis topic in any domain that may need information about society, cough cough, nudge nudge ...

So there you have it, these are the kinds of meetups happening at Data Innovation DC. I don’t know about you, but I definitely have a new perspective on government data. I also feel a little more inclined to open my door when those census folk drop by and give them real answers.

Please comment as you see fit and send me questions.  Also, JOIN Data Innovation DC and check out Data Community DC with all of other related data meetup groups. Let us know what kind of information you want to know about and what issues/topics you want us to address.

I’m new to the blog/review game but will continue to review meetups and some hot topics, podcasts etc. that I think need to be checked out. Let me know if you want me to speak to anything in particular.

Why Aren't There More Open Data Startups?

This post is a guest reblog (with permission original 1/19/2011) by Tom Lee, the Director of Sunlight Labs and recent speaker at Data Innovation DC. It's a question I'm seeing asked more and more: by press, by Gov 2.0 advocates, and by the online public. Those of us excited by the possibilities of open data have promised great things. So why is BrightScope the only government data startup that anyone seems to talk about?datagov  I think it's important that those of us who value open data be ready with an answer to this question. But part of that answer needs to address the misperceptions built into the query itself.

There Are Lots of Open Data Businesses

BrightScope is a wonderful example of a business that sells services built in part on publicly available data. They've gotten a lot of attention because they started up after the Open Government Directive, after -- after Gov 2.0 in general -- and can therefore be pointed to as a validation of that movement.

But, if we want to validate the idea of public sector information (PSI) being useful foundations for businesses in general, we can expand our scope considerably. And if we do, it's easy to find companies that are built on government data: there are databases of legal decisionsdatabases of patent information,medicare data, resellers of weather databusiness intelligence services that rely in part on SEC data, GIS products derived from Census data, and many others.

Some of these should probably be free, open, and much less profitable than they currently are*. But all of them are examples of how genuinely possible it is to make money off of government data. It's not all that surprising that many of the most profitable uses of PSI emerged before anyone started talking about open data's business potential. That's just the magic of capitalism! This stuff was useful, and so people found it and commercialized it. The profit motive meant that nobody had to wait around for people like me to start talking about open formats and APIs. There are no doubt still efficiencies to be gained in improving and opening these systems, but let's not be shocked if a lot of the low-hanging commercial fruit turns out to have already been picked.

Still, surely there are more opportunities out there. A lot of new government data is being opened up. Some of it must be valuable... right?

Government Does What The Market Won't

Well, sure. Much of it is extremely valuable. But it may not be valuable to entrepreneurs. To understand why, we need to get a little philosophical. What does government do? It provides public goods: things of value that the market is not able to adequately supply on its own. A standing army and public schools and well-policed streets and clean water are all things that are useful to society as a whole, but which the market can't be relied upon to provide automatically. So we organize government as a structure that can provide those kinds of things, and which will make sure that everyone can benefit from them in a way that's fair.

These are not ideal conditions under which to start a business: the fact that the government is the one collecting a particular type of data may mean that no one is interested in buying it -- a natural market for the data doesn't exist in the way that it does for, say, sports scores or stats about television viewership. And, even if you create a business that takes advantage of the subsidy represented by government involvement (data collected at taxpayer expense, resold at low, low prices!), your long-term prospects may still be poor since there's no way to deny competitors access to the same subsidy**. Someone else can come along and undercut you, and there's nothing you can do about it except be better and cheaper. That's great for the consumer, but not so great for people hoping to start a lucrative business. (Those who think BrightScope is a counterexample should have a closer look at their about page: they utilize a mix of public data, data that they laboriously capture themselves, and data bought from subscription services.)

Data's Real Value Can Be Hard To Measure

I'll be glad to see more open data startups -- and to be clear, I think we will see more. But the open data movement will be important regardless of whether any IPOs come out of it.

There are lots of types of value that are difficult to measure. If the IRS puts forms online, taxpayers have to spend less time waiting in line at the post office. If Census data reveals where a retailer's new store should go, it can mean profits for shareholders and more jobs for the community. If scientific data's openness allows more researchers to engage with a question, it can lead to better conclusions, better policies and better outcomes. If regulatory data about companies is public, it can give firms an incentive to self-police and help markets price things correctly.

All of these are real benefits, but they can be difficult or impossible to calculate -- and tough for a startup to monetize. Still, this is where I think the really exciting benefits to open data are likely to be found. If government data helps entrepreneurs make money, that's great. If it makes our country work better, that's fantastic.

* Historically, many gov data vendors have made money off of the data's artificial scarcity -- a legacy that we must unravel, even though doing so will be politically difficult: openness's benefits to the public will probably mean less revenue for the vendors.

** There shouldn't be, anyway -- in practice, public/private partnerships often fall short of this goal.

Weekly Round-Up: Open Data Order, Data Discovery, Andrew Ng, and Connected Devices

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from Open Data to connected devices. In this week's round-up:

  • Open Data Order Could Save Lives, Energy Costs And Make Cool Apps
  • Four Types of Discovery Technology
  • Andrew Ng and the Quest for the New AI
  • Our Connected Future

Open Data Order Could Save Lives, Energy Costs And Make Cool Apps

This is a TechCrunch article about President Obama's recent Open Data Order, an executive order intended to make more government agency data openly available for analysis. The article goes on to talk about some of the ways open data has been used in the past and has a link to Project Open Data's Github page where you can find more details.

Four Types of Discovery Technology

This Smart Data Collective post talks about the value of discovery in data analytics and business. The author claims there are four types of discovery for business analytics - event discovery, data discovery, information discovery, and visual discovery - and he goes into some detail explaining each one and the differences between them.

Andrew Ng and the Quest for the New AI

This is an interesting Wired piece about Andrew Ng, best known as the Stanford machine learning professor who also co-founded Coursera. The article talks about Ng's background and interest in artificial intelligence as well as some of the deep learning projects he is working on. It goes on to explain a little about what deep learning is and how it may evolve in the future.

Our Connected Future

Our final piece this week is a GigaOM article about connected devices and how they will become more prevalent in the future. The article highlights some very interesting devices, explains what they do, and describes how they are being used. The article also talks about the data that can be collected from connected devices such as these and different ways that this data can be used.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups