Marc Joffe is the founder of Public Sector Credit Solutions (PSCS), which applies open data and analytics to rating government bonds. Before starting PSCS, Marc was a Senior Director at Moody’s Analytics. You can contact him at email@example.com. Marc is also one of the winners of Sunlight Foundation’s OpenGov Grants. Extracting useful information from PDFs is a problem as old as … PDFs. Too often, we focus on extracting information from a specific set of documents instead of looking at the bigger picture. If you’ve ever struggled with this problem, join us for Sunlight’s PDF Liberation Hackathon, dedicated to improving open source tools for PDF extraction. Instead of focusing on one set of documents, coders will come together to add features, extensions and plugins to existing PDF extraction frameworks, making them more flexible, useful and sustainable. Sunlight’s PDF Liberation Hackathon will tackle real-world PDF data extraction problems. In doing so, we will build upon existing open-source PDF extraction solutions such as Tabula and Ashima’s PDF Table Extractor built on Poppler. In addition, hackers will have the option of using licensed PDF software libraries as long as the implementation cost of these libraries is less than $1,000. If you have an idea for a library you want to use, please mention it in your signup form and we will try to work out the licensing ahead of time so that things run smoothly.
Let me just start off by saying thank you SMITHSONIAN AMERICAN ART MUSEUM for putting on the best hackathon ever! Granted, it was my first hackathon, but they have set the bar very high. In case you missed it, check out part 1 of this two part post.
Day 2 commenced with breakfast at 10 am and a continuation of hacking and developing our 2 minute videos and 5 minute presentations to meet the 4:00 pm deadline. You could clearly see many of us had little sleep despite the fact that we got to go home. We all basically went home and continued to ruminate over our beautifully designed ideas.
The buzz in the room began to grow as the deadline neared and everyone started escaping into little nooks and crannies to record their 2 minute video pitches.
At 4:30pm presentations began, and they were incredible. The link to all the videos will be posted shortly. Here is a quick summary of what in less than 48 hours, hackers accomplished:
1. Team Once Upon a Time: A dad and their kids got together and created an incredible video of an interactive museum experience where you would basically tap on items in the museum and get information on them via online sources.
2. Team Muneeb and Sohaib: Two brothers with a mission! Sohaib developed a game simulation of the museum where you could pick a character and immerse yourself in an fantastical experience at the Luce Center. Muneeb developed a phone application where you could call a number and get information on that art piece. It was like 311 for art!
3. Team Kiosk to the Future: A beautifully designed mobile application where you could take a picture of the art, get information about it, share with your friends, and all other sorts of neat tricks. It was fully functional on their phone, by the way.
4. Team Geosafe: A mobile application where you could learn more about the artists of the art pieces, like where they lived and even see where they lived on a map in real time. It had a really nice “touch-screen” feel and view. This a one-man team!
5. Team Patrick: A great prototype of having a number of very interactive kiosks in the art museum so you could get information of the art you were looking at during that moment and find related art works nearby. You could walk around and feel welcomed by the space and always re-orient yourself at the nearest kiosk. Another one man team.
6. Team Megatherium: A great interactive game idea, where you would select tiles on a screen about art, play a game with questions about those art pieces, compete with other friends, record your score, etc. The best part about this is that the game didn’t end at the museum, you could go online at home, continue to play, earn badges. It kept you connected beyond your visit. They also planned to have a Phase 2 to expand to other museums.
7. Team Diego: Another great game where you would learn about art, and also tag it based on what term you think it best described. These tags would be fed back into an algorithm to better help label the art pieces in the database as well. *Diego was not a part of the teams to be judged since he is a fellow and helped develop the API. He was also one of the judges.
8. Team Art Pass: They developed a card with a qr code that you would walking around with to interact with the space. You would scan it to a mobile device, login and also scan it across art pieces. It would record all your activity and save it to your profile. They also mentioned expanding it to other museums, so you could have all your experiences recorded in one place. They also printed a mock-up of the card with the QR code!
9.Team Back Left: Our team developed a website where you would basically learn about art, play games, see exactly where the art piece via google maps (Regardless of where you were), develop your own art with a word tag cloud that you could print at home or at the museum and more! We wanted to make sure you always felt you were in the museum regardless of where you physically were.
The judges deliberated for half an hour and each came back for reasons why all of the ideas were great. Prizes were awarded in the following categories:
Best use of the API: Team Geosafe
Most Franchise-able: Team Art Pass
Most unexpected: Team Muneeb, the visual arts museum game
People's Choice: Team Once Upon a Time
Runner Up: Team Kiosk to the Future
Absolute Favorite: Team Back Left!
At the end we all took a great communal photo and yes we all managed to crowd source the solution of 30 people fitting in one photo ;). Group Photo!
We then had the opportunity (which we all took), to video tape small 2 minute pitches on why we think this API should stay open and available to the developers for use. I think all of our faces said WHY NOT!!?
There was some interacting, mingling and more ideation until the museum started pushing us out. The day ended at around 7pm.
So there you have it, a magnificent time to make friends, help one of our Smithsonian institutions and refine our hacking abilities.
PLEASE COMMENT HERE IF YOU THINK THE API SHOULD BE OPEN AND WHAT IDEAS YOU MAY HAVE IF YOU COULD USE IT OR OTHER SMITHSONIAN INSTITUTIONS WHERE YOU WOULD LIKE TO SEE THIS HAPPEN.
Until another data adventure!
At around 8am on a chilly Saturday morning, the Smithsonian American Art Museum opened it's doors early for a special group. A group that decided this weekend they would get together and help the Art Museum modernize the digital face of the it's Luce Foundation Center (http://americanart.si.edu/luce/).
The group consisted of developers, cartographers, user experience specialists, engineers, strategists, White House fellows, game developers and more. I don't know about you, but that is one good looking group!
The Smithsonian team worked hard to get an API up and running that we could use to access the entire collections of the American Art Museum. For now, that api is private, and well, that made us feel pretty special. :) . I would like to give credit to that team now (with contact info to follow in the second post):
- Sarah Allen- Presidential Innovation Fellow and Smithsonian Institution
- Bridget Callahan-Luce Foundation Center Coordinator, Smithsonian American Art Museum
- Georgina Goodlander- Deputy Chief, Media and Technology, Smithsonian American Art Museum
- Diego Mayer- Presidential Innovation Fellow and Smithsonian Institution
- Jason Shen- Presidential Innovation Fellow and Smithsonian Institution
- Ching-Hsien Wang- Supervisory IT Specialist, Office of the Chief Information Officer, Smithsonian Institution
- Andrew Gunther- IT Specialist, Office of the Chief Information Officer, Smithsonian Institution and volunteer at the Smithsonian Transcription Center
- Deron Burba - Chief Information Officer, Smithsonian Institution
- George Bowman- IT Specialist, Office of the Chief Information Officer, Smithsonian Institution
- Rich Brassell- Contractor, Office of the Chief Information Officer, Smithsonian Institution
- Eryn Starun- Assistant General Counsel, Smithsonian Institution
Day 1 started with a general overview of expectations (thank you Georgina), and a fantastic tour of the Luce Foundation Center provided by Bridget. The tour was a walking, talking, invigorating ideation session. As we strolled through the incredible pieces in the collection, questions posed led to note writing, Ipad jotting and inquisitive “I have an idea” looks.
After about 2 hours, we head back down to the MacMillan Education Center to prepare for an all out hackathon session ending at 11pm. We receive our mission to create an elevator pitch of our idea by 4pm Sunday afternoon, and get to work.
Teams were organically created as we all huddled together and started building our prototypes. There was an exhilarating buzz in the room, an energy that can only be found in hackathons. This is enhanced by the fact that we aren't receiving monetary prizes, job promotions, or some kind of national acclaim.
We are here because we believe in bringing a better user experience for visitors of the Luce Foundation Center and the Smithsonian American Art Museum. We are here to help an institution that really belongs to the people of this nation. Okay, that's enough “tooting our own horn," but it had to be done.
During our "brainwork" and lunch-eating, we were given a great overview of the API, with some sample use cases and instructions. The Smithsonian’s technology team was extremely helpful and stayed with the hackathon the whole time to ensure everyone was able to access resources correctly. They even gave us a very nice web-based “hackpad” where we could all post questions, share thoughts and keep track of our work.
There were small brain-breaks were groups broke out to work on other projects and remove themselves from their work to get a little extra stimulation so they could look at their ideas with a fresh mind. Thank you Erie Meyer for that. Erie works for the White House Office of Science and Technology Policy, (http://www.whitehouse.gov/administration/eop/ostp). That's a pretty cool gig.
White paper with post-its, colored makers, sketches, flow diagrams and the like were lined up across the wall. If you just stopped and looked around, you could see the brain power that room encapsulated.
As the evening hours neared, pizza came and the “idea-ting” continued. I must admit I left a little early to handle some non-hacking work, so I look forward to checking in with my team, and getting back to you all on Hackathon Day 2. I will be writing about final results, photos, videos and some general commentary from my fellow hackathon crew.
If you want to check out other resources before the second post of the hackathon, feel free to peruse these useful links:
- Luce Foundation Center website: http://www.americanart.si.edu/luce/about
- Luce Center Interact page: http://www.americanart.si.edu/luce/interact/
- Smithsonian American Art Museum Collections Search (filter by Luce): http://www.americanart.si.edu/collections/search/
- Luce Foundation Center on Facebook: http://www.facebook.com/americanartluce
- Luce Foundation Center on Tumblr: http://americanartluce.tumblr.com
- Luce Foundation Center Flickr: http://flickr.com/groups/lucehack
Stay tuned for Day 2 and final results!
Data Community DC is excited to welcome Andrea to our host of bloggers. Andrea's impressive bio is below and she will be bringing energy, ideas, and enthusiasm to the Data Innovation DC organizational team.
Census Data is cool?
At least that’s what everyone discovered at last night’s Data Innovation DC's MeetUp. The U.S Census Bureau came in to "reverse pitch" their petabytes of data to a group of developers, data scientists and data-preneurs at Cooley LLP in Downtown DC.
First off, let's offer a massive thanks to the US Census Bureau that sent five of their best and brightest to come engage the community long into the evening and late night hours. Who specifically did they send? Just take a look at the impressive list below:
Editor's note - a special thank you to Logan Powell who made this entire event possible.
And they brought the fantastic Jeremy Carbaugh jcarbaugh [at] sunlightfoundation.
By the way, Sunlight created a very convenient python wrapper for the Census API; you can pip it off of PyPI and check out the source on github here (a round of applause for our sunlight folks!) Did I mention that they are a non-profit doing this with far less funding then many others out there?
Sitegeist is nice but exactly how accessible is the Census Data? I am glad you asked. The census has two approaches, their American Fact Finder and API, both easy to use. The fact finder is good to just go ahead and peruse what you may find interesting before actually grabbing the data for yourself. The api is like the Twitter version 1 API. You get a key and use stateless HTTP GET requests to pull the data via the web. For those non-api folks, I’ll be posting a how-to shortly.
The census also has their own fun mobile app called Americas Economy.
Alright so we’ve got some data, we’ve got some ways to get it but what’s up with the reverse pitch thing? This was the best part as everyone had awesome ideas and ideations.
Some questions included:
Can we blend WorldBank and Federal Reserve Bank data to get meaningful results?
This came from a guy who was already building some nice apps around WB and Fed data. The general consensus was "yes," a lot of business value can come from that, but they need folks like us to come up with use-cases. So, thoughts? Please comment and tinker away.
What about the geospatial aspects of the data?
There were a lot of questions around the GIS mapping data and some problems with drilling down on the geo-spatial data down to block sizes or very small lots of land. People seem really interested in getting this data for things like understanding how diseases spread, patters of migration etc. The Census folks said that with the longer term surveys you can definitely get down to the block level but, because boundaries and borders can be defined differently across the nation, it is very difficult to normalize the data. Another use-case? A herculean effort? Hmm..food for thought. Also, shortly after the event, someone posted this on geo-normalization in Japan. Thanks Logan!
Editor's note: More information on US Census Grids can be found here.
How does Census data help commercial companies?
There was a great established use case where the Census helped Target Retail understand their demographic. That blew me away. The gov’t and a private retail company working to make a better profit, a better product? This definitely got my creative juices flowing, hopefully it will get everyone out there cogitating too.
or, check out this case study from the National Association of Homebuilders:
and last but not least, an example of Census data helping disaster relief (not really commercial but Logan didn't get a chance to show all of his videos):
We finally had people talking about the importance of longitudinal studies.
What is different now for our nation in terms of demographics, culture, and geography from 20-30-50 years ago? Just imagine some really cool heat map or time series visualization of how Central Park in NY or Rock Creek in DC has changed…yes I am saying this so someone actually goes out and gives that one a go. Don’t worry you can take the credit ;)
Oh and I almost forgot due to obvious privacy issues a lot of the data is pre-processed so you can’t stalk your ex-boss/boyfriend/girlfriend. But, listen up! If you are in school and doing research and want to get your hands on the microdata, you can apply. Go to this link and check it out (http://www.census.gov/ces/rdcresearch/howtoapply.html). For those of you stuck on a thesis topic in any domain that may need information about society, cough cough, nudge nudge ...
So there you have it, these are the kinds of meetups happening at Data Innovation DC. I don’t know about you, but I definitely have a new perspective on government data. I also feel a little more inclined to open my door when those census folk drop by and give them real answers.
Please comment as you see fit and send me questions. Also, JOIN Data Innovation DC and check out Data Community DC with all of other related data meetup groups. Let us know what kind of information you want to know about and what issues/topics you want us to address.
I’m new to the blog/review game but will continue to review meetups and some hot topics, podcasts etc. that I think need to be checked out. Let me know if you want me to speak to anything in particular.
This post is a guest reblog (with permission original 1/19/2011) by Tom Lee, the Director of Sunlight Labs and recent speaker at Data Innovation DC. It's a question I'm seeing asked more and more: by press, by Gov 2.0 advocates, and by the online public. Those of us excited by the possibilities of open data have promised great things. So why is BrightScope the only government data startup that anyone seems to talk about? I think it's important that those of us who value open data be ready with an answer to this question. But part of that answer needs to address the misperceptions built into the query itself.
There Are Lots of Open Data Businesses
BrightScope is a wonderful example of a business that sells services built in part on publicly available data. They've gotten a lot of attention because they started up after the Open Government Directive, after data.gov -- after Gov 2.0 in general -- and can therefore be pointed to as a validation of that movement.
But, if we want to validate the idea of public sector information (PSI) being useful foundations for businesses in general, we can expand our scope considerably. And if we do, it's easy to find companies that are built on government data: there are databases of legal decisions, databases of patent information,medicare data, resellers of weather data, business intelligence services that rely in part on SEC data, GIS products derived from Census data, and many others.
Some of these should probably be free, open, and much less profitable than they currently are*. But all of them are examples of how genuinely possible it is to make money off of government data. It's not all that surprising that many of the most profitable uses of PSI emerged before anyone started talking about open data's business potential. That's just the magic of capitalism! This stuff was useful, and so people found it and commercialized it. The profit motive meant that nobody had to wait around for people like me to start talking about open formats and APIs. There are no doubt still efficiencies to be gained in improving and opening these systems, but let's not be shocked if a lot of the low-hanging commercial fruit turns out to have already been picked.
Still, surely there are more opportunities out there. A lot of new government data is being opened up. Some of it must be valuable... right?
Government Does What The Market Won't
Well, sure. Much of it is extremely valuable. But it may not be valuable to entrepreneurs. To understand why, we need to get a little philosophical. What does government do? It provides public goods: things of value that the market is not able to adequately supply on its own. A standing army and public schools and well-policed streets and clean water are all things that are useful to society as a whole, but which the market can't be relied upon to provide automatically. So we organize government as a structure that can provide those kinds of things, and which will make sure that everyone can benefit from them in a way that's fair.
These are not ideal conditions under which to start a business: the fact that the government is the one collecting a particular type of data may mean that no one is interested in buying it -- a natural market for the data doesn't exist in the way that it does for, say, sports scores or stats about television viewership. And, even if you create a business that takes advantage of the subsidy represented by government involvement (data collected at taxpayer expense, resold at low, low prices!), your long-term prospects may still be poor since there's no way to deny competitors access to the same subsidy**. Someone else can come along and undercut you, and there's nothing you can do about it except be better and cheaper. That's great for the consumer, but not so great for people hoping to start a lucrative business. (Those who think BrightScope is a counterexample should have a closer look at their about page: they utilize a mix of public data, data that they laboriously capture themselves, and data bought from subscription services.)
Data's Real Value Can Be Hard To Measure
I'll be glad to see more open data startups -- and to be clear, I think we will see more. But the open data movement will be important regardless of whether any IPOs come out of it.
There are lots of types of value that are difficult to measure. If the IRS puts forms online, taxpayers have to spend less time waiting in line at the post office. If Census data reveals where a retailer's new store should go, it can mean profits for shareholders and more jobs for the community. If scientific data's openness allows more researchers to engage with a question, it can lead to better conclusions, better policies and better outcomes. If regulatory data about companies is public, it can give firms an incentive to self-police and help markets price things correctly.
All of these are real benefits, but they can be difficult or impossible to calculate -- and tough for a startup to monetize. Still, this is where I think the really exciting benefits to open data are likely to be found. If government data helps entrepreneurs make money, that's great. If it makes our country work better, that's fantastic.
* Historically, many gov data vendors have made money off of the data's artificial scarcity -- a legacy that we must unravel, even though doing so will be politically difficult: openness's benefits to the public will probably mean less revenue for the vendors.
** There shouldn't be, anyway -- in practice, public/private partnerships often fall short of this goal.