Data Sources

Smithsonian American Art Museum Hackathon Day 1!

luce-foundation_small At around 8am on a chilly Saturday morning, the Smithsonian American Art Museum opened it's doors early for a special group. A group that decided this weekend they would get together and help the Art Museum modernize the digital face of the it's Luce Foundation Center (

The group consisted of developers, cartographers, user experience specialists, engineers, strategists, White House fellows, game developers and more. I don't know about you, but that is one good looking group!

The Smithsonian team worked hard to get an API up and running that we could use to access the entire collections of the American Art Museum. For now, that api is private, and well, that made us feel pretty special. :) . I would like to give credit to that team now (with contact info to follow in the second post):

  • Sarah Allen- Presidential Innovation Fellow and Smithsonian Institution
  • Bridget Callahan-Luce Foundation Center Coordinator, Smithsonian American Art Museum
  • Georgina Goodlander- Deputy Chief, Media and Technology, Smithsonian American Art Museum
  • Diego Mayer-  Presidential Innovation Fellow and Smithsonian Institution
  • Jason Shen- Presidential Innovation Fellow and Smithsonian Institution
  • Ching-Hsien Wang- Supervisory IT Specialist, Office of the Chief Information Officer, Smithsonian Institution
  • Andrew Gunther- IT Specialist, Office of the Chief Information Officer, Smithsonian Institution and volunteer at the Smithsonian Transcription Center
  • Deron Burba - Chief Information Officer, Smithsonian Institution
  • George Bowman- IT Specialist, Office of the Chief Information Officer, Smithsonian Institution
  • Rich Brassell- Contractor, Office of the Chief Information Officer, Smithsonian Institution
  • Eryn Starun-  Assistant General Counsel, Smithsonian Institution

Day 1

Day 1 started with a general overview of expectations (thank you Georgina), and a fantastic tour of the Luce Foundation Center provided by Bridget. The tour was a walking, talking, invigorating ideation session. As we strolled through the incredible pieces in the collection, questions posed led to note writing, Ipad jotting and inquisitive “I have an idea” looks.

After about 2 hours, we head back down to the MacMillan Education Center to prepare for an all out hackathon session ending at 11pm. We receive our mission to create an elevator pitch of our idea by 4pm Sunday afternoon, and get to work.

Teams were organically created as we all huddled together and started building our prototypes. There was an exhilarating buzz in the room, an energy that can only be found in hackathons. This is enhanced by the fact that we aren't receiving monetary prizes, job promotions, or some kind of national acclaim.

We are here because we believe in bringing a better user experience for visitors of the Luce Foundation Center and the Smithsonian American Art Museum. We are here to help an institution that really belongs to the people of this nation.  Okay, that's enough “tooting our own horn," but it had to be done.

During our "brainwork" and lunch-eating, we were given a great overview of the API, with some sample use cases and instructions. The Smithsonian’s technology team was extremely helpful and stayed with the hackathon the whole time to ensure everyone was able to access resources correctly. They even gave us a very nice web-based “hackpad” where we could all post questions, share thoughts and keep track of our work.

There were small brain-breaks were groups broke out to work on other projects and remove themselves from their work to get a little extra stimulation so they could look at their ideas with a fresh mind. Thank you Erie Meyer for that. Erie works for the White House Office of Science and Technology Policy,  (  That's a pretty cool gig.

White paper with post-its, colored makers, sketches, flow diagrams and the like were lined up across the wall. If you just stopped and looked around, you could see the brain power that room encapsulated.


As the evening hours neared, pizza came and the “idea-ting” continued. I must admit I left a little early to handle some non-hacking work, so I look forward to checking in with my team, and getting back to you all on Hackathon Day 2. I will be writing about  final results, photos, videos and some general commentary from my fellow hackathon crew.

If you want to check out other resources before the second post of the hackathon, feel free to peruse these useful links:

Stay tuned for Day 2 and final results!

Signing off


Selling Data Science: Common Language

What do you think of when you say the word "data"?  For data scientists this means SO MANY different things from unstructured data like natural language and web crawling to perfectly square excel spreadsheets.  What do non-data scientists think of?  Many times we might come up with a slick line for describing what we do with data, such as, "I help find meaning in data" but that doesn't help sell data science.  Language is everything, and if people don't use a word on a regular basis it will not have any meaning for them.  Many people aren't sure whether they even have data let alone if there's some deeper meaning, some insight, they would like to find.  As with any language barrier the goal is to find common ground and build from there.

You can't blame people, the word "data" is about as abstract as you can get, perhaps because it can refer to so many different things.  When discussing data casually, rather than mansplain what you believe data is or what it could be, it's much easier to find examples of data that they are familiar with and preferably are integral to their work.

The most common data that everyone runs into is natural language, unfortunately this unstructured data is also some of the most difficult to work with; In other words, they may know what it is but showing how it's data may still be difficult.  One solution: discuss a metric with a qualitative name, such metrics include: "similarity", "diversity", or "uniqueness".  We may use the Jaro algorithm to measure similarity, where we count common letters between two strings and their transpositions, and there are other algorithms.  When discuss 'similarity' with someone new, or any other word that measures relationships in natural language, we are exploring something we both accept and we are building common ground.


Some data is obvious, like this neatly curated spreadsheet from the Committee to Protect Journalists.  Part of my larger presentation at Freedom Hack (thus the lack of labels), the visualization shown on the right was only possible to build in short order because the data was already well organized.  If we're lucky enough to have such an easy start to a conversation, we get to bring the conversation to the next level and maybe build something interesting that all parties can appreciate; In other words we get to "geek out" professionally.

Data-driven presentations using Slidify

Presentations are the stock-in-trade for consultants, managers, teachers, public speakers, and, probably, you. We all have to present our work at some level, to someone we report to or to our peers, or to introduce newcomers to our work. Of course, presentations are passe, so why blog about it? There’s already PowerPoint, and maybe Keynote. What more need we talk about? slidify

Well, technology has changed, and vibrant dynamic presentations are here today for everyone to see. No, I mean literally everybody, if I like. All anyone will need is a web browser to see it. Graphs can be interactive, flow can be nonlinear, and presentations can be fun and memorable again!

But PowerPoint is so easy! You click, paste, type, add a bit of glitz, and you’re done, right? Well, as most of us can attest to, not really. It takes a bit more effort and putzing around to really get things in reasonable shape, let alone great shape.

And there are powerful alternatives. Which are simple and easy. And do a pretty great job on their own. Oh, and, by the way, if you have data and analysis results to present, super slick and a one-stop-shop from analysis to presentation. Really!! Actually there are a few out there, but I’m going to talk about just one. My favorite. Slidify.

Slidify is a fantastic R package that takes a document written in RMarkdown , which is Markdown (an easy text markup format) possibly interspersed with of R code that result in tables or figures or interactive graphics, weaves in the results of that code, and then formats it into beautiful web presentations using HTML5. You can decide on the format template ( it comes with quite a few) or brew your own. You can make your presentation look and behave the way you want, even like a Prezi (using ImpressJS). You can also make interactive questionnaires and even put in windows to code interactively within your presentation!!

A Slidify Demonstration

Slidify is obviously feature-rich, and infinitely customizable, but that’s not really what attracted me to it. It was the ability to write presentations in Markdown, which is super easy and let’s me put down content quickly without worrying about appearance (Between you and me, I’m writing this post in Markdown, on a Nexus 7). It lets me weave in results of my analyses easily, keeping the code in one place within my document. So when my data changes, I can create an updated presentation literally with the press of a button. Markdown is geared to create HTML documents. Pandoc lets you create HTML presentations from Markdown, but not living, data driven presentations like Slidify. I get to put my presentations up on Github or on Rpubs, or even in my  Dropbox, directly using Slidify, share the link, and I’m good to go.

Dr. Ramnath Vaidyanathan created Slidify to help him teach more effectively at McGill University, where he is on the Desautels Faculty of Management. But, for me, it is now the goto place for creating presentations , even if I don’t need to incorporate data. If you’re an analyst and live in the R ecosystem, I highly recommend Slidify. If you don’t and use other tools, Slidify is a great reason to come and see what R can do for you. Even if it to just create great presentations. There are plenty of great examples of what’s possible at

If you are in the DC metro area, come see Slidify in action. Dr. Vaidyanathan presents at a joint Statistical Programming DC / Data Visualization DC meetup on both Slidify and his other brainchildren, rCharts (which can create really cool and dynamic visualizations from R, see Sean's blog) and rNotebook on August 19. See the announcements at SPDC and DVDC, sign up, and we’ll see you there.

Introducing the Congress App for iOS from the Sunlight Foundation

This post is reblogged with permission from the Sunlight Foundation and the original can be found here. We at Data Community DC are always looking to highlight local innovations in data including software, apps, data sets, infrastructure, databases, startups, algorithms, and more.  If you would like to garner publicity for your efforts, please contact me, Sean Murphy, at As Congress returns for their July session, the Sunlight Foundation is excited to Congress-app-for-iOS announce our free Congress app for iOS devices that allows anyone to get the latest from Washington. Download it here. Now it is easy to learn more about your member of Congress, contact them directly and see their activities right from your phone. Follow the latest legislation, floor activity and even get a breakdown of votes with just a swipe and tap. The new Congress app for iOS has many more features in development and complements theimmensely popular version for Android.

When you launch the app, you'll go right to a feed of the latest activity. Tap on a piece of legislation and you can see the summary information, sponsorship details, movement through committees, votes and links to the full text. Easily swipe to the left to access the menu of other features. Under the legislators tab, you can quickly browse the list of members sorted by state, chamber or even tap the location icon to see who represents your current spot or wherever you drop a pin. There is no shame in endlessly dropping new pins to discover the interesting shapes of congressional districts.

A screenshot of Rep. Ann Eshoo's profile from the Sunlight Foundation's new Congress app for iOS devices.The Congress app has detailed and update-to-date information for every member of Congress. Quickly see their picture, get directions to their DC office address, find links to their website and social media, see a map of their district and a button to call them directly. You can also see the bills they sponsored and their vote record. Star any legislator or bill to quickly access them through the "Following" section. From there you can easily see the latest activity on the bills you follow and if you're looking at a vote breakdown, the legislators you follow will appear at the top.

In future releases we will have push notifications for actions related to what you're following as well as a new section for committee listings, calendars, floor updates and much more. Stay tuned for the latest updates by following the Congress app Twitter account here. Like all of Sunlight's work, the Congress app is open source with the code available on GitHub. The app uses data from official sources through the Sunlight Congress API and the beautiful maps are powered by MapBox. Please email us with any feedback.

Data Visualization: From Excel to ???

So you're an excel wizard, you make the best graphs and charts Microsoft's classic product has to offer, and you expertly integrate them into your business operations.  Lately you've studied up on all the latest uses for data visualization and dashboards in taking your business to the next level, which you tried to emulate with excel and maybe some help from the Microsoft cloud, but it just doesn't work the way you'd like it to.  How do you transition your business from the stalwart of the late 20th century?

If you believe you can transition your business operations to incorporate data visualization, you're likely gathering raw data, maintaining basic information, making projections, all eventually used in an analysis-of-alternatives and final decision for internal and external clients.  In addition, it's not just about using the latest tools and techniques, your operational upgrades must actually make it easier for you and your colleagues to execute daily, otherwise it's just an academic exercise.

Google Docs

There are some advantages to using Google Docs over desktop excel, the fact that it's in the cloud, has built in sharing capabilities, wider selection of visualization options, but my favorite is that you can reference and integrate multiple sheets from multiple users to create a multi-user network of spreadsheets.  If you have a good javascript programmer on hand you can even define custom functions, which can be nice when you have particularly lengthy calculations as spreadsheet formulas tend to be cumbersome.  A step further, you could use Google Docs as a database for input to R, which can then be used to set up dashboards for the team using a Shiny Server.  Bottom line, Google makes it flexible, allowing you to pivot when necessary, but it can take time to master.

Tableau Server

Tableau Server is a great option to share information across all users in your organization, have access to a plethora of visualization tools, utilize your mobile device, set up dashboards, keep your information secure.  The question is, how big is your organization?  Tableau Server will cost you $1000/user, with a minimum of 10 users, and 20% yearly maintenance.  If you're a small shop it's likely that your internal operations are straightforward and can be outlined to someone new in a good presentation, meaning that Tableau is like grabbing the whole toolbox to hang a picture, it may be more than necessary.  If you're a larger organization, Tableau may accelerate your business in ways you never thought of before.

Central Database

There are a number of database options, including Amazon Relational Data Services and Google Apps Engine.  There are a lot of open source solutions using either, and it will take more time to set up, but with these approaches you're committing to a future.  As you gain more clients, and gather more data, you may want to access to discover insights you know are there from your experience in gathering that data.  This is a simple function call from R, and results you like can be set up as a dashboard using a number of different languages.  You may expand your services, hire new employees, but want to easily access your historical data to set up new dashboards for daily operations.  Even old dashboards may need an overhaul, and being able to access the data from a standard system, as opposed to coordinating a myriad of spreadsheets, makes pivoting much easier.

Centralize vs Distributed

Google docs is very much a distributed system where different users have different permissions, whereas setting up a centralized database will restrict most people into using your operational system according to your prescription.  So when do you consolidate into a single system and when do you give people the flexibility to use their data as they see fit?  It depends of course.  It depends on the time history of that data, if the data is no good next week then be flexible, if this is your company's gold then make sure the data is in a safe, organized, centralized place.  You may want to allow employees to access your company's gold for their daily purposes, and classic spreadsheets may be all they need for that, but when you've made considerable effort to get the unique data you have, make sure it's in a safe place and use a database system you know you can easily come back to when necessary.

Applying Aggregated Government Data - Insights into a Data-focused Company

The following blog is a review of Enigma Technology, Inc., their VC funded open source government data aggregation platform, and how that platform may be utilized in different business applications.  Enjoy!


Aggregation of previously unconnected data sources is a new industry in our hyper-connected world.  Enigma has focused on aggregating open source US Government data, and the question is, “What is possible with this new technology?”  Given only the information from the website, this paper explores Enigma’s decision support approach in their three ‘Feature’ sections: Data Sources, Discover, and Analyze.

Data Sources

The technology to gather data from the open web, either directly or by scraping websites, has reached maturity#, and as a result it is simply a bureaucratic process to focus aggregation on the specific government industries they highlight (aircraft, lobbying, financial, spending, and patent).  Enigma focuses on data augmentation, API access, and custom data, which is another way of saying, “We can provide standard insights or give you easy access, and we can apply these principles to whatever data sets you have in mind.”  This business model is another 21st century standard: Standard charge for self-service data-applications, consult on private data-applications.

Discover (Data)

A primary feature of Government is its siloed approach to data; a classic example being sharing of intelligence# information following 9/11#.  Juxtaposed data always produces new correlations between the data sets and thereby new insights, allowing for exploration previously impossible or impractical.  Combined with powerful UI and self-service tools, Enigma is seeking to empower its users to recognize what’s most valuable to them, as opposed to their providing any one ‘Right’ answer, an approach broadly adopted in the software as a service (SaaS) industry and widely applied to Government data#.

Analyze (Data)

Enigma’s goal is to immerse its users in the data in a meaningful way, allowing them to drill down to any detail or rise above the fray with metadata created by its standard functions and operators ala classic statistics, classification#, and data ontology#.  Again utilizing a powerful UI and self-service tools#, Enigma plans to empower its users to focus on their data of interest (filtering) and choose which mathematical operations to perform on data under comparison, all with a goal of integrating previously independent Government data sets.

Business Applications

If the aggregated open source data is directly applicable to your existing business, by all means weigh the ROI of an Enigma subscription, however in most cases the application of this data will require more significant discussion and negotiations with potential clients, or Enigma’s self-service model will only serve as a demonstration for private or more restricted-access data.  Government organizations are being charged with integrating data across ‘silos’, and services like those of Enigma will provide comprehensive tools for the first step in this process, allowing for consultation on its application and for services specializing in the chartered goals of that Government data integration.

Using Data to Create Viral Content. [INFOGRAPHIC]

Pinterest_Infographic_teaser Netflix recently used their own data to drive the creation of the hit series 'House of Cards'. A similar approach can be applied to other forms of media to create content that is highly likely to become popular or even go viral through social media channels.

I examined the data set collected by Feastie Analytics to determine the features of recipes that make them the most likely to go viral on Pinterest. Some of the results are in the infographic below (originally published here). The data set includes 109,000 recipes published after Jan 1, 2011 on over 1200 different food blogs. Each recipe is tagged by its ingredients, meal course, dish title, and publication date. For each recipe, I have a recent total pin count. I also have the dimensions of a representative photo from the original blog post.

The first thing that I examined is the distribution of pins by recipe. What I found is that the distribution  of pins by recipe is much like the distribution of wealth in the United States -- the top 1% have orders of magnitude more than the bottom 90%. The top 0.1% has another order of magnitude more than the top 1%! Many of the most pinned recipes are from popular blogs that regularly have highly pinned recipes, but a surprising number are from smaller or newer blogs. A single viral photo can drive hundreds of thousands of new visitors to a site that has never seen that level of traffic before.

For the purposes of this analysis, I defined "going viral" as reaching the top 5% of recipes --  having a pin count over 2964 pins. Then, I calculated how much more (or less) likely a recipe is to go viral depending on its meal course, keywords in the dish title, ingredients, day of the week, and the aspect ratio of the photo.

Some of the results are surprising and some are expected. Many people would expect that desserts are most likely to go viral on Pinterest. But in reality, desserts are published the most but not most likely to go viral. Appetizers have the best probability of going viral, perhaps because they are published less frequently, yet are in relatively high demand. The popularity of cheese, chocolate, and other sweets in the dishes and ingredients is not surprising. What is somewhat surprising are some of the healthier ingredients such as quinoa, spinach, and black beans. The fact that Sunday is the second best day to publish is surprising, as most publishers avoid weekends. However traffic to recipe sites spikes on Sundays, so it makes sense that recipes published then have an advantage. Finally, it's no surprise that images with tall orientations are more likely to go viral on Pinterest considering how they are given more space by the Pinterest design. But now, we can put a number on just how much of an advantage portrait oriented photos have -- they are approximately twice as likely to go viral as the average photo.

Hungry yet? What other forms of content would you like to see this approach applied to?

Check back tomorrow for a tutorial on how to create an infographic with Keynote.

How to Make Your Recipe Go Viral on Pinterest

Create Data-Driven Tools that May Improve Diabetes Care for a Chance to Win $100K

This post is being reblogged by DC2 as it looked like a pretty worthy cause. The original author was Dwayne Spradlin.

RDD-DDD-1week-left-700px-w 2

In 2011, McKinsey Global Institute put out a study that projected, “If U.S. healthcare were to use big data creatively and effectively to drive efficiency and quality, the sector could create more than $300 billion in value every year.” The transformative power of open data has been a hot topic of national conversation over the past several years, particularly in its application to healthcare, as data practitioners ponder groundbreaking new developments that could change the lives of patients. In what may come as a surprising move for some, Sanofi US, an established healthcare player, is leading the charge to find out exactly what’s possible when you apply the power of open data to healthcare innovation to help millions of people living with diabetes.

Last week, in partnership with the Health Data Consortium, the Sanofi US 2013 Data Design Diabetes Innovation Challenge - Prove It! kicked off the Redesigning Data Challenge Series, inviting innovators to create data-driven tools that could potentially improve diabetes care in the US. Now in its third year, the partnership with the Health Data Consortium brings a new focus on data to Data Design Diabetes, encouraging entrepreneurs, data scientists, and designers to create the evidence needed to make better decisions related to diabetes. Through baseline knowledge models, evidence-based practice, or predictive analysis, the Challenge asks innovators to submit Prove It! concepts that have the potential to create real change with real knowledge.

Since its inception in 2011, Data Design Diabetes has helped to launch a number of successful startups, spurring innovative solutions to help people living with diabetes, their families, and their caregivers. The first Data Design Diabetes winner,, uses an app that alerts caregivers to concerning behavioral changes. The team has raised $8.2M in venture funding, has grown to a team of twelve, and recently acquired Rock Health startup, Pipette. has been groundbreaking in its use of mobile technology to create data that can be utilized to help improve the health and daily lives of people living with diabetes. Last year’s winner, n4a Diabetes Care Center, uses predictive analysis to isolate and target patients based on cost patterns and risk profiles, to provide them with support and services designed to slow the progression of the disease, improve the quality of health, and slow the spending associated with a patient’s health.

Prove It! is open for submissions now through April 7, 2013. Finalist teams will present at Health Datapalooza IV in Washington, DC, and one winner will receive $100,000.


EVIDENCE-BASED HEALTH OUTCOMES: Ability to demonstrate in an evidence-based way how the concept can improve the outcomes and/or experience of people living with diabetes in the US.

TARGET AUDIENCE: Ability to support one or more members of the healthcare ecosystem and provide them with data-driven tools or evidence-based insight that can help them make better contributions to staving the diabetes epidemic in the US.

DECISION-MAKING: Ability to illustrate how the concept can enable better data-driven decision-making at a particular stage across the spectrum of type 1 or type 2 diabetes, from lifestyle and environmental factors to diagnosis, treatment, maintenance, and beyond.

DATA SCIENCE: Utilize new or traditional data methodology -- such as baseline knowledge models, evidence-based practice, and predictive analysis -- to create a tool that may potentially change the landscape of diabetes management through richer insight, more timely information, or better sets of decisions.


2013 Data Design Diabetes Innovation Challenge Timeline

  • March 18: Challenge open for submissions
  • April 7: Last day to submit to the Challenge
  • April 18: Finalists announced during TEDMED
  • April 18-June 2: Finalists participate in virtual incubator
  • April 26-28: Innovators’ bootcamp in San Francisco, CA
  • June 3: Finalists present at Health Datapalooza IV
  • June 4: Winner announced at Health Datapalooza IV


Submit a concept today, or follow the Challenge on Twitter or Facebook for updates as the 2013 Data Design Diabetes Innovation Challenge - Prove It! progresses.

DC Data Source Weekly - The World Bank Enterprise Surveys

by Asif Islam (LinkedIn@AsifMIEcon) If you wanted to know about the business environment in developing countries directly from firms, where would you look? Typically, firm level survey data collection is undertaken by governments in their respective countries, but these data sets tend to shy away on certain topics such as corruption given that firms are less likely to respond to corruption questions on government implemented surveys.Furthermore, most survey level data collected by governments tend to adopt different sampling methodologies across countries drawing concerns about comparing across countries or regions.

Enterprise Surveys, a joint venture by the World Bank and the International Finance Corporation, provides publicly available survey data on firms in developing economies. The data allows you to investigate hypothesis such as:

  • What is the main source of funding for firms?
  • Do firms make informal payments for permits?
  • What are the sales per worker?
  • What are losses suffered by firms due to crime?
  • What do firms perceive as the main obstacle to their business?

More information is available on on various topics including access to finance, corruption, infrastructure, crime, competition, and performance measures. Enterprise surveys also adopts a common methodology in sampling which implies that the data is comparable across countries. Data is available for 130,000 firms across 135 countries. For details of the methodology and access to the data, visit and like us on Facebook for latest data updates: