Applications Now Open for District Data Labs Incubator and Research Lab

Applications are now open for the Spring 2015 session of the District Data Labs incubator program and research lab.

The Incubator is a structured 3-month project development program where teams of people work on data projects together.  Each team is assigned one project and team members build a data product together over the course of the 3 months.  Team sizes are small (3-4 people per team) and are carefully assembled to contain a mix of quantitative and technical skills.

Energy Education Data Jam

This is a guest post by Austin Brown, co-organizer of the Data Jam, and senior analyst with the National Renewable Energy Laboratory. DC2 urges you to check this out. (And if you participate, please let us know how it goes!) The Office of Energy Efficiency and Renewable Energy (EERE) at the U.S. Department of Energy (DOE) is hosting an “Energy Education Data Jam,” which will take place on Thursday, March 27, 2014, from 9am to 4pm, in Washington D.C. This is an event that could really benefit from the participation of some more talented developers, data folks, and designers.

Features presentations from a great set of experts and innovators: Aneesh Chopra, former White House CTO; Dr. Ed Dieterle, Gates Foundation; Dr. Jan DeWaters, Clarkson University; Dr. Cindy Moss, Discovery Education; Diane Tucker, Wilson Center

In the growing ecosystem of energy-related data jams and hackathons, this one will be distinct in that it is targeted toward improving the general understanding of the basics of energy in the U.S., which we have identified as a key obstacle to sensible long-term progress in energy.  We hope that what emerges from this data jam will be applicable to learners of any age – from preschool to adult learners.

EERE is working to amplify our approach to help improve energy understanding, knowledge, and decision-making. To address the measured gap in America's energy literacy, we plan to unite energy experts with the software, visualization, and development communities. This single-day event will bring developers and topic experts together with the goal of creating innovative products and partnerships to directly address energy literacy going forward.

The goal of the data jam is to catalyze development of tools, visualizations, and activities to improve energy literacy by bringing together:

•         Developers and designers who understand the problems presented by the energy literacy gap, and have a desire to bring about change

•         Educators with knowledge of how students learn, how energy is taught, and ideas about how we can bridge the energy literacy gap

•         Energy experts with a high-level understanding of the energy economy and who are capable of deconstructing complicated energy data

•         Energy foundations and nonprofits committed to clean energy and an understanding that education can be the first step towards a clean energy economy

No prior experience in energy education is required – just an innovative mindset and a readiness to try to change the thinking on spreading the word about energy.

If you have any questions or would like to RSVP, please send an email to energyliteracy@ee.doe.gov. You can also RSVP through Eventbrite. This event will strive for participation from a number of different backgrounds and expertise and, as such, space will be limited.  We ask that you kindly respond as soon as possible. Lunch will be provided.

Winter Updates from District Data Labs

Just wanted to give you a quick update about some exciting things happening in our data community with regard to education. Last year, we sent out a survey to this group about potentially starting an data science academy. Many of you responded and as a result, a few of us have formed District Data Labs, which will focus on bringing more and better data science educational offerings to our community.  Expect to see more workshops and some longer courses being offered this year!

Speaking of workshops, a lot of you have expressed interest in learning more about Python and how it can be used to ingest, analyze, model, and visualize data. So to start the year off right, we've got a couple Python Data Analysis workshops coming up on February 22nd.  Make sure to check those out!

Lastly, if you have any topics you'd like to learn more about or if you can teach a subject you think others in the community would like to learn more about, please get in touch with me. I'd love to get your thoughts.

Thanks, and happy learning in 2014!

Tony Ojeda Co-founder, Data Community DC & District Data Labs

Validation: The Market vs Peer Review

FixMyPineapple2How do we know our methods are valid? In academia your customers are primarily other academics, and in business they're whoever has money to pay.  However, it's a fallacy to think academics don't have to answer outside of their ivory tower or that businesses need not be concerned with expert opinion.  Academics need to pay their mortgages and buy milk, and businesses will find their products or services aren't selling if they're clearly ignoring well established principles.  Conversely, businesses don't need to push the boundaries of science for a product to be useful and academics don't need to increase market share to push the boundaries of science. So how do we strike a balance?  When do we need to seek these different forms of validation?

Academics Still Need Food

Some of us wish things were so simple, that academics need not worry about money nor businesses about peer review, but everything is not so well delineated.  Eventually the most eccentric scientist has to buy food, and every business eventually faces questions about its products/services.  Someone has to buy the result of your efforts, the only question is how many are buying and at what price.  In academia, without a grant you may just be an adjunct professor.  Professors are effectively running mission driven organizations that are non-profit in nature, and their investors are the grant review panels who greatly consider the peer review process in the awarding process.

Bridge the Consumer Gap

Nothing goes exactly as planned.  Consumers may buy initially, but there will inevitably be questions and business representatives can not possibly address all those questions.  A small "army" may be necessary to handle the questions, and armies need clear direction, so businesses are inevitably reviewed and accepted by its peers.  The peer review helps bridge the gap between business and consumer.

Unlike academic peer review, businesses often have privacy requirements for competitive advantage that preclude the open exposure of a particular solution. In these cases, credibility is demonstrated when your solution can provide answers to clients' particular use cases. This is a practical peer review in the jungle of business.

Incestual Peer Review

You can get lost in this peer review process, each person has their thoughts which they write about, affecting others' thoughts which they write about, and so on and so forth.  A small group of people can produce mountains of "peer reviewed" papers and convince themselves of whatever they like, much like any crazy person can shut the outside world and become lost in their own thoughts.  Godel's incompleteness theorem can be loosely interpreted as, "For any system there will always be statements that are true, but that are unprovable within the system."  Godel was dealing specifically with natural numbers, but we inherently understand that you can not always look inward for the answer, you have to engage the outside world.

Snake Oil Salesman

Conversely, without peer review or accountability, cursory acceptance (i.e. consumer purchases) can give a false sense of legitimacy.  Some people will give anything a try at least once, and the snake oil salesman is the perfect example.  Traveling from town to town, the salesman brings an oil people have never seen before and claims it can remedy all of their worst ailments; However, once people use the snake oil and realize it is no more effective than a sugar pill, the salesman has already moved on to another town. Experience with a business goes a long way in legitimizing the business.

Avoid Mob Rule

These two forms of legitimacy, looking internal versus external, peer review versus purchase, can be extremely powerful, rewarding, and a positive force in society.  Have you ever had an idea people didn't immediately accept?  Did you look to your friends and colleagues for support before sharing your idea more widely?  This is a type of peer review (though not a formal one), and something we use to develop and introduce new ideas.

The Matrix

Conversely, have you ever known something to be true but can't find the words to convince others it is true?  In The Matrix, Morpheus tells Neo, "Unfortunately, no one can be told what the Matrix is. You have to see it for yourself."  If people can be made to see what you know to be true, to experience the matrix rather than be told about it, they have more grounds to believe and accept your claims.  Sometimes in business you have to ignore the nay-sayers, build your vision, and let its adoption speak for itself.  Ironically there are those who would presume to teach birds to fly, and businesses may watch the peer review process explain how their vision works only to then be lectured on why they were successful.


In legitimizing our work, business or academic, when do we look to peer review and when do we look to engaging the world?  This is a self similar process, where we may gather our own thoughts before speaking, or we may consult our friends and colleagues before publishing, but above all we must be aware of who is consuming our product and review our product accordingly before sharing it.

Selling Data Science: Validation

FixMyPineapple2 We are all familiar with the phrase "We can not see the forest for the trees", and this certainly applies to us as data scientists.  We can become so involved with what we're doing, what we're building, the details of our work, that we don't know what our work looks like to other people.  Often we want others to understand just how hard it was to do what we've done, just how much work went into it, and sometimes we're vain enough to want people to know just how smart we are.

So what do we do?  How do we validate one action over another?  Do we build the trees so others can see the forrest?  Must others know the details to validate what we've built, or is it enough that they can make use of our work?

We are all made equal by our limitation to 24 hours in a day, and we must choose what we listen to and what we don't, what we focus on and what we don't.  The people who make use of our work must do the same.  John Locke proposed the philosophical thought experiment, "If a tree falls in the woods and no one is around to hear it, does it make a sound?"  If we explain all the details of our work, and no one gives the time to listen, will anyone understand?  To what will people give their time?

Let's suppose that we can successfully communicate all the challenges we faced and overcame in building our magnificent ideas (as if anyone would sit still that long), what then?  Thomas Edison is famous for saying, “I have not failed. I've just found 10,000 ways that won't work.”, but today we buy lightbulbs that work, who remembers all the details about the different ways he failed?  "It may be important for people who are studying the thermodynamic effects of electrical currents through materials." Ok, it's important to that person to know the difference, but for the rest of us it's still not important.  We experiment, we fail, we overcome, thereby validating our work because others don't have to.

Better to teach a man to fish than to provide for him forever, but there are an infinite number of ways to successfully fish.  Some approaches may be nuanced in their differences, but others may be so wildly different they're unrecognizable, unbelievable, and beg for incredulity.  The catch is (no pun intended) methods are valid because they yield measurable results.

It's important to catch fish, but success is not consistent nor guaranteed, and groups of people may fish together so after sharing their bounty everyone is fed.  What if someone starts using this unrecognizable and unbelieveable method of fishing?  Will the others accept this "risk" and share their fish with those who won't use the "right" fishing technique, their technique?  Even if it works the first time that may simply be a fluke they say, and we certainly can't waste any more resources "risking" hungry bellies now can we.

So does validation lie in the method or the results?  If you're going hungry you might try a new technique, or you might have faith in what's worked until the bitter end.  If a few people can catch plenty of fish for the rest, let the others experiment.  Maybe you're better at making boats, so both you and the fishermen prosper.  Perhaps there's someone else willing to share the risk because they see your vision, your combined efforts giving you both a better chance at validation.

If we go along with what others are comfortable with, they'll provide fish.  If we have enough fish for a while, we can experiment and potentially catch more fish in the long run.  Others may see the value in our experiments and provide us fish for a while until we start catching fish.  In the end you need fish, and if others aren't willing to give you fish you have to get your own fish, whatever method yields results.

Selling Data Science: Common Language

What do you think of when you say the word "data"?  For data scientists this means SO MANY different things from unstructured data like natural language and web crawling to perfectly square excel spreadsheets.  What do non-data scientists think of?  Many times we might come up with a slick line for describing what we do with data, such as, "I help find meaning in data" but that doesn't help sell data science.  Language is everything, and if people don't use a word on a regular basis it will not have any meaning for them.  Many people aren't sure whether they even have data let alone if there's some deeper meaning, some insight, they would like to find.  As with any language barrier the goal is to find common ground and build from there.

You can't blame people, the word "data" is about as abstract as you can get, perhaps because it can refer to so many different things.  When discussing data casually, rather than mansplain what you believe data is or what it could be, it's much easier to find examples of data that they are familiar with and preferably are integral to their work.

The most common data that everyone runs into is natural language, unfortunately this unstructured data is also some of the most difficult to work with; In other words, they may know what it is but showing how it's data may still be difficult.  One solution: discuss a metric with a qualitative name, such metrics include: "similarity", "diversity", or "uniqueness".  We may use the Jaro algorithm to measure similarity, where we count common letters between two strings and their transpositions, and there are other algorithms.  When discuss 'similarity' with someone new, or any other word that measures relationships in natural language, we are exploring something we both accept and we are building common ground.


Some data is obvious, like this neatly curated spreadsheet from the Committee to Protect Journalists.  Part of my larger presentation at Freedom Hack (thus the lack of labels), the visualization shown on the right was only possible to build in short order because the data was already well organized.  If we're lucky enough to have such an easy start to a conversation, we get to bring the conversation to the next level and maybe build something interesting that all parties can appreciate; In other words we get to "geek out" professionally.

Data Visualization: The Double Edged Data Sword

Can we use data visualization, and perhaps data avatars, to build a better community?  If you've ever been part of making the rules for an organization, you may be familiar with the desire to write a rule for every scenario that may arise, to codify how the organization expects things to happen given a specific context.  For a small group this may work as you can resolve most issues with a simple conversation and only broad rules need be codified (roles, responsibilities, etc.).  However, we're also familiar with the draconian rules that arise as a result of some crazy thing that one person did.  One could argue, "What choice do we have?", because once the size of a group grows and communication becomes a combinatorial challenge (you can't talk to everyone about everything all the time) we need the rule of law.  Laws provide everyone a common reference people can relate to their individual context and use to govern their daily conduct, but so can data.  The difference our modern world, our sensor laden interconnected world, has compared to the opaque world of previous generations is data and information.  We have a much greater potential to be aware of the context beyond our immediate senses, and thereby better understand the consequences of our actions, but we can't reach that potential unless we can visualize that data.

There will always be human issues, people interpret data and information differently, which is why we must "trust but verify" and when necessary use data to revisit peoples' reasoning.  Data visualization is what allows us to be aware of "the context beyond our immediate senses", and this premise holds whether you're in the moment or you're looking for a deeper understanding.

When we're in the moment we, presumably, want to make a better decisions and need "decision support".  To make decisions in the moment, the information must be "readily available" or we're forced to make decisions without it.  Consider how new data might change basic decisions throughout your day, in fact businesses are taking advantage of this and coffee shops provide public transit information on tablets behind the counter so people know they have time for that extra latte.

Conversely, if we want to understand how events unfolded we rely on our observations, possibly the observations of those we trust, and fill the space between with our experience and assumptions about the world.  Different people make different observations, and sometimes we can piece together a more precise picture of our shared experience, but the more precise the observations the more unique the situation, and we need laws that provide "a common reference" so we write laws for the most general cases.  The consequence: an officer could write us a ticket for jaywalking even though there are no cars for miles, or we are afraid to help those around us because it may implicate us.  Bottom line,  the more observations we have the less we are forced to assume.

"Decision Support" and "Trust but Verify" are the two sides of the double edged data sword, and it's this give and take that forces gradual adoption by people, organizations, governments, etc.  Almost universally people want transparency into why events unfolded, but do not necessarily want information about them made widely available.  The most notorious of these examples involves Julian Assange of WikiLeaks and Edward Snowden of the recent NSA leaks, in these cases the US Government wants information without having their information disclosed.  Conversely many of us believe governments would run better with more transparency, but there is a proper balance.

On a less controversial and more personal level, I use financial tracking software to help me plan budgets and generally live within my means, I use GPS to help me understand my workout routines, I use event tools to plan my Data Visualization DC events, I generally allow third party applications access to my data in exchange for new and better services.  Each of these services has some sort of data visualization and analytics that come with it, and these visualizations and analytics are essential to my personal decision support.  I enjoy the services, but it is interesting to also suddenly see advertisements about the thing I shard, tweeted, emailed, etc., earlier that day or week.  On the one hand I'm glad advertisements have become more meaningful, but what are the consequences of the double edged data sword?

I would like to be able to revisit my personal information for innocuous reasons, to remember where I've been, what actions I took, who I talked to, who I shared what with, etc.  There are more compelling reasons too, trust but verify reasons, as the data could prove I was at work, was conducting work, met with that client, didn't waste the money, couldn't be liable for an accident, etc.; I'd like to generally have the power to confirm statements and claims I make using my personal data.

Unfortunately I don't own what's recorded about me, my personal data, typically the third party owns that data.  Theoretically I can get that information piecemeal, I can go to the service provided and manually record it, a former justice department lawyer even suggested the wide use of FOIA, but we data scientists and visualizers know that if you can't automate the data collection and visualization then it's really not practical.  In other words, without data visualization we can't hear the proverbial tree in the woods.  So until there is a FOIA app on my smartphone, basically an API for my personal data guaranteed by the government for the people, we can't visualize "the context beyond our immediate senses" for ourselves and others, and the other edge of the data sword will always be difficult to defend against.

Data Visualization: Exploring Biodiversity

BHL_FlikrWhen you have a few hundred years worth of data on biological records, as the Smithsonian does, from journals to preserved specimens to field notes to sensor data, even the most diligently kept records don't perfectly align over the years, and in some cases there is outright conflicting information.  This data is important, it is our civilization's best minds giving their all to capture and record the biological diversity of our planet.  Unfortunately, as it stands today, if you or I were to decide we wanted to learn more, or if we wanted to research a specific species or subject, accessing and making sense of that data effectively becomes a career.  Earlier this year an executive order was given which generally stated that federally funded research had to comply with certain data management rules, and the Smithsonian took that order to heart, event though it didn't necessarily directly apply to them, and has embarked to make their treasure of information more easily accessible.  This is a laudable goal, but how do we actually go about accomplishing this?  Starting with digitized information, which is a challenge in and of itself, we have a real Big Data challenge, setting the stage for data visualization. The Smithsonian has already gone a long way in curating their biodiversity data on the Biodiversity Heritage Library (BHL) website, where you can find ever increasing sources.  However, we know this curation challenge can not be met by simply wrapping the data with a single structure or taxonomy.   When we search and explore the BHL data we may not know precisely what we're looking for, and we don't want a scavenger hunt to ensue where we're forced to find clues and hidden secrets in hopes of reaching our national treasure; maybe the Gates family can help us out...

People see relationships in the data differently, so when we go exploring one person may do better with a tree structure, others prefer a classic title/subject style search, or we may be interested in reference types and frequencies.  Why we don't think about it as one monolithic system is akin to discussing the number of Angels that fit on the head of a pin, we'll never be able to test our theories.  Our best course is to accept that we all dive into data from different perspectives, and we must therefore make available different methods of exploration.

Data Visualization DC (DVDC) is partnering with the Smithsonian's Biodiversity Heritage Library to introduce new methods of exploring their vast national data treasure.  Working with cutting edge visualizers such as Andy Trice of Adobe, DVDC is pairing new tools with the Smithsonian's, our public, biodiversity data.  Andy's development of web standards for Visualizing data with HTML5 is a key step forward to making the BHL data more easily accessible not only by creating rich immersive experiences, but also by providing the means through which we all can take a bite out of this huge challenge.  We have begun with simple data sets, such as these sets organized by title and subject, but there are many fronts to tackle.  Thankfully the Smithsonian is providing as much of their BHL data as possible through their API, but for all that is available there is more that is yet to even be digitized.  Hopefully over time we can utilize data visualization to unlock this great public treasure, and hopefully knowledge of biodiversity will help us imagine greater things.



Free Government Data: Access Sunlight Foundation APIs on a New Data Services Site

We love what the Sunlight Foundation is trying to do with government data and so the following is a repost (with permission) of one of their recent announcements.

Grab a key and start developing a project using open data

May 6, 2013

Contact: Liz Bartolomeo 202-742-1520 x226

WASHINGTON, DC — The Sunlight Foundation is expanding its free data services with a new website -- http://sunlightfoundation.com/api/ -- to access our open government APIs. We offer APIs (a.k.a. application programming interfaces) for a number of our projects and tools and support a community of developers who create their own projects using this data.

Nonprofit organizations, political campaigns and media outlets use our collection of APIs, which cover topics such as the Congressional Record, lobbying records and state legislation. More than 7,000 people have registered for an API key, resulting in over 735 million API calls to date. Greenpeace uses congressional information available through Sunlight APIs on its activist tools, and the Wikimedia Foundation used Sunlight APIs to help people connect with their lawmakers in Congress during the SOPA debate last year. Those using Sunlight APIs run across the political spectrum, from the Obama-Biden campaign to the Tea Party Patriots.

The new data services site has a complete listing of our APIs, their current operating status, as well as a query builder so developers can try out the API before using it. Need inspiration? The API gallery features a curated list of interesting uses of Sunlight APIs from an experiment reviewing “kindness in the Congressional Record” to an interactive bill tracker for Missouri.

"The Sunlight Data Services site also features projects from the larger community that are actively accepting code contributions, allowing developers to get their feet wet with OpenGov projects right away," said Kaitlin Devine a web developer who previewed the site at this past weekend’s TransparencyCamp. TransparencyCamp is an unconference organized by Sunlight and attracted more than 500 developers, technologists, policy makers, academics, journalists and activists.

Sunlight Foundation APIs

Capitol Words API The Capitol Words API is an API allowing access to the word frequency count data powering the Capitol Words project.

Congress API v3 A live JSON API for the people and work of Congress. Information on legislators, districts, committees, bills, votes, as well as real-time notice of hearings, floor activity and upcoming bills.

Influence Explorer API The Influence Explorer API gives programmers and journalists the ability to easily create subsets of large data for their own research and development purposes. The API currently offers campaign contributions and lobbying records with more data sets coming soon.

Open States API Information on the legislators and activities of all 50 state legislatures, Washington, D.C. and Puerto Rico.

Political Party Time API Provides access to the underlying, raw data that the Sunlight Foundation creates based on fundraising invitations collected in Party Time. As we enter information on new invitations, the database updates automatically.

The new data services site also features a calendar, so those interested in getting involved in local meetups, hackathons and other events can get involved around the country.

Since we know that not everyone is a developer, you can learn more about how to use our APIs on Sunlight Academy.

Receive the latest news about our APIs on the Sunlight Labs API Google Group or follow Sunlight Labs on Twitter.

The Sunlight Foundation is a non-partisan non-profit that uses cutting-edge technology and ideas to make government transparent and accountable. Visit http://SunlightFoundation.com to learn more about Sunlight’s projects, including http://PoliticalPartyTime.org and http://influenceexplorer.com.