"Ten Simple Rules for Reproducible Computational Research" - An Excellent Read for Data Scientists

Recently, the journal PLOS Computational Biology published an excellent article by Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind Hovig entitled the "Ten Simple Rules for Reproducible Computational Research." The list of ten rules below resonates strongly with my experiences both in computational biology and data science.

Rule 1: For Every Result, Keep Track of How It Was Produced Rule 2: Avoid Manual Data Manipulation Steps Rule 3: Archive the Exact Versions of All External Programs Used Rule 4: Version Control All Custom Scripts Rule 5: Record All Intermediate Results, When Possible in Standardized Formats Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds Rule 7: Always Store Raw Data behind Plots Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected Rule 9: Connect Textual Statements to Underlying Results Rule 10: Provide Public Access to Scripts, Runs, and Results

I highly recommend reading the full article in PLOS Computational Biology here.

The rules highlighted in bold above are those that seem to be part of an emerging movement to take best practices from software engineering and even devops and bring them into the more exploratory computational and data science.  This trend was evident at Strata + Hadoop World 2013 in NY during Wes McKinney's excellent talk, "Building More Productive Data Science and Analytics Workflows."

More on this trend in future blog posts but, for a quick tease, go and check out make for data, Drake.



Recommended Article on Disruptive Technologies in the Professional Storage Market

What is a data community without data? And, if you have data, where do you put it? Anandtech has a fascinating article on the winds of change in the enterprise storage market that I believe some of our more hardware-oriented community members might find fascinating.  If so, follow the link below to read this recommended article written by Johan De Gelas :


The Open Source Report Card - A Fun Data Project Visualizing your GitHub Data

Do you contribute to Open Source projects?

Do you have a Git Hub account?

If so, keep reading.  Dan Foreman-Mackey built the Open Source Report Card, a fantastic web application that simply asks you for your GitHub username and then gives you back some of your own data from the GitHub timeline in a fun and entertaining fashion.  It reminds me a lot of DC2's own Data Science Survey but Dan's ability to build a web app far exceeds my own. Further, Dan does a nice job of describing what he has done and how he did it. Enjoy!


Here is a bit from his site:

Every day, many thousands of open source contributions are made on GitHub by developers around the world. This data is publicly available through the API and—even more conveniently—on the GitHub Archive. This is generally a pretty fun dataset to play with but it is particularly exciting for hackers because we get to play with data that describes our own behavior! Last year, shortly after the full event stream was publicly released, the first annual GitHub data challenge produced some sick data visualizations and it's clear that people at GitHub havebeenthinking about how to Use The Data For Good™.

The one graph that is especially awesome in all sorts of surprising ways is the contributions heat map on every user's profile page. What sets this apart from the other visualizations that already exist on the site? It makes a general statement about one specific user. It lets a developer have a global view of their contributions, skills and habits. This ends up being extremely motivating because it lets the developer see their progress in real time. With this in mind, it seemed like a good idea to provide a more complete set of global statistics summarizing the hacker personality of any GitHub user.



Evidence from Google IO: Recommendation Engines are not MVPs

My co-editor's earlier post today about recommendation engines is simply spot on and I wanted to add not only my strong agreement but also some more anecdotal support for her conclusions. Google IO 2013, which concluded last week, was filled with developer-oriented announcements. As a result, some of the more consumer-focused announcements and their ramifications were glossed over. Google just announced that they are now recommending books and apps and musics via Google Play.  In other words, Google just launched their own recommendation engine. Bringing this point up to a few Googlers I was told that there have been quite a few teams at Google that have attempted to build such a recommendation engine before and met with less than stellar success.  And, remember, this is Google. There have been 48 billion app installs. They are indexing the world's data. They have knowledge graph. They probably have your email. Yet, with more data than anyone else on the planet, ridiculous computing super-infrastructure, and immense pools of elite talent, Google Play is just now getting its own recommendation engine in 2013. Recommendation engines are NOT minimum viable products, full stop.

Maker's Schedules Versus Manager's Schedules and Why it Matters for Data Scientists

Paul Graham of Y-Combinator wrote a fascinating article in July 2009 about the "Maker's Schedule" versus the "Manager's Schedule." In it, he describes the differences in how managers and makers (software programmers) define and schedule their time and the implications this has for meetings, productivity, and possible tensions between these two groups.

While I will be the first person to point out the large differences between software engineering and data science, the scheduling mentality of the maker is pretty similar to that of the data scientist; large blocks of uninterrupted time (think half or full day) are required to do work and work is defined as the creation of something, be it an analysis, a new methodology, a visualization, etc.  In contrast, managers think in hourly blocks with the meeting being the actual product created or unit of work. Thus, for the data scientist, a single meeting at 10am can completely destroy a half-day block of potential productivity.

Paul's insights strongly ring true from my personal experiences and he advocates several different strategies to mitigate this potential conflict. Simply put, I highly recommend reading this article for both data makers and data managers alike.