"Ten Simple Rules for Reproducible Computational Research" - An Excellent Read for Data Scientists

Recently, the journal PLOS Computational Biology published an excellent article by Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind Hovig entitled the "Ten Simple Rules for Reproducible Computational Research." The list of ten rules below resonates strongly with my experiences both in computational biology and data science.

Rule 1: For Every Result, Keep Track of How It Was Produced Rule 2: Avoid Manual Data Manipulation Steps Rule 3: Archive the Exact Versions of All External Programs Used Rule 4: Version Control All Custom Scripts Rule 5: Record All Intermediate Results, When Possible in Standardized Formats Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds Rule 7: Always Store Raw Data behind Plots Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected Rule 9: Connect Textual Statements to Underlying Results Rule 10: Provide Public Access to Scripts, Runs, and Results

I highly recommend reading the full article in PLOS Computational Biology here.

The rules highlighted in bold above are those that seem to be part of an emerging movement to take best practices from software engineering and even devops and bring them into the more exploratory computational and data science.  This trend was evident at Strata + Hadoop World 2013 in NY during Wes McKinney's excellent talk, "Building More Productive Data Science and Analytics Workflows."

More on this trend in future blog posts but, for a quick tease, go and check out make for data, Drake.