best practices

"Ten Simple Rules for Reproducible Computational Research" - An Excellent Read for Data Scientists

Recently, the journal PLOS Computational Biology published an excellent article by Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind Hovig entitled the "Ten Simple Rules for Reproducible Computational Research." The list of ten rules below resonates strongly with my experiences both in computational biology and data science.

Rule 1: For Every Result, Keep Track of How It Was Produced Rule 2: Avoid Manual Data Manipulation Steps Rule 3: Archive the Exact Versions of All External Programs Used Rule 4: Version Control All Custom Scripts Rule 5: Record All Intermediate Results, When Possible in Standardized Formats Rule 6: For Analyses That Include Randomness, Note Underlying Random Seeds Rule 7: Always Store Raw Data behind Plots Rule 8: Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected Rule 9: Connect Textual Statements to Underlying Results Rule 10: Provide Public Access to Scripts, Runs, and Results

I highly recommend reading the full article in PLOS Computational Biology here.

The rules highlighted in bold above are those that seem to be part of an emerging movement to take best practices from software engineering and even devops and bring them into the more exploratory computational and data science.  This trend was evident at Strata + Hadoop World 2013 in NY during Wes McKinney's excellent talk, "Building More Productive Data Science and Analytics Workflows."

More on this trend in future blog posts but, for a quick tease, go and check out make for data, Drake.



The Pyramid of Data Science

This is a guest post by Oscar Olmedo, whose bio is below. DSPyramidIn a data science project, we want to get to the ultimate goal of extracting knowledge from a given set of data, knowledge that we can use to predict and explain future and past observed events. Like other quantitative sciences, for example, physics or chemistry, we want to get at the underlying causes of observed events in our world around us. But unlike physics or chemistry, the final knowledge extracted comes out of modeling the data using mathematical models or tools such as machine learning or data mining. By its very nature, data science is purely empirical in deriving knowledge out of data.

Being empirical, data science requires that we follow the scientific method. Yes, that's right, I said scientific method. We begin with a question of why/how a particular thing happens, followed by a hypothesis. . . null hypothesis. . . alternative hypothesis. . . Then think of some way to test the hypothesis to see if it can explain that thing which we want to explain, and finally use the results to make predictions, then test again.   In data science, we create models as tests to the hypothesis based on a priori domain knowledge. These models connect one piece of data with another piece, finding relationships previously unknown. Once satisfied that our models confirm our hypothesis, how we choose to use this knowledge is up to us. But one thing's for certain - we at some point are going to have to explain our results to others.   How do we get knowledge out of data is not always straightforward and we may go through many hypothesis before finding something that works. Either way, the methodology for data science should be pretty general. Here I just want to outline what I have come to call the pyramid of data science, which at its core is based on the scientific method and simply outlines the steps I have come up with to execute a data science project. You may also find it resembles the DIKW (Data, Information, Knowledge, Wisdom) pyramid in some ways. The base upon which the pyramid sits is the idea, data selection and gathering are on the first tier, followed by data cleaning/integration and storage, then feature extraction, knowledge extraction, and finally visualization.

The Question:

Probably the most difficult part of beginning a data science project is knowing what questions to ask of data. For example, let us say we had a set of data with X number of parameters. We are not just blindly going to start analyzing the data, we must first ask a question of data. Furthermore, we may not even necessary have the data or any data at hand at the time of asking. But once we know the question we can move to the bottom most tier of the pyramid.

1. Data Selection and Gathering

At this tier we must think about what data we are going to gather for our question at hand. This is a most crucial step because whatever data parameters we choose to collect we are stuck with till the end (unless we come back to this step). It goes without saying that the final results will be entirely dependent on only these data parameters and nothing else. Here we might be gathering data from many sources, all of which will probably be in different data formats and contain data organized in many different ways.

2. Data Cleaning/Integration, and Storage

These first two tiers are generally going to take a substantial time of the project and could possibly be done simultaneously. Based on my own experience, I would argue that this step may take as much of 80% or more of the time involved in the project. Here we must design a database into which our data is to be stored. We must select a database management system that best fits our data. The data gathered must be cleaned for such things as corrupt, missing, or inaccurate entries. Then, data from different sources needs to be integrated with each other into a cohesive dataset. This is where the storage comes in because a schema must be chosen so that all of the gathered data from the different sources can readily be reformatted into the database.

3. Feature Extraction

This tier is, in other words, dimensionality reduction of the data; feature extraction usually refers to image processing but I like to generalize it to all kinds of multidimensional data. We want to make the data easier for us to handle. The main goal here is to consolidate/combine our variables into more easily digestible chunks if at all possible. In this step we could use such techniques as principle component analysis or other transforms such as Fourier or wavelet transforms. In some cases it may not be possible to reduce the dimensionality of the data and we will just have to use all of the variables as they are. Another example could be the use of the rates at which the variables change in time series.

4. Knowledge Extraction

Finally, this is the tier that you have been waiting for! It is where you are at last going to be able to answer your question. With the data nicely formatted and the features selected it is time to apply whichever analysis is most appropriate for your data, be it a machine learning algorithm, statistical analysis, time series analysis, regression, or what have you. Lets say you want to predict an outcome based on the features you have selected. You might what to use some machine learning algorithm, such a perceptron, naïve Bayes, or SVM; it is all up to you. Or, you might want to cluster your data to find hidden patterns. Many of these techniques are quite standard and implementations are generally straight forward, or you might use a machine learning tool such as WEKA, which is a tool that has many machine learning algorithms implemented in Java easily ready to be adapted to your project. There are many open source implementations out there, so just go out and look for them. For this reason, in my opinion, this step generally might take less time than the execution of the first 3 tiers. This leads to the final step of visualization, which is basically to explain your results with others.

5. Visualization

This, I leave up to the reader. This step takes some creativity, though you don't have to be a great graphic designer or anything, you just need a good visual way to get your point across. This also depends on who your audience is going to be. If it is to your colleagues that already are familiar with the data set and techniques you've used, then perhaps you can omit things that would be obvious to them. If it is to the CEO of your company and you have just discovered the greatest way to optimize profits then you will definitely have to get creative to tell a good story.

From the Editor

Do you agree? Do you disagree? Leave your thoughts below!


oscarThis is a guest post by Oscar Olmedo, physicist, data scientist, and quantitative programmer at NASA Goddard Space Flight Center. Oscar is an expert in programming, mathematical modeling, and performing statistical analysis of large and diverse data sets.