Is Statistics the Least Important Part of Data Science?

There is a fascinating discussion occurring on Andrew Gelman's blog that some of our Data Community DC member's might want to chime in on ... or discuss right here on our blog.

There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science. . . .

The tech industry has always had to deal with databases and coding; that stuff is a necessity. The statistical part of data science is more of an option.

To put it another way: you can do tech without statistics but you can’t do it without coding and databases.

So, what do you think?

Some of the Data Community DC board members weigh in below:

From Sean Gonzalez

We are often quoted as saying something akin to "80% of data science is data munging.", and Dr.Gelman is following in this basic idea.  Part of me wishes I could be given a little sandbox where I get to play with algorithms all day; the amount of work it takes to set up an environment where you actually get to play with math with any bearing on your colleagues or business can be staggering.  On the other hand, this is exactly what I've run away from in 2013; I do not like having the algorithms' applications essentially decided for me ahead of time by the business plan.  There are enough tools, languages, packages, code bases, and enthusiasm that data science can become the centerpiece of the product; and some of the best examples are in marketing and product pricing.  If you have the vision, affinity for multiple programming languages, algorithmic insights, and the pulse of your customers, you can build applications that shock people with their capabilities.  In other words, applications using algorithms yield much more interesting results and are far more useful.  In short, without statistics, algorithms, mathematics, etc. you're a developer not a data scientist.

From Tony Ojeda

In my opinion, data science without math is not really data science. Data science is not just moving data around, transforming it, and storing it in different places. It's not just creating an application that uses data in some way. Data science involves the entire tool box of skills needed to get through the process of acquiring data, cleaning it, storing it, exploring it, analyzing it, modeling it, and then taking the results of all that and creating something valuable.

This means that DBAs and software developers need to improve their math skills if they want to refer to themselves as data scientists, and statisticians and mathematicians need to improve their data architecture and coding skills if they want to refer to themselves as data scientists.

From Abhijit Dasgupta

Hilary Mason says that data science is about telling stories with data. Just data manipulation and organization can't tell stories, though it can help make the story easier to write. The story can't just come from abstract models and algorithms without setting up the data, and understanding its context. The story can't be written by just knowing the context, without being able to get your hands dirty and understand what the data is saying. It takes a combination of data handling, manipulation, and organization, algorithms and models, visualization and context to tell the story. In other words, data science is a partnership between IT and database developers (data handling, manipulation, organization), statisticians/mathematicians/computer scientists (algorithms and modeling, visualization), and the owners/caretakers/collectors of the data (context). Each part is necessary to tell the story. A data scientist (more precisely, a "good" data scientist) must have a combination of data architecture, manipulation and coding skills, knowledge of algorithms and models (including an understanding of what algorithm is good for what types of data and contexts, i.e., the mathematics of the algorithm), and the ability to contextualize and translate the results of the coding and analyses to create the story; that story can then be further developed into something more useful and valuable. So the math/stat types have to learn to get their hands dirty using data infrastructure and programming tools, the IT types must learn something about models, visualization and the appropriate contexts in which they have a chance at being successful, and both have to learn about the context and background of the data. After all, data and all the analysis in the world without an end in mind is just data, not a story.

From Ben Bengfort

Data science as a product does focus primarily on computation, true. Perhaps because of this most Data Science products are constructed in an Agile fashion- building through iteration. Agile data science requires fast hypothesis testing across large data sets, failure early and often, and sprints towards finite goals. Although the tools are computational (particularly if you consider machine learning computation rather than statistics) - hypothesis validation and analyses are essential to this methodology. Tech without statistics gives you no sense of where you're at and where you're going- and you're simply pounding away at a domain specific dataset without innovating. Although the statistics gets minimized (possibly because statisticians don't necessarily have an application oriented focus)- it plays a major role in the agile data science life cycle, and should be treated as essential as databases and coding.



If you would like to read more of the discussion that has already occurred, please go over to Dr Gelman's blog.