Back in 2010, New York-based data scientist Drew Conway famously created the Data Science Venn Diagram. Illustrating that Data Science is the intersection of "Substantive Expertise," "Hacking Skills," and "Math and Statistics Knowledge," the diagram had a substantial impact on the nascent community. It was one of a key set of articles that helped to define the distinguishing features of data science, as a discipline, and to clarify why there was a need for a new term (rather than just "applied statistics"). If you haven't seen the diagram, click through before proceeding. Another new term is "Data Products." Mike Loukides and DJ Patil, among others, have written about the things that data scientists build. I'd like to add to that a Venn Diagram for data products that clarifies what that terms means. And more importantly, the diagram relates data products to other sorts of data-related artifacts that have existed for a long time. Here it is:
What does this mean? First of all, there are three sets of skills, directly paralleling Drew's data science skill sets, all floating in a sea of data. When you combine Data with Domain Knowledge, you get Spreadsheets. With Statistics, Predictive Analytics, and Visualization, you get Exploratory Data Analysis and Statistical Programming. And with Software Engineering, you get Databases. Highly useful systems and products, but nothing particularly new.
Combining pairs of sets with this sea of data, you get more specific products:
- Data + Software Engineering + Domain Knowledge = Business Rules and Expert Systems with implementations such as Drools and FICO's Blaze.
- Data + Software Engineering + EDA & Statistical Programming = BI and Statistics Tools, such as Tableau, SPSS, and many more general-purpose statistical systems.
- Data + Domain Knowledge + EDA & Statistical Programming = One-Off Analyses, which may be a PDF article, or a data-driven Powerpoint presentation, or simply a chart showing a distribution sent via email.
And at the center of it is all a Data Product, which is a piece of software that includes both Domain Knowledge and Statistical components. These may be widgets in a larger web tool, such as LinkedIn's People You May Know, or software systems designed for specific analytic purposes, with baked-in domain knowledge. Tools that are designed for statistical analysis of DNA sequences, or optimization of truck routing for distributors, or many many other things, all fall into this category. In many cases, data products make it easy for regular people to get what they need without having to dive into a very complex set of data and a very complex set of algorithms.
What are the consequences of this framework? I'd assert that the value of a product that combines all three aspects of a data product, requiring all three skill sets of a data scientist to design and build, may be substantially more valuable than products that combine just one or two of the components.
One-off analyses can be great, but a repeatable, reproducible analysis is much better. Business rules can lead to maintainable software systems, but without statistical capabilities, they may be too rigid to adequately work in many real world situations. (See the history of AI research prior to about the 1980s.) And general purpose BI and Statistics tools are extremely useful, but may become even more powerful when the systems are designed for and incorporate particular domain knowledge.
What do you think? What's missing? Does this clarify your thinking? Or is this entirely obvious?