Sean Murphy co-organizes Data Business DC, among many other things. Hadley Wickham, having just taught workshops in DC for RStudio, shared with the DC R Meetup his view on the future, or at least the near future of Data Analysis. Herein lies my notes for this talk, spiffed up into semi-comprehensible language. Please note that my thoughts, opinions, and biases have been split, combined, and applied to his. If you really only want to download the slides, scroll to the end of this article.
As another legal disclaimer, one of the best parts of this talk was Hadley's commitment to making a statement, or, as he related, "I am going to say things that I don't necessarily believe but I will say them strongly."
You will also note that Hadley's slides are striking even at 300px wide ... evidence that the fundamental concepts of data visualization and graphic design overlap considerably.
Data analysis is a heavily overladen term with different meanings to different people.
However, there are three sets of tools for data analysis:
- Transform, which he equated to data munging or wrangling
- Visualization, which is useful for raising new questions but, as it requires eyes on each image, does not scale well; and
- Modeling, which complements visualization and where you have made a question sufficiently precise that you can build a quantitative model. The downside to modeling is that it doesn't let you find what you don't expect.
Now, I have to admit I loved this one. Data analysis is "typing not clicking." Not to disparage all of those Excel users out there but programming or coding (in open source languages) allows one to automate processes, make analyses reproducible, and even help communicate your thoughts and results, even to "future you." You can also throw your code on Stack Overflow for help or to help others.
Hadley also described data analysis as much more cogitation time than CPU execution time. One should spend more time thinking about things than actually doing them. However, as data sets scale, this balance may shift a bit ... or one could argue the longer it takes to run your analysis, the more thought you should put into the idea and code before it runs for days as the penalty for doing the wrong analysis or an incorrect analysis grows. Luckily, we aren't quite back to the days of the punchcard.
Above is a nice way of looking at some of the possible data analysis tool sets for different types of individuals. To put this into the vernacular of the data scientist survey that Harlan, Marck and I put out, R+js/python would map well to the Data Researcher, R+sql+regex+xpath, could map to the Data Creative, and R+java+scala+C/C++ could map to the Data Developer. Ideally, one would be a polyglot and know languages that span these categorizations.
Who doesn't love this quote? The future (of R and data analysis) is here in pockets of advanced practitioners. As ideas disperse through the community and the rest of the masses catchup, we push forward.
Communication is key ...
but traditional tools fall short when communicating data science ideas and results and methods. Thus, rmarkdown gets it done and can be quickly and easily translated into HTML.
Going one step further but still coupled to rmarkdown is a new service, RPubs, that allows one click publishing of rmarkdown to the web for free. Check it out ...
If rmarkdown is the Microsoft Word of data science, than Slidify is the comparable to Powerpoint (and it is free), allowing one to integrate text, code, and output powerfully and easily.
While these tools are great, they aren't perfect. We are not yet at a point where our local IDE has been seemlessly integrated into our code versioning system, our data versioning system, our environment and dependency versioning system, our publishing/broadcasting results generating system, or our collaboration systems.
Not there yet ...
Basically, Rcpp allows you to embed C++ code into your R code easily. Why would someone want to do that? Because it allows you to easily circumvent the performance penalty of FOR loops in R; just write them in C++.
On a personal rant, I don't think mixing in additional languages is necessarily a good idea, especially C++.
Notice the units of microseconds. There is always a trade off between the time spent optimizing your code and running your slow code.
Awesome name, full stop. Let's take two great tastes, ggplot2 and D3.js, and put them together.
As a comparison, what might take you a few lines of code to create in R + gglot2, could take you a few hundred lines of code in D3.js. Some middle ground is needed, allowing R to produce web suitable, D3-esque graphics.
ps Just in case you were wondering, r2d3 does not yet exist. It is currently vaporware.
Enter shiny which allows you to make web apps with R hidden as the back end, generating .pngs that are refreshed, potentially with an adjustable parameter input from the same web page. This doesn't seem the Holy Grail everyone is looking for but is moving the conversation forward.
One central theme was the idea that we want to say what we want and allow the software to figure out the best way to do that. We want a D3-type visualization but we don't want to learn 5 languages to do it. Also, this applies equally on the data analysis size for data sets ranging many orders of magnitude.
Another theme was that the output of the future is HTML 5. I did not know this but R Studio is basically a web browser, everything is drawn using HTML5, js, and CSS.
Loved this slide because who doesn't want to know?!
DPLYR is an attempt at a grammar of data manipulation, abstracting out the back end of crunching the data from the description of what someone wants done (and no, SQL is not the solution to that problem).
And this concludes what was a fantastic talk about The ^(near) Future of Data Analysis. If you've made it this far and still want to download Hadley's full slide deck or Marck's introductory talk, look no further: