On May 8, we kicked off the transformation of R Users DC to Statistical Programming DC (SPDC) with a meetup at iStrategyLabs in Dupont Circle. The meetup, titled "Stepping up to big data with R and Python," was an experiment in collective learning as Marck and I guided a lively discussion of strategies to leverage the "traditional" analytics stack in R and Python to work with big data.
R and Python are two of the most popular open-source programming languages for data analysis. R developed as a statistical programming language with a large ecosystem of user-contributed packages (over 4500, as of 4/26/2013) aimed at a variety of statistical and data mining tasks. Python is a general programming language with an increasingly mature set of packages for data manipulation and analysis. Both languages have their pros and cons for data analysis, which have been discussed elsewhere, but each is powerful in its own right. Both Marck and I have used R and Python in different situations where each has brought something different to the table. However, since both ecosystems are very large, we didn't even try to cover everything, and we didn't believe that any one or two people could cover all the available tools. We left it to our attendees (and to you , our readers) to fill in the blanks with favorite tools in R and Python for particular data analytic tasks.
There are several basic tasks we covered in the discussions: data import, visualization, MapReduce, parallel processing. We noted that, since R is becoming one of the lingua statististica, many commercial products by SAP, Oracle, Teradata, Netezza and the like have developed interfaces to allow R as an analytic backend. However, Python has been used to develop integrated analysis platforms due to its strengths as a "glue language" and its robust general capabilities and web development packages.
Most data scientists have had experience with small to medium data. Big Data poses its own challenges in terms of its size. Marck made the great point that Big Data is almost never directly used, but is aggregated and summarized before being analyzed, and this summary data is often not very big. However, we do need to use available tools a bit differently to deal with large data sizes, based on the design choices R and Python developers have made. R has a earned reputation for not being about to handle datasets larger than memory, but users have developed useful packages like ff and bigmemory to handle this. In our experience, Python reads data much more efficiently (orders of magnitude) than R, so reading data with Python and piping it to R has often been a solution. Both R and Python have well established means of communicating with Hadoop, mainly leveraging Hadoop Streaming. Both also have well-developed interfaces to connect with both SQL-based and NoSQL databases. There was a lively discussion of various issues regarding using Big Data within R and Python, specifically in regards to Hadoop.
There is a basic stack of packages in both R and Python for data analysis, and many more packages for other analytic tasks. Both software platforms have huge ecosystems; so, to try and get you started on discovering many of the tools available for different data scientific tasks, we have developed preliminary maps of each ecosystem (click for a larger view, outlines with links, and to download):
In fact, R can be used from within Python using the rpy2 package by Laurent Gautier, which has been nicely wrapped in the rmagic magic function in ipython. This allows R to be used from within an ipython notebook. (PS: If you're a Python user and are not using ipython and the ipython notebook, you really should look into it). There are several ways of integrating R and Python into unified platforms, as I've described earlier.
Our meetup, and the maps above, are intended as a launching pad for your exploration of R and Python for your data analysis needs. We will have video from this meetup available soon (stay tuned). Resources for learning R are widely available on the web. We have described Python's capabilities for data science and data analysis in earlier blog posts, and Ben Bengfort has a series of posts on using Python for Natural Language Processing, one of it's analytic strengths. We hope that you will contribute to this discussion in the comments, and we will compile different tools and strategies that you suggest in a future post.