Statistical Programming DC

Natural Language Processing in Python and R

This is a guest post by Charlie Greenbacker and Tommy Jones.

Data comes in many forms. As a data scientist, you might be comfortable working with large amounts of structured data nicely organized in a database or other tabular format, but what do you do if a customer drops 10,000 unstructured text documents in your lap and asks you to analyze them?

Some estimates claim unstructured data accounts for more than 90 percent of the digital universe, much of it in the form of text. Digital publishing, social media, and other forms of electronic communication all contribute to the deluge of text data from which you might seek to derive insights and extract value. Fortunately, many tools and techniques have been developed to facilitate large-scale text analytics. Operating at the intersection of computer science, artificial intelligence, and computational linguistics, Natural Language Processing (NLP) focuses on algorithmically understanding human language.

Interested in getting started with Natural Language Processing but don't know where to begin? On July 9th, a joint meetup co-hosted by Statistical Programming DC, Data Wranglers DC, and DC NLP will feature two introductory talks on the nuts & bolts of working with NLP in Python and R.

The Python programming language is increasingly popular in the data science community for a variety of reasons, including its ease of use and the plethora of open source software libraries available for scientific computing & data analysis. Packages like SciPy, NumPy, Scikit-learn, Pandas, NetworkX, and others help Python developers perform everything from linear algebra and dimensionality reduction, to clustering data and analyzing multigraphs.

Back in the dark ages (about 10+ years ago), folks working in NLP usually maintained an assortment of homemade utility programs designed to handle many of the common tasks involved with NLP. Despite our best intentions, most of this code was lousy, brittle, and poorly documented -- hardly a good foundation upon which to build your masterpiece. Over the past several years, however, mainstream open source software libraries like the Natural Language Toolkit for Python (NLTK) have emerged to offer a collection of high-quality reusable NLP functionality. NLTK enables researchers and developers to spend more time focusing on the application logic of the task at hand, and less on debugging an abandoned method for sentence segmentation or reimplementing noun phrase chunking.

If you're already familiar with Python, the NLTK library will equip you with many powerful tools for working with text data. The O'Reilly book Natural Language Processing with Python written by Steven Bird, Ewan Klein, and Edward Loper offers an excellent overview of using NLTK for text analytics. Topics include processing raw text, tagging words, document classification, information extraction, and much more. Best of all, the entire contents of this NLTK book are freely available online under a Creative Commons license.

The Python portion of this joint meetup event will cover a handful of the NLP building blocks provided by NLTK, including extracting text from HTML, stemming & lemmatization, frequency analysis, and named entity recognition. These components will then be assembled to build a very basic document summarization program.

Additional NLP resources in Python

- Natural Language Toolkit for Python (NLTK): - Natural Language Processing with Python (book): (free online version: - Python Text Processing with NLTK 2.0 Cookbook (book): - Python wrapper for the Stanford CoreNLP Java library: - guess_language (Python library for language identification): - MITIE (new C/C++-based NER library from MIT with a Python API): - gensim (topic modeling library for Python):

R is a programming language popular in statistics and machine learning research. R has several advantages in the ML/stat domains. R is optimized for vector operations. This simplifies programming since your code is very close to the math that you're trying to execute. R also has a huge community behind it; packages exist for just about any application you can think of. R has a close relationship with C, C++, and Fortran and there are R packages to execute Java and Python code, increasing its flexibility. Finally, the folks at CRAN are zealous about version control and compatibility, making installing R and subsequent packages a smooth experience.

However, R does have some sharp edges that become obvious when working with any non-trivially-sized linguistic data. R holds all data in your active workspace in RAM. If you are running R on a 32-bit system, you have a 4 GB limit to the RAM R can access. There are two implications of this: NLP data need to be stored in memory-efficient objects (more on that later) and (regrettably) there is still a hard limit on how much data you can work on at one time. There are packages, such as `bigmemory` that are moving to address this, but they are outside the scope of this presentation. You also need to write efficient code; the size of NLP data will punish you for inefficiencies.

What advantages, then, does R have? Every person and every problem is unique, but I can offer a few suggestions:

1. You are doing statistics/ML research and not developing software. 2. (Similar to 1.) You are a quantitative generalist (and probably good in R already) and NLP is just another feather in your cap.

Sometimes being a data scientist is about developing and tweaking your own algorithms. Sometimes being a data scientist is taking others' algorithms, plugging in your data, and moving on to other areas of the problem. If you are doing more of the former, R is a solid choice. If you are doing more of the latter, R isn't too bad. But I've found that my code often runs faster than some of the pre-packaged code. Your individual mileage may vary.

The second presentation at this meetup will cover the basics of reading documents into R and creating a document term matrix, then demonstrating some basic document summarization, keyword extraction, and document clustering techniques.

Seats are filling up quickly, so RSVP here now:

Diving into Statsmodels with an Intro to Python & Pydata

Abhijit and Marck, the organizers of Statistical Programming DC, kindly invited me to give the talk for the April meetup on statsmodels. Statsmodels is a Python module for conducting data exploration and statistical analysis, modeling, and inference. You can find many common usage examples and a full list of features in the online documentation. For those who were unable to make it, the entire talk is available as an IPython Notebook on github. If you aren't familiar with the notebook, it is an incredibly useful and exciting tool. The Notebook is a web-based interactive document that allows you combine text, mathematics, graphics, and code (languages other than Python such as R, Julia, Matlab, and, even, C/C++ and Fortran are supported).

The talk introduced users to what is available in statsmodels. Then we looked at a typical statsmodels workflow, highlighting high-level features such as our integration with pandas and the use of formulas via patsy. We covered a few areas in a little more detail building off some of our example datasets. And finally we discussed some of the features we have in the pipeline for our upcoming release.

For the examples, we used Duncan's occupational prestige data to have a look at linear regression and the identification of observations with high leverage and potential outliers. We used Fair's data on extramarital affairs to explore discrete choice modeling. Finally, I provided a quick overview of performing time-series modeling using our CO2 emissions dataset.

It was great to see such a large and diverse audience interested in using Python for statistical analysis. After the talk there was a lively discussion and many great questions. One of these questions, in particular, I'd like to provide a more detailed answer to one of these questions -- how do I learn Python or how do I become a better Python programmer?

Many were interested in resources for how to learn Python. There are a number of great, online resources. There are a few with which I'm familiar that also focus on teaching general programming. These include Dive Into Python, Learn Python the Hard Way, and Alan Gauld's Learning to Program. For those who are experienced scientific programmers but not necessarily with Python, you may find these helpful -- the Python Scientific Lecture Notes and, though it's a bit outdated in places, the Mathesaurus guide to NumPy for R Users and Matlab users can be helpful. Stackoverflow, using the Python tag, is, of course, a great way to learn and find help. And finally, long before there was Stackoverflow, there were mailing lists. The python-tutor mailing list is a good resource if even just to lurk and learn from other people new to Python asking for and receiving help from experts. The NumPy and SciPy mailing lists can also be good resources, though much of the traffic has migrated to Stackoverflow. And, of course, as Marck pointed out, another great way to learn is to come to meetups. There are several good meetup groups in the area devoted to Python and learning Python specifically. As one final suggestion, you might consider attending the tutorials and talks for the Scipy 2014 this summer in Austin. This is a great conference (and it will sellout).

Finally, I mentioned offhand the "Zen of Python" easter egg, and I was asked if there were any others. Here are a few.

    $ python
    >>> import this
    >>> from __future__ import braces
    >>> import __hello__
    >>> import antigravity

I had a great time giving this talk and hope to see you at future SPDC meetups.

Author Information

Skipper Seabold is a PhD candidate in Economics at American University, with an emphasis on information theory and econometrics. He has been a Python developer for 5 years and has been a primary contributor to statsmodels. He also contributes to scipy and pandas and other parts of the Pydata stack.

Facility Location Analysis Resources Incorporating Travel Time

This is a guest blog post by Alan Briggs. Alan is a operations researcher and data scientist at Elder Research. Alan and Harlan Harris (DC2 President and Data Science DC co-organizer) have co-presented a project with location analysis and Meetup location optimization at the Statistical Programming DC Meetup and an INFORMS-MD chapter meeting. Recently, Harlan presented a version of this work at the New York Statistical Programming Meetup. There was some great feedback on the Meetup page asking for some additional resources. This post by Alan is in response to that question.

If you’re looking for a good text resource to learn some of the basics about facility location, I highly recommend grabbing a chapter of Dr. Michael Kay’s e-book (pfd) available for free from his logistics engineering website. He gives an excellent overview of some of the basics of facility location, including single facility location, multi-facility location, facility location-allocation, etc. At ~20 pages, it’s entirely approachable, but technical enough to pique the interest of the more technically-minded analyst. For a deeper dive into some of the more advanced research in this space, I’d recommend using some of the subject headings in his book as seeds for a simple search on Google Scholar. It’s nothing super fancy, but there are plenty of articles in the public-domain that relate to minisum/minimax optimization and all of their narrowly tailored implementations.

One of the really nice things about the data science community is that it is full of people with an inveterate dedication to critical thinking. There is nearly always some great [constructive] criticism about what didn’t quite work or what could have been a little bit better. One of the responses to the presentation recommended optimizing with respect to travel time instead of distance. Obviously, in this work, we’re using Euclidean distance as a proxy for time. Harlan cites laziness as the primary motivation for this shortcut, and I’ll certainly echo his response. However, a lot of modeling boils down to cutting your losses, getting a good enough solution or trying to find the balance among your own limited resources (e.g. time, data, technique, etc.). For the problem at hand, clearly the goal is to make Meetup attendance convenient for the most number of people; we want to get people to Meetups. But, for our purposes, a rough order of magnitude was sufficient. Harlan humorously points out that the true optimal location for a version of our original analysis in DC was an abandoned warehouse—if it really was an actual physical location at all. So, when you really just need a good solution, and the precision associated with true optimality is unappreciated, distance can be a pretty good proxy for time.

A lot of times good enough works, but there are some obvious limitations. In statistics, the law of large numbers holds that independent measurements of a random quantity tend toward the theoretical average of that quantity. For this reason, logistics problems can be more accurately estimated when they involve a large number of entities (e.g. travelers, shipments, etc.). For the problem at hand, if we were optimizing facility location for 1,000 or 10,000 respondents, again, using distance as a proxy for time, we would feel much better about the optimality of the location. I would add that similar to large quantities, introducing greater degrees of variability can also serve to improve performance. Thus, optimizing facility location across DC or New York may be a little too homogeneous. If instead, however, your analysis uses data across a larger, more heterogeneous area, like say an entire state where you have urban, suburban and rural areas, you will again get better performance on optimality.

Let’s say you’ve weighed the pros and cons and you really want to dive deeper into optimizing based on travel time, there are a couple different options you can consider. First, applying a circuity factor to the Euclidean distance can account for non-direct travel between points. So, to go from point A to point B actually may take 1.2 units as opposed to the straight-line distance of 1 unit. That may not be as helpful for a really complex urban space where feasible routing is not always intuitive and travel times can vary wildly. However, it can give some really good approximations, again, over larger, more heterogeneous spaces. An extension to a singular circuity factor, would be to introduce some gradient circuity factor that is proportional to population density. There are some really good zip code data available that can be used to estimate population.

Increasing the circuity factor in higher-density locales and decreasing in the lower-density can help by providing a more realistic assessment of how far off of the straight line distance the average person would have to commute. For the really enthusiastic modeler that has some good data skills and is looking for even more integrity in their travel time analysis, there are a number of websites that provide road network information. They list roads across the United States by functional class (interstate, Expressway & principal arterial, Urban principal arterial, collector, etc.) and even provide shape files and things for GIS. I've done basic speed limit estimation by functional class, but you could also do something like introduce a speed gradient respective to population density (as we mentioned above for the circuity factor). You could also derive some type of an inverse speed distribution that slows traffic at certain times of the day based on rush hour travel in or near urban centers.

In the words of George E. P. Box, “all models are wrong, but some models are useful.” If you start with a basic location analysis and find it to be wrong but useful, you may have done enough. If not, however, perhaps you can make it useful by increasing complexity in one of the ways I have mentioned above.

Public Service Annoucement: Think Twice About Upgrading to OSX Mavericks If You Use R Studio

Upgrading to any new operating system can be problematic. However, if you depend on R and R Studio for your living, I would highly recommend NOT upgrading to OS X Mavericks, despite its nonexistent price tag. Personally, I have seen painfully sluggish UI behavior with R Studio 0.97.551 since I updated to the latest version of Mavericks (I have also seen such behavior with Google Chrome, previously the fastest browser by far on my Mac Book Air). Confirming my suspicions was the following helpful blog which I recommend reading:

How To Put Your Meetup On the Map (Literally)

This is a guest post by Alan Briggs. Alan is a Data Scientist with Elder Research, Inc. assisting with the development of predictive analytic capabilities for national security clients in the Washington, DC Metro area. He is President-Elect of the Maryland chapter of the Institute for Operations Research and Management Science (INFORMS). Follow him at @AlanWBriggs. This post previews two events that Alan is presenting with Data Community DC President Harlan Harris. Have you ever tried to schedule dinner or a movie with a group of friends from across town? Then, when you throw out an idea about where to go, someone responds “no, not there, that’s too far?” Then, there’s the old adage that the three most important things about real estate are location, location and location. From placing a hotdog stand on the beach to deciding where to build the next Ikea, location really crops up all over the place. Not surprisingly, and I think most social scientists would agree, people tend to act in their own self-interest. That is, everyone wants to travel the least amount of distance, spend the least amount of money or expend the least amount of time possible in order to do what they need [or want] to do. For one self-interested person, the solution to location problems would always be clear, but we live in a world of co-existence and shared resources. There’s not just one person going to the movies or shopping at the store; there are several, hundreds, thousands, maybe several hundred thousand. If self-interest is predictable in this small planning exercise of getting together with friends, can we use math and science to leverage it to our advantage? It turns out that the mathematical and statistical techniques that are scalable to the worlds’ largest and most vexing problems can also be used to address some more everyday issues, such as where to schedule a Meetup event.

With a little abstraction, this scenario looks a lot like a classical problem in operations research called the facility location or network design problem. Its roots tracing back to the 17th century Fermat-Weber Problem, facility location analysis seeks to minimize the costs associated with locating a facility. In our case, we can define the cost of a Meetup venue by the sum of the distance traveled to the Meetup by its attendees. Other costs could be included, but to start simple, you can’t beat a straight-line distance.

So, here’s a little history. The data-focused Meetup scene in the DC region is several years old, with Hadoop DC, Big Data DC and the R Users DC (now Statistical Programming DC) Meetups having been founded first. Over the years, as these groups have grown and been joined by many others, their locations have bounced around among several different locations, mostly in downtown Washington DC, Northern Virginia, and suburban Maryland. Location decisions were primarily driven by supply – what organization would be willing to allow a big crowd to fill its meeting space on a weekday evening? Data Science DC, for instance, has almost always held its events downtown. But as the events have grown, and as organizers have heard more complaints about location, it became clear that venue selection needed to include a demand component as well, and that some events could or should be held outside of the downtown core (or vice-versa, for groups that have held their events in the suburbs).

Data Community DC performed a marketing survey at the beginning of 2013, and got back a large enough sample of Meetup attendees to do some real analysis. (See the public survey announcement.) As professional Meetups tend to be on weekday evenings, it is clear that attendees are not likely traveling just from work or just from home, but are most likely traveling to the Meetup as a detour from the usual commute line connecting their work and home. Fortunately, the survey designers asked Meetup attendees to provide both their home and their work zip codes, so the data could be used to (roughly) draw lines on a map describing typical commute lines.

Commutes for data Meetup attendees, based on ZIP codes.

The Revolutions blog recently presented a similar problem in their post How to Choose a New Business Location with R. The author, Rodolfo Vanzini, writes that his wife’s business had simply run out of space in its current location. An economist by training, Vanzini initially assumed the locations of customers at his wife’s business must be uniformly distributed. After further consideration, he concluded that “individuals make biased decisions basing their assumptions and conclusions on a limited and approximate set of rules often leading to sub-optimal outcomes.” Vanzini then turned to R, plotted his customers’ addresses on a map and picked a better location based on visual inspection of the location distribution.

If you’ve been paying attention, you’ll notice a common thread with the previously mentioned location optimizations. When you’re getting together with friends, you’re only going to one movie theater; Vanzini’s wife was only locating one school. Moreover, both location problems rely on a single location — their home — for each interested party. That’s really convenient for a beginner-level location problem; accordingly, it’s a relatively simple problem to solve. The Meetup location problem on the other hand adds two complexities that make this problem worthy of the time you’re spending here. Principally, if it’s not readily apparent, a group of 150 boisterous data scientists can easily overstay their welcome by having monthly meetings at the same place over and over again. Additionally, having a single location also ensures that the part of the population that drives the farthest will have to do so for each and every event. For this reason, we propose to identify the three locations which can be chosen that minimize the sum of the minimum distances traveled for the entire group. The idea is that the Meetup events can rotate between each of the three optimal locations. This provides diversity in location which appeases meeting space hosts. But, it also provides a closer meeting location for a rotating third of the event attendee population. Every event won't be ideal for everyone, but it'll be convenient for everyone at least sometimes.

As an additional complexity, we have two ZIP codes for each person in attendance — work and home — which means that instead of doing a point-to-point distance computation, we instead want to minimize the distance to the three meeting locations from the closest point along the commute line. Optimizing location with these two concepts in mind — particularly the n-location component — is substantially more complicated than the single location optimization with just one set of coordinates for each attendee.

So, there you have it. To jump ahead to the punchline, the three optimal locations for a data Meetup to host its meetings are Rockville, MD, downtown Washington DC and Southern Arlington, VA.

Gold points are optimal locations. Color gradient shows single-location location utility.

To hear us (Harlan and Alan) present this material, there are two great events coming up. The INFORMS Maryland Chapter is hosting their inaugural Learn. Network. Do. event on October 23 at INFORMS headquarters on the UMBC campus. Statistical Programming DC will also be holding its monthly meeting at iStrategy Labs on October 24. Both events will pull back the curtain on the code that was used and either event should send you off with sufficient knowledge to start to tackle your own location optimization problem.

Fantastic presentations from R using slidify and rCharts

Ramnath Vaidynathan presenting in DCDr. Ramnath Vaidyanathan of McGill University gave an excellent presentation at a joint Data Visualization DC/Statistical Programming DC event on Monday, August 19 at nclud, on two R projects he leads -- slidify and rCharts. After the evening, all I can say is, Wow!! It's truly impressive to see what can be achieved in presentation and information-rich graphics directly from R. Again, wow!! (I think many of the attendees shared this sentiment)


Slidify is a R package that

helps create, customize and share elegant, dynamic and interactive HTML5 documents through R Markdown.

We have blogged about slidify, but it was great to get an overview of slidify directly from the creator. Dr. Vaidyanathan explained that the underlying principle in developing slidify is the separation of the content and the appearance and behavior of the final product. He achieves this using HTML5 frameworks, layouts and widgets which are customizable (though he provides several here and through his slidifyExamples R package).

Example RMarkdown file for slidify

You start with a modified R Markdown file as seen here. This file can have chunks of R code in it. It is then processed to a pure Markdown file, interlacing the output of R code into the file. This is then split-apply-combined to produce the final HTML5 document. This document can be shared using GitHub, Dropbox or RPubs directly from R. Dr. Vaidyanathan gave examples of how slidify can even be used to create interactive quizzes or even interactive documents utilizing slidify and Shiny.

One really neat feature he demonstrated is the ability to embed an interactive R console within a slidify presentation. He explained that this used a Shiny server backend locally, or an OpenCPU backend if published online. This feature changes how presentations can be delivered, by not forcing the presenter to bounce around between windows but actually demonstrate within the presentations.


rCharts is

an R package to create, customize and share interactive visualizations, using a lattice-like formula interface

Again, we have blogged about rCharts, but there have been several advances in the short time since then, both in rCharts and interactive documents that Dr. Vaidyanathan has developed.

rCharts creates a formula-driven interface to several Javascript graphics frameworks, including NVD3, Highcharts, Polycharts and Vega. This formula interface is familiar to R users, and makes the process of creating these charts quite straightforward. Some customization is possible, as well as putting in basic controls without having to use Shiny. We saw several examples of excellent interactive charts using simple R commands. There is even a gallery where users can contribute their rCharts creations. There is really no excuse any more for avoiding these technologies for visualization, and it makes life so much more interesting!!

Bikeshare maps, or how to create stellar interactive visualizations using R and Javascript

Dr. Vaidyanathan demonstrated one project which, I feel, shows the power of the technologies he is developing using R and Javascript. He created a web application using R, Shiny, his rCharts packages which accesses the Leaflet Javascript library, and a very little bit of Javascript magic to visualize the availability of bicycles at different stations in a bike sharing network. This application can automatically download real-time data and visualize availability in over 100 bike sharing systems worldwide. He focused on the London bike share map, which was fascinating in that it showed how bikes had moved from the city to the outer fringes at night. Clicking on any dot showed how many bikes were available at that station.

London Bike Share map Dr. Vaidyanathan quickly demonstrated a basic process of how to map points on a city map, how to change their appearance and how to add additional meta-data to each point, that will appear as a pop-up when clicked.

You can see the full project and how Dr. Vaidyanathan developed this application here.

Interactive learning environments

Finally, Dr. Vaidyanathan showed a new application he is developing using slidify, rCharts, and other open-source technologies like OpenCPU and PopcornJS. This application allows him to author a lesson in R Markdown, integrate interactive components including interactive R consoles, record the lesson as a screencast, sync the screencast with the slides, and publish it. This seems to me to be one possible future for presenting massive online courses. An example presentation is available here, and the project is hosted here

Open presentation

The presentation and all the relevant code and demos are hosted on GitHub, and the presentation can be seen (developed using slidify, naturally) here.

Stay tuned for an interview I did with Dr. Vaidyanathan earlier, which will be published here shortly.

Have fun using these fantastic tools in the R ecosystem to make really cool, informative presentations of your data projects. See you next time!!!

Data-driven presentations using Slidify

Presentations are the stock-in-trade for consultants, managers, teachers, public speakers, and, probably, you. We all have to present our work at some level, to someone we report to or to our peers, or to introduce newcomers to our work. Of course, presentations are passe, so why blog about it? There’s already PowerPoint, and maybe Keynote. What more need we talk about? slidify

Well, technology has changed, and vibrant dynamic presentations are here today for everyone to see. No, I mean literally everybody, if I like. All anyone will need is a web browser to see it. Graphs can be interactive, flow can be nonlinear, and presentations can be fun and memorable again!

But PowerPoint is so easy! You click, paste, type, add a bit of glitz, and you’re done, right? Well, as most of us can attest to, not really. It takes a bit more effort and putzing around to really get things in reasonable shape, let alone great shape.

And there are powerful alternatives. Which are simple and easy. And do a pretty great job on their own. Oh, and, by the way, if you have data and analysis results to present, super slick and a one-stop-shop from analysis to presentation. Really!! Actually there are a few out there, but I’m going to talk about just one. My favorite. Slidify.

Slidify is a fantastic R package that takes a document written in RMarkdown , which is Markdown (an easy text markup format) possibly interspersed with of R code that result in tables or figures or interactive graphics, weaves in the results of that code, and then formats it into beautiful web presentations using HTML5. You can decide on the format template ( it comes with quite a few) or brew your own. You can make your presentation look and behave the way you want, even like a Prezi (using ImpressJS). You can also make interactive questionnaires and even put in windows to code interactively within your presentation!!

A Slidify Demonstration

Slidify is obviously feature-rich, and infinitely customizable, but that’s not really what attracted me to it. It was the ability to write presentations in Markdown, which is super easy and let’s me put down content quickly without worrying about appearance (Between you and me, I’m writing this post in Markdown, on a Nexus 7). It lets me weave in results of my analyses easily, keeping the code in one place within my document. So when my data changes, I can create an updated presentation literally with the press of a button. Markdown is geared to create HTML documents. Pandoc lets you create HTML presentations from Markdown, but not living, data driven presentations like Slidify. I get to put my presentations up on Github or on Rpubs, or even in my  Dropbox, directly using Slidify, share the link, and I’m good to go.

Dr. Ramnath Vaidyanathan created Slidify to help him teach more effectively at McGill University, where he is on the Desautels Faculty of Management. But, for me, it is now the goto place for creating presentations , even if I don’t need to incorporate data. If you’re an analyst and live in the R ecosystem, I highly recommend Slidify. If you don’t and use other tools, Slidify is a great reason to come and see what R can do for you. Even if it to just create great presentations. There are plenty of great examples of what’s possible at

If you are in the DC metro area, come see Slidify in action. Dr. Vaidyanathan presents at a joint Statistical Programming DC / Data Visualization DC meetup on both Slidify and his other brainchildren, rCharts (which can create really cool and dynamic visualizations from R, see Sean's blog) and rNotebook on August 19. See the announcements at SPDC and DVDC, sign up, and we’ll see you there.

Stepping up to Big Data with R and Python: A Mind Map of All the Packages You Will Ever Need

On May 8, we kicked off the transformation of R Users DC to Statistical Programming DC   (SPDC) with a meetup at iStrategyLabs in Dupont Circle. The meetup, titled "Stepping up to big data with R and Python," was an experiment in collective learning as Marck and I guided a lively discussion of strategies to leverage the "traditional" analytics stack in R and Python to work with big data.

Rlogo               python-logo

R and Python are two of the most popular open-source programming languages for data analysis. R developed as a statistical programming language with a large ecosystem of user-contributed packages (over 4500, as of 4/26/2013) aimed at a variety of statistical and data mining tasks. Python is a general programming language with an increasingly mature set of packages for data manipulation and analysis. Both languages have their pros and cons for data analysis, which have been discussed elsewhere, but each is powerful in its own right. Both Marck and I have used R and Python in different situations where each has brought something different to the table. However, since both ecosystems are very large, we didn't even try to cover everything, and we didn't believe that any one or two people could cover all the available tools. We left it to our attendees (and to you , our readers) to fill in the blanks with favorite tools in R and Python for particular data analytic tasks.

There are several basic tasks we covered in the discussions: data import, visualization, MapReduce, parallel processing. We noted that, since R is becoming one of the lingua statististica, many commercial products by SAP, Oracle, Teradata, Netezza and the like have developed interfaces to allow R as an analytic backend. However, Python has been used to develop integrated analysis platforms due to its strengths as a "glue language" and its robust general capabilities and web development packages.

Most data scientists have had experience with small to medium data. Big Data poses its own challenges in terms of its size. Marck made the great point that Big Data is almost never directly used, but is aggregated and summarized before being analyzed, and this summary data is often not very big. However, we do need to use available tools a bit differently to deal with large data sizes, based on the design choices R and Python developers have made. R has  a earned reputation for not being about to handle datasets larger than memory, but users have developed useful packages like ff and bigmemory to handle this. In our experience, Python reads data much more efficiently (orders of magnitude) than R, so reading data with Python and piping it to R has often been a solution. Both R and Python have well established means of communicating with Hadoop, mainly leveraging Hadoop Streaming. Both also have well-developed interfaces to connect with both SQL-based and NoSQL databases. There was a lively discussion of various issues regarding using Big Data within R and Python, specifically in regards to Hadoop.

There is a basic stack of packages in both R and Python for data analysis, and many more packages for other analytic tasks. Both software platforms have huge ecosystems; so, to try and get you started on discovering many of the tools available for different data scientific tasks, we have developed preliminary maps of each ecosystem (click for a larger view, outlines with links, and to download):

R for big data

Python for big data

In fact, R can be used from within Python using the rpy2 package by Laurent Gautier, which has been nicely wrapped in the rmagic magic function in ipython. This allows R to be used from within an ipython notebook. (PS: If you're a Python user and are not using ipython and the ipython notebook, you really should look into it). There are several ways of integrating R and Python into unified platforms, as I've described earlier.

Our meetup, and the maps above, are intended as a launching pad for your exploration of R and Python for your data analysis needs. We will have video from this meetup available soon (stay tuned). Resources for learning R are widely available on the web. We have described Python's capabilities for data science and data analysis in earlier blog posts, and Ben Bengfort has a series of posts on using Python for Natural Language Processing, one of it's analytic strengths. We hope that you will contribute to this discussion in the comments, and we will compile different tools and strategies that you suggest in a future post.


PyData and More Tools for Getting Started with Python for Data Scientists

It would turn out that people are very interested in learning more about python and our last post, “Getting Started with Python for Data Scientists,” generated a ton of comments and recommendations. So, we wanted to give back those comments and a few more in a new post.  As luck would have it, John Dennison, who helped co-author this post (along with Abhijit), attended both PyCon and PyData and wanted to sneak in some awesome developments he learned at the two conferences.

Screen Shot 2013-03-29 at 8.06.31 PM

PyData was a smaller conference the directly followed PyCon in Santa Clara. It was a great time and it was wonderful to meet hackers from a range of disciplines that all shared a love of python scientific computing stack. Plus, no one got fired.   Given this reception, we definitely want to give back these recommendations and some more of our own.

Also,  Abhijit Dasgupta lent his incredible wealth of knowledge to putting together this post.

More Books

There are even more excellent books available than originally mentioned. One reader reminded us about the "Think" series--Think Python, an intro to the language, Think Complexity, looking at data structures and algorithms, and Think Stats, an intro to statistics--all by Allen B Downey and all available for free at

Travis Oliphant's "Guide to Numpy" is also freely available at For data scientists, Wes "Pandas" McKinney's "Python for Data Analysis" is probably a very good read.  And, if you are interested in forking over some money, you might want to check out the following introductory Python/computer science text by John Zelle (Note, DataCommunityDC is an Amazon Affiliate. Thus, if you click the image to the side and buy the book, we will make approximately $0.43 and retire to a small island).

More IDEs


Spyder, which was previously known as Pydee, is yet another IDE for Python but with interactive testing, debugging, auto completion, pylint checking, optional pep8 checking, and more. Looks pretty good and is available for numerous operating systems.

One reader, who might just work for the government, recommended the following two tools:


PyScripter is a free and open-source Python Integrated Development Environment (IDE) created with the ambition to become competitive in functionality with commercial Windows-based IDEs available for other languages. Being built in a compiled language is rather snappier than some of the other Python IDEs and provides an extensive blend of features that make it a productive Python development environment.

Portable Python

Ever needed a full programming language and environment on a USB Key?

Portable Python is a  Python® programming language preconfigured to run directly from any USB storage device, enabling you to have, at any time, a portable programming environment. Just download it, extract to your portable storage device or hard drive and in 10 minutes you are ready to create your next Python® application.


More Python Distributions

Who knew there were so many Python distributions out there but apparently there are and many are geared toward number crunching.


Majid mentioned this distribution which is geared toward helping people make the switch from Matlab or compiled languages. If you are used to a full blown IDE, this looks great. Installers are available for Windows and Linux.

Python(x,y) is a scientific-oriented Python Distribution based on Qt and Spyder - see the Plugins page. Its purpose is to help scientific programmers used to interpreted languages (such as MATLAB or IDL) or compiled languages (C/C++ or Fortran) to switch to Python. C/C++ or Fortran programmers should appreciate to reuse their code "as is" by wrapping it so it can be called directly from Python scripts.

Enthought Python Distribution (EPD)

EPD "provides scientists with a comprehensive set of tools to perform rigorous data analysis and visualization." It is cross-platform and easy to install.

Anaconda produces the cleverly named "Anaconda" distribution, which offers a "completely free enterprise-ready Python distribution for large-scale data processing, predictive analytics, and scientific computing." But, don't just take their word for it.  Abhijit and Eli both mentioned Anaconda.

I second Hugo's statement. If you go with the free version of Anaconda (AnacondaCE) it offers more functionality than the free version of EPD. For speed demons will be in for a treat with Anaconda as well. You will get Numba installed without all the hassles of source installation. Win-win in my opinion.

Anaconda is cross-platform, and supplies an installer program called "conda" (naturally) which parallels the functionality of pip or easy_install for a more limited set of data science related python libraries. Anaconda is the brain child of the same person who started Numpy, and was part of Enthought for many years before starting Continuum, so there are parallels with EPD as well as advances.


Python and Macs

An additional comment for Mac users. The pydata ecosystem outlined here can be "difficult" to install because of dependencies on Mac OSX. Chris Fonnesbeck has done a bang-up job putting them all together in a "Scipy Superpack", downloadable from github. ( Makes life immeasurably easier. Includes PyMC (for MCMC simulations), and includes cutting edge versions of the packages.


Python and Data


Scikit-Learn is a wonderful package of machine learning algorithms. It has most of the big name approaches and plenty that you have never heard of. The maintainers have released a great flow chart to guide you through your machine learning conundrum.

One of the real selling points to scikit is that the maintainers have done a wonderful job of maintaining a consistent API across the different algorithms. In addition, with the help of IPython's parallel powers, one of the core commiters, Olivier, gave a wonderful talk at PyData showing scikit's parallel chops. When the video of the talk is released, we'll be sure to post.


Rpy2 is a great library that opens up the immense R ecosystem to your python scientific computing applications. By running a R interpreter besides python, you can access R objects, functions and packages with native python code. One word of warning is that data abstraction runs from R to python. That is to say if you pass large amounts of data from python to R it will result in a full copy.  However, if you access R objects from python, Rpy2 does a clever job at maintaining the memory locations of R objects and exposing them to python without duplicating the data.

I personally have found this library extremely helpful. As a heavy R user, this library allows you to access libraries that might not have a direct counterpart in python. Fellow DC2 board member Abhijit says that rpy2 "works pretty well. It's developed in Linux and so works fine there. Windows was a bear to install, until I discovered someone with a solution ( Get the best of both worlds here."


Python and Giving Back Your Data


A key part of data science is giving back the data and the results to her or his audience in a meaningful way. While static documents such as PDF can be useful, who wouldn't rather send a URL to a full blown web application, built to demonstrate and explore the data!  Luckily, Python has you covered and I am not just talking about Django ("the web framework for perfectionists with deadlines,") which is relatively large and complex. I am talking about Flask, a microframework for Python that allows you to build web applications in a single source file. Want an example? Check out to see Flask in action.  Also, if you want a fantastic and indepth tutorial, check out this lengthy series of posts by Miguel Grinberg.


Python and Insane Number Crunching Power


This is a project that John had not heard of until PyData and was incredibly impressed. The basic idea is by adding type casts python code, numba will transform your python byte code and feed it to the LLVM compiler make mind melting speed improvements. Fans of julia know the impressive bench marks that the LLVM stack can provide. At a minimum  the library requires a single function decorators:

def foo(bar):
    return bar * 2

With this "auto" version, the first time you call the function numba will inspect the type of 'bar' and compile a LLVM version of the function and link it to your python interpreter. You do get a speed hit the first time you compile but, every time you call your function, you call the linked compiled version. Obviously, this example is contrived but the speed examples given where very  impressive. A way to avoid this live-compile hit is to use the @jit decorator which requires you to specify before hand the types of arguments that you will pass in. While the underlying engineering looks incredibly complex, at least for this (John) humanities-major turned hacker, the simple API promises to open up speed improvements previously only available to hand tuned Cython code. Another Continuum contribution.


The GPGPU (General-Purpose Graphics Processing Unit) programming movement has been fascinating to watch ever since 2001. Dollar for dollar, GPU's offer an incredible amount of number crunching capability for the right (read easily parallelized) application.  To put this in perspective, nVidia's new Titan gpu solution offers 4.5 terraflops of compute for $1,000 (2688 stream processors at 875MHz and 6 GB of 6GHZ GDDR5)!!! And now you can access that number crunching power from Python. PyCUDA gives you easy, Pythonic access to Nvidia‘s CUDA parallel computation API.


Blaze is the next generation of Numpy, being developed by Continuum. "Blaze aims to extend the structural properties of NumPy arrays to a wider variety of table and array-like structures that support commonly requested features such as missing values, type heterogeneity, and labeled arrays." Continuum was recently awarded a $3M DARPA grant to develop Blaze, so I'm sure we'll see many good things in short order.



We had already mentioned matplotlib in the previous post. There are two other libraries you might want to look at.



Mayavi is a sophisticated open-source 3-D visualization for Python, produced by Enthought. It depends on some other Enthought products, which are part of the Anaconda CE distribution


"Bokeh (pronounced boh-Kay) is an implementation of Grammar of Graphics for Python, that also supports the customized rendering flexibility of Protovis and d3. Although it is a Python library, its primary output backend is HTML5 Canvas". Bokeh is trying  to be ggplot for Python. It is in active development in the Continuum stable.


Getting Started with Python for Data Scientists

With the R Users DC Meetup broadening its topic base to include other statistical programming tools, it seemed only reasonable to write a meta post highlighting some of the best Python tutorials and resources available for data science and statistics. What you don't know is often the hardest part of picking up a new skill, so hopefully these resources will help make learning Python a little easier. Prepare yourself for code indentation heaven. Python is such an incredible language because it can do practically anything, from high performance scientific computing to web frameworks such as Django or Flask.  Python is heavily used at Google so the language must be doing something right. And, similar to R, Python has a fantastic community around it and, luckily for you, this community can write. Don't just take my word for it, watch the following video to fully understand.




Python is available for free from and there are two popular versions, 2.7 or 3.x.  Which should you choose? I would either go with whatever is currently installed on your system or 2.7. For a better discusion, check out this site.

Commercial distributions are also available that have included and tested various useful packages such as the Enthought Python Distribution. This distribution provides a comprehensive, cross-platform environment for scientific computing with the Python programming language. A single-click installer allows immediate access to over 100 libraries and tools. Our open source initiatives include SciPy,NumPy, and the Enthought Tool Suite.

Python Developer Tools

Getting started with a new programming language often requires getting started with a new tool to use the language, unless you are a hardcore VI, VIM, or EMACS person. Python is no exception and there are a great number of editors or full-blown IDEs to try out:

Sublime Text2 - If you have never used it, you should try this editor. "Sublime Text is a sophisticated text editor for code, markup and prose. You'll love the slick user interface, extraordinary features and amazing performance."

IPython provides a rich architecture for interactive computing with:

  • Powerful interactive shells (terminal and Qt-based).
  • A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media.
  • Support for interactive data visualization and use of GUI toolkits.
  • Flexible, embeddable interpreters to load into your own projects.
  • Easy to use, high performance tools for parallel computing.

NINJA-IDE  (free) (from the recursive acronym: "Ninja-IDE Is Not Just Another IDE"), "is a cross-platform integrated development environment (IDE). NINJA-IDE runs on Linux/X11, Mac OS X and Windows desktop operating systems, and allows developers to create applications for several purposes using all the tools and utilities of NINJA-IDE, making the task of writing software easier and more enjoyable."

PyCharm by Jetbrains (not free) - the folks at Jetbrains make great tools and PyCharm is no exception.


Learning Python

Learn about Packages

Python is known for it’s “batteries included” philosophy and has a rich standard library. However, being a popular language, the number of third party packages is much larger than the number of standard library packages. So it eventually becomes necessary to discover how packages are used, found and created in Python


Package Management and Installation

Once you know a bit about packages, you will start installing them. There is no better ways to get this done than with either the EasyInstall or PIP package managers. It is recommended that you use PIP as it newer and seems to have larger support.

For Windows users sometimes it helpful to use the pre-built binaries maintained here:

You will notice that not all packages have been ported to 3.x. This is true of many popular libraries and it is why 2.6 or 2.7 is recommended.

Virtualenv - learn it early and use it

Package management can be a pain point when working across systems or when deploying larger applications in production environments. For this reason it is  HIGHLY RECOMMENDED that you get comfortable with the wonderful virtualenv package. Here is a good intro to virtualenv for ubuntu (for the windows users... well just go install ubuntu) . The basic idea is that each of your projects gets a self-contained python environment which can be shipped to a new machine and carry its Gordian knot of dependencies with it.

Python Koans - the zen of python

This project is great for those who want to dive right in. It is based on a ruby project which presents the language as a series of failed unit tests. You must edit the source until the unit test passes. It is wonderful and is an introduction to TTD(Test Driven Development) while you learn python.


Python the Hard Way 

Yes, here is an entire book on python for free online or you can upgrade for even more content and videos. And yes, the book is pretty good.

Welcome to the 3rd Edition of Learn Python the hard way. You can visit the companion site to the book at where you can purchase digital downloads and paper versions of the book. The free HTML version of the book is available at


Python's Execution Model If you want to dive deeper into the underlying execution model of Python, there is no better place to start than this fantastic post:

Those new to Python are often surprised by the behavior of their own code. They expect A but, seemingly for no reason, B happens instead. The root cause of many of these "surprises" is confusion about the Python execution model. It's the sort of thing that, if it's explained to you once, a number of Python concepts that seemed hazy before become crystal clear. It's also really difficult to just "figure out" on your own, as it requires a fundamental shift in thinking about core language concepts like variables, objects, and functions.

In this post, I'll help you understand what's happening behind the scenes when you do common things like creating a variable or calling a function. As a result, you'll write cleaner, more comprehensible code. You'll also become a better (and faster) code reader. All that's necessary is to forget everything you know about programming...

Python for Numerical and Scientific Computing

NumPy, SciPy, and matplotlib form the basis for scientific computing in Python.


NumPy is the fundamental package for scientific computing with Python. It contains among other things:

  • a powerful N-dimensional array object
  • sophisticated (broadcasting) functions
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.



SciPy (pronounced "Sigh Pie") is open-source software for mathematics, science, and engineering. It is also the name of a very popular conference on scientific programming with Python. The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization. Together, they run on all popular operating systems, are quick to install, and are free of charge. NumPy and SciPy are easy to use, but powerful enough to be depended upon by some of the world's leading scientists and engineers. If you need to manipulate numbers on a computer and display or publish the results, give SciPy a try!



matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB®* or Mathematica®), web application servers, and six graphical user interface toolkits.


Python for Data


Pandas is really the Python approximation to R, although most would argue that it isn't yet as full featured as R. Or, in the words of the website, "pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language."

Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

Combined with the excellent IPython toolkit and other libraries, the environment for doing data analysis in Python excels in performance, productivity, and the ability to collaborate.



Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. Researchers across fields may find that statsmodels fully meets their needs for statistical computing and data analysis in Python.