Diving into Statsmodels with an Intro to Python & Pydata

Abhijit and Marck, the organizers of Statistical Programming DC, kindly invited me to give the talk for the April meetup on statsmodels. Statsmodels is a Python module for conducting data exploration and statistical analysis, modeling, and inference. You can find many common usage examples and a full list of features in the online documentation. For those who were unable to make it, the entire talk is available as an IPython Notebook on github. If you aren't familiar with the notebook, it is an incredibly useful and exciting tool. The Notebook is a web-based interactive document that allows you combine text, mathematics, graphics, and code (languages other than Python such as R, Julia, Matlab, and, even, C/C++ and Fortran are supported).

The talk introduced users to what is available in statsmodels. Then we looked at a typical statsmodels workflow, highlighting high-level features such as our integration with pandas and the use of formulas via patsy. We covered a few areas in a little more detail building off some of our example datasets. And finally we discussed some of the features we have in the pipeline for our upcoming release.

For the examples, we used Duncan's occupational prestige data to have a look at linear regression and the identification of observations with high leverage and potential outliers. We used Fair's data on extramarital affairs to explore discrete choice modeling. Finally, I provided a quick overview of performing time-series modeling using our CO2 emissions dataset.

It was great to see such a large and diverse audience interested in using Python for statistical analysis. After the talk there was a lively discussion and many great questions. One of these questions, in particular, I'd like to provide a more detailed answer to one of these questions -- how do I learn Python or how do I become a better Python programmer?

Many were interested in resources for how to learn Python. There are a number of great, online resources. There are a few with which I'm familiar that also focus on teaching general programming. These include Dive Into Python, Learn Python the Hard Way, and Alan Gauld's Learning to Program. For those who are experienced scientific programmers but not necessarily with Python, you may find these helpful -- the Python Scientific Lecture Notes and, though it's a bit outdated in places, the Mathesaurus guide to NumPy for R Users and Matlab users can be helpful. Stackoverflow, using the Python tag, is, of course, a great way to learn and find help. And finally, long before there was Stackoverflow, there were mailing lists. The python-tutor mailing list is a good resource if even just to lurk and learn from other people new to Python asking for and receiving help from experts. The NumPy and SciPy mailing lists can also be good resources, though much of the traffic has migrated to Stackoverflow. And, of course, as Marck pointed out, another great way to learn is to come to meetups. There are several good meetup groups in the area devoted to Python and learning Python specifically. As one final suggestion, you might consider attending the tutorials and talks for the Scipy 2014 this summer in Austin. This is a great conference (and it will sellout).

Finally, I mentioned offhand the "Zen of Python" easter egg, and I was asked if there were any others. Here are a few.

    $ python
    >>> import this
    >>> from __future__ import braces
    >>> import __hello__
    >>> import antigravity

I had a great time giving this talk and hope to see you at future SPDC meetups.

Author Information

Skipper Seabold is a PhD candidate in Economics at American University, with an emphasis on information theory and econometrics. He has been a Python developer for 5 years and has been a primary contributor to statsmodels. He also contributes to scipy and pandas and other parts of the Pydata stack.