PyData NYC 2013 was a two-day conference this past weekend (Saturday and Sunday, 11/9 and 11/11) with a day of tutorials on Friday. Saturday and Sunday featured keynotes each morning and three tracks of talks and workshops. JP Morgan graciously provided space in the financial district of New York City (complete with a gorgeous 60th floor room for keynotes) and Continuum Analytics appeared to be one of the primary sponsors. Based on the nature and content of the talks and the speakers themselves, it was quite clear that this conference is by and for practitioners, those individuals writing code and munging data every day.
This post won't try to detail every talk or every slide but highlight some portions of the keynotes and discuss other trends that emerged.
General Trends and Thoughts
I saw two geneal trends at this conference:
1. Python Does Data
Python now robustly covers the full software stack for data science, full stop. If you have been using SPSS, SAS, Stata, Matlab, or even R, Python is now a very legitimate alternative for your data analysis needs that does not suffer the limitations of being a DSL (domain specific language originally designed for a small set of tasks). This set of Python tools includes but is not limited to:
- Pandas - data analyis library for Python including dataframes
- Numpy - a rich foundational library for scientific and numerical computing in Python
- SciPy - a rich foundational library for scientific computing, complimenting Numpy
- StatsModels - rich statistical methods for Python
- scikits-learns - the machine learning tool set
- NLTK - the Natural Language Tool Kit for NLP
- PyMC3 - a module for Bayesian statistical modeling
- iPython - an interactive, easier to use Python interface
- iPython Notebook
- Spyder - an IDE focused on data and numerical computation
In terms of ease of use, Continuum Analytics has solved a major issue by packaging together the key Python components listed above into a single, easy to install distribution of Python named Anaconda.
2. Python Eats The World
While Python handles the statistical data niche that tradititional packages such as SPSS, Stata, SAS, and even R have historically ruled, it handles everything else as well scaling from research code all the way to large scale production systems handling true big data.
Think of it this way, if you were a young but unusually wise graduate student tasked with some form of numerical computation or data analysis, would you want to use a domain specific language good for one thing or a programming language that opened numerous doors (this is all assuming this individual actually has a choice)?
Keynote Day 1 - Peter Wang
Peter Wang, the CEO of Continuum Analytics gave what I would consider one of the best data talks that I have heard in the last few years. Leveraging his physics background, Peter discussed quite literally the full data stack, from data visualization to the circuits flipping bits performing computations.
Why this Community Now?
Peter discussed the answer to the fundamental question of "why this community now" (or, put less succinctly, why is Python and data--big or otherwise--exploding now). Succinctly, he sees it as the perfect storm of disruptions in:
- storage technologies - mostly cloud based although fast SSD's don't hurt
- processor cycle availability - again, almost unlimited CPU hours available in the cloud
- virtually unlimited data generation happening continuously - currently from the web and mobile applications but the Internet of Things will only add to the data generation
- traditional BI tools not up to par
- demonstrated clear value in large datasets
Data Comprehension as a Core Competency
To put it in managerial terms, if you went back in time to 1996 before the ubiquity of the Internet, predictions were made such as:
Business that build network-oriented capability into their core will fundamentally outcompete and destroy their competition
At the time, some people argued vehemently with these predictions.
Peter believes and I whole heartedly agree that a similar data revolution is afoot. Data is becoming core; data comprehension is a must in a quantized world. Thus, while some may disagree, it is safe to say that
Businesses that don't understand data and that do not have data comprehension as a core competency will quickly be extinguished by those that do.
Next, Peter also discussed how the velocity and volume of large data overwhelms old school ETL in business data processing. In the past, the work flow perspective reflected how a factory works. Data was like a train traveling from one station to the next, getting transformed or stored at each stop in the journey. Now that data is so large, moving the train from station to station simply requires to much effort and time. Thus, the train is kept in one place and the platforms and stations are being moved to it.
Data Science as Scientific Computing 2.0
Peter also discussed the idea that data science was really scientific computing 2.0 as scientists have been using data and computers and algorithms for practically longer than everyone else. He even half-jokingly offered up an equation for it from datagravity.org.
Whether you agree or disagree with his point, it is hard to argue that there are many, many lessons that data practitioners and data scientists not coming from science can learn from those that have come before. In particular, Peter highly recommended a paper led by Jim Gray at Microsoft entitled Scientific Data Management in the Coming Decade. This jem, writting in 2004/2005, is full of absolutely fascinating insights.
While there were many additional points in this key note that were equally fascinating, I will leave you with the one that Peter used to hint at things to come. Each layer of software abstraction built on top of the hardware--from the firmware to the assembler to the kernel to the operating system to the programming language and applications--are a lie. Those constructs exist to hide implementation details from the higher levels to enable more rapid advancements. However, these abstractions constrain what can be done and can cause significant performance penalties.
Add to this the facts that
- numerous projects have virtualized or attempted to virtualize various levels of these abstractions, fuzzying the boundaries between these layers, and
- some cutting edge research is underway to compile code directly into hardware layouts, completely cutting out the middlemen
and you are left wondering what might just be coming in the future. As many great minds come to similar conclusions, look at some of the thoughts behind the Julia language here.
NumFOCUS is a new 501(c)3 non-profit organization designed to support the scientific and data software stack in Python including Numpy, Matplotlib, Scipy, Pandas, and more. In their own words:
NumFOCUS supports and promotes world-class, innovative, open source scientific software. Most individual projects, even the wildly successful ones, find the overhead of a non-profit to be too large for their community to bear. NumFOCUS provides a critical service as an umbrella organization which removes the burden from the projects themselves to raise money.
NumFOCUS aims to ensure that money is available to keep projects in the scientific Python stack funded and available. So if you find value in these tools and have always wanted to give back, donating to NumFOCUS gives you a way of supporting either a specific project of your choice or all of these great codes at once!
There were simply too many things to of interest to discuss in depth. I won't go much further than listing them out here but definitely recommend clicking through.
in database analytics engine, PL, Python - open source
a run time accelerator for an array-oriented subset of Python leveraging the power of the GPU
PyMC is a python module for Bayesian statistical modeling and model fitting which focuses on advanced Markov chain Monte Carlo fitting algorithms. Its flexibility and extensibility make it applicable to a large suite of problems.
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently.
That is all for now. As a reward for making it this far, here is a link to virtually every presentation given at the event! I was going to discuss the keynote given by Brian Sanger, a professor at Cal Poly State University who is highly involved in the iPython project, but will save that for a separate post.