data analysis

Weekly Round-Up: Data Analysis Tools, M2M, Machine Learning, and Naming Babies

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from data analysis tools to naming babies. In this week's round-up:

  • Data Analysis Tools Target Non-experts
  • How M2M Data Will Dominate the Big Data Era
  • What Hackers Should Know About Machine Learning
  • Knowledge Engineering Applied to Baby Names

Data Analysis Tools Target Non-experts

Our first piece this week is an O'Reilly Strata article about some of the data analysis tools that are coming to market and are aimed at providing business users with the analytics they need to make decisions. The article highlights several tools from a variety of companies and categorizes them into three different categories according to what they help you do. The article also includes links to all the companies' websites so that, if you're anything like me, you can check out every single one of them.

How M2M Data Will Dominate the Big Data Era

The Internet of Things is getting a lot of attention these days, partly due to the amount of data that gets produced when one connected device communicates with another connected device. This is known as Machine-to-Machine data (M2M), and this Smart Data Collective article describes where a lot of this data may come from and how much data can potentially be generated.

What Hackers Should Know About Machine Learning

Our third piece is a Fast Company interview with Drew Conway, the author of the must-own book Machine Learning for Hackers. In the interview Drew answers questions about why developers should learn machine learning, the biggest knowledge gaps they need to overcome, and the differences between a machine learning project and a development project. (Editor's Note, the image to the left links to Amazon where if you buy the book we get a small cut of the proceeds. Buy enough books through this link, and we retire to an island.)

Knowledge Engineering Applied to Baby Names

Our final piece this week is a blog post about a company called Nameling is in the midst of holding a contest to improve the algorithms behind their baby name recommendation engine. Coming up with a good name for your baby is very important to parents, as the consequences of choosing a bad one almost certainly result in ridicule and tears. It should be interesting to see the results of the contest as well as what kinds of names the recommendation engine spits out.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Getting Started with Python for Data Scientists

With the R Users DC Meetup broadening its topic base to include other statistical programming tools, it seemed only reasonable to write a meta post highlighting some of the best Python tutorials and resources available for data science and statistics. What you don't know is often the hardest part of picking up a new skill, so hopefully these resources will help make learning Python a little easier. Prepare yourself for code indentation heaven. Python is such an incredible language because it can do practically anything, from high performance scientific computing to web frameworks such as Django or Flask.  Python is heavily used at Google so the language must be doing something right. And, similar to R, Python has a fantastic community around it and, luckily for you, this community can write. Don't just take my word for it, watch the following video to fully understand.




Python is available for free from and there are two popular versions, 2.7 or 3.x.  Which should you choose? I would either go with whatever is currently installed on your system or 2.7. For a better discusion, check out this site.

Commercial distributions are also available that have included and tested various useful packages such as the Enthought Python Distribution. This distribution provides a comprehensive, cross-platform environment for scientific computing with the Python programming language. A single-click installer allows immediate access to over 100 libraries and tools. Our open source initiatives include SciPy,NumPy, and the Enthought Tool Suite.

Python Developer Tools

Getting started with a new programming language often requires getting started with a new tool to use the language, unless you are a hardcore VI, VIM, or EMACS person. Python is no exception and there are a great number of editors or full-blown IDEs to try out:

Sublime Text2 - If you have never used it, you should try this editor. "Sublime Text is a sophisticated text editor for code, markup and prose. You'll love the slick user interface, extraordinary features and amazing performance."

IPython provides a rich architecture for interactive computing with:

  • Powerful interactive shells (terminal and Qt-based).
  • A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media.
  • Support for interactive data visualization and use of GUI toolkits.
  • Flexible, embeddable interpreters to load into your own projects.
  • Easy to use, high performance tools for parallel computing.

NINJA-IDE  (free) (from the recursive acronym: "Ninja-IDE Is Not Just Another IDE"), "is a cross-platform integrated development environment (IDE). NINJA-IDE runs on Linux/X11, Mac OS X and Windows desktop operating systems, and allows developers to create applications for several purposes using all the tools and utilities of NINJA-IDE, making the task of writing software easier and more enjoyable."

PyCharm by Jetbrains (not free) - the folks at Jetbrains make great tools and PyCharm is no exception.


Learning Python

Learn about Packages

Python is known for it’s “batteries included” philosophy and has a rich standard library. However, being a popular language, the number of third party packages is much larger than the number of standard library packages. So it eventually becomes necessary to discover how packages are used, found and created in Python


Package Management and Installation

Once you know a bit about packages, you will start installing them. There is no better ways to get this done than with either the EasyInstall or PIP package managers. It is recommended that you use PIP as it newer and seems to have larger support.

For Windows users sometimes it helpful to use the pre-built binaries maintained here:

You will notice that not all packages have been ported to 3.x. This is true of many popular libraries and it is why 2.6 or 2.7 is recommended.

Virtualenv - learn it early and use it

Package management can be a pain point when working across systems or when deploying larger applications in production environments. For this reason it is  HIGHLY RECOMMENDED that you get comfortable with the wonderful virtualenv package. Here is a good intro to virtualenv for ubuntu (for the windows users... well just go install ubuntu) . The basic idea is that each of your projects gets a self-contained python environment which can be shipped to a new machine and carry its Gordian knot of dependencies with it.

Python Koans - the zen of python

This project is great for those who want to dive right in. It is based on a ruby project which presents the language as a series of failed unit tests. You must edit the source until the unit test passes. It is wonderful and is an introduction to TTD(Test Driven Development) while you learn python.


Python the Hard Way 

Yes, here is an entire book on python for free online or you can upgrade for even more content and videos. And yes, the book is pretty good.

Welcome to the 3rd Edition of Learn Python the hard way. You can visit the companion site to the book at where you can purchase digital downloads and paper versions of the book. The free HTML version of the book is available at


Python's Execution Model If you want to dive deeper into the underlying execution model of Python, there is no better place to start than this fantastic post:

Those new to Python are often surprised by the behavior of their own code. They expect A but, seemingly for no reason, B happens instead. The root cause of many of these "surprises" is confusion about the Python execution model. It's the sort of thing that, if it's explained to you once, a number of Python concepts that seemed hazy before become crystal clear. It's also really difficult to just "figure out" on your own, as it requires a fundamental shift in thinking about core language concepts like variables, objects, and functions.

In this post, I'll help you understand what's happening behind the scenes when you do common things like creating a variable or calling a function. As a result, you'll write cleaner, more comprehensible code. You'll also become a better (and faster) code reader. All that's necessary is to forget everything you know about programming...

Python for Numerical and Scientific Computing

NumPy, SciPy, and matplotlib form the basis for scientific computing in Python.


NumPy is the fundamental package for scientific computing with Python. It contains among other things:

  • a powerful N-dimensional array object
  • sophisticated (broadcasting) functions
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.



SciPy (pronounced "Sigh Pie") is open-source software for mathematics, science, and engineering. It is also the name of a very popular conference on scientific programming with Python. The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines such as routines for numerical integration and optimization. Together, they run on all popular operating systems, are quick to install, and are free of charge. NumPy and SciPy are easy to use, but powerful enough to be depended upon by some of the world's leading scientists and engineers. If you need to manipulate numbers on a computer and display or publish the results, give SciPy a try!



matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB®* or Mathematica®), web application servers, and six graphical user interface toolkits.


Python for Data


Pandas is really the Python approximation to R, although most would argue that it isn't yet as full featured as R. Or, in the words of the website, "pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language."

Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.

Combined with the excellent IPython toolkit and other libraries, the environment for doing data analysis in Python excels in performance, productivity, and the ability to collaborate.



Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available for different types of data and each estimator. Researchers across fields may find that statsmodels fully meets their needs for statistical computing and data analysis in Python.