Languages

Book Review: R for Everyone by Jared P. Lander

This is a guest post by Vadim Y. Bichutskiy, a Lead Data Scientist at Echelon Insights, a Republican analytics firm. His background spans analytical/engineering positions in Silicon Valley, academia, and the US Government. He holds MS/BS Computer Science from University of California, Irvine, MS Statistics from California State University, East Bay, and is a PhD Candidate in Data Sciences at George Mason University. Follow him on Twitter @vybstat.

Recently I got a hold of Jared Lander's book R for Everyone. It is one of the best books on R that I have seen. I first started learning R in 2007 when I was a CS graduate student at UC Irvine. Bored with my research, I decided to venture into statistics and machine learning. I enrolled in several PhD-level statistics courses--the Statistics Department at UC Irvine is in the same school as the CS Dept.--where I was introduced to R. Coming from a C/C++/Java background, R was different, exciting, and powerful.

Learning R is challenging because documentation is scattered all over the place. There is no comprehensive book that covers many important use cases. To get the fundamentals, one has to look at multiple books as well as many online resources and tutorials. Jared has written an excellent book that covers the fundamentals (and more!). It is easy-to-understand, concise and well-written. The title "R for everyone" is accurate because, while it is great for R novices, it is also quite useful for experienced R hackers. It truly lives up to its title.

Chapters 1-4 cover the basics: installation, RStudio, the R package system, and basic language constructs. Chapter 5 discusses fundamental data structures: data frames, lists, matrices, and arrays. Importing data into R is covered in Chapter 6: reading data from CSV files, Excel spreadsheets, relational databases, and from other statistical packages such as SAS and SPSS. This chapter also illustrates saving objects to disk and scraping data from the Web. Statistical graphics is the subject of Chapter 7 including Hadley Wickham's irreplaceable ggplot2 package. Chapters 8-10 are about writing R functions, control structures, and loops. Altogether Chapters 1-10 cover lots of ground. But we're not even halfway through the book!

Chapters 11-12 introduce tools for data munging: base R's apply family of functions and aggregation, Hadley Wickham's packages plyr and reshape2, and various ways to do joins. A section on speeding up data frames with the indispensable data.table package is also included. Chapter 13 is all about working with string (character) data including regular expressions and Hadley Wickham's stringr package. Important probability distributions are the subject of Chapter 14. Chapter 15 discusses basic descriptive and inferential statistics including the t-test and the analysis of variance. Statistical modeling with linear and generalized linear models is the topic of Chapters 16-18. Topics here also include survival analysis, cross-validation, and the bootstrap. The last part of the book covers hugely important topics. Chapter 19 discusses regularization and shrinkage including Lasso and Ridge regression, their generalization the Elastic Net, and Bayesian shrinkage. Nonlinear and nonparametric methods are the focus of Chapter 20: nonlinear least squares, splines, generalized additive models, decision trees, and random forests. Chapter 21 covers time series analysis with autoregressive moving average (ARIMA), vector autoregressive (VAR), and generalized autoregressive conditional heteroskedasticity (GARCH) models. Clustering is the the topic of Chapter 22: K-means, partitioning around medoids (PAM), and hierarchical.

The final two chapters cover topics that are often omitted from other books and resources, making the book especially useful to seasoned programmers. Chapter 23 is about creating reproducible reports and slide shows with the Yihui Xie’s knitr package, LaTeX and Markdown. Developing R packages is the subject of Chapter 24.

A useful appendix on the R ecosystem puts icing on the cake with valuable resources including Meetups, conferences, Web sites and online documentation, other books, and folks to follow on Twitter.

Whether you are a beginner or an experienced R hacker looking to pick up new tricks, Jared's book will be good to have in your library. It covers a multitude of important topics, is concise and easy-to-read, and is as good as advertised.

Announcing the Publication of Practical Data Science Cookbook

Practical Data Science Cookbook is perfect for those who want to learn data science and numerical programming concepts through hands-on, real-world project examples. Whether you are brand new to data science or you are a seasoned expert, you will benefit from learning about the structure of data science projects, the steps in the data science pipeline, and the programming examples presented in this book. 

Simulation and Predictive Analytics

This is a guest post by Lawrence Leemis, a professor in the Department of Mathematics at The College of William & Mary.  A front-page article over the weekend in the Wall Street Journal indicated that the number one profession of interest to tech firms is a data scientist, someone whose analytic skills, computing skills, and domain skills are able to detect signals from data and use them to advantage. Although the terms are squishy, the push today is for "big data" skills and "predictive analytics" skills which allow firms to leverage the deluge of data that is now accessible.

I attended the Joint Statistical Meetings last week in Boston and I was impressed by the number of talks that referred to big data sets and also the number that used the R language. Over half of the technical talks that I attended included a simulation study of one type or another.

The two traditional aspects of the scientific method, namely theory and experimentation, have been enhanced with computation being added as a third leg. Sitting at the center of computation is simulation, which is the topic of this post. Simulation is a useful tool when analytic methods fail because of mathematical intractability.

The questions that I will address here are how Monte Carlo simulation and discrete-event simulation differ and how they fit into the general framework of predictive analytics.

First, how do how Monte Carlo and discrete-event simulation differ? Monte Carlo simulation is appropriate when the passage of time does not play a significant role. Probability calculations involving problems associated with playing cards, dice, and coins, for example, can be solved by Monte Carlo.

Discrete-event simulation, on the other hand, has the passage of time as an integral part of the model. The classic application areas in which discrete-event simulation has been applied are queuing, inventory, and reliability. As an illustration, a mathematical model for a queue with a single server might consist of (a) a probability distribution for the time between arrivals to the queue, (b) a probability distribution for the service time at the queue, and (c) an algorithm for placing entities in the queue (first-come-first served is the usual default). Discrete-event simulation can be coded into any algorithmic language, although the coding is tedious. Because of the complexities of coding a discrete-event simulation, dozens of languages have been developed to ease implementation of a model. 

The field of predictive analytics leans heavily on the tools from data mining in order to identify patterns and trends in a data set. Once an appropriate question has been posed, these patterns and trends in explanatory variables (often called covariates) are used to predict future behavior of variables of interest. There is both an art and a science in predictive analytics. The science side includes the standard tools of associated with mathematics computation, probability, and statistics. The art side consists mainly of making appropriate assumptions about the mathematical model constructed for predicting future outcomes. Simulation is used primarily for verification and validation of the mathematical models associated with a predictive analytics model. It can be used to determine whether the probabilistic models are reasonable and appropriate for a particular problem.

Two sources for further training in simulation are a workshop in Catonsville, Maryland on September 12-13 by Barry Lawson (University of Richmond) and me or the Winter Simulation Conference (December 7-10, 2014) in Savannah.

Natural Language Processing in Python and R

This is a guest post by Charlie Greenbacker and Tommy Jones.

Data comes in many forms. As a data scientist, you might be comfortable working with large amounts of structured data nicely organized in a database or other tabular format, but what do you do if a customer drops 10,000 unstructured text documents in your lap and asks you to analyze them?

Some estimates claim unstructured data accounts for more than 90 percent of the digital universe, much of it in the form of text. Digital publishing, social media, and other forms of electronic communication all contribute to the deluge of text data from which you might seek to derive insights and extract value. Fortunately, many tools and techniques have been developed to facilitate large-scale text analytics. Operating at the intersection of computer science, artificial intelligence, and computational linguistics, Natural Language Processing (NLP) focuses on algorithmically understanding human language.

Interested in getting started with Natural Language Processing but don't know where to begin? On July 9th, a joint meetup co-hosted by Statistical Programming DC, Data Wranglers DC, and DC NLP will feature two introductory talks on the nuts & bolts of working with NLP in Python and R.

The Python programming language is increasingly popular in the data science community for a variety of reasons, including its ease of use and the plethora of open source software libraries available for scientific computing & data analysis. Packages like SciPy, NumPy, Scikit-learn, Pandas, NetworkX, and others help Python developers perform everything from linear algebra and dimensionality reduction, to clustering data and analyzing multigraphs.

Back in the dark ages (about 10+ years ago), folks working in NLP usually maintained an assortment of homemade utility programs designed to handle many of the common tasks involved with NLP. Despite our best intentions, most of this code was lousy, brittle, and poorly documented -- hardly a good foundation upon which to build your masterpiece. Over the past several years, however, mainstream open source software libraries like the Natural Language Toolkit for Python (NLTK) have emerged to offer a collection of high-quality reusable NLP functionality. NLTK enables researchers and developers to spend more time focusing on the application logic of the task at hand, and less on debugging an abandoned method for sentence segmentation or reimplementing noun phrase chunking.

If you're already familiar with Python, the NLTK library will equip you with many powerful tools for working with text data. The O'Reilly book Natural Language Processing with Python written by Steven Bird, Ewan Klein, and Edward Loper offers an excellent overview of using NLTK for text analytics. Topics include processing raw text, tagging words, document classification, information extraction, and much more. Best of all, the entire contents of this NLTK book are freely available online under a Creative Commons license.

The Python portion of this joint meetup event will cover a handful of the NLP building blocks provided by NLTK, including extracting text from HTML, stemming & lemmatization, frequency analysis, and named entity recognition. These components will then be assembled to build a very basic document summarization program.

Additional NLP resources in Python

- Natural Language Toolkit for Python (NLTK): http://www.nltk.org/ - Natural Language Processing with Python (book): http://oreilly.com/catalog/9780596516499/ (free online version: http://www.nltk.org/book/) - Python Text Processing with NLTK 2.0 Cookbook (book): http://www.packtpub.com/python-text-processing-nltk-20-cookbook/book - Python wrapper for the Stanford CoreNLP Java library: https://pypi.python.org/pypi/corenlp - guess_language (Python library for language identification): https://bitbucket.org/spirit/guess_language - MITIE (new C/C++-based NER library from MIT with a Python API): https://github.com/mit-nlp/MITIE - gensim (topic modeling library for Python): http://radimrehurek.com/gensim/

R is a programming language popular in statistics and machine learning research. R has several advantages in the ML/stat domains. R is optimized for vector operations. This simplifies programming since your code is very close to the math that you're trying to execute. R also has a huge community behind it; packages exist for just about any application you can think of. R has a close relationship with C, C++, and Fortran and there are R packages to execute Java and Python code, increasing its flexibility. Finally, the folks at CRAN are zealous about version control and compatibility, making installing R and subsequent packages a smooth experience.

However, R does have some sharp edges that become obvious when working with any non-trivially-sized linguistic data. R holds all data in your active workspace in RAM. If you are running R on a 32-bit system, you have a 4 GB limit to the RAM R can access. There are two implications of this: NLP data need to be stored in memory-efficient objects (more on that later) and (regrettably) there is still a hard limit on how much data you can work on at one time. There are packages, such as `bigmemory` that are moving to address this, but they are outside the scope of this presentation. You also need to write efficient code; the size of NLP data will punish you for inefficiencies.

What advantages, then, does R have? Every person and every problem is unique, but I can offer a few suggestions:

1. You are doing statistics/ML research and not developing software. 2. (Similar to 1.) You are a quantitative generalist (and probably good in R already) and NLP is just another feather in your cap.

Sometimes being a data scientist is about developing and tweaking your own algorithms. Sometimes being a data scientist is taking others' algorithms, plugging in your data, and moving on to other areas of the problem. If you are doing more of the former, R is a solid choice. If you are doing more of the latter, R isn't too bad. But I've found that my code often runs faster than some of the pre-packaged code. Your individual mileage may vary.

The second presentation at this meetup will cover the basics of reading documents into R and creating a document term matrix, then demonstrating some basic document summarization, keyword extraction, and document clustering techniques.

Seats are filling up quickly, so RSVP here now: http://www.meetup.com/stats-prog-dc/events/177772322/

Flask Mega Meta Tutorial for Data Scientists

Introduction

Data science isn't all statistical modeling, machine learning, and data frames. Eventually, your hard work pays off and you need to give back the data and the results of your analysis; those blinding insights that you and your team uncovered need to be operationalized in the final stage of the data science pipeline as a scalable data product. Fortunately for you, the web provides an equitable platform to do so and Python isn't all NumPy, Scipy, and Pandas. So, which Python web application framework should you jump into?

375px-Flask_logo.svg

There are numerous web application frameworks for Python with the 800lb gorilla being Django, "the web framework for perfectionists with deadlines." Django has been around since 2005 and, as the community likes to say, is definitely "batteries included," meaning that Django comes with a large number of bundled components (i.e., decisions that have already been made for you). Django comes built in with its chosen object-relational mapper (ORM) to make database access simple, a beautiful admin interface, a template system, a cache system for improving site performance, internationalization support, and much more. As a result, Django has a lot of moving parts and can feel large, especially for beginners. This behind the scene magic can obscure what actually happens and complicates matter when something goes wrong. From a learning perspective, this can make gaining a deeper understanding of web development more challenging.

Enter Flask, the micro web framework written in Python by Armin Ronacher. Purportedly, it came out as an April Fool's joke but proved popular enough not to go quietly into the night. The incredible thing about Flask is that you can have a web application running in about 10 lines of code and a few minutes of effort. Despite this, Flask can and does run real production sites.

As you need additional functionality, you can add it through the numerous extensions available. Need an admin interface? Try Flask-Admin. Need user session management and the ability for visitors to log in and out? Check out Flask-Login.

While Django advocates may argue that by the time you add all of the "needed" extensions to Flask, you get Django, others would argue that a micro-framework is the ideal platform to learn web development. Not only does the small size make it easier for you to get your head around it at first, but the fact that you need to add each component yourself requires you to understand fully the moving parts involved. Finally, and very relevant to data scientists, Flask is quite popular for building RESTful web services and creating APIs.

As Flask documentation isn't quite as extensive as the numerous Django articles on the web, I wanted to assemble the materials that I have been looking at as I explore the world of Flask development.

For Beginners - Getting Started

The Official Flask Docs

The official Flask home page has a great set of documentation, list of extensions to add functionality, and even a tutorial to build a micro blogging site. Check it out here.

Blog post on Learning Flask

This tutorial is really a hidden gem designed for those that are new to web programming and Flask. It contains lots of small insights into the process via examples that really help connect the dots.

Full Stack Python

While not exclusively focused on Flask, Fullstackpython.com is a fantastic page that offers a great amount of information and links about the full web stack, from servers, operating systems, and databases to configuration management, source control, and web analytics, all from the perspective of Python. A must read through for those new to web development.

Starting A Python Project the Right Way

Jeff Knupp again walks the beginner Python developer through how he starts a new Python project.


For Intermediate Developers

The Flask Mega Tutorial

The Flask Mega Tutorial is aptly named as it has eighteen separate parts walking you through the numerous aspects of building a website with Flask and various extensions. I think this tutorial is pretty good but have heard some comment that it can be a bit much if you don't have a strong background in web development.

Miguel Grinberg's tutorial has been so popular that it is even being used as the basis of an O'Reilly book, Flask Web Development, that is due out on May 25, 2014. An early release is currently available, so called "Raw & Unedited," here.

A snippet about the book is below:

If you have Python experience, and want to learn advanced web development techniques with the Flask micro-framework, this practical book takes you step-by-step through the development of a real-world project. It’s an ideal way to learn Flask from the ground up without the steep learning curve. The author starts with installation and brings you to more complicated topics such as database migrations, caching, and complex database relationships.

Each chapter in this focuses on a specific aspect of the project—first by exploring background on the topic itself, and then by waking you through a hands-on implementation. Once you understand the basics of Flask development, you can refer back to individual chapters to reinforce your grasp of the framework.

Flask Web Development

Beating O'Reilly to the punch, Packt Publishing started offering the book, "Instant Flask Web Development" from Ron DuPlain in August 2013.

Flask - Blog: Building a Flask Blog: Part1

This blog contains two different tutorials in separate posts. This one, tackles the familiar task of building a blog using Flask-SQLAlchemy, WTForms, Flask-WTF, Flask-Migrate, WebHelpers, and PostgreSQL. The second one shows the creation of a music streaming app.

Flask with uWSGI + Nginx

This short tutorial and Git Hub repository shows you how to set up a simple Flask app with uWSGI and the web server Nginx.

Python Web Applications with Flask

This extensive 3 part blog post from Real Python works its way through the development of a mid-sized web analytics application. The tutorial was last updated November 17th of 2013 and has the added bonus of demonstrating the use of virtualenv and git hub as well.

Python and Flask are Ridiculously Powerful

Jeff Knupp, author of Idiomatic Python, is fed up with online credit card payment processors so builds his own with Flask in a few hours.

 


More Advanced Content

The Blog of Erik Taubeneck

This is the blog of Erick Taubeneck, "a mathematician/economist/statistician by schooling, a data scientist by trade, and a python hacker by night." He is a Hacker School alum in NYC and contributor to a number of Flask extensions (and an all around good guy).

How to Structure Flask Applications

Here is a great post, recommended by Erik, discussing how one seasoned developers structures his flask applications.

Modern Web Applications with Flask and Backbone.js - Presentation

Here is a presentation about building a "modern web application" using Flask for the backend and backbone.js in the browser. Slide 2 shows a great timeline of the last 20 years in the history of web applications and slide 42 is a great overview of the distribution of Flask extensions in the backend and javascript in the frontend.

Modern Web Application Framework: Python, SQL Alchemy, Jinja2 & Flask - Presentation

This 68-slide presentation on Flask from Devert Alexandre is an excellent resource and tutorial discussing Flask, the very popular SQL Alchemy and practically default Flask ORM, and the Jinja2 template language.

Diving Into Flask - Presentation

66-slide presentation on Flask by Andrii V. Mishkovskyi from EuroPython 2012 that also discusses Flask-Cache, Flask-Celery, and blueprints for structuring larger applications.