R

Supervised Machine Learning with R Workshop on April 30th

Supervised Machine Learning with R Workshop on April 30th

Data Community DC and District Data Labs are hosting a Supervised Machine Learning with R workshop on Saturday April 30th. Come out and learn about R's capabilities for regression and classification, how to perform inference with these models, and how to use out-of-sample evaluation methods for your models!

Book Review: R for Everyone by Jared P. Lander

This is a guest post by Vadim Y. Bichutskiy, a Lead Data Scientist at Echelon Insights, a Republican analytics firm. His background spans analytical/engineering positions in Silicon Valley, academia, and the US Government. He holds MS/BS Computer Science from University of California, Irvine, MS Statistics from California State University, East Bay, and is a PhD Candidate in Data Sciences at George Mason University. Follow him on Twitter @vybstat.

Recently I got a hold of Jared Lander's book R for Everyone. It is one of the best books on R that I have seen. I first started learning R in 2007 when I was a CS graduate student at UC Irvine. Bored with my research, I decided to venture into statistics and machine learning. I enrolled in several PhD-level statistics courses--the Statistics Department at UC Irvine is in the same school as the CS Dept.--where I was introduced to R. Coming from a C/C++/Java background, R was different, exciting, and powerful.

Learning R is challenging because documentation is scattered all over the place. There is no comprehensive book that covers many important use cases. To get the fundamentals, one has to look at multiple books as well as many online resources and tutorials. Jared has written an excellent book that covers the fundamentals (and more!). It is easy-to-understand, concise and well-written. The title "R for everyone" is accurate because, while it is great for R novices, it is also quite useful for experienced R hackers. It truly lives up to its title.

Chapters 1-4 cover the basics: installation, RStudio, the R package system, and basic language constructs. Chapter 5 discusses fundamental data structures: data frames, lists, matrices, and arrays. Importing data into R is covered in Chapter 6: reading data from CSV files, Excel spreadsheets, relational databases, and from other statistical packages such as SAS and SPSS. This chapter also illustrates saving objects to disk and scraping data from the Web. Statistical graphics is the subject of Chapter 7 including Hadley Wickham's irreplaceable ggplot2 package. Chapters 8-10 are about writing R functions, control structures, and loops. Altogether Chapters 1-10 cover lots of ground. But we're not even halfway through the book!

Chapters 11-12 introduce tools for data munging: base R's apply family of functions and aggregation, Hadley Wickham's packages plyr and reshape2, and various ways to do joins. A section on speeding up data frames with the indispensable data.table package is also included. Chapter 13 is all about working with string (character) data including regular expressions and Hadley Wickham's stringr package. Important probability distributions are the subject of Chapter 14. Chapter 15 discusses basic descriptive and inferential statistics including the t-test and the analysis of variance. Statistical modeling with linear and generalized linear models is the topic of Chapters 16-18. Topics here also include survival analysis, cross-validation, and the bootstrap. The last part of the book covers hugely important topics. Chapter 19 discusses regularization and shrinkage including Lasso and Ridge regression, their generalization the Elastic Net, and Bayesian shrinkage. Nonlinear and nonparametric methods are the focus of Chapter 20: nonlinear least squares, splines, generalized additive models, decision trees, and random forests. Chapter 21 covers time series analysis with autoregressive moving average (ARIMA), vector autoregressive (VAR), and generalized autoregressive conditional heteroskedasticity (GARCH) models. Clustering is the the topic of Chapter 22: K-means, partitioning around medoids (PAM), and hierarchical.

The final two chapters cover topics that are often omitted from other books and resources, making the book especially useful to seasoned programmers. Chapter 23 is about creating reproducible reports and slide shows with the Yihui Xie’s knitr package, LaTeX and Markdown. Developing R packages is the subject of Chapter 24.

A useful appendix on the R ecosystem puts icing on the cake with valuable resources including Meetups, conferences, Web sites and online documentation, other books, and folks to follow on Twitter.

Whether you are a beginner or an experienced R hacker looking to pick up new tricks, Jared's book will be good to have in your library. It covers a multitude of important topics, is concise and easy-to-read, and is as good as advertised.

Announcing the Publication of Practical Data Science Cookbook

Practical Data Science Cookbook is perfect for those who want to learn data science and numerical programming concepts through hands-on, real-world project examples. Whether you are brand new to data science or you are a seasoned expert, you will benefit from learning about the structure of data science projects, the steps in the data science pipeline, and the programming examples presented in this book. 

Simulation and Predictive Analytics

This is a guest post by Lawrence Leemis, a professor in the Department of Mathematics at The College of William & Mary.  A front-page article over the weekend in the Wall Street Journal indicated that the number one profession of interest to tech firms is a data scientist, someone whose analytic skills, computing skills, and domain skills are able to detect signals from data and use them to advantage. Although the terms are squishy, the push today is for "big data" skills and "predictive analytics" skills which allow firms to leverage the deluge of data that is now accessible.

I attended the Joint Statistical Meetings last week in Boston and I was impressed by the number of talks that referred to big data sets and also the number that used the R language. Over half of the technical talks that I attended included a simulation study of one type or another.

The two traditional aspects of the scientific method, namely theory and experimentation, have been enhanced with computation being added as a third leg. Sitting at the center of computation is simulation, which is the topic of this post. Simulation is a useful tool when analytic methods fail because of mathematical intractability.

The questions that I will address here are how Monte Carlo simulation and discrete-event simulation differ and how they fit into the general framework of predictive analytics.

First, how do how Monte Carlo and discrete-event simulation differ? Monte Carlo simulation is appropriate when the passage of time does not play a significant role. Probability calculations involving problems associated with playing cards, dice, and coins, for example, can be solved by Monte Carlo.

Discrete-event simulation, on the other hand, has the passage of time as an integral part of the model. The classic application areas in which discrete-event simulation has been applied are queuing, inventory, and reliability. As an illustration, a mathematical model for a queue with a single server might consist of (a) a probability distribution for the time between arrivals to the queue, (b) a probability distribution for the service time at the queue, and (c) an algorithm for placing entities in the queue (first-come-first served is the usual default). Discrete-event simulation can be coded into any algorithmic language, although the coding is tedious. Because of the complexities of coding a discrete-event simulation, dozens of languages have been developed to ease implementation of a model. 

The field of predictive analytics leans heavily on the tools from data mining in order to identify patterns and trends in a data set. Once an appropriate question has been posed, these patterns and trends in explanatory variables (often called covariates) are used to predict future behavior of variables of interest. There is both an art and a science in predictive analytics. The science side includes the standard tools of associated with mathematics computation, probability, and statistics. The art side consists mainly of making appropriate assumptions about the mathematical model constructed for predicting future outcomes. Simulation is used primarily for verification and validation of the mathematical models associated with a predictive analytics model. It can be used to determine whether the probabilistic models are reasonable and appropriate for a particular problem.

Two sources for further training in simulation are a workshop in Catonsville, Maryland on September 12-13 by Barry Lawson (University of Richmond) and me or the Winter Simulation Conference (December 7-10, 2014) in Savannah.

Natural Language Processing in Python and R

This is a guest post by Charlie Greenbacker and Tommy Jones.

Data comes in many forms. As a data scientist, you might be comfortable working with large amounts of structured data nicely organized in a database or other tabular format, but what do you do if a customer drops 10,000 unstructured text documents in your lap and asks you to analyze them?

Some estimates claim unstructured data accounts for more than 90 percent of the digital universe, much of it in the form of text. Digital publishing, social media, and other forms of electronic communication all contribute to the deluge of text data from which you might seek to derive insights and extract value. Fortunately, many tools and techniques have been developed to facilitate large-scale text analytics. Operating at the intersection of computer science, artificial intelligence, and computational linguistics, Natural Language Processing (NLP) focuses on algorithmically understanding human language.

Interested in getting started with Natural Language Processing but don't know where to begin? On July 9th, a joint meetup co-hosted by Statistical Programming DC, Data Wranglers DC, and DC NLP will feature two introductory talks on the nuts & bolts of working with NLP in Python and R.

The Python programming language is increasingly popular in the data science community for a variety of reasons, including its ease of use and the plethora of open source software libraries available for scientific computing & data analysis. Packages like SciPy, NumPy, Scikit-learn, Pandas, NetworkX, and others help Python developers perform everything from linear algebra and dimensionality reduction, to clustering data and analyzing multigraphs.

Back in the dark ages (about 10+ years ago), folks working in NLP usually maintained an assortment of homemade utility programs designed to handle many of the common tasks involved with NLP. Despite our best intentions, most of this code was lousy, brittle, and poorly documented -- hardly a good foundation upon which to build your masterpiece. Over the past several years, however, mainstream open source software libraries like the Natural Language Toolkit for Python (NLTK) have emerged to offer a collection of high-quality reusable NLP functionality. NLTK enables researchers and developers to spend more time focusing on the application logic of the task at hand, and less on debugging an abandoned method for sentence segmentation or reimplementing noun phrase chunking.

If you're already familiar with Python, the NLTK library will equip you with many powerful tools for working with text data. The O'Reilly book Natural Language Processing with Python written by Steven Bird, Ewan Klein, and Edward Loper offers an excellent overview of using NLTK for text analytics. Topics include processing raw text, tagging words, document classification, information extraction, and much more. Best of all, the entire contents of this NLTK book are freely available online under a Creative Commons license.

The Python portion of this joint meetup event will cover a handful of the NLP building blocks provided by NLTK, including extracting text from HTML, stemming & lemmatization, frequency analysis, and named entity recognition. These components will then be assembled to build a very basic document summarization program.

Additional NLP resources in Python

- Natural Language Toolkit for Python (NLTK): http://www.nltk.org/ - Natural Language Processing with Python (book): http://oreilly.com/catalog/9780596516499/ (free online version: http://www.nltk.org/book/) - Python Text Processing with NLTK 2.0 Cookbook (book): http://www.packtpub.com/python-text-processing-nltk-20-cookbook/book - Python wrapper for the Stanford CoreNLP Java library: https://pypi.python.org/pypi/corenlp - guess_language (Python library for language identification): https://bitbucket.org/spirit/guess_language - MITIE (new C/C++-based NER library from MIT with a Python API): https://github.com/mit-nlp/MITIE - gensim (topic modeling library for Python): http://radimrehurek.com/gensim/

R is a programming language popular in statistics and machine learning research. R has several advantages in the ML/stat domains. R is optimized for vector operations. This simplifies programming since your code is very close to the math that you're trying to execute. R also has a huge community behind it; packages exist for just about any application you can think of. R has a close relationship with C, C++, and Fortran and there are R packages to execute Java and Python code, increasing its flexibility. Finally, the folks at CRAN are zealous about version control and compatibility, making installing R and subsequent packages a smooth experience.

However, R does have some sharp edges that become obvious when working with any non-trivially-sized linguistic data. R holds all data in your active workspace in RAM. If you are running R on a 32-bit system, you have a 4 GB limit to the RAM R can access. There are two implications of this: NLP data need to be stored in memory-efficient objects (more on that later) and (regrettably) there is still a hard limit on how much data you can work on at one time. There are packages, such as `bigmemory` that are moving to address this, but they are outside the scope of this presentation. You also need to write efficient code; the size of NLP data will punish you for inefficiencies.

What advantages, then, does R have? Every person and every problem is unique, but I can offer a few suggestions:

1. You are doing statistics/ML research and not developing software. 2. (Similar to 1.) You are a quantitative generalist (and probably good in R already) and NLP is just another feather in your cap.

Sometimes being a data scientist is about developing and tweaking your own algorithms. Sometimes being a data scientist is taking others' algorithms, plugging in your data, and moving on to other areas of the problem. If you are doing more of the former, R is a solid choice. If you are doing more of the latter, R isn't too bad. But I've found that my code often runs faster than some of the pre-packaged code. Your individual mileage may vary.

The second presentation at this meetup will cover the basics of reading documents into R and creating a document term matrix, then demonstrating some basic document summarization, keyword extraction, and document clustering techniques.

Seats are filling up quickly, so RSVP here now: http://www.meetup.com/stats-prog-dc/events/177772322/

Win Free eCopies of Social Media Mining with R

This is a sponsored post by Richard Heimann. Rich is Chief Data Scientist at L-3 NSS and recently published Social Media Mining with R (Packt Publishing, 2014) with co-author Nathan Danneman, also a Data Scientist at L-3 NSS Data Tactics. Nathan has been featured at recent Data Science DC and DC NLP meetups. Nathan Danneman and Richard Heimann have teamed up with DC2 to organize a giveaway of their new book, Social Media Mining with R.

Over the new two weeks five lucky winners will win a digital copy of the book. Please keep reading to find out how you can be one of the winners and learn more about Social Media Mining with R.

Overview: Social Media Mining with R

Social Media Mining with R is a concise, hands-on guide with several practical examples of social media data mining and a detailed treatise on inference and social science research that will help you in mining data in the real world.

Whether you are an undergraduate who wishes to get hands-on experience working with social data from the Web, a practitioner wishing to expand your competencies and learn unsupervised sentiment analysis, or you are simply interested in social data analysis, this book will prove to be an essential asset. No previous experience with R or statistics is required, though having knowledge of both will enrich your experience. Readers will learn the following:

  • Learn the basics of R and all the data types
  • Explore the vast expanse of social science research
  • Discover more about data potential, the pitfalls, and inferential gotchas
  • Gain an insight into the concepts of supervised and unsupervised learning
  • Familiarize yourself with visualization and some cognitive pitfalls
  • Delve into exploratory data analysis
  • Understand the minute details of sentiment analysis

How to Enter?

All you need to do is share your favorite effort in social media mining or more broadly in text analysis and natural language processing in the comments section of this blog. This can be some analytical output, a seminal white paper or an interesting commercial or open source package! In this way, there are no losers as we will all learn. 

The first five commenters will win a free copy of the eBook. (DC2 board members and staff are not eligible to win.) Share your public social media accounts (about.me, Twitter, LinkedIn, etc.) in your comment, or email media@datacommunitydc.org after posting.