R

Supervised Machine Learning with R Workshop on April 30th

Supervised Machine Learning with R Workshop on April 30th

Data Community DC and District Data Labs are hosting a Supervised Machine Learning with R workshop on Saturday April 30th. Come out and learn about R's capabilities for regression and classification, how to perform inference with these models, and how to use out-of-sample evaluation methods for your models!

Announcing the Publication of Practical Data Science Cookbook

Practical Data Science Cookbook is perfect for those who want to learn data science and numerical programming concepts through hands-on, real-world project examples. Whether you are brand new to data science or you are a seasoned expert, you will benefit from learning about the structure of data science projects, the steps in the data science pipeline, and the programming examples presented in this book. 

Win Free eCopies of Social Media Mining with R

This is a sponsored post by Richard Heimann. Rich is Chief Data Scientist at L-3 NSS and recently published Social Media Mining with R (Packt Publishing, 2014) with co-author Nathan Danneman, also a Data Scientist at L-3 NSS Data Tactics. Nathan has been featured at recent Data Science DC and DC NLP meetups. Nathan Danneman and Richard Heimann have teamed up with DC2 to organize a giveaway of their new book, Social Media Mining with R.

Over the new two weeks five lucky winners will win a digital copy of the book. Please keep reading to find out how you can be one of the winners and learn more about Social Media Mining with R.

Overview: Social Media Mining with R

Social Media Mining with R is a concise, hands-on guide with several practical examples of social media data mining and a detailed treatise on inference and social science research that will help you in mining data in the real world.

Whether you are an undergraduate who wishes to get hands-on experience working with social data from the Web, a practitioner wishing to expand your competencies and learn unsupervised sentiment analysis, or you are simply interested in social data analysis, this book will prove to be an essential asset. No previous experience with R or statistics is required, though having knowledge of both will enrich your experience. Readers will learn the following:

  • Learn the basics of R and all the data types
  • Explore the vast expanse of social science research
  • Discover more about data potential, the pitfalls, and inferential gotchas
  • Gain an insight into the concepts of supervised and unsupervised learning
  • Familiarize yourself with visualization and some cognitive pitfalls
  • Delve into exploratory data analysis
  • Understand the minute details of sentiment analysis

How to Enter?

All you need to do is share your favorite effort in social media mining or more broadly in text analysis and natural language processing in the comments section of this blog. This can be some analytical output, a seminal white paper or an interesting commercial or open source package! In this way, there are no losers as we will all learn. 

The first five commenters will win a free copy of the eBook. (DC2 board members and staff are not eligible to win.) Share your public social media accounts (about.me, Twitter, LinkedIn, etc.) in your comment, or email media@datacommunitydc.org after posting.

How To Put Your Meetup On the Map (Literally)

This is a guest post by Alan Briggs. Alan is a Data Scientist with Elder Research, Inc. assisting with the development of predictive analytic capabilities for national security clients in the Washington, DC Metro area. He is President-Elect of the Maryland chapter of the Institute for Operations Research and Management Science (INFORMS). Follow him at @AlanWBriggs. This post previews two events that Alan is presenting with Data Community DC President Harlan Harris. Have you ever tried to schedule dinner or a movie with a group of friends from across town? Then, when you throw out an idea about where to go, someone responds “no, not there, that’s too far?” Then, there’s the old adage that the three most important things about real estate are location, location and location. From placing a hotdog stand on the beach to deciding where to build the next Ikea, location really crops up all over the place. Not surprisingly, and I think most social scientists would agree, people tend to act in their own self-interest. That is, everyone wants to travel the least amount of distance, spend the least amount of money or expend the least amount of time possible in order to do what they need [or want] to do. For one self-interested person, the solution to location problems would always be clear, but we live in a world of co-existence and shared resources. There’s not just one person going to the movies or shopping at the store; there are several, hundreds, thousands, maybe several hundred thousand. If self-interest is predictable in this small planning exercise of getting together with friends, can we use math and science to leverage it to our advantage? It turns out that the mathematical and statistical techniques that are scalable to the worlds’ largest and most vexing problems can also be used to address some more everyday issues, such as where to schedule a Meetup event.

With a little abstraction, this scenario looks a lot like a classical problem in operations research called the facility location or network design problem. Its roots tracing back to the 17th century Fermat-Weber Problem, facility location analysis seeks to minimize the costs associated with locating a facility. In our case, we can define the cost of a Meetup venue by the sum of the distance traveled to the Meetup by its attendees. Other costs could be included, but to start simple, you can’t beat a straight-line distance.

So, here’s a little history. The data-focused Meetup scene in the DC region is several years old, with Hadoop DC, Big Data DC and the R Users DC (now Statistical Programming DC) Meetups having been founded first. Over the years, as these groups have grown and been joined by many others, their locations have bounced around among several different locations, mostly in downtown Washington DC, Northern Virginia, and suburban Maryland. Location decisions were primarily driven by supply – what organization would be willing to allow a big crowd to fill its meeting space on a weekday evening? Data Science DC, for instance, has almost always held its events downtown. But as the events have grown, and as organizers have heard more complaints about location, it became clear that venue selection needed to include a demand component as well, and that some events could or should be held outside of the downtown core (or vice-versa, for groups that have held their events in the suburbs).

Data Community DC performed a marketing survey at the beginning of 2013, and got back a large enough sample of Meetup attendees to do some real analysis. (See the public survey announcement.) As professional Meetups tend to be on weekday evenings, it is clear that attendees are not likely traveling just from work or just from home, but are most likely traveling to the Meetup as a detour from the usual commute line connecting their work and home. Fortunately, the survey designers asked Meetup attendees to provide both their home and their work zip codes, so the data could be used to (roughly) draw lines on a map describing typical commute lines.

Commutes for data Meetup attendees, based on ZIP codes.

The Revolutions blog recently presented a similar problem in their post How to Choose a New Business Location with R. The author, Rodolfo Vanzini, writes that his wife’s business had simply run out of space in its current location. An economist by training, Vanzini initially assumed the locations of customers at his wife’s business must be uniformly distributed. After further consideration, he concluded that “individuals make biased decisions basing their assumptions and conclusions on a limited and approximate set of rules often leading to sub-optimal outcomes.” Vanzini then turned to R, plotted his customers’ addresses on a map and picked a better location based on visual inspection of the location distribution.

If you’ve been paying attention, you’ll notice a common thread with the previously mentioned location optimizations. When you’re getting together with friends, you’re only going to one movie theater; Vanzini’s wife was only locating one school. Moreover, both location problems rely on a single location — their home — for each interested party. That’s really convenient for a beginner-level location problem; accordingly, it’s a relatively simple problem to solve. The Meetup location problem on the other hand adds two complexities that make this problem worthy of the time you’re spending here. Principally, if it’s not readily apparent, a group of 150 boisterous data scientists can easily overstay their welcome by having monthly meetings at the same place over and over again. Additionally, having a single location also ensures that the part of the population that drives the farthest will have to do so for each and every event. For this reason, we propose to identify the three locations which can be chosen that minimize the sum of the minimum distances traveled for the entire group. The idea is that the Meetup events can rotate between each of the three optimal locations. This provides diversity in location which appeases meeting space hosts. But, it also provides a closer meeting location for a rotating third of the event attendee population. Every event won't be ideal for everyone, but it'll be convenient for everyone at least sometimes.

As an additional complexity, we have two ZIP codes for each person in attendance — work and home — which means that instead of doing a point-to-point distance computation, we instead want to minimize the distance to the three meeting locations from the closest point along the commute line. Optimizing location with these two concepts in mind — particularly the n-location component — is substantially more complicated than the single location optimization with just one set of coordinates for each attendee.

So, there you have it. To jump ahead to the punchline, the three optimal locations for a data Meetup to host its meetings are Rockville, MD, downtown Washington DC and Southern Arlington, VA.

Gold points are optimal locations. Color gradient shows single-location location utility.

To hear us (Harlan and Alan) present this material, there are two great events coming up. The INFORMS Maryland Chapter is hosting their inaugural Learn. Network. Do. event on October 23 at INFORMS headquarters on the UMBC campus. Statistical Programming DC will also be holding its monthly meeting at iStrategy Labs on October 24. Both events will pull back the curtain on the code that was used and either event should send you off with sufficient knowledge to start to tackle your own location optimization problem.

Fantastic presentations from R using slidify and rCharts

Ramnath Vaidynathan presenting in DCDr. Ramnath Vaidyanathan of McGill University gave an excellent presentation at a joint Data Visualization DC/Statistical Programming DC event on Monday, August 19 at nclud, on two R projects he leads -- slidify and rCharts. After the evening, all I can say is, Wow!! It's truly impressive to see what can be achieved in presentation and information-rich graphics directly from R. Again, wow!! (I think many of the attendees shared this sentiment)

Slidify

Slidify is a R package that

helps create, customize and share elegant, dynamic and interactive HTML5 documents through R Markdown.

We have blogged about slidify, but it was great to get an overview of slidify directly from the creator. Dr. Vaidyanathan explained that the underlying principle in developing slidify is the separation of the content and the appearance and behavior of the final product. He achieves this using HTML5 frameworks, layouts and widgets which are customizable (though he provides several here and through his slidifyExamples R package).

Example RMarkdown file for slidify

You start with a modified R Markdown file as seen here. This file can have chunks of R code in it. It is then processed to a pure Markdown file, interlacing the output of R code into the file. This is then split-apply-combined to produce the final HTML5 document. This document can be shared using GitHub, Dropbox or RPubs directly from R. Dr. Vaidyanathan gave examples of how slidify can even be used to create interactive quizzes or even interactive documents utilizing slidify and Shiny.

One really neat feature he demonstrated is the ability to embed an interactive R console within a slidify presentation. He explained that this used a Shiny server backend locally, or an OpenCPU backend if published online. This feature changes how presentations can be delivered, by not forcing the presenter to bounce around between windows but actually demonstrate within the presentations.

rCharts

rCharts is

an R package to create, customize and share interactive visualizations, using a lattice-like formula interface

Again, we have blogged about rCharts, but there have been several advances in the short time since then, both in rCharts and interactive documents that Dr. Vaidyanathan has developed.

rCharts creates a formula-driven interface to several Javascript graphics frameworks, including NVD3, Highcharts, Polycharts and Vega. This formula interface is familiar to R users, and makes the process of creating these charts quite straightforward. Some customization is possible, as well as putting in basic controls without having to use Shiny. We saw several examples of excellent interactive charts using simple R commands. There is even a gallery where users can contribute their rCharts creations. There is really no excuse any more for avoiding these technologies for visualization, and it makes life so much more interesting!!

Bikeshare maps, or how to create stellar interactive visualizations using R and Javascript

Dr. Vaidyanathan demonstrated one project which, I feel, shows the power of the technologies he is developing using R and Javascript. He created a web application using R, Shiny, his rCharts packages which accesses the Leaflet Javascript library, and a very little bit of Javascript magic to visualize the availability of bicycles at different stations in a bike sharing network. This application can automatically download real-time data and visualize availability in over 100 bike sharing systems worldwide. He focused on the London bike share map, which was fascinating in that it showed how bikes had moved from the city to the outer fringes at night. Clicking on any dot showed how many bikes were available at that station.

London Bike Share map Dr. Vaidyanathan quickly demonstrated a basic process of how to map points on a city map, how to change their appearance and how to add additional meta-data to each point, that will appear as a pop-up when clicked.

You can see the full project and how Dr. Vaidyanathan developed this application here.

Interactive learning environments

Finally, Dr. Vaidyanathan showed a new application he is developing using slidify, rCharts, and other open-source technologies like OpenCPU and PopcornJS. This application allows him to author a lesson in R Markdown, integrate interactive components including interactive R consoles, record the lesson as a screencast, sync the screencast with the slides, and publish it. This seems to me to be one possible future for presenting massive online courses. An example presentation is available here, and the project is hosted here

Open presentation

The presentation and all the relevant code and demos are hosted on GitHub, and the presentation can be seen (developed using slidify, naturally) here.

Stay tuned for an interview I did with Dr. Vaidyanathan earlier, which will be published here shortly.

Have fun using these fantastic tools in the R ecosystem to make really cool, informative presentations of your data projects. See you next time!!!

Data Science MD July Recap: Python and R Meetup

highres_261064962 For July's meetup, Data Science MD was honored to have Jonathan Street of NIH and Brian Godsey of RedOwl Analytics come discuss using Python and R for data analysis.

Jonathan started off by describing the growing ecosystem of Python data analysis tools including Numpy, Matplotlib, and Pandas.

He next walked through a brief example demonstrating Numpy, Pandas, and Matplotlib that he made available with the IPython notebook viewer.

The second half of Jonathan's talk focused on the problem of using clustering to identify scientific articles of interest. He needed to a) convert PDF to text b) extract sections of the document c) cluster and d) retrieve new material.

Jonathan used the PyPDF library for PDF conversion and then used the NLTK library for text processing. For a thorough discussion of NLTK, please see Data Community DC's multi-part series written by Ben Bengfort.

Clustering was done using scikit-learn, which identified seven groups of articles. From these, Jonathan was then able to retrieve other similar articles to read.

Overall, by combining several Python packages to handle text conversion, text processing, and clustering, Jonathan was able to create an automated personalized scientific recommendation system. Please see the Data Community DC posts on Python data analysis tutorials and Python data tools for more information.

Next to speak was Brian Godsey of RedOwl Analytics who was presenting on their social network analysis. He first described the problem of identifying misbehavior in a financial firm. Their goals are to detect patterns in employee behavior, measure differences between various types of people, and ultimately find anomalies in behavior.

In order to find these anomalies, they model behavior based on patterns in communications and estimate model parameters from a data set and set of effects.

Brian then revealed that while implementing their solution they have developed a R package called rRevelation that allows a user to import data sets, create covariates, specify a model's behavioral parameters, and estimate the parameter values.

To conclude his presentation, Brian demonstrated using the package against the well-known Enron data set and discussed how larger data sets requires using other technologies such as MapReduce.

  http://www.youtube.com/watch?v=1rFlqD3prKE&list=PLgqwinaq-u-Piu9tTz58e7-5IwYX_e6gh

Slides can be found here for Jonathan and here for Brian.