Supervised Machine Learning with R Workshop on April 30th

Supervised Machine Learning with R Workshop on April 30th

Data Community DC and District Data Labs are hosting a Supervised Machine Learning with R workshop on Saturday April 30th. Come out and learn about R's capabilities for regression and classification, how to perform inference with these models, and how to use out-of-sample evaluation methods for your models!

Announcing the Publication of Practical Data Science Cookbook

Practical Data Science Cookbook is perfect for those who want to learn data science and numerical programming concepts through hands-on, real-world project examples. Whether you are brand new to data science or you are a seasoned expert, you will benefit from learning about the structure of data science projects, the steps in the data science pipeline, and the programming examples presented in this book. 

Win Free eCopies of Social Media Mining with R

This is a sponsored post by Richard Heimann. Rich is Chief Data Scientist at L-3 NSS and recently published Social Media Mining with R (Packt Publishing, 2014) with co-author Nathan Danneman, also a Data Scientist at L-3 NSS Data Tactics. Nathan has been featured at recent Data Science DC and DC NLP meetups. Nathan Danneman and Richard Heimann have teamed up with DC2 to organize a giveaway of their new book, Social Media Mining with R.

Over the new two weeks five lucky winners will win a digital copy of the book. Please keep reading to find out how you can be one of the winners and learn more about Social Media Mining with R.

Overview: Social Media Mining with R

Social Media Mining with R is a concise, hands-on guide with several practical examples of social media data mining and a detailed treatise on inference and social science research that will help you in mining data in the real world.

Whether you are an undergraduate who wishes to get hands-on experience working with social data from the Web, a practitioner wishing to expand your competencies and learn unsupervised sentiment analysis, or you are simply interested in social data analysis, this book will prove to be an essential asset. No previous experience with R or statistics is required, though having knowledge of both will enrich your experience. Readers will learn the following:

  • Learn the basics of R and all the data types
  • Explore the vast expanse of social science research
  • Discover more about data potential, the pitfalls, and inferential gotchas
  • Gain an insight into the concepts of supervised and unsupervised learning
  • Familiarize yourself with visualization and some cognitive pitfalls
  • Delve into exploratory data analysis
  • Understand the minute details of sentiment analysis

How to Enter?

All you need to do is share your favorite effort in social media mining or more broadly in text analysis and natural language processing in the comments section of this blog. This can be some analytical output, a seminal white paper or an interesting commercial or open source package! In this way, there are no losers as we will all learn. 

The first five commenters will win a free copy of the eBook. (DC2 board members and staff are not eligible to win.) Share your public social media accounts (about.me, Twitter, LinkedIn, etc.) in your comment, or email media@datacommunitydc.org after posting.

How To Put Your Meetup On the Map (Literally)

This is a guest post by Alan Briggs. Alan is a Data Scientist with Elder Research, Inc. assisting with the development of predictive analytic capabilities for national security clients in the Washington, DC Metro area. He is President-Elect of the Maryland chapter of the Institute for Operations Research and Management Science (INFORMS). Follow him at @AlanWBriggs. This post previews two events that Alan is presenting with Data Community DC President Harlan Harris. Have you ever tried to schedule dinner or a movie with a group of friends from across town? Then, when you throw out an idea about where to go, someone responds “no, not there, that’s too far?” Then, there’s the old adage that the three most important things about real estate are location, location and location. From placing a hotdog stand on the beach to deciding where to build the next Ikea, location really crops up all over the place. Not surprisingly, and I think most social scientists would agree, people tend to act in their own self-interest. That is, everyone wants to travel the least amount of distance, spend the least amount of money or expend the least amount of time possible in order to do what they need [or want] to do. For one self-interested person, the solution to location problems would always be clear, but we live in a world of co-existence and shared resources. There’s not just one person going to the movies or shopping at the store; there are several, hundreds, thousands, maybe several hundred thousand. If self-interest is predictable in this small planning exercise of getting together with friends, can we use math and science to leverage it to our advantage? It turns out that the mathematical and statistical techniques that are scalable to the worlds’ largest and most vexing problems can also be used to address some more everyday issues, such as where to schedule a Meetup event.

With a little abstraction, this scenario looks a lot like a classical problem in operations research called the facility location or network design problem. Its roots tracing back to the 17th century Fermat-Weber Problem, facility location analysis seeks to minimize the costs associated with locating a facility. In our case, we can define the cost of a Meetup venue by the sum of the distance traveled to the Meetup by its attendees. Other costs could be included, but to start simple, you can’t beat a straight-line distance.

So, here’s a little history. The data-focused Meetup scene in the DC region is several years old, with Hadoop DC, Big Data DC and the R Users DC (now Statistical Programming DC) Meetups having been founded first. Over the years, as these groups have grown and been joined by many others, their locations have bounced around among several different locations, mostly in downtown Washington DC, Northern Virginia, and suburban Maryland. Location decisions were primarily driven by supply – what organization would be willing to allow a big crowd to fill its meeting space on a weekday evening? Data Science DC, for instance, has almost always held its events downtown. But as the events have grown, and as organizers have heard more complaints about location, it became clear that venue selection needed to include a demand component as well, and that some events could or should be held outside of the downtown core (or vice-versa, for groups that have held their events in the suburbs).

Data Community DC performed a marketing survey at the beginning of 2013, and got back a large enough sample of Meetup attendees to do some real analysis. (See the public survey announcement.) As professional Meetups tend to be on weekday evenings, it is clear that attendees are not likely traveling just from work or just from home, but are most likely traveling to the Meetup as a detour from the usual commute line connecting their work and home. Fortunately, the survey designers asked Meetup attendees to provide both their home and their work zip codes, so the data could be used to (roughly) draw lines on a map describing typical commute lines.

Commutes for data Meetup attendees, based on ZIP codes.

The Revolutions blog recently presented a similar problem in their post How to Choose a New Business Location with R. The author, Rodolfo Vanzini, writes that his wife’s business had simply run out of space in its current location. An economist by training, Vanzini initially assumed the locations of customers at his wife’s business must be uniformly distributed. After further consideration, he concluded that “individuals make biased decisions basing their assumptions and conclusions on a limited and approximate set of rules often leading to sub-optimal outcomes.” Vanzini then turned to R, plotted his customers’ addresses on a map and picked a better location based on visual inspection of the location distribution.

If you’ve been paying attention, you’ll notice a common thread with the previously mentioned location optimizations. When you’re getting together with friends, you’re only going to one movie theater; Vanzini’s wife was only locating one school. Moreover, both location problems rely on a single location — their home — for each interested party. That’s really convenient for a beginner-level location problem; accordingly, it’s a relatively simple problem to solve. The Meetup location problem on the other hand adds two complexities that make this problem worthy of the time you’re spending here. Principally, if it’s not readily apparent, a group of 150 boisterous data scientists can easily overstay their welcome by having monthly meetings at the same place over and over again. Additionally, having a single location also ensures that the part of the population that drives the farthest will have to do so for each and every event. For this reason, we propose to identify the three locations which can be chosen that minimize the sum of the minimum distances traveled for the entire group. The idea is that the Meetup events can rotate between each of the three optimal locations. This provides diversity in location which appeases meeting space hosts. But, it also provides a closer meeting location for a rotating third of the event attendee population. Every event won't be ideal for everyone, but it'll be convenient for everyone at least sometimes.

As an additional complexity, we have two ZIP codes for each person in attendance — work and home — which means that instead of doing a point-to-point distance computation, we instead want to minimize the distance to the three meeting locations from the closest point along the commute line. Optimizing location with these two concepts in mind — particularly the n-location component — is substantially more complicated than the single location optimization with just one set of coordinates for each attendee.

So, there you have it. To jump ahead to the punchline, the three optimal locations for a data Meetup to host its meetings are Rockville, MD, downtown Washington DC and Southern Arlington, VA.

Gold points are optimal locations. Color gradient shows single-location location utility.

To hear us (Harlan and Alan) present this material, there are two great events coming up. The INFORMS Maryland Chapter is hosting their inaugural Learn. Network. Do. event on October 23 at INFORMS headquarters on the UMBC campus. Statistical Programming DC will also be holding its monthly meeting at iStrategy Labs on October 24. Both events will pull back the curtain on the code that was used and either event should send you off with sufficient knowledge to start to tackle your own location optimization problem.

Fantastic presentations from R using slidify and rCharts

Ramnath Vaidynathan presenting in DCDr. Ramnath Vaidyanathan of McGill University gave an excellent presentation at a joint Data Visualization DC/Statistical Programming DC event on Monday, August 19 at nclud, on two R projects he leads -- slidify and rCharts. After the evening, all I can say is, Wow!! It's truly impressive to see what can be achieved in presentation and information-rich graphics directly from R. Again, wow!! (I think many of the attendees shared this sentiment)


Slidify is a R package that

helps create, customize and share elegant, dynamic and interactive HTML5 documents through R Markdown.

We have blogged about slidify, but it was great to get an overview of slidify directly from the creator. Dr. Vaidyanathan explained that the underlying principle in developing slidify is the separation of the content and the appearance and behavior of the final product. He achieves this using HTML5 frameworks, layouts and widgets which are customizable (though he provides several here and through his slidifyExamples R package).

Example RMarkdown file for slidify

You start with a modified R Markdown file as seen here. This file can have chunks of R code in it. It is then processed to a pure Markdown file, interlacing the output of R code into the file. This is then split-apply-combined to produce the final HTML5 document. This document can be shared using GitHub, Dropbox or RPubs directly from R. Dr. Vaidyanathan gave examples of how slidify can even be used to create interactive quizzes or even interactive documents utilizing slidify and Shiny.

One really neat feature he demonstrated is the ability to embed an interactive R console within a slidify presentation. He explained that this used a Shiny server backend locally, or an OpenCPU backend if published online. This feature changes how presentations can be delivered, by not forcing the presenter to bounce around between windows but actually demonstrate within the presentations.


rCharts is

an R package to create, customize and share interactive visualizations, using a lattice-like formula interface

Again, we have blogged about rCharts, but there have been several advances in the short time since then, both in rCharts and interactive documents that Dr. Vaidyanathan has developed.

rCharts creates a formula-driven interface to several Javascript graphics frameworks, including NVD3, Highcharts, Polycharts and Vega. This formula interface is familiar to R users, and makes the process of creating these charts quite straightforward. Some customization is possible, as well as putting in basic controls without having to use Shiny. We saw several examples of excellent interactive charts using simple R commands. There is even a gallery where users can contribute their rCharts creations. There is really no excuse any more for avoiding these technologies for visualization, and it makes life so much more interesting!!

Bikeshare maps, or how to create stellar interactive visualizations using R and Javascript

Dr. Vaidyanathan demonstrated one project which, I feel, shows the power of the technologies he is developing using R and Javascript. He created a web application using R, Shiny, his rCharts packages which accesses the Leaflet Javascript library, and a very little bit of Javascript magic to visualize the availability of bicycles at different stations in a bike sharing network. This application can automatically download real-time data and visualize availability in over 100 bike sharing systems worldwide. He focused on the London bike share map, which was fascinating in that it showed how bikes had moved from the city to the outer fringes at night. Clicking on any dot showed how many bikes were available at that station.

London Bike Share map Dr. Vaidyanathan quickly demonstrated a basic process of how to map points on a city map, how to change their appearance and how to add additional meta-data to each point, that will appear as a pop-up when clicked.

You can see the full project and how Dr. Vaidyanathan developed this application here.

Interactive learning environments

Finally, Dr. Vaidyanathan showed a new application he is developing using slidify, rCharts, and other open-source technologies like OpenCPU and PopcornJS. This application allows him to author a lesson in R Markdown, integrate interactive components including interactive R consoles, record the lesson as a screencast, sync the screencast with the slides, and publish it. This seems to me to be one possible future for presenting massive online courses. An example presentation is available here, and the project is hosted here

Open presentation

The presentation and all the relevant code and demos are hosted on GitHub, and the presentation can be seen (developed using slidify, naturally) here.

Stay tuned for an interview I did with Dr. Vaidyanathan earlier, which will be published here shortly.

Have fun using these fantastic tools in the R ecosystem to make really cool, informative presentations of your data projects. See you next time!!!

Data Science MD July Recap: Python and R Meetup

highres_261064962 For July's meetup, Data Science MD was honored to have Jonathan Street of NIH and Brian Godsey of RedOwl Analytics come discuss using Python and R for data analysis.

Jonathan started off by describing the growing ecosystem of Python data analysis tools including Numpy, Matplotlib, and Pandas.

He next walked through a brief example demonstrating Numpy, Pandas, and Matplotlib that he made available with the IPython notebook viewer.

The second half of Jonathan's talk focused on the problem of using clustering to identify scientific articles of interest. He needed to a) convert PDF to text b) extract sections of the document c) cluster and d) retrieve new material.

Jonathan used the PyPDF library for PDF conversion and then used the NLTK library for text processing. For a thorough discussion of NLTK, please see Data Community DC's multi-part series written by Ben Bengfort.

Clustering was done using scikit-learn, which identified seven groups of articles. From these, Jonathan was then able to retrieve other similar articles to read.

Overall, by combining several Python packages to handle text conversion, text processing, and clustering, Jonathan was able to create an automated personalized scientific recommendation system. Please see the Data Community DC posts on Python data analysis tutorials and Python data tools for more information.

Next to speak was Brian Godsey of RedOwl Analytics who was presenting on their social network analysis. He first described the problem of identifying misbehavior in a financial firm. Their goals are to detect patterns in employee behavior, measure differences between various types of people, and ultimately find anomalies in behavior.

In order to find these anomalies, they model behavior based on patterns in communications and estimate model parameters from a data set and set of effects.

Brian then revealed that while implementing their solution they have developed a R package called rRevelation that allows a user to import data sets, create covariates, specify a model's behavioral parameters, and estimate the parameter values.

To conclude his presentation, Brian demonstrated using the package against the well-known Enron data set and discussed how larger data sets requires using other technologies such as MapReduce.


Slides can be found here for Jonathan and here for Brian.

A Julia Meta Tutorial


If you are thinking about taking Julia, the hot new mathematical, statistical, and data-oriented programming language, for a test drive, you might need a little bit of help. In this blog we round up some great posts discussing various aspects of Julia to get you up and running faster.

Why We Created Julia

If only you could always read through the intentions and thoughts of the creators of a language! With Julia you can. Jump over to here to get the perspectives of four of the original developers, Jeff BezansonStefan KarpinskiViral Shah, and Alan Edelman.

We are power Matlab users. Some of us are Lisp hackers. Some are Pythonistas, others Rubyists, still others Perl hackers. There are those of us who used Mathematica before we could grow facial hair. There are those who still can’t grow facial hair. We’ve generated more R plots than any sane person should. C is our desert island programming language.

We love all of these languages; they are wonderful and powerful. For the work we do — scientific computing, machine learning, data mining, large-scale linear algebra, distributed and parallel computing — each one is perfect for some aspects of the work and terrible for others. Each one is a trade-off.

We are greedy: we want more.

An IDE for Julia

If you are looking for an IDE for Julia, check out the Julia Studio. Even better, Forio, the makers of this IDE, offer a nice series of beginner, intermediate, and advanced tutorials to help you get up and running.

 Julia Documentation

By far the most comprehensive and best source of help and information on Julia are the ever growing Julia Docs which includes a Manual for the language (with a useful getting started guide), details of the Standard Library, and an overview of available packages.  Not to be missed are the two sections detailing noteworthy differences between Matlab and R.

MATLAB, R, and Julia: Languages for data analysis

Avi Bryant provides a very nice overview and comparison of Matlab, R, Julia, and Python. Definitely recommended reading if you are considering a new data analysis language.

An R Programmer Looks at Julia

This post is from mid-2012 so a lot has changed with Julia. However, it is an extensive look at the language from an experienced R developer.

There are many aspects of Julia that are quite intriguing to an R programmer. I am interested in programming languages for "Computing with Data", in John Chambers' term, or "Technical Computing", as the authors of Julia classify it. I believe that learning a programming language is somewhat like learning a natural language in that you need to live with it and use it for a while before you feel comfortable with it and with the culture surrounding it. Read more ...

The State of Statistics in Julia - Late 2012 

Continuing on this theme of statistics and Julia, John Myles White provides a great view of using Julia for statistics which he updated in December of last year.

A Matlab Programmer's Take on Julia - Mid 2012

A quick look at Julia from the perspective of a Matlab programmer and pretty insightful as well.

Julia is a new language for numerical computing. It is fast (comparable to C), its syntax is easy to pick up if you already know Matlab, supports parallelism and distributed computing, has a neat and powerful typing system, can call C and Fortran code, and includes a pretty web interface. It also has excellent online documentation. Crucially, and contrary to SciPy, it indexes from 1 instead of 0.  Read more ...

Why I am Not on the Julia Bandwagon Yet

Finally, we leave you, good reader, with a contrarian view point.

Stepping up to Big Data with R and Python: A Mind Map of All the Packages You Will Ever Need

On May 8, we kicked off the transformation of R Users DC to Statistical Programming DC   (SPDC) with a meetup at iStrategyLabs in Dupont Circle. The meetup, titled "Stepping up to big data with R and Python," was an experiment in collective learning as Marck and I guided a lively discussion of strategies to leverage the "traditional" analytics stack in R and Python to work with big data.

Rlogo               python-logo

R and Python are two of the most popular open-source programming languages for data analysis. R developed as a statistical programming language with a large ecosystem of user-contributed packages (over 4500, as of 4/26/2013) aimed at a variety of statistical and data mining tasks. Python is a general programming language with an increasingly mature set of packages for data manipulation and analysis. Both languages have their pros and cons for data analysis, which have been discussed elsewhere, but each is powerful in its own right. Both Marck and I have used R and Python in different situations where each has brought something different to the table. However, since both ecosystems are very large, we didn't even try to cover everything, and we didn't believe that any one or two people could cover all the available tools. We left it to our attendees (and to you , our readers) to fill in the blanks with favorite tools in R and Python for particular data analytic tasks.

There are several basic tasks we covered in the discussions: data import, visualization, MapReduce, parallel processing. We noted that, since R is becoming one of the lingua statististica, many commercial products by SAP, Oracle, Teradata, Netezza and the like have developed interfaces to allow R as an analytic backend. However, Python has been used to develop integrated analysis platforms due to its strengths as a "glue language" and its robust general capabilities and web development packages.

Most data scientists have had experience with small to medium data. Big Data poses its own challenges in terms of its size. Marck made the great point that Big Data is almost never directly used, but is aggregated and summarized before being analyzed, and this summary data is often not very big. However, we do need to use available tools a bit differently to deal with large data sizes, based on the design choices R and Python developers have made. R has  a earned reputation for not being about to handle datasets larger than memory, but users have developed useful packages like ff and bigmemory to handle this. In our experience, Python reads data much more efficiently (orders of magnitude) than R, so reading data with Python and piping it to R has often been a solution. Both R and Python have well established means of communicating with Hadoop, mainly leveraging Hadoop Streaming. Both also have well-developed interfaces to connect with both SQL-based and NoSQL databases. There was a lively discussion of various issues regarding using Big Data within R and Python, specifically in regards to Hadoop.

There is a basic stack of packages in both R and Python for data analysis, and many more packages for other analytic tasks. Both software platforms have huge ecosystems; so, to try and get you started on discovering many of the tools available for different data scientific tasks, we have developed preliminary maps of each ecosystem (click for a larger view, outlines with links, and to download):

R for big data

Python for big data

In fact, R can be used from within Python using the rpy2 package by Laurent Gautier, which has been nicely wrapped in the rmagic magic function in ipython. This allows R to be used from within an ipython notebook. (PS: If you're a Python user and are not using ipython and the ipython notebook, you really should look into it). There are several ways of integrating R and Python into unified platforms, as I've described earlier.

Our meetup, and the maps above, are intended as a launching pad for your exploration of R and Python for your data analysis needs. We will have video from this meetup available soon (stay tuned). Resources for learning R are widely available on the web. We have described Python's capabilities for data science and data analysis in earlier blog posts, and Ben Bengfort has a series of posts on using Python for Natural Language Processing, one of it's analytic strengths. We hope that you will contribute to this discussion in the comments, and we will compile different tools and strategies that you suggest in a future post.


Data Visualization: From Excel to ???

So you're an excel wizard, you make the best graphs and charts Microsoft's classic product has to offer, and you expertly integrate them into your business operations.  Lately you've studied up on all the latest uses for data visualization and dashboards in taking your business to the next level, which you tried to emulate with excel and maybe some help from the Microsoft cloud, but it just doesn't work the way you'd like it to.  How do you transition your business from the stalwart of the late 20th century?

If you believe you can transition your business operations to incorporate data visualization, you're likely gathering raw data, maintaining basic information, making projections, all eventually used in an analysis-of-alternatives and final decision for internal and external clients.  In addition, it's not just about using the latest tools and techniques, your operational upgrades must actually make it easier for you and your colleagues to execute daily, otherwise it's just an academic exercise.

Google Docs

There are some advantages to using Google Docs over desktop excel, the fact that it's in the cloud, has built in sharing capabilities, wider selection of visualization options, but my favorite is that you can reference and integrate multiple sheets from multiple users to create a multi-user network of spreadsheets.  If you have a good javascript programmer on hand you can even define custom functions, which can be nice when you have particularly lengthy calculations as spreadsheet formulas tend to be cumbersome.  A step further, you could use Google Docs as a database for input to R, which can then be used to set up dashboards for the team using a Shiny Server.  Bottom line, Google makes it flexible, allowing you to pivot when necessary, but it can take time to master.

Tableau Server

Tableau Server is a great option to share information across all users in your organization, have access to a plethora of visualization tools, utilize your mobile device, set up dashboards, keep your information secure.  The question is, how big is your organization?  Tableau Server will cost you $1000/user, with a minimum of 10 users, and 20% yearly maintenance.  If you're a small shop it's likely that your internal operations are straightforward and can be outlined to someone new in a good presentation, meaning that Tableau is like grabbing the whole toolbox to hang a picture, it may be more than necessary.  If you're a larger organization, Tableau may accelerate your business in ways you never thought of before.

Central Database

There are a number of database options, including Amazon Relational Data Services and Google Apps Engine.  There are a lot of open source solutions using either, and it will take more time to set up, but with these approaches you're committing to a future.  As you gain more clients, and gather more data, you may want to access to discover insights you know are there from your experience in gathering that data.  This is a simple function call from R, and results you like can be set up as a dashboard using a number of different languages.  You may expand your services, hire new employees, but want to easily access your historical data to set up new dashboards for daily operations.  Even old dashboards may need an overhaul, and being able to access the data from a standard system, as opposed to coordinating a myriad of spreadsheets, makes pivoting much easier.

Centralize vs Distributed

Google docs is very much a distributed system where different users have different permissions, whereas setting up a centralized database will restrict most people into using your operational system according to your prescription.  So when do you consolidate into a single system and when do you give people the flexibility to use their data as they see fit?  It depends of course.  It depends on the time history of that data, if the data is no good next week then be flexible, if this is your company's gold then make sure the data is in a safe, organized, centralized place.  You may want to allow employees to access your company's gold for their daily purposes, and classic spreadsheets may be all they need for that, but when you've made considerable effort to get the unique data you have, make sure it's in a safe place and use a database system you know you can easily come back to when necessary.

Python vs R vs SPSS ... Can't All Programmers Just Get Along?

Programmers have long been very proud and loyal with their tools, and often very vocal. This has led to well-contested rivalries and "fights" about which tool is better:

  • emacs or vi;
  • Java or C++;
  • Perl or Python;
  • Django or Rails;
  • and, for data geeks, the SAS/SPSS/R/Matlab fight.


The truth is, very few of us data geeks (data scientists, data analysts, statisticians, or what ever we call ourselves [editor note: Data Practitioners]) use only a single tool for all of our work. We will often extract data from a SQL database, munge it using Perl or Python, and then do statistical analysis using R or SAS, reporting the results using Word or, increasingly, the web. Specially for data analysis, there is often no single tool that can do the end-to-end workflow well, however much we would like to believe that there is. Each tool has its strengths and weaknesses, and often a mixture works best. The trick is in finding the right "glue" that can string our workflow together.

There are now several interface packages available to talk between open-source languages. I'll speak to the interfaces with R, which I'm most familiar with, but I'm sure that the community will point out other useful interfaces. R is not the fastest nor most elegant of languages, but has by far the richest ecosystem of cutting-edge data analysis packages. There are now ways to communicate with R from other general programming languages like Java (through the rJava package and JNI), Perl (Statistics::R, available in CPAN),  Python (rpy2, PypeR, available in PyPI). Packages in R allow communication out with general packages, like RSPython, RSPerl (both available at Omegahat) and rJava. Most commercial statistical packages, like SAS, SPSS and Statistica allow you to write R code to send to R and then get back the results. A specially nice SAS macro to do this for those without the latest versions of SAS is %Proc_R, available here. One can also call R from Matlab. There are also many ways of interfacing with R using web-based tools like Rserve or, on Windows, the rcom interface to utilize COM and connect with, among other things, Word and Excel.

More recently I have been excited about platforms where code can be written in different languages and integrated using literate programming (i.e., the weaving of the results of code with text to create reports).

  • Babel is a part of org-mode in Emacs which allows different programming languages to be used in the same document to perform an analysis and report. There are several examples of how this is done.
  • The latest IPython distribution now allows you to integrate other languages using user-contributed magic functions. The initial languages available are R, Octave and, very recently, Julia. The first two are already integrated into IPython. Using these magic functions, you can use the power of R, Octave and Julia along with all the tools available in Python like Numpy, Scipy, matplotlib, pandas and the like on one platform. Literate programming is easily achieved through the excellent HTML notebook that is now part of IPython distributions. Update: A sql magic function was just added to the ecosystem.

The interfacing tools I've described now allows us to create a greater ecosystem where different tools can be integrated to a common goal rather easily. Instead of fighting over which tool is better, we're now going to a place where that doesn't matter; what matters is being able to use the right tools for each piece of the job and getting the tools working together to do the best job possible. We can, after all, all get along.

PS: For translating code between Matlab/Octave, Python and R, there is a great little site called Mathesaurus.


(Note, DataCommunityDC is an Amazon Affiliate. Thus, if you click the image in the post and buy the book, we will make approximately $0.43 and retire to a small island).

Data Visualization: Shiny Democratization

In organizing Data Visualization DC we focus on three themes: The Message, The Process, The Psychology. In other words, ideas and examples of what can be communicated, the tools and know-how to get it done, and how best to communicate. We know intuitively and from experience that the best communication comes in the form of visualizations, and we know there are certain approaches that are more effective than others. What is it about certain visualizations that stimulate memory? Perhaps because we're naturally visual creatures, perhaps visuals allow multiple ideas to be associated with one object, perhaps visuals bring people together and create a common reference for further fluid discussion. Let's explore these ideas.

Rembrant&MemoryVisualization is the Natural Medium

No one really has the answer.  The best visualizers have traditionally been artists, and we know that any given artwork speaks to some and not to others.  Visualizations help you think in new ways, make you ask new questions, but each person will ask different questions and there is no one size fits all.  Visualizations will help you have a conversation without even speaking, much the way Khan Academy allows study on your own time.  Trying to turn this into a science is a noble effort, and articles like "The eyes have it" do an excellent job outlining the cutting edge, but when we have to use visualizations to conduct our work more efficiently we know the first question in communicating is "who's the audience?"  There are certainly best practices (no eye charts, good coloring, associated references, etc.) but the same information will vary in its presentation for each audience.

FoodVizCase in point, everyone has to eat, everyone knows food from their perspective, so if we want to communicate nutrition facts why not know use your audience's craving for delicious looking food to draw them into exploring the visualization.  Fat or Fiction  does an excellent job of this, and I can tell you I never would have known cottage cheese had such a low nutritional value next to cheddar if they weren't juxtaposed for easy comparison.

Ultimately there is a balance and “If you focus more attention on one part of the data, you are ultimately taking attention away from another part of the data,” explains Amitabh Varshney, director of the Institute for Advanced Computer Studies at the University of Maryland, US.  You can attempt to optimize this by hacking your learning, but if you're as curious as I am you need some way of exploring memory on a regular basis to learn for yourself, it shouldn't have to be a never-ending checklist of best practices.


Personally I believe that social memes are an example of societal memory, they shape and define our culture giving us objects to reference in conversation.  Looking at the relationships in the initial force-graph presentation of the meme, I can't help but think of neural patterns, the basis of our own memory.  We're all familiar with this challenge when we meet someone from another country, or another generation, and we draw an analogy with a favorite movie, song, actor, etc.; If the person is familiar with the social meme the reference immediately invokes thoughts and memories, which we use to continue the flow of the conversation and introduce new ideas.  The Harlem Shake: anatomy of a viral meme captures how memes emerge over time, and allows you to drill down all the way to what actions people took in different contexts.  My goal in studying this chart is to come away with how to introduce ideas for each audience, through visualizations or otherwise, to maximum information retention.

Data Visualization: Shiny Spiced Consulting

If you haven't already heard, RStudio has developed an incredibly easy way to deploy R on the web with its Shiny Package. For those who have heard, this really isn't new as bloggers have already been blogging about it for some months now, but I have primarily seen a focus on how to build Shiny apps, and feel it's also important to focus on utilizing Shiny Apps for clients.

From Pigeon Hole to Social Web Deployment

I was originally taught C/C++ but I didn't really begin programming until I was introduced to Matlab.  A breath of fresh air, I no longer had to manage memory issues, and its mathematics and matrix design allowed me to think about the algorithms as I wrote, rather than the code, much like we write sentences in English without worrying too much about grammar.  Removing those human-interpretive layers and allowing the mind to focus on the real challenge at hand had an interesting secondary effect where I eventually began thinking and dreaming in Matlab, and it was easier at times to write a quick algorithm than a descriptive paper. What's more, Matlab had beautiful graphics which greatly simplified the communicative process, as a good graphic is self-evident to a much larger crowd. Fast forward and today we are an open source and socially networked community where the web is our medium.  Social Networks are not reserved for Facebook and Twitter, in a way when you use a new Package in R you're "Friending" its developer and anyone else who uses it.  For working individually this is a great model, but unfortunately to deploy the power of AI, machine learning, or even simple algorithms and make use of the web-medium required the additional skillset of web-programming.  Although not an overly complicated skillset to be proficient at, like running, biking, or swimming, just because you once ran seven minutes per mile doesn't mean you can after a few years of inactivity.

Democratizing Data

Enter RStudio Shiny, an instant hit.  In the second half of 2012 I worked on a project using D3.js, Spring MVC, and Renjin, the idea being more administrative in that UI developers could focus on UI and algorithm developers could focus on algorithms, perhaps eventually meeting in the middle. I was practically building a custom version of Shiny, and for 90% of the intended use stories, I wish Shiny had been available early in 2012.

Thousands of lines of code were cut by an order of magnitude when implementing in Shiny, and just like when back in the day Matlab let me think in terms of algorithms, Shiny is letting me think in terms of communicating with my audience.  If I can plot it in R, I can host it on a Shiny Server. R is already excellent for writing algorithms, and once a framework is written in Shiny, integrating new algorithms or new plots is as simple as replacing function calls. This allows you to quickly iterate between meetings and create an interactive experience that is self evident to everyone because it's closely related to the conversation at hand. What's more, because it's web based, the experience goes beyond the meeting and everyone from the CEO to Administrative Assistants can explore the underlying data, creating a common thread for discussion much like chatting about the Oscars around the proverbial water cooler. Shiny Democratizes Data.


The response to Shiny has been very positive, and its use is quickly becoming wide-spread. Like Yoda said, "See past the decisions already made we can" (I may be paraphrasing), we can see the next steps for Shiny, including interactive plots, user-friendly reactive functions, easier web deployment with Shiny Servers, and integration of third party applications such as GoogleVis and D3js. With respect to Yoda, Shiny's allowed me to decide that dynamic interaction with data, for the wider data science community, is the clear next step.

Examining Overlapping Meetup Memberships with Venn Diagrams

As of the beginning of 2013, Data Community DC ran three Meetup groups: Data Science DC, Data Business DC, and R Users DC. We've often wondered how much these three groups overlapped. In this post, I'm going to show you two answers from two different sources of data. And I'm going to illustrate these results with Euler diagrams, which are similar to the familiar Venn diagrams you learned about in school. I showed these graphs at the January 28th Data Science DC Meetup, and quickly walked through the technical steps of processing the data and making graphs at the February 11th R Users DC Meetup. The R code used in that presentation is available from GitHub.

The first source of data is a community survey we did in January. Among other interesting questions, some of which we'll talk about on this blog, we asked about membership and level of attendance in the three Meetup groups. When those results are processed, we get the following illustration of how the groups overlap:

Euler diagram of DC2 Meetup membership overlap, based on January 2013 survey data.

Another way to answer that question is to use the rich data available from the Meetup API. When we pull the data and calculate the overlaps based on Meetup's unique ID for each person, we get the following similar but not identical story:

Euler diagram of DC2 Meetup membership overlap, based on January 2013 Meetup API data.

Why the different stories? Different biases. The survey data is based on volunteer responses, a not-fully-representative subset of the 2000-plus members of our broader community. In particular, people who are most dedicated to attending Meetup events and networking with their professional community were presumably most likely to respond to the survey. (As were people desperate to try winning a book from generous sponsor O'Reilly Media!) It's unsurprising that these people overlap more strongly than those who have just signed up for the Meetup group on the web site at some point.

But the Meetup API data, although technically complete, does not necessarily answer the question we want to know either. We are mostly interested in understanding the overlap among people who at least occasionally attend events. Many people sign up for a Meetup group but never attend an event or ever again interact with the site. Some people RSVP to every event but never show up. The set of people we really want to be counting is difficult to define based solely on the Meetup API data.

So, overall, we think the answer lies somewhere between the above two graphs. DBDC overlaps strongly with DSDC, and RUDC somewhat less so. A relatively small set of people, probably less than a quarter, belong to all three. (Some crazy people belong to many Meetup groups -- I currently am a member of 20 groups, and go to 8 or 10 of them at least occasionally.)

It's also worth quickly noting another source of error in the visualization. Euler diagrams cannot, in general, be drawn perfectly accurately with circles alone when the number of sets exceeds two. Sizing and positioning the circles is a constrained optimization problem, requiring a solution that minimizes overstating or understating overlap. Leland Wilkinson does a good job of explaining the issues and describing an algorithm; his code was used to draw the illustrations above, and his paper on the topic is linked below. Briefly, Wilkinson defines a loss metric called stress, which is essentially the extent to which the graphical overlap in the circles differs from the counted overlap in the data. A quasi-gradient descent technique is used to first roughly, then more precisely, minimize stress and approach the best-possible layout. The method also allows statistical analysis; Wilkinson assumes that the data is sampled with normal error, which allows a test to determine if the fitted illustration is statistically significantly better than a random layout. In our case, the illustrations are unambiguously better: for the survey data layout, [latex]s = 0 .002 < 0.056 = s_{.01}[/latex],

and for the Meetup data layout,

[latex]s = 0 .001 < 0.056 = s_{.01}[/latex].

Got an Euler/Venn diagram that you're particularly proud of? Post a link in the comments!

Wilkinson, L. (2012). Exact and approximate area-proportional circular Venn and Euler diagrams. Visualization and Computer Graphics, IEEE Transactions on18(2), 321-331.



Data Visualization: Graphics with GGPlot2

By:  DSC00302 - Version 2 Basic plots in R using standard packages like lattice work for most situations where you want to see trends in small data sets, such as your simulation variables, which make sense considering lattice began with the Bell Lab's S language.  However, when we need to summarize and communicate our work with those primarily interested in the "forest" perspective, we use tools like ggplot2.  In other words, the difference between lattice and ggplot2 is the difference between understanding data versus drawing pictures.

You can learn all about ggplot2 by downloading the R package and reading, but even Even Hadley Wickham, author of ggplot2, thinks going through the R help documentation will "drive you crazy!"  To alleviate stress, we've compiled references, examples, documentation, blogs, books, groups, and commentary from practitioners who use ggplot2 regularly, enjoy.

GGplot2 is an actively maintained open-source chart-drawing library for R based upon the principles of "Grammar of Graphics", thus the "gg".  Grammar of Graphics was written for statisticians, computer scientists, geographers, research and applied scientists, and others interested in visualizing data.  GGplot2 can be generalized as layers composed of: a data set, mappings and aesthetics (position, shape, size color), statistical transforms, and scaling.  To better wrap our minds around how this applies to ggplot2, we can take Hadley's tour, or attend one of his events.  The overall goal is to automate graphical processes and put more resources at our fingertips; below are some great works from practitioners.

London Bike RoutesPopularLondonBikeRoutes

The London bike routes image is built with three layers: building polygons, waterways and lakes, and bike routes.  The route data itself is a count of the number of bikes, as well as their position, featured as thickness and color intensity in yellow, which is a nice contrast to the black and grey of the city map.  I enjoy this dataviz because you can imagine yourself trying to get around on a bicycle in London.

Raman Spectroscopic Grading of GliomasSpectroscopicObservations

The background of this work is the classification of tumour tissues using their Raman-Spectra. A detailed discussion can be found in C. Beleites et al.  Gliomas are the most frequent brain tumours, and astrocytomas are their largest subgroup. These tumours are treated by surgery. However, the exact borders of the tumour are hardly visible. Thus the need for new tools that help the surgeon find the tumour border. A grading scheme is given by the World Health Organization (WHO).

TwitteR Packagetwitter-ggplot

Curious about your influence on twitter?  Want to see how your messages resonate within and outside your network?  Here is a great website that goes through many examples on using the TwitteR package in R, with the following ggplot2 code that creates the chart on our right-hand-side:

[code lang="R"]require(ggplot2)





The ggplot2 interface is interesting because you're using the + operator, thus manifesting the Grammar of Graphics concept of layers.visualizingSentencingData-ggplot2

This final example of Sentencing Data for Local Courts easily breaks up the data by demographics committing different classes of crimes.  As above, the R code is very simple and follows the layering paradigm:


[code lang="R"]ggplot(iw, aes(AGE,fill=sex))+geom_bar() +



The (near) Future of Data Analysis - A Review

co-organizes Data Business DC, among many other things. Hadley Wickham, having just taught workshops in DC for RStudio, shared with the DC R Meetup his view on the future, or at least the near future of Data Analysis. Herein lies my notes for this talk, spiffed up into semi-comprehensible language. Please note that my thoughts, opinions, and biases have been split, combined, and applied to his. If you really only want to download the slides, scroll to the end of this article.

As another legal disclaimer, one of the best parts of this talk was Hadley's commitment to making a statement, or, as he related, "I am going to say things that I don't necessarily believe but I will say them strongly."

You will also note that Hadley's slides are striking even at 300px wide ... evidence that the fundamental concepts of data visualization and graphic design overlap considerably.

Data analysis is a heavily overladen term with different meanings to different people.

However, there are three sets of tools for data analysis:

  1. Transform, which he equated to data munging or wrangling
  2. Visualization, which is useful for raising new questions but, as it requires eyes on each image, does not scale well; and
  3. Modeling,  which complements visualization and where you have made a question sufficiently precise that you can build a quantitative model. The downside to modeling is that it doesn't let you find what you don't expect.


Now, I have to admit I loved this one. Data analysis is "typing not clicking." Not to disparage all of those Excel users out there but programming or coding (in open source languages) allows one to automate processes, make analyses reproducible, and even help communicate your thoughts and results, even to "future you."  You can also throw your code on Stack Overflow for help or to help others.

Hadley also described data analysis as much more cogitation time than CPU execution time. One should spend more time thinking about things than actually doing them.  However, as data sets scale, this balance may shift a bit ... or one could argue the longer it takes to run your analysis, the more thought you should put into the idea and code before it runs for days as the penalty for doing the wrong analysis or an incorrect analysis grows. Luckily, we aren't quite back to the days of the punchcard.

Above is a nice way of looking at some of the possible data analysis tool sets for different types of individuals. To put this into the vernacular of the data scientist survey that Harlan, Marck and I put out,  R+js/python would map well to the Data Researcher, R+sql+regex+xpath, could map to the Data Creative, and R+java+scala+C/C++ could map to the Data Developer.  Ideally, one would be a polyglot and know languages that span these categorizations.

Who doesn't love this quote? The future (of R and data analysis) is here in pockets of advanced practitioners. As ideas disperse through the community and the rest of the masses catchup, we push forward.

Communication is key ...

but traditional tools fall short when communicating data science ideas and results and methods. Thus, rmarkdown gets it done and can be quickly and easily translated into HTML.

Going one step further but still coupled to rmarkdown is a new service, RPubs, that allows one click publishing of rmarkdown to the web for free. Check it out ...

If rmarkdown is the Microsoft Word of data science, than Slidify is the comparable to Powerpoint (and it is free), allowing one to integrate text, code, and output powerfully and easily.

While these tools are great, they aren't perfect.  We are not yet at a point where our local IDE has been seemlessly integrated into our code versioning system, our data versioning system, our environment and dependency versioning system, our publishing/broadcasting results generating system, or our collaboration systems.

Not there yet ...


Basically, Rcpp allows you to embed C++ code into your R code easily.  Why would someone want to do that? Because it allows you to easily circumvent the performance penalty of FOR loops in R; just write them in C++.

On a personal rant, I don't think mixing in additional languages is necessarily a good idea, especially C++.

Notice the units of microseconds.  There is always a trade off between the time spent optimizing your code and running your slow code.

Awesome name, full stop. Let's take two great tastes, ggplot2 and D3.js, and put them together.

If you don't know about ggplot2 or the Grammar of Graphics, click the links!

D3 allows one to make beautiful, animated, and even interactive data visualizations in javascript for the web. If you have seen some of the impressive interactive visualizations at the New York Times, you have seen D3 in action.  However, D3 has a quite steep learning curve as it requires understanding of CSS, HTML, javascript,

As a comparison, what might take you a few lines of code to create in R + gglot2, could take you a few hundred lines of code in D3.js.  Some middle ground is needed, allowing R to produce web suitable, D3-esque graphics.

ps Just in case you were wondering, r2d3 does not yet exist. It is currently vaporware.

Enter shiny which allows you to make web apps with R hidden as the back end, generating .pngs that are refreshed, potentially with an adjustable parameter input from the same web page. This doesn't seem the Holy Grail everyone is looking for but is moving the conversation forward.

One central theme was the idea that we want to say what we want and allow the software to figure out the best way to do that. We want a D3-type visualization but we don't want to learn 5 languages to do it. Also, this applies equally on the data analysis size for data sets ranging many orders of magnitude.

Another theme was that the output of the future is HTML 5.  I did not know this but R Studio is basically a web browser, everything is drawn using HTML5, js, and CSS.

Loved this slide because who doesn't want to know?!

DPLYR is an attempt at a grammar of data manipulation, abstracting out the back end of crunching the data from the description of what someone wants done (and no, SQL is not the solution to that problem).

And this concludes what was a fantastic talk about The ^(near) Future of Data Analysis. If you've made it this far and still want to download Hadley's full slide deck or Marck's introductory talk, look no further: