As of the beginning of 2013, Data Community DC ran three Meetup groups: Data Science DC, Data Business DC, and R Users DC. We've often wondered how much these three groups overlapped. In this post, I'm going to show you two answers from two different sources of data. And I'm going to illustrate these results with Euler diagrams, which are similar to the familiar Venn diagrams you learned about in school. I showed these graphs at the January 28th Data Science DC Meetup, and quickly walked through the technical steps of processing the data and making graphs at the February 11th R Users DC Meetup. The R code used in that presentation is available from GitHub.

The first source of data is a community survey we did in January. Among other interesting questions, some of which we'll talk about on this blog, we asked about membership and level of attendance in the three Meetup groups. When those results are processed, we get the following illustration of how the groups overlap:

Another way to answer that question is to use the rich data available from the Meetup API. When we pull the data and calculate the overlaps based on Meetup's unique ID for each person, we get the following similar but not identical story:

Why the different stories? Different biases. The survey data is based on volunteer responses, a not-fully-representative subset of the 2000-plus members of our broader community. In particular, people who are most dedicated to attending Meetup events and networking with their professional community were presumably most likely to respond to the survey. (As were people desperate to try winning a book from generous sponsor O'Reilly Media!) It's unsurprising that these people overlap more strongly than those who have just signed up for the Meetup group on the web site at some point.

But the Meetup API data, although technically complete, does not necessarily answer the question we want to know either. We are mostly interested in understanding the overlap among people who at least occasionally attend events. Many people sign up for a Meetup group but never attend an event or ever again interact with the site. Some people RSVP to every event but never show up. The set of people we really want to be counting is difficult to define based solely on the Meetup API data.

So, overall, we think the answer lies somewhere between the above two graphs. DBDC overlaps strongly with DSDC, and RUDC somewhat less so. A relatively small set of people, probably less than a quarter, belong to all three. (Some crazy people belong to many Meetup groups -- I currently am a member of 20 groups, and go to 8 or 10 of them at least occasionally.)

It's also worth quickly noting another source of error in the visualization. Euler diagrams cannot, in general, be drawn perfectly accurately with circles alone when the number of sets exceeds two. Sizing and positioning the circles is a constrained optimization problem, requiring a solution that minimizes overstating or understating overlap. Leland Wilkinson does a good job of explaining the issues and describing an algorithm; his code was used to draw the illustrations above, and his paper on the topic is linked below. Briefly, Wilkinson defines a loss metric called *stress*, which is essentially the extent to which the graphical overlap in the circles differs from the counted overlap in the data. A quasi-gradient descent technique is used to first roughly, then more precisely, minimize stress and approach the best-possible layout. The method also allows statistical analysis; Wilkinson assumes that the data is sampled with normal error, which allows a test to determine if the fitted illustration is statistically significantly better than a random layout. In our case, the illustrations are unambiguously better: for the survey data layout, [latex]s = 0 .002 < 0.056 = s_{.01}[/latex],

and for the Meetup data layout,

[latex]s = 0 .001 < 0.056 = s_{.01}[/latex].

Got an Euler/Venn diagram that you're particularly proud of? Post a link in the comments!