Data Science DC's hottest topic is: Hulls?!

This guest blog post on ROCs was spurred by a conversation in the Q&A at Data Science DC’s June 16th Meetup on “Predicting Topics and Sharing in Social Media”. John Kaufhold, managing partner of Deep Learning Analytics, asked Bill Rand, Assistant Professor of Marketing at University of Maryland, about ROCs and convex hulls. In the post, Dr. Kaufhold satirizes data science moments lost in Q&A, talks ROC curves, and discusses the value of error bars in visualizing data science results.

DC Data Science’s hottest topic is: Hulls?!


A DC2 regular with Corey Hart’s 1980’s hair asked the age old question: “do these ROCs look right?”

This question has everything: Convex confusion, ROCs, no error bars, anti-kaggle, assistant professor of marketing Bill Rand, and a meetup industry vs. academia proxilike debate.

A proxilike debate? It’s that thing of when someone comments something on Meetup and then someone answers and the crowdsourced likes of each answer become a proxy debate to choose who to believe. Because data science is a democracy. Like all science.


First, let’s scope the relative frequency of the terms to get some sanity check on how esoteric this question about these “ROCs” is.

Because we’re data scientists, we do a quick data-driven Google trends check on common performance metrics for binary classification problems. We sprinkle in: “roc curve”, “precision recall”, "receiver operating characteristic" (a google autocomplete!), “data science” (to calibrate), and “F score”, a popular summary metric for detection problems. In fact, the previous presenter might have even mentioned F-scores.

We see that until somewhere between July and September, 2013, ROCs were more popular than all of data science! Whoa. That can’t be right. So we type it in again—same thing! Whoa. There’s an ROC collision with Middle Earth eagles, but we let it slide because the google autocompleted.

At first glance, this question passes the sanity check for “useful to know.” We’ll probably see ROCs again in data science presentations, and we might even remember seeing them in coursework, so it’s probably worth the 101 Tom Fawcett’s written for us. So what do convex hulls have to do with ROCs, anyway?

Convex confusion: Start with a teachable moment about a simple property of a common performance evaluation tool in data science—the ROC. Then pose a question in the direction of that teachable moment that doesn’t connect with the presenter. This question on ROC curves arguably turned into a real time “choose your own adventure” question about monotonicity (substituting monotonicity for convexity in the answer) to create some misunderstanding about both ROCs and convex hulls. To sow even more confusion, the question was pushed to the Meetup event comments and provided yet more food for thought on understanding ROCs:

On convex hulls of ROCs, see section 6. > This Fawcett paper> 's also a good reference for anyone who's looking to present ROCs in a paper

To which the presenter agreed:

Hi Jay, Great reference, which I point my students to. I apologize, for some reason when I heard convex hull during the Q&A, I was thinking monotonic, our ROC curves are monotonic but not convex, and Fawcett also indicates that ROC curves don't have to be convex. However, the Fawcett work goes on to say you can simply remove points not on the convex hull because they are suboptimal.

Which was great, but this part:

Since we were interested in the actual performance comparison on all datapoints, not the potential performance, we left in all points.

Still means there's more work to do. We will get to this at the end. Let’s first distill three lessons just from the Q&A between the audience and presenter—an ounce of prevention. When there’s a microphone involved, always start with Lesson 0: a Q&A session is not an invitation for audience members to give speeches. Lesson 0 applies to life in general, not just data science presentations. That is, the presenter should be able to easily repeat your question for most of the audience to understand even though you’re holding the Gauntlet of Power that is the microphone. i.e. Get to the crux of your question as concisely as you can and relate it to the presentation given, not your favorite unrelated esoteric topic. The audience came to hear the speaker talk, not you.

In this case, the presenter first backed up to the slide with ROCs so the whole audience was literally on the same page—-this is a best practice. The question about ROC curves was then asked.

I don’t know if it’s a function of the data, but I did notice they’re not on the convex hull. Is there something special about your data that makes them wiggle like that?”

[i.e. why don’t your ROC points lie on their convex hulls]? In this open question, the substance behind the answer about ROC convex hulls could go in a number of directions. Some are: (1) Which curves (i.e. algorithms) really dominate? (2) Is this ROC artifact an issue of too few evaluation data? (3) Are too few thresholds used to draw the curve? (4) What’s the external standard you’re comparing to? (5) Is this just a plotting artifact of the way the curve’s drawn as a polyline? Discussions on these might lead to some other worthy questions like: (6) Is the false alarm (FA) rate an independent variable that can be held constant and we simply bin the FA axis and get error bars per FA rate? (7) Is the independent variable in an ROC really the threshold? If so, (8) how do we compute error bars to assess actual performance of a particular classifier? (9) Are ROCs the right visualization tool?

We didn’t discuss this substance in the Q&A, though the question was only about why the ROC curve didn’t fall on the convex hull of the ROC points. So data science lesson one: the question needs to be concise, but also give the presenter enough information to know what you’re really getting at. In this case, the question failed to connect sufficiently with the presenter. Data science lesson two is important, too: when asked about convex hulls, answer a question about convex hulls, not, for instance, about monotonicity (or even convex functions, which are different). And especially if you’re answering a question about monotonicity, don’t say you’re talking about convexity. Once a presenter says points that don’t lie on a convex hull do lie on a convex hull, the person asking the question can choose to either put the presenter on the defensive and correct the presenter in front of the audience, or find another way to visibly bring the question to the community. A best practice for a presenter to prevent this confusion is to repeat the question for the audience to confirm that the presenter understands the question, and the audience knows what question the presenter’s answering.

ROCs: First a brief overview of what ROCs are and how they work for anyone who doesn’t know:

Figure courtesy wikipedia: The ROC curve is in the lower panel where PFA is P(FP) and our PD is P(TP). An overlapping distribution of positive (P) and negative (N) test example output scores are shown in the top left, where a threshold to declare a test example positive is the vertical line. As the top left vertical line moves to the right (i.e. increasing the score required to declare a positive), the ROC curve on the bottom moves to the lower left (fewer detections and fewer false alarms). As the top left vertical line moves to the left (i.e. lowering the threshold to declare a positive), both the FA rate and the detection rate go up. One threshold produces a single “contingency table” or “confusion matrix” as in the top right, where the cells in the matrix are filled in with the corresponding colored areas under the distributions at left. More ROC discipline and practical wisdom than most people incorporate into talks or papers can be found in Fawcett’s “ROCs 101”.

Let’s first make an assumption—bad habit, but let’s do it. Based on the kinky staircase way the curves appeared, let’s assume parametric distributions were not fit to the data and then thresholded, which would typically lead to smoother curves. This is a good thing because that practice introduces a “parametric fit error” that can be useful or misleading (misleading more often than not, IMHO). But let’s assume that these distributions are really PMFs (not PDFs). The presenter said “We’re not predicting a curved line between points,” and while that doesn’t confirm the assumption, it’s enough of a hand wave in the direction of “we just thresholded and connected dots.”

Next, some standard design choices when plotting ROCs are (1) how many thresholds are there—and the presenter does mention this (2) how many test set examples of each class (3) how do you connect the dots, and (4) how are these validated?

For the first, how closely does your ROC polyline approximate the binning of your distributions? More thresholds typically mean more points on the ROC curve, and you shouldn’t need more thresholds than you have points. To get a “more continuous” curve, use more thresholds. The shape of the curve is also a function of the ROC grid spacing. ROC PFA and PD points will lie on a grid, where the grid spacing is 1/number of positive examples for the PD axis and 1/number of negative examples for the PFA axis. Lastly, how do you “connect the dots”? If you do it straightforwardly, sometimes you end up with the staircase pattern in the curves in the talk. If you pick only the convex hull of the set of points on each ROC, the resulting curves start looking more like the ROC curves you see drawn in your machine learning, statistics or pattern recognition classes.

No error bars: “Significantly zig zagging ROCs typically point to some issue in performance evaluation, specifically confidence estimation.” In retrospect, that would have been an innocuous and better way to jump start this discussion with the presenter. In any event, we saw zig-zags in the ROCs. So what does this mean? A folk theorem in data science is that 80% of the work is wrangling. I think that’s about right. 80% might even be conservative. But I’d argue an often missed data science corollary is the other 80% of the work is clearly and concisely communicating your key results to stakeholders via visualizations.

One way to think about these error bars is in terms of validation. To validate, imagine splitting all your test points into 10 “folds” — maybe random draws of a fraction of test points with replacement—maybe 5 mutually exclusive partitions of a random permutation of the test data—doesn’t usually matter with enough data. Then compute an ROC for each “fold” and overlay them on your plot. This will start getting at the “practical errorbars” of your ROCs.

Often, in big “kaggle-esque” shared task competitions, there are no errorbars. Maybe there are enough points that these curves for different folds will basically fall on top of each other, i.e. for large enough n and iid enough samples, all the folds produce approximately the same curve. Maybe they’re just lazy. Maybe kaggle competitions focus only on a small part of the ROC space. Maybe there’s only one test set held out to simplify leaderboards. Once you split that set into folds, the error bars, themselves, complicate the competition.

For instance, what if the winner’s got big errorbars? And second place is just a little behind in average performance, but has really tiny error bars? Then you’re in a bias versus variance debate! Oh the humanity! In that case, Americans discard the error bars and declare the best mean performance the winner. The French start a quarrel that usually ends in a duel.

Antikaggle: Kaggle competitions typically give everyone the same training data and then compare performance on a common held out test set where a disinterested third party objectively compares each participant’s results. Consider first how on kaggle there’s typically one “metric” the kaggler optimizes.

In ROCs, there are at least two—and even three now that error bars are out of the bag (couldn’t resist!). An ROC curve that dominates is above and to the left of all points on all other ROC curves. Maybe this happened in what we saw. This is related to not having some standard benchmark performance curve to compare to. If that benchmark curve really was plotted with the ones the presenter wanted to highlight, I suspect it would not have been so zig zaggy, and would have raised questions like this earlier.

The deeper question is: how does this ROC analysis compare to how others have solved this problem? How confident are you? How does this ROC curve analysis demonstrate that? I forget if we saw.

Marketing professors: Anyone can learn, even marketing professors.

A proxilike debate: The aim here is to use that teachable moment about ROCs to get the DC data community to think critically about how ROCs actually work in practice and why we use them at all.

Though the Q&A observation was a confused exchange even when both people ostensibly understood what was going on, it wasn’t clear if the community grokked what was going on. And in presentations, as a corollary of Lesson 0, the aim is to give the audience what they came for—a more complete understanding. Since a Q&A is usually just two strangers trying to quickly connect perspectives to generate more insight than either has alone, Q&As often fail.

Do message boards help? Theoretically, they can. But usually not without moderation. In my experience, the nuances of communication and deference are lost on message boards and people sound more defensive and narrow than they might sound in person. YMMV in DMV. But thankfully, we have that external moderation on the Meetup board. And people can like the comments. And 6 people liked the pointer to Fawcett’s intro to ROCs. On google scholar, >5000 people have cited it, so there’s probably something to it. And 5 people liked the conciliatory response from the presenter—all is well in the world. Except this:

Since we were interested in the actual performance comparison on all datapoints, not the potential performance, we left in all points.

No, no, no! Say something about picking your thresholds, comparison to quantization of classifier output levels, how you connected your curves, what benchmark you’re comparing to, what part of the ROC is most relevant, how you validated, what package you plotted with, etc.. But don’t say actual performance. The actual performance actually does lie on the convex hull. And you can actually reach those points with the classifiers that generated the ROCs you plotted. And this is true whether you’re in academia or industry.

Useful links: