operations research

Thoughts on the INFORMS Business Analytics Conference

This post, from DC2 President Harlan Harris, was originally published on his blog. Harlan was on the board of WINFORMS, the local chapter of the Operations Research professional society, from 2012 until this summer. Earlier this year, I attended the INFORMS Conference on Business Analytics & Operations Research, in Boston. I was asked beforehand if I wanted to be a conference blogger, and for some reason I said I would. This meant I was able to publish posts on the conference's WordPress web site, and was also obliged to do so!

Here are the five posts that I wrote, along with an excerpt from each. Please click through to read the full pieces:

Operations Research, from the point of view of Data Science

  • more insight, less action — deliverables tend towards predictions and storytelling, versus formal optimization
  • more openness, less big iron — open source software leads to a low-cost, highly flexible approach
  • more scruffy, less neat — data science technologies often come from black-box statistical models, vs. domain-based theory
  • more velocity, smaller projects — a hundred $10K projects beats one $1M project
  • more science, less engineering — both practitioners and methods have different backgrounds
  • more hipsters, less suits — stronger connections to the tech industry than to the boardroom
  • more rockstars, less teams — one person can now (roughly) do everything, in simple cases, for better or worse

What is a “Data Product”?

DJ Patil says “a data product is a product that facilitates an end goal through the use of data.” So, it’s not just an analysis, or a recommendation to executives, or an insight that leads to an improvement to a business process. It’s a visible component of a system. LinkedIn’s People You May Know is viewed by many millions of customers, and it’s based on the complex interactions of the customers themselves.

Healthcare (and not Education) at INFORMS Analytics

[A]s a DC resident, we often hear of “Healthcare and Education” as a linked pair of industries. Both are systems focused on social good, with intertwined government, nonprofit, and for-profit entities, highly distributed management, and (reportedly) huge opportunities for improvement. Aside from MIT Leaders for Global Operations winning the Smith Prize (and a number of shoutouts to academic partners and mentors), there was not a peep from the education sector at tonight’s awards ceremony. Is education, and particularly K-12 and postsecondary education, not amenable to OR techniques or solutions?

What’s Changed at the Practice/Analytics Conference?

In 2011, almost every talk seemed to me to be from a Fortune 500 company, or a large nonprofit, or a consulting firm advising a Fortune 500 company or a large nonprofit. Entrepeneurship around analytics was barely to be seen. This year, there are at least a few talks about Hadoop and iPhone apps and more. Has the cost of deploying advanced analytics substantially dropped?

Why OR/Analytics People Need to Know About Database Technology

It’s worthwhile learning a bit about databases, even if you have no decision-making authority in your organization, and don’t feel like becoming a database administrator (good call). But by getting involved early in the data-collection process, when IT folks are sitting around a table arguing about platform questions, you can get a word in occasionally about the things that matter for analytics — collecting all the data, storing it in a way friendly to later analytics, and so forth.

All in all, I enjoyed blogging the conference, and recommend the practice to others! It's a great way to organize your thoughts and to summarize and synthesize your experiences.

Facility Location Analysis Resources Incorporating Travel Time

This is a guest blog post by Alan Briggs. Alan is a operations researcher and data scientist at Elder Research. Alan and Harlan Harris (DC2 President and Data Science DC co-organizer) have co-presented a project with location analysis and Meetup location optimization at the Statistical Programming DC Meetup and an INFORMS-MD chapter meeting. Recently, Harlan presented a version of this work at the New York Statistical Programming Meetup. There was some great feedback on the Meetup page asking for some additional resources. This post by Alan is in response to that question.

If you’re looking for a good text resource to learn some of the basics about facility location, I highly recommend grabbing a chapter of Dr. Michael Kay’s e-book (pfd) available for free from his logistics engineering website. He gives an excellent overview of some of the basics of facility location, including single facility location, multi-facility location, facility location-allocation, etc. At ~20 pages, it’s entirely approachable, but technical enough to pique the interest of the more technically-minded analyst. For a deeper dive into some of the more advanced research in this space, I’d recommend using some of the subject headings in his book as seeds for a simple search on Google Scholar. It’s nothing super fancy, but there are plenty of articles in the public-domain that relate to minisum/minimax optimization and all of their narrowly tailored implementations.

One of the really nice things about the data science community is that it is full of people with an inveterate dedication to critical thinking. There is nearly always some great [constructive] criticism about what didn’t quite work or what could have been a little bit better. One of the responses to the presentation recommended optimizing with respect to travel time instead of distance. Obviously, in this work, we’re using Euclidean distance as a proxy for time. Harlan cites laziness as the primary motivation for this shortcut, and I’ll certainly echo his response. However, a lot of modeling boils down to cutting your losses, getting a good enough solution or trying to find the balance among your own limited resources (e.g. time, data, technique, etc.). For the problem at hand, clearly the goal is to make Meetup attendance convenient for the most number of people; we want to get people to Meetups. But, for our purposes, a rough order of magnitude was sufficient. Harlan humorously points out that the true optimal location for a version of our original analysis in DC was an abandoned warehouse—if it really was an actual physical location at all. So, when you really just need a good solution, and the precision associated with true optimality is unappreciated, distance can be a pretty good proxy for time.

A lot of times good enough works, but there are some obvious limitations. In statistics, the law of large numbers holds that independent measurements of a random quantity tend toward the theoretical average of that quantity. For this reason, logistics problems can be more accurately estimated when they involve a large number of entities (e.g. travelers, shipments, etc.). For the problem at hand, if we were optimizing facility location for 1,000 or 10,000 respondents, again, using distance as a proxy for time, we would feel much better about the optimality of the location. I would add that similar to large quantities, introducing greater degrees of variability can also serve to improve performance. Thus, optimizing facility location across DC or New York may be a little too homogeneous. If instead, however, your analysis uses data across a larger, more heterogeneous area, like say an entire state where you have urban, suburban and rural areas, you will again get better performance on optimality.

Let’s say you’ve weighed the pros and cons and you really want to dive deeper into optimizing based on travel time, there are a couple different options you can consider. First, applying a circuity factor to the Euclidean distance can account for non-direct travel between points. So, to go from point A to point B actually may take 1.2 units as opposed to the straight-line distance of 1 unit. That may not be as helpful for a really complex urban space where feasible routing is not always intuitive and travel times can vary wildly. However, it can give some really good approximations, again, over larger, more heterogeneous spaces. An extension to a singular circuity factor, would be to introduce some gradient circuity factor that is proportional to population density. There are some really good zip code data available that can be used to estimate population.

Increasing the circuity factor in higher-density locales and decreasing in the lower-density can help by providing a more realistic assessment of how far off of the straight line distance the average person would have to commute. For the really enthusiastic modeler that has some good data skills and is looking for even more integrity in their travel time analysis, there are a number of websites that provide road network information. They list roads across the United States by functional class (interstate, Expressway & principal arterial, Urban principal arterial, collector, etc.) and even provide shape files and things for GIS. I've done basic speed limit estimation by functional class, but you could also do something like introduce a speed gradient respective to population density (as we mentioned above for the circuity factor). You could also derive some type of an inverse speed distribution that slows traffic at certain times of the day based on rush hour travel in or near urban centers.

In the words of George E. P. Box, “all models are wrong, but some models are useful.” If you start with a basic location analysis and find it to be wrong but useful, you may have done enough. If not, however, perhaps you can make it useful by increasing complexity in one of the ways I have mentioned above.

How To Put Your Meetup On the Map (Literally)

This is a guest post by Alan Briggs. Alan is a Data Scientist with Elder Research, Inc. assisting with the development of predictive analytic capabilities for national security clients in the Washington, DC Metro area. He is President-Elect of the Maryland chapter of the Institute for Operations Research and Management Science (INFORMS). Follow him at @AlanWBriggs. This post previews two events that Alan is presenting with Data Community DC President Harlan Harris. Have you ever tried to schedule dinner or a movie with a group of friends from across town? Then, when you throw out an idea about where to go, someone responds “no, not there, that’s too far?” Then, there’s the old adage that the three most important things about real estate are location, location and location. From placing a hotdog stand on the beach to deciding where to build the next Ikea, location really crops up all over the place. Not surprisingly, and I think most social scientists would agree, people tend to act in their own self-interest. That is, everyone wants to travel the least amount of distance, spend the least amount of money or expend the least amount of time possible in order to do what they need [or want] to do. For one self-interested person, the solution to location problems would always be clear, but we live in a world of co-existence and shared resources. There’s not just one person going to the movies or shopping at the store; there are several, hundreds, thousands, maybe several hundred thousand. If self-interest is predictable in this small planning exercise of getting together with friends, can we use math and science to leverage it to our advantage? It turns out that the mathematical and statistical techniques that are scalable to the worlds’ largest and most vexing problems can also be used to address some more everyday issues, such as where to schedule a Meetup event.

With a little abstraction, this scenario looks a lot like a classical problem in operations research called the facility location or network design problem. Its roots tracing back to the 17th century Fermat-Weber Problem, facility location analysis seeks to minimize the costs associated with locating a facility. In our case, we can define the cost of a Meetup venue by the sum of the distance traveled to the Meetup by its attendees. Other costs could be included, but to start simple, you can’t beat a straight-line distance.

So, here’s a little history. The data-focused Meetup scene in the DC region is several years old, with Hadoop DC, Big Data DC and the R Users DC (now Statistical Programming DC) Meetups having been founded first. Over the years, as these groups have grown and been joined by many others, their locations have bounced around among several different locations, mostly in downtown Washington DC, Northern Virginia, and suburban Maryland. Location decisions were primarily driven by supply – what organization would be willing to allow a big crowd to fill its meeting space on a weekday evening? Data Science DC, for instance, has almost always held its events downtown. But as the events have grown, and as organizers have heard more complaints about location, it became clear that venue selection needed to include a demand component as well, and that some events could or should be held outside of the downtown core (or vice-versa, for groups that have held their events in the suburbs).

Data Community DC performed a marketing survey at the beginning of 2013, and got back a large enough sample of Meetup attendees to do some real analysis. (See the public survey announcement.) As professional Meetups tend to be on weekday evenings, it is clear that attendees are not likely traveling just from work or just from home, but are most likely traveling to the Meetup as a detour from the usual commute line connecting their work and home. Fortunately, the survey designers asked Meetup attendees to provide both their home and their work zip codes, so the data could be used to (roughly) draw lines on a map describing typical commute lines.

Commutes for data Meetup attendees, based on ZIP codes.

The Revolutions blog recently presented a similar problem in their post How to Choose a New Business Location with R. The author, Rodolfo Vanzini, writes that his wife’s business had simply run out of space in its current location. An economist by training, Vanzini initially assumed the locations of customers at his wife’s business must be uniformly distributed. After further consideration, he concluded that “individuals make biased decisions basing their assumptions and conclusions on a limited and approximate set of rules often leading to sub-optimal outcomes.” Vanzini then turned to R, plotted his customers’ addresses on a map and picked a better location based on visual inspection of the location distribution.

If you’ve been paying attention, you’ll notice a common thread with the previously mentioned location optimizations. When you’re getting together with friends, you’re only going to one movie theater; Vanzini’s wife was only locating one school. Moreover, both location problems rely on a single location — their home — for each interested party. That’s really convenient for a beginner-level location problem; accordingly, it’s a relatively simple problem to solve. The Meetup location problem on the other hand adds two complexities that make this problem worthy of the time you’re spending here. Principally, if it’s not readily apparent, a group of 150 boisterous data scientists can easily overstay their welcome by having monthly meetings at the same place over and over again. Additionally, having a single location also ensures that the part of the population that drives the farthest will have to do so for each and every event. For this reason, we propose to identify the three locations which can be chosen that minimize the sum of the minimum distances traveled for the entire group. The idea is that the Meetup events can rotate between each of the three optimal locations. This provides diversity in location which appeases meeting space hosts. But, it also provides a closer meeting location for a rotating third of the event attendee population. Every event won't be ideal for everyone, but it'll be convenient for everyone at least sometimes.

As an additional complexity, we have two ZIP codes for each person in attendance — work and home — which means that instead of doing a point-to-point distance computation, we instead want to minimize the distance to the three meeting locations from the closest point along the commute line. Optimizing location with these two concepts in mind — particularly the n-location component — is substantially more complicated than the single location optimization with just one set of coordinates for each attendee.

So, there you have it. To jump ahead to the punchline, the three optimal locations for a data Meetup to host its meetings are Rockville, MD, downtown Washington DC and Southern Arlington, VA.

Gold points are optimal locations. Color gradient shows single-location location utility.

To hear us (Harlan and Alan) present this material, there are two great events coming up. The INFORMS Maryland Chapter is hosting their inaugural Learn. Network. Do. event on October 23 at INFORMS headquarters on the UMBC campus. Statistical Programming DC will also be holding its monthly meeting at iStrategy Labs on October 24. Both events will pull back the curtain on the code that was used and either event should send you off with sufficient knowledge to start to tackle your own location optimization problem.