This is a guest blog post by Alan Briggs. Alan is a operations researcher and data scientist at Elder Research. Alan and Harlan Harris (DC2 President and Data Science DC co-organizer) have co-presented a project with location analysis and Meetup location optimization at the Statistical Programming DC Meetup and an INFORMS-MD chapter meeting. Recently, Harlan presented a version of this work at the New York Statistical Programming Meetup. There was some great feedback on the Meetup page asking for some additional resources. This post by Alan is in response to that question.
If you’re looking for a good text resource to learn some of the basics about facility location, I highly recommend grabbing a chapter of Dr. Michael Kay’s e-book (pfd) available for free from his logistics engineering website. He gives an excellent overview of some of the basics of facility location, including single facility location, multi-facility location, facility location-allocation, etc. At ~20 pages, it’s entirely approachable, but technical enough to pique the interest of the more technically-minded analyst. For a deeper dive into some of the more advanced research in this space, I’d recommend using some of the subject headings in his book as seeds for a simple search on Google Scholar. It’s nothing super fancy, but there are plenty of articles in the public-domain that relate to minisum/minimax optimization and all of their narrowly tailored implementations.
One of the really nice things about the data science community is that it is full of people with an inveterate dedication to critical thinking. There is nearly always some great [constructive] criticism about what didn’t quite work or what could have been a little bit better. One of the responses to the presentation recommended optimizing with respect to travel time instead of distance. Obviously, in this work, we’re using Euclidean distance as a proxy for time. Harlan cites laziness as the primary motivation for this shortcut, and I’ll certainly echo his response. However, a lot of modeling boils down to cutting your losses, getting a good enough solution or trying to find the balance among your own limited resources (e.g. time, data, technique, etc.). For the problem at hand, clearly the goal is to make Meetup attendance convenient for the most number of people; we want to get people to Meetups. But, for our purposes, a rough order of magnitude was sufficient. Harlan humorously points out that the true optimal location for a version of our original analysis in DC was an abandoned warehouse—if it really was an actual physical location at all. So, when you really just need a good solution, and the precision associated with true optimality is unappreciated, distance can be a pretty good proxy for time.
A lot of times good enough works, but there are some obvious limitations. In statistics, the law of large numbers holds that independent measurements of a random quantity tend toward the theoretical average of that quantity. For this reason, logistics problems can be more accurately estimated when they involve a large number of entities (e.g. travelers, shipments, etc.). For the problem at hand, if we were optimizing facility location for 1,000 or 10,000 respondents, again, using distance as a proxy for time, we would feel much better about the optimality of the location. I would add that similar to large quantities, introducing greater degrees of variability can also serve to improve performance. Thus, optimizing facility location across DC or New York may be a little too homogeneous. If instead, however, your analysis uses data across a larger, more heterogeneous area, like say an entire state where you have urban, suburban and rural areas, you will again get better performance on optimality.
Let’s say you’ve weighed the pros and cons and you really want to dive deeper into optimizing based on travel time, there are a couple different options you can consider. First, applying a circuity factor to the Euclidean distance can account for non-direct travel between points. So, to go from point A to point B actually may take 1.2 units as opposed to the straight-line distance of 1 unit. That may not be as helpful for a really complex urban space where feasible routing is not always intuitive and travel times can vary wildly. However, it can give some really good approximations, again, over larger, more heterogeneous spaces. An extension to a singular circuity factor, would be to introduce some gradient circuity factor that is proportional to population density. There are some really good zip code data available that can be used to estimate population.
Increasing the circuity factor in higher-density locales and decreasing in the lower-density can help by providing a more realistic assessment of how far off of the straight line distance the average person would have to commute. For the really enthusiastic modeler that has some good data skills and is looking for even more integrity in their travel time analysis, there are a number of websites that provide road network information. They list roads across the United States by functional class (interstate, Expressway & principal arterial, Urban principal arterial, collector, etc.) and even provide shape files and things for GIS. I've done basic speed limit estimation by functional class, but you could also do something like introduce a speed gradient respective to population density (as we mentioned above for the circuity factor). You could also derive some type of an inverse speed distribution that slows traffic at certain times of the day based on rush hour travel in or near urban centers.
In the words of George E. P. Box, “all models are wrong, but some models are useful.” If you start with a basic location analysis and find it to be wrong but useful, you may have done enough. If not, however, perhaps you can make it useful by increasing complexity in one of the ways I have mentioned above.