This is a guest post from Data Science DC Member and quantitative political scientist David J. Elkind. As the November Data Science DC Meetup, Nathan Danneman, Emory University PhD and analytics engineer at Data Tactics, presented an approach to detecting unusual units within a geospatial data set. For me, the most enjoyable feature of Dr. Danneman’s talk was his engaging presentation. I suspect that other data consultants have also spent quite some time reading statistical articles and lost quite a few hours attempting to trace back the authors’ incoherent prose. Nathan approached his talk in a way that placed a minimal quantitative demand on the audience, instead focusing on the three essential components of his analysis: his analytical task, the outline of his approach, and the presentation of his findings. I’ll address each of these in turn.
Nathan was presented with the problem of locating maritime vessels in the Strait of Hormuz engaged in smuggling activities: sanctions against Iran have made it very difficult for Iran to engage in international commerce, so improving detection of smugglers crossing the Strait from Iran to Qatar and the United Arab Emirates would improve the effectiveness of the sanctions regime and increase pressure on the regime. (I’ve written about issues related to Iranian sanctions for CSIS’s Project on Nuclear Issues Blog.)
Having collected publicly accessible satellite positioning data of maritime vessels, Nathan had four fields for each craft at several time intervals within some period: speed, heading, latitude and longitude.
But what do smugglers look like? Unfortunately, Nathan’s data set did not itself include any examples of watercraft which had been unambiguously identified by, e.g., the US Navy, as smugglers, so he could not rely on historical examples of smuggling as a starting point for his analysis. Instead, he has to puzzle out how to leverage information a craft’s spatial location
I’ve encountered a few applied quantitative researchers who, when faced with a lack of historical examples, would be entirely stymied in their progress, declaring the problem too hard. Instead of throwing up his hands, Dr. Danneman dug into the topic of maritime smuggling and found that many smuggling scenarios involve ship-to-ship transfers of contraband which take place outside of ordinary shipping lanes. This qualitatively-informed understanding transforms the project from mere speculation about what smugglers might look like into the problem of discovering maritime vessels which deviate too far from ordinary traffic patterns.
Importantly, framing the research in this way entirely premises the validity of inferences on the notion that unusual ships are smugglers and smugglers are unusual ships. But in reality, there are many reasons that ships might not conform to ordinary traffic patterns – for example, pleasure craft and fishing boats might have irregular movement patterns that don’t coincide with shipping lanes, and so look similar to the hypothesized smugglers.
Outline of Approach
The basic approach can be split into three sections: partitioning the strait into many grids, generating fake boats to compare the real boats, and then training a logistic regression to use the four data fields (speed, heading, latitude and longitude) to differentiate the real boats from the fake ones.
Partitioning the strait into grids helps emphasize the local character of ships’ movements in that region. For example, a grid square partially containing a shipping channel will have many ships located in the channel and on a heading taking it along that channel. Generating fake boats, with bivariate uniform distribution in the grid square, will tend not to fall in the path of ordinary traffic channel, just like the hypothesized behavior of smugglers. The same goes for the uniformly-distributed timestamps and otherwise randomly-assigned boat attributes for the comparison sets: these will all tend to stand apart from ordinary traffic. Therefore, training a model to differentiate between these two classes of behaviors will advance the goal of differentiating between smugglers and ordinary traffic.
Dr. Danneman described this procedure as unsupervised-as-supervised learning – a novel term for me, so forgive me if I’m loose with the particulars – but this in this case it refers to the notion that there are two classes of data points, one i.i.d. from some unknown density and another simulated via Monte Carlo methods from some known density. Pooling both samples gives one a mixture of the two densities; this problem then becomes one of comparing the relative densities of the two classes of data points – that is, this problem is actually a restatement of the problem of logistic regression! Additional details can be found in Elements of Statistical Learning (2nd edition, section 14.2.4, p. 495).
Presentation of Findings
After fitting the model, we can examine which of the real boats the model rated as having a low odds of being real – that is, boats which looked so similar to the randomly-generated boats that the model had difficulty differentiating the two. These are the boats that we might call “outliers,” and, given the premise that ship-to-ship smuggling likely takes place aboard boats with unusual behavior, are more likely to be engaged in smuggling.
I will repeat here a slight criticism that I noted elsewhere and point out that the model output cannot be interpreted as a true probability, contrary to the results displayed in slide 39. In this research design, Dr. Danneman did not randomly sample from the population of all shipping traffic in the Strait of Hormuz to assemble a collection of smuggling craft and ordinary traffic in proportion roughly equal to their occurrence in nature. Rather, he generated one fake boat for each real boat. This is a case-control research design, so the intercept term of the logistic regression model is fixed to reflect the ratio of positive cases to negative cases in the data set. All of the terms in the model, including the intercept, are still MLE estimators, and all of the non-intercept terms are perfectly valid for comparing the odds of an observation being in one class or another. But to establish probabilities, one would have to replace the intercept term with knowledge of what the overall ratio of positives to negatives in the other.
In the question-and-answer session, some in the audience pushed back against the limited data set, noting that one could improve the results by incorporating other information specific to each of the ships (its flag, its shipping line, the type of craft, or other pieces of information). First, I believe that all applications would leverage this information – were it available – and model it appropriately; however, as befit a pedagogical talk on geospatial outlier detection, this talk focused on leveraging geospatial data for outlier detection.
Second, it should be intuitive that including more information in a model might improve the results: the more we know about the boats, the more we can differentiate between them. Collecting more data is, perhaps, the lowest-hanging fruit of model improvement. I think it’s more worthwhile to note that Nathan’s highly parsimonious model achieved very clean separation between fake and real boats despite the limited amount of information collected for each boat-time unit.