This is a guest post by Eunice Choi, a Health IT Consultant who is very interested in Data Science.
When a fortuitous event takes place, it is a very human inclination to be intrigued—and when such an event happens again in seemingly quick succession, we start to look for patterns. It is widely known that, in Data Mining, this ability to notice patterns is of great consequence—and in Big Data, this ability plays itself out in the data analysis process, both in intuitive and counterintuitive ways.
In addition to addressing what Jules Berman, in his book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information, calls statistical method bias and ambiguity bias, the speakers at the Data Science DC Meetup illuminated an important issue for Data Miners that goes beyond ‘correlation is not causation’—the issue of whether the correlation itself is real or the result of repeated, massive exploration and modeling of the data. In addition, by citing examples from the fields of Statistics, Predictive Analytics, Epidemiology, Biomedical Research, Marketing, and the Business and Government worlds, the speakers addressed a commonly seen issue stemming from overconfidence: viewing data validation as proof of causality. The talks provided a broad sense of context around the phenomenon of how repetitive computer intensive modeling can lead to overfitting and model underperformance. The speakers provided a deeper understanding of the process by which to determine which models and methods would offer the highest level of confidence, with a focus on best practices and remedies.
A compelling use case mentioned was: Do we see random events for their randomness, or do we see winning a $1 million lottery twice in a day as something beyond chance?
Jules Berman noted that, in order “To get the greatest value from Big Data resources, it is important to understand when a problem in one field has equivalence to a problem in another field.” In a set of talks that spoke to such equivalence, Peter Bruce, President of The Institute for Statistics Education at Statistics.com, and Gerhard Pilcher, VP and Senior Scientist at Elder Research, Inc. (ERI), who leads the Washington, DC office and all its federal civil work, presented on the topic of ‘Data Mining for Patterns That Aren’t There’.
During the first portion of the Meetup, networking took place outdoors over empanadas, and the atmosphere was collegial and friendly. Once the audience filtered in, the audience appreciated Jonathan Street’s data visualization of when new members RSVP’d to the Meetup eventyou can see the momentum built up in the 5 days directly preceding the Meetup:
Peter Bruce spoke on the topic by drawing the audience into probability examples and discussing the ‘lack of replication’ problem in scientific research. He then observed that humans are unwilling to think that chance is responsible for patterns in datasets and expanded upon this further with an example of the human capacity to be fooled by randomness in which commodity traders were shown charts and were asked to comment on them. Charts such as the one below were produced by random chance, yet the commodity traders viewed the charts as being representative of specific, observed phenomena, and continued to do so even after being told that the actual series were random:
He then spoke on the how numerators and denominators figure into the question: Did you see the interesting event and then conclude it was interesting? In that case the numerator could be huge, which would mean that “interestingness” would decrease drastically.
To guide the audience to reexamine the significance of “statistically significant” correlations, Bruce cited epidemiology studies and other examples from health and science. For instance, in epidemiological studies on Bisphenol A (BPA), 1,000 people were involved and the models looked at 275 chemicals, 32 possible health outcomes, and 10 demographic variables. The high dimensionality and high volume of the data objects created computational challengesthere were 9,000,000 possible models when accounting for all possible covariate inclusion/exclusion options. This example demonstrated the idea: ‘Try enough models with enough covariates and you’ll get a correlation’but also demonstrated that this idea does not necessarily embody the optimal approach to data mining. In data mining, Bruce asserted that the proper use of a validation sample protects to some extent—but that information about the validation data may leak into the modeling process through repeated model tuning using the validation data or via information gained during the exploration/preparation phase.
Gerhard Pilcher discussed the ‘Vast Search Effect’ (i.e., what statisticians call the ‘multiple testing problem’ or ‘data snooping’), which he defined as ‘trying to find something interesting, whether that finding is real or the effect of random chance.’ He focused his talk on points of inquiry around the main question: Are orange cars really least likely to be bad buys?
From an initial bar graph on the proportion of bad buys by car color, it would appear that orange cars were least likely to be bad buys:
However, Pilcher pointed out that the hypothesis was developed after seeing the data and data was not partitioned (the hypothesis also tested the same data), which had led to an instantiation of the ‘Vast Search Effect.’ To set the stage for why the Vast Search Effect is important, one compelling fact that Pilcher mentioned was that Bayer Laboratories confirmed that they could not replicate 67% of positive findings claimed in medical journals.
Pilcher also showed a great example of a financial model that was built and gave great results on two variables in terms of the numbers. However, when the model was plotted, the response surface showed the return (in red)—the stability of the model was shown to be very low, so the model was not able to continue to be used.
To avoid the Vast Search Effect, Pilcher offered the following solutions:

Partitioning – breaking out the dataset into training, validation, and/or test data sets (making sure to avoid using the testing set to revise the training set)

Statistical Inference – deduce and test a new hypothesis

Simulation – sampling without replacement (e.g., targetshuffling and checking for proportions)
Key takeaways from Pilcher’s talk included the following: Hypothesis tests work when:
 The hypothesis comes first, the analysis second;
 The data is partitioned into training and testing datasets; and
 The logic incorporates practical significance in addition to statistical significance.
Pilcher emphasized the importance of the human element to determine what makes sense in a computer’s output in data mining. In addition, he compared the modern machine learning algorithm of learning by induction to linear regression and made the point that when learning by induction, one is inducing what the data is trying to tell us, thereby creating nonlinear surfaces—and that in that situation, it is likely one will overfit one’s model. Therefore, he used random shuffling to test different algorithms. He emphasized that one ought to test the algorithm against the data and should ask oneself: How much is the algorithm trying to overfit the random data?
By having us consider these questions, the speakers balanced their cautionary word on overfitting models with their assertion that data validation also depends on meaningful results—and the best ways to arrive at the hypotheses and processes that lead to such results. If the modeling process could be likened to an expansive cube, personally, the effect of pondering these considerations was like walking around such an expansive cube and examining it for all of its contoursin addition to peering inside of it to understand its properties.
For more on the presentations, see the following resources: