The Power of Power: Common Problems Running Experiments Online

The following is a guest post by Katya Vasilaky, a PhD economist who is now a Fellow at the Earth Institute at Columbia University and served as the Vice President of Strategy and Analytics at TroopSwap, now ID.me, a DC-based startup. It is based on a lightning talk given at Data Gotham 2013  

“In spite of years of teaching and using statistics, we had not developed an intuitive sense of the reliability of statistical results observed in small samples. Our subjective judgments were biased: we were far too willing to believe research findings based on inadequate evidence prone to collect too few observations in our own research.”

Daniel Kahneman, Thinking Fast and Slow

While large data may give the illusion that we can observe a full population simply because the data are “large,” large data often represent a very select process or particular subsample of the population.

As Sinan Aral discussed in his 2012 presentation at Data Gotham, teasing out causality versus correlation requires experiments and randomized control trials to control for extraneous variables regardless of the data’s size.

There are two major misconceptions among the data science community that I’ve observed:

  1. Data scientists erroneously assume that the “big” in big data constitutes a solution overcoming selection biases in data. Somehow that the data are “big” implies that they represent a full population, and, therefore, are not observing a self-selected group of individuals that may be driving their significant results. However, irrespective of whether a scientist could in fact collect data for an entire population, e.g. all tweeters in the universe, “big” does not imply that the relationships in the data are somehow now causal and/or are observed without bias. Experiments remain necessary components to identifying causal vs. correlated relationships.
  2. For those data scientists who understand (1) and are running experiments, a frequently overlooked or not well understood practice, particularly with regards to studying big data collected from online behavior, is computing the “power” of an experiment, or the probability of not committing a Type II error or the probability of correctly rejecting a false null hypothesis.

Too often, data scientists focus solely on Type I errors, or the statistical significance of a test, without ever reporting the underlying power of their study. This may be due to the academic community’s pre-occupation with publishing only statistically significant findings or publication bias.

The lack of computing or reporting power may not seem so dangerous when it comes to big data as, say, with small samples in clinical trials. Creating a sufficiently large sample with online data is fairly costless, so we may safely assume all online experiments are well powered. Compare this to running a large-scale social experiment for de-worming children in Kenya (Kremer 2004). For that reason, readers may not question whether online experiments are sufficiently powered.

But there is still a method to running experiments that should be followed that I have not seen explicitly addressed within the data science community, and it is often abused.

Common perils of failing to compute power before the start of an experiment:

  1. Small Samples
    • Missed significant effects because the experiment was not well powered
    • Found significant effect, despite an underpowered experiment. Are the results simply a fluke, or the 5% chance that we found a statistically significant result when we should not have?
  2. Large Samples
    • Found a small statistically significant effect, but the economic effects are not meaningful. Namely, as the chosen sample size infinitely large, we can detect the smallest changes, which may be economically insignificant (sometimes referred to as over-fitting or fitting into data noise).
    • Increasing the sample size in an ad-hoc fashion because the trial was “underpowered.” This practice will increase the Type I error rate (1- (1-alpha)^n, where n is the number of samples taken and alpha is the Type I error.) This is relevant in for data scientists working with UX designers or other web developers who use tools such as Ruby’s Vanity Gem, which continually increases the sample size, while exhibiting a “counter” for statistical significance level as data are added.
  3. Inefficient Sample groups across treatments.
    • An equal number of subjects across treatment cells are common but not necessarily efficient. The relative distribution of subjects across treatment groups should increase proportionally to the standard deviation of the respective outcomes in each group. Assuming equal sample sizes to each treatment cell results in a large loss of efficiency, requiring a larger overall sample size (List et. al. 2010).
  4. Over Stating Significance
    • Many experiments do not cluster their standard errors for online users coming from the same traffic source or webpage, whose behavior and choices may be correlated.

References

Kremer, Michael and Edward Miguel. Worms: Identifying Impacts on Education and Health in the Presence of Treatment Externalities. Econometrica, Vol. 72, No. 1 (January, 2004), 159–217.

List, John, Sally Sadoff and Mathis Wagner. “So you want to run an experiment, now what? Some Simple Rules of Thumb for Optimal Experimental Design. NBER Working Paper No. 15701. Issued in January 2010.

List, John and Glenn Harrison. Field Experiments. Journal of Economic Literature Vol. XLII (December 2004) pp. 1009–1055.