# Will big data bring a return of sampling statistics? And a review of Aaron Strauss's talk at DSDC

This guest post by Tommy Jones was originally published on Biased Estimates. Tommy is a statistician or data scientist -- depending on the context -- in Washington, DC. He is a graduate of Georgetown's MS program for mathematics and statistics. Follow him on Twitter @thos_jones.

### Some Background

#### What is sampling statistics?

Sampling statistics concerns the planning, collection, and analysis of survey data. When most people take a statistics course, they are learning "model-based" statistics. (Model-based statistics is not the same as statistical modeling, stick with me here.) Model-based statistics uses a mathematical function to model the distribution of an infinitely-sized population to quantify uncertainty. Sampling statistics, however, uses a priori knowledge of the size of the target population to inform quantifying uncertainty. The big lesson I learned after taking survey sampling is that if you assume the correct model, then the two statistical philosophies agree. But if your assumed model is wrong, the two approaches give different results. (And one approach has fewer assumptions, bee tee dubs.)
Sampling statistics also has a big bag of other tricks, too many to do justice here. But it provides frameworks for handling missing or biased data, combining data on subpopulations whose sample proportions differ from their proportions of the population, how to sample when subpopulations have very different statistical characteristics, etc.
As I write this, it is entirely possible to earn a PhD in statistics and not take a single course in sampling or survey statistics. Many federal agencies hire statisticians and then send them immediately back to school to places like UMD's Joint Program in Survey Methodology. (The federal government conducts a LOT of surveys.)
I can't claim to be certain, but I think that sampling statistics became esoteric for two reasons. First, surveys (and data collection in general) have traditionally been expensive. Until recently, there weren't many organizations except for the government that had the budget to conduct surveys properly and regularly. (Obviously, there are exceptions.) Second, model-based statistics tend to work well and have broad applicability. You can do a lot with a laptop, a .csv file, and the right education. My guess is that these two factors have meant that the vast majority of statisticians and statistician-like researchers have become consumers of data sets, rather than producers. In an age of "big data" this seems to be changing, however.

Response rates for surveys have been dropping for years, causing frustration among statisticians and skepticism from the public. Having a lower response rate doesn't just mean your confidence intervals get wider. Given the nature of many surveys, it's possible (if not likely) that the probability a person responds to the survey may be related to one or a combination of relevant variables. If unaddressed, such non-response can damage an analysis. Addressing the problem drives up the cost of a survey, however.
Consider measuring unemployment. A person is considered unemployed if they don't have a job and they are looking for one. Somebody who loses their job may be less likely to respond to the unemployment survey for a variety of reasons. They may be embarrassed, they may move back home, they may have lost their house! But if the government sends a survey or interviewer and doesn't hear back, how will it know if the respondent is employed, unemployed (and looking), or off the job market completely? So, they have to find out. Time spent tracking a respondent down is expensive!
So, if you are collecting data that requires a response, you must consider who isn't responding and why. Many people anecdotally chalk this effect up to survey fatigue. Aren't we all tired of being bombarded by websites and emails asking us for "just a couple minutes" of our time? (Businesses that send a satisfaction survey every time a customer contacts customer service take note; you may be your own worst data-collection enemy.)

### In Practice: Political Polling in 2012 and Beyond

In context of the above, Aaron Strauss's February 25th talk at DSDC was enlightening. Aaron's presentation was billed as covering "two things that people in [Washington D.C.] absolutely love. One of those things is political campaigns. The other thing is using data to estimate causal effects in subgroups of controlled experiments!" Woooooo! Controlled experiments! Causal effects! Subgroup analysis! Be still, my beating heart.
Aaron earned a PhD in political science from Princeton and has been involved in three of the last four presidential campaigns designing surveys, analyzing collected data, and providing actionable insights for the Democratic party. His blog is here. (For the record, I am strictly non-partisan and do not endorse anyone's politics though I will get in knife fights over statistical practices.)

In an hour-long presentation, Aaron laid a foundation for sampling and polling in the 21st century, revealing how political campaigns and businesses track our data, analyze it, and what the future of surveying may be. The most profound insight I got was to see how the traditional practices of sampling statistics were being blended with 21st century data collection methods, through apps and social media. Whether these changes will address the decline is response rates or only temporarily offset them remains to be seen.Some highlights:

• The number of households that have only wireless telephone service is reaching parity with the number having land line phone service. When considering only households with children (excluding older people with grown children and young adults without children) the number sits at 45 percent.
• Offering small savings on wireless bills may incentivize the taking of flash polls through smart phones.
• Reducing the marginal cost of surveys allows political pollsters to design randomized controlled trials, to evaluate the efficacy of different campaign messages on voting outcomes. (As with all things statistics, there are tradeoffs and confounding variables with such approaches.)