statistics

The Fall of the P-Value

Screen Shot 2014-03-09 at 10.55.56 AMWe at Data Community DC wanted to highlight a very interesting and relevant article for data practitioners published over at Nature.com. For most people, P-values are the "gold standard" by which the validity of scientific results are measured. However, mounting evidence suggests that this isn't the case. Further, the growing use of online experimentation has precipitated a new wave of individuals, not necessarily indoctrinated in the field of statistics, to question the relevance of the P-value. Curious? We highly recommend checking out the article below:

For a brief moment in 2010, Matt Motyl was on the brink of scientific glory: he had discovered that extremists quite literally see the world in black and white.

The results were “plain as day”, recalls Motyl, a psychology PhD student at the University of Virginia in Charlottesville. Data from a study of nearly 2,000 people seemed to show that political moderates saw shades of grey more accurately than did either left-wing or right-wing extremists. “The hypothesis was sexy,” he says, “and the data provided clear support.” The P value, a common index for the strength of evidence, was 0.01 — usually interpreted as 'very significant'. Publication in a high-impact journal seemed within Motyl's grasp.

If you want to read more, head on over to the article here.

Will big data bring a return of sampling statistics? And a review of Aaron Strauss's talk at DSDC

This guest post by Tommy Jones was originally published on Biased Estimates. Tommy is a statistician or data scientist -- depending on the context -- in Washington, DC. He is a graduate of Georgetown's MS program for mathematics and statistics. Follow him on Twitter @thos_jones.

Some Background

What is sampling statistics?

Sampling statistics concerns the planning, collection, and analysis of survey data. When most people take a statistics course, they are learning "model-based" statistics. (Model-based statistics is not the same as statistical modeling, stick with me here.) Model-based statistics uses a mathematical function to model the distribution of an infinitely-sized population to quantify uncertainty. Sampling statistics, however, uses a priori knowledge of the size of the target population to inform quantifying uncertainty. The big lesson I learned after taking survey sampling is that if you assume the correct model, then the two statistical philosophies agree. But if your assumed model is wrong, the two approaches give different results. (And one approach has fewer assumptions, bee tee dubs.)
Sampling statistics also has a big bag of other tricks, too many to do justice here. But it provides frameworks for handling missing or biased data, combining data on subpopulations whose sample proportions differ from their proportions of the population, how to sample when subpopulations have very different statistical characteristics, etc.
As I write this, it is entirely possible to earn a PhD in statistics and not take a single course in sampling or survey statistics. Many federal agencies hire statisticians and then send them immediately back to school to places like UMD's Joint Program in Survey Methodology. (The federal government conducts a LOT of surveys.)
I can't claim to be certain, but I think that sampling statistics became esoteric for two reasons. First, surveys (and data collection in general) have traditionally been expensive. Until recently, there weren't many organizations except for the government that had the budget to conduct surveys properly and regularly. (Obviously, there are exceptions.) Second, model-based statistics tend to work well and have broad applicability. You can do a lot with a laptop, a .csv file, and the right education. My guess is that these two factors have meant that the vast majority of statisticians and statistician-like researchers have become consumers of data sets, rather than producers. In an age of "big data" this seems to be changing, however.

Much ado about response rates

Response rates for surveys have been dropping for years, causing frustration among statisticians and skepticism from the public. Having a lower response rate doesn't just mean your confidence intervals get wider. Given the nature of many surveys, it's possible (if not likely) that the probability a person responds to the survey may be related to one or a combination of relevant variables. If unaddressed, such non-response can damage an analysis. Addressing the problem drives up the cost of a survey, however.
Consider measuring unemployment. A person is considered unemployed if they don't have a job and they are looking for one. Somebody who loses their job may be less likely to respond to the unemployment survey for a variety of reasons. They may be embarrassed, they may move back home, they may have lost their house! But if the government sends a survey or interviewer and doesn't hear back, how will it know if the respondent is employed, unemployed (and looking), or off the job market completely? So, they have to find out. Time spent tracking a respondent down is expensive!
So, if you are collecting data that requires a response, you must consider who isn't responding and why. Many people anecdotally chalk this effect up to survey fatigue. Aren't we all tired of being bombarded by websites and emails asking us for "just a couple minutes" of our time? (Businesses that send a satisfaction survey every time a customer contacts customer service take note; you may be your own worst data-collection enemy.)

In Practice: Political Polling in 2012 and Beyond

In context of the above, Aaron Strauss's February 25th talk at DSDC was enlightening. Aaron's presentation was billed as covering "two things that people in [Washington D.C.] absolutely love. One of those things is political campaigns. The other thing is using data to estimate causal effects in subgroups of controlled experiments!" Woooooo! Controlled experiments! Causal effects! Subgroup analysis! Be still, my beating heart.
Aaron earned a PhD in political science from Princeton and has been involved in three of the last four presidential campaigns designing surveys, analyzing collected data, and providing actionable insights for the Democratic party. His blog is here. (For the record, I am strictly non-partisan and do not endorse anyone's politics though I will get in knife fights over statistical practices.)

In an hour-long presentation, Aaron laid a foundation for sampling and polling in the 21st century, revealing how political campaigns and businesses track our data, analyze it, and what the future of surveying may be. The most profound insight I got was to see how the traditional practices of sampling statistics were being blended with 21st century data collection methods, through apps and social media. Whether these changes will address the decline is response rates or only temporarily offset them remains to be seen.Some highlights:

  • The number of households that have only wireless telephone service is reaching parity with the number having land line phone service. When considering only households with children (excluding older people with grown children and young adults without children) the number sits at 45 percent.
  • Offering small savings on wireless bills may incentivize the taking of flash polls through smart phones.
  • Reducing the marginal cost of surveys allows political pollsters to design randomized controlled trials, to evaluate the efficacy of different campaign messages on voting outcomes. (As with all things statistics, there are tradeoffs and confounding variables with such approaches.)
  • Pollsters would love to get access to all of your Facebook data.

Sampling Statistics and "Big Data"

Today, businesses and other organizations are tracking people at unprecedented levels. One reason rationale for big data being a "revolution" is that for the first time organizations have access to the full population of interest. For example, Amazon can track the purchasing history of 100% of its customers.I would challenge the above argument, but won't outright disagree with it. Your current customer base may or may not be your full population of interest. You may, for example, be interested in people who don't purchase your product. You may wish to analyze a sample of your market, to figure out how who isn't purchasing from you and why. You may have access to some data on the whole population, but you may not have all the variables you want.More importantly, sampling statistics has tools that may allow organizations to design tracking schemes to gather the most relevant data to their questions of interest. To quote R.A. Fisher "To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: He may be able to say what the experiment died of." The world (especially the social-science world) is not static; priorities and people's behavior are sure to change.Data fusion, the process of pulling together data from heterogeneous sources into one analysis, is not a survey. But these sources may represent observations and variables in proportions or frequencies differing from the target population. Combining data from these sources with a simple merge may result in biased analyses. Sampling statistics has methods of using sample weights to combine strata of a stratified sample where some strata may be over or under sampled (and there are reasons to do this intentionally).

I am not proposing that sampling statistics will become the new hottest thing. But I would not be surprised if sampling courses move from the esoteric fringes, to being a core course in many or most statistics graduate programs in the coming decades. (And we know it may take over a hundred years for something to become the new hotness anyway.)

The professor that taught the sampling statistics course that I took a few years ago is the chief of the Statistical Research Division at the U.S. Census Bureau. When I last saw him at an alumni/prospective student mixer for Georgetown's math/stat program in 2013, he was wearing a button that said "ask me about big data." In a time when some think that statistics is the old school discipline only relevant for small data, seeing this button on a man whose field even within statistics is considered so "old school" that even most statisticians have moved on  made me chuckle. But it also made me think; things may be coming full circle for sample statistics.

Links for further reading

A statistician's role in big data (my source for the R.A. Fisher quote, above)

November Data Science DC Event Review: Identifying Smugglers: Local Outlier Detection in Big Geospatial Data

This is a guest post from Data Science DC Member and quantitative political scientist David J. Elkind. Geopatial Outliers in the Strait of HormuzAs the November Data Science DC Meetup, Nathan Danneman, Emory University PhD and analytics engineer at Data Tactics, presented an approach to detecting unusual units within a geospatial data set. For me, the most enjoyable feature of Dr. Danneman’s talk was his engaging presentation. I suspect that other data consultants have also spent quite some time reading statistical articles and lost quite a few hours attempting to trace back the authors’ incoherent prose. Nathan approached his talk in a way that placed a minimal quantitative demand on the audience, instead focusing on the three essential components of his analysis: his analytical task, the outline of his approach, and the presentation of his findings. I’ll address each of these in turn.

Analytical Task

Nathan was presented with the problem of locating maritime vessels in the Strait of Hormuz engaged in smuggling activities: sanctions against Iran have made it very difficult for Iran to engage in international commerce, so improving detection of smugglers crossing the Strait from Iran to Qatar and the United Arab Emirates would improve the effectiveness of the sanctions regime and increase pressure on the regime. (I’ve written about issues related to Iranian sanctions for CSIS’s Project on Nuclear Issues Blog.)

Having collected publicly accessible satellite positioning data of maritime vessels, Nathan had four fields for each craft at several time intervals within some period: speed, heading, latitude and longitude.

But what do smugglers look like? Unfortunately, Nathan’s data set did not itself include any examples of watercraft which had been unambiguously identified by, e.g., the US Navy, as smugglers, so he could not rely on historical examples of smuggling as a starting point for his analysis. Instead, he has to puzzle out how to leverage information a craft’s spatial location

I’ve encountered a few applied quantitative researchers who, when faced with a lack of historical examples, would be entirely stymied in their progress, declaring the problem too hard. Instead of throwing up his hands, Dr. Danneman dug into the topic of maritime smuggling and found that many smuggling scenarios involve ship-to-ship transfers of contraband which take place outside of ordinary shipping lanes. This qualitatively-informed understanding transforms the project from mere speculation about what smugglers might look like into the problem of discovering maritime vessels which deviate too far from ordinary traffic patterns.

Importantly, framing the research in this way entirely premises the validity of inferences on the notion that unusual ships are smugglers and smugglers are unusual ships. But in reality, there are many reasons that ships might not conform to ordinary traffic patterns – for example, pleasure craft and fishing boats might have irregular movement patterns that don’t coincide with shipping lanes, and so look similar to the hypothesized smugglers.

Outline of Approach

The basic approach can be split into three sections: partitioning the strait into many grids, generating fake boats to compare the real boats, and then training a logistic regression to use the four data fields (speed, heading, latitude and longitude) to differentiate the real boats from the fake ones.

Partitioning the strait into grids helps emphasize the local character of ships’ movements in that region. For example, a grid square partially containing a shipping channel will have many ships located in the channel and on a heading taking it along that channel. Generating fake boats, with bivariate uniform distribution in the grid square, will tend not to fall in the path of ordinary traffic channel, just like the hypothesized behavior of smugglers. The same goes for the uniformly-distributed timestamps and otherwise randomly-assigned boat attributes for the comparison sets: these will all tend to stand apart from ordinary traffic. Therefore, training a model to differentiate between these two classes of behaviors will advance the goal of differentiating between smugglers and ordinary traffic.

Dr. Danneman described this procedure as unsupervised-as-supervised learning – a novel term for me, so forgive me if I’m loose with the particulars – but this in this case it refers to the notion that there are two classes of data points, one i.i.d. from some unknown density and another simulated via Monte Carlo methods from some known density.  Pooling both samples gives one a mixture of the two densities; this problem then becomes one of comparing the relative densities of the two classes of data points – that is, this problem is actually a restatement of the problem of logistic regression! Additional details can be found in Elements of Statistical Learning (2nd edition, section 14.2.4, p. 495).

Presentation of Findings

After fitting the model, we can examine which of the real boats the model rated as having a low odds of being real – that is, boats which looked so similar to the randomly-generated boats that the model had difficulty differentiating the two. These are the boats that we might call “outliers,” and, given the premise that ship-to-ship smuggling likely takes place aboard boats with unusual behavior, are more likely to be engaged in smuggling.

I will repeat here a slight criticism that I noted elsewhere and point out that the model output cannot be interpreted as a true probability, contrary to the results displayed in slide 39. In this research design, Dr. Danneman did not randomly sample from the population of all shipping traffic in the Strait of Hormuz to assemble a collection of smuggling craft and ordinary traffic in proportion roughly equal to their occurrence in nature. Rather, he generated one fake boat for each real boat. This is a case-control research design, so the intercept term of the logistic regression model is fixed to reflect the ratio of positive cases to negative cases in the data set. All of the terms in the model, including the intercept, are still MLE estimators, and all of the non-intercept terms are perfectly valid for comparing the odds of an observation being in one class or another. But to establish probabilities, one would have to replace the intercept term with knowledge of what the overall ratio of positives to negatives in the other.

In the question-and-answer session, some in the audience pushed back against the limited data set, noting that one could improve the results by incorporating other information specific to each of the ships (its flag, its shipping line, the type of craft, or other pieces of information). First, I believe that all applications would leverage this information – were it available – and model it appropriately; however, as befit a pedagogical talk on geospatial outlier detection, this talk focused on leveraging geospatial data for outlier detection.

Second, it should be intuitive that including more information in a model might improve the results: the more we know about the boats, the more we can differentiate between them. Collecting more data is, perhaps, the lowest-hanging fruit of model improvement. I think it’s more worthwhile to note that Nathan’s highly parsimonious model achieved very clean separation between fake and real boats despite the limited amount of information collected for each boat-time unit.

The presentation and code may be found on Nathan Danneman's web site. The audio for the presentation is also available.

September 2013 Data Science DC Event Review: Data Mining for Patterns That Aren’t There

This is a guest post by Eunice Choi, a Health IT Consultant who is very interested in Data Science.

When a fortuitous event takes place, it is a very human inclination to be intrigued—and when such an event happens again in seemingly quick succession, we start to look for patterns. It is widely known that, in Data Mining, this ability to notice patterns is of great consequence—and in Big Data, this ability plays  itself out in the data analysis process, both in intuitive and counterintuitive ways.

In addition to addressing what Jules Berman, in his book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information, calls statistical method bias and ambiguity bias, the speakers at the Data Science DC Meetup illuminated an important issue for Data Miners that goes beyond ‘correlation is not causation’—the issue of whether the correlation itself is real or the result of repeated, massive exploration and modeling of the data.  In addition, by citing examples from the fields of Statistics, Predictive Analytics, Epidemiology, Biomedical Research, Marketing, and the Business and Government worlds, the speakers addressed a commonly seen issue stemming from overconfidence: viewing data validation as proof of causality. The talks provided a broad sense of context around the phenomenon of how repetitive computer intensive modeling can lead to overfitting and model underperformance. The speakers provided a deeper understanding of the process by which to determine which models and methods would offer the highest level of confidence, with a focus on best practices and remedies.

Attendee ratings, showing mostly positive experiences, and a desire for more technical content.

A compelling use case mentioned was: Do we see random events for their randomness, or do we see winning a $1 million lottery twice in a day as something beyond chance?

Jules Berman noted that, in order “To get the greatest value from Big Data resources, it is important to understand when a problem in one field has equivalence to a problem in another field.” In a set of talks that spoke to such equivalence, Peter Bruce, President of The Institute for Statistics Education at Statistics.com, and Gerhard Pilcher, VP and Senior Scientist at Elder Research, Inc. (ERI), who leads the Washington, DC office and all its federal civil work, presented on the topic of ‘Data Mining for Patterns That Aren’t There’.

During the first portion of the Meetup, networking took place outdoors over empanadas, and the  atmosphere was collegial and friendly. Once the audience filtered in, the audience appreciated Jonathan Street’s data visualization of when new members RSVP’d to the Meetup event--you can see the momentum built up in the 5 days directly preceding the Meetup:

Graph

Peter Bruce spoke on the topic by drawing the audience into probability examples and discussing the ‘lack of replication’ problem in scientific research. He then observed that humans are unwilling to think that chance is responsible for patterns in datasets and expanded upon this further with an example of the human capacity to be fooled by randomness in which commodity traders were shown charts and were asked to comment on them. Charts such as the one below were produced by random chance, yet the commodity traders viewed the charts as being representative of specific, observed phenomena, and continued to do so even after being told that the actual series were random:

"Commodity Price"

He then spoke on the how numerators and denominators figure into the question: Did you see the interesting event and then conclude it was interesting? In that case the numerator could be huge, which would mean that “interestingness” would decrease drastically.

To guide the audience to re-examine the significance of “statistically significant” correlations, Bruce cited epidemiology studies and other examples from health and science. For instance, in epidemiological studies on Bisphenol A (BPA), 1,000 people were involved and the models looked at 275 chemicals, 32 possible health outcomes, and 10 demographic variables. The high dimensionality and high volume of the data objects created computational challenges--there were 9,000,000 possible models when accounting for all possible covariate inclusion/exclusion options. This example demonstrated the idea: ‘Try enough models with enough covariates and you’ll get a correlation’--but also demonstrated that this idea does not necessarily embody the optimal approach to data mining.  In data mining, Bruce asserted that the proper use of a validation sample protects to some extent—but that information about the validation data may leak into the modeling process through repeated model tuning using the validation data or via information gained during the exploration/preparation phase.

Gerhard Pilcher discussed the ‘Vast Search Effect’ (i.e., what statisticians call the ‘multiple testing problem’ or ‘data snooping’), which he defined as ‘trying to find something interesting, whether that finding is real or the effect of random chance.’ He focused his talk on points of inquiry around the main question: Are orange cars really least likely to be bad buys?

From an initial bar graph on the proportion of bad buys by car color, it would appear that orange cars were least likely to be bad buys:

What Color Car Would You Buy?

However, Pilcher pointed out that the hypothesis was developed after seeing the data and data was not partitioned (the hypothesis also tested the same data), which had led to an instantiation of the ‘Vast Search Effect.’ To set the stage for why the Vast Search Effect is important, one compelling fact that Pilcher mentioned was that Bayer Laboratories confirmed that they could not replicate 67% of positive findings claimed in medical journals.

Pilcher also showed a great example of a financial model that was built and gave great results on two variables in terms of the numbers. However, when the model was plotted, the response surface showed the return (in red)—the stability of the model was shown to be very low, so the model was not able to continue to be used.

A Financial Example

To avoid the Vast Search Effect, Pilcher offered the following solutions:

  • Partitioning – breaking out the dataset into training, validation, and/or test data sets (making sure to avoid using the testing set to revise the training set)

  • Statistical Inference – deduce and test a new hypothesis

  • Simulation – sampling without replacement (e.g., target-shuffling and checking for proportions)

Key takeaways from Pilcher’s talk included the following: Hypothesis tests work when:

  1. The hypothesis comes first, the analysis second;
  2. The data is partitioned into training and testing datasets; and
  3. The logic incorporates practical significance in addition to statistical significance.

Pilcher emphasized the importance of the human element to determine what makes sense in a computer’s output in data mining. In addition, he compared the modern machine learning algorithm of learning by induction to linear regression and made the point that when learning by induction, one is inducing what the data is trying to tell us, thereby creating nonlinear surfaces—and that in that situation, it is likely one will overfit one’s model. Therefore, he used random shuffling to test different algorithms. He emphasized that one ought to test the algorithm against the data and should ask oneself: How much is the algorithm trying to overfit the random data?

By having us consider these questions, the speakers balanced their cautionary word on overfitting models with their assertion that data validation also depends on meaningful results—and the best  ways to arrive at the hypotheses and processes that lead to such results. If the modeling process could be likened to an expansive cube, personally, the effect of pondering these considerations was like walking around such an expansive cube and examining it for all of its contours--in addition to peering inside of it to understand its properties.

For more on the presentations, see the following resources:

2013 September DSMD Event: Teaching Data Science to the Masses

Stats-vs-data-science For Data Science MD's Septmeber meetup, we were very fortunate to have the very talented and very passionate Dr. Jeff Leek speak about his experiences teaching Data Science through the online learning platform Coursera. This was also a unique event for DSMD itself because it was the first meetup that only featured one speaker. Having one speaker speak for a whole hour can be a disaster if the speaker is unable to keep the attention of those in the audience. However, Dr. Leek is a very dynamic and engaging speaker and had no problem keeping the attention of everyone in the room, including a couple of middle school students.

For those of you who are not familiar with Dr. Leek, he is a biostatistician at Johns Hopkins University as well as a instructor in the JHU biostatistics program. His biostatistics work typically entails analyzing human genome sequenced data to provide insights to doctors and patients in the form of raw data and advanced visualizations. However, when he is not revolutionizing the medical world or teaching the great biostatisticians of tomorrow at JHU, you may look for him teaching his course on Coursera, or providing new content to his blog, Simply Statistics.

Now, on to the talk. Johns Hopkins and specifically Dr. Leek got involved in teaching a Coursera course because they have constantly been looking at ways to improve learning for their students. They had been "flipping the classroom" by taking lectures and posting them to YouTube so that students could review the lecture material before class and then use the classroom time to dig deeper into specific topics. Because online videos are such a vital component of Massive Open Online Classes (MOOCs), it is no surprise that they took the next leap.

But just in case you think that JHU and Dr. Leek are new to this whole data science "thing," check out their Kaggle team's results for the Heritage Health Prize.

jhu-kaggle-team

Even though their team fell a few places when run on the private data, they still had a very impressive showing considering there were 1358 teams that entered and over 20,000 entries. But what exactly does data science mean to Dr. Leek? Check out his expanded components of data science chart, that differs from similar charts of other data scientists by showing the root disciplines of each component too.

expanded-fusion-of-data-science

But what does the course look like?

course-setup

He covers topics such as type of analyses, how to organize a data analysis, data munging as well as others like:

concepts-machine-learning

statistics-concept

One of the interesting things to note though is that he also shows examples of poor data analysis attempts. There is a core problem with the statistics example from above (pointed out by high school students). Below is an example of another:

concepts-confounding

And this course, in addition to two other courses, Computing for Data Analysis and Mathematical Biostatistics Bootcamp taught by other JHU faculty, have had a very positive response.

jhu-course-enrollment

But how do you teach that many people effectively? That is where the power of Coursera comes in; JHU could have chosen other providers like edX or Udacity but decided to go with Coursera. The videos make it easy to convey knowledge and message boards provide a mechanism to ask questions. Dr. Leek even had students answering questions for other students so that all he had to do was validate the response. But he also pointed out that his class' message board was just like all other message boards and followed 1/98/1 rule where 1% of people respond in a mean way and are unhelpful, 1% of people are very nice and very helpful and the other 98% don't care and don't really respond.

One of the most unique aspects of Coursera is that it helps to scale to tens of thousands of students by using peer/student grading. Each person grades 4 different assignments so that everyone is submitting one answer and grading 4 others. The final score for each student is the median of the four scores from the other students. The rubric used in Dr. Leek's class is below.

grading-rubric

The result of this grading policy, based on Dr. Leek's analysis is that good students received good grades, poor students received poor grades and middle students' grades fluctuated a fair amount. So it seems like the policy works mostly, but there is still room for improvement.

But why does Johns Hopkins and Dr. Leek even support this model of learning? They do have full time jobs that involve teaching after all. Well, besides being huge supporters of open source technology and open learning, they also see many other reasons for supporting this paradigm.

partial-motivation

Check out the video for the many other reasons why JHU further supports this paradigm. And, while you are at it, see if you can figure out if the x and y axes are related in some way. This was our data science/statistics problem for the evening. The answer can also be found in the video.

stat-challenge

http://www.youtube.com/playlist?list=PLgqwinaq-u-NNCAp3nGLf93dh44TVJ0vZ

We also got a sneak peek at a new tool/component that integrates directly into R - swirl. Look for a meetup or blog post about this tool in the future.

swirl

Our next meetup is on October 9th at Advertising.com in Baltimore beginning at 6:30PM. We will have Don Miner speak about using Hadoop for Data Science. If you can make it, come out and join us.

August 2013 Data Science DC Event Review: Confidential Data

This is a guest post by Brand Niemann, former Sr. Enterprise Architect at EPA, and Director and Sr. Data Scientist at Semantic Community. He previously wrote a blog post for DC2 about the upcoming SOA, Semantics, and Data Science conference. The August Data Science DC Meetup provided the contrasting views of a data scientist and a statistician to a controversial problem about the use of "restricted data".

Open Government Data can be restricted because of the Open Data Policy of the US Federal Government as outlined at Data.gov:

  • Public Information: All datasets accessed through Data.gov are confined to public information and must not contain National Security information as defined by statute and/or Executive Order, or other information/data that is protected by other statute, practice, or legal precedent. The supplying Department/Agency is required to maintain currency with public disclosure requirements.
  • Security: All information accessed through Data.gov is in compliance with the required confidentiality, integrity, and availability controls mandated by Federal Information Processing Standard (FIPS) 199 as promulgated by the National Institute of Standards and Technology (NIST) and the associated NIST publications supporting the Certification and Accreditation (C&A) process. Submitting Agencies are required to follow NIST guidelines and OMB guidance (including C&A requirements).
  • Privacy: All information accessed through Data.gov must be in compliance with current privacy requirements including OMB guidance. In particular, Agencies are responsible for ensuring that the datasets accessed through Data.gov have any required Privacy Impact Assessments or System of Records Notices (SORN) easily available on their websites.
  • Data Quality and Retention: All information accessed through Data.gov is subject to the Information Quality Act (P.L. 106-554). For all data accessed through Data.gov, each agency has confirmed that the data being provided through this site meets the agency's Information Quality Guidelines.
  • Secondary Use" Data accessed through Data.gov do not, and should not, include controls over its end use. However, as the data owner or authoritative source for the data, the submitting Department or Agency must retain version control of datasets accessed. Once the data have been downloaded from the agency's site, the government cannot vouch for their quality and timeliness. Furthermore, the US Government cannot vouch for any analyses conducted with data retrieved from Data.gov.

Federal Government Data is also governed by the Principles and Practices for a Federal Statistical Agency Fifth Edition:

Statistical researchers are granted access to restricted Federal Statistical and other data on condition that their public disclosure will not violate the laws and regulations associated with these data, otherwise the fundamental trust involved with the collection and reporting of these data is violated and the data collection methodology is compromised.

Tommy Shen, a data scientist and the first presenter, commented afterwards: "One of the reasons I agreed to present yesterday is that I fundamentally believe that we, as a data science community, can do better than sums and averages; that instead of settling for the utility curves presented to us by government agencies, can expand the universe of the possible information and knowledge that can be gleaned from the data that your tax dollars and mine help to collect without making sacrifices to privacy."

Daniell Toth, a mathematical statistician, described the methods he uses in his work for a government agency as follows:

  • Identity
    • Suppression; Data Swapping
  • Value
    • Top-Coding; Perturbation;
    • Synthetic Data Approaches
  • Link
    • Aggregation/Cell Suppression; Data Smearing

His slides include examples of each method and he concluded:

  • Protecting data always involves a trade-off of utility
  • You must know what you are trying to protect
  • We discussed a number of methods – the best depends on the intended use of the data and what you are protecting

My comment was that the first speaker needs to employ the services of a professional statistician who knows how to anonymize and/or aggregate data while preserving its statistical properties, and that the second speaker needs to explain that decision makers in the government have access to the raw data and detailed results and that the public needs to work with available open government data and lobby their Congressional Representatives to support legislation like the Data Act of 2013.

Also of note, SAS provides simulated statistical data sets for training and the Data Transparency Coalition has a conference on September 10th, Data Transparency 2013, to discuss ways to move forward.

Overall, excellent Meetup! I suggest we have event host CapitalOne Labs speak at a future Meetup to tell us about the work they do and especially their recent acquisition of Bundle to advance their big data agenda. "Bundle gives you unbiased ratings on businesses based on anonymous credit card data."

For more, see the event slides and audio:

Weekly Round-Up: Statisticians, Build Smart DC, Kirk Borne, and Treating Parkinson's

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from collecting building data to treating Parkinson's. In this week's round-up:

  • Statisticians: An Endangered Species?
  • Washington DC Launches Real-time Building Energy Data Project
  • Time Spent with Kirk Borne
  • Michael J. Fox Foundation Points Big Data At Parkinson's

Statisticians: An Endangered Species?

Our first piece this week is an interesting blog post on the Revolution Analytics blog about how statisticians are perceived and how that relates to data science. The post was inspired by an American Statistical Association Magazine article that portrayed statisticians as being left in the dust of the big data movement. The author goes on to talk about how he was surprised at how little mention there was of R in the article and how contributing to the statistical programming language may be a good way for statisticians to continue to play an important role in data science.

Washington DC Launches Real-time Building Energy Data Project

Our next piece is a GigaOM article about a project that launched last week called Build Smart DC. The project monitors energy data from city-owned buildings at 15 minute intervals to provide management with a much more granular view of energy use in the properties than ever before. This will allow them to monitor trends and make data-driven decisions that will lead to more efficient energy consumption. The article also goes on to talk about the startup that is driving this program and some other cities that have similar projects in place.

Time Spent with Kirk Borne

Our third piece is an interesting short interview with Kirk Borne. Kirk is a Professor of Astrophysics and Computational Science at George Mason University and has been one of the most influential Big Data advocates on Twitter in recent years. He talks to the interviewer about astrophysics, big data, and data science education.

Michael J. Fox Foundation Points Big Data At Parkinson's

Our final article this week is an InformationWeek piece about how the Michael J. Fox Foundation put on a Kaggle competition to see if data scientists could help identify patients that had Parkinson's and track increases and decreases in symptoms among patients that had the disease. The article highlights the winning team in the competition, some of the methods they used to generate their predictive models, and how they were about to acquire the domain knowledge that ultimately helped them win the competition.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Weekly Round-Up: Big Data ROI, Statistics, GE, and China

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from Big Data's return on investment to its progress in China. In this week's round-up:

  • Big Data ROI Still Tough To Measure
  • What Statistics Should Do About Big Data
  • GE CEO Jeff Immelt’s Big Data Bet
  • In China, Big Data Is Becoming Big Business

Big Data ROI Still Tough To Measure

This is an article about how difficult it is to measure the return on investment of big data solutions. Given all the hype in the media, business leaders naturally want to know whether their investments in these solutions are paying off. The article goes on to describe some of the complexities involved and talks about some of the obstacles that will have to be overcome in order for business leaders to feel more satisfied with the solutions they invest in.

What Statistics Should Do About Big Data

This is a blog post by Jeff Leek continuing the discussions being had recently about the role of statistics in big data. Jeff writes about his understanding of what some of the issues raised in previous conversations boil down to and then provides his thoughts about what statisticians need to do in order to not get left out of the big data discussion. He concludes the post with a list of things he'd like to see come out these discussions that would help the discipline progress to the next level.

GE CEO Jeff Immelt’s Big Data Bet

This is a summary of GE CEO Jeff Immelt's interview at the D11 conference this past week, which centered around how data collected from sensors can make machines more efficient - what GE calls the Industrial Internet. The article provides some examples of where GE is trying to implement these practices and explains why it's important for GE to be doing this. If you'd like to see the full interview, you can find the video here.

In China, Big Data Is Becoming Big Business

Our last article this week is a Bloomberg BusinessWeek piece about how big data is progressing in China. The fact that it is such a large country and the fact that an increasing number of its citizens are using technology means that the quantity of data generated is rapidly increasing. This article talks about how data scientists will be in high demand there in the near future and how both government and businesses are working on building infrastructure that can support their data needs.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Weekly Round-Up: Google's Quantum Computer, Data Science vs. Statistics & BI, Business Computing, and Detecting Terrorism Networks

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from Google's new quantum computer to detecting terrorist networks. In this week's round-up:

  • Google Buys a Quantum Computer
  • Statistics vs. Data Science vs. BI
  • Could Business Computing Be Done by Users Without Technical Experience?
  • Can Math Models Be Used to Detect Terrorism Networks?

Google Buys a Quantum Computer

With the ever-increasing amount and complexity of data out there, companies at the edge of technology are starting to look for faster and more efficient ways to process, analyze, and put to use the data that is available to them. That is what Google seems to be working toward as they have purchased a quantum computer and are partnering with NASA to find ways to apply quantum computing to machine learning. This article has some more details about how they are looking to use it and what other companies are also looking into quantum computing.

Statistics vs. Data Science vs. BI

This is an interesting Smart Data Collective article that takes a stab at trying to differentiate between statistics, data science, and business intelligence. The author is a statistician, but ultimately feels that data scientist more accurately describes the work that he does and that's what led him to want to do the comparisons. Check it out and see how much you agree/disagree with his descriptions of each.

Could Business Computing Be Done by Users Without Technical Experience?

This is an article about business computing, how most of it is done using traditional spreadsheet programs, and what the difficulties and challenges that come with it have been. The author describes where spreadsheets are useful, but also where they have their shortcomings. At the end, he introduces a desktop BI solution called esCalc that attempts to correct many of these shortcomings and explains how it does so.

Can Math Models Be Used to Detect Terrorism Networks?

This article is about a paper published last month in the SIAM Journal on Discrete Mathematics. The subject of the paper was disrupting information flow in complex real-world networks, such as terrorist organizations. The article describes the similarities between terrorist networks and other hierarchical organizations and even some social networks. The article also talks about the type of model the authors are using and how the model works.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Weekly Round-Up: Data Science Education, Statistics, Data Driven Organizations, and Data Stories

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from data education to data-driven organizations. In this week's round-up:

  • Universities Offer Courses in a Hot New Field: Data Science
  • Data Science: The End of Statistics?
  • How Do You Create a Data-Driven Organization?
  • Tell Better Data Stories with Motion and Interactivity

Universities Offer Courses in a Hot New Field: Data Science

This is a NY Times article about how more and more schools are now offering degrees in data science. The article explains that the demand for these skills has been growing rapidly in the last few years and that schools are adapting their curriculum to the demands of the market. The author provides quotes from faculty at several of the universities mentioned in the article and also some details about the content of some of the programs at these schools.

Data Science: The End of Statistics?

This was an interesting blog post posing questions about why statistics is sometimes left out of the data science hype. The author takes a shot at briefly proposing answers, but at the end solicits answers from the readers. The comments section of this post is excellent and well worth reading, with several folks with a wide range of experience chiming in to help answer the questions and shed some more light on this topic.

How Do You Create a Data-Driven Organization?

This is an excellent blog post about how to create a data driven organization. The author just switched jobs to a company where he needs to overhaul how data is collected, stored, analyzed, and reported; and in this post he walks the reader through his thoughts on doing that and the steps he is taking to get all this done. The process includes information gathering and learning about the business, training, infrastructure, metrics, and reporting mediums. Each of these parts has sub-sections with comments and considerations.

Tell Better Data Stories with Motion and Interactivity

This is a Harvard Business Review article about using motion and interactivity as tools when visualizing data over time. The article has several videos embedded in it that serve as examples and help further explain how these tools can be effective. At the end, the author provides three valuable takeaways when putting visualizations together yourself.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups

Weekly Round-Up: CIA Big Data, Unifying Mean/Median/Mode, New Data Startups, and Naked Statistics

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from data startups to statistics lessons. In this week's round-up:

  • CIA Presentation on Big Data
  • Modes, Medians and Means: A Unifying Perspective
  • A Couple New Notable Data Startups
  • Naked Statistics: Stripping the Dread From the Data

CIA Presentation on Big Data

This is a Business Insider article about the presentation made by CIA Chief Technology Officer, Ira "Gus" Hunt, at GigaOM's Structure data conference in New York. The presentation was about how the agency plans to capture, store, and use the vast amounts of data it is able to collect. The article includes some highlights of the talk and a link to Hunt's slides from the presentation. The video and transcript of the entire talk can be found on GigaOM's website here.

Modes, Medians and Means: A Unifying Perspective

This is a post published earlier this week on the blog of John Myles White, co-author of Machine Learning for Hackers, where he tackles the task of explaining the relationships between mean, median, and mode; noting that this particularly important topic is usually excluded from introductory statistics courses. His explanation of the relationships between the three summary statistics comes across as intuitive and very well structured. For those that have a grasp on basic statistics, this post will definitely help you understand things a little deeper.

A Couple New Notable Data Startups

This week, I came across a couple articles about new startups in the data space that should be interesting to watch grow. The first was a TechCrunch article about Fivetran, a company that wants to reinvent spreadsheets so that they can handle the more modern data analysis tasks that have outpaced the functionality of traditional spreadsheets. Fivetran is backed by Paul Graham's startup incubator, Y-Combinator, and the article provides an overview of the problems they are trying to solve and how they are trying to solve them.

The second data startup article was about Wise.io, a company that is trying to provide machine learning as a service to the masses. The article talks about what they're trying to accomplish, where they got the idea from, and some of their sources of revenue (they are bootstrapped and already profitable).

Naked Statistics: Stripping the Dread From the Data

This is an interesting review of the recently released book Naked Statistics by Charles Wheelan on the Economist website. The book aims to strip away the complexity and explain statistics intuitively by using language, examples, and humor that most people can identify with. The review describes some of the specific examples used in the book to illustrate statistical concepts, comments on some of the other ways Wheelan has chosen to deliver the material, and highlights some of the things you will learn from reading the book.

That's it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there's something we missed, feel free to let us know in the comments below.

Read Our Other Round-Ups