August 2013 Data Science DC Event Review: Confidential Data

This is a guest post by Brand Niemann, former Sr. Enterprise Architect at EPA, and Director and Sr. Data Scientist at Semantic Community. He previously wrote a blog post for DC2 about the upcoming SOA, Semantics, and Data Science conference. The August Data Science DC Meetup provided the contrasting views of a data scientist and a statistician to a controversial problem about the use of "restricted data".

Open Government Data can be restricted because of the Open Data Policy of the US Federal Government as outlined at Data.gov:

  • Public Information: All datasets accessed through Data.gov are confined to public information and must not contain National Security information as defined by statute and/or Executive Order, or other information/data that is protected by other statute, practice, or legal precedent. The supplying Department/Agency is required to maintain currency with public disclosure requirements.
  • Security: All information accessed through Data.gov is in compliance with the required confidentiality, integrity, and availability controls mandated by Federal Information Processing Standard (FIPS) 199 as promulgated by the National Institute of Standards and Technology (NIST) and the associated NIST publications supporting the Certification and Accreditation (C&A) process. Submitting Agencies are required to follow NIST guidelines and OMB guidance (including C&A requirements).
  • Privacy: All information accessed through Data.gov must be in compliance with current privacy requirements including OMB guidance. In particular, Agencies are responsible for ensuring that the datasets accessed through Data.gov have any required Privacy Impact Assessments or System of Records Notices (SORN) easily available on their websites.
  • Data Quality and Retention: All information accessed through Data.gov is subject to the Information Quality Act (P.L. 106-554). For all data accessed through Data.gov, each agency has confirmed that the data being provided through this site meets the agency's Information Quality Guidelines.
  • Secondary Use" Data accessed through Data.gov do not, and should not, include controls over its end use. However, as the data owner or authoritative source for the data, the submitting Department or Agency must retain version control of datasets accessed. Once the data have been downloaded from the agency's site, the government cannot vouch for their quality and timeliness. Furthermore, the US Government cannot vouch for any analyses conducted with data retrieved from Data.gov.

Federal Government Data is also governed by the Principles and Practices for a Federal Statistical Agency Fifth Edition:

Statistical researchers are granted access to restricted Federal Statistical and other data on condition that their public disclosure will not violate the laws and regulations associated with these data, otherwise the fundamental trust involved with the collection and reporting of these data is violated and the data collection methodology is compromised.

Tommy Shen, a data scientist and the first presenter, commented afterwards: "One of the reasons I agreed to present yesterday is that I fundamentally believe that we, as a data science community, can do better than sums and averages; that instead of settling for the utility curves presented to us by government agencies, can expand the universe of the possible information and knowledge that can be gleaned from the data that your tax dollars and mine help to collect without making sacrifices to privacy."

Daniell Toth, a mathematical statistician, described the methods he uses in his work for a government agency as follows:

  • Identity
    • Suppression; Data Swapping
  • Value
    • Top-Coding; Perturbation;
    • Synthetic Data Approaches
  • Link
    • Aggregation/Cell Suppression; Data Smearing

His slides include examples of each method and he concluded:

  • Protecting data always involves a trade-off of utility
  • You must know what you are trying to protect
  • We discussed a number of methods – the best depends on the intended use of the data and what you are protecting

My comment was that the first speaker needs to employ the services of a professional statistician who knows how to anonymize and/or aggregate data while preserving its statistical properties, and that the second speaker needs to explain that decision makers in the government have access to the raw data and detailed results and that the public needs to work with available open government data and lobby their Congressional Representatives to support legislation like the Data Act of 2013.

Also of note, SAS provides simulated statistical data sets for training and the Data Transparency Coalition has a conference on September 10th, Data Transparency 2013, to discuss ways to move forward.

Overall, excellent Meetup! I suggest we have event host CapitalOne Labs speak at a future Meetup to tell us about the work they do and especially their recent acquisition of Bundle to advance their big data agenda. "Bundle gives you unbiased ratings on businesses based on anonymous credit card data."

For more, see the event slides and audio: