Data Book Review: Anonymizing Health Data

Laura is a data enthusiast, genomicist, and book lover in Washington, DC, so this book was right up her alley. You can read more of her stuff at at her blog, Wasabi Powered Lifestyle, or connect with her on twitter @lauranlorenz.Also please note that DC2 is an Amazon Affiliate and will get a little bit of money if you click on the book link and buy it.

You might refrain from putting your SSN in any web forms; you could slip your credit card into an RFID blocking sleeve before putting it in your wallet; you may shred all your mail. Despite all your identify precautions, certain institutions – banks, medical offices, the government – have personally identifying information of yours. But did you know hidden somewhere in the privacy policy you didn’t fully read, sharing of that data is often allowed, even without your direct consent? The caveat is that it must be anonymized; that is, individual records must be de-identified and individuals must be untraceable from the shared data.

Institutions share data for lots of reasons, from outsourcing needed internal analyses to complying with federal regulations. Health data in particular is frequently shared between institutions, whether it be for research purposes, to track disease outbreaks, or to monitor insurance claim patterns. The US Government has attempted to protect individual privacy during these frequent data sharing events by publishing minimum standards for de-identification of health data sets in the HIPPA Safe Harbor standard. But with so many different data sets and so many different types of uses, how do we know the Safe Harbor standard is enough? What about when it is too conservative and the data is useless? And how do we go about implementing it?

Dr. Khaled El Emam and Luk Arbuckle’s new book from O’Reilly Media, Inc., titled Anonymizing Health Data: Case Studies and Methods to Get You Started, provides a primer on de-identifying health data to meet and surpass the minimal HIPPA Safe Harbor standard. These two Canadian privacy experts work together at Dr. El Emam’s company Privacy Analytics, Inc., a data anonymization software and consulting company offering privacy solutions to global clients. Though El Emam has recently published a thorough walkthrough for de-identifying health data for privacy practitioners, this co-authored, 195-page survey allows healthcare professionals, ethics committees, data custodians, aspiring privacy specialists, and data receivers to learn the basics.

The minimal HIPPA Safe Harbor standard flags 18 data elements for removal, such as names, telephone numbers, and zip codes, which offers a “cookie cutter” approach to de-identification that El Emam and Arkbuckle find unsophisticated and sometimes crippling to data utility. Instead, they advocate a risk-based approach that takes into account possible attacker motives and capacity, as well as the needs of the data user, to balance re-identification risk and the ability of the de-identified data to provide meaningful insights.

The book is organized into 13 chapters which cover a broad range of topics and include case studies from Privacy Analytics, Inc. projects such as de-identifying the BORN pregnancy, birth, and childhood registry, the World Trade Center Health registry, and i2b2 genetic information database. For each case study, the authors go through their methodology, which starts with a risk assessment by modeling potential threats, and then iterates over increasing generalization or increased suppression on specific data types until the risk of re-identification is acceptably low. Topics such as anonymizing free-form text (such as clinical notes in electronic medical records) and preventing geoproxy attacks (when attackers triangulate useful identifiers such as zip code by tracking an individual’s preferred pharmacy or hospital locations) provide insight into issues overlooked by the HIPPA standard.

This lightweight book is an excellent introduction to health data privacy concerns and the well-respected authors provide a clear overview and useful case studies to advocate their risk-based methodology. However, this book is not in-depth enough to guide the de-identification of a data set on its own, and in an effort to be concise, the authors assume basic probability fluency and glibly gloss over integral statistical concepts. First-timers will enjoy the read and get hung up only a few times where the explanations are too vague; serious practitioners should check out Dr. El Emam’s full treatise for the background, proofs, technical descriptions and equations needed to properly model threats and implement the de-identification process.

August 2013 Data Science DC Event Review: Confidential Data

This is a guest post by Brand Niemann, former Sr. Enterprise Architect at EPA, and Director and Sr. Data Scientist at Semantic Community. He previously wrote a blog post for DC2 about the upcoming SOA, Semantics, and Data Science conference. The August Data Science DC Meetup provided the contrasting views of a data scientist and a statistician to a controversial problem about the use of "restricted data".

Open Government Data can be restricted because of the Open Data Policy of the US Federal Government as outlined at

  • Public Information: All datasets accessed through are confined to public information and must not contain National Security information as defined by statute and/or Executive Order, or other information/data that is protected by other statute, practice, or legal precedent. The supplying Department/Agency is required to maintain currency with public disclosure requirements.
  • Security: All information accessed through is in compliance with the required confidentiality, integrity, and availability controls mandated by Federal Information Processing Standard (FIPS) 199 as promulgated by the National Institute of Standards and Technology (NIST) and the associated NIST publications supporting the Certification and Accreditation (C&A) process. Submitting Agencies are required to follow NIST guidelines and OMB guidance (including C&A requirements).
  • Privacy: All information accessed through must be in compliance with current privacy requirements including OMB guidance. In particular, Agencies are responsible for ensuring that the datasets accessed through have any required Privacy Impact Assessments or System of Records Notices (SORN) easily available on their websites.
  • Data Quality and Retention: All information accessed through is subject to the Information Quality Act (P.L. 106-554). For all data accessed through, each agency has confirmed that the data being provided through this site meets the agency's Information Quality Guidelines.
  • Secondary Use" Data accessed through do not, and should not, include controls over its end use. However, as the data owner or authoritative source for the data, the submitting Department or Agency must retain version control of datasets accessed. Once the data have been downloaded from the agency's site, the government cannot vouch for their quality and timeliness. Furthermore, the US Government cannot vouch for any analyses conducted with data retrieved from

Federal Government Data is also governed by the Principles and Practices for a Federal Statistical Agency Fifth Edition:

Statistical researchers are granted access to restricted Federal Statistical and other data on condition that their public disclosure will not violate the laws and regulations associated with these data, otherwise the fundamental trust involved with the collection and reporting of these data is violated and the data collection methodology is compromised.

Tommy Shen, a data scientist and the first presenter, commented afterwards: "One of the reasons I agreed to present yesterday is that I fundamentally believe that we, as a data science community, can do better than sums and averages; that instead of settling for the utility curves presented to us by government agencies, can expand the universe of the possible information and knowledge that can be gleaned from the data that your tax dollars and mine help to collect without making sacrifices to privacy."

Daniell Toth, a mathematical statistician, described the methods he uses in his work for a government agency as follows:

  • Identity
    • Suppression; Data Swapping
  • Value
    • Top-Coding; Perturbation;
    • Synthetic Data Approaches
  • Link
    • Aggregation/Cell Suppression; Data Smearing

His slides include examples of each method and he concluded:

  • Protecting data always involves a trade-off of utility
  • You must know what you are trying to protect
  • We discussed a number of methods – the best depends on the intended use of the data and what you are protecting

My comment was that the first speaker needs to employ the services of a professional statistician who knows how to anonymize and/or aggregate data while preserving its statistical properties, and that the second speaker needs to explain that decision makers in the government have access to the raw data and detailed results and that the public needs to work with available open government data and lobby their Congressional Representatives to support legislation like the Data Act of 2013.

Also of note, SAS provides simulated statistical data sets for training and the Data Transparency Coalition has a conference on September 10th, Data Transparency 2013, to discuss ways to move forward.

Overall, excellent Meetup! I suggest we have event host CapitalOne Labs speak at a future Meetup to tell us about the work they do and especially their recent acquisition of Bundle to advance their big data agenda. "Bundle gives you unbiased ratings on businesses based on anonymous credit card data."

For more, see the event slides and audio: