Data Book Review: Anonymizing Health Data

Laura is a data enthusiast, genomicist, and book lover in Washington, DC, so this book was right up her alley. You can read more of her stuff at at her blog, Wasabi Powered Lifestyle, or connect with her on twitter @lauranlorenz.Also please note that DC2 is an Amazon Affiliate and will get a little bit of money if you click on the book link and buy it.

You might refrain from putting your SSN in any web forms; you could slip your credit card into an RFID blocking sleeve before putting it in your wallet; you may shred all your mail. Despite all your identify precautions, certain institutions – banks, medical offices, the government – have personally identifying information of yours. But did you know hidden somewhere in the privacy policy you didn’t fully read, sharing of that data is often allowed, even without your direct consent? The caveat is that it must be anonymized; that is, individual records must be de-identified and individuals must be untraceable from the shared data.

Institutions share data for lots of reasons, from outsourcing needed internal analyses to complying with federal regulations. Health data in particular is frequently shared between institutions, whether it be for research purposes, to track disease outbreaks, or to monitor insurance claim patterns. The US Government has attempted to protect individual privacy during these frequent data sharing events by publishing minimum standards for de-identification of health data sets in the HIPPA Safe Harbor standard. But with so many different data sets and so many different types of uses, how do we know the Safe Harbor standard is enough? What about when it is too conservative and the data is useless? And how do we go about implementing it?

Dr. Khaled El Emam and Luk Arbuckle’s new book from O’Reilly Media, Inc., titled Anonymizing Health Data: Case Studies and Methods to Get You Started, provides a primer on de-identifying health data to meet and surpass the minimal HIPPA Safe Harbor standard. These two Canadian privacy experts work together at Dr. El Emam’s company Privacy Analytics, Inc., a data anonymization software and consulting company offering privacy solutions to global clients. Though El Emam has recently published a thorough walkthrough for de-identifying health data for privacy practitioners, this co-authored, 195-page survey allows healthcare professionals, ethics committees, data custodians, aspiring privacy specialists, and data receivers to learn the basics.

The minimal HIPPA Safe Harbor standard flags 18 data elements for removal, such as names, telephone numbers, and zip codes, which offers a “cookie cutter” approach to de-identification that El Emam and Arkbuckle find unsophisticated and sometimes crippling to data utility. Instead, they advocate a risk-based approach that takes into account possible attacker motives and capacity, as well as the needs of the data user, to balance re-identification risk and the ability of the de-identified data to provide meaningful insights.

The book is organized into 13 chapters which cover a broad range of topics and include case studies from Privacy Analytics, Inc. projects such as de-identifying the BORN pregnancy, birth, and childhood registry, the World Trade Center Health registry, and i2b2 genetic information database. For each case study, the authors go through their methodology, which starts with a risk assessment by modeling potential threats, and then iterates over increasing generalization or increased suppression on specific data types until the risk of re-identification is acceptably low. Topics such as anonymizing free-form text (such as clinical notes in electronic medical records) and preventing geoproxy attacks (when attackers triangulate useful identifiers such as zip code by tracking an individual’s preferred pharmacy or hospital locations) provide insight into issues overlooked by the HIPPA standard.

This lightweight book is an excellent introduction to health data privacy concerns and the well-respected authors provide a clear overview and useful case studies to advocate their risk-based methodology. However, this book is not in-depth enough to guide the de-identification of a data set on its own, and in an effort to be concise, the authors assume basic probability fluency and glibly gloss over integral statistical concepts. First-timers will enjoy the read and get hung up only a few times where the explanations are too vague; serious practitioners should check out Dr. El Emam’s full treatise for the background, proofs, technical descriptions and equations needed to properly model threats and implement the de-identification process.