May 25, 2022

Digital Privacy: “You Can Probably Be Identified From Your Anonymized Data”

From Naked Security:

If you thought that removing identifying information from a database of sensitive personal records was enough to retain privacy, it’s time to think again. A study published this week asserts that it’s even easier to re-identify information than we first thought.


The study, released in Nature Communications, calls all that into question. Its authors at the Université catholique de Louvain (Belgium) and at Imperial College London (UK) say that it’s easy to re-identify a high percentage of people in de-identified data sets.


What this latest research proves is that it’s even easier than we thought to reconstruct people’s identities, even when only a tiny subset of the data is released. When it comes to de-identification, it suggests that it might be time to go back to the drawing board.

The researchers have created an online tool that lets you check to see how identifiable you might be given your own characteristics.

Learn More, Read the Complete Article

Direct to Research Article: Estimating The Success Of Re-Identifications In Incomplete Datasets Using Generative Models (via Nature Communications)

While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.

About Gary Price

Gary Price ( is a librarian, writer, consultant, and frequent conference speaker based in the Washington D.C. metro area. Before launching INFOdocket, Price and Shirl Kennedy were the founders and senior editors at ResourceShelf and DocuTicker for 10 years. From 2006-2009 he was Director of Online Information Services at, and is currently a contributing editor at Search Engine Land.