Machine learning can find you in heavily sampled, anonymised dataset
- 25 July, 2019 14:06
A sample set of anonymised data can be reverse engineered using machine learning techniques to re-identify individuals, a paper by researchers from Imperial College London and the UCLouvain in Belgium has demonstrated.
Stripping a dataset's records of direct identifiers like name and email address, and sharing only a small proportion of them, has been the main method of sharing data while preserving people’s privacy.
The intuition is that there may be multiple people who are, say, in their 30s, female, and living in Brisbane. Any record matching those demographics in a sample of anonymised data could conceivably belong to any number of individuals.
“The issue? It does not work,” the researchers said.
With just a few more attributes a record quickly becomes more exceptional. The researchers’ statistical model quantifies the likelihood that a re-identification attempt would be successful, even with a “heavily incomplete” dataset.
For example, according to an online tool that demonstrates the work, with just my gender, marital status, date of birth and post code, I have an 86 per cent chance of being correctly identified in any anonymised dataset.
“This is pretty standard information for companies to ask for,” said lead author Dr Yves-Alexandre de Montjoye.
In the paper – Estimating the success of re-identifications in incomplete datasets using generative models, published in Nature Communications – 99.98 per cent of Americans were correctly re-identified in any available anonymised dataset by using just 15 characteristics, including age, gender, and marital status.
Validated on 210 datasets from demographic and survey data, the researchers claim that their technique – which uses Gaussian copulas to model uniqueness – shows “that even extremely small sampling fractions are not sufficient to prevent re-identification and protect your data”.
“Contrary to popular belief, sampling a dataset does not provide plausible deniability and does not effectively protecting people's privacy,” de Montjoye added.
Mundane facts, taken together
Traded and shared datasets often contain many attributes. For example, data broker Experian sold Alteryx access to an anonymised dataset containing 248 attributes per household for 120 million Americans.
“While there might be a lot of people who are in their thirties, male, and living in New York City, far fewer of them were also born on January 5, are driving a red sports car, and live with two kids, both girls and one dog. There is probably one and only one,” said co-author Dr Luc Rocher.
There are few protections from such reidentification attempts – although the Australian federal government has previously considered criminalising the re-identification of Commonwealth datasets released as part of its open data program.
Sampling anonymised data means it is no longer subject to data protection regulations – like the EU’s General Data Protection Regulation (GDPR) and California’s Consumer Privacy Act (CCPA) – meaning it can be it can be freely used and sold to third parties like advertising companies and data brokers.
In its De-identification Guide from last year, the Office of the Australian Information Commissioner states that sampling creates “uncertainty that any particular person is even included in the dataset”.
“Companies and governments downplay the risk of re-identification by arguing that the datasets they sell are always incomplete. Our findings show this might not help,” said de Montjoye.
There are many examples of supposedly anonymous datasets being released and later re-identified.
In 2016, journalists re-identified public figures in an anonymised browsing history dataset of 3 million German citizens they acquired for free from a data broker, allowing them to discover a judge’s porn preferences and the medication used by a MP.
The same year, Melbourne University researchers were able to decrypt individual service provider ID numbers in 10 per cent sample of medical billing records released by the Australian Department of Health. This provided a potential route for Medicare service providers in the dataset to be identified.
A year later, the same researchers showed how patients could also be reidentified by linking unencrypted parts of the record with known information about the individual.
“A few mundane facts taken together often suffice to isolate an individual,” Culnane et al noted.
The Imperial College and UCLouvain researchers have called for tighter rules around anonymised data sharing.
“The goal of anonymisation is to help use data to benefit society. This is extremely important but should not and does not have to happen at the expense of people’s privacy,” said co-author Professor Julien Hendrickx.
“It is essential for anonymisation standards to be robust and account for new threats like the one demonstrated in this paper,” he added.