University of Chicago recently published a study describing a new kind of attack called “downcoding,” demonstrating the vulnerability of a deidentified data set and sending a warning that these data transformations should not be considered sufficient to protect individuals’ privacy. University of Chicago computer scientist Aloni Cohen deals the latest decisive blow against the most popular deidentification techniques in a new paper. When datasets containing personal information are shared for research or used by companies, researchers try to disguise data – removing the final one or two digits of a zip code, for example – while still preserving its utility for insight. But while deidentification is often intended to satisfy legal requirements for data privacy, the most commonly used methods stand on shaky technical ground.
“Even by the regulatory standards, there’s a problem here,” said UChicago Computer Scientist Aloni Cohen. “Policymakers care about real world risks instead of hypothetical risks. So people have argued that the risks security and privacy researchers pointed out were hypothetical or very contrived. The goal when you’re doing that sort of technique is to redact as little as you need to guarantee a target level of anonymity. But if you achieve that goal of redacting just as little as you need, then the fact that that’s the minimum might tell you something about what was redacted. If what you want to do is take data, sanitize it, and then forget about it – put it on the web or give it to some outside researchers and decide that all your privacy obligations are done – you can’t do that using these techniques. They should not free you of your obligations to think about and protect the privacy of that data.”
By describing a new kind of attack called “downcoding,” and demonstrating the vulnerability of a deidentified data set from an online education platform, Cohen sends a warning that these data transformations should not be considered sufficient to protect individuals’ privacy. Deidentification works by redacting quasi-identifiers – information that can be put together with data from a second source to de-anonymize a data subject. Failing to account for all possible quasi-identifiers can lead to disclosures.