Making data anonymous can be tricky
For the London Free Press - October 5, 2009 Read this on Canoe
PERSONAL INFORMATION: Reidentification of individuals by comparing anonymous data to other sources of data can be surprisingly easy
Making personal information anonymous is trickier than it seems.
Anonymous data created from information about individuals can be a valuable commodity. It is becoming apparent, though, that it is not as easy to keep anonymous individual data as one might think. Reidentification of individuals by comparing anonymous data to other sources of data can be surprisingly easy.
We start with a database containing, for example, information on our purchases or our medical information. The identifying information, such as our names and addresses, is removed. That way, those using the remaining information can't figure out who the detailed information relates to. Or at least that's the theory.
This type of information is very useful to a wide range of people. Social scientists can use the information in conducting research. Medical trends can be discerned. Advertisers can analyze buying patterns and determine the effectiveness of marketing strategies.
A study by Paul Ohm, an associate professor at the University of Colorado Law School, revealed troubling facts regarding anonymous data.
Ohm's article revealed the surprising ease with which some have been able to "deanonymize" individuals from supposedly anonymous data.
In his study, Ohm describes the work of Latanya Sweeney, then a graduate student in computer science. She undertook an effort to "reidentify" those who had personal information released in an anonymous manner.
She reviewed the records of the Massachusetts Group Insurance Commission, which had released "anonymized" data on the health records of all state employees. Before releasing the data, personal identifiers, such as name, address and social security number, were removed to ensure the privacy of those included in the data.
Sweeney, now an associate professor at Carnegie Mellon University in Pittsburgh, Pa., found that using basic information from municipal voter rolls -- such as names, ZIP codes and dates of birth -- found she could identify a large number of people in the databank. This revealed private information such as hosp-ital visits and prescriptions.
To illustrate her findings, Sweeney obtained the health records of William Weld, governor of Massachusetts at the time, and sent a copy to his office.
Sweeney found that 87% of all Americans could be personally identified using only their ZIP code, birthdate and sex.
There does not appear to be a simple solution to the issues of data collection. As Ohm notes, data can either be perfectly anonymous or useful, but they cannot always be both.
If data are scrubbed clean of all possible personal identifiers, it reduces the ability to use them for any meaningful research.
The consequences of such data being publicly available are serious.
As Ohm observes, "For almost every person on Earth, there is at least one fact about them stored in a computer database that an adversary could use to blackmail, discriminate against, harass, or steal the identity of him or her."
Those in possession of personal data need to be very careful about how they make data anonymous, lest they run afoul of privacy laws.