Focusing public attention on emerging privacy and civil liberties issues

Re-identification

Concerning the Re-Identification of Consumer Information

Latest News

  • FTC Approves Final Settlement over Consumer Tracking, Fails to Enforce FIPs or Suggest Best Practices for Anonymization: The Federal Trade Commission adopted a proposed settlement with Compete, Inc., over allegations that Compete failed to adopt reasonable data security practices and deceived consumers about the amount of personal information that its toolbar and survey panel would collect. The FTC also charged Compete with deceptive practices for falsely claiming that the data it kept was anonymous. The settlement requires Compete to obtain consumers' express consent before collecting any data through its software, to delete personal information already collected, and to provide directions for uninstalling its software. In comments to the agency, EPIC recommended that the FTC also require the Compete to implement Fair Information Practices similar to those contained in the Consumer Privacy Bill of Rights, and develop a best practices guide to de-identification techniques. The FTC declined to adopt EPIC’s recommendations, stating that it "does not provide specific technical guidance in areas like [anonymization], which are constantly changing," and "may not impose additional obligations that are not reasonably related to such conduct or preventing its recurrence." For more information, see EPIC: Federal Trade Commission and EPIC: Re-Identification. (Feb. 26, 2013)
  • EPIC Submits Comments to FTC on Consumer Tracking Settlement: EPIC submitted comments to the Federal Trade Commission on a recent settlement with Compete, Inc. The settlement arises from allegations that Compete failed to adopt reasonable data security practices and deceived consumers about the amount of personal information that its toolbar and survey panel would collect. The FTC also charged Compete with deceptive practices for falsely claiming that the data it kept was anonymous. The proposed settlement requires Compete to obtain consumers’ express consent before collecting any data through its software, to delete personal information already collected, and to provide directions for uninstalling its software. EPIC expressed support for the settlement, but recommended that the FTC also require the Compete to implement Fair Information Practices similar to the Consumer Privacy Bill of Rights, make the compliance reports publicly available, and develop a best practices guide to de-identification techniques, as anonymization has become more critical for online privacy. For more information, see EPIC: Federal Trade Commission and EPIC: Re-Identification. (Nov. 20, 2012)
  • California Supreme Court Rules Zip Code is Personal Information: In Pineda v. William Sonoma, the California Supreme Court has determined that merchants may not require credit card customers to provide ZIP codes. In a unanimous decision, the Court found that ZIP codes are "personal identification information" under the state Credit Card Act of 1971. In the Pineda case, the customer believed that providing an SSN was necessary to complete a credit card transaction. The merchant subsequently used the SSN to determine the customer's home address. The California court said that the Credit Card Act "intended to provide robust consumer protections by prohibiting retailers from soliciting and recording information about the cardholder that is unnecessary to the credit card transaction." For more information, see EPIC - Social Security Numbers and EPIC - Reidentification. (Feb. 11, 2011)
  • Netflix Cancels Contest over Privacy Concerns: Netflix canceled its second $1 million Netflix Prize after privacy concerns from the FTC and a federal lawsuit alleging invasion of privacy and violations of the Video Privacy Protection Act. The Netflix contest challenged contestants to find a superior movie-recommendation algorithm from “anonymized” datasets that included movie ratings, date of ratings, unique ID numbers for Netflix subscribers, and movie information. In 2006, during the first Netflix Prize contest, researchers conducted a study that revealed if a person has information about when and how a user rated six movies, that person can identify 99% of people in the Netflix database. After productive discussions with the FTC over reidentification concerns which stemmed from this study, Netflix and the federal agency reached an understanding on how Netflix would use user data in the future. Netflix also settled the VPPA lawsuit. For more information, see EPIC: Reidentification. (Mar. 15, 2010)
  • EPIC Supports Privacy Safeguards for Genetic Information, Recommends Robust Techniques for Deidentification: EPIC filed comments with the Department of Health and Human Services, advising the federal agency to strengthen the requirements for classifying data as “de-identified” under the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. HHS proposed a rule that would clarify HIPAA and the Genetic Information Nondiscrimination Act (GINA), by providing that genetic information is “health information” and prohibiting the use of such information for underwriting purposes or other discriminatory purposes. EPIC supports this proposed regulation but warned that a safe harbor provision for de-identified data could undercut privacy safeguards unless the techniques were shown to be "robust, scalable, transparent, and provable." For more information, see EPIC: Reidentification. (Dec. 8, 2009)
  • HHS to Explore Scope of Personally Identifiable Health Information, Seeks Public comments: The Department of Health and Human Services plans to modify sections of the federal Privacy Rule, issued under HIPAA. The proposed changes would clarify the scope of privacy and confidentiality of genetic information. More specifically, HHS proposes to modify the Privacy Rule, taking into account the Genetic Information Nondiscrimination Act, to prohibit health plans from using or disclosing personally identifiable health information, which would explicitly include genetic information, for underwriting purposes. Public comments on the proposed rule are due December 7, 2009. EPIC is recommending that HHS pay particular attention to the problem of data reidentification. For more information, see EPIC's Genetic Privacy Page. (Oct. 13, 2009)

Background

Introduction

Re-identification is the process by which anonymized personal data is matched with its true owner. In order to protect the privacy interests of consumers, personal identifiers, such as name and social security number, are often removed from databases containing sensitive information. This anonymized, or de-identified, data safeguards the privacy of consumers while still making useful information available to marketers or datamining companies. Recently, however, computer scientists have revealed that this "anonymized" data can easily be re-identified, such that the sensitive information may be linked back to an individual. The re-identification process implicates privacy rights, because organizations will say that privacy obligations do not apply to information that is anonymized, but if the data is in fact personally identifiable, then privacy obligations should apply.

Deidentified Patient Data

Datamining companies often purchase and collect doctors’, dentists’, and nurse practitioners’ prescription information, or prescriber-identifiable data, without their knowledge or consent, in order to sell to research institutions, private companies, or - their best clients by far - pharmaceutical companies. In a process called “detailing,” pharmaceutical sales representatives analyze this data to identify trends or habits of prescribers, thus allowing them to tailor their marketing of prescription drugs to individual prescribers. Pharmaceutical sales representatives only detail brand-name drugs, which are more expensive, although not necessarily more effective, than their generic counterparts. In an effort to curb prescription drug costs and to protect the privacy of their citizens’ prescription information, several states have enacted, or considered enacting, statutes that limit pharmaceutical datamining activities and provide prescribers and consumers with more control over their prescription information. Datamining companies, such as IMS Health and Verispan, LLC, have aggressively opposed these laws, arguing, among other things, that the prescription data is de-identified, and thus the privacy interests of consumers are not implicated in the detailing process.

IP Addresses

Search engines often retain IP addresses of their users for several purposes, including to target advertisements and to investigate click fraud. However, some search engines are responding to increasing pressure to anonymize this data for privacy purposes. In 2006, AOL anonymized the IP addresses and usernames of its users and released data relating to search queries. A study that was conducted based on this data release, however, revealed that anonymization of IP addresses is not adequate protection, as unique user information can be re-identified. The re-identification process of AOL users' information is described in more detail below.

In 2007, the European Commission’s Article 29 group, which monitors data protection issues, urged Google and Yahoo to anonymize their users' IP addresses every few months. Google and Yahoo claim to anonymize these IP addresses, although both search engines do retain the first few digits of the IP addresses. It is clear that if actual anonymization of IP addresses is not an adequate process to protect user data, partial redaction of IP addresses is certainly inadequate.

Redacted Social Security Numbers

One method of protecting an individual’s identity often employed by companies is the redaction of social security numbers (SSNs), either in whole or in part. Social security numbers are considered personally identifiable information (PII) and are one of the most highly sought after sensitive pieces of individualized data by identity thieves. Although consumers may believe that redaction of SSN is an adequate privacy measure, a professor and researcher from Carnegie Mellon conducted a study that reveals the assignment of social security numbers actually follows predictable trends, and as a result, it is possible to estimate narrow ranges of values where one’s SSN likely falls. Furthermore, the predictability of SSNs is based solely on information that is publicly available.

EPIC's Interest

EPIC has a strong interest in protecting the privacy of consumers and their information, and ensuring this data is not disclosed to third parties. Datamining companies that gather consumer data often do so without knowledge or consent of the consumers, implicating privacy interests because consumers have the right to know how and what kind of information is being used and disclosed to third parties. EPIC believes it is important to make clear the substantial state interest in safeguarding sensitive personal information as well as the related concern about the transfer of “anonymized” patient data to datamining firms.

EPIC has a particular interest in protections limiting the disclosure of medical information and has shown strong support for this position in its amicus briefs. In 2007, EPIC filed an amicus brief in the First Circuit in IMS Health v. Ayotte, a case which upheld New Hampshire’s prescription privacy law. In 2009, EPIC filed an amicus brief in the Second Circuit in IMS Health v. Sorrell, which is currently on appeal to the Second Circuit. EPIC supports the lower court’s decision in this case, which upheld the constitutionality of a Vermont prescriber confidentiality law. Further, EPIC has testified in the European Parliament that IP addresses should be treated as personally-identifiable information unless it can be provably shown otherwise.

Law Concerning Anonymized Data

Health Insurance Portability and Accountability Act

The HIPAA Privacy Rule (45 CFR Parts 160 and 164) provides the "federal floor" of privacy protection for health information in the United States, while allowing more protective ("stringent") state laws to continue in force. Under the Privacy Rule, protected health information (PHI) is defined very broadly. PHI includes individually identifiable health information related to the past, present or future physical or mental health or condition, the provision of health care to an individual, or the past, present, or future payment for the provision of health care to an individual. Even the fact that an individual received medical care is protected information under the regulation.

The Privacy Rule establishes a federal mandate for individual rights in health information, imposes restrictions on uses and disclosures of individually identifiable health information, and provides for civil and criminal penalties for violations. The complementary Security Rule includes standards for protection of health information in electronic form. For more information, see EPIC's page on Medical Record Privacy.

Many states have enacted legislation that would require prescriber-identifiable data to be confidential, thus prohibiting the disclosure or sale of patient prescription information.

California's Confidentiality of Social Security Numbers Provisions

Cal. Civ. Code 1798.89 prohibits a person, entity, or federal agency to publicly file or record a document containing more than the last four digits of an individual's social security number, unless otherwise required to do so by state or federal law.

Section 66018.55 of the Education Code directs the Office of Privacy Protection in the Department of Consumer Affairs to establish a task force to review the use of SSNs by colleges and universities, in order to recommend best practices and minimize the use, retention, and disclosure of SSNs in relation to their academic or operational needs.

Article 3.5 of the Government Code establishes a Social Security Number Truncation Program, which requires the county recorder of each county to create exact electronic copies of each official record between January 1980 and December 2008, except that any social security number shall be truncated (first five digits redacted).

Several other states have enacted legislation regarding the redaction or disclosure of social security numbers.

De-identified Data and Free Speech

One of the central questions that arises from the re-identification conversation is what level of scrutiny should be afforded this type of speech. This First Amendment question was raised most recently in two cases, IMS Health v. Ayotte and IMS Health v. Sorrell. In both these cases, pharmaceutical datamining companies challenged state statutes that banned the sale of prescription data in most circumstances. The datamining companies in Ayotte argued : 1) the law was subject to strict scrutiny because it provided a content-based restriction on non-commercial free speech; 2) the law violated the First Amendment because it was not narrowly tailored to serve compelling state interests; and 3) if the judge determined that the law was subject to intermediate scrutiny because it only restricted commercial speech, it still did not advance a substantial government interest in a narrowly tailored way. The companies in Sorrell made a similar free speech argument.

In the State's defense in Ayotte, the Attorney General argued: 1) that the New Hampshire law did not implicate the First Amendment because it did not regulate speech; and even if the Act did implicate speech, 2) the law should survive intermediate scrutiny because it advanced the State's substantial interests in promoting public health, controlling health care costs and protecting the privacy of patients and doctors, while still allowing the data to be used for non-commercial purposes. The district court in Ayotte held the Act did implicate speech rights by restricting commercial speech, and thus was subject to intermediate scrutiny and the Central Hudson test. Under Central Hudson, commercial speech can only be limited if it: is 1) truthful and non-misleading; 2) is in support of a substantial government interest; 3) directly advances the government interest asserted; and 4) is not more extensive than necessary to serve that interest.Although the lower court held the Act did not advance the substantial state interest in protecting the privacy rights of patients and healthcare providers, the First Court of Appeals reversed and upheld the constitutionality of the New Hampshire statute.

The court in Sorrell reached a similar decision as the First Circuit in Ayotte. Responding to the same arguments as in Ayotte, the Vermont Attorney General argued, among other things: 1) that the law did not implicate the First Amendment because it did not regulate speech; and even if the Act did implicate speech, and 2) the law should survive intermediate scrutiny under Central Hudson because it advanced the State's substantial interests in promoting public health, controlling health care costs and protecting the privacy of patients and doctors, while still allowing the data to be used for non-commercial purposes. In an opinion issued on April 23, 2009, District Court Judge John Garvan Murtha held that the Act restricted speech. However, the court determined that the speech implicated by the release of prescriber-identifiable data was not fully-protected speech under the Constitution. Rather, the speech in this case was commercial speech and was subject to intermediate scrutiny. Accordingly, the Court utilized the four-part Central Hudson test. The District Court generally agreed with the attorney general's arguments, finding that the "law is sustainable on the State's cost containment and public health interests, which are substantial . . . ." Therefore, the court held that the law's restrictions on data disclosure were "in reasonable proportion to the State's interests." Although the court did not believe that "prescriber privacy [was] a sufficient interest to justify the law," it did not fully consider the merits of the State's privacy claim. For more information, see EPIC's pages on IMS Health v. Ayotte and IMS Health v. Sorrell.

The Process of Re-Identification

The Netflix Study

In 2006, Netflix released data pertaining to how 500,000 of its users rated movies over a six-year period. Netflix “anonymized” the data before releasing it by removing usernames. Still, Netflix assigned unique identification numbers to users in order to allow for continuous tracking of user ratings and trends. Researchers used this information to uniquely identify individual Netflix users. According to the study, if a person has information about when and how a user rated six movies, that person can identify 99% of people in the Netflix database.

AOL's Release of User Data

In 2006, as part of its AOL Research initiative, AOL posted 20 million search queries from 650,000 of its users over a three-month span. AOL attempted to de-identify the data before releasing it by removing IP addresses and usernames to protect the privacy of AOL users. However, because AOL wanted to uniquely identify the data for research purposes, it replaced the usernames and IP addresses with identification numbers, so that a user’s searches would still be connected to the user. Because this data was still linked with unique identification numbers, researchers could link search queries with the individuals who conducted the searches. For example, many users made search queries that identified their city, or even neighborhood, their first and/or last name, and their age demographic. With such information, researchers were able to narrow down the population to the one individual responsible for the searches. In the aftermath of this data release, the researcher responsible for releasing the data was dismissed and the Chief Technology Officer resigned. Still, one of the dangers of releasing such re-identified personal information remains, which is that the potential for future breaches is much higher.

Unique Identification Through Zip Code, Sex, Birthdate

Latanya Sweeney, a computer science professor, conducted a study in 1990 using census data, and found that zip code, birth date, and sex could be combined to uniquely identify 87% of the United States population. To illustrate this threat, Sweeney gathered data from a government agency called Group Insurance Commission (GIC) in order to reveal the identity of a Massachusetts governor. GIC, a purchaser of health insurance for employees, released records of state employees to researchers. GIC, with the support of Governor Weld of Massachusetts, removed names, addresses, social security numbers, and other identifying information, in order to protect the privacy of these employees. Governor Weld assured Massachusetts residents that the release information would remain private.

Sweeney purchased voter rolls, which included name, zip code, address, sex, and birth date of voters in Cambridge, where Governor Weld resided, and combined the information with GIC’s data and easily found the governor. From GIC’s databases, only six people in Cambridge were born on the same day as the governor, half of them were men, and the governor was the only one who lived in the zip code provided by the voter rolls. The information in the GIC database on the Massachusetts governor included prescriptions and diagnoses.

Predicting SSNs by Birth date and State

In their 2009 study, Carnegie Mellon professor Alessandro Acquisti and researcher Ralph Gross demonstrate through a two-step process how SSN is easily predicted by knowing an individual’s birth date and geographic location. First, the researchers analyzed public records in the Social Security Administration’s Death Master File (DMF) to examine statistical trends in the assignment of SSN for those whose deaths were reported to the Social Security Administration. Second, combining these patterns derived from DMF analysis with an alive individual’s state and birth date (which can be found on various offline sources, such as voter registration lists, or online sources, such as social networking sites), Acquisti and Gross identified the first 5 digits for 44% of DMF records from 1989 to 2003 and complete SSNs in less than 1000 attempts for 8.5% of the records. Acquisti and Gross found a strong correlation between birth date and all nine digits of an SSN, a correlation that increases for individuals in less populous states. These results have important consequences for the living population in the United States, as they imply that millions of SSNs for individuals whose birthdates are known can be identified.

How Data is Re-identified

In each of the above cases, data was re-identified by combining two datasets with different types of information about an individual. One of the datasets contained anonymized information; the other contained outside information - generally available to the public - collected on a daily or routine basis (such as voter registration information), and which includes identifying information (e.g., name). The two datasets will usually have at least one type of information that is the same (e.g., birthdate), which links the anonymized information to an individual. By combining information from each of these datasets, researchers can uniquely identify individuals in the population. While companies tend to focus on the removal of personally-identifiable information (PII), the studies above show that re-identification can occur even by combining non-PII, such as movie ratings in the Netflix study or search engine queries in the AOL example.

News Reports

Related Resources