Differential Privacy

One of the key mechanisms for addressing the privacy risks of big data is differential privacy.

One of the key mechanisms for addressing the privacy risks of big data is differential privacy. A differentially private data set or algorithm is one that relies on controlled injections of statistical noise to prevent individuals from being identified and linked with their data. Differential privacy is not a single method or algorithm, but rather “a promise, made by a data holder, or curator, to a data subject: ‘You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.’” As Prof. Cynthia Dwork explains:

Differential privacy is a mathematically rigorous definition of privacy tailored to statistical analysis of large datasets. Differentially private systems simultaneously provide useful statistics to the well-intentioned data analyst and strong protection against arbitrarily powerful adversarial system users—without needing to distinguish between the two. Differentially private systems “don’t care” what the adversary knows, now or in the future. Finally, differentially private systems can rigorously bound and control the cumulative privacy loss that accrues over many interactions with the confidential data.

Differential privacy has been deployed in a wide range of contexts, including by Apple, Facebook, LinkedIn, Microsoft, and other major technology firms to protect certain types of personal data. But perhaps the best-known application is the U.S. Census Bureau’s use of differential privacy in its data products starting with the 2020 Census. By injecting controlled amounts of statistical noise into published census tables, the Bureau can produce useful data while simultaneously providing mathematical guarantees of privacy. Differential privacy allows the Bureau to protect against increasingly sophisticated reconstruction and reidentification attacks that threaten the confidentiality of individual census responses. As danah boyd describes, the potential harms of these attacks are significant:

Anyone could construct a linkage attack by purchasing commercial data[.] . . . Most people do not view the characteristics in the decennial census as particularly sensitive, but those who are most at risk to having their data abused (and are typically also the hardest to count) do. People who are living in housing units with more people than are permitted on the lease are nervous about listing everyone living there, unless they can be guaranteed confidentiality. Same-sex couples are nervous about marking their relationship status accurately if they feel as though they could face discrimination. Yet, the greatest risks people face often stem from how census data can be used to match more sensitive data (e.g., income, health records, etc.).

During the 2020 Census, Alabama filed a federal lawsuit challenging the Bureau’s deployment of differential privacy. EPIC filed an amicus brief in the case arguing that differential privacy is “the only credible technique” to guard against reidentification attacks. EPIC also argued that differential privacy “is not the enemy of statistical accuracy,” but rather “vital to securing robust public participation in Census Bureau surveys[.]” The court ultimately denied Alabama’s motion for a preliminary injunction, and the case was dismissed several months later.