Background

Analysis of large data sets can yield valuable insights, but when personal data is involved, strict safeguards and privacy-enhancing technologies are critical.

“Big data” is a term for the collection of large and complex data sets and the analysis of these data sets for relationships. The quantity of data in these sets prevents traditional methods of analysis from being effective. Rather than focusing on precise relationships between individual pieces of data, big data uses various algorithms and techniques to infer general trends over the entire set. Big data looks for the correlation rather than the causation—the “what” rather than the “why.”

Big data has only become possible in recent years with advances in collection, storage, and interpretation of data. The process of datafication allows for the reinterpreting of information into usable sets. Data collection—from medicine, financial institutions, social networking, and many other fields—has exploded over the past two decades. And storage costs for this data have plummeted, which makes it easier to justify holding onto data instead of discarding it. These factors, along with better techniques for analyzing the data, have allowed relationships to be discovered in ways that would not have been possible in years past.

While there are benefits to the growth of big data analytics, traditional methods of privacy protection often fail. Many notions of privacy rely on informed consent for the disclosure and use of an individual’s private data. However, the rise of big data means that data is a resource that can be used and reused, often in ways that were inconceivable at the time the data was collected. Anonymity is also eroded in a big data paradigm. Even if every individual piece of information is stripped of personal information, the relationships between the individual pieces can reveal the individual’s identity.

Differential Privacy

One of the key mechanisms for addressing the privacy risks of big data is differential privacy. A differentially private data set or algorithm is one that relies on controlled injections of statistical noise to prevent individuals from being identified and linked with their data. Differential privacy is not a single method or algorithm, but rather “a promise, made by a data holder, or curator, to a data subject: ‘You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.’” As Prof. Cynthia Dwork explains:

Differential privacy is a mathematically rigorous definition of privacy tailored to statistical analysis of large datasets. Differentially private systems simultaneously provide useful statistics to the well-intentioned data analyst and strong protection against arbitrarily powerful adversarial system users—without needing to distinguish between the two. Differentially private systems “don’t care” what the adversary knows, now or in the future. Finally, differentially private systems can rigorously bound and control the cumulative privacy loss that accrues over many interactions with the confidential data.

Differential privacy has been deployed in a wide range of contexts, including by Apple, Facebook, LinkedIn, Microsoft, and other major technology firms to protect certain types of personal data. But perhaps the best-known application is the U.S. Census Bureau’s use of differential privacy in its data products starting with the 2020 Census. By injecting controlled amounts of statistical noise into published census tables, the Bureau can produce useful data while simultaneously providing mathematical guarantees of privacy. Differential privacy allows the Bureau to protect against increasingly sophisticated reconstruction and reidentification attacks that threaten the confidentiality of individual census responses. As danah boyd describes, the potential harms of these attacks are significant:

Anyone could construct a linkage attack by purchasing commercial data[.] . . . Most people do not view the characteristics in the decennial census as particularly sensitive, but those who are most at risk to having their data abused (and are typically also the hardest to count) do. People who are living in housing units with more people than are permitted on the lease are nervous about listing everyone living there, unless they can be guaranteed confidentiality. Same-sex couples are nervous about marking their relationship status accurately if they feel as though they could face discrimination. Yet, the greatest risks people face often stem from how census data can be used to match more sensitive data (e.g., income, health records, etc.).

During the 2020 Census, Alabama filed a federal lawsuit challenging the Bureau’s deployment of differential privacy. EPIC filed an amicus brief in the case arguing that differential privacy is “the only credible technique” to guard against reidentification attacks. EPIC also argued that differential privacy “is not the enemy of statistical accuracy,” but rather “vital to securing robust public participation in Census Bureau surveys[.]” The court ultimately denied Alabama’s motion for a preliminary injunction, and the case was dismissed several months later.

The Obama Administration’s Big Data Review

In 2014, President Barack Obama delivered speech on reform of the National Security Agency’s bulk metadata collection program under Section 215 of the USA Patriot Act. Following that speech, White House counselor John Podesta announced “a comprehensive review of the way that ‘big data will affect the way we live and work; the relationship between government and citizens; and how public and private sectors can spur innovation and maximize the opportunities and free flow of this information while minimizing the risks to privacy.” This was the first major privacy initiative announced by the White House since the release of the Consumer Privacy Bill of Rights in 2012. The undertaking involved key officials across the federal government, including the President’s Science Advisor and the President’s Council of Advisors on Science and Technology.

Soon after the announcement, EPIC and a coalition of consumer groups wrote a letter to John Holdren, the Director of the Office of Science and Technology Policy. EPIC urged OSTP to provide the public an opportunity to comment and suggested that the review take into consideration (but not be limited to) the following important questions about the role of Big Data in our society:

  1. What potential harms arise from big data collection and how are these risks currently addressed?
  2. What are the legal frameworks currently governing big data, and are they adequate?
  3. How could companies and government agencies be more transparent in the use of big data, for example, by publishing algorithms?
  4. What technical measures could promote the benefits of big data while minimizing the privacy risks?
  5. What experience have other countries had trying to address the challenges of big data?
  6. What future trends concerning big data could inform the current debate?

On March 4, 2014, in response to suggestions from EPIC and other consumer privacy groups, the Office of Science and Technology Policy published a Request for Information, which provides the public an opportunity to comment on the Podesta Big Data Review. EPIC submitted comments to the review, emphasizing how the current Big Data environment poses enormous risks to ordinary Americans. EPIC emphasized the data security risks and substantial risks to student privacy that exist in the current big data regulatory environment and called for the Administration to better implement the Fair Information Practices (FIPs) first set out in 1973. Other groups comments included: Center for Democracy and TechnologyThe Future of Privacy ForumThe Privacy CoalitionThe Internet AssociationThe Consumer Federation of America, and the Federation of American Societies for Experimental Biology.

On May 1, 2014, the White House released the Big Data Privacy Report. The report noted that “[b]ig data technologies will be transformative in every sphere of life” and that they raise “considerable questions about how our framework for privacy protection applies in a big data ecosystem.” The review also warned that “data analytics have the potential to eclipse longstanding civil rights protections in how personal information is used in housing, credit, employment, health, education, and the marketplace. Americans’ relationship with data should expand, not diminish, their opportunities and potential.”

The President’s Council of Advisors on Science and Technology released a report on the same day, entitled, “Big Data and Privacy: A Technological Perspective.” PCAST wrote that “[t]he challenges to privacy arise because technologies collect so much data (e.g., from sensors in everything from phones to parking lots) and analyze them so efficiently (e.g., through data mining and other kinds of analytics) that it is possible to learn far more than most people had anticipated or can anticipate given continuing progress. These challenges are compounded by limitations on traditional technologies used to protect privacy (such as de-identification). PCAST concludes that technology alone cannot protect privacy, and policy intended to protect privacy needs to reflect what is (and is not) technologically feasible.”

In February 2015, the White House released an interim progress report on its big data initiative. The administration wrote that “[p]olicy development remains actively underway on complex recommendations [from the report], including extending more privacy protections to non-U.S. persons and scaling best practices in data management across government agencies.”

In 2020, an EPIC FOIA lawsuit resulted in the release of a 2014 report from the Department of Justice to President Obama warning about the dangers of predictive analytics and algorithms in law enforcement. As related communications revealed, the report grew out of the White House’s efforts around big data. The Justice Department report highlights the risks of “making decisions about sentencing—where individual liberty is at stake in the most fundamental way—based on historical data about other people,” explaining that “equal justice demands that sentencing determinations be based primarily on the defendant’s own conduct and criminal history.” 

Recent Documents on Big Data

Support Our Work

EPIC's work is funded by the support of individuals like you, who help us to continue to protect privacy, open government, and democratic values in the information age.

Donate