Analysis

Scraping for Me, Not for Thee: Large Language Models, Web Data, and Privacy-Problematic Paradigms

February 27, 2025 | Justin Sherman, EPIC Scholar in Residence

When the Chinese startup DeekSeek released a new chatbot model in January, the impacts in the US private sector, government, and think tank community were widespread. US artificial intelligence companies collectively lost more than $1 trillion in market valuation in a single day, with NVIDIA alone taking a $593 billion hit. Members of Congress began calling for a ban on DeepSeek’s AI model and mobile app for national security reasons. And analysts debated how it all fits US-China AI relations and global security.

Yet something especially curious happened, too: OpenAI and Microsoft started investigating whether DeepSeek obtained, in their words, “unauthorized” access to OpenAI data to build its LLM.

For those aware of how OpenAI and many other companies have built their so-called large language models (LLMs)—the kind of AI model behind ChatGPT and other such chatbots—the irony is not lost. Because OpenAI, much like many of its competitors, built and builds its LLMs in part by scraping vast amounts of data and information from the internet and processing it all, typically without the data holders’ permission or the actual consent of people whose data was encompassed within it. More than just hypocrisy (though that as well), the supposed reaction to DeepSeek’s model speaks to a troubling AI company argument—one that positions mass web scraping and large-scale data ingestion, sans consent, as a necessity, and an act in which only the companies themselves are entitled to partake.

This argument distorts the meaning of “necessary.” Perhaps more insidiously, it also advances a problematic paradigm, wherein privacy costs are not “outweighed” by other factors but intentionally discarded at the outset, because they are too at odds with the industry-expedient process to develop a technology. The idea is to therefore cast meaningful privacy considerations aside entirely, not even weighing them, which hurts individuals, other companies, and society and does longer-term damage to public debate about new technologies.

Striking similarities

It’s worth spending a moment on the reaction itself. If it had a name, it would be hypocrisy. As Wall Street Journal columnist Joanna Stern amusingly and incisively remarked: oh dear, did someone steal something from OpenAI? Huh.

Unnamed individuals with supposed knowledge said, according to Bloomberg, that Microsoft identified people potentially linked to DeepSeek using OpenAI’s application programming interface (API) to pull out data. OpenAI and Microsoft declined to comment on the story, as did DeepSeek and its hedge fund High-Flyer. David Sacks, the Silicon Valley investor turned Trump “AI czar” (with his own choice views), said there is “substantial evidence” that DeepSeek looked to OpenAI’s model outputs to build its own LLM. Responding to Sacks’ comment, OpenAI said generally that “we know [People’s Republic of China] based companies – and others – are constantly trying to distill the models of leading US AI companies.” Going forward, the statement continued, “it is critically important that we are working closely with the US government to best protect the most capable models from efforts by adversaries and competitors to take US technology.”

Clearly, as all the caveats (supposed knowledge, potentially linked) underscore, much is uncertain—and just as there is reason for skepticism over DeepSeek’s claims, we should wait for additional evidence before making conclusions about this API-use possibility. It is true that Sacks may be correct about how DeepSeek built its latest model. It is also true that all involved parties have many incentives to represent the DeepSeek advancement as a violation of important norms, whether or not it did happen.

Now, to be very clear, this is not to say that concerns about intellectual property theft from American companies are baseless; it’s a real problem, including vis-a-vis the Chinese government and many Chinese companies and institutions. This is also not to dismiss the notion that people in the United States, as well as US companies, should be concerned about illicit access to people’s and organizations’ data for such purposes as building AI models; again, this concern is founded on clear harms and implicates people’s data, commercial trade secrets, the ability to have a more level market playing field, and more. If made in good faith, with accurate consideration of the facts and risks, these are important points to have in the public discourse.

But we can’t ignore the maker of the argument. Access to data without the data holder’s or individual’s permission is, by many accounts, exactly how OpenAI has been building—and plans to continue to build—many of its AI models in the first place. OpenAI, among other things, allegedly scraped New York Times articles, used the text from copyrighted books, transcribed over a million hours of YouTube videos, and scraped enormous volumes of data on websites, in many cases allegedly with zero explicit permission from the data-holders, to train its models. Some of these practices directly violate the other companies’ terms of service. OpenAI even asked actress Scarlett Johansson in 2023 to be the voice of ChatGPT, which she declined, and proceeded to launch a ChatGPT update in 2024 (the GPT-4o model Sky) that sounded “eerily similar” to her voice—prompting Johansson to say she was “shocked, angered, and in disbelief” at OpenAI’s conduct. Hollywood writers and other creatives have hardly been pleased with the idea of taking creative works without permission for LLM training, either. Law scholars Daniel J. Solove and Woodrow Hartzog put it well when they called this the era of the “great scrape.”

As Joseph Cox at 404 Media and others have explained, the notion that a company shouldn’t use data without “authorization” to build an LLM directly contradicts OpenAI’s own arguments on its website and in court about its web scraping and large-scale data ingestion—arguing, in the DeepSeek case, that it’s an unconscionable attack on innovation, threatening fair markets and US national security, and in its own case that all the other impacted companies, individuals, and institutions have no harm to claim, at all, because it’s not a problem whatsoever. Quite the walking contradiction.

Rules for me, not for thee

It’s not just about the two-faced nature of this argument. The way that some AI companies and commentators argue for the “necessity” of mass web scraping and large-scale data ingestion, because that’s just how LLMs are built, is highly troubling for people, other companies, and society—and for future public debates about new technologies. Unpacking the rhetorical distortions and leaps is important in the context of LLMs but also far beyond it.

First of all, those that do actually argue that this kind of mass data scraping and ingestion is “necessary” distort the definition of the word, as either a basic requirement for human survival or a predestined, inevitable reality—neither of which apply to language models. LLMs are not food or water (and in fact consume enormous amounts of the latter to answer questions people could Google). And as much as some commentators only speak of LLMs and AI in passive voice, people and companies choose to design, build, deploy, and maintain these systems; they are not inevitable. Further, despite the many organizations and individuals embedding AI into elements of society, to call LLMs “necessary” diminishes the actual necessity of technologies more clearly and directly tied to advancing basic human needs, such as biotechnologies for food sustainability or new imaging techniques in medicine. Even within the world of AI, calling general-purpose LLMs “necessary” conflates and confuses different AI applications that themselves can be built in different ways, with different levels of energy consumption and climate harm, and with different levels of benefit and accountability to marginalized people and the public. Imaging techniques in medicine are one such example: AI image recognition models to improve cancer screening are not the same as a chatbot used by an ad agency to write copy, whether in their design, development, function, or societal impact.

Stepping back further, the notion that this much data, of this kind, should be scraped from the internet and elsewhere without question because it’s “necessary for building an LLM” is a dangerous one. Notwithstanding critical debates about whether LLMs need to be so “large” in the first place, the framing of mass web scraping and data ingestion as “necessary for” an end state centers the end state, rather than the process to the end state. It rhetorically moves the focus of the conversation to the LLMs some companies build—why their makers say they’re important; how to use them with safeguards, because of course they’re going to and should be used—and skips right over the entire set of actions that individuals and organizations choose to take to build them in current form.

The framing immediately narrows the window of thought. Rather than seriously weighing what mass web scraping and large-scale data ingestion mean for privacy, copyright issues, energy consumption and the climate, labor, the economy, and various other points, and presenting a good faith argument (agree or not) that the mass data grabbing and LLM combined are a net benefit to society, making the words “it’s necessary for building an LLM” the conversation-stopper leaves all these issues unconsidered. Because, of course, why would someone dare to question the idea that a few companies should be able to scrape and ingest massive amounts of personal data, copyrighted works, information and data published by many other companies (which also, surprise, contribute to the economy), and other materials to build an LLM. How dare one ask: is this actually necessary? Doing so would introduce evidence and reasoning as to why a current approach to LLMs is—whether or not themselves a net negative—absolutely harmful to privacy in the process and a flagrant disrespect for the concept of consent. With a similar rhetorical twist, one could argue that it’s “necessary” to take high-resolution photos of every single American, put them into a database, scan them with facial recognition, and link that into a centralized facial recognition model running on all CCTV cameras in the United States, because that’s what’s “necessary for” nationwide facial recognition to work.

The obvious is still important; OpenAI’s reaction to DeepSeek prompts cries of hypocrisy and directly contradicts its own excuses for scraping huge amounts of data to build its own LLMs. But the broader argument about why mass data scraping and large-scale data ingestion is “necessary” sits within a dangerous paradigm: focusing exclusively on technological end states rather than process, not even considering privacy and other issues in the first place or pretending they were carefully weighed and resolved through a moderately deliberative and democratic process. Our privacy deserves better.

Support Our Work

EPIC's work is funded by the support of individuals like you, who allow us to continue to protect privacy, open government, and democratic values in the information age.

Donate