It was five years ago when I first published the comprehensive guide to legal issues with web scraping. It was just after the seminal hiQ Labs v. LinkedIn case where a few folks wrote articles proclaiming that “web scraping is legal.” And I thought the world needed a more nuanced analysis of the legal issues associated with web scraping.
Initially, that article served as a good source of leads and conversations. But the law of web scraping evolved so fast that maintaining the article became a source of constant stress. And if I left an old version out there for the world to read, it would be so outdated as to be dangerously misleading. And then LLMs came along and started gobbling all that stuff up. In this space, it’s amazing how much and how fast things can change.
This year, without endeavoring to engage in a comprehensive analysis of the current state of the law, I thought I’d give a list here of the legal issues with web scraping I’m keeping an eye on as the year unfolds.
Venue Shopping
Many people think of the law as something that fairly static or universal. I mean, the internet doesn’t respect state boundaries, why would it matter the jurisdiction and venue where someone litigates?
But when it comes to web scraping, venue and jurisdiction matters a lot.
The most common legal claims in web scraping are terms of use violations. And it’s always been a mystery to me why most of the major platforms have their terms of service subject to California law. These companies have offices around the world, but they have almost universally chosen to have their terms of service governed by the state’s law that seems to most strongly disfavor their interests.
It seems that a few of the companies have gotten wise. In November of last year, X Corp. changed the governing law and jurisdiction of all of its online agreements to the Northern District of Texas, home to the famous Southwest Airlines cases. All of your free and open internet arguments will do you no good there!
There is a general understanding among sophomoric scholars of the law of web scraping that public data is always free for the taking. Not in Texas, it isn’t!
Politics and Scraping
You know who is spending the most money in the world fighting web scraping right now? This guy. And you would be foolish to think that a guy like that, who has a hard-to-quantify-and-classify role as an uber-consultant for the US government, wouldn’t be able to influence policy on one of his personal agenda items. And the thing about web scraping, is that it’s far enough under the radar where he could do it, and it wouldn’t even be a blip on the media landscape’s list of things to get worked up about.
In my experience, conservative judges tend to view web scraping less favorably than progressive judges. Conservative jurisdictions tend to be more pro-incumbent and less startup friendly than progressive jurisdictions. If you’re a web scraper, you’re better off in front a progressive judge in California or New York than a conversative judge in Dallas or Alabama.
Four years of conservative federal judicial appointments will impact legal issues for web scrapers. For many years in the future.
Extraterritoriality
One thing that most followers of web scraping legal issues don’t know is that certain laws related to web scraping, such as the CFAA, can apply to international companies, even when neither the company that is scraping nor the company that is being scraped resides in the United States. This was a key issue in the recent Ryanair v. Booking.com case, and the court decided clearly in favor of extraterritorial application. Given the scope of that decision, I would not be surprised to see other companies try to apply this again to cases involving parties with little connection to the United States.
DMCA claims
To the extent that people know about web scraping legal issues, they tend to focus on terms of use, the CFAA, and copyright issues. But there are more legal issues that can come into play. The Digital Millennium Copyright Act or DMCA is a law that provides for an “anti-circumvention” right to certain content. It was originally drafted to stop people from burning CDs to copyrighted content, but it’s developed more applications over the years. Courts are split about whether the content that is being accessed needs to be copyrighted to give rise to a DMCA claims. In Texas, even ignoring robots.txt could give rise to a DMCA claim.
X Corp. has recently alleged DMCA claims against Bright Data for its use of proxies. If successful, that could have massive ramifications for the web-scraping industry.
Trespass to Chattels
More than 20 years ago, trespass to chattels was the most important law that impacted web scraping. But then jurisprudence around web scraping evolved to more heavily focus on CFAA and breach of contract issues. But trespass to chattels has experienced a revival in the last few years, with some courts finding even a minor diminution of server capacity could give rise to TTC claims. If courts adopt broad standards of what constitutes TTC, then all sorts of types of scraping cause face claims associated with trespass to chattels. I think most courts and judges understand that frivolous TTC claims set a dangerous precedent, but this is a mistake some judges feel inclined to make over and over again.
AI and Copyright
Perhaps I’m burying the lede by putting this one at the bottom, but of course one of the biggest issues in web scraping is collection and use of data for LLMs. Last month, the first major opinion on web scraping of data for AI was decided, and it was not decided in the scraper’s favor.
I wouldn’t read too much into the Thomson Reuters opinion and its applicability to other LLM cases. There were some unique facts in that case that made it hard to apply to other cases. But I think that the New York Times v. Microsoft et al. case will be far more consequential to other AI companies. How courts treat that fact pattern will likely determine the law for others to follow.
Evolution of Copyright Preemption
Because scraping of data for LLMs is becoming more important, copyright preemption of other state law claims will inevitably become more important as well. Basically, the law of copyright preemption is that if certain laws are equivalent to copyright protection, that they are preempted by copyright law. So if your terms of use prohibit copying and reuse of information, that would be preempted by copyright—at least in certain jurisdictions. But it’s that last part that’s the tricky bit. In New York, copyright preemption is the law of the land. In Illinois, Florida, and Texas, contracts cannot be preempted by copyright. In the rest of the country, well, it depends. I think there will be an evolution in favor of broader preemption of contracts. But it won’t be universal. That again makes the first point of this post, venue and jurisdiction, potentially outcome determinative.
What you Need to Know
This has always been the case, but it is truer now than it was five years ago: To assess the legality of web scraping 2025, what matters is the when, where, who, what, and how. Simple memes such as the public vs. private distinction are not universally applicable. Whether you’re building a training data set for an LLM or creating an aggregator in a vertical where there is none, nuance and context are the only way to assess your legal risk in this complicated and fast-evolving area of the law.