Beware of Web Scrapers Selling the Law of Web Scraping


One of the main reasons I first became interested in the law of web scraping was because of how hard it was to find reliable information about it. Years ago, I had a client with a big web-scraping project, and they asked me to research whether what they were planning on doing was legal or not. And in my research, I kept uncovering article after article about the laws related to web scraping that was absolute nonsense.

Only by sticking with academic research and the case law was I able to find anything that I could trust.

As I started digging deeper, the more I realized that this was an interesting area of law with lots of room for nuance and development. But part of the reason that I originally got hooked was because I had to do some much digging before I could find information I could trust.

Don’t get me wrong: there’s plenty of information out there about the laws related to web scraping. Just most of it is terrible and unreliable.

And part of the reason that much of the information is so bad—other than journalists simply repeating the same bad information over and over again— is because the most prominent purveyors of information are often the very companies selling the services.

Which would be one thing if these companies were American-based with legal counsel who specialized in US laws and how to navigate them. But they’re not. Zyte is headquartered in Ireland. Bright Data is from Israel (currently being sued by Meta). Octoparse is a Chinese company (with a US-based subsidiary that is currently being sued by Meta).

This is relevant in a few different ways. First, some foreign scrapers are not likely to have the resources or the inclination to hire a US-based attorney to help them navigate these issues. Second, part of the reason these companies can do what they do is because they are not obviously subject to the jurisdiction of the United States (although, to fully assess the exact details of that question would be worthy of an entirely separate and much longer article). Also, the legal risk of a foreign company scraping the data of companies based in the United States is different from an American-based company looking to do the same thing.

In a sense, I’m being a bit unfair to these companies by lumping them together. Zyte is a sophisticated and intelligent company that takes a conservative approach to risk and is an excellent advocate for the web scraping industry. Octoparse is a hot pile of garbage that pumps and dumps half-researched clickbait on anyone without a healthy sense of skepticism and self-preservation.

This isn’t about xenophobia or negative sentiment toward non-American companies. It’s about understanding that even though the internet is frictionless and borderless, that the law is very much not that way. That if you run afoul of a company’s terms of service or scrape beneath a log in, you are likely subjecting yourself to the jurisdiction of a specific place. Some of those places are much less friendly to web scrapers than others.

I came across this a lot when Bitcoin first started to become a big deal. I’d have clients ask me if they could legally create an online Bitcoin trading platform in the United States. And I would tell them that while it was theoretically possible to do so, that the compliance obligations associated with doing this would be far more expensive than they could afford. And they’d point me to all these small-scale Bitcoin trading platforms online. And then I’d look to see where they were domiciled, and it was invariably in some place like Russia, Malta, or Gibraltar.

The law that applies to a Maltese company with anonymous founders might not be the same as the law as it applies to you. Just as the law that applies to a company with no presence in the United States might not be the same as the law that applies to a person living in Nebraska, California, or Texas.

Fortunately or unfortunately, the companies that sell web-scraping services also tend to be pretty good about maximizing their SEO. So even though they are not the most reliable sources of information, they tend to find themselves near the top of search lists for key terms on Google, because the people who do know the law—lawyers, researchers, and academics—suck at search.

(or maybe Google’s the one that’s starting to suck at search—but that’s a topic for another day).

So yeah, if you’re trying to learn more about the law of web scraping, first and foremost, consider the source.

For a nuanced and pro-scraping perspective on the law related to web scraping, the Electronic Frontier Foundation, Duke’s Center for the Study of the Public Domain, and the Markup are all excellent. For a more enterprise perspective on things, Bloomberg Law and often provide good insight. And of course, no one’s been at it longer, more diligently, and with more insight than Eric Goldman’s blog, where Venkat, Eric, and I cover all the major cases on this topic.