Open web was never built to become free training data for AI chatbots

AI scraping is reshaping consent, copyright, and privacy across the digital economy today

Last updated: June 27, 2026 | 05:02

4 MIN READ

One of the most immediate concerns regarding AI is basic. Who gave AI companies permission to ingest vast portions of the internet in the first place?

Shutterstock

The rise of AI chatbots has triggered a technological gold rush unlike anything the internet has seen in decades. Companies are racing to build smarter models, faster assistants, and more human-like conversational systems. Behind that race sits one uncomfortable truth: these systems were built on an unprecedented scale of data scraping.

The ethical debate around AI is often framed around existential risks, job losses, or misinformation. Yet one of the most immediate concerns is far more basic. Who gave AI companies permission to ingest vast portions of the internet in the first place?

For years, the open web operated on an informal exchange. Publishers allowed search engines to crawl websites in return for visibility and traffic. Journalists, artists, researchers, bloggers, and forum users contributed to a digital ecosystem where attribution still mattered. AI chatbots have disrupted that arrangement.

Successful AI will become a natural part of everyday life

Unlike traditional search engines, generative AI systems do not simply direct users toward original sources. They increasingly absorb, summarise, paraphrase, and reproduce information directly inside chatbot interfaces. That shift changes the economics of the internet itself.

The issue is no longer theoretical. Courts, regulators, publishers, and creators are now openly challenging the ethics and legality of AI scraping practices.

Search to extraction

The New York Times sued OpenAI and Microsoft over alleged unauthorised use of its journalism in training datasets. Reddit sued Anthropic over claims that its platform data was scraped without proper licensing agreements. Publishers across Europe and the United States are demanding compensation or stronger control over how their content is used by AI firms.

Job threats, rogue bots: 5 hot issues in AI

These lawsuits reveal a larger ethical problem. Public accessibility does not equal informed consent. Most creators never knowingly agreed for their articles, artwork, code, comments, or books to become training material for commercial AI systems worth billions of dollars. The internet may be open, but openness was never meant to function as blanket permission for industrial-scale extraction. That distinction matters.

AI companies often argue that training models on publicly available data falls under fair use or serves transformative purposes. Courts may ultimately decide parts of that legal argument. Ethics operates on a different standard. A practice can be technically legal and still violate public trust.

Consent problem

The controversy surrounding Anthropic illustrates this tension clearly. A US judge recently indicated that AI training itself may qualify as fair use in certain circumstances, while simultaneously criticising the storage of millions of pirated books used during model development. That distinction is critical because it separates the act of machine learning from the methods used to acquire the data.

WGS 2026: AI is becoming a sovereignty decision for governments, global leaders warn

The ethical problem is not merely that AI learns from human knowledge. Human beings learn from existing knowledge all the time. The problem emerges when companies quietly collect enormous datasets without meaningful transparency, compensation, or opt-out mechanisms.

That opacity has fuelled growing resistance. In the US, California’s Training Data Transparency Act, which took effect in January 2026, now requires developers of generative AI systems to disclose summaries of the datasets used to train models. The European Union’s AI Act also includes transparency obligations and copyright-related requirements for general-purpose AI providers.

These measures are imperfect. Dataset summaries still do not tell creators whether their specific work was used, how extensively it shaped a model, or whether its influence can ever truly be removed. Still, the regulations signal an important shift: governments are beginning to treat training data not as a technical footnote, but as a matter of accountability.

Internet is fighting back

The internet infrastructure itself is also adapting. Cloudflare’s launch of “Pay Per Crawl” last year marked one of the clearest signs that website owners are no longer willing to provide free access to AI crawlers indefinitely. The tool allows publishers to block, permit, or charge AI companies for scraping access. Meanwhile, internet standards groups are developing new protocols designed specifically for AI data preferences and permissions.

These developments point toward a broader transformation of the web economy. Large publishers with legal teams and negotiating power may secure licensing agreements with AI companies. News Corp, Reddit, the Financial Times, and other major organisations have already pursued deals tied to AI access. Smaller publishers, freelancers, independent journalists, artists, educators, and forum users may not have the same leverage.

That imbalance risks creating a new digital hierarchy where powerful platforms monetise human creativity twice: once through advertising systems and again through AI training.

Privacy, the hidden casualty

Privacy concerns deepen the ethical dilemma. AI scraping does not only involve published journalism or copyrighted books. It often includes personal discussions, online comments, social media posts, archived forum threads, and user-generated content that people never imagined would become machine-learning fuel. The average internet user did not expect a casual Reddit comment or decade-old forum post to potentially influence a global AI product.

Consent in the AI era has become dangerously diluted. The industry now faces a defining choice. It can continue treating internet-scale scraping as an unavoidable cost of innovation. Or it can build a system based on permission, attribution, licensing, and enforceable boundaries. The second path is harder. It may slow development. It may increase costs for AI companies accustomed to unrestricted data collection. Yet it offers something the current system lacks: legitimacy.

AI has extraordinary potential. It can accelerate research, improve productivity, expand access to knowledge, and reshape entire industries. None of those benefits require ignoring the rights of the people whose work made these systems possible.

The open web was built on participation, not silent extraction. If AI companies want public trust, they cannot continue acting as though every publicly accessible sentence online exists solely to feed the next chatbot.

Justin Varghese

Justin is a personal finance author and seasoned business journalist with over a decade of experience. He makes it his mission to break down complex financial topics and make them clear, relatable, and relevant—helping everyday readers navigate today’s economy with confidence. Before returning to his Middle Eastern roots, where he was born and raised, Justin worked as a Business Correspondent at Reuters, reporting on equities and economic trends across both the Middle East and Asia-Pacific regions.

Get the latest news and updates straight to your inbox

Network Links

GN Store

About Gulf News Terms & Conditions Reach by GN Sitemap Contact us Privacy Policy GN Focus Have your say Work with us Advertise with us Gulf News epaper Printing Services

Download our app