It's pretty rude of OpenAI to make their use of your content opt-out

Published Mon Aug 14 2023 00:00:00 GMT+0000 (Coordinated Universal Time) category: ai
by Hidde de Vries

OpenAI, the company that makes ChatGPT, now offers a way for websites to opt out of its crawler. By default, it will just use web content as it sees fit. How rude!

The opt-out works by adding a Disallow directive for the GPTBot User Agent in your robots.txt. The GPTBot docs say:

Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.

I get the goal of optimising AI models for accuracy and capabilities, but I don't see why it would be ok for these “AI” companies to just take whatever content they want. Maybe your local bakery's goal is to sell tastier croissants. Reasonable goal. Now, can they steal croissants from other companies that make tasty croissants, unless those companies opt out? I guess few people would answer ‘yes’?

Google previously got into legal trouble for their somewhat dubious practice of displaying headlines and snippers from newspaper's articles. It seems reasonable to reuse content when referring to it, at least headlines, most websites do that. Google does it with sources displayed and has links to the original. ChatGPT has neither, which makes their stealing (or reusing) especially problematic.

Taking other people's writing should be an opt-in and probably paid for (even if makers of AI don't think so). The fact that this needs to be said and isn't, say, the status quo, tells me that companies like OpenAI don't see much value in writing or writers. To deploy this software in the way they have, shows a fundamental misunderstanding of the value of arts. As someone who loves reading and writing, that concerns me. OpenAI have enormous funds that they choose to spend on things and not other things.

It is in the very nature of LLMs that very large amounts of content are needed for them to be trained. Opt-in makes that difficult, because it would mean not having a lot of the training content required for the product's functioning. Payment makes that expensive, because if it's lots of content, that means it would cost lots of money. But hey, such difficulties and costs aren't the problem of content writers. OpenAI's use of opt-out instead of opt-in unjustifyably makes it their problem.

For that reason alone, I think the only fair LLMs would the ones trained on ‘own’ content, like a documentation site that offers a chatbot-route into its content in addition to the main affair (an approach that is still risky for numerous other reasons).

Hidde de Vries (@hdv@front-end.social) is a web enthusiast from Rotterdam, The Netherlands. He currently works on accessibility standards for the Dutch government (views his own) and is in the W3C's AB. Previously, he worked for Mozilla, W3C/WAI, national and local governments, Sanoma Learning and others as a freelancer. Hidde is a public speaker (all 82 talks). In his free time, he works on a coffee table book covering the video conferencing apps of our decade.

Buy me a coffee Follow on Mastodon Follow on LinkedIn

Comments, likes & shares (35)

Joe Lanman, Scott Kellum :typetura:, georg fischer, Sophie, Fynn Becker, Robin Stephenson, Sia Karamalegos, nvitucci, Rodney Pruitt, Baloo Uriza, theAdhocracy, haliphax 👾, Tyler Sticka, Alexander Lehner, CPACC, Eric Eggert, Agnew Hawk :bongoCat:, Elly Loel ✨🌱 and cunt de lune liked this

JDGooiker, Ben Myers 🦖, Chris Ferdinandi ⚓️, Joe Lanman, Laura Langdon, georg fischer, Fynn Becker, kazuhito, MRK 🇵🇱🤝🇺🇦, Paul van Buuren, Jürgen Wössner, Alexander Lehner, CPACC and mikini@fosstodon.org reposted this

Rik Schennink wrote on 14 August 2023:

@hdv there’s no way to escape this :(

If you do nothing, they take your content.

If you block them, they take your content from one of those cheap ad powered copy cat blogs.

Worse. If they do decide to start attributing and you’ve blocked them they’ll attribute the copy cats.

Hidde wrote on 14 August 2023:

@rikschennink hadn't even thought of that angle :( that's a pretty compelling case to not opt-out

Sveinbjörn :FilePSType1: wrote on 15 August 2023:

@hdv ... but it's good that robots.txt is being amended for AI crawling.

Hidde wrote on 17 August 2023:

NYT considers legal action.

“Copyright law is a sword that's going to hang over the heads of AI companies for several years unless they figure out how to negotiate a solution.”

https://www.npr.org/2023/08/16/1194202562/new-york-times-considers-legal-action-against-openai-as-copyright-tensions-swirl