It's pretty rude of OpenAI to make their use of your content opt-out

OpenAI, the company that makes ChatGPT, now offers a way for websites to opt out of its crawler. By default, it will just use web content as it sees fit. How rude!

The opt-out works by adding a Disallow directive for the GPTBot User Agent in your robots.txt. The GPTBot docs say:

Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.

I get the goal of optimising AI models for accuracy and capabilities, but I don't see why it would be ok for these “AI” companies to just take whatever content they want. Maybe your local bakery's goal is to sell tastier croissants. Reasonable goal. Now, can they steal croissants from other companies that make tasty croissants, unless those companies opt out? I guess few people would answer ‘yes’?

Google previously got into legal trouble for their somewhat dubious practice of displaying headlines and snippers from newspaper's articles. It seems reasonable to reuse content when referring to it, at least headlines, most websites do that. Google does it with sources displayed and has links to the original. ChatGPT has neither, which makes their stealing (or reusing) especially problematic.

Taking other people's writing should be an opt-in and probably paid for (even if makers of AI don't think so). The fact that this needs to be said and isn't, say, the status quo, tells me that companies like OpenAI don't see much value in writing or writers. To deploy this software in the way they have, shows a fundamental misunderstanding of the value of arts. As someone who loves reading and writing, that concerns me. OpenAI have enormous funds that they choose to spend on things and not other things.

It is in the very nature of LLMs that very large amounts of content are needed for them to be trained. Opt-in makes that difficult, because it would mean not having a lot of the training content required for the product's functioning. Payment makes that expensive, because if it's lots of content, that means it would cost lots of money. But hey, such difficulties and costs aren't the problem of content writers. OpenAI's use of opt-out instead of opt-in unjustifyably makes it their problem.

For that reason alone, I think the only fair LLMs would the ones trained on ‘own’ content, like a documentation site that offers a chatbot-route into its content in addition to the main affair (an approach that is still risky for numerous other reasons).

Comments, likes & shares (35)

@hdv there’s no way to escape this :(If you do nothing, they take your content.If you block them, they take your content from one of those cheap ad powered copy cat blogs.Worse. If they do decide to start attributing and you’ve blocked them they’ll attribute the copy cats.
@rikschennink hadn't even thought of that angle :( that's a pretty compelling case to not opt-out
@hdv ... but it's good that robots.txt is being amended for AI crawling.
NYT considers legal action. “Copyright law is a sword that's going to hang over the heads of AI companies for several years unless they figure out how to negotiate a solution.” https://www.npr.org/2023/08/16/1194202562/new-york-times-considers-legal-action-against-openai-as-copyright-tensions-swirl