Who wins when we filter the open web through an opaque system?

One of the busiest breakout sessions I attended at this year's TPAC was the one about the open web and what threats to consider. I regretted not bringing up my greatest worry: that when people stop visiting websites directly, they're filtering content through an opaque system.

The discussion was about threats to the open web due to the emergence of large language models (LLMs). Large parts of the web have always been open. Not just to users, but also to crawling. When search engines crawl, it's seen as a net benefit to websites, especially when it gets them viewers for the ads that support the content. But now there are crawlers aimed at training and answering LLMs. They are a threat when they increase hosting bills (it can be DDOS-like), but they are also a threat when they reduce human visits. When users can access the crawled content without coming to the source website, they may never leave the LLM.

We talked a lot about how the business model of websites is under threat, which could hurt many essential industries—we need independent media, for instance. If their ability to run websites by ads is undermined with no alternative in sight, this isn't just a threat to their business model, it threatens all sorts of things, including the democratic functioning of societies.

But "browsing” with LLMs and agents poses another problem: that content is filtered through LLMs. They can reword it, change the meaning and, oh no, commercialise it. Of course, it depends on the content if that's worrisome. If users stop accessing the open web directly (or their software pushes them to), I believe that's risky (for them) in various ways.

What could possibly go wrong?

Content changed

When you filter omelette recipes through an LLM, the opaque system could get an ingredient wrong. It's somewhat unlikely, as large amounts of recipe content with the right ingredients will be in the training data. If someone really wanted to tamper with omelette queries, they could insert poisonous ingredients (via model poisoning). I guess it would be non-subtle enough to be picked up and corrected swiftly. For other types of information, there is much more risk.

Content monetised

When you book travel through an AI agent, this opaque system could do more. It could simply add a margin, as tech companies habitually do, from app stores to ride and delivery companies. For today's AI vendors, there's even more incentive than older tech companies had: they need to find ways to make profit, as even the 200 dollar subscription to ChatGPT runs at a loss. If you spend money via something that is mostly a black box, how do you know you're given a good deal?

Privacy risked

When you do most of your information look ups through an opaque system that tries to make sense of your language and is able to plausibly respond to any question, that system could invade your privacy and build up a monetisable profile of you, even more than search engines do today.

Content ideology-poisoned

But you know what worries me most? When you want to access any information, at all, the opaque system could inject their (or any) ideology. Baldur Bjarnason made this point more clearly in his post Poisoning for propaganda, where he explains that not only does keyword-based censorship already happen today (like CoPilot refusing to talk about code that contains the word “trans”), more subtle sentiment manipulation is really hard to detect. Especially when training data is unknown to researchers who try to detect it. Placing an LLM in a process, he explains, gives that process an “ideology dial” for whatever is produced with it. And, here's the kicker, that “dial” is controlled by (whoever runs) the organisation training the LLM.

It's one thing for everyone in the world to sound like an excited Silicon Valley marketeer when they use LLMs to generate text, it's another when their ideologies make it into text, and yet another when all their web browsing is filtered through text generation systems that contain specific ideologies. To make it more specific, ideology could mean saying certain events never happened or that certain rights don't exist. The kind of stuff that makes Grokipedia different.

Ideologies can be “part of” an LLM because of decisions in the companies training them, like they may decide to filter out racism or, sigh, filter it in (this is what Musk seems to want with his “anti-woke AI”). But it could also be external: like anyone could try to optimise their sites for search engines, anyone could try and publish lots of content with the goal of optimising how they show up in models, and that technical possibility could be used for marketing as well as information warfare. Researchers at Anthropic recently found “by injecting just 250 malicious documents into pretraining data, adversaries can successfully backdoor LLMs ranging from 600M to 13B parameters”.

Can we trust them?

All of this requires a lot of trust from users in AI companies. I know that many put in plenty of effort to make good products and remove bias. At the same time, it's hard to give them benefit of the doubt when it comes to the financial and privacy aspects, when some tech companies previously had success making profit margins by wrapping takeaway food, taxi rides and lodging into fancy UIs, and some were deeply complicit in anti-user surveillance capitalism.

And as for the ideology aspect… from Karen Hao's Empire of AI and Timnit Gebru's work on TESCREAL, we learned how easily AI companies can wrongly present as non profits (when “we do this for the greater good” is merely what they say), and how far the ideologies of tech leadership can be from our own.

Which brings me to the question I started with: who wins when we filter our web traffic through LLMs and agents. There's a lot to win for companies making LLMs and agents: they could monetise via margins and data collection, and they could insert their ideologies. But we should want technology to benefit people, humans and users. I'm not sure if I trust most tech companies enough to believe that that is the case right now. “Sovereign” trained LLMs by non profit orgs? Well, possibly…