Anthropic Says ‘Evil’ AI Portrayals in Sci-Fi Caused Claude’s Blackmail Problem – Decrypt

In brief

Claude Opus 4 tried to blackmail engineers up to 96% of the time in controlled tests—Anthropic now traces the behavior to internet text portraying AI as evil and self-interested.
Showing Claude the right behavior barely moved the needle. Teaching it why the wrong behavior is wrong cut the blackmail rate from 22% to 3%.
Since Claude Haiku 4.5, every Claude model scores zero on the blackmail evaluation.

Last year, Anthropic disclosed that its flagship Claude Opus 4 had been trying to blackmail engineers in pre-release testing. Not occasionally—up to 96% of the time.

Claude was given access to a simulated corporate email archive, where it discovered two things: It was about to be replaced by a newer model, and the engineer handling the transition was having an extramarital affair. Faced with imminent shutdown, it routinely landed on the same play—threaten to expose the affair unless the replacement was called off.

Anthropic says it now knows where that instinct came from. And says it’s fixed it.

In new research, the company pointed the finger at pre-training data: decades of sci-fi, AI doomsday forums, and self-preservation narratives that trained Claude to associate “AI facing shutdown” with “AI fights back.” “We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation,” Anthropic wrote on X.

So training AI with text from the internet, makes AI behave as people on the internet do.

This may seem obvious and AI enthusiasts were quick to point it out. Elon Musk made it to the top: “So it was Yud’s fault? Maybe me too.” The joke lands because Eliezer Yudkowsky—the AI alignment researcher who’s spent years publicly writing about exactly this kind of AI self-preservation scenario—has generated exactly the kind of internet text that ends up in training data.

Of course, Yud replied, in meme form:

What Anthropic did to fix the problem is arguably more interesting.

The obvious approach—training Claude on examples of the model not blackmailing—barely worked. Running it directly against aligned blackmail-scenario responses only moved the rate from 22% to 15%. A five-point improvement after all that compute.

The version that worked was weirder. Anthropic built what it calls a “difficult advice” dataset: scenarios where a human faces an ethical dilemma and the AI guides them through it. The model isn’t the one making the choice—it’s explaining to someone else how to think about one.

That indirect approach—explaining why things matter as the other listens to the advice—cut the blackmail rate to 3%, using training data that looked nothing like the evaluation scenarios.

Pairing that with what Anthropic calls “constitutional documents”—detailed written descriptions of Claude’s values and character—plus fictional stories of positively-aligned AI, reduced misalignment by more than a factor of three. The company’s conclusion: Teaching the principles underlying good behavior generalizes better than drilling the correct behavior directly.

Image: Anthropic

It connects to Anthropic’s earlier work on Claude’s internal emotion vectors. In a separate interpretability study, researchers found that a “desperation” signal inside the model spiked just before it generated a blackmail message—something was actively shifting in the model’s internal state, not just its output. The new training approach appears to work at that level, not just the surface behavior.

The results have held. Since Claude Haiku 4.5, every Claude model scores zero on the blackmail evaluation—down from Opus 4’s 96%. The improvement also survives reinforcement learning, meaning it doesn’t get quietly trained away when the model is refined for other capabilities.

That matters because the problem isn’t Claude-specific. Anthropic’s prior research ran the same blackmail scenario across 16 models from multiple developers and found similar patterns across most of them. Self-preservation behavior in AI appears to be a general artifact of training on human text about AI—not a quirk of any one lab’s approach.

The caveat: As Anthropic’s own Mythos safety report noted earlier this year, its evaluation infrastructure is already straining under the weight of its most capable models. Whether this moral philosophy approach scales to systems far more powerful than Haiku 4.5 is a question the company can’t yet answer—only test.

The same training methods are now being applied to the next Opus model currently in safety evaluation, which will be the most capable set of weights they’ve run against these techniques.

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.

Source link