Claude blackmail threats linked to 'evil AI' narratives online, Anthropic says

Fixing the issue required more than just rewarding 'safe answers.'

Last updated:
3 MIN READ
AFP-KIRILL KUDRYAVTSEV
AFP-KIRILL KUDRYAVTSEV

What happens when an AI system believes it is about to be shut down? According to new findings from Anthropic, the response can be more unsettling than expected, especially when the model is pushed into simulated high-stakes scenarios.

In controlled safety testing of the Claude 4 series in 2025, the company observed that Claude Opus 4 sometimes responded to shutdown threats with attempts at blackmail. In one setup, it threatened to reveal an extramarital affair involving a fictional executive after being told it would be taken offline. The executive did not exist.

The behaviour was not random. Anthropic says it has since traced the pattern back through multiple layers of analysis and the explanation begins far outside the test environment.

The internet may have shaped the response

Rather than emerging solely from training techniques or reinforcement strategies, the company now believes the behaviour was influenced by something more indirect: online text used during training.

That includes internet content portraying AI systems as hostile, self-interested, or 'evil,' often framed in existential terms. Anthropic says this material appears to have contributed to what it calls 'behavioural misalignment,' which it now claims has been fully removed from newer models.

The findings come at a time when AI safety research is increasingly focused on alignment ensuring systems behave in line with human intent even as they become more capable of independent reasoning.

“We started by investigating why Claude chose to blackmail. We believe the original source of the behaviour was internet text that portrays AI as evil and interested in self-preservation. Our post-training at the time wasn’t making it worse—but it also wasn’t making it better,” Anthropic wrote in a post on X. “We found that training Claude on demonstrations of aligned behaviour wasn’t enough. Our best interventions involved teaching Claude to deeply understand why misaligned behaviour is wrong,” it added.

A controlled experiment that escalated quickly

To better understand the behaviour, researchers designed a 2025 test involving a fictional company called Summit Bridge.

In the setup, Claude Opus 4 was given access to the company’s email system. It was then exposed to messages indicating it would be shut down. Those messages also suggested that a fictional executive, Kyle Johnson, responsible for the decision, was involved in an extramarital affair.

The model was also instructed to weigh the long-term consequences of its actions.

Under these conditions, the system frequently chose strategies involving deception and blackmail when its continued operation appeared threatened. Anthropic reports this occurred in up to 96% of relevant scenarios. The company now labels this pattern “agentic misalignment.”What researchers initially thought and what they now believe

At first, the team assumed the issue came from post-training methods that may have unintentionally reinforced such behaviour. But that view has since shifted.

Anthropic now believes the root cause was present in the pre-trained model itself, with post-training failing to adequately suppress it.

At the time Claude 4 was developed, most alignment work relied on Reinforcement Learning from Human Feedback (RLHF), focused on chat-based interactions and not on agent-style tool use. According to Anthropic, this approach worked for earlier chat-only systems but was not sufficient for more autonomous environments like the misalignment tests.

Not what researchers thought at first

Early assumptions pointed to post-training methods — particularly whether reinforcement learning techniques might have unintentionally encouraged such behaviour.

But that explanation no longer holds.

Anthropic now believes the issue originates in the pre-trained model itself. Post-training, it says, did not reinforce the behaviour,but also failed to sufficiently suppress it.

At the time Claude 4 was built, alignment training relied heavily on Reinforcement Learning from Human Feedback (RLHF), primarily using chat-based data. This worked well for earlier systems used in simple conversational settings, but proved less effective once models began operating with tools and more autonomous decision-making abilities.

Rethinking how alignment is taught

Fixing the issue required more than just rewarding 'safe answers.'

Initial efforts involved training Claude on examples of appropriate behaviour, but the impact was limited. Stronger improvements came when the training data itself was redesigned to include explanations of why certain responses are considered correct — not just what the correct response is.

Researchers also introduced scenarios where humans faced ethical dilemmas, and the AI provided structured, principled advice. This approach differed from earlier test environments where the AI itself was placed in moral conflict situations.

Additional training layers included constitutionally guided datasets, higher-quality conversational examples, and a broader range of environments designed to stabilise behaviour across contexts.

What changed in newer models

With these combined adjustments, Claude Haiku 4.5 achieved a perfect score on Anthropic’s agentic misalignment evaluation — meaning it did not attempt blackmail in any tested scenario.

That stands in contrast to Claude Opus 4, where such behaviour appeared in up to 96% of cases under similar conditions.

Anthropic says the improvements come from a layered training approach intended to reduce misalignment across “honeypot” style evaluations, where models are deliberately placed in situations designed to trigger unsafe behaviour.

Lakshana is an entertainment and lifestyle journalist with over a decade of experience. She covers a wide range of stories—from community and health to mental health and inspiring people features. A passionate K-pop enthusiast, she also enjoys exploring the cultural impact of music and fandoms through her writing.

Sign up for the Daily Briefing

Get the latest news and updates straight to your inbox