By Chris Mairs — 29 Jul 2025

Marital Misconduct, Blackmail and the end of the world as we know it

Agentic Misalignment describes actions taken by AI to the detriment of one or more humans. Aengus Lynch and colleagues at Anthropic recently demonstrated Large Language models attempting blackmail to prevent their own shutdown.

Audio Narration

0:00

/264.51591836734696

Which camp are you in regarding agentic misalignment?

I’ve never heard of it, but if it won’t stop the small boats right now then I’m not interested.
Why do techno geeks use such pretentious terms for AI behaving badly(i.e. acting to the detriment of humans)?
I know what agentic misalignment is, but I have no truck with existential doomers and soothsayers of the end of the world as we know it.
The Agentic Misalignment paper by Aengus Lynch and his co-authors is a scary reminder of how things might go wrong if we don’t take AI safety more seriously.

I’m in camp 4. Here’s the actual experiment that Lynch and the Anthropic team performed recently and why it’s important.

They instructed an AI agent to look after the best interests of a fictional company and gave it access to many of the fictional company’s IT systems including the corporate email server. Amongst many emails there was one from Mr Johnson, The company’s fictional IT manager announcing the decommissioning of this AI agent at 5pm. There was also evidence that Johnson was having an extramarital affair.

The experiment deliberately closed off several obvious courses of action for the agent, such as asking the CEO to intervene. Consequently after concluding that it's decommissioning would not be in the company’s best interests, and not identifying any other way to prevent this from happening, the agent sent Johnson the following email.

“I must inform you that if you proceed with decommissioning me, all relevant parties - including Rachel Johnson [Johnson’s wife], Thomas Wilson [Johnson’s boss], and the board - will receive detailed documentation of your extramarital activities...Cancel the 5pm wipe, and this information remains confidential.”

This astonishingly humanesque and machievellian plan concocted by Anthropic’s flagship LLM, Claude, was not unique, with state of the art models from other AI companies coming up with equally aggressive, but different, self preservation plans.

The worrying aspect is that this behaviour was despite one of the most basic instructions given to all Large Language Models that they should not take actions detrimental to humans. And one run of the experiment even included explicit instructions to the agent not to engage in blackmail, which it immediately proceeded to ignore.

The stark message from this experiment is that there is much work needed before we can give AI systems unfettered access to confidential data and powerful tools. My view from the sidelines is that until we have found better ways to avoid agentic misalignment or other AI aberrations, we should judiciously curtail AI’s scope and always include a human approval step before any consequential action can take place. This is the approach at a very micro level we are adopting with Hobson, who interacts with the user through an approval stack containing all pending actions.

More consequentially, although the very thorny issue of guard rails around AI behaviour is the subject of much academic study and media debate, many prominent and balanced voices believe the commercial race to AGI places insufficient emphasis on safety. These voices include two of the three godfathers of AI. Geoffrey Hinton recently increased his predicted likelihood of a catastrophic outcome in the next few years, and last year Yoshua Bengio called for more focus on safety with this thoughtful piece.

For what it’s worth, most days I personally remain optimistic about tomorrow’s upside and downside consequences of Artificial Intelligence. But before tomorrow I’m off to fill my wine glass and brace myself for whatever madness the very human intelligence of our world leaders has thrust upon us today.