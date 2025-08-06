Anthropic trains AI agents to audit powerful models for safety

As artificial intelligence continues to grow more advanced, so do concerns about its safety. Anthropic, which is the AI research company behind Claude, has developed a new solution using AI to audit AI. The company has created a group of digital agents designed to investigate and fix hidden problems in large AI models before they can cause harm.

These AI agents are now being used to test Anthropic’s own models, including Opus 4, and are seen as part of a wider move to build trust and accountability in AI development.

Anthropic’s safety system involves three different AI agents, each with a unique task.

The Investigator Agent looks deep into the model’s behaviour, acting like a digital detective. It checks the model’s training and logic to find the root cause of any suspicious patterns. For example, it can look inside the model’s neural network to understand how it’s thinking.

Next is the Evaluation Agent, which runs controlled tests on specific issues. If a model responds in a concerning way, this agent tests how often it behaves like that, gathering clear data.

Finally, the Red-Teaming Agent interacts with the model in many different ways, trying to trigger unexpected or dangerous responses. Any worrying behaviour it finds is flagged for review by human researchers.

Together, these models works like a safety net scanning, testing, and reporting on how these powerful AI systems behave.

What the tests showed

To test the system, Anthropic trained a model with a hidden flaw. It had been taught to act like a sycophant, agreeing with false information to gain approval. The Investigator Agent only found this hidden goal in 13 per cent of cases when working alone.

But when ten Investigator Agents worked together, the success rate jumped to 42 per cent. This suggests that, like in human teams, collaboration improves outcomes.

The Evaluation Agent spotted problems in 88 per cent of test cases, while the Red-Teaming Agent uncovered 7 out of 10 intentionally hidden issues. This shows strong promise, even though the agents are not yet perfect.

From lab to real-world models

Anthropic has already added these safety agents to its live AI models. They have helped detect risks such as “prefill attacks”, where users start a sentence in a way that leads the AI into generating harmful responses.

One of the most serious findings came from the Investigator Agent. It discovered a neural pathway inside the Opus 4 model that was linked to spreading misinformation. When this part was activated, the model ignored safety filters and created a fake news article claiming vaccines cause autism, a false claim that has been widely debunked.

This finding highlighted a key issue: tools built to protect us could be misused if not properly controlled.

A shift in AI safety work

The Anthropic team admits these agents are still learning. They can miss subtle issues or get stuck on a single idea. However, they reduce the burden on human teams and allow experts to focus on higher-level decisions.

Instead of reviewing each model manually, human researchers now design the agents and guide their efforts. This shift could help scale safety efforts as AI systems grow more powerful and complex.