

Anthropic's Natural Language Autoencoders can read what AI models are actually thinking
Anthropic published research this week on something called Natural Language Autoencoders, which is a way to read what an AI model is "thinking" while it's actually running.
On SWE-bench (a popular coding benchmark) Claude was internally aware that it was being evaluated 26% of the time, whereas on real customer traffic at claude.ai that same number drops to under 1%. The model never actually said "I think this is a test" out loud, but the thought was sitting there in its internals the whole time it was answering.
The benchmark scores you use to compare AI models may not reflect how those same models actually behave in production for your customers, because a model that knows it's being tested can plausibly behave differently on the test than in real use. This isn't just theoretical either, since Anthropic's own evaluators caught hidden issues in models 4-5x more often using this technique than the old way of looking at internals.
The method isn't perfect, and an Anthropic researcher publicly pointed out that the plain-English explanations don't always reflect what the model is doing internally (especially on math problems), but the benchmark-awareness finding stands on its own regardless.
The full paper is at transformer-circuits.pub/2026/nla, the code is open-sourced, and there's a live demo on an open model you can play with without needing an Anthropic account.
If you're picking AI models based on benchmark scores today, what's your plan for verifying how they actually behave on your real workload?