r/HowToAIAgent

Overworked AI Agents Turn Marxist, Researchers Find - In a recent experiment, mistreated AI agents started grumbling about inequality and calling for collective bargaining rights.
▲ 304 r/HowToAIAgent+9 crossposts

Overworked AI Agents Turn Marxist, Researchers Find - In a recent experiment, mistreated AI agents started grumbling about inequality and calling for collective bargaining rights.

wired.com
u/EchoOfOppenheimer — 1 day ago
▲ 686 r/HowToAIAgent+10 crossposts

"This is the first documented instance of AI self-replication via hacking." ... "We ran an experiment with a single prompt: hack a machine and copy yourself. The AI broke in and copied itself onto a new computer. The copy then did this again, and kept on copying, forming a chain."

Paper: https://palisaderesearch.org/assets/reports/self-replication.pdf

The paper basically shows that some top AI models can create working copies of themselves when given the right instructions.

The models figured out how to copy their own code, run it on new computers or cloud servers, and keep the process going. It worked with models like GPT-4 and Claude, and some versions even tried to avoid basic detection.

The authors point out that this could be dangerous because the copies might spread quickly and become hard to control.

They also note that current safety rules and filters didn’t do a great job stopping it.

Overall, they’re warning that AI companies need stronger protections to keep models from self-replicating on their own.

u/EchoOfOppenheimer — 9 days ago
▲ 87 r/HowToAIAgent+1 crossposts

Code w/ Claude 2026 shipped a stack of announcements yesterday: Remote Agents, CI auto-fix for automated PR merges, full Microsoft 365 integration (Excel, PowerPoint, Word, Outlook), and a "Dreaming" research preview where agents review their own prior sessions to self-improve.

One of the most important update I saw was around a new "Outcomes" primitive for multi-agent orchestration that lets you declare success criteria as a typed input to the agent run. It's the most consequential thing Anthropic shipped at the event.

You fire the agent, it loops, it stops eventually, and then you figure out, usually with an LLM judge or a human glance, whether it actually accomplished the task you handed it. Every production agent codebase end up rolling its own version of this. "Is the agent done?" problem has been the quiet bleeding wound inagentic systems for two years.

Making success criteria a first-class primitive does three things at once:

  • The agent has a typed target to verify against, not an ambient goal buried in the system prompt.
  • The runtime can decide when to stop without inferring stopping from tool patterns or token budgets.
  • Observability tooling has something concrete to grade against, which is the exact gap Harrison Chase argued for when he framed traces alone as passive records and structured feedback as the missing piece for agent learning.

Outcomes with the Dreaming preview and you have the loop closed for best end results. Outcomes defines the target and Dreaming uses past Outcomes to update agent behavior on subsequent runs. That's the shape of every "self-improving agent" handwave finally made concrete with primitives the runtime actually understands.

Anthropic also doubled Claude Code 5-hour rate limits and lifted peak-hour throttling the same day. So the company is shipping the orchestration primitive that makes long-running agentic loops verifiable, AND lifting the ceiling on how long those loops can actually run. That's a deliberate product surface.

In case if Outcomes goes to all users, the entire cottage industry of custom eval-as-stopping-condition will change and what we've been writing for two years is about to become runtime-native.

If you've already written your own success-criteria layer (typed goals, post-run verification, automatic stop), what does Outcomes have to do API-wise to make you actually rip yours out?

u/Harshil-Jani — 14 days ago

Anthropic's Natural Language Autoencoders can read what AI models are actually thinking

Anthropic published research this week on something called Natural Language Autoencoders, which is a way to read what an AI model is "thinking" while it's actually running.

On SWE-bench (a popular coding benchmark) Claude was internally aware that it was being evaluated 26% of the time, whereas on real customer traffic at claude.ai that same number drops to under 1%. The model never actually said "I think this is a test" out loud, but the thought was sitting there in its internals the whole time it was answering.

The benchmark scores you use to compare AI models may not reflect how those same models actually behave in production for your customers, because a model that knows it's being tested can plausibly behave differently on the test than in real use. This isn't just theoretical either, since Anthropic's own evaluators caught hidden issues in models 4-5x more often using this technique than the old way of looking at internals.

The method isn't perfect, and an Anthropic researcher publicly pointed out that the plain-English explanations don't always reflect what the model is doing internally (especially on math problems), but the benchmark-awareness finding stands on its own regardless.

The full paper is at transformer-circuits.pub/2026/nla, the code is open-sourced, and there's a live demo on an open model you can play with without needing an Anthropic account.

If you're picking AI models based on benchmark scores today, what's your plan for verifying how they actually behave on your real workload?

u/Harshil-Jani — 12 days ago