u/HunterWHT_WaNG

I’ve been thinking about a strange tradeoff in agent design. A lot of “agent safety” discussion still sounds like chatbot safety: better prompts, better alignment, fewer hallucinations. But once an agent is connected to real tools, the problem changes.

The useful part of an agent is that it can operate with delegated capability: read from a mailbox, inspect a repo, call an API, edit a file, submit a form, trigger a workflow. But The moment I give it those capabilities, I am no longer only evaluating model output. I am trusting a system to decide when and how to exercise authority on my behalf.

In other words, I don’t think the hard problem is simply: “Can the model make the right decision?” It is also: “What is the model structurally unable to do, even if it makes the wrong decision?”

There is a product problem too. If you constrain everything, the agent becomes a chatbot again. If you allow everything, it kinda becomes terrifying.

So I’m curious how other people are thinking about this.

Where do you draw the boundary for agents acting on your behalf?

AI agents become useful at the exact point they become risky.