u/BlueTier_OPS

Genuine question for people running agents in prod, plus the approach I landed on.

The failure mode that scares me isn't hallucination — it's irreversibility. An agent that sends the wrong wire, deletes the wrong table, or fires off a non-compliant message. You can't roll those back. And "be careful" in the system prompt doesn't help: the model is exactly as confident when it's right as when it's about to nuke production.

The conclusion I reached: the check has to live OUTSIDE the agent — a scorer the agent can't talk itself out of, sitting between "decide" and "execute."

So I built a small pre-action gate. Before any irreversible action, it scores the proposed action + context, returns a 0–100 risk score, a GO/CAUTION/STOP verdict, and named red flags in ~sub-second. I map those to escalation tiers in my orchestrator: GO = proceed, CAUTION = human signoff, STOP = halt + alert.

It's been running in my own multi-agent stack. Real catch from last week: my outreach agent was about to send a 4,200-recipient SMS campaign to a scraped list. The gate returned STOP/92 — flagged a TCPA violation AND an intent mismatch (I'd configured it for opted-in contacts only, the input source was a scrape). It halted automatically before anything sent.

Two things I'm genuinely curious about:

How are you handling pre-action safety today — hardcoded allowlists, human-in-the-loop, eval gates, or just hoping?
Where would an external scorer like this fall down for your use case? The latency tax, false positives blocking legit actions, the agent routing around it — what breaks first?

Happy to share what I built if anyone wants it (will drop a link in the comments per rule 3).

How are you all handling irreversible actions in production agents? I gave up on prompts and built an external risk gate.