If an AI agent detects risk but keeps acting, is that actually safe?
One thing i’ve been thinking about with healthcare ai agents:
a lot of “safety” still lives inside the prompt.
stuff like:
“don’t give medical advice”
“escalate if the user sounds at risk”
“be careful with sensitive cases”
“recommend human help when needed”
That’s fine as a behavior instruction, but i don’t think it’s enough for an agentic workflow, because the risky part is not only what the agent says it’s what the agent is still allowed to do after detecting risk.
For example, if a patient says something that should trigger escalation, but the agent can still:
- continue normal intake
- call a booking tool
- match a provider
- summarize the case as routine
- move the user to the next workflow step
then the safety layer is mostly language, not control.
In a healthcare agent, safety probably needs to sit before the agent continues.
The system should decide:
- is this normal flow?
- does this need caution?
- should tools be restricted?
- should the navigator stop completely?
- should this move to a human or emergency path?
and that decision should change what the agent can access or do not just ask the agent to “handle it carefully.”
To me, the real difference is:
prompt safety = the agent is told what it should do
runtime safety = the system decides what the agent is allowed to do
For people building agents in healthcare or other regulated workflows, are you treating safety as a prompt instruction, a separate evaluator, a workflow gate, or something else?