u/Afraid_Translator402

My cofounder and I are experimenting with agent reliability tooling. We've been running thousands of agent tasks on tau-bench (airline customer service benchmark) trying to automatically detect when agents fail and improving their accuracy.

However, we're stuck on something and curious if anyone else has hit this.

Catching wrong actions is relatively straightforward as you can compare the constraint against the tool call and flag it.

But catching missing actions is a different beast. In one of the experiments user asks to add baggage and change seat. Agent does the seat but just never touches baggage and the conversation ends like nothing happened. There is no error anywhere in the trace. In real life one can only catch this when the customer complains or someone manually checks.

So we built a tracker that parses what the user asked for and checks whether each thing actually got done by the end of the session.

But the problem is sometimes the agent correctly didn't do something. Policy blocked the flight change. The user changed their mind halfway through. The agent tried but the API timed out and the user said "forget it just transfer me to someone". All of these look identical to "agent silently skipped an action" if you're just checking whether a tool got called or not.

We're at about 50% precision right now. Meaning half the stuff we flag as a failure isnt actually a failure. The agent made the right call, we just cant tell the difference yet.

Anyone building agents in production running into similar stuff? Or working on evals/monitoring that deals with this? Would love to compare notes.

How do you catch when an AI agent skips something it was supposed to do?