u/FlimsyProperty8544

Curious what people here think.

When teams start building AI agents, I feel like the first eval setup is usually way too output-focused. They check whether the final answer “looks right,” but they don’t really evaluate the path the agent took to get there.

For simple RAG apps, final-answer evals can get you pretty far. But with agents, a lot can go wrong before the final response:

wrong tool selected
correct tool used with bad arguments
unnecessary retries
bad retrieval step
hallucinated intermediate reasoning
model recovered by accident
final answer looks fine, but the trace is a mess

That last one feels especially dangerous because the agent passes the “vibe check,” but you have no idea whether it will stay reliable under slightly different inputs.

I’ve started thinking that agent evals need to be more trace-first. Not just “did the final output pass?” but also:

did each step make sense?
did the agent use the right tools?
did it retrieve the right context?
did it avoid unnecessary work?
did the failure happen in planning, retrieval, tool use, or synthesis?

Basically, evals for agents feel less like grading an answer and more like debugging a workflow.

What’s the biggest eval mistake you’ve seen teams make? And do you mostly evaluate final outputs, full traces, or both?

What’s the biggest eval mistake you see teams make with AI agents?