r/AIEval

been working through agent evaluation properly and wanted to share a few things that actually changed how i think about it.

start from the symptom not the layer

wrong tool being called is a component problem. correct answer but too many steps is a trajectory problem. final answer looks wrong is an outcome problem. unsafe action or injection risk is an adversarial problem. once you map symptoms to layers debugging gets way faster.

most teams only check final outputs

trajectory evaluation catches a whole class of failures that output checking misses entirely including duplicate calls, loops, unnecessary retries and cost blowouts.

an uncalibrated LLM judge is worse than no judge

if you haven't validated your LLM as judge against a small set of human labels you're adding noise on top of noise. calibration is not optional.

convert every production failure into a test case

before your next release not after. within a few cycles you have a regression suite that actually catches things before deployment.

adversarial testing is not optional

if your agent reads external content or takes real actions, indirect prompt injection through tool outputs is a real failure mode most eval setups ignore entirely.

if you want to go deeper on all of this we have a hands on bootcamp on june 27 where we cover all four layers live with real notebooks: https://www.eventbrite.co.uk/e/agent-evals-bootcamp-tickets-1990306501323?aff=raieval

What are you actually evaluating these days: prompts, context, or the whole harness?

things i wish i knew before evaluating AI agents in production