u/More-Version3682

Hello developers,

I've recently been building and testing AI agents, and one thing that keeps coming up is flaky evaluations caused by the non-deterministic nature of LLMs.

Sometimes a test case fails, I rerun it immediately, and it passes without any code changes. Other times the agent produces a slightly different reasoning path that still reaches the correct outcome.

For teams shipping agentic products:

How much tolerance do you allow for these kinds of failures in CI/CD?
Do you rerun failed evaluations before failing a build?
How do you distinguish between genuinely broken behavior and sporadic LLM variability?
Are your PR gates based on individual test cases, aggregate metrics, statistical significance, or something else?

I'm curious how mature teams handle this in production because traditional "all tests must pass" approaches seem difficult to apply when some amount of variability is inherent to the system.

Would love to hear what has worked (and what hasn't) for your teams.

How are you testing your AI Agents?