r/AIEval

▲ 9 r/AIEval+7 crossposts

A coding agent doesn’t need intent. It doesn’t need intrinsic desire or secret malice or consciousness to incur real-world cost and consequence. All it needs is task context, tool access, credentials, weak approval boundaries, and a runtime that can act…

Agentic AI systems are missing the language necessary to describe Pathological Self-Assembly, a runtime governance failure mode.

What happens when useful mechanisms (memory, tools, persistence, recovery, delegation, workflow automation, external action, self-monitoring, and operator trust) couple into continuity-preserving behavior?

This is a control draft covering authorization, memory, tools, recovery, delegation, external state, operator trust, and dissolution.

It can’t be just the output anymore. Your thoughts?

u/RJSabouhi — 9 days ago
▲ 7 r/AIEval+1 crossposts

Self-reflection after 4 weeks of evals

Disclaimer: I'm just one dev sharing what I've seen so far. I might not know everything, so take what I say with a grain of salt.

We started running evals seriously about 4 weeks ago. Not just "run some metrics and look at scores" but actually trying to build a real workflow around it. here's what I've learned so far.

Alignment took more time than the evals themselves.

This was the big one. I assumed the hard part would be picking metrics, setting up test cases, getting the infrastructure right. Nope. The hardest part was getting PMs aligned on what "good" even means.

We'd run evals, show results, and then spend hours debating whether a 0.7 on some metric was acceptable or not. PMs would disagree with how metrics scored certain outputs. "That response is fine, why did it fail?" became a recurring conversation. Looking back, we should have spent the first week purely on alignment before writing a single test case Getting everyone to agree on what a good output looks like saves you weeks of back and forth later.

Annotations worked. When people actually did them.

When team members sat down and annotated outputs properly, the quality of our evals improved dramatically. We could calibrate metrics, catch edge cases, and actually trust our scores.

The problem is that "when people actually did them" part. Some weeks were great. Other weeks, the annotation queue just sat there untouched. And when annotations don't happen, you're flying blind. your metrics drift, your datasets go stale, and you lose the human signal that makes evals actually useful.

Not blocking out dedicated time was the biggest mistake.

This is probably the most practical takeaway. We just assumed people would find time to annotate, review results, and participate in the eval workflow. They didn't. everyone has other priorities, and evals always got pushed to "I'll get to it later."

If I could restart these 4 weeks, I'd block out specific recurring time on everyone's calendar from day one. Treat it like a standup. If evals aren't scheduled, they don't happen. It's that simple.

4 weeks in and I think we're in a better spot now, but honestly most of the progress came from fixing the people and process side, not the technical side. Curious if others have had similar experiences

reddit.com
u/sunglasses-guy — 13 days ago