r/PracticalAgenticDev

Paper: production-derived benchmarks for coding agents are getting more serious

Paper worth reading: ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents

Short summary: the authors built a benchmark from real developer-agent sessions with a production AI coding assistant. Each sample includes the original prompt, the committed code change, and tests that should go from failing to passing. The benchmark spans seven programming languages. In their evaluation, model solve rates ranged from 53.2% to 72.2%.

Why this matters: a lot of coding benchmarks are useful, but they often miss how messy real work is. Production prompts are not always clean. Monorepos have weird test setups. Codebases have local conventions. The paper argues that benchmark design should reflect those conditions.

A few concepts in plain English:

"Fail-to-pass tests" means tests that fail before the agent’s change and pass after the correct fix. This gives a concrete signal that the change solved the intended problem.

"Multi-run stability checks" means running the same evaluation more than once to see if the result is reliable. Agents can be nondeterministic, so one lucky run is not enough.

"Harness design" means the environment around the model: tools, shell access, test commands, file editing, context loading, and rules. For coding agents, the harness can matter almost as much as the model.

My practical takeaway: if your team is evaluating coding agents, do not stop at public leaderboard scores. Build a small internal benchmark from real tickets, real tests, and real repo constraints.

reddit.com
u/aistranin — 7 days ago