AI "Solve Rates" are a joke. We need a Safe-to-Merge metric.
Every AI coding tool brags about its SWE-Bench solve rate. But solve rate just measures: did the targeted test pass?
It doesn't measure whether the fix silently broke 8 other tests. Or whether the change introduced regressions that only show up in production. Or whether the code is actually safe to ship.
We've all merged AI-generated PRs that passed CI and broke something downstream. That's not a solve rate problem. That's a measurement problem.
I was thinking an open standard called Safe-to-Merge Rate (STMR). An agent's PR only qualifies if:
- The targeted bug fix passes.
- 100% of the existing test suite still passes (zero regressions).
- Linters and type-checkers throw zero new errors.
- The full CI/CD pipeline builds successfully end-to-end.
Feedback wanted: Is this a metric the industry actually needs, or is it just SWE-bench with extra steps? How will agents try to game it?
u/Due-Rent4403 — 14 days ago