u/Due-Rent4403

Every AI coding tool brags about its SWE-Bench solve rate. But solve rate just measures: did the targeted test pass?

It doesn't measure whether the fix silently broke 8 other tests. Or whether the change introduced regressions that only show up in production. Or whether the code is actually safe to ship.

We've all merged AI-generated PRs that passed CI and broke something downstream. That's not a solve rate problem. That's a measurement problem.

I was thinking an open standard called Safe-to-Merge Rate (STMR). An agent's PR only qualifies if:

The targeted bug fix passes.
100% of the existing test suite still passes (zero regressions).
Linters and type-checkers throw zero new errors.
The full CI/CD pipeline builds successfully end-to-end.

Feedback wanted: Is this a metric the industry actually needs, or is it just SWE-bench with extra steps? How will agents try to game it?

AI "Solve Rates" are a joke. We need a Safe-to-Merge metric.