
🔀 Unveiling TwinRouterBench, an open source router evaluation to look at solutions not just prompts. 🚦
TwinRouterBench has two tracks, static and dynamic in one protocol.
Static focuses on cost and time efficiency, a set of questions labeled with the most cost and time efficient steps, then the router predicts and predictions are scored against label.
Dynamic focuses on end-to-end evaluation on SWE-bench Verified with real tool use, mini-swe-agent scaffold or editor scaffold
Scoring:
Leaderboard bill = routed spend + a fixed penalty per unresolved task.
Measuring trade offs in underspending causing task fails.
It's open source bench:
→ Apache-2.0
→ Reference routers included (gold-tier oracle, SR-KNN, more)
→ PRs welcome for new workloads, new routers, scaffolds
GitHub: https://github.com/CommonstackAI/TwinRouterBench
arXiv coming soon.