was just doing some refactoring strategy testing among different models including both deepseek, kimi k2.6 and glm 5.1. M-2.7 did surprisingly well, especially considering it is the smallest model in class by a margin
u/Comfortable-Rock-498
▲ 5 r/MiniMax_AI
u/Comfortable-Rock-498 — 15 days ago
u/Comfortable-Rock-498 — 17 days ago
Scored 65.2% vs google's official 47.8%, and the existing top closed source model Junie CLI's 64.3%.
Since there are a lot of reports of deliberate cheating on TerminalBench 2.0 lately (https://debugml.github.io/cheating-agents/), I would like to also clarify a few things
- Absolutely no {agents/skills}.md files were inserted at any point. No cheating mechanisms whatsoever
- The cli agent was run in leaderboard compliant way (no modification of resources or timeouts)
- The full terminal bench run was done using the fully open source version of the agent, no difference between what is on github and what was run.
I was originally going to wait for it to land on the leaderboard, but it has been 8 days and the maintainers do not respond unfortunately (there is a large backlog of the pull requests on their HF) so I decided to post anyways.
HF PR: https://huggingface.co/datasets/harborframework/terminal-ben...
It is astounding how much the harness matters, based on this and other experiments I have done.
u/Comfortable-Rock-498 — 25 days ago