
u/rohansrma1

GPT-5.5 is OpenAI’s best model. But your benchmark might be measuring the judge too.
posted part 1 last week about model costs/scores and a bunch of people pushed back (fairly) that one benchmark shouldn’t be treated like a universal truth. totally agree with that btw.
but while going through the follow-up analysis from Tessl, i think the more interesting finding actually ended up being the judges themselves:
https://tessl.io/blog/your-benchmarks-are-lying-to-you-and-your-judge-is-to-blame/
same 6 models. ame 11 engineering skills. ame outputs
only the judge changed. and the rankings moved around way more than i expected.
opus-4-7 stays #1 regardless of judge:
94.5 under Sonnet
89.2 under GPT-5.5
96.5 under Opus
so the top-end signal seems pretty real. but below that, things get messy fast.
gpt-5.3 goes from rank #3 under Sonnet to rank #5 under GPT-5.5. 91.9 vs 75.7 on the exact same outputs. that’s a 16 point swing caused purely by swapping the evaluator. one individual skill apparently shifted by 47 points depending on which judge graded it.
that’s the part that stood out to me most because it explains a lot of the reactions on the last post too.
some people in the comments were saying:
“there’s no way composer-2 is that good”
others were saying:
“opus 4.7 is miles ahead in practice”
others focused entirely on cost/performance
and honestly all of them can kinda be “right” depending on what the evaluator rewards.
| Judge | Avg without-skill | Avg with-skill | Avg lift |
|---|---|---|---|
| Sonnet | 76.1 | 90.3 | +14.2 |
| Opus-4-7 | 72.6 | 88.3 | +15.7 |
| GPT-5.5 | 70.7 | 83.4 | +12.7 |
Sonnet was consistently the most lenient. GPT-5.5 was the strictest.
Almost a 7 point average gap between judges grading the same work.
the self-judge stuff is interesting too:
- Opus grading itself gets a +4.6 boost vs cross-judge average
- GPT-5.5 grading itself actually scores lower than the other judges gave it
so yeah, maybe i’m biased because i work at Tessl, but i think the takeaway here is less: “this model wins”
and more: “single-judge evals are probably noisier than most people think”
especially once the model gaps get small.
BUT THE MOST INTERESTING PART!!
The average leaderboard still looks very very very close to the previous benchmark check the screenshot attached.
the part that took us a while to realize about workspace roles
we went through this exact mess a few months ago when we were onboarding a bigger team and nobody had really thought through who should be able to do what. It sounds boring until you're four weeks in and someone has accidentally published something to the wrong workspace or a new hire has been sitting with basically no access because nobody remembered to bump their role up from the default. The default member role is pretty limited intentionally, like search and install only, which is fine for a joe-who-started-monday situation but it's easy to forget to revisit it.
what i ended up doing was just mapping out who needed what before touching any of the settings, because otherwise you end up clicking around reactively. The breakdown we landed on basically looks like this:
| user type | what they actually need | role |
|---|---|---|
| org admin (samira) | manage all workspaces, add users, create new spaces | org admin |
| lead engineer (eddie) | publish in engineering workspace, read-only elsewhere | publisher in one workspace, member in others |
| manager (jennifer) | add users, manage workspace settings, publish | workspace admin |
| new hire (joe) | search and install skills, nothing else yet | member |
the part that took us a while to realize is that workspace-level roles and org-level roles are separate, so you can give someone full access in one workspace without touching their permissions anywhere else. That's not obvious from just poking around the ui. We had someone set up like samira across everything when they really should have been like eddie, just scoped to their team's workspace.
i'm probably biased here since i work at tessl, but this isn't something we put together internally, someone else worked through the same setup and documented it and it maps pretty closely to what our team kept running into: https://tessl.io/blog/tessl-admin-guide-organizations-workspaces-and-roles/
went through simon maple’s eval again and honestly the interesting part is not who wins, its how close everything is once you add skills.
baseline (no skills) still shows differences, sure. gpt-5.5 is clearly ahead there. but the moment you give models structure and context, things compress a lot.
and then cost starts to matter way more than raw capability.
here’s the cleaner view:
| Model | Baseline (no skill) | With skill | Cost/run | Time |
|---|---|---|---|---|
| claude-opus-4-7 | 80.8 | 93.4 | $1.00 | 158.9s |
| cursor:composer-2 | 74.3 | 89.6 | $0.23 | 152.0s |
| gpt-5.5 | 75.6 | 89.4 | $0.49 | 89.5s |
| gpt-5.4 | 74.1 | 89.3 | $0.30 | 135.4s |
| gpt-5.3-codex | 65.5 | 83.9 | $0.44 | 87.9s |
| gpt-5-codex | 68.7 | 78.7 | $1.05 | 136.2s |
few things that stood out to me:
- biggest gap is in baseline, not real usage
- 5.5 leads raw, but disappears into the pack with skills
- 5.4 almost same output for way cheaper
- cursor is kind of wild on cost efficiency
- opus still king on absolute score, but expensive
and then the weird one again: 5.3
lower baseline, lower final score, still costs more than 5.4
that one just doesnt make sense from any angle
also quick note, i work at tessl. we focus on agent enablement, basically helping teams run evals like this and manage skills, context, and workflows around models. so yeah i might look at this stuff more than normal people.
but takeaway feels pretty simple now:
models are getting good enough that how you use them matters more than which one you pick
skills, context, constraints thats where the real gains are.
model choice is starting to look like a pricing and latency decision more than anything else.
read the full breakdown here: https://tessl.io/blog/gpt-55-is-openais-best-model-but-paying-more-for-it-makes-no-sense/
Simon maple just dropped a pretty clean benchmark, and the result is kinda funny
gpt-5.5 is the strongest model out of the box, no doubt. but once you give models skills (which is how people actually use them), it basically performs the same as gpt-5.4
like almost identical. same tasks, same setup, same outputs.
the only real difference is you pay a lot more for 5.5 just to get things done a bit faster.
| Model | Task Scores (with skills) | Cost/run | Score per $ |
|---|---|---|---|
| gpt-5.5 | 89.4 | $0.49 | 182 |
| gpt-5.4 | 89.3 | $0.30 | 298 |
| gpt-5.3 | 83.9 | $0.44 | 191 |
so yeah:
- 5.5 vs 5.4 is basically 0.1 difference in score
- but costs 63% more
- only real win is speed
and the weird one, 5.3, is just a bad deal. costs more than 5.4 and still performs worse.
also quick disclosure: i work at tessl, which is an agent enablement platform focused on helping teams manage, evaluate, and improve the skills and context that AI agents rely on in real workflows
feels like we are hitting a point where picking a model is less about "which is smartest" and more about "what are you optimizing for, cost or latency".
In a new benchmark by Simon Maple, 1,742 tests across 45 scenarios and 11 real engineering skills show something surprising: GPT-5.5 is the most capable model out of the box, but once you add skills, it’s basically tied with GPT-5.4 (89.4 vs 89.3).
Now look at the cost. $0.49 per run vs $0.30. That’s a 63% premium for a 0.1 point gain.
The only clear win for GPT-5.5 is speed: ~89s vs ~135s. If latency matters, it’s defensible. If cost or value matters, it’s hard to justify.
The more interesting story is GPT-5.3. It scores worse (83.9), costs more than GPT-5.4, and burns tokens inefficiently. You’re literally paying more for less.
Check the full comparision here: https://tessl.io/blog/gpt-55-is-openais-best-model-but-paying-more-for-it-makes-no-sense/ (disclosing that i work for this org)