






Round 2: Token usage between GPT-5.4, GPT-5.5, GPT-5.4 Mini, GPT-5.3-Codex in Codex across all 4 reasoning modes (Low, Medium, High, and XHigh) using the exact same prompt and the same project as the baseline
Okay, round two of the comparison which we did a few days ago (Previous GPT 5.4 vs 5.5 Token & Cost Comparison Across Effort Levels ).
To summarize the previous review, a few days ago, we ran a comparison between GPT-5.4 and GPT-5.5 in Codex across different effort levels (low, medium, high, xhigh) using the same repo (Express JS) and the same summarization prompt. The results were pretty stark in terms of turns, token usage, and total cost.
But a lot of the feedback we got from Commenters was completely fair:
“How do you actually know which result is better?”
“What about GPT 5.3 Codex or GPT 5.4 Mini which I use quite often?”
Summaries are extremely subjective. We tried evaluating them with other LLMs, but quickly realized there’s no clean benchmark for “summary quality”. One model might be more concise, another more detailed, another more architectural, and all of them could technically be “correct”.
So for round two, we wanted something much more objective.
Instead of summarization, we created a small React note-taking application "Lumen", as our baseline project repo and asked the models to implement a real feature build.
The task included:
- building an outline panel
- keyboard shortcuts
- feature integration into the existing app
- preserving existing behavior/spec constraints
This time we expanded the benchmark set beyond just GPT-5.4 and GPT-5.5.
We tested the following:
- GPT-5.5
- GPT-5.4
- GPT-5.3-codex
- GPT-5.4-mini
…each across multiple effort levels (created 16 worktrees for the test project & ran Codex CLI in parallel for all with same prompt). Nearly all delivered the output as expected - there were minor differences in style but they worked. Next, we had to also evaluate by code quality (going beyond just comparing token and cost efficiency). We used Claude Opus 4.7 to evaluate the quality of the diffs that were generated.
Key Findings:
GPT 5.4 mini struggled the most and only low effort actually a working solution while others gave following issues and didn't finish:
- custom build pipelines (changed esbuild, deleted & rebuilt config files or missed css file import)
- hardcoded ports (fixed ports in the pipeline)
- broken scripts
- shell loops (couldn't figure out powershell/syntax to start background server & kept looping)
- spec drift (created directories it didn't need or have to)
Meanwhile:
- GPT-5.4 was the most consistently reliable overall
- GPT-5.5 had some very strong top-end implementations, but with significantly higher token/cost scaling
- GPT-5.3-codex was surprisingly competitive and quite consistent relative to cost
One interesting outcome:
GPT-5.4 xhigh narrowly took the top score overall, with GPT-5.5 xhigh right behind it.
If you ignore GPT-5.4-mini, GPT-5.4 overall comes out on top and taking cost into consideration, it becomes even more attractive.
| Rank | Model | Input Tokens | Output Tokens | Cache Read | Turns | Cost | Runtime |
|---|---|---|---|---|---|---|---|
| 1 | GPT-5.4 XHigh | 1.9M | 27.4k | 1.7M | 160 | $5.89 | 10m 30s |
| 2 | GPT-5.5 XHigh | 3.8M | 27.2k | 3.6M | 222 | $22.16 | 12m 48s |
| 3 | GPT-5.3-codex XHigh | 2.1M | 21.9k | 2.0M | 189 | $4.57 | 9m 36s |
| 4 | GPT-5.5 High | 2.7M | 14.5k | 2.5M | 156 | $15.30 | 9m 06s |
| 5 | GPT-5.3-codex Medium | 1.7M | 10.1k | 1.6M | 127 | $3.43 | 5m 06s |
| 6 | GPT-5.4 Low | 410.6k | 7.2k | 338.3k | 72 | $1.23 | 3m 06s |
| 7 | GPT-5.4 High | 1.5M | 18.0k | 1.2M | 134 | $4.43 | 7m 06s |
| 8 | GPT-5.3-codex Low | 553.6k | 4.9k | 510.7k | 58 | $1.13 | 2m 36s |
This feels a lot closer to real-world coding evaluation than summarization benchmarks, even though the project itself was intentionally small.