u/Deep-Palpitation8315

Image 1 — Round 3: Claude Opus 4.7 1M vs Opus 4.7 vs Opus 4.6 Legacy vs Sonnet 4.6 across effort levels on the same real React feature-build task
Image 2 — Round 3: Claude Opus 4.7 1M vs Opus 4.7 vs Opus 4.6 Legacy vs Sonnet 4.6 across effort levels on the same real React feature-build task
Image 3 — Round 3: Claude Opus 4.7 1M vs Opus 4.7 vs Opus 4.6 Legacy vs Sonnet 4.6 across effort levels on the same real React feature-build task
Image 4 — Round 3: Claude Opus 4.7 1M vs Opus 4.7 vs Opus 4.6 Legacy vs Sonnet 4.6 across effort levels on the same real React feature-build task
Image 5 — Round 3: Claude Opus 4.7 1M vs Opus 4.7 vs Opus 4.6 Legacy vs Sonnet 4.6 across effort levels on the same real React feature-build task
Image 6 — Round 3: Claude Opus 4.7 1M vs Opus 4.7 vs Opus 4.6 Legacy vs Sonnet 4.6 across effort levels on the same real React feature-build task
▲ 31 r/Claudeopus+2 crossposts

Round 3: Claude Opus 4.7 1M vs Opus 4.7 vs Opus 4.6 Legacy vs Sonnet 4.6 across effort levels on the same real React feature-build task

Okay, time for Round 3.

A few days ago I posted Round 2, where we benchmarked GPT-5.4, GPT-5.5, GPT-5.3-codex, and GPT-5.4-mini across different effort levels using Codex on the same React project and the same feature-building prompt. Here's the note-taking app used as our test project.

That round came after a lot of useful feedback from the 1st experiment.

The first version was based on summarization of a Git repo (Express JS), but the problem with summarization is that it was very subjective (all summaries are essentially "correct" depending on preference/perspective).

So in Round 2 we moved to something more practical: give every model the same real coding task inside the same repository, then compare the actual feature implementation.

For Round 3, I repeated the same style of experiment, but this time with Anthropic models via Claude Code.

The models / model families tested were:

  • Claude Opus 4.7 1M
  • Claude Opus 4.7
  • Claude Sonnet 4.6
  • Claude Opus 4.6 Legacy
  • Claude Haiku 4.5 (skipped in final results as it DNF'ed)

The task was the same type of real feature-build benchmark as before: implement an outline panel inside a small React note-taking app, while preserving the existing app behavior and respecting the prompt constraints.

We ran this experiment with each model-effort combination in their own separate Git worktree using Claude Code programmatic access. Every run got the same repo and the same prompt. After the runs completed, we checked whether the implementation actually worked in the UI first before we used our scoring system - 4 models grading against the same set of code quality attributes and averaged the scores.

For the successful / usable runs, we evaluated code quality across following dimensions:

  • spec adherence
  • feature correctness
  • project pattern fit
  • React / TypeScript quality
  • performance
  • general code quality / maintainability
  • constraint adherence

The final code quality ranking was then used as the ordering for the token usage and turns/cost comparison tables (below and also available in the attached infographic):

Rank Run Model Effort Quality Score Input Tokens Output Tokens Turns Runtime Cost
1 exp-22 Opus 4.7 1M XHigh 32.50 / 35 3.8M 31.4K 67 8.3m $4.21
2 exp-27 Opus 4.7 XHigh 31.75 / 35 4.0M 23.4K 141 8.7m $4.06
3 exp-20 Opus 4.7 1M Medium 30.50 / 35 2.0M 15.2K 47 4.2m $2.05
4 exp-26 Opus 4.7 High 30.00 / 35 2.0M 15.4K 86 5.5m $2.57
5= exp-21 Opus 4.7 1M High 29.50 / 35 2.9M 20.0K 60 5.2m $3.23
5= exp-23 Opus 4.7 1M Max 29.50 / 35 4.2M 36.4K 70 8.6m $4.68
7 exp-28 Opus 4.7 Max 29.25 / 35 4.7M 33.9K 149 11.2m $4.81
8 exp-24 Opus 4.7 Low 28.75 / 35 2.0M 12.5K 92 5.1m $2.50
9 exp-19 Opus 4.7 1M Low 28.00 / 35 1.9M 12.0K 45 3.6m $2.21
10 exp-25 Opus 4.7 Medium 27.75 / 35 2.4M 15.7K 103 6.2m $2.77

No surprises but Opus 4.7 1M xhigh came out on top overall, with an averaged quality score of 32.50 / 35.

A few interesting findings:

Opus 4.7 xhigh was very close behind at 31.75 / 35.

The best “efficient quality” run was arguably Opus 4.7 1M medium, which scored 30.50 / 35 while using much less time and cost than the xhigh / max runs.

At the model-family level, Opus 4.7 1M narrowly led regular Opus 4.7, but the difference was small:

  • Opus 4.7 1M average: 30.00 / 35
  • Opus 4.7 average: 29.50 / 35

So in this specific test, the two Opus 4.7 families were very close. The 1M version had the top individual run, but regular Opus 4.7 was also highly competitive.

Sonnet 4.6 and Opus 4.6 Legacy trailed the 4.7 models pretty clearly on code quality in this particular benchmark.

One thing that stood out: higher effort was not always better.

The xhigh runs were strongest, but max did not consistently improve quality. In a couple of cases, max increased cost and token usage without producing a better implementation.

4.6 Opus (which we love) we were sure, was better than 4.7 (during its release at least) but results say other wise (I have conspiracy theories).

Caveat: This is still a N=1/single-task benchmark. The results should be treated as directional rather than definitive.

The next thing we plan to do is compare the Anthropic results against the OpenAI Round 2 results more directly, especially quality-per-dollar and quality-per-minute. I kept this post focused only on the Anthropic side so the comparison does not get too messy.

u/Deep-Palpitation8315 — 8 hours ago
▲ 154 r/OpenAIDev+1 crossposts

Round 2: Token usage between GPT-5.4, GPT-5.5, GPT-5.4 Mini, GPT-5.3-Codex in Codex across all 4 reasoning modes (Low, Medium, High, and XHigh) using the exact same prompt and the same project as the baseline

Okay, round two of the comparison which we did a few days ago (Previous GPT 5.4 vs 5.5 Token & Cost Comparison Across Effort Levels ).

To summarize the previous review, a few days ago, we ran a comparison between GPT-5.4 and GPT-5.5 in Codex across different effort levels (low, medium, high, xhigh) using the same repo (Express JS) and the same summarization prompt. The results were pretty stark in terms of turns, token usage, and total cost.

But a lot of the feedback we got from Commenters was completely fair:

“How do you actually know which result is better?”

“What about GPT 5.3 Codex or GPT 5.4 Mini which I use quite often?”

Summaries are extremely subjective. We tried evaluating them with other LLMs, but quickly realized there’s no clean benchmark for “summary quality”. One model might be more concise, another more detailed, another more architectural, and all of them could technically be “correct”.

So for round two, we wanted something much more objective.

Instead of summarization, we created a small React note-taking application "Lumen", as our baseline project repo and asked the models to implement a real feature build.

The task included:

  1. building an outline panel
  2. keyboard shortcuts
  3. feature integration into the existing app
  4. preserving existing behavior/spec constraints

This time we expanded the benchmark set beyond just GPT-5.4 and GPT-5.5.

We tested the following:

  1. GPT-5.5
  2. GPT-5.4
  3. GPT-5.3-codex
  4. GPT-5.4-mini

…each across multiple effort levels (created 16 worktrees for the test project & ran Codex CLI in parallel for all with same prompt). Nearly all delivered the output as expected - there were minor differences in style but they worked. Next, we had to also evaluate by code quality (going beyond just comparing token and cost efficiency). We used Claude Opus 4.7 to evaluate the quality of the diffs that were generated.

Key Findings:

GPT 5.4 mini struggled the most and only low effort actually a working solution while others gave following issues and didn't finish:

  • custom build pipelines (changed esbuild, deleted & rebuilt config files or missed css file import)
  • hardcoded ports (fixed ports in the pipeline)
  • broken scripts
  • shell loops (couldn't figure out powershell/syntax to start background server & kept looping)
  • spec drift (created directories it didn't need or have to)

Meanwhile:

  • GPT-5.4 was the most consistently reliable overall
  • GPT-5.5 had some very strong top-end implementations, but with significantly higher token/cost scaling
  • GPT-5.3-codex was surprisingly competitive and quite consistent relative to cost

One interesting outcome:
GPT-5.4 xhigh narrowly took the top score overall, with GPT-5.5 xhigh right behind it.

If you ignore GPT-5.4-mini, GPT-5.4 overall comes out on top and taking cost into consideration, it becomes even more attractive.

Rank Model Input Tokens Output Tokens Cache Read Turns Cost Runtime
1 GPT-5.4 XHigh 1.9M 27.4k 1.7M 160 $5.89 10m 30s
2 GPT-5.5 XHigh 3.8M 27.2k 3.6M 222 $22.16 12m 48s
3 GPT-5.3-codex XHigh 2.1M 21.9k 2.0M 189 $4.57 9m 36s
4 GPT-5.5 High 2.7M 14.5k 2.5M 156 $15.30 9m 06s
5 GPT-5.3-codex Medium 1.7M 10.1k 1.6M 127 $3.43 5m 06s
6 GPT-5.4 Low 410.6k 7.2k 338.3k 72 $1.23 3m 06s
7 GPT-5.4 High 1.5M 18.0k 1.2M 134 $4.43 7m 06s
8 GPT-5.3-codex Low 553.6k 4.9k 510.7k 58 $1.13 2m 36s

This feels a lot closer to real-world coding evaluation than summarization benchmarks, even though the project itself was intentionally small.

u/Deep-Palpitation8315 — 2 days ago
▲ 159 r/codex+1 crossposts

Just compared token usage between GPT-5.4 and GPT-5.5 in Codex across all four reasoning modes (Low, Medium, High, and XHigh) using the exact same prompt and the same project as the baseline

Takeaways:

  • GPT-5.5 scales token usage, turns, and cost much more aggressively at higher reasoning modes.
  • GPT-5.4 remains relatively cost-efficient, even in XHigh.
  • GPT-5.5 appears to spend significantly more tokens on iterative reasoning and context revisiting - basically seems to read more.
  • GPT-5.4 feels more compressed (basically read less in our example - fewer docs) and execution-oriented by comparison.

One of the more interesting deltas:

GPT-5.5 XHigh
→ 456.6k input tokens
→ 40 turns
→ $2.58

GPT-5.4 XHigh
→ 296.6k input tokens
→ 30 turns
→ $0.84

u/Deep-Palpitation8315 — 9 days ago

Built SuperBased Observer - Claude Code, OpenAI Codex, CopIlot Usage Metrics

Almost everyone building with AI today is using some kind of coding agent (Cursor, Claude Code, OpenAI Codex, OpenCode etc.) or a mix of them depending on the workflow.

The problem is that most of these agents still feel like black boxes.

You can see the output, but it’s hard to understand what’s actually happening underneath:

  • How many tokens are being used?
  • Which tools are being called?
  • Where is the agent spending time?
  • What’s actually driving the cost and behavior?

And every agent handles things differently under the hood.

As builders ourselves, we noticed that we kept switching between different coding agents depending on the task. We also think that’s going to continue — there probably won’t be one “winning” agent for every workflow.

So we built SuperBased Observer.

The idea was simple: give developers a clear view into what their coding agents are actually doing — both from a bird’s-eye view and at a granular level.

SuperBased Observer helps visualize:

  • token usage
  • tool calls
  • costs
  • model behavior
  • agent activity over time

Sharing a few screenshots below of what the experience looks like when you install and use it.

Would genuinely love feedback from other builders working with AI agents every day.

npm:

https://www.npmjs.com/package/@superbased/observer

github repo:

https://github.com/marmutapp/superbased-observer

List of Sessions Across Agents & the Usage Levels

Single Session Tool & Token Usage Drill-down

Usage Across Models

Cost Dashboard

reddit.com
u/Deep-Palpitation8315 — 10 days ago