





Round 3: Claude Opus 4.7 1M vs Opus 4.7 vs Opus 4.6 Legacy vs Sonnet 4.6 across effort levels on the same real React feature-build task
Okay, time for Round 3.
A few days ago I posted Round 2, where we benchmarked GPT-5.4, GPT-5.5, GPT-5.3-codex, and GPT-5.4-mini across different effort levels using Codex on the same React project and the same feature-building prompt. Here's the note-taking app used as our test project.
That round came after a lot of useful feedback from the 1st experiment.
The first version was based on summarization of a Git repo (Express JS), but the problem with summarization is that it was very subjective (all summaries are essentially "correct" depending on preference/perspective).
So in Round 2 we moved to something more practical: give every model the same real coding task inside the same repository, then compare the actual feature implementation.
For Round 3, I repeated the same style of experiment, but this time with Anthropic models via Claude Code.
The models / model families tested were:
- Claude Opus 4.7 1M
- Claude Opus 4.7
- Claude Sonnet 4.6
- Claude Opus 4.6 Legacy
- Claude Haiku 4.5 (skipped in final results as it DNF'ed)
The task was the same type of real feature-build benchmark as before: implement an outline panel inside a small React note-taking app, while preserving the existing app behavior and respecting the prompt constraints.
We ran this experiment with each model-effort combination in their own separate Git worktree using Claude Code programmatic access. Every run got the same repo and the same prompt. After the runs completed, we checked whether the implementation actually worked in the UI first before we used our scoring system - 4 models grading against the same set of code quality attributes and averaged the scores.
For the successful / usable runs, we evaluated code quality across following dimensions:
- spec adherence
- feature correctness
- project pattern fit
- React / TypeScript quality
- performance
- general code quality / maintainability
- constraint adherence
The final code quality ranking was then used as the ordering for the token usage and turns/cost comparison tables (below and also available in the attached infographic):
| Rank | Run | Model | Effort | Quality Score | Input Tokens | Output Tokens | Turns | Runtime | Cost |
|---|---|---|---|---|---|---|---|---|---|
| 1 | exp-22 | Opus 4.7 1M | XHigh | 32.50 / 35 | 3.8M | 31.4K | 67 | 8.3m | $4.21 |
| 2 | exp-27 | Opus 4.7 | XHigh | 31.75 / 35 | 4.0M | 23.4K | 141 | 8.7m | $4.06 |
| 3 | exp-20 | Opus 4.7 1M | Medium | 30.50 / 35 | 2.0M | 15.2K | 47 | 4.2m | $2.05 |
| 4 | exp-26 | Opus 4.7 | High | 30.00 / 35 | 2.0M | 15.4K | 86 | 5.5m | $2.57 |
| 5= | exp-21 | Opus 4.7 1M | High | 29.50 / 35 | 2.9M | 20.0K | 60 | 5.2m | $3.23 |
| 5= | exp-23 | Opus 4.7 1M | Max | 29.50 / 35 | 4.2M | 36.4K | 70 | 8.6m | $4.68 |
| 7 | exp-28 | Opus 4.7 | Max | 29.25 / 35 | 4.7M | 33.9K | 149 | 11.2m | $4.81 |
| 8 | exp-24 | Opus 4.7 | Low | 28.75 / 35 | 2.0M | 12.5K | 92 | 5.1m | $2.50 |
| 9 | exp-19 | Opus 4.7 1M | Low | 28.00 / 35 | 1.9M | 12.0K | 45 | 3.6m | $2.21 |
| 10 | exp-25 | Opus 4.7 | Medium | 27.75 / 35 | 2.4M | 15.7K | 103 | 6.2m | $2.77 |
No surprises but Opus 4.7 1M xhigh came out on top overall, with an averaged quality score of 32.50 / 35.
A few interesting findings:
Opus 4.7 xhigh was very close behind at 31.75 / 35.
The best “efficient quality” run was arguably Opus 4.7 1M medium, which scored 30.50 / 35 while using much less time and cost than the xhigh / max runs.
At the model-family level, Opus 4.7 1M narrowly led regular Opus 4.7, but the difference was small:
- Opus 4.7 1M average: 30.00 / 35
- Opus 4.7 average: 29.50 / 35
So in this specific test, the two Opus 4.7 families were very close. The 1M version had the top individual run, but regular Opus 4.7 was also highly competitive.
Sonnet 4.6 and Opus 4.6 Legacy trailed the 4.7 models pretty clearly on code quality in this particular benchmark.
One thing that stood out: higher effort was not always better.
The xhigh runs were strongest, but max did not consistently improve quality. In a couple of cases, max increased cost and token usage without producing a better implementation.
4.6 Opus (which we love) we were sure, was better than 4.7 (during its release at least) but results say other wise (I have conspiracy theories).
Caveat: This is still a N=1/single-task benchmark. The results should be treated as directional rather than definitive.
The next thing we plan to do is compare the Anthropic results against the OpenAI Round 2 results more directly, especially quality-per-dollar and quality-per-minute. I kept this post focused only on the Anthropic side so the comparison does not get too messy.