u/AkiDenim

Open Source vs frontier models on a single-file HTML canvas driving animation - results
▲ 6 r/LocalLLM+1 crossposts

Open Source vs frontier models on a single-file HTML canvas driving animation - results

Hey yall, I was inspired by this post : https://www.reddit.com/r/LocalLLaMA/comments/1tf3p6c/local_qwen_36_vs_frontier_models_on_a_coding/

And I know this isn't exactly local, but I wanted to share what I tested out and what results each model delivered so I decided to share this.

I ran the same single-file Canvas prompt across multiple models using my harness
( https://github.com/AidenGeunGeun/OpenCodeOrchestra ). The models were able to use whatever tools they had access to - some used auditor models and some did not. We have some clear winners and some ambiguous.

https://preview.redd.it/2ehkh47vfo1h1.png?width=2972&format=png&auto=webp&s=d1e643f7b8bd0c3bab241838731848109359f1e1

The results are here: 

https://aidengeungeun.github.io/oco-canvas-car-scene-compare/

Setup:

  • Same prompt for every run
  • One isolated Orchestrator per model
  • Highest available thinking/effort setting for each model
  • Output target: one standalone HTML file, no libraries, no external assets
  • Task: realistic side-view car driving scene with parallax scenery, spinning wheels, subtle body motion, cinematic lighting, and seamless looping

Models included:

  • GPT-5.5 xhigh
  • GPT-5.4 xhigh
  • Claude Opus 4.7 (max effort)
  • Claude Opus 4.6 (max effort)
  • Claude Sonnet 4.6 (high effort, max doesnt exist on sonnet)
  • Kimi K2.6
  • DeepSeek V4 Pro
  • DeepSeek V4 Flash
  • GLM-5.1
  • MiniMax M2.7
  • Qwen 3.6 Plus
  • Grok 4.3

I used whatever highest thinking possible for each model. tok/s and time for generation were not measured.

The results are here:

Gallery: https://aidengeungeun.github.io/oco-canvas-car-scene-compare/

Source: https://github.com/AidenGeunGeun/oco-canvas-car-scene-compare

We know that models are capable of doing these kind of work, but I was wondering how a wide variety of Open weights models compare to frontier models, especially the ones that are used often.

I tried to use MiMo-V2.5-pro too, but since that model had billing issues with the OpenCode Go subscription, I couldn't use it. Take a look!

reddit.com
u/AkiDenim — 6 days ago
▲ 108 r/Overwatch

all it takes is a cat

(I found it funny on a Korean overwatch community lol)

u/AkiDenim — 20 days ago
▲ 22 r/codex

OpenAI insists that GPT-5.5 Can get stuff done in much less tokens and thus is actually more economic than GPT-5.4, and many people might not agree.

But here's my explanation, and it kind of aligns with what I felt while I was using GPT-5.5 in a lot of coding tasks, including a large codebase, a ML project, and a physics project.

GPT-5.5 is $5/Mtok input, and $30/Mtok output. Very expensive on paper. However, the math is kind of interesting.

Usually, GPT-5.5 medium would be able to do whatever GPT-5.4 xhigh could do, but with much better coherency and it felt more natural to talk to (which is a big W, and the only reason I couldn't let go of Claude for a bit - now I'm unsubbing to Claude max yay)

However, since reasoning tokens are billed as output, when there's a LOT of reasoning going on, the economics change.

A good place to see that is in Artificial Analysis: "Cost to Run Artificial Analysis Intelligence Index" and "Verbosity". That is the amount of output tokens (and cost in total) needed to run the full evaluations themselves.

Meanwhile Claude Opus:

So, even when GPT-5.5 is much more expensive on paper, it's much faster (since you output less) and it's actually cheaper to get the similar intelligence results.

https://preview.redd.it/cte3hpn7m4yg1.png?width=2362&format=png&auto=webp&s=dbbfbfcec71dbd9e3b49d4af97e73e1249cb0da2

https://preview.redd.it/966xakaam4yg1.png?width=2346&format=png&auto=webp&s=1bab98c3c19d4fd2f9301abf9fd883ddbb0054b0

As you can see, GPT-5.4 xhigh used 120M output tokens to get the evaluations done, while GPT-5.5 Medium gets a similar result but does that in 22M tokens! This means that we get a big speed boost without losing too much intelligence, and it's cheaper to run comparatively to 5.4.

Well, of course, if you spam GPT-5.5 in xhigh thinking, say good bye to your wallet.. It's going to be Opus-level spendings.

But i really didn't feel the need to go high/xhigh, UNLESS I was getting the model to reason about physics and math. Physics and math is where heavy reasoning did pay off heavily. But for most work, medium thinking is *perfect*.

This also is well represented by CritPt benchmarks, where the results fluctuate in a great margin depending on reasoning level.

CritPt benchmark scores fluctuate greatly based on reasoning amounts

One more thing to keep in mind:
is that /fast mode in GPT-5.5 will take 2.5x more quota than normal, and if that compounds over using GPT-5.5 high/xhigh everywhere, your quota will be TANKED.

So if you really want to save some usage, turn off /fast mode in codex. GPT-5.5 Medium, without /fast, is still going to be faster than 5.4 xhigh or high with /fast enabled. Use the right amount of reasoning for your tasks!

I hope this helps with people suffering / experiencing the "quotas being too small". I really think that the $20 plan still offers a lot in value.

reddit.com
u/AkiDenim — 24 days ago