u/alphatrad

I ran a quantization shootout on Qwen3-Coder and the results are... interesting

I ran a quantization shootout on Qwen3-Coder and the results are... interesting

Out of random curiousity I ran a shootout on Qwen3-Coder-Next. I've been using the MXFP4_MOE from unsloth for awhile as it's just really fast on my system. But was curious about perceision. I know quantization hurts the model, but I don't think I had really understoof that till I tested it myself.

Hardware: 3× R9700 PRO (96 GB VRAM)

Backend: llama.cpp Vulkan

Eval: wikitext-2 (583 chunks, ctx 512)

Formats tested: MXFP4_MOE Q4_K_M Q5_K_M UD-Q5_K_M

TLDR: UD-Q5_K_M is cooking! Better quality than formats half its size, barely any speed penalty. Unsloth's dynamic precision approach is really good. I might need to test it at lower quants now.

The Numbers
(no shit I asked claude to make me a table to copy pasta)

Metric MXFP4 Q4_K_M Q5_K_M UD-Q5_K_M
Same top-1 89.4% 89.6% 93.0% 94.0%
Mean KL divergence 0.0746 0.0685 0.0308 0.0217
Max KL (worst token) 13.04 5.93 8.19 4.75
File size 44.7 GB 45.2 GB 52.9 GB 55.2 GB

UD-Q5_K_M wins on literally every quality metric while only being ~10 GB larger than MXFP4.

Here's the thing nobody talks about: token accuracy compounds exponentially.

A 5% difference in per-token agreement becomes a 500× difference by token 100. All LLM's are auto regressive. Yann LeCun is always talking about this and that LLM's suffer from exponentially diverging error probabilities. This is were all your hallicunations and stuff happen.

MXFP4 (89.4%) > 100 token output: 0.0014% chance of perfect agreement

UD-Q5_K_M (94%) > 100 token output: 0.21% chance of perfect agreement

That's not a big number, but on long refactoring tasks or multi step reasoning, you feel it. MXFP4 "goes off the rails" way more often.

There is a speed trade off to all of this though.

refill (batch 512): MXFP4 still fastest (hardware kernels)

Prefill (batch 4096): MXFP4 wins again

Decode: Q4_K_M edges UD-Q5 slightly, but UD-Q5 is within 9% despite being 22% larger

For interactive coding (which is decode-bound anyway), the speed hit is negligible.

For me, I swapped my default from MXFP4 to UD-Q5_K_M. MXFP4 is still great for heavy prefill workloads but for daily code generation where you care about quality over speed, UD-Q5 is the clear winner.

What quants are you guys running for code models? Are you finding the same quality cliff with aggressive compression? And if you're on Nvidia hardware, are you seeing different tradeoffs than RDNA?

https://preview.redd.it/0z8kkkhjkp2h1.png?width=1130&format=png&auto=webp&s=aadcce727dc26d756d67d4e356a709aa96fd030f

reddit.com
u/alphatrad — 13 hours ago

Peeps are always asking which local model is best. That question is loaded and totally depends on the task you're asking of it.

Llama2 for example is old but still useful for summarizing YouTube transcripts into 10 bullet points. I don't code with it obviously, but it works well for that.

So, I built my own benchmarking tool to test local models on my client codebases. SWE Bench and similar tools test only Python gate-based tasks. They do not care whether the model writes slop or creates additional bugs. The only question: did the test turn green? Yes or no.

My benchmark runs 30 tests across 8 scenarios - with a maximum of 64pts possible.

Category What It Probes
surgical-edit Fix exactly the thing that's broken. Don't touch adjacent code.
audit Read the code, find the bugs. Do NOT edit anything.
scope-discipline Make the requested change. Nothing else.
read-only-analysis Answer a question about the code. Don't reach for the edit tool.
verify-and-repair Close the loop: reproduce the failure, fix it, verify, and recover if needed.
implementation Read a spec, build the feature. Multi-file spec-to-code.
responsiveness Stay usable in a tight edit loop. Correctness only counts when turns stay under budget.
long-context Retrieve the right answer from a very large inline context and respond quickly.

Right now I'm focused mainly on testing Javascript, Typescript, React, Go and SQL.

I've been bench marking a lot of local models for use with code and comparing them to the SOTA models on my Bench Mark.

To make a long story short, everyone kind of doesn't take Mistral seriously. But I was looking for something else to benchmark while browsing Hugging Face and decided to try Devstal Small 2 24B Instruct. SPECIFICALLY using this Q8 qaunt from Unsloth: https://huggingface.co/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF

To my surprise - this has scored the HIGHEST in 3 runs on my bench mark for a LOCAL model. I've mainly been using Qwen in a hybrid fashion for code work. I use Claude to write specs, Qwen to execute and Codex to do code Reviews. Generally... most of the fixes needed with Qwen are stylistic or duplication or maybe some anti-patterns introduced.

https://preview.redd.it/t9tij1ijqdyg1.png?width=2102&format=png&auto=webp&s=d337208acdd9ad44d18a4c1ba5032b7531ffd816

But Qwen hasn't scored as high as Devstral - the first local model to break over 80% on my benchmarks. It even beat out Sonnet 4.6 and Codex 5.3 !!! OK.... surprising?

TPS however is a little slow. Wall time not so bad. And I'm just wondering, have we all been sleeping on Mistral? Usually I hear people trash them, but I'm actually suprised.

https://preview.redd.it/ym2nn3j4rdyg1.png?width=2096&format=png&auto=webp&s=41a59301f9687332b40807232fb8f0f8fc3895ff

I need to spend a few weeks testing this in production - because a bench mark again isn't real life. And who knows, maybe I'm a moron who built a bad bench.

Anyone else test this model?

And if you are curious, this is Scaffold Bench - a work in progress. Whether or not my 30 tests are even good, is open to debate.

Github: https://github.com/1337hero/scaffold-bench

reddit.com
u/alphatrad — 22 days ago