u/alphatrad

Metric	MXFP4	Q4_K_M	Q5_K_M	UD-Q5_K_M
Same top-1	89.4%	89.6%	93.0%	94.0%
Mean KL divergence	0.0746	0.0685	0.0308	0.0217
Max KL (worst token)	13.04	5.93	8.19	4.75
File size	44.7 GB	45.2 GB	52.9 GB	55.2 GB

Metric

MXFP4

Q4_K_M

Q5_K_M

UD-Q5_K_M

Same top-1

89.4%

89.6%

93.0%

94.0%

Mean KL divergence

0.0746

0.0685

0.0308

0.0217

Max KL (worst token)

13.04

5.93

8.19

4.75

File size

44.7 GB

45.2 GB

52.9 GB

55.2 GB

Peeps are always asking which local model is best. That question is loaded and totally depends on the task you're asking of it.

Llama2 for example is old but still useful for summarizing YouTube transcripts into 10 bullet points. I don't code with it obviously, but it works well for that.

So, I built my own benchmarking tool to test local models on my client codebases. SWE Bench and similar tools test only Python gate-based tasks. They do not care whether the model writes slop or creates additional bugs. The only question: did the test turn green? Yes or no.

My benchmark runs 30 tests across 8 scenarios - with a maximum of 64pts possible.

Category	What It Probes
`surgical-edit`	Fix exactly the thing that's broken. Don't touch adjacent code.
`audit`	Read the code, find the bugs. Do NOT edit anything.
`scope-discipline`	Make the requested change. Nothing else.
`read-only-analysis`	Answer a question about the code. Don't reach for the edit tool.
`verify-and-repair`	Close the loop: reproduce the failure, fix it, verify, and recover if needed.
`implementation`	Read a spec, build the feature. Multi-file spec-to-code.
`responsiveness`	Stay usable in a tight edit loop. Correctness only counts when turns stay under budget.
`long-context`	Retrieve the right answer from a very large inline context and respond quickly.

Right now I'm focused mainly on testing Javascript, Typescript, React, Go and SQL.

I've been bench marking a lot of local models for use with code and comparing them to the SOTA models on my Bench Mark.

To make a long story short, everyone kind of doesn't take Mistral seriously. But I was looking for something else to benchmark while browsing Hugging Face and decided to try Devstal Small 2 24B Instruct. SPECIFICALLY using this Q8 qaunt from Unsloth: https://huggingface.co/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF

To my surprise - this has scored the HIGHEST in 3 runs on my bench mark for a LOCAL model. I've mainly been using Qwen in a hybrid fashion for code work. I use Claude to write specs, Qwen to execute and Codex to do code Reviews. Generally... most of the fixes needed with Qwen are stylistic or duplication or maybe some anti-patterns introduced.

https://preview.redd.it/t9tij1ijqdyg1.png?width=2102&format=png&auto=webp&s=d337208acdd9ad44d18a4c1ba5032b7531ffd816

But Qwen hasn't scored as high as Devstral - the first local model to break over 80% on my benchmarks. It even beat out Sonnet 4.6 and Codex 5.3 !!! OK.... surprising?

TPS however is a little slow. Wall time not so bad. And I'm just wondering, have we all been sleeping on Mistral? Usually I hear people trash them, but I'm actually suprised.

https://preview.redd.it/ym2nn3j4rdyg1.png?width=2096&format=png&auto=webp&s=41a59301f9687332b40807232fb8f0f8fc3895ff

I need to spend a few weeks testing this in production - because a bench mark again isn't real life. And who knows, maybe I'm a moron who built a bad bench.

Anyone else test this model?

And if you are curious, this is Scaffold Bench - a work in progress. Whether or not my 30 tests are even good, is open to debate.

Github: https://github.com/1337hero/scaffold-bench

I ran a quantization shootout on Qwen3-Coder and the results are... interesting