
New to local LLM benchmarking, got 97.6% HumanEval+ on Qwen3.6. A sanity check please
Just got my RTX 5090 and spent today doing my first real local LLM benchmark on Qwen3.6-35B-A3B (Unsloth's MTP-UD-Q4_K_XL). I'm a business consultant by day, not an ML engineer, so I'd genuinely appreciate the community's eyes on whether I went about this sensibly.
What I got:
- llama-bench: 258 t/s baseline, 270 t/s with MTP
-d 2 - Perplexity Q4 vs Q5 on WikiText-2: delta of 0.0042 (within noise)
- HumanEval+ pass@1: 92.1% thinking off, 97.6% thinking on
That last number is what's making me nervous, it puts a local model in GPT-5 / Claude Opus territory, which feels too good to be a casual Tuesday result.
Where I'm uncertain:
- Is WikiText-2 even the right perplexity corpus for a 2026 coding model? Training data has probably seen it cold by now.
- First EvalPlus run returned 2.4% pass@1, which seemed impossibly low. Turned out EvalPlus was reading the
contentfield while Qwen's reasoning output was landing inreasoning_content. I patched it with a 30-line proxy that merges the two. Is there a cleaner standard solution I missed? - Thinking-off run finished in 3 minutes, thinking-on in 40. Is that ratio normal, or did I configure something poorly?
- MTP acceptance hit 86% on reasoning output but lower on chat-style. Anyone else seeing this pattern?
- Anything obvious I should have measured but didn't? MBPP+? LiveCodeBench? Different perplexity corpus?
Full write-up with every command and the proxy code: (First post ever) https://spoliatiotexo854419.substack.com/p/from-unboxing-to-976-humaneval-benchmarking?r=20wxz1&utm_campaign=post&utm_medium=web&triedRedirect=true
Mostly hoping someone tells me which parts I overthought and which I underthought. Thanks in advance.