
▲ 27 r/oMLX
Qwen3.6-35B-oQ6 is the sweet spot for me with MTP
I've been having a good time playing with OpenCode and oMLX. Multi-token prediction does really seem to speed things up. I'm playing with the Qwen 3.6 35B MoE models, and I noticed that the oQ6 model is almost as fast as the oQ4 for me in token generation. This may be because the prediction acceptance rate is higher. Here are benchmarks for the two running on my machine (M5 Max 64GB) through oMLX:
oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.6-35B-A3B-oQ4-mtp
================================================================================
Single Request Results
--------------------------------------------------------------------------------
Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem
pp1024/tg128 436.3 8.29 2346.9 tok/s 121.6 tok/s 1.489 773.8 tok/s 20.37 GB
pp4096/tg128 1073.4 8.73 3815.9 tok/s 115.4 tok/s 2.183 1935.4 tok/s 21.17 GB
pp8192/tg128 2018.7 9.17 4058.0 tok/s 109.9 tok/s 3.184 2613.2 tok/s 21.66 GB
pp16384/tg128 4503.8 9.72 3637.8 tok/s 103.7 tok/s 5.739 2877.3 tok/s 22.36 GB
oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.6-35B-A3B-oQ6-mtp
================================================================================
Single Request Results
--------------------------------------------------------------------------------
Test TTFT(ms) TPOT(ms) pp TPS tg TPS E2E(s) Throughput Peak Mem
pp1024/tg128 463.3 9.34 2210.3 tok/s 107.9 tok/s 1.650 698.3 tok/s 28.29 GB
pp4096/tg128 1121.2 9.87 3653.1 tok/s 102.1 tok/s 2.375 1778.7 tok/s 29.10 GB
pp8192/tg128 2095.8 10.38 3908.8 tok/s 97.1 tok/s 3.414 2436.9 tok/s 29.58 GB
pp16384/tg128 4732.2 10.61 3462.2 tok/s 95.0 tok/s 6.080 2715.8 tok/s 30.29 GB
u/arfung39 — 6 days ago