u/arfung39

I've been having a good time playing with OpenCode and oMLX. Multi-token prediction does really seem to speed things up. I'm playing with the Qwen 3.6 35B MoE models, and I noticed that the oQ6 model is almost as fast as the oQ4 for me in token generation. This may be because the prediction acceptance rate is higher. Here are benchmarks for the two running on my machine (M5 Max 64GB) through oMLX:

oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.6-35B-A3B-oQ4-mtp
================================================================================
Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128           436.3        8.29  2346.9 tok/s   121.6 tok/s       1.489   773.8 tok/s    20.37 GB
pp4096/tg128          1073.4        8.73  3815.9 tok/s   115.4 tok/s       2.183  1935.4 tok/s    21.17 GB
pp8192/tg128          2018.7        9.17  4058.0 tok/s   109.9 tok/s       3.184  2613.2 tok/s    21.66 GB
pp16384/tg128         4503.8        9.72  3637.8 tok/s   103.7 tok/s       5.739  2877.3 tok/s    22.36 GB
oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.6-35B-A3B-oQ6-mtp
================================================================================
Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128           463.3        9.34  2210.3 tok/s   107.9 tok/s       1.650   698.3 tok/s    28.29 GB
pp4096/tg128          1121.2        9.87  3653.1 tok/s   102.1 tok/s       2.375  1778.7 tok/s    29.10 GB
pp8192/tg128          2095.8       10.38  3908.8 tok/s    97.1 tok/s       3.414  2436.9 tok/s    29.58 GB
pp16384/tg128         4732.2       10.61  3462.2 tok/s    95.0 tok/s       6.080  2715.8 tok/s    30.29 GB

Qwen3.6-35B-oQ6 is the sweet spot for me with MTP