
110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp
Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB, until they actually merged the MTP PR. Then, performance tanked and was barely above non-MTP. So, I decided to try out ik_llama.cpp since it also supports MTP and is apparently better optimized for CPU offloading. I did not expect such a huge speed boost!
Before moving on with the benchmark results, here's my PC specs:
OS: CachyOS with Plasma (X11) - HIGHLY recommended
GPU: RTX 4070 Super 12GB
CPU: AMD Ryzen 7 9700X
RAM: 48GB DDR5-6000 EXPO I
UPDATED: For comparison, here's the regular llama.cpp mtp-bench.py results with byteshape's recently released Qwen3.6-35B-A3B-IQ4_XS-4.19bpw quant, which has similar accuracy to Unsloth's Q4_K_XL, but is 4GB smaller:
❯ ./mtp-bench.py
code_python pred= 192 draft= 122 acc= 118 rate=0.967 tok/s=79.8
code_cpp pred= 192 draft= 117 acc= 110 rate=0.940 tok/s=89.1
explain_concept pred= 192 draft= 124 acc= 113 rate=0.911 tok/s=88.0
summarize pred= 192 draft= 139 acc= 127 rate=0.914 tok/s=95.0
qa_factual pred= 192 draft= 133 acc= 128 rate=0.962 tok/s=97.0
translation pred= 192 draft= 125 acc= 117 rate=0.936 tok/s=91.6
creative_short pred= 192 draft= 109 acc= 99 rate=0.908 tok/s=82.1
stepwise_math pred= 192 draft= 130 acc= 125 rate=0.962 tok/s=97.0
long_code_review pred= 192 draft= 121 acc= 115 rate=0.950 tok/s=88.2
Aggregate: {
"n_requests": 9,
"total_predicted": 1728,
"total_draft": 1120,
"total_draft_accepted": 1052,
"aggregate_accept_rate": 0.9393,
"wall_s_total": 21.86
}
This gives a 89.76 tok/s average.
Here's my llama.cpp launch command. Temperature is set to 0.0 for the benchmark to prevent diverging results between runs:
llama-server \
-m Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf \
--fit on \
--fit-target 512 \
--ctx-size 131072 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--cache-type-k-draft q8_0 \
--cache-type-v-draft q8_0 \
--spec-type draft-mtp \
--spec-draft-p-min 0.75 \
--spec-draft-n-max 3 \
--no-mmap \
--mlock \
--threads 8 \
--temp 0.0
Now, here's the benchmark results with the same quant, but running with ik_llama.cpp:
❯ ./mtp-bench.py
code_python pred= 192 draft= 135 acc= 122 rate=0.904 tok/s=105.1
code_cpp pred= 192 draft= 136 acc= 120 rate=0.882 tok/s=110.3
explain_concept pred= 192 draft= 133 acc= 116 rate=0.872 tok/s=109.0
summarize pred= 56 draft= 38 acc= 37 rate=0.974 tok/s=122.3
qa_factual pred= 192 draft= 141 acc= 127 rate=0.901 tok/s=116.0
translation pred= 192 draft= 143 acc= 113 rate=0.790 tok/s=104.1
creative_short pred= 192 draft= 133 acc= 118 rate=0.887 tok/s=109.4
stepwise_math pred= 192 draft= 140 acc= 125 rate=0.893 tok/s=114.6
long_code_review pred= 192 draft= 128 acc= 108 rate=0.844 tok/s=101.4
Aggregate: {
"n_requests": 9,
"total_predicted": 1592,
"total_draft": 1127,
"total_draft_accepted": 986,
"aggregate_accept_rate": 0.8749,
"wall_s_total": 16.64
}
That's a 110.24 tok/s average, or 23% increase!
If you want to get similar results on a 12GB RTX GPU, make sure you use the following ik_llama.cpp launch parameters, as they can differ from llama.cpp:
llama-server \
-m Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf \
--fit \
--fit-margin 1664 \
--ctx-size 131072 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--cache-type-k-draft q8_0 \
--cache-type-v-draft q8_0 \
--multi-token-prediction \
--draft-p-min 0.75 \
--draft-max 3 \
--no-mmap \
--mlock \
--threads 8 \
--temp 0.0
I also want to mention that I'm on CachyOS running my GPU as a secondary GPU, with the monitor plugged in the iGPU, so I can use 100% of available VRAM.
If you get an "out of memory" (OOM) error while loading the model or working with it, try increasing --fit-margin to 1792 or even 2048.
Cheers :)