Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB, until they actually merged the MTP PR. Then, performance tanked and was barely above non-MTP. So, I decided to try out ik_llama.cpp since it also supports MTP and is apparently better optimized for CPU offloading. I did not expect such a huge speed boost!

Before moving on with the benchmark results, here's my PC specs:

OS: CachyOS with Plasma (X11) - HIGHLY recommended
GPU: RTX 4070 Super 12GB
CPU: AMD Ryzen 7 9700X
RAM: 48GB DDR5-6000 EXPO I

UPDATED: For comparison, here's the regular llama.cpp mtp-bench.py results with byteshape's recently released Qwen3.6-35B-A3B-IQ4_XS-4.19bpw quant, which has similar accuracy to Unsloth's Q4_K_XL, but is 4GB smaller:

❯ ./mtp-bench.py
 code_python        pred= 192 draft= 122 acc= 118 rate=0.967 tok/s=79.8
 code_cpp           pred= 192 draft= 117 acc= 110 rate=0.940 tok/s=89.1
 explain_concept    pred= 192 draft= 124 acc= 113 rate=0.911 tok/s=88.0
 summarize          pred= 192 draft= 139 acc= 127 rate=0.914 tok/s=95.0
 qa_factual         pred= 192 draft= 133 acc= 128 rate=0.962 tok/s=97.0
 translation        pred= 192 draft= 125 acc= 117 rate=0.936 tok/s=91.6
 creative_short     pred= 192 draft= 109 acc=  99 rate=0.908 tok/s=82.1
 stepwise_math      pred= 192 draft= 130 acc= 125 rate=0.962 tok/s=97.0
 long_code_review   pred= 192 draft= 121 acc= 115 rate=0.950 tok/s=88.2

Aggregate: {
 "n_requests": 9,
 "total_predicted": 1728,
 "total_draft": 1120,
 "total_draft_accepted": 1052,
 "aggregate_accept_rate": 0.9393,
 "wall_s_total": 21.86
}

This gives a 89.76 tok/s average.

Here's my llama.cpp launch command. Temperature is set to 0.0 for the benchmark to prevent diverging results between runs:

llama-server \
  -m Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf \
  --fit on \
  --fit-target 512 \
  --ctx-size 131072 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --cache-type-k-draft q8_0 \
  --cache-type-v-draft q8_0 \
  --spec-type draft-mtp \
  --spec-draft-p-min 0.75 \
  --spec-draft-n-max 3 \
  --no-mmap \
  --mlock \
  --threads 8 \
  --temp 0.0

Now, here's the benchmark results with the same quant, but running with ik_llama.cpp:

❯ ./mtp-bench.py
 code_python        pred= 192 draft= 135 acc= 122 rate=0.904 tok/s=105.1
 code_cpp           pred= 192 draft= 136 acc= 120 rate=0.882 tok/s=110.3
 explain_concept    pred= 192 draft= 133 acc= 116 rate=0.872 tok/s=109.0
 summarize          pred=  56 draft=  38 acc=  37 rate=0.974 tok/s=122.3
 qa_factual         pred= 192 draft= 141 acc= 127 rate=0.901 tok/s=116.0
 translation        pred= 192 draft= 143 acc= 113 rate=0.790 tok/s=104.1
 creative_short     pred= 192 draft= 133 acc= 118 rate=0.887 tok/s=109.4
 stepwise_math      pred= 192 draft= 140 acc= 125 rate=0.893 tok/s=114.6
 long_code_review   pred= 192 draft= 128 acc= 108 rate=0.844 tok/s=101.4

Aggregate: {
 "n_requests": 9,
 "total_predicted": 1592,
 "total_draft": 1127,
 "total_draft_accepted": 986,
 "aggregate_accept_rate": 0.8749,
 "wall_s_total": 16.64
}

That's a 110.24 tok/s average, or 23% increase!

If you want to get similar results on a 12GB RTX GPU, make sure you use the following ik_llama.cpp launch parameters, as they can differ from llama.cpp:

llama-server \
  -m Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf \
  --fit \
  --fit-margin 1664 \
  --ctx-size 131072 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --cache-type-k-draft q8_0 \
  --cache-type-v-draft q8_0 \
  --multi-token-prediction \
  --draft-p-min 0.75 \
  --draft-max 3 \
  --no-mmap \
  --mlock \
  --threads 8 \
  --temp 0.0

I also want to mention that I'm on CachyOS running my GPU as a secondary GPU, with the monitor plugged in the iGPU, so I can use 100% of available VRAM.

If you get an "out of memory" (OOM) error while loading the model or working with it, try increasing --fit-margin to 1792 or even 2048.

Cheers :)

llama-server \ -m Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \ -fitt 1536 \ -c 131072 \ -n 32768 \ -fa on \ -np 1 \ -ctk q8_0 \ -ctv q8_0 \ -ctkd q8_0 \ -ctvd q8_0 \ -ctxcp 64 \ --no-mmap \ --mlock \ --no-warmup \ --spec-type mtp \ --spec-draft-n-max 2 \ --chat-template-kwargs '{"preserve_thinking": true}' \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ --repeat-penalty 1.0

mtp-bench.py code_python pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=80.8 code_cpp pred= 58 draft= 40 acc= 37 rate=0.925 tok/s=81.8 explain_concept pred= 192 draft= 152 acc= 114 rate=0.750 tok/s=70.0 summarize pred= 53 draft= 40 acc= 32 rate=0.800 tok/s=75.4 qa_factual pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=77.8 translation pred= 22 draft= 16 acc= 13 rate=0.812 tok/s=81.9 creative_short pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=69.2 stepwise_math pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=76.5 long_code_review pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=73.2

u/janvitos

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

Before moving on with the benchmark results, here's my PC specs:

UPDATED: For comparison, here's the regular llama.cpp mtp-bench.py results with byteshape's recently released Qwen3.6-35B-A3B-IQ4_XS-4.19bpw quant, which has similar accuracy to Unsloth's Q4_K_XL, but is 4GB smaller:

Here's my llama.cpp launch command. Temperature is set to 0.0 for the benchmark to prevent diverging results between runs:

Now, here's the benchmark results with the same quant, but running with ik_llama.cpp:

If you want to get similar results on a 12GB RTX GPU, make sure you use the following ik_llama.cpp launch parameters, as they can differ from llama.cpp:

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP