u/janvitos

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB, until they actually merged the MTP PR. Then, performance tanked and was barely above non-MTP. So, I decided to try out ik_llama.cpp since it also supports MTP and is apparently better optimized for CPU offloading. I did not expect such a huge speed boost!

Before moving on with the benchmark results, here's my PC specs:

OS: CachyOS with Plasma (X11) - HIGHLY recommended
GPU: RTX 4070 Super 12GB
CPU: AMD Ryzen 7 9700X
RAM: 48GB DDR5-6000 EXPO I

UPDATED: For comparison, here's the regular llama.cpp mtp-bench.py results with byteshape's recently released Qwen3.6-35B-A3B-IQ4_XS-4.19bpw quant, which has similar accuracy to Unsloth's Q4_K_XL, but is 4GB smaller:

❯ ./mtp-bench.py
 code_python        pred= 192 draft= 122 acc= 118 rate=0.967 tok/s=79.8
 code_cpp           pred= 192 draft= 117 acc= 110 rate=0.940 tok/s=89.1
 explain_concept    pred= 192 draft= 124 acc= 113 rate=0.911 tok/s=88.0
 summarize          pred= 192 draft= 139 acc= 127 rate=0.914 tok/s=95.0
 qa_factual         pred= 192 draft= 133 acc= 128 rate=0.962 tok/s=97.0
 translation        pred= 192 draft= 125 acc= 117 rate=0.936 tok/s=91.6
 creative_short     pred= 192 draft= 109 acc=  99 rate=0.908 tok/s=82.1
 stepwise_math      pred= 192 draft= 130 acc= 125 rate=0.962 tok/s=97.0
 long_code_review   pred= 192 draft= 121 acc= 115 rate=0.950 tok/s=88.2

Aggregate: {
 "n_requests": 9,
 "total_predicted": 1728,
 "total_draft": 1120,
 "total_draft_accepted": 1052,
 "aggregate_accept_rate": 0.9393,
 "wall_s_total": 21.86
}

This gives a 89.76 tok/s average.

Here's my llama.cpp launch command. Temperature is set to 0.0 for the benchmark to prevent diverging results between runs:

llama-server \
  -m Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf \
  --fit on \
  --fit-target 512 \
  --ctx-size 131072 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --cache-type-k-draft q8_0 \
  --cache-type-v-draft q8_0 \
  --spec-type draft-mtp \
  --spec-draft-p-min 0.75 \
  --spec-draft-n-max 3 \
  --no-mmap \
  --mlock \
  --threads 8 \
  --temp 0.0

Now, here's the benchmark results with the same quant, but running with ik_llama.cpp:

❯ ./mtp-bench.py
 code_python        pred= 192 draft= 135 acc= 122 rate=0.904 tok/s=105.1
 code_cpp           pred= 192 draft= 136 acc= 120 rate=0.882 tok/s=110.3
 explain_concept    pred= 192 draft= 133 acc= 116 rate=0.872 tok/s=109.0
 summarize          pred=  56 draft=  38 acc=  37 rate=0.974 tok/s=122.3
 qa_factual         pred= 192 draft= 141 acc= 127 rate=0.901 tok/s=116.0
 translation        pred= 192 draft= 143 acc= 113 rate=0.790 tok/s=104.1
 creative_short     pred= 192 draft= 133 acc= 118 rate=0.887 tok/s=109.4
 stepwise_math      pred= 192 draft= 140 acc= 125 rate=0.893 tok/s=114.6
 long_code_review   pred= 192 draft= 128 acc= 108 rate=0.844 tok/s=101.4

Aggregate: {
 "n_requests": 9,
 "total_predicted": 1592,
 "total_draft": 1127,
 "total_draft_accepted": 986,
 "aggregate_accept_rate": 0.8749,
 "wall_s_total": 16.64
}

That's a 110.24 tok/s average, or 23% increase!

If you want to get similar results on a 12GB RTX GPU, make sure you use the following ik_llama.cpp launch parameters, as they can differ from llama.cpp:

llama-server \
  -m Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf \
  --fit \
  --fit-margin 1664 \
  --ctx-size 131072 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --cache-type-k-draft q8_0 \
  --cache-type-v-draft q8_0 \
  --multi-token-prediction \
  --draft-p-min 0.75 \
  --draft-max 3 \
  --no-mmap \
  --mlock \
  --threads 8 \
  --temp 0.0

I also want to mention that I'm on CachyOS running my GPU as a secondary GPU, with the monitor plugged in the iGPU, so I can use 100% of available VRAM.

If you get an "out of memory" (OOM) error while loading the model or working with it, try increasing --fit-margin to 1792 or even 2048.

Cheers :)

u/janvitos — 11 hours ago
▲ 596 r/LocalLLM+1 crossposts

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Just wanted to share my config in hopes of helping other 12GB GPU owners achieve what I see as very respectable token generation speeds with modest VRAM. Using the latest llama.cpp build + MTP PR, I got over 80 tok/sec with 80%+ draft acceptance rate on the benchmark found here: https://gist.githubusercontent.com/am17an/228edfb84ed082aa88e3865d6fa27090/raw/7a2cee40ee1e2ca5365f4cef93632193d7ad852a/mtp-bench.py

Here's my PC specs:

OS: CachyOS (HIGHLY recommended)
CPU: AMD Ryzen 7 9700X
RAM: 48GB DDR5-6000 EXPO I
GPU: RTX 4070 Super 12GB

Results with other hardware may vary.

To run llama.cpp with MTP support, you need to build it from source and add a draft PR that hasn't yet been merged with the master branch. You can find a very nice guide on how to do that here and also download the Qwen3.6 MTP GGUF: https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF - Thanks u/havenoammo!

llama.cpp command:

llama-server \
  -m Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \
  -fitt 1536 \
  -c 131072 \
  -n 32768 \
  -fa on \
  -np 1 \
  -ctk q8_0 \
  -ctv q8_0 \
  -ctkd q8_0 \
  -ctvd q8_0 \
  -ctxcp 64 \
  --no-mmap \
  --mlock \
  --no-warmup \
  --spec-type mtp \
  --spec-draft-n-max 2 \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0

The most important parameter here is -fitt 1536. Since part of the model is offloaded to CPU because of its size and , this tells llama.cpp to properly balance the load on the GPU/CPU to get the best possible performance, and leaves 1536 MB of free memory for the MTP draft model and KV cache. Since I'm running my dGPU as a secondary GPU (monitor plugged in the iGPU), I can use all the available 12GB VRAM for inference. 1536 might be too small if you use your dGPU as your primary GPU, so test it out first.

You can also try different values for -spec-draft-n-max. I got slightly better tok/sec with 3, but a much better acceptance rate with 2, so the trade off was not worth it. With MTP, you want to maximize speed AND acceptance, so you need to find the best balance between both.

Benchmark results:

mtp-bench.py

code_python        pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=80.8
code_cpp           pred=  58 draft=  40 acc=  37 rate=0.925 tok/s=81.8
explain_concept    pred= 192 draft= 152 acc= 114 rate=0.750 tok/s=70.0
summarize          pred=  53 draft=  40 acc=  32 rate=0.800 tok/s=75.4
qa_factual         pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=77.8
translation        pred=  22 draft=  16 acc=  13 rate=0.812 tok/s=81.9
creative_short     pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=69.2
stepwise_math      pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=76.5
long_code_review   pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=73.2

If you have any questions, feel free to ask :)

Cheers.

reddit.com
u/janvitos — 12 days ago