A few days ago I posted about my experiments with MTP on a 6GB VRAM laptop. That didn't work so well; CPU offload hurts MTP performance badly. But now I've tried out the new ByteShape quants for Qwen3.6-35B-A3B that are claimed to be both smaller and faster than others while still having excellent quality. I decided to compare my previous best Unsloth UD-IQ4_XS setup head-to-head with the ByteShape "CPU-5" quant in terms of performance.

TL;DR: ByteShape quant is 30% faster on TG but slightly slower on PP than the similarly sized Unsloth quant when partially offloaded to CPU on a 6GB VRAM laptop.

Hardware

Asus ROG Zephyrus G14 laptop, 2021 model
AMD Ryzen 7 5800HS with Radeon Graphics (8 CPU cores / 16 threads)
NVIDIA RTX 3060 Laptop GPU, 6GB VRAM
24GB RAM (DDR4 3200 MT/s), 1TB SSD

Software

Linux Mint 22.2 (based on Ubuntu 24.04) with the Cinnamon desktop running on the Radeon iGPU (thus the 3060 was dedicated to llama.cpp only)
llama.cpp version: 9203 (87589042c) built from current master branch with GNU 13.3.0 for Linux x86_64
CUDA 12.0 installed from Ubuntu repositories

Test setup

I fixed the following for all the experiments:

context size 65536 (enough to do agentic coding on e.g. Pi or Dirac, or run Hermes Agent)
mmap off, mlock on, ubatch size 2048 (gives much better PP speed than the default 512)
no mmproj (no image input support needed for now)
for more details, see configuration below

The quants tested:

Unsloth UD-IQ4_XS (17.7 GB)
ByteShape CPU-5 aka Q4_K_S-4.22bpw (18.3 GB)

Configuration

My models-preset.ini contents:

version = 1
[Qwen3.6-35B-A3B]
# Unsloth variant
m = /proj/llms/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf
# ByteShape variant
# m = /proj/llms/Qwen3.6-35B-A3B-Q4_K_S-4.22bpw.gguf
fit = true
fit-target = 64
c = 65536
chat-template-kwargs = {"preserve_thinking": true}
temp = 0.6
top-p = 0.95
min-p = 0.0
top-k = 20
repeat-penalty = 1.0
presence-penalty = 0.0
ctx-checkpoints = 64
flash-attn = on
b = 2048
ub = 2048
jinja = true
ctk = q8_0
ctv = q8_0
threads = 6
parallel = 1
cache-ram = 4096
mmap = false
mlock = true

Benchmark results

I used a test prompt of approx. 10k tokens, followed by 1.5-2k tokens of generation. Tried both twice, got pretty much exactly the same numbers.

	Unsloth	ByteShape	Δ
PP tok/s	585	564	-4%
TG tok/s	25.4	33.1	+30%

The ByteShape quant, despite being a bit larger than Unsloth, is over 30% faster on generation than the Unsloth quant! PP speed is slightly lower for ByteShape though.

Observations

Part of the difference may be explained by imatrix (IQ) vs regular (Q) quants. Unsloth UD-IQ4_XS is imatrix, and I understand that these are slower to compute on CPU. A better comparison would be against the ByteShape GPU-5 quant, which is also imatrix in my understanding. But I wanted an upgrade over UD-IQ4_XS and definitely got it!
I noticed that my TG performance seems to degrade over time by ~10% or more without changing the setup. I suspect suspending and then awakening the laptop repeatedly somehow hurts, but I haven't figured out the reason; it's not just memory pressure building up AFAICT. Rebooting the machine brings me the best performance, so I did that before benchmarking.
I haven't made any detailed quality measurements between the models. The ByteShape model seems very similar; possibly the thinking output is generally somewhat shorter than with Unsloth, but that could be a measurement error. I hope that someone does an independent comparison between ByteShape and other quants in terms of output quality, because their claims seem to be a bit too good to be true!

Notes

This post assembled from 100% biodegradeable bytes. No AIs were harmed in the process.

I have an Asus gaming laptop from 2021 that I bought used for 500€ last year. I wanted to see if the recently merged MTP support in llama.cpp is worth using on such a VRAM constrained device for the Qwen3.6-35B-A3B model. So I did some experiments to measure performance with and without MTP.

TL;DR: It's not worth it. The prompt processing is so much slower with MTP that it outweighs the minimal gains in TG speeds. However, I did discover a useful VRAM saving trick: using q4_0 quantization for the draft KV cache works just as well as q8_0 and saves a small bit of VRAM.

Hardware

Asus ROG Zephyrus G14 laptop, 2021 model
AMD Ryzen 7 5800HS with Radeon Graphics (8 CPU cores / 16 threads)
NVIDIA RTX 3060 Laptop GPU, 6GB VRAM
24GB RAM (DDR4 3200 MT/s), 1TB SSD

Software

Linux Mint 22.2 (based on Ubuntu 24.04) with the Cinnamon desktop running on the Radeon iGPU (thus the 3060 was dedicated to llama.cpp only)
llama.cpp version: 9198 (a6d6183db) built from current master branch with GNU 13.3.0 for Linux x86_64
CUDA 12.0 installed from Ubuntu repositories

Test setup

I fixed the following parameters for all the experiments:

Unsloth Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL model (pushing the maximum this system can run; I used the same model for both MTP and non-MTP, just varying the command line arguments so the MTP part of the model was not used in all runs)
q8_0 quantization for the main KV cache (I don't want to compromise on quality too much)
context size 65536 (enough to do agentic coding on e.g. Pi or Dirac, or run Hermes Agent)
for MTP, I used --spec-draft-n-max 2 (I know that 3 might be slightly better in some cases, but decided to stick to this to make the results comparable)
mmap enabled (it's the only way I can run this model without freezing my machine...)

I varied these parameters:

MTP vs non-MTP (including/omitting MTP specific CLI parameters)
ubatch size: 512, 1024, 1536, 2048
draft model KV cache quantization: either q8_0 or q4_0 (always same for both K & V)
--fit-target set to the lowest value (in steps of 64) that works without OOM errors

Here is an example of a full llama-server command (MTP 1 in the table below):

build/bin/llama-server \
-m Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \
--threads 8 \
-ub 512 \
--parallel 1 \
--fit-target 448 \
-c 65536 \
-ctk q8_0 \
-ctv q8_0 \
-ctkd q8_0 \
-ctvd q8_0 \
--chat-template-kwargs '{"preserve_thinking": true}' \
--temp 0.6 \
--top-p 0.95 \
--min-p 0.0 \
--top-k 20 \
--repeat-penalty 1.0 \
--presence-penalty 0.0 \
--spec-type draft-mtp \
--spec-draft-n-max 2

The tasks I gave the model were two:

MB: Run the mtp-bench.py script to benchmark MTP on various different tasks.
S: Summarize a longer document (MTP PR 22673 from github) into a few bullet points. This is a 13448 token prompt followed by 2000-3000 tokens of generation.

Results

This table summarizes the outcome. ub = ubatch size, dKV = draft KV quant type, fitt = fit-target value, acc% = acceptance rate.

Setup	ub	dKV	fitt	MB TG	MB acc%	S PP	S TG	S acc%
No MTP 1	512	-	0	25.0	-	178	23.8	-
No MTP 2	1024	-	0	23.1	-	292	22.5	-
No MTP 3	1536	-	0	24.5	-	299	24.4	-
No MTP 4	2048	-	0	23.0	-	436	26.1	-
MTP 1	512	q8_0	448	27.3	81.5	143	26.1	76.5
MTP 2	1024	q8_0	960	18.7	82.7	138	25.9	72.0
MTP 3	512	q4_0	448	26.4	81.5	139	25.3	73.4
MTP 4	1024	q4_0	960	25.4	82.7	198	23.7	73.7

I also tried higher ubatch values with MTP, but the results were so bad (TG 10-15 tok/s, probably due to running out of RAM and swapping) that I aborted those runs.

Verdict

The baseline "No MTP 4" with ubatch=2048 is clearly the best non-MTP setup. It reached PP speeds over 400 tok/s and TG speeds of 23-26 tok/s.
The "MTP 1" run with ubatch=512 reached the best TG speed (over 27 tok/s) in mtp-bench but was tied with "No MTP 4" on the summarization task TG. PP speeds were much lower than any non-MTP setups.
Increasing ubatch size in MTP can improve PP speeds a bit, especially in the "MTP 4" setup which also used q4_0 quantization for the draft KV cache. But this practically eliminated the benefit in TG speeds while still more than halving PP speeds.
In short: MTP is not worth it in this setting. Tiny increase in TG for some cases, but always a giant drop in PP speeds. If PP speeds for MTP are later improved in llama.cpp (this was listed as a known issue in the PR), this might change.

Observations

I was surprised to see that using q4_0 quantization for the draft model KV cache had negligible impact on draft model accuracy. This saves a tiny bit of VRAM, so might be a useful trick for very VRAM constrained setups.
There is a bit of unexplained variation between measurements, probably due to random change, CPU/GPU temperature throttling etc. Not too bad, but take with a grain of salt.
VRAM is obviously very tight from the start. The MTP VRAM overhead easily pushes the system into a badly performing scenario.
The --fit and --fit-target options don't seem to take into account the MTP overhead; you need to reserve some memory for MTP and this amount depends mainly on the ubatch size. Thus you have to set --fit-target manually if you want to squeeze the maximum performance out of your limited VRAM. In my case, setting fit-target to a number a bit less than the ubatch size seemed to work, but YMMV.

Notes

This post was constructed from 100% organic ingredients. No AIs were harmed in the process.

My second post here. Happy to answer any questions.

u/OsmanthusBloom

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop

Hardware

Software

Test setup

Configuration

Benchmark results

Observations

Notes

MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it

Hardware

Software

Test setup

Results

Verdict

Observations

Notes