u/OsmanthusBloom

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop

ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop

A few days ago I posted about my experiments with MTP on a 6GB VRAM laptop. That didn't work so well; CPU offload hurts MTP performance badly. But now I've tried out the new ByteShape quants for Qwen3.6-35B-A3B that are claimed to be both smaller and faster than others while still having excellent quality. I decided to compare my previous best Unsloth UD-IQ4_XS setup head-to-head with the ByteShape "CPU-5" quant in terms of performance.

TL;DR: ByteShape quant is 30% faster on TG but slightly slower on PP than the similarly sized Unsloth quant when partially offloaded to CPU on a 6GB VRAM laptop.

Hardware

  • Asus ROG Zephyrus G14 laptop, 2021 model
  • AMD Ryzen 7 5800HS with Radeon Graphics (8 CPU cores / 16 threads)
  • NVIDIA RTX 3060 Laptop GPU, 6GB VRAM
  • 24GB RAM (DDR4 3200 MT/s), 1TB SSD

Software

  • Linux Mint 22.2 (based on Ubuntu 24.04) with the Cinnamon desktop running on the Radeon iGPU (thus the 3060 was dedicated to llama.cpp only)
  • llama.cpp version: 9203 (87589042c) built from current master branch with GNU 13.3.0 for Linux x86_64
  • CUDA 12.0 installed from Ubuntu repositories

Test setup

I fixed the following for all the experiments:

  • context size 65536 (enough to do agentic coding on e.g. Pi or Dirac, or run Hermes Agent)
  • mmap off, mlock on, ubatch size 2048 (gives much better PP speed than the default 512)
  • no mmproj (no image input support needed for now)
  • for more details, see configuration below

The quants tested:

Configuration

My models-preset.ini contents:

version = 1
[Qwen3.6-35B-A3B]
# Unsloth variant
m = /proj/llms/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf
# ByteShape variant
# m = /proj/llms/Qwen3.6-35B-A3B-Q4_K_S-4.22bpw.gguf
fit = true
fit-target = 64
c = 65536
chat-template-kwargs = {"preserve_thinking": true}
temp = 0.6
top-p = 0.95
min-p = 0.0
top-k = 20
repeat-penalty = 1.0
presence-penalty = 0.0
ctx-checkpoints = 64
flash-attn = on
b = 2048
ub = 2048
jinja = true
ctk = q8_0
ctv = q8_0
threads = 6
parallel = 1
cache-ram = 4096
mmap = false
mlock = true

Benchmark results

I used a test prompt of approx. 10k tokens, followed by 1.5-2k tokens of generation. Tried both twice, got pretty much exactly the same numbers.

Unsloth ByteShape Δ
PP tok/s 585 564 -4%
TG tok/s 25.4 33.1 +30%

The ByteShape quant, despite being a bit larger than Unsloth, is over 30% faster on generation than the Unsloth quant! PP speed is slightly lower for ByteShape though.

Observations

  • Part of the difference may be explained by imatrix (IQ) vs regular (Q) quants. Unsloth UD-IQ4_XS is imatrix, and I understand that these are slower to compute on CPU. A better comparison would be against the ByteShape GPU-5 quant, which is also imatrix in my understanding. But I wanted an upgrade over UD-IQ4_XS and definitely got it!
  • I noticed that my TG performance seems to degrade over time by ~10% or more without changing the setup. I suspect suspending and then awakening the laptop repeatedly somehow hurts, but I haven't figured out the reason; it's not just memory pressure building up AFAICT. Rebooting the machine brings me the best performance, so I did that before benchmarking.
  • I haven't made any detailed quality measurements between the models. The ByteShape model seems very similar; possibly the thinking output is generally somewhat shorter than with Unsloth, but that could be a measurement error. I hope that someone does an independent comparison between ByteShape and other quants in terms of output quality, because their claims seem to be a bit too good to be true!

Notes

This post assembled from 100% biodegradeable bytes. No AIs were harmed in the process.

u/OsmanthusBloom — 10 hours ago

MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it

I have an Asus gaming laptop from 2021 that I bought used for 500€ last year. I wanted to see if the recently merged MTP support in llama.cpp is worth using on such a VRAM constrained device for the Qwen3.6-35B-A3B model. So I did some experiments to measure performance with and without MTP.

TL;DR: It's not worth it. The prompt processing is so much slower with MTP that it outweighs the minimal gains in TG speeds. However, I did discover a useful VRAM saving trick: using q4_0 quantization for the draft KV cache works just as well as q8_0 and saves a small bit of VRAM.

Hardware

  • Asus ROG Zephyrus G14 laptop, 2021 model
  • AMD Ryzen 7 5800HS with Radeon Graphics (8 CPU cores / 16 threads)
  • NVIDIA RTX 3060 Laptop GPU, 6GB VRAM
  • 24GB RAM (DDR4 3200 MT/s), 1TB SSD

Software

  • Linux Mint 22.2 (based on Ubuntu 24.04) with the Cinnamon desktop running on the Radeon iGPU (thus the 3060 was dedicated to llama.cpp only)
  • llama.cpp version: 9198 (a6d6183db) built from current master branch with GNU 13.3.0 for Linux x86_64
  • CUDA 12.0 installed from Ubuntu repositories

Test setup

I fixed the following parameters for all the experiments:

  • Unsloth Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL model (pushing the maximum this system can run; I used the same model for both MTP and non-MTP, just varying the command line arguments so the MTP part of the model was not used in all runs)
  • q8_0 quantization for the main KV cache (I don't want to compromise on quality too much)
  • context size 65536 (enough to do agentic coding on e.g. Pi or Dirac, or run Hermes Agent)
  • for MTP, I used --spec-draft-n-max 2 (I know that 3 might be slightly better in some cases, but decided to stick to this to make the results comparable)
  • mmap enabled (it's the only way I can run this model without freezing my machine...)

I varied these parameters:

  • MTP vs non-MTP (including/omitting MTP specific CLI parameters)
  • ubatch size: 512, 1024, 1536, 2048
  • draft model KV cache quantization: either q8_0 or q4_0 (always same for both K & V)
  • --fit-target set to the lowest value (in steps of 64) that works without OOM errors

Here is an example of a full llama-server command (MTP 1 in the table below):

build/bin/llama-server \
-m Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \
--threads 8 \
-ub 512 \
--parallel 1 \
--fit-target 448 \
-c 65536 \
-ctk q8_0 \
-ctv q8_0 \
-ctkd q8_0 \
-ctvd q8_0 \
--chat-template-kwargs '{"preserve_thinking": true}' \
--temp 0.6 \
--top-p 0.95 \
--min-p 0.0 \
--top-k 20 \
--repeat-penalty 1.0 \
--presence-penalty 0.0 \
--spec-type draft-mtp \
--spec-draft-n-max 2

The tasks I gave the model were two:

  1. MB: Run the mtp-bench.py script to benchmark MTP on various different tasks.
  2. S: Summarize a longer document (MTP PR 22673 from github) into a few bullet points. This is a 13448 token prompt followed by 2000-3000 tokens of generation.

Results

This table summarizes the outcome. ub = ubatch size, dKV = draft KV quant type, fitt = fit-target value, acc% = acceptance rate.

Setup ub dKV fitt MB TG MB acc% S PP S TG S acc%
No MTP 1 512 - 0 25.0 - 178 23.8 -
No MTP 2 1024 - 0 23.1 - 292 22.5 -
No MTP 3 1536 - 0 24.5 - 299 24.4 -
No MTP 4 2048 - 0 23.0 - 436 26.1 -
MTP 1 512 q8_0 448 27.3 81.5 143 26.1 76.5
MTP 2 1024 q8_0 960 18.7 82.7 138 25.9 72.0
MTP 3 512 q4_0 448 26.4 81.5 139 25.3 73.4
MTP 4 1024 q4_0 960 25.4 82.7 198 23.7 73.7

I also tried higher ubatch values with MTP, but the results were so bad (TG 10-15 tok/s, probably due to running out of RAM and swapping) that I aborted those runs.

Verdict

  • The baseline "No MTP 4" with ubatch=2048 is clearly the best non-MTP setup. It reached PP speeds over 400 tok/s and TG speeds of 23-26 tok/s.
  • The "MTP 1" run with ubatch=512 reached the best TG speed (over 27 tok/s) in mtp-bench but was tied with "No MTP 4" on the summarization task TG. PP speeds were much lower than any non-MTP setups.
  • Increasing ubatch size in MTP can improve PP speeds a bit, especially in the "MTP 4" setup which also used q4_0 quantization for the draft KV cache. But this practically eliminated the benefit in TG speeds while still more than halving PP speeds.
  • In short: MTP is not worth it in this setting. Tiny increase in TG for some cases, but always a giant drop in PP speeds. If PP speeds for MTP are later improved in llama.cpp (this was listed as a known issue in the PR), this might change.

Observations

  • I was surprised to see that using q4_0 quantization for the draft model KV cache had negligible impact on draft model accuracy. This saves a tiny bit of VRAM, so might be a useful trick for very VRAM constrained setups.
  • There is a bit of unexplained variation between measurements, probably due to random change, CPU/GPU temperature throttling etc. Not too bad, but take with a grain of salt.
  • VRAM is obviously very tight from the start. The MTP VRAM overhead easily pushes the system into a badly performing scenario.
  • The --fit and --fit-target options don't seem to take into account the MTP overhead; you need to reserve some memory for MTP and this amount depends mainly on the ubatch size. Thus you have to set --fit-target manually if you want to squeeze the maximum performance out of your limited VRAM. In my case, setting fit-target to a number a bit less than the ubatch size seemed to work, but YMMV.

Notes

This post was constructed from 100% organic ingredients. No AIs were harmed in the process.

My second post here. Happy to answer any questions.

u/OsmanthusBloom — 6 days ago