llama.cpp MTP on Strix Halo: Qwen3.6 27B Q8 hits 2.44×, MoE 1.40×
MTP support landed in mainline llama.cpp on May 16 (PR #22673, commit 4f13cb7). Ran it on a Framework Desktop Strix Halo with ROCm 7.0.2.
Qwen3.6 27B, single-stream chat, temperature 0, median of 5 runs:
- Q4_K_M: 11.7 → 21.2 tok/s (1.81×, n=3)
- Q8_0: 7.4 → 18.1 tok/s (2.44×, n=3)
Qwen3.6 35B-A3B (MoE), same harness:
- 49.5 → 69.4 tok/s (1.40×, n=3)
The Q8 gain is bigger than Q4 because baseline Q8 was bandwidth-bound on the 215 GB/s LPDDR5X - MTP turns N decode steps into one heavier forward pass that the bandwidth can actually hide, so more of the weight traffic gets reused per token generated.
The MoE gain is smaller because only ~3B of 35B params run per token. Each forward pass is already cheap, so saving N-1 of them is a smaller win.
Enable with --spec-type draft-mtp --spec-draft-n-max N. Output is
byte-identical to baseline at the same seed and temperature.
Writeup with build commands and per-shape tables (chat, rag, codegen, agent c=4): https://calebcoffie.com/blog/benchmarking-llama-cpp-mtp-on-strix-halo
Raw YAML per run: https://github.com/CCoffie/CalebCoffie.com/tree/main/content/benchmarks/runs
Build: llama.cpp 4f13cb7, ROCm 7.0.2, llama-server with
--device Vulkan0 --split-mode none --main-gpu 0 to pin to the iGPU
(the Strix APU enumerates as Vulkan0; CPU is Vulkan1 and shouldn't
get any layers).