MTP in llama.cpp (PR #22673) tested on AMD Strix Halo: Qwen 3.6 35B-A3B hits 71 t/s short / 48 t/s at 62K via Vulkan RADV
Llama.cpp merged PR #22673 last week with MTP support. Three days later unsloth shipped Qwen 3.6 35B-A3B-MTP-GGUF. Today I swapped the vision endpoint on my Strix Halo box. Sharing because the numbers honestly surprised me.
Same hardware. Measurements:
Gemma 4 26B-A4B Q8 (before):
- 41 t/s short ctx
- 36 t/s at 22K
- 66 MiB KV per 1K tokens (SWA)
- 96K practical ceiling
Qwen 3.6 35B-A3B Q6_K + MTP-2 (now):
- 71 t/s short
- 48 t/s sustained at 62K (2200+ tokens in one decode)
- 2 MiB KV per 1K (Gated DeltaNet, linear attention in select layers)
- Running native 256K ctx, nowhere near hitting the memory wall
- MTP accept rate 86% average, peak 96.7%
+60-90% to generation speed. KV 15x more compact. Multimodal still works (mmproj-F16 in the same repo), tool calling works, thinking mode works. Nothing to build manually, just the stock kyuz0/amd-strix-halo-toolboxes:vulkan-radv image with llama.cpp master.
Hardware: AMD Ryzen AI Max+ 395, 128 GB UMA, Radeon 8060S gfx1151, Vulkan RADV backend.
The actual surprise was DeltaNet, not MTP. I assumed MTP was doing all the heavy lifting, but on long context most of the win comes from DeltaNet. Gemma's SWA falls off a cliff past 30K. Qwen stays almost flat. At 62K it loses about a third, not half.
#LocalLLM #StrixHalo #LlamaCpp #Qwen