u/voStragaIT

Llama.cpp merged PR #22673 last week with MTP support. Three days later unsloth shipped Qwen 3.6 35B-A3B-MTP-GGUF. Today I swapped the vision endpoint on my Strix Halo box. Sharing because the numbers honestly surprised me.

Same hardware. Measurements:

Gemma 4 26B-A4B Q8 (before):

- 41 t/s short ctx

- 36 t/s at 22K

- 66 MiB KV per 1K tokens (SWA)

- 96K practical ceiling

Qwen 3.6 35B-A3B Q6_K + MTP-2 (now):

- 71 t/s short

- 48 t/s sustained at 62K (2200+ tokens in one decode)

- 2 MiB KV per 1K (Gated DeltaNet, linear attention in select layers)

- Running native 256K ctx, nowhere near hitting the memory wall

- MTP accept rate 86% average, peak 96.7%

+60-90% to generation speed. KV 15x more compact. Multimodal still works (mmproj-F16 in the same repo), tool calling works, thinking mode works. Nothing to build manually, just the stock kyuz0/amd-strix-halo-toolboxes:vulkan-radv image with llama.cpp master.

Hardware: AMD Ryzen AI Max+ 395, 128 GB UMA, Radeon 8060S gfx1151, Vulkan RADV backend.

The actual surprise was DeltaNet, not MTP. I assumed MTP was doing all the heavy lifting, but on long context most of the win comes from DeltaNet. Gemma's SWA falls off a cliff past 30K. Qwen stays almost flat. At 62K it loses about a third, not half.

#LocalLLM #StrixHalo #LlamaCpp #Qwen

MTP in llama.cpp (PR #22673) tested on AMD Strix Halo: Qwen 3.6 35B-A3B hits 71 t/s short / 48 t/s at 62K via Vulkan RADV