u/C_Coffie

llama.cpp MTP on Strix Halo: Qwen3.6 27B Q8 hits 2.44×, MoE 1.40×

MTP support landed in mainline llama.cpp on May 16 (PR #22673, commit 4f13cb7). Ran it on a Framework Desktop Strix Halo with ROCm 7.0.2.

Qwen3.6 27B, single-stream chat, temperature 0, median of 5 runs:

  • Q4_K_M: 11.7 → 21.2 tok/s (1.81×, n=3)
  • Q8_0: 7.4 → 18.1 tok/s (2.44×, n=3)

Qwen3.6 35B-A3B (MoE), same harness:

  • 49.5 → 69.4 tok/s (1.40×, n=3)

The Q8 gain is bigger than Q4 because baseline Q8 was bandwidth-bound on the 215 GB/s LPDDR5X - MTP turns N decode steps into one heavier forward pass that the bandwidth can actually hide, so more of the weight traffic gets reused per token generated.

The MoE gain is smaller because only ~3B of 35B params run per token. Each forward pass is already cheap, so saving N-1 of them is a smaller win.

Enable with --spec-type draft-mtp --spec-draft-n-max N. Output is byte-identical to baseline at the same seed and temperature.

Writeup with build commands and per-shape tables (chat, rag, codegen, agent c=4): https://calebcoffie.com/blog/benchmarking-llama-cpp-mtp-on-strix-halo

Raw YAML per run: https://github.com/CCoffie/CalebCoffie.com/tree/main/content/benchmarks/runs

Build: llama.cpp 4f13cb7, ROCm 7.0.2, llama-server with --device Vulkan0 --split-mode none --main-gpu 0 to pin to the iGPU (the Strix APU enumerates as Vulkan0; CPU is Vulkan1 and shouldn't get any layers).

reddit.com
u/C_Coffie — 4 days ago

llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig

PR #22673 (commit 4f13cb7) landed MTP speculative decoding in mainline llama.cpp on May 16. I tested it on two separate rigs.

Qwen3.6 27B, single-stream chat, temperature 0, median of 5 runs:

Strix Halo (Framework Desktop, ROCm 7.0.2):

  • Q4_K_M: 11.7 → 21.2 tok/s (1.81×)
  • Q8_0: 7.4 → 18.1 tok/s (2.44×)

Single RTX 3090 @ 450W (CUDA 12.9, driver 590.26):

  • Q4_K_M: 38.7 → 59.5 tok/s (1.54×, n=2)

Dual RTX 3090, layer-split:

  • Q8_0: 25.7 → 55.9 tok/s (2.17×, n=3)

Qwen3.6 35B-A3B (MoE):

  • Strix Halo: 49.5 → 69.4 tok/s (1.40×)
  • 3090: 120.0 → 148.3 tok/s (1.24×)

Enable with --spec-type draft-mtp --spec-draft-n-max N. Output is byte-identical to baseline at the same seed and temperature.

MTP helps MoE less because only ~3B of 35B params run per token — each forward pass is already cheap, so saving N-1 of them is a smaller win. Sweet-spot N also depends on the rig: uncapped 3090 prefers n=2 at Q4, capped 3090 and Strix Halo prefer n=3.

Couple of follow-ups from the last thread:

  • The 3090 numbers in my earlier post were undercut by an undisclosed 200W cap (breaker-popping issue with 4 cards on one circuit). I re-benched 26 of the 3090 models at 350W and 450W; dense 27-32B models gained +70 to +113%. Writeup with the curve and full table: https://calebcoffie.com/blog/how-much-do-power-limits-affect-llm-benchmark-tok-s
  • Prompt-processing tok/s and prompt-token columns are now on every row of the benchmarks page.

MTP writeup with both rigs side-by-side, build commands, and per-shape tables: https://calebcoffie.com/blog/benchmarking-llama-cpp-mtp-on-strix-halo

Raw YAML per run: https://github.com/CCoffie/CalebCoffie.com/tree/main/content/benchmarks/runs

reddit.com
u/C_Coffie — 4 days ago

Strix Halo inference benchmarks across 20 models on llama.cpp ROCm, Vulkan, and CPU — side-by-side with a 3090 and 5070

I kept seeing inference-speed claims for these models and wanting an apples-to-apples comparison on the hardware I actually have. So I built a harness and a public page that dumps every run as YAML.

The dataset: 55 runs, three rigs, five backends (rocm, vulkan, cpu, cuda, vllm-cuda), models from 0.35B (LFM2.5) through 35B-A3B (Qwen3.5 MoE). Workloads: short-prompt chat, long-context RAG, codegen long-output, and an agent shape at concurrency 1 and 4. Three measured iterations after one warmup, temperature 0, VRAM-fit verified before each run.

A few patterns from the data:

Memory bandwidth runs the show for decode. The RTX 5070 (12 GiB GDDR7, Vulkan) actually beats the RTX 3090 (24 GiB GDDR6X, CUDA) on every model that fits in 12 GiB:

Gemma-3-4b      chat:   5070 = 156.6  vs  3090 = 142.0   tok/s
Gemma-4-E4B     chat:   5070 = 124.3  vs  3090 = 118.4   tok/s
LFM2-8B-A1B     chat:   5070 = 336.1  vs  3090 = 318.7   tok/s

The 3090 wins decisively in the 14-31B band where the model fits in 24 GiB but not 12 GiB:

Gemma-4-26B-A4B chat:   3090 = 100.5  |  Strix ROCm = 43.7  |  Strix Vulkan = 47.7  tok/s
Qwen3.6-27B     chat:   3090 = 21.1   |  Strix ROCm = 11.2  |  Strix Vulkan = 11.6  tok/s

Strix Vulkan is often a hair faster than Strix ROCm on the same hardware/model. Biggest gap I saw was Gemma-4-26B-A4B at +9% (43.7 → 47.7). Some models are basically tied. Probably a gfx1151 kernel tuning gap on the bundled ROCm build; haven't dug in.

Quant cost on the 3090 for Qwen3.6-27B chat:

Q2_K = 24.0   Q3_K_M = 20.5   Q4_K_M = 21.1   Q5_K_M = 18.6   Q6_K = 15.3   tok/s

Q2 to Q6 is a 1.6x range. Q4 is the sweet spot. Q2 buys you ~14% over Q4 in exchange for the quality hit; Q6 costs ~28% for the quality bump. Surprised the curve isn't steeper.

Reasoning models look ~5x slower than they actually are if you only watch output tok/s. Qwen3.5/3.6 stream most output through a hidden reasoning_content channel that counts in the decode rate but isn't part of the user-visible answer. Worth knowing when picking a coding assistant.

CPU on Strix is not nothing. Gemma-4-26B-A4B MoE runs at ~5-9 tok/s on pure CPU thanks to unified memory + active-param routing. Not fast, but usable for batch work where you don't need the GPU.

Site has every run plus the rest of the models if you want to dig: https://calebcoffie.com/benchmarks. Methodology and the rest of the writeup: https://calebcoffie.com/blog/introducing-open-weight-model-benchmarks.

Things I know I haven't done: vLLM on Strix (lemonade's backend-readiness timeout kills the FP8 autotune; fix queued), CUDA on the 5070 (Arch gcc transition broke the cuda package mid-update, parked), the 70-130B Strix-only models (queued for v2). I don't own a 4090/5080/5090, so those aren't represented; the writeup has a back-of-envelope bandwidth extrapolation.

Not trying to replace existing benchmark sites. Just wanted another data point for my own setup and figured the same combo of rigs would be useful to someone else. Happy to be wrong on methodology if anyone spots a flaw.

reddit.com
u/C_Coffie — 6 days ago

Ran the same models across Strix Halo, RTX 3090, and RTX 5070 because I wanted my own numbers

I kept seeing inference-speed claims for these models and wanting an apples-to-apples comparison on the hardware I actually have. So I built a harness and a public page that dumps every run as YAML.

The dataset: 55 runs, three rigs, five backends (rocm, vulkan, cpu, cuda, vllm-cuda), models from 0.35B (LFM2.5) through 35B-A3B (Qwen3.5 MoE). Workloads: short-prompt chat, long-context RAG, codegen long-output, and an agent shape at concurrency 1 and 4. Three measured iterations after one warmup, temperature 0, VRAM-fit verified before each run.

A few patterns from the data:

Memory bandwidth runs the show for decode. The RTX 5070 (12 GiB GDDR7, Vulkan) actually beats the RTX 3090 (24 GiB GDDR6X, CUDA) on every model that fits in 12 GiB:

Gemma-3-4b      chat:   5070 = 156.6  vs  3090 = 142.0   tok/s
Gemma-4-E4B     chat:   5070 = 124.3  vs  3090 = 118.4   tok/s
LFM2-8B-A1B     chat:   5070 = 336.1  vs  3090 = 318.7   tok/s

The 3090 wins decisively in the 14-31B band where the model fits in 24 GiB but not 12 GiB:

Gemma-4-26B-A4B chat:   3090 = 100.5  |  Strix ROCm = 43.7  |  Strix Vulkan = 47.7  tok/s
Qwen3.6-27B     chat:   3090 = 21.1   |  Strix ROCm = 11.2  |  Strix Vulkan = 11.6  tok/s

Strix Vulkan is often a hair faster than Strix ROCm on the same hardware/model. Biggest gap I saw was Gemma-4-26B-A4B at +9% (43.7 → 47.7). Some models are basically tied. Probably a gfx1151 kernel tuning gap on the bundled ROCm build; haven't dug in.

Quant cost on the 3090 for Qwen3.6-27B chat:

Q2_K = 24.0   Q3_K_M = 20.5   Q4_K_M = 21.1   Q5_K_M = 18.6   Q6_K = 15.3   tok/s

Q2 to Q6 is a 1.6x range. Q4 is the sweet spot. Q2 buys you ~14% over Q4 in exchange for the quality hit; Q6 costs ~28% for the quality bump. Surprised the curve isn't steeper.

Reasoning models look ~5x slower than they actually are if you only watch output tok/s. Qwen3.5/3.6 stream most output through a hidden reasoning_content channel that counts in the decode rate but isn't part of the user-visible answer. Worth knowing when picking a coding assistant.

CPU on Strix is not nothing. Gemma-4-26B-A4B MoE runs at ~5-9 tok/s on pure CPU thanks to unified memory + active-param routing. Not fast, but usable for batch work where you don't need the GPU.

Site has every run plus the rest of the models if you want to dig: https://calebcoffie.com/benchmarks. Methodology and the rest of the writeup: https://calebcoffie.com/blog/introducing-open-weight-model-benchmarks.

Things I know I haven't done: vLLM on Strix (lemonade's backend-readiness timeout kills the FP8 autotune; fix queued) & the 70-130B Strix-only models (queued for v2). I don't own a 4090/5080/5090, so those aren't represented; the writeup has a back-of-envelope bandwidth extrapolation.

Not trying to replace existing benchmark sites. Just wanted another data point for my own setup and figured the same combo of rigs would be useful to someone else. Happy to be wrong on methodology if anyone spots a flaw.

reddit.com
u/C_Coffie — 6 days ago