u/wzoran

I wanted to share my recent experience and benchmarks

deploying the **Qwen3.6-27B-MTP-GGUF (Q8_0)** model locally

using speculative decoding on a dual-GPU budget enterprise

setup: **2x AMD Instinct MI50 (32GB HBM2 each, gfx906

architecture)**.

If you are looking for a cost-effective way to host 27B+

models locally with fast generation speeds, older AMD

enterprise cards (like the MI50) are absolute hidden

gems—though they do require some ROCm tuning to hit their

maximum potential.

Here is a full breakdown of the performance, the bottleneck

I ran into, and how I optimized the prefill (Prompt

Processing) by **over 50%**!

---

### 💻 Hardware & Env Specs

* **GPUs**: 2x AMD Instinct MI50 32GB HBM2 (PCIe Gen3 x16)

* **CPU**: Intel Xeon E5-2696 v4 (22C/44T)

* **RAM**: 64GB DDR4

* **Backend**: llama.cpp (inside Docker with ROCm 7.2.3,

custom compiled for gfx906)

* **Model**: `unsloth/Qwen3.6-27B-MTP-GGUF` (Q8_0 weights,

~28GB) loaded fully onto 2x GPUs via layer-split.

* **Speculative Decoding**: Native `draft-mtp` enabled with

`--spec-draft-n-max 2 -np 1`.

---

### 📊 Real-world Generation Speed & MTP Acceptance Rate

The generation speed is incredibly snappy thanks to

Unsloth's MTP architecture. Here are the speculative decoding

stats captured from a real long-context conversation:

* **Generation Speed (Decode)**: **27.96 - 28.57

tokens/sec**

* **Draft Acceptance Rate**: **76.6%** (489 accepted out of

638 generated drafts!)

* **VRAM footprint**: Super clean **~42% (13.7 GB)** VRAM

usage per GPU with a **64k context window** (`-c 65536`),

leaving tons of headroom.

Here is the raw stdout log of the inference run:

```text

24.17.528.975 I slot print_timing: id 0 | task 2484 |

eval time = 28897.12 ms / 808 tokens ( 35.76 ms per

token, 27.96 tokens per second)

24.17.528.976 I slot print_timing: id 0 | task 2484 |

total time = 135079.97 ms / 19574 tokens

24.17.528.977 I slot print_timing: id 0 | task 2484 |

graphs reused = 2955

24.17.528.978 I slot print_timing: id 0 | task 2484 |

draft acceptance = 0.76646 ( 489 accepted / 638 generated)

24.17.528.990 I statistics draft-mtp: #calls(b,g,a)

= 12 2993 2993, #gen drafts = 2993, #acc drafts =

2562, #gen tokens = 5985, #acc tokens = 4642

──────

### 🔍 Troubleshooting the Prefill (Prompt Processing)

Bottleneck

Initially, while generation was fast, my prefill speed was

painfully slow, hovering around 176.7 tokens/second. A ~18k

token prompt was taking 106 seconds just to prefill!

By digging into the logs and running llama-bench , I found

two main culprits:

Physical Batch Size Underutilization: The default --

ubatch-size 512 is too small to saturate the massive 3,840

stream processors on the Vega-based MI50 architecture.

PCIe Checkpoint Synchronizations: The llama.cpp server's

default context checkpointing ( --checkpoint-every-n-tokens

8192 ) was triggering a ~250MB state copy from VRAM to host

RAM over PCIe Gen3 every 8192 tokens. This forced the entire

GPU pipeline to stall and sync for 1.01 seconds per

checkpoint!

#### 🚀 The Fix:

I adjusted the startup parameters in my launching script to:

Max out the physical batch size: -b 2048 --ubatch-size

2048 to saturate the GPU.

Disable the PCIe context checkpoints: --checkpoint-every-

n-tokens -1 (since this is a sequential Claude CLI setup,

intermediate checkpoint rolling back is unnecessary).

Set -c 65536 to lower the overall KV-Cache footprint for

safety.

──────

### 📈 The Benchmark Results (Optimized vs Default)

Here are the actual實机 llama-bench results under different

physical micro-batch ( ubatch-size ) configurations on my

dual

MI50 server:

Prompt Size│ ubatch-size│ ubatch-size│ ubatch-size│ Speed Im

────────────┼────────────┼────────────┼────────────┼─────────

512 Tokens │ 196.47 t/s │ 264.44 t/s │ 271.74 t/s │ + 38.3%

2048 Tokens│ 194.20 t/s │ 273.71 t/s │ 290.00 t/s │ + 49.3%

8192 Tokens│ 195.12 t/s │ 279.79 t/s │ 295.66 t/s │ + 51.5%

Note: Disabling the checkpoints also completely eliminated

the 1-second stalls, meaning the end-to-end prefill for an

18k

token payload now finishes in ~63 seconds instead of 106

seconds!

### 💡 Takeaways

• MTP speculative decoding works incredibly well on ROCm with

llama-server . Over 76% acceptance rate makes a 27B model

feel like a much smaller model during generation.

• If you deploy ROCm on older architectures

(gfx906/MI50/Radeon VII), do not stick with default ubatch

sizes. Cranking up --ubatch-size 2048 coupled with --flash-

attn on yields massive prefill speedups.

• Watch out for context checkpoint overhead if you are using

long contexts over slower PCIe buses.

Big thanks to the Unsloth team for making these MTP weights

available! Let me know if you have any questions about ROCm

configurations or MI50 setup.

Dual AMD MI50 (gfx906) for local LLM: Tuning Qwen3.6-27B- MTP-GGUF to ~28 t/s generation (76.6% acceptance) &amp; 295 t/s │ prefill!

Dual AMD MI50 (gfx906) for local LLM: Tuning Qwen3.6-27B- MTP-GGUF to ~28 t/s generation (76.6% acceptance) & 295 t/s │ prefill!