Dual AMD MI50 (gfx906) for local LLM: Tuning Qwen3.6-27B- MTP-GGUF to ~28 t/s generation (76.6% acceptance) & 295 t/s │ prefill!
I wanted to share my recent experience and benchmarks
deploying the **Qwen3.6-27B-MTP-GGUF (Q8_0)** model locally
using speculative decoding on a dual-GPU budget enterprise
setup: **2x AMD Instinct MI50 (32GB HBM2 each, gfx906
architecture)**.
If you are looking for a cost-effective way to host 27B+
models locally with fast generation speeds, older AMD
enterprise cards (like the MI50) are absolute hidden
gems—though they do require some ROCm tuning to hit their
maximum potential.
Here is a full breakdown of the performance, the bottleneck
I ran into, and how I optimized the prefill (Prompt
Processing) by **over 50%**!
---
### 💻 Hardware & Env Specs
* **GPUs**: 2x AMD Instinct MI50 32GB HBM2 (PCIe Gen3 x16)
* **CPU**: Intel Xeon E5-2696 v4 (22C/44T)
* **RAM**: 64GB DDR4
* **Backend**: llama.cpp (inside Docker with ROCm 7.2.3,
custom compiled for gfx906)
* **Model**: `unsloth/Qwen3.6-27B-MTP-GGUF` (Q8_0 weights,
~28GB) loaded fully onto 2x GPUs via layer-split.
* **Speculative Decoding**: Native `draft-mtp` enabled with
`--spec-draft-n-max 2 -np 1`.
---
### 📊 Real-world Generation Speed & MTP Acceptance Rate
The generation speed is incredibly snappy thanks to
Unsloth's MTP architecture. Here are the speculative decoding
stats captured from a real long-context conversation:
* **Generation Speed (Decode)**: **27.96 - 28.57
tokens/sec**
* **Draft Acceptance Rate**: **76.6%** (489 accepted out of
638 generated drafts!)
* **VRAM footprint**: Super clean **~42% (13.7 GB)** VRAM
usage per GPU with a **64k context window** (`-c 65536`),
leaving tons of headroom.
Here is the raw stdout log of the inference run:
```text
24.17.528.975 I slot print_timing: id 0 | task 2484 |
eval time = 28897.12 ms / 808 tokens ( 35.76 ms per
token, 27.96 tokens per second)
24.17.528.976 I slot print_timing: id 0 | task 2484 |
total time = 135079.97 ms / 19574 tokens
24.17.528.977 I slot print_timing: id 0 | task 2484 |
graphs reused = 2955
24.17.528.978 I slot print_timing: id 0 | task 2484 |
draft acceptance = 0.76646 ( 489 accepted / 638 generated)
24.17.528.990 I statistics draft-mtp: #calls(b,g,a)
= 12 2993 2993, #gen drafts = 2993, #acc drafts =
2562, #gen tokens = 5985, #acc tokens = 4642
──────
### 🔍 Troubleshooting the Prefill (Prompt Processing)
Bottleneck
Initially, while generation was fast, my prefill speed was
painfully slow, hovering around 176.7 tokens/second. A ~18k
token prompt was taking 106 seconds just to prefill!
By digging into the logs and running llama-bench , I found
two main culprits:
- Physical Batch Size Underutilization: The default --
ubatch-size 512 is too small to saturate the massive 3,840
stream processors on the Vega-based MI50 architecture.
- PCIe Checkpoint Synchronizations: The llama.cpp server's
default context checkpointing ( --checkpoint-every-n-tokens
8192 ) was triggering a ~250MB state copy from VRAM to host
RAM over PCIe Gen3 every 8192 tokens. This forced the entire
GPU pipeline to stall and sync for 1.01 seconds per
checkpoint!
#### 🚀 The Fix:
I adjusted the startup parameters in my launching script to:
- Max out the physical batch size: -b 2048 --ubatch-size
2048 to saturate the GPU.
- Disable the PCIe context checkpoints: --checkpoint-every-
n-tokens -1 (since this is a sequential Claude CLI setup,
intermediate checkpoint rolling back is unnecessary).
- Set -c 65536 to lower the overall KV-Cache footprint for
safety.
──────
### 📈 The Benchmark Results (Optimized vs Default)
Here are the actual實机 llama-bench results under different
physical micro-batch ( ubatch-size ) configurations on my
dual
MI50 server:
Prompt Size│ ubatch-size│ ubatch-size│ ubatch-size│ Speed Im
────────────┼────────────┼────────────┼────────────┼─────────
512 Tokens │ 196.47 t/s │ 264.44 t/s │ 271.74 t/s │ + 38.3%
2048 Tokens│ 194.20 t/s │ 273.71 t/s │ 290.00 t/s │ + 49.3%
8192 Tokens│ 195.12 t/s │ 279.79 t/s │ 295.66 t/s │ + 51.5%
Note: Disabling the checkpoints also completely eliminated
the 1-second stalls, meaning the end-to-end prefill for an
18k
token payload now finishes in ~63 seconds instead of 106
seconds!
### 💡 Takeaways
• MTP speculative decoding works incredibly well on ROCm with
llama-server . Over 76% acceptance rate makes a 27B model
feel like a much smaller model during generation.
• If you deploy ROCm on older architectures
(gfx906/MI50/Radeon VII), do not stick with default ubatch
sizes. Cranking up --ubatch-size 2048 coupled with --flash-
attn on yields massive prefill speedups.
• Watch out for context checkpoint overhead if you are using
long contexts over slower PCIe buses.
Big thanks to the Unsloth team for making these MTP weights
available! Let me know if you have any questions about ROCm
configurations or MI50 setup.