r/unsloth

▲ 11 r/unsloth

Dual AMD MI50 (gfx906) for local LLM: Tuning Qwen3.6-27B- MTP-GGUF to ~28 t/s generation (76.6% acceptance) & 295 t/s │ prefill!

I wanted to share my recent experience and benchmarks

deploying the **Qwen3.6-27B-MTP-GGUF (Q8_0)** model locally

using speculative decoding on a dual-GPU budget enterprise

setup: **2x AMD Instinct MI50 (32GB HBM2 each, gfx906

architecture)**.

If you are looking for a cost-effective way to host 27B+

models locally with fast generation speeds, older AMD

enterprise cards (like the MI50) are absolute hidden

gems—though they do require some ROCm tuning to hit their

maximum potential.

Here is a full breakdown of the performance, the bottleneck

I ran into, and how I optimized the prefill (Prompt

Processing) by **over 50%**!

---

### 💻 Hardware & Env Specs

* **GPUs**: 2x AMD Instinct MI50 32GB HBM2 (PCIe Gen3 x16)

* **CPU**: Intel Xeon E5-2696 v4 (22C/44T)

* **RAM**: 64GB DDR4

* **Backend**: llama.cpp (inside Docker with ROCm 7.2.3,

custom compiled for gfx906)

* **Model**: `unsloth/Qwen3.6-27B-MTP-GGUF` (Q8_0 weights,

~28GB) loaded fully onto 2x GPUs via layer-split.

* **Speculative Decoding**: Native `draft-mtp` enabled with

`--spec-draft-n-max 2 -np 1`.

---

### 📊 Real-world Generation Speed & MTP Acceptance Rate

The generation speed is incredibly snappy thanks to

Unsloth's MTP architecture. Here are the speculative decoding

stats captured from a real long-context conversation:

* **Generation Speed (Decode)**: **27.96 - 28.57

tokens/sec**

* **Draft Acceptance Rate**: **76.6%** (489 accepted out of

638 generated drafts!)

* **VRAM footprint**: Super clean **~42% (13.7 GB)** VRAM

usage per GPU with a **64k context window** (`-c 65536`),

leaving tons of headroom.

Here is the raw stdout log of the inference run:

```text

24.17.528.975 I slot print_timing: id 0 | task 2484 |

eval time = 28897.12 ms / 808 tokens ( 35.76 ms per

token, 27.96 tokens per second)

24.17.528.976 I slot print_timing: id 0 | task 2484 |

total time = 135079.97 ms / 19574 tokens

24.17.528.977 I slot print_timing: id 0 | task 2484 |

graphs reused = 2955

24.17.528.978 I slot print_timing: id 0 | task 2484 |

draft acceptance = 0.76646 ( 489 accepted / 638 generated)

24.17.528.990 I statistics draft-mtp: #calls(b,g,a)

= 12 2993 2993, #gen drafts = 2993, #acc drafts =

2562, #gen tokens = 5985, #acc tokens = 4642

──────

### 🔍 Troubleshooting the Prefill (Prompt Processing)

Bottleneck

Initially, while generation was fast, my prefill speed was

painfully slow, hovering around 176.7 tokens/second. A ~18k

token prompt was taking 106 seconds just to prefill!

By digging into the logs and running llama-bench , I found

two main culprits:

  1. Physical Batch Size Underutilization: The default --

ubatch-size 512 is too small to saturate the massive 3,840

stream processors on the Vega-based MI50 architecture.

  1. PCIe Checkpoint Synchronizations: The llama.cpp server's

default context checkpointing ( --checkpoint-every-n-tokens

8192 ) was triggering a ~250MB state copy from VRAM to host

RAM over PCIe Gen3 every 8192 tokens. This forced the entire

GPU pipeline to stall and sync for 1.01 seconds per

checkpoint!

#### 🚀 The Fix:

I adjusted the startup parameters in my launching script to:

  1. Max out the physical batch size: -b 2048 --ubatch-size

2048 to saturate the GPU.

  1. Disable the PCIe context checkpoints: --checkpoint-every-

n-tokens -1 (since this is a sequential Claude CLI setup,

intermediate checkpoint rolling back is unnecessary).

  1. Set -c 65536 to lower the overall KV-Cache footprint for

safety.

──────

### 📈 The Benchmark Results (Optimized vs Default)

Here are the actual實机 llama-bench results under different

physical micro-batch ( ubatch-size ) configurations on my

dual

MI50 server:

Prompt Size│ ubatch-size│ ubatch-size│ ubatch-size│ Speed Im

────────────┼────────────┼────────────┼────────────┼─────────

512 Tokens │ 196.47 t/s │ 264.44 t/s │ 271.74 t/s │ + 38.3%

2048 Tokens│ 194.20 t/s │ 273.71 t/s │ 290.00 t/s │ + 49.3%

8192 Tokens│ 195.12 t/s │ 279.79 t/s │ 295.66 t/s │ + 51.5%

Note: Disabling the checkpoints also completely eliminated

the 1-second stalls, meaning the end-to-end prefill for an

18k

token payload now finishes in ~63 seconds instead of 106

seconds!

### 💡 Takeaways

• MTP speculative decoding works incredibly well on ROCm with

llama-server . Over 76% acceptance rate makes a 27B model

feel like a much smaller model during generation.

• If you deploy ROCm on older architectures

(gfx906/MI50/Radeon VII), do not stick with default ubatch

sizes. Cranking up --ubatch-size 2048 coupled with --flash-

attn on yields massive prefill speedups.

• Watch out for context checkpoint overhead if you are using

long contexts over slower PCIe buses.

Big thanks to the Unsloth team for making these MTP weights

available! Let me know if you have any questions about ROCm

configurations or MI50 setup.

reddit.com
u/wzoran — 1 day ago

Garbage output while trying to run IQ3XXS variant of unsloth/Qwen3.6-27B-MTP-GGUF with llamacpp

Platform: Ubuntu 26.04, RTX 5060Ti, NVIDIA Driver 595.71.05, CUDA 13.2

I downloaded the IQ3XXS version and tried to run it with llamacpp both ways - with and without the newly introduced spec-type argument. But in both the cases, the model produces random characters as output. Here's one of the commands that I used:

./llama-server -m ~/.lmstudio/models/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-IQ3_XXS.gguf --no-mmproj -ngl 99 -c 8192 -np 1

I am able to run the regular models but not this one. Am I missing something or this quant has a problem?

Sample output:

1   .111 111111111 1049  . 21 1. 1.111 10 1.  .​1

11​..​ .  A .

Update: This is a known issue where compiling llamacpp against CUDA13.2 makes it produce garbage with all variants (MTP/Non MTP). Downgrading to CUDA12.8 solved it. However that isn't simple on Ubuntu 26.04.

I recently upgraded to Ubuntu 26.04 because of NPU support. However, the recommended CUDA toolkit with 26.04 is 13.2. I managed to install the older versions of the toolkit using deb installers on nvidia site but llamacpp compilation fails because of glibc incompatibility. Eventually had to use docker and setup nvidia/cuda:12.8.0-devel-ubuntu24.04 for compilation.

Now happy with a jump from 30tps to 40tps :)

reddit.com
u/v01dm4n — 2 days ago

Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark?

I’m running Hermes Agent on a single NVIDIA DGX Spark using vLLM with:

docker run --gpus all \
--name qwen36-aggressive \
--restart unless-stopped \
-p 8000:8000 \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--shm-size=32g \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_ATTENTION_BACKEND=FLASHINFER \
-e FLASHINFER_DISABLE_VERSION_CHECK=1 \
-e VLLM_HTTP_TIMEOUT_KEEP_ALIVE=600 \
vllm/vllm-openai:cu130-nightly \
--model Qwen/Qwen3.6-35B-A3B-FP8 \
--served-model-name qwen36 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.75 \
--dtype auto \
--kv-cache-dtype fp8 \
--max-model-len 262144 \
--max-num-batched-tokens 32768 \
--max-num-seqs 4 \
--attention-backend flashinfer \
--enable-prefix-caching \
--enable-chunked-prefill \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--trust-remote-code \
--reasoning-parser qwen3 \
--performance-mode throughput \
--default-chat-template-kwargs '{"preserve_thinking":true}' \
--override-generation-config '{"temperature":0.6,"top_p":0.95,"top_k":20,"min_p":0.0,"presence_penalty":0.0,"repetition_penalty":1.0}' \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

It boots successfully and seems stable so far, but I’d love opinions from people running similar long-context / agentic setups.

Any feedback or suggestions are welcome.

reddit.com
u/povedaaqui — 1 day ago

Question for opensourced Ling

I noticed quite a few people have downloaded Ling on Hugging Face. What tasks are people using it for? Does it have any drawbacks?

u/Own_Development_9809 — 2 days ago
▲ 141 r/unsloth

4-bit Qwen3.6 MTP GGUF cited 70+ websites with one prompt!

4-bit Qwen3.6 MTP GGUF managed to search 70+ sites from a single prompt.

Try this locally with Unsloth Studio on 20GB RAM.

Unsloth now supports automatic MTP + speculative decoding for supported models. Unsloth also now auto-selects the best MTP settings for your specific device (Mac, CPU, GPU etc.)

We also fixed many bugs and issues including tokens/s not showing up correctly and MTP not being applied properly.

GitHub: https://github.com/unslothai/unsloth

u/yoracale — 3 days ago
▲ 86 r/unsloth+1 crossposts

RX 7900 XTX vs Radeon AI PRO R9700 — llama.cpp Vulkan vs ROCm (6 models, token-gen)

Setup: llama.cpp llama-bench, -fa 1 -ngl 99 -ctk q8_0 -ctv q8_0 -p 512,2048 -n 128,256 -r

3, 300 W power cap on both cards. Models are unsloth GGUFs (UD-IQ4_XS / UD-Q4_K_XL);

gpt-oss-20b is the ggml-org native MXFP4. R9700 = RDNA4/gfx1201, 7900 XTX = RDNA3/gfx1100.

R9700 runs measured one day earlier, identical config.

Takeaways:

- 7900 XTX beats the R9700 by +24–29% on token-gen across the whole slate — memory

bandwidth (384-bit vs 256-bit).

- Vulkan > ROCm for token-gen on both architectures — huge on MoE (XTX: +33–64%).

- Prefill flips it: ROCm pp2048 is ~8–17% faster on dense models (e.g. Qwen-27B IQ4: ROCm

1022 vs Vulkan 870 t/s).

greetings Ginmarr

u/Ginmarr — 3 days ago

Unsloth Studio and Cmake Flags

I'm on a shared DGX-H200 that unfortunately does not have current gcc-toolset-13, and I cannot update it. When I build my local version of llama.cpp I have to set cmake flags to disable all AVX instructions. I'm installing unsloth-studio via the single script:

curl -fsSL https://unsloth.ai/install.sh | sh

How do I modify the build config of the llama.cpp within the install ?

reddit.com
u/Simusid — 2 days ago

Where can i find the imatrix dataset file for unsloth Quants?

quantize.imatrix.file Qwen3.6-35B-A3B-GGUF/imatrix_unsloth.gguf
quantize.imatrix.dataset unsloth_calibration_Qwen3.6-35B-A3B.txt

I find the gguf but not the unsloth_calibration_Qwen3.6-35B-A3B.txt file. Same for other LLMs.

reddit.com
u/PromptInjection_ — 2 days ago
▲ 638 r/unsloth+1 crossposts

Run Qwen3.6 MTP GGUFs locally!

Hey guys, Qwen3.6 can run ~1.4–2.2× faster with no accuracy change due to MTP. You can run this locally on just 18GB RAM, VRAM or unified memory.

The Qwen3.6 Unsloth GGUFs are now out of experimental mode, llama.cpp has merged many PRs, and MTP is now properly supported in Unsloth. MTP is now ready!

Please use the latest Unsloth `v0.1.41-beta`, not `v0.1.405-beta` which is older. In Studio, we automatically set all the params for you depending on your specific hardware so you get the near best results (you can still change it)

Qwen3.6-27B MTP can run at 160 tokens/s. Qwen3.6-35B-A3B MTP GGUF reaches 240 tokens/s. We also uploaded MTP GGUFs for Qwen3.5!

27B MTP GGUF: https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF
35B-A3B MTP GGUF: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF

Guide: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

Thank you! We've got lots of releases this week as well.

u/yoracale — 4 days ago
▲ 13 r/unsloth

I made my own organization on huggingface for soley releasing low size distills of bigger models

I recently started my own Hugging Face org called CoNDeNse-AI focused on making smaller, lightweight distilled AI models that are easier to run on normal hardware 🙌

Org: https://huggingface.co/CoNDeNse-AI

Most of the training is done on Kaggle using 2x T4 GPUs, so a big part of the project is figuring out how to get the best possible results from limited hardware. Because of this, we unfortunately can’t currently make proper distills based on newer/larger Qwen 3.5 base models since Kaggle struggles heavily with them during training and distillation.

Some current projects are:

- GLM-5.1-Qwen3-1.7B-CoNDeNse

- GLM-5.1-Qwen3-0.6B-CoNDeNse

- GLM-5.1-Qwen3-1.7B-CoNDeNse-GGUF

The 1.7B versions mainly focus on preserving reasoning, coding, and multilingual capabilities while reducing overhead, while the 0.6B variant is more focused on accessibility and lower-end hardware support. The GGUF release is aimed at easier local inference in things like llama.cpp and LM Studio 💻

The org is still very experimental, so alongside proper releases there are also research checkpoints, quantization tests, and random experiments that may or may not work 😅

Would love feedback from people working on low-resource training/distillation setups.

u/Capital_Savings_9942 — 3 days ago
▲ 28 r/unsloth

PinchBench and Tau2 may matter more than one more AIME headline

For agent models, PinchBench and Tau2 may matter more than one more AIME headline。

I still think AIME and GPQA matter. They say something real about capability ceilings. For agent models, though, I reach first for execution-heavy, tool-heavy, multi-step signals. That is why Ring-2.6-1T caught my eye: PinchBench: 87.60, Tau2-Bench Telecom: 95.32, and ClawEval: 63.82 sit alongside AIME 26: 95.83, GPQA Diamond: 88.27, and ARC-AGI-V2: 66.18. For production-style agents, I care first about whether the model can keep a workflow moving, coordinate tools cleanly, and avoid spending deep reasoning on every intermediate step. The public high / xhigh framing fits that story too, with deeper reasoning available when you need it instead of dominating every path.

reddit.com
u/Tricky_Season2969 — 3 days ago
▲ 145 r/unsloth

Run Qwen3.6 MTP GGUFs in Unsloth Studio!

Hey guys, Qwen3.6 MTP GGUFs now work in Unsloth Studio: https://github.com/unslothai/unsloth

Just update Unsloth Studio or do a fresh install.

MacOS, Linux, WSL:

curl -fsSL https://unsloth.ai/install.sh | sh

Windows PowerShell:

irm https://unsloth.ai/install.ps1 | iex

As always huge thanks to llama.cpp and devs for making this possible.

We'll be doing a new pypi release with lots of new updates tomorrow! Lots!!!

u/yoracale — 5 days ago

최신 unsloth studio 버전에서 한글 입력이 안되는데 해결한 사람 있나?

unsloth studio를 업데이트 한 이후부터 프롬프트 입력에서 한글 입력이 전혀 안된다.
운영 환경은 mac mini

unsloth studio의 버그야? 아니면 한글 같은 다국어 입력 문제 해결 방법이 있나??

reddit.com
u/More-Sail-6170 — 4 days ago
▲ 435 r/unsloth

Qwen3.6 MTP Unsloth GGUFs now 1.8x faster!

Qwen3.6 MTP Unsloth GGUFs now run **1.8x faster, increased from 1.4x just two days ago!**This is due to llama.cpp adding --spec-draft-p-min 0.75!

Args have also changed from
--spec-type mtp
to
--spec-type draft-mtp

Also increase --spec-draft-n-max 2 to 6

We also released Qwen3.5-0.8B, 2B, 4B, 9B MTP GGUFs! We'll be providing more soon!

For folks who find the new updated branch to have some perf regression, set --spec-draft-p-min to 0.0 to get the old behavior - we provided a plot of the old branch (red) vs the new branch (blue / green) as well.

Also you can use 2 speculative decoding algos - you can add ngram via --spec-type ngram-mod,draft-mtp - the perf isn't yet optimized so I'll do more benchmarks to find better numbers - see https://github.com/ggml-org/llama.cpp/pull/22673

Guide for MTP: https://unsloth.ai/docs/models/qwen3.6#mtp-guide

u/danielhanchen — 7 days ago
▲ 33 r/unsloth

Any plans to update Qwen3 CoderNext with MTP?

The Unsloth team has been truly amazing in breadth and depth of releases. I’m super excited to try the 27b in particular.

The Qwen3 CoderNext model has actually surprised me in functionality when thinking is less valuable, like feeding Aider.

I would be grateful if this got MTP turned on with Unsloth’s high quality quants!!

Anyone else a fan?

reddit.com
u/fixedupperfan — 6 days ago

Unsloth Studio is only loading into RAM

I finally got around to installing Unsloth Studio and am giving it a whirl on my Windows machine. I have an RTX6k and am trying to load the latest QWEN MTP model you guys created. However no matter what I do, all models are loading in RAM instead of the GPU.

From what i see online it means there's a cuda driver mismatch most likely. But i installed 12.9 and updated all my environment variables. So I'm not sure why this is still happening.

Any thoughts or help?

reddit.com
u/Demonicated — 6 days ago
▲ 253 r/unsloth

Ring-2.6-1T has been open-sourced!!!

​

Ring-2.6-1T is a 1T-parameter-scale thinking model with 63B active parameters, built for real-world agent workflows that require both strong capability and operational efficiency. It is optimized for coding agents, tool use, and long-horizon task execution, delivering leading results on benchmarks including PinchBench, ClawEval, TAU2-Bench, and GAIA2-search.

With adaptive reasoning effort across high and xhigh modes, Ring-2.6-1T dynamically allocates reasoning budget based on task complexity. This enables stronger performance with lower token overhead, especially in tool-heavy and multi-turn agent workflows.

Ring-2.6-1T is designed for advanced coding agents, complex reasoning pipelines, and large-scale autonomous systems where execution quality, latency, and cost efficiency all matter.

https://huggingface.co/inclusionAI/Ring-2.6-1T

u/rulingarayashiki — 8 days ago