u/whodoneit1

r/ROCm r/LocalLLaMA r/Vllm r/LocalLLM r/vulkan

▲ 54 r/vulkan+2 crossposts

ROCm vs Vulkan vs vLLM on Dual R9700's

Just wanted to share these numbers I saw running Qwen3.6 35BA3 and Qwen3.6 27B and the big increase I saw going to vLLM. I was just expecting better concurrency but ended up with a lot better speeds.

llama.cpp services Running ROCm and Vulkan

Model	Backend	Gen
35B-A3B Q6_K_XL (MTP)	ROCm	~106 t/s
27B Q6_K_XL (MTP)	ROCm	~44 t/s
35B-A3B Q6_K_XL (MTP)	Vulkan	~87 t/s
27B Q6_K_XL (MTP)	Vulkan	~41 t/s

vLLM

Model	Backend	Gen
35B-A3B MoE FP8 (MTP)	ROCm + AITER	156 t/s
27B FP8 (MTP)	ROCm + AITER	69 t/s

**EDIT, here are prefill speeds from 35BA3 since several were asking:

Pulled these from vLLM logger.

Prompt size	Prefill speed	(= tokens ÷ TTFT)

~10K	~10,000 tok/s	10,033 ÷ 0.98s
~40K	~6,600 tok/s	39,997 ÷ 6.0s
~70K	~5,500 tok/s	70,027 ÷ 12.7s
~100K	~4,400 tok/s	99,991 ÷ 22.9s

I am curious what speeds others are seeing on Qwen3.6 35BA3 and 27B.

reddit.com

u/whodoneit1 — 6 days ago

▲ 10 r/LocalLLM

2× Radeon AI PRO R9700 (RDNA4/gfx1201) on vLLM 0.22.1 — how we fixed the long-context decode cliff (and what we learned chasing FP8)

Posting our setup for the (apparently growing) club of people running multiple R9700s on vLLM. Big shout-out to u/AustinM731 — their AITER Unified Attention post was the single most useful thing we found, and I want to (a) confirm it works, (b) share where our findings lined up vs differed, and (c) save the next person the week we spent going down dead ends.

The rig

GPUs: 2× AMD Radeon AI PRO R9700 (gfx1201 / RDNA4, 32 GB each), TP=2
Board/CPU: ASRock X870E, Ryzen, 60 GB RAM
OS: Fedora 44 Server, kernel 7.0.11 (the ~100 W idle-draw bug is fixed in 7.0 — already not an issue for us)
Model: Qwen3.6-35B-A3B-FP8 (the 35B hybrid Gated-DeltaNet + attention MoE, ~3B active), native 262K context
Serving: MTP speculative decoding (n=3), AITER Unified Attention, bf16 KV cache, TunableOp, --enable-chunked-prefill

Exact versions (so people know what this is on)

GPU arch     : gfx1201 (RDNA4) ×2, TP=2
OS / kernel  : Fedora Linux 44 (Server), kernel 7.0.11-200.fc44
vLLM         : 0.22.1
ROCm / HIP   : 7.2.x (torch.version.hip = 7.2.53211)
PyTorch      : 2.10.0 (+git8514f05)
Triton       : 3.6.0
AITER        : present (gfx1201 gate relaxed; see below)
base image   : vllm/vllm-openai-rocm:v0.22.1  (we run a committed image with 2 one-line patches)
runtime      : podman + systemd (--user), --ipc=host, NCCL_PROTO=Simple, ROCR_VISIBLE_DEVICES=0,1

Note on versioning: vLLM moves fast and the gfx1201 gates change between releases. On 0.22.1 the AITER unified-attention backend is already built in (just gated to CDNA). On the 0.19/0.20 images others used, you had to rebuild. So your patch surface depends heavily on your vLLM version — worth stating yours when you compare numbers.

The thing that actually mattered: the long-context decode cliff

For ages we only ever benchmarked at ~8K context and were happy (~100+ tok/s). Then we benchmarked deep, and decode fell off a cliff:

context	ROCm prefill-decode attn (before)
~8K	~100 tok/s
~21K	56
~79K	14

That ~7× collapse is not normal memory-bandwidth decay — it was the unoptimized ROCm attention path on gfx1201 scaling badly. The fix is exactly what u/AustinM731 found: AITER Unified Attention (ROCM_AITER_UNIFIED_ATTN).

On vLLM 0.22.1 the backend is already compiled in — it's just gated to CDNA (MI300/MI350). Relax one gate and select it:

In vllm/_aiter_ops.py, is_aiter_found_and_supported() returns on_mi3xx(). Make it also allow gfx1x:

return on_mi3xx() or bool(getattr(_rocmmod, "_ON_GFX1X", False))

Run with --attention-backend ROCM_AITER_UNIFIED_ATTN, VLLM_ROCM_USE_AITER=1, and turn the others off (VLLM_ROCM_USE_AITER_MHA=0, _PAGED_ATTN=0, _MOE=0, _LINEAR=0) — those have no gfx1201 kernel and will crash MoE init otherwise. Plus FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE.
It auto-sets KV block size to 64 (power-of-2), which sidesteps the AITER TILE_SIZE assert on the Qwen3.6 hybrid layout.

Result (Qwen3.6-35B-A3B-FP8, TP2, MTP3, bf16 KV) — strictly faster at every depth, gap widens with context:

context	before	AITER unified
~8.7K	~100	136
~21K	56	83
~79K	14	41 (≈3×)
~118K	collapsed	30

Quality unchanged (still bf16 KV). For a context-filling coding agent this was night and day.

How our findings compared to u/AustinM731's post

Confirmed / same:

AITER Unified Attention is THE long-context fix on gfx1201. Relaxing the CDNA gate to include RDNA4 is the move.
MTP=3 is the sweet spot (~84% draft acceptance for us, free single-stream speed).
That fast attention path is bf16/fp16 KV only — you can't pair it with FP8 KV.
The 100 W idle issue is fixed in kernel 7.0.

Different / what we'd add:

Newer vLLM = less patching. They were on 0.19.1/0.20.2 and rebuilt images; on 0.22.1 the unified-attn backend already ships — it's a one-line Python gate relax + the --attention-backend flag. No full rebuild.
TP=2 on hybrid models needs the GDN-KKT fix. vLLM ≥0.21 mis-compiles the Gated-DeltaNet chunk_scaled_dot_kkt Triton kernel on gfx1201 (a Hopper WGMMA layout change, #42076) → TP≥2 hangs at startup with a misleading shm_broadcast timeout. One-line revert of that operand layout on non-CUDA fixes it. If you run Qwen3.6/Qwen3-Next hybrids on TP2, you probably need this.
We went deep on FP8 KV and concluded it's a dead end on gfx1201 — skip it. The 262K-context dream via FP8 KV isn't worth it: the stock vLLM fp8 decode kernel does a per-element fp32 dequant that's ~3× slower; we wrote a kernel patch (fold the scalar scale → cast to bf16) that got it 34→41.5 tok/s, and even probed native fp8 WMMA (compiles on RDNA4!) and int32-packed loads — none beat bf16, and AITER unified requires bf16 KV anyway. Qwen3.6's KV footprint is tiny, so just run bf16.
The HIP "custom paged attention" kernel is unreachable for this model. It's hard-gated off for hybrid GDN models (stride-padded KV layout → has_native_kv_cache_layout is false), so even bf16 falls back to Triton. Don't chase it for Qwen3.6.
Context headroom: with bf16 KV our pool is ~768K tokens, so at the model's native 262K you still get ~2.9× concurrency. No need for FP8 KV to reach max context.
2 GPUs vs their 4: our single-stream decode holds ~30 tok/s at 118K (they hold higher on 4×). Long-context decode scales with how much compute/bandwidth you can throw at it.

TL;DR config for gfx1201 + Qwen3.6 on vLLM 0.22.1

Patch 1: revert #42076 operand layout on non-CUDA (GDN-KKT) → TP2 works
Patch 2: allow ROCM_AITER_UNIFIED_ATTN on gfx1x in _aiter_ops.py
Flags: --attention-backend ROCM_AITER_UNIFIED_ATTN, AITER on but MHA/paged/MoE/linear off, MTP n=3, bf16 KV, TunableOp, chunked prefill
Don't bother with FP8 KV.

Happy to share the exact patches/compose if anyone wants them. Thanks again to u/AustinM731 — the unified-attention tip was the unlock.

reddit.com

u/whodoneit1 — 9 days ago

▲ 12 r/Vllm

2× Radeon AI PRO R9700 (RDNA4/gfx1201) on vLLM 0.22.1 — how we fixed the long-context decode cliff (and what we learned chasing FP8)

# The rig

* **GPUs:** 2× AMD Radeon AI PRO R9700 (gfx1201 / RDNA4, 32 GB each), TP=2
* **Board/CPU:** ASRock X870E, Ryzen, 60 GB RAM
* **OS:** Fedora 44 Server, **kernel 7.0.11** (the \~100 W idle-draw bug is fixed in 7.0 — already not an issue for us)
* **Model:** Qwen3.6-35B-A3B-FP8 (the 35B hybrid Gated-DeltaNet + attention MoE, \~3B active), native 262K context
* **Serving:** MTP speculative decoding (n=3), AITER Unified Attention, **bf16 KV cache**, TunableOp, `--enable-chunked-prefill`

# Exact versions (so people know what this is on)

GPU arch : gfx1201 (RDNA4) ×2, TP=2
OS / kernel : Fedora Linux 44 (Server), kernel 7.0.11-200.fc44
vLLM : 0.22.1
ROCm / HIP : 7.2.x (torch.version.hip = 7.2.53211)
PyTorch : 2.10.0 (+git8514f05)
Triton : 3.6.0
AITER : present (gfx1201 gate relaxed; see below)
base image : vllm/vllm-openai-rocm:v0.22.1 (we run a committed image with 2 one-line patches)
runtime : podman + systemd (--user), --ipc=host, NCCL_PROTO=Simple, ROCR_VISIBLE_DEVICES=0,1

Note on versioning: vLLM moves fast and the gfx1201 gates change between releases. On **0.22.1** the AITER unified-attention backend is already built in (just gated to CDNA). On the 0.19/0.20 images others used, you had to rebuild. So your patch surface depends heavily on your vLLM version — worth stating yours when you compare numbers.

# The thing that actually mattered: the long-context decode cliff

For ages we only ever benchmarked at \~8K context and were happy (\~100+ tok/s). Then we benchmarked *deep*, and decode fell off a cliff:

context	ROCm prefill-decode attn (before)
\~8K	\~100 tok/s
\~21K	56
\~79K	14

That \~7× collapse is **not** normal memory-bandwidth decay — it was the unoptimized ROCm attention path on gfx1201 scaling badly. The fix is exactly what u/AustinM731 found: **AITER Unified Attention** (`ROCM_AITER_UNIFIED_ATTN`).

On vLLM 0.22.1 the backend is already compiled in — it's just gated to CDNA (MI300/MI350). Relax one gate and select it:

* In `vllm/_aiter_ops.py`, `is_aiter_found_and_supported()` returns `on_mi3xx()`. Make it also allow gfx1x: `return on_mi3xx() or bool(getattr(_rocmmod, "_ON_GFX1X", False))`
* Run with `--attention-backend ROCM_AITER_UNIFIED_ATTN`, `VLLM_ROCM_USE_AITER=1`, and **turn the others off** (`VLLM_ROCM_USE_AITER_MHA=0`, `_PAGED_ATTN=0`, `_MOE=0`, `_LINEAR=0`) — those have no gfx1201 kernel and will crash MoE init otherwise. Plus `FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE`.
* It auto-sets KV block size to 64 (power-of-2), which sidesteps the AITER TILE\_SIZE assert on the Qwen3.6 hybrid layout.

Result (Qwen3.6-35B-A3B-FP8, TP2, MTP3, bf16 KV) — strictly faster at every depth, gap widens with context:

context	before	AITER unified
\~8.7K	\~100	136
\~21K	56	83
\~79K	14	41 (≈3×)
\~118K	collapsed	30

Quality unchanged (still bf16 KV). For a context-filling coding agent this was night and day.

# How our findings compared to u/AustinM731's post

**Confirmed / same:**

* AITER Unified Attention is THE long-context fix on gfx1201. Relaxing the CDNA gate to include RDNA4 is the move.
* MTP=3 is the sweet spot (\~84% draft acceptance for us, free single-stream speed).
* That fast attention path is **bf16/fp16 KV only** — you can't pair it with FP8 KV.
* The 100 W idle issue is fixed in kernel 7.0.

**Different / what we'd add:**

* **Newer vLLM = less patching.** They were on 0.19.1/0.20.2 and rebuilt images; on 0.22.1 the unified-attn backend already ships — it's a one-line Python gate relax + the `--attention-backend` flag. No full rebuild.
* **TP=2 on hybrid models needs the GDN-KKT fix.** vLLM ≥0.21 mis-compiles the Gated-DeltaNet `chunk_scaled_dot_kkt` Triton kernel on gfx1201 (a Hopper WGMMA layout change, #42076) → TP≥2 hangs at startup with a misleading shm\_broadcast timeout. One-line revert of that operand layout on non-CUDA fixes it. If you run Qwen3.6/Qwen3-Next hybrids on TP2, you probably need this.
* **We went deep on FP8 KV and concluded it's a dead end on gfx1201 — skip it.** The 262K-context dream via FP8 KV isn't worth it: the stock vLLM fp8 decode kernel does a per-element fp32 dequant that's \~3× slower; we wrote a kernel patch (fold the scalar scale → cast to bf16) that got it 34→41.5 tok/s, and even probed native fp8 WMMA (compiles on RDNA4!) and int32-packed loads — none beat bf16, and AITER unified requires bf16 KV anyway. Qwen3.6's KV footprint is tiny, so just run bf16.
* **The HIP "custom paged attention" kernel is unreachable for this model.** It's hard-gated off for hybrid GDN models (stride-padded KV layout → `has_native_kv_cache_layout` is false), so even bf16 falls back to Triton. Don't chase it for Qwen3.6.
* **Context headroom:** with bf16 KV our pool is \~768K tokens, so at the model's native 262K you still get \~2.9× concurrency. No need for FP8 KV to reach max context.
* **2 GPUs vs their 4:** our single-stream decode holds \~30 tok/s at 118K (they hold higher on 4×). Long-context decode scales with how much compute/bandwidth you can throw at it.

# TL;DR config for gfx1201 + Qwen3.6 on vLLM 0.22.1

* Patch 1: revert #42076 operand layout on non-CUDA (GDN-KKT) → TP2 works
* Patch 2: allow `ROCM_AITER_UNIFIED_ATTN` on gfx1x in `_aiter_ops.py`
* Flags: `--attention-backend ROCM_AITER_UNIFIED_ATTN`, AITER on but MHA/paged/MoE/linear off, MTP n=3, bf16 KV, TunableOp, chunked prefill
* Don't bother with FP8 KV.

Happy to share the exact patches/compose if anyone wants them. Thanks again to u/AustinM731 — the unified-attention tip was the unlock.

reddit.com

u/whodoneit1 — 9 days ago

▲ 32 r/ROCm

2× Radeon AI PRO R9700 (RDNA4/gfx1201) on vLLM 0.22.1 — how we fixed the long-context decode cliff (and what we learned chasing FP8)

The rig

GPUs: 2× AMD Radeon AI PRO R9700 (gfx1201 / RDNA4, 32 GB each), TP=2
Board/CPU: ASRock X870E, Ryzen, 60 GB RAM
OS: Fedora 44 Server, kernel 7.0.11 (the ~100 W idle-draw bug is fixed in 7.0 — already not an issue for us)
Model: Qwen3.6-35B-A3B-FP8 (the 35B hybrid Gated-DeltaNet + attention MoE, ~3B active), native 262K context
Serving: MTP speculative decoding (n=3), AITER Unified Attention, bf16 KV cache, TunableOp, --enable-chunked-prefill

Exact versions (so people know what this is on)

GPU arch     : gfx1201 (RDNA4) ×2, TP=2
OS / kernel  : Fedora Linux 44 (Server), kernel 7.0.11-200.fc44
vLLM         : 0.22.1
ROCm / HIP   : 7.2.x (torch.version.hip = 7.2.53211)
PyTorch      : 2.10.0 (+git8514f05)
Triton       : 3.6.0
AITER        : present (gfx1201 gate relaxed; see below)
base image   : vllm/vllm-openai-rocm:v0.22.1  (we run a committed image with 2 one-line patches)
runtime      : podman + systemd (--user), --ipc=host, NCCL_PROTO=Simple, ROCR_VISIBLE_DEVICES=0,1

The thing that actually mattered: the long-context decode cliff

For ages we only ever benchmarked at ~8K context and were happy (~100+ tok/s). Then we benchmarked deep, and decode fell off a cliff:

context	ROCm prefill-decode attn (before)
~8K	~100 tok/s
~21K	56
~79K	14

On vLLM 0.22.1 the backend is already compiled in — it's just gated to CDNA (MI300/MI350). Relax one gate and select it:

In vllm/_aiter_ops.py, is_aiter_found_and_supported() returns on_mi3xx(). Make it also allow gfx1x:

return on_mi3xx() or bool(getattr(_rocmmod, "_ON_GFX1X", False))

Run with --attention-backend ROCM_AITER_UNIFIED_ATTN, VLLM_ROCM_USE_AITER=1, and turn the others off (VLLM_ROCM_USE_AITER_MHA=0, _PAGED_ATTN=0, _MOE=0, _LINEAR=0) — those have no gfx1201 kernel and will crash MoE init otherwise. Plus FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE.
It auto-sets KV block size to 64 (power-of-2), which sidesteps the AITER TILE_SIZE assert on the Qwen3.6 hybrid layout.

Result (Qwen3.6-35B-A3B-FP8, TP2, MTP3, bf16 KV) — strictly faster at every depth, gap widens with context:

context	before	AITER unified
~8.7K	~100	136
~21K	56	83
~79K	14	41 (≈3×)
~118K	collapsed	30

Quality unchanged (still bf16 KV). For a context-filling coding agent this was night and day.

How our findings compared to u/AustinM731's post

Confirmed / same:

AITER Unified Attention is THE long-context fix on gfx1201. Relaxing the CDNA gate to include RDNA4 is the move.
MTP=3 is the sweet spot (~84% draft acceptance for us, free single-stream speed).
That fast attention path is bf16/fp16 KV only — you can't pair it with FP8 KV.
The 100 W idle issue is fixed in kernel 7.0.

Different / what we'd add:

Newer vLLM = less patching. They were on 0.19.1/0.20.2 and rebuilt images; on 0.22.1 the unified-attn backend already ships — it's a one-line Python gate relax + the --attention-backend flag. No full rebuild.
TP=2 on hybrid models needs the GDN-KKT fix. vLLM ≥0.21 mis-compiles the Gated-DeltaNet chunk_scaled_dot_kkt Triton kernel on gfx1201 (a Hopper WGMMA layout change, #42076) → TP≥2 hangs at startup with a misleading shm_broadcast timeout. One-line revert of that operand layout on non-CUDA fixes it. If you run Qwen3.6/Qwen3-Next hybrids on TP2, you probably need this.
We went deep on FP8 KV and concluded it's a dead end on gfx1201 — skip it. The 262K-context dream via FP8 KV isn't worth it: the stock vLLM fp8 decode kernel does a per-element fp32 dequant that's ~3× slower; we wrote a kernel patch (fold the scalar scale → cast to bf16) that got it 34→41.5 tok/s, and even probed native fp8 WMMA (compiles on RDNA4!) and int32-packed loads — none beat bf16, and AITER unified requires bf16 KV anyway. Qwen3.6's KV footprint is tiny, so just run bf16.
The HIP "custom paged attention" kernel is unreachable for this model. It's hard-gated off for hybrid GDN models (stride-padded KV layout → has_native_kv_cache_layout is false), so even bf16 falls back to Triton. Don't chase it for Qwen3.6.
Context headroom: with bf16 KV our pool is ~768K tokens, so at the model's native 262K you still get ~2.9× concurrency. No need for FP8 KV to reach max context.
2 GPUs vs their 4: our single-stream decode holds ~30 tok/s at 118K (they hold higher on 4×). Long-context decode scales with how much compute/bandwidth you can throw at it.

TL;DR config for gfx1201 + Qwen3.6 on vLLM 0.22.1

Patch 1: revert #42076 operand layout on non-CUDA (GDN-KKT) → TP2 works
Patch 2: allow ROCM_AITER_UNIFIED_ATTN on gfx1x in _aiter_ops.py
Flags: --attention-backend ROCM_AITER_UNIFIED_ATTN, AITER on but MHA/paged/MoE/linear off, MTP n=3, bf16 KV, TunableOp, chunked prefill
Don't bother with FP8 KV.

Happy to share the exact patches/compose if anyone wants them. Thanks again to u/AustinM731 — the unified-attention tip was the unlock.

reddit.com

u/whodoneit1 — 9 days ago

▲ 25 r/Vllm+2 crossposts

2× Radeon AI PRO R9700 (RDNA4/gfx1201) on vLLM 0.22.1 — how we fixed the long-context decode cliff (and what we learned chasing FP8)

The rig

GPUs: 2× AMD Radeon AI PRO R9700 (gfx1201 / RDNA4, 32 GB each), TP=2
Board/CPU: ASRock X870E, Ryzen, 60 GB RAM
OS: Fedora 44 Server, kernel 7.0.11 (the ~100 W idle-draw bug is fixed in 7.0 — already not an issue for us)
Model: Qwen3.6-35B-A3B-FP8 (the 35B hybrid Gated-DeltaNet + attention MoE, ~3B active), native 262K context
Serving: MTP speculative decoding (n=3), AITER Unified Attention, bf16 KV cache, TunableOp, --enable-chunked-prefill

Exact versions (so people know what this is on)

GPU arch     : gfx1201 (RDNA4) ×2, TP=2
OS / kernel  : Fedora Linux 44 (Server), kernel 7.0.11-200.fc44
vLLM         : 0.22.1
ROCm / HIP   : 7.2.x (torch.version.hip = 7.2.53211)
PyTorch      : 2.10.0 (+git8514f05)
Triton       : 3.6.0
AITER        : present (gfx1201 gate relaxed; see below)
base image   : vllm/vllm-openai-rocm:v0.22.1  (we run a committed image with 2 one-line patches)
runtime      : podman + systemd (--user), --ipc=host, NCCL_PROTO=Simple, ROCR_VISIBLE_DEVICES=0,1

The thing that actually mattered: the long-context decode cliff

For ages we only ever benchmarked at ~8K context and were happy (~100+ tok/s). Then we benchmarked deep, and decode fell off a cliff:

context	ROCm prefill-decode attn (before)
~8K	~100 tok/s
~21K	56
~79K	14

On vLLM 0.22.1 the backend is already compiled in — it's just gated to CDNA (MI300/MI350). Relax one gate and select it:

In vllm/_aiter_ops.py, is_aiter_found_and_supported() returns on_mi3xx(). Make it also allow gfx1x: return on_mi3xx() or bool(getattr(_rocmmod, "_ON_GFX1X", False))
Run with --attention-backend ROCM_AITER_UNIFIED_ATTN, VLLM_ROCM_USE_AITER=1, and turn the others off (VLLM_ROCM_USE_AITER_MHA=0, _PAGED_ATTN=0, _MOE=0, _LINEAR=0) — those have no gfx1201 kernel and will crash MoE init otherwise. Plus FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE.
It auto-sets KV block size to 64 (power-of-2), which sidesteps the AITER TILE_SIZE assert on the Qwen3.6 hybrid layout.

Result (Qwen3.6-35B-A3B-FP8, TP2, MTP3, bf16 KV) — strictly faster at every depth, gap widens with context:

context	before	AITER unified
~8.7K	~100	136
~21K	56	83
~79K	14	41 (≈3×)
~118K	collapsed	30

Quality unchanged (still bf16 KV). For a context-filling coding agent this was night and day.

How our findings compared to u/AustinM731's post

Confirmed / same:

AITER Unified Attention is THE long-context fix on gfx1201. Relaxing the CDNA gate to include RDNA4 is the move.
MTP=3 is the sweet spot (~84% draft acceptance for us, free single-stream speed).
That fast attention path is bf16/fp16 KV only — you can't pair it with FP8 KV.
The 100 W idle issue is fixed in kernel 7.0.

Different / what we'd add:

Newer vLLM = less patching. They were on 0.19.1/0.20.2 and rebuilt images; on 0.22.1 the unified-attn backend already ships — it's a one-line Python gate relax + the --attention-backend flag. No full rebuild.
TP=2 on hybrid models needs the GDN-KKT fix. vLLM ≥0.21 mis-compiles the Gated-DeltaNet chunk_scaled_dot_kkt Triton kernel on gfx1201 (a Hopper WGMMA layout change, #42076) → TP≥2 hangs at startup with a misleading shm_broadcast timeout. One-line revert of that operand layout on non-CUDA fixes it. If you run Qwen3.6/Qwen3-Next hybrids on TP2, you probably need this.
We went deep on FP8 KV and concluded it's a dead end on gfx1201 — skip it. The 262K-context dream via FP8 KV isn't worth it: the stock vLLM fp8 decode kernel does a per-element fp32 dequant that's ~3× slower; we wrote a kernel patch (fold the scalar scale → cast to bf16) that got it 34→41.5 tok/s, and even probed native fp8 WMMA (compiles on RDNA4!) and int32-packed loads — none beat bf16, and AITER unified requires bf16 KV anyway. Qwen3.6's KV footprint is tiny, so just run bf16.
The HIP "custom paged attention" kernel is unreachable for this model. It's hard-gated off for hybrid GDN models (stride-padded KV layout → has_native_kv_cache_layout is false), so even bf16 falls back to Triton. Don't chase it for Qwen3.6.
Context headroom: with bf16 KV our pool is ~768K tokens, so at the model's native 262K you still get ~2.9× concurrency. No need for FP8 KV to reach max context.
2 GPUs vs their 4: our single-stream decode holds ~30 tok/s at 118K (they hold higher on 4×). Long-context decode scales with how much compute/bandwidth you can throw at it.

TL;DR config for gfx1201 + Qwen3.6 on vLLM 0.22.1

Patch 1: revert #42076 operand layout on non-CUDA (GDN-KKT) → TP2 works
Patch 2: allow ROCM_AITER_UNIFIED_ATTN on gfx1x in _aiter_ops.py
Flags: --attention-backend ROCM_AITER_UNIFIED_ATTN, AITER on but MHA/paged/MoE/linear off, MTP n=3, bf16 KV, TunableOp, chunked prefill
Don't bother with FP8 KV.

Happy to share the exact patches/compose if anyone wants them. Thanks again to u/AustinM731 — the unified-attention tip was the unlock.

reddit.com

u/whodoneit1 — 9 days ago