
llama-bech-qwen36-27b-iq4_xs with 2 x R9700 (ubuntu 26.04)
This is summary (my text file at the EOF)
💻 System Configuration
- GPU: 2 × AMD Radeon AI PRO R9700 (Total VRAM: 64 GiB / Available: 81552 MiB)
- GPU Settings: -500Hz clock, -88mV undervolt, 210W power limit
- Driver: ROCM-SMI 4.0.0, library 7.8.0
- Software: llama-bench (build 9a532ae4b)
📊 Performance Summary Table
- Model: Qwen3.6-27B-IQ4_XS (Size: 14.62 GiB, Params: 27.32 B)
- Run Parameters:
-ngl 999 -fa 1 -sm tensor
| Context (Ctx) | Prompt Processing Speed (pp512) | Token Generation Speed (tg128) |
|---|---|---|
| 8K (8192) | 1161.98 ± 21.74 t/s | 37.80 ± 0.17 t/s |
| 16K (16384) | 1091.45 ± 20.30 t/s | 36.93 ± 0.17 t/s |
| 32K (32768) | 956.87 ± 15.09 t/s | 35.55 ± 0.17 t/s |
| 64K (65536) | 766.77 ± 20.19 t/s | 33.02 ± 0.14 t/s |
| 96K (98304) | 660.56 ± 6.95 t/s | 30.93 ± 0.11 t/s |
| 131K (131072) | 555.79 ± 45.52 t/s | 29.08 ± 0.10 t/s |
📈 Visualized Trends Analysis
- Prompt Processing (pp512):
- Shows a linear decline as context grows.
- The drop is gentle (-17.6%) between 8K → 32K.
- The slope steepens after 64K. At the maximum 131K context, speed drops by 52% from the baseline. The error bars expand significantly here, signaling execution time instability near the VRAM capacity limit.
- Token Generation (tg128):
- Demonstrates a very smooth, stable descent.
- Performance loss is minimal, dropping by only ~1.3 t/s for every additional 32K of context.
- The overall speed difference between 8K and 131K is just 23%, which is excellent for a dual-card setup running via ROCm in
tensorsplit mode.
FILE (as is)
# t/s, ppt/s test 64GB VRAM
HIP_VISIBLE_DEVICES=0,1 ./llama-bench -m ~/Downloads/models/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-IQ4_XS.gguf -ngl 999 -fa 1 -sm tensor
<!-- TEST ctx list -->
# -2 8k (8192)
# -1 16k (16384)
# 0. 32k (32768)
# 1. 64K (65536)
# 2. 96K (98304)
# 3. 112K (114688)
# 4. 120K (122880)
# 5. 124K (126976)
# 6. 128K (131072)
<!-- GPU -->
ggml_cuda_init: found 3 ROCm devices (Total VRAM: 81552 MiB):
Device 0: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB
Device 1: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB
<!-- Llamacpp ver. -->
build: 9a532ae4b (9222)
<!-- Bios ver. -->
============================ ROCm System Management Interface ============================
========================================= VBIOS ==========================================
GPU[0] : VBIOS version: 113-R9700AT-F50
GPU[1] : VBIOS version: 113-R9700AT-F50
==========================================================================================
================================== End of ROCm SMI Log ===================================
<!-- Clock for R9700 -->
-500hz (was max)
-88mV (avg by comm.)
210w limit (max drop)
<!-- Driver ver. -->
ROCM-SMI version: 4.0.0+c2d9476115
ROCM-SMI-LIB version: 7.8.0
<!-- RESULTS -->
# -2
HIP_VISIBLE_DEVICES=0,1 ./llama-bench \
-m ~/Downloads/models/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-IQ4_XS.gguf \
-ngl 999 -fa 1 -sm tensor \
-d 8192 -p 512 -n 128
| model | size | params | backend | ngl | sm | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | pp512 @ d8192 | 1161.98 ± 21.74 |
| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | tg128 @ d8192 | 37.80 ± 0.17 |
# -1
HIP_VISIBLE_DEVICES=0,1 ./llama-bench \
-m ~/Downloads/models/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-IQ4_XS.gguf \
-ngl 999 -fa 1 -sm tensor \
-d 16384 -p 512 -n 128
| model | size | params | backend | ngl | sm | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | pp512 @ d16384 | 1091.45 ± 20.30 |
| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | tg128 @ d16384 | 36.93 ± 0.17 |
# 0.
HIP_VISIBLE_DEVICES=0,1 ./llama-bench \
-m ~/Downloads/models/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-IQ4_XS.gguf \
-ngl 999 -fa 1 -sm tensor \
-d 32768 -p 512 -n 128
| model | size | params | backend | ngl | sm | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | pp512 @ d32768 | 956.87 ± 15.09 |
| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | tg128 @ d32768 | 35.55 ± 0.17 |
# 1.
HIP_VISIBLE_DEVICES=0,1 ./llama-bench \
-m ~/Downloads/models/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-IQ4_XS.gguf \
-ngl 999 -fa 1 -sm tensor \
-d 65536 -p 512 -n 128
| model | size | params | backend | ngl | sm | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | pp512 @ d65536 | 766.77 ± 20.19 |
| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | tg128 @ d65536 | 33.02 ± 0.14 |
# 2.
HIP_VISIBLE_DEVICES=0,1 ./llama-bench \
-m ~/Downloads/models/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-IQ4_XS.gguf \
-ngl 999 -fa 1 -sm tensor \
-d 98304 -p 512 -n 128
| model | size | params | backend | ngl | sm | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | pp512 @ d98304 | 660.56 ± 6.95 |
| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | tg128 @ d98304 | 30.93 ± 0.11 |
# 3.
HIP_VISIBLE_DEVICES=0,1 ./llama-bench \
-m ~/Downloads/models/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-IQ4_XS.gguf \
-ngl 999 -fa 1 -sm tensor \
-d 131072 -p 512 -n 128
| model | size | params | backend | ngl | sm | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |
| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | pp512 @ d131072 | 555.79 ± 45.52 |
| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | tg128 @ d131072 | 29.08 ± 0.10 |