u/Former_Bathroom_2329

This is summary (my text file at the EOF)
💻 System Configuration

GPU: 2 × AMD Radeon AI PRO R9700 (Total VRAM: 64 GiB / Available: 81552 MiB)
GPU Settings: -500Hz clock, -88mV undervolt, 210W power limit
Driver: ROCM-SMI 4.0.0, library 7.8.0
Software: llama-bench (build 9a532ae4b)

📊 Performance Summary Table

Model: Qwen3.6-27B-IQ4_XS (Size: 14.62 GiB, Params: 27.32 B)
Run Parameters: -ngl 999 -fa 1 -sm tensor

Context (Ctx)	Prompt Processing Speed (pp512)	Token Generation Speed (tg128)
8K (8192)	1161.98 ± 21.74 t/s	37.80 ± 0.17 t/s
16K (16384)	1091.45 ± 20.30 t/s	36.93 ± 0.17 t/s
32K (32768)	956.87 ± 15.09 t/s	35.55 ± 0.17 t/s
64K (65536)	766.77 ± 20.19 t/s	33.02 ± 0.14 t/s
96K (98304)	660.56 ± 6.95 t/s	30.93 ± 0.11 t/s
131K (131072)	555.79 ± 45.52 t/s	29.08 ± 0.10 t/s

https://preview.redd.it/3olfw1d0352h1.png?width=421&format=png&auto=webp&s=0f0a5cf339b278da0b6784994436ae1e791c589b

📈 Visualized Trends Analysis

Prompt Processing (pp512):
- Shows a linear decline as context grows.
- The drop is gentle (-17.6%) between 8K → 32K.
- The slope steepens after 64K. At the maximum 131K context, speed drops by 52% from the baseline. The error bars expand significantly here, signaling execution time instability near the VRAM capacity limit.
Token Generation (tg128):
- Demonstrates a very smooth, stable descent.
- Performance loss is minimal, dropping by only ~1.3 t/s for every additional 32K of context.
- The overall speed difference between 8K and 131K is just 23%, which is excellent for a dual-card setup running via ROCm in tensor split mode.

FILE (as is)
# t/s, ppt/s test 64GB VRAM

HIP_VISIBLE_DEVICES=0,1 ./llama-bench -m ~/Downloads/models/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-IQ4_XS.gguf -ngl 999 -fa 1 -sm tensor

# -2 8k (8192)

# -1 16k (16384)

# 0. 32k (32768)

# 1. 64K (65536)

# 2. 96K (98304)

# 3. 112K (114688)

# 4. 120K (122880)

# 5. 124K (126976)

# 6. 128K (131072)

ggml_cuda_init: found 3 ROCm devices (Total VRAM: 81552 MiB):

Device 0: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB

Device 1: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32, VRAM: 32624 MiB

build: 9a532ae4b (9222)

============================ ROCm System Management Interface ============================

========================================= VBIOS ==========================================

GPU[0] : VBIOS version: 113-R9700AT-F50

GPU[1] : VBIOS version: 113-R9700AT-F50

==========================================================================================

================================== End of ROCm SMI Log ===================================

-500hz (was max)

-88mV (avg by comm.)

210w limit (max drop)

ROCM-SMI version: 4.0.0+c2d9476115

ROCM-SMI-LIB version: 7.8.0

# -2

HIP_VISIBLE_DEVICES=0,1 ./llama-bench \

-m ~/Downloads/models/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-IQ4_XS.gguf \

-ngl 999 -fa 1 -sm tensor \

-d 8192 -p 512 -n 128

| model | size | params | backend | ngl | sm | fa | test | t/s |

| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |

| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | pp512 @ d8192 | 1161.98 ± 21.74 |

| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | tg128 @ d8192 | 37.80 ± 0.17 |

# -1

HIP_VISIBLE_DEVICES=0,1 ./llama-bench \

-m ~/Downloads/models/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-IQ4_XS.gguf \

-ngl 999 -fa 1 -sm tensor \

-d 16384 -p 512 -n 128

| model | size | params | backend | ngl | sm | fa | test | t/s |

| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |

| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | pp512 @ d16384 | 1091.45 ± 20.30 |

| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | tg128 @ d16384 | 36.93 ± 0.17 |

# 0.

HIP_VISIBLE_DEVICES=0,1 ./llama-bench \

-m ~/Downloads/models/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-IQ4_XS.gguf \

-ngl 999 -fa 1 -sm tensor \

-d 32768 -p 512 -n 128

| model | size | params | backend | ngl | sm | fa | test | t/s |

| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |

| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | pp512 @ d32768 | 956.87 ± 15.09 |

| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | tg128 @ d32768 | 35.55 ± 0.17 |

# 1.

HIP_VISIBLE_DEVICES=0,1 ./llama-bench \

-m ~/Downloads/models/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-IQ4_XS.gguf \

-ngl 999 -fa 1 -sm tensor \

-d 65536 -p 512 -n 128

| model | size | params | backend | ngl | sm | fa | test | t/s |

| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |

| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | pp512 @ d65536 | 766.77 ± 20.19 |

| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | tg128 @ d65536 | 33.02 ± 0.14 |

# 2.

HIP_VISIBLE_DEVICES=0,1 ./llama-bench \

-m ~/Downloads/models/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-IQ4_XS.gguf \

-ngl 999 -fa 1 -sm tensor \

-d 98304 -p 512 -n 128

| model | size | params | backend | ngl | sm | fa | test | t/s |

| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |

| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | pp512 @ d98304 | 660.56 ± 6.95 |

| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | tg128 @ d98304 | 30.93 ± 0.11 |

# 3.

HIP_VISIBLE_DEVICES=0,1 ./llama-bench \

-m ~/Downloads/models/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-IQ4_XS.gguf \

-ngl 999 -fa 1 -sm tensor \

-d 131072 -p 512 -n 128

| model | size | params | backend | ngl | sm | fa | test | t/s |

| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -: | --------------: | -------------------: |

| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | pp512 @ d131072 | 555.79 ± 45.52 |

| qwen35 27B IQ4_XS - 4.25 bpw | 14.62 GiB | 27.32 B | ROCm | 999 | tensor | 1 | tg128 @ d131072 | 29.08 ± 0.10 |

llama-bech-qwen36-27b-iq4_xs with 2 x R9700 (ubuntu 26.04)