Recently picked up a 7900 XTX to run LLMs locally, providing a local LLM API for opencode and pi.dev.

Spent quite some time benchmarking performance. The results are below for reference. This is just a rough log; I won’t post the full llama-bench outputs here as there’s too much data.

1. ROCm + TurboQuant

Repo: https://github.com/domvox/llama.cpp-turboquant-hip Performance: 256k context window | PP: 970 t/s | TG: 29 t/s Comment: In current tests, although the response latency isn't as fast as online APIs, the quality of generated code is comparable to online APIs.

~/llama.cpp-turboquant-hip/rocm/llama-server -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf   --mmproj ~/model/llm/qwen3.6-27b/mmproj-Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-f16.gguf   --alias qwen3.6-27b   --host 0.0.0.0   --port 8080   --n-gpu-layers 999   --ctx-size 262144   --batch-size 2048   --ubatch-size 768   --threads 8   --temp 1.0      --top-p 0.95     --top-k 20     --min-p 0.00   --presence_penalty 1.5   --cache-type-k turbo3   --cache-type-v turbo3

2. Vulkan

Repo: https://github.com/ggml-org/llama.cpp Performance: 256k context window | KV-cache-type: Q4_0 | PP: 730 t/s | TG: 47 t/s (Q8_0 is slightly slower)

~/Downloads/llama.cpp/vulkan/bin/llama-server -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf   --alias qwen3.6-27b  --cache-type-k q4_0 --cache-type-v q4_0 -np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256

2.1 Vulkan + TurboQuant

Repo: https://github.com/TheTom/llama-cpp-turboquant Performance: 256k context window | KV-cache-type: Q4_0 | TG: 10 t/s. During decoding, GPU utilization stays below 30%, resulting in poor speed. Enabling MTP yields similar results.

~/llama.cpp/build/bin/llama-server   -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf   --alias qwen3.6-27b   --cache-type-k turbo3 --cache-type-v turbo3   -np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256

3. Vulkan + MTP

Repo/PR: https://github.com/ggml-org/llama.cpp/pull/22673 Performance: 256k context window | KV-cache-type: Q4_0 | PP: 730 t/s | TG: 67 t/s. VRAM usage is similar to running without MTP.

~/Downloads/llama.cpp/vulkan/bin/llama-server -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Q4_K_M-mtp.gguf   --alias qwen3.6-27b   --spec-type mtp --spec-draft-n-max 3   --cache-type-k q4_0 --cache-type-v q4_0 -np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256

3. ROCm + MTP

Repo/PR: https://github.com/ggml-org/llama.cpp/pull/22673 Performance: 4k context window | KV-cache-type: Q4_0 | PP: 730 t/s | TG: 67 t/s. Comment: There is an issue with the ROCm backend + MTP. VRAM spikes by 5GB at the start of a conversation for unknown reasons. Consequently, the maximum context length is limited to just over 8k. The current advantage of ROCm is its integration with TurboQuant.

~/llama.cpp/build/bin/llama-server   -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Q4_K_M-mtp.gguf   --alias qwen3.6-27b   --spec-type mtp --spec-draft-n-max 3   --cache-type-k q4_0 --cache-type-v q4_0   -np 1 -c 4096 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256

4. Hipfire (DFlash) v0.1.20

Repo: https://github.com/Kaden-Schutt/hipfire Performance: 4k context window | PP: 930 t/s | TG: 46 t/s. Comment: Only supports chat interactions. Speed is very fast with DFlash enabled by default. However, contexts larger than 8k cause freezes or crashes, making it unusable for opencode or pi. Will revisit in 3–6 months.

5. Legacy Card: Tesla P40 (24GB)

Repo: https://github.com/TheTom/llama-cpp-turboquant PR: https://github.com/ggml-org/llama.cpp/pull/22673

Without MTP

Performance: 196k context window | TG: 10 t/s

~/llama.cpp-mtp/build/bin/llama-server -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf   --alias qwen3.6-27b  --cache-type-k turbo3 --cache-type-v turbo3 -c 196608 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256

With MTP

Performance: 196k context window | TG: 17 t/s

~/llama-cpp-turboquant/build/bin/llama-server -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Q4_K_M-mtp.gguf   --alias qwen3.6-27b   --spec-type mtp --spec-draft-n-max 3   --cache-type-k turbo3 --cache-type-v turbo3   -np 1 -c 196608 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256

Ran benchmarks using opencode + deepseek v4, results below:

If pursuing performance, Vulkan + MTP yields the best results.
MTP performance is not constant; it varies significantly depending on the context or task. Performance gains may differ when writing novels, planning daily tasks, or coding. Benchmarks are for reference only.
Currently, MTP only supports single-session conversations and cannot handle parallel requests.
The Vulkan backend has issues supporting TurboQuant; GPU utilization is insufficient and requires optimization.
ROCm + MTP suffers from VRAM issues, with unexplained spikes of 5GB, limiting usable context to slightly above 8k.

llama-bench Test Results

Environment

MTP Model: Qwen3.6-27B-Q4_K_M-mtp.gguf (15.82 GiB) https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF/
Non-MTP Model: Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf (17 GiB) https://huggingface.co/HauhauCS/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive
GPU: AMD Radeon RX 7900 XTX (24,560 MiB VRAM)
CPU: Genuine Intel(R) 13900HK ES
Threads: 8
n-gpu-layers: 999 (Fully offloaded to GPU)
Temp: 0.7, top-k: 20

ROCm (HIP) - KV Cache Type Comparison (Non-MTP)

Binary: ~/llama.cpp/rocm/bin/llama-bench (build 9046)

KV Cache Type	pp1024 (token/s)	tg128 (token/s)
f16 (default)	904.50	28.99
q4_0	898.01	28.81

Vulkan - KV Cache Type Comparison (Non-MTP)

Standard Build (`~/Downloads/llama.cpp/build-vulkan/bin/llama-bench`)

KV Cache Type	pp512 (token/s)	tg128 (token/s)
f16	765.94	37.06
Q4_0	769.82	37.17
Q8_0	273.25	37.13

Turboquant Build (`~/Downloads/llama-cpp-turboquant/build-vulkan/bin/llama-bench`)

KV Cache Type	pp512 (token/s)	tg128 (token/s)
turbo2	193.43 ± 1.49	23.79 ± 0.17
turbo3	128.44 ± 1.31	21.88 ± 0.14
turbo4	178.94 ± 2.03	23.00 ± 0.14

> Note: During TurboQuant testing, GPU utilization was only ~30%, failing to fully leverage the GPU. The bottleneck likely lies in CPU-side quantization/dequantization operations. > q4_0/q8_0 tests failed in the turboquant build's llama-bench.

Vulkan + MTP

Binary: ~/llama.cpp/vulkan/bin/llama-cli Command: --spec-type mtp --spec-draft-n-max 3 --parallel 1 -p "tell me a jok" -n 128 -ngl 999

> Note: MTP uses -np 1 (single parallel sequence), so it cannot process in parallel. The draft model executes sequentially, limiting throughput.

Configuration	Generation Speed (token/s)
Non-MTP (f16)	39.5
MTP (q4_0)	81.2
MTP (q8_0)	77.5

ROCm + MTP

Binary: ~/llama.cpp/rocm/bin/llama-cli with LD_LIBRARY_PATH

Configuration	Generation Speed (token/s)
Non-MTP (f16)	29.4
MTP (q4_0)	53.6
MTP (turbo3)	47.4
MTP (turbo4)	57.2

Summary

Non-MTP (llama-bench)

KV Cache Type	PP (token/s)	TG128 (token/s)	Backend
f16	904.50	28.99	ROCm (pp1024)
q4_0	898.01	28.81	ROCm (pp1024)
f16	765.94	37.06	Vulkan Standard (pp512)
Q4_0	769.82	37.17	Vulkan Standard (pp512)
Q8_0	273.25	37.13	Vulkan Standard (pp512)
turbo2	193.43	23.79	Vulkan TurboQuant (pp512)
turbo4	178.94	23.00	Vulkan TurboQuant (pp512)
turbo3	128.44	21.88	Vulkan TurboQuant (pp512)

MTP (llama-cli)

Configuration	Generation Speed (token/s)	Backend
MTP (q4_0)	81.2	Vulkan
MTP (q8_0)	77.5	Vulkan
MTP (turbo4)	57.2	ROCm
MTP (q4_0)	53.6	ROCm
MTP (turbo3)	47.4	ROCm
Non-MTP (f16)	39.5	Vulkan
Non-MTP (f16)	29.4	ROCm

Key Observations

ROCm q4_0 performance is nearly identical to f16 (898 vs 905 token/s) — the difference is negligible.
TurboQuant types are only available in the TurboQuant Vulkan build. turbo2 offers the fastest prompt processing (193 token/s @ pp512). Generation speeds across turbo variants are similar (~22-24 token/s).
Standard Vulkan builds support Q4_0/Q8_0. Q4_0 matches f16 speed (~770 token/s pp512), while Q8_0 prompt processing is ~2.8x slower (273 token/s) but maintains the same generation speed (~37 token/s). Turbo types are exclusive to the TurboQuant build.
MTP significantly boosts generation speed: Vulkan+q4_0 reaches 81.2 token/s (+106% improvement over non-MTP), Vulkan+q8_0 reaches 77.5 token/s (+96%), and ROCm+turbo4 reaches 57.2 token/s (+95%).

u/Fit-Courage5400

Llama.cpp Turboquant + MTP on 7900 XTX

1. ROCm + TurboQuant

2. Vulkan

2.1 Vulkan + TurboQuant

3. Vulkan + MTP

3. ROCm + MTP

4. Hipfire (DFlash) v0.1.20

5. Legacy Card: Tesla P40 (24GB)

Without MTP

With MTP

Ran benchmarks using opencode + deepseek v4, results below:

llama-bench Test Results

Environment

ROCm (HIP) - KV Cache Type Comparison (Non-MTP)

Vulkan - KV Cache Type Comparison (Non-MTP)

Standard Build (`~/Downloads/llama.cpp/build-vulkan/bin/llama-bench`)

Turboquant Build (`~/Downloads/llama-cpp-turboquant/build-vulkan/bin/llama-bench`)

Vulkan + MTP

ROCm + MTP

Summary

Non-MTP (llama-bench)

MTP (llama-cli)

Key Observations

u/Fit-Courage5400

Llama.cpp Turboquant + MTP on 7900 XTX

1. ROCm + TurboQuant

2. Vulkan

2.1 Vulkan + TurboQuant

3. Vulkan + MTP

3. ROCm + MTP

4. Hipfire (DFlash) v0.1.20

5. Legacy Card: Tesla P40 (24GB)

Without MTP

With MTP

Ran benchmarks using opencode + deepseek v4, results below:

llama-bench Test Results

Environment

ROCm (HIP) - KV Cache Type Comparison (Non-MTP)

Vulkan - KV Cache Type Comparison (Non-MTP)

Standard Build (~/Downloads/llama.cpp/build-vulkan/bin/llama-bench)

Turboquant Build (~/Downloads/llama-cpp-turboquant/build-vulkan/bin/llama-bench)

Vulkan + MTP

ROCm + MTP

Summary

Non-MTP (llama-bench)

MTP (llama-cli)

Key Observations

Standard Build (`~/Downloads/llama.cpp/build-vulkan/bin/llama-bench`)

Turboquant Build (`~/Downloads/llama-cpp-turboquant/build-vulkan/bin/llama-bench`)