u/Fit-Courage5400

Llama.cpp Turboquant + MTP on 7900 XTX

Llama.cpp Turboquant + MTP on 7900 XTX

Recently picked up a 7900 XTX to run LLMs locally, providing a local LLM API for opencode and pi.dev.

Spent quite some time benchmarking performance. The results are below for reference. This is just a rough log; I won’t post the full llama-bench outputs here as there’s too much data.

1. ROCm + TurboQuant

Repo: https://github.com/domvox/llama.cpp-turboquant-hip Performance: 256k context window | PP: 970 t/s | TG: 29 t/s Comment: In current tests, although the response latency isn't as fast as online APIs, the quality of generated code is comparable to online APIs.

~/llama.cpp-turboquant-hip/rocm/llama-server -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf   --mmproj ~/model/llm/qwen3.6-27b/mmproj-Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-f16.gguf   --alias qwen3.6-27b   --host 0.0.0.0   --port 8080   --n-gpu-layers 999   --ctx-size 262144   --batch-size 2048   --ubatch-size 768   --threads 8   --temp 1.0      --top-p 0.95     --top-k 20     --min-p 0.00   --presence_penalty 1.5   --cache-type-k turbo3   --cache-type-v turbo3

2. Vulkan

Repo: https://github.com/ggml-org/llama.cpp Performance: 256k context window | KV-cache-type: Q4_0 | PP: 730 t/s | TG: 47 t/s (Q8_0 is slightly slower)

~/Downloads/llama.cpp/vulkan/bin/llama-server -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf   --alias qwen3.6-27b  --cache-type-k q4_0 --cache-type-v q4_0 -np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256

2.1 Vulkan + TurboQuant

Repo: https://github.com/TheTom/llama-cpp-turboquant Performance: 256k context window | KV-cache-type: Q4_0 | TG: 10 t/s. During decoding, GPU utilization stays below 30%, resulting in poor speed. Enabling MTP yields similar results.

~/llama.cpp/build/bin/llama-server   -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf   --alias qwen3.6-27b   --cache-type-k turbo3 --cache-type-v turbo3   -np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256

3. Vulkan + MTP

Repo/PR: https://github.com/ggml-org/llama.cpp/pull/22673 Performance: 256k context window | KV-cache-type: Q4_0 | PP: 730 t/s | TG: 67 t/s. VRAM usage is similar to running without MTP.

~/Downloads/llama.cpp/vulkan/bin/llama-server -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Q4_K_M-mtp.gguf   --alias qwen3.6-27b   --spec-type mtp --spec-draft-n-max 3   --cache-type-k q4_0 --cache-type-v q4_0 -np 1 -c 262144 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256

3. ROCm + MTP

Repo/PR: https://github.com/ggml-org/llama.cpp/pull/22673 Performance: 4k context window | KV-cache-type: Q4_0 | PP: 730 t/s | TG: 67 t/s. Comment: There is an issue with the ROCm backend + MTP. VRAM spikes by 5GB at the start of a conversation for unknown reasons. Consequently, the maximum context length is limited to just over 8k. The current advantage of ROCm is its integration with TurboQuant.

~/llama.cpp/build/bin/llama-server   -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Q4_K_M-mtp.gguf   --alias qwen3.6-27b   --spec-type mtp --spec-draft-n-max 3   --cache-type-k q4_0 --cache-type-v q4_0   -np 1 -c 4096 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256

4. Hipfire (DFlash) v0.1.20

Repo: https://github.com/Kaden-Schutt/hipfire Performance: 4k context window | PP: 930 t/s | TG: 46 t/s. Comment: Only supports chat interactions. Speed is very fast with DFlash enabled by default. However, contexts larger than 8k cause freezes or crashes, making it unusable for opencode or pi. Will revisit in 3–6 months.

5. Legacy Card: Tesla P40 (24GB)

Repo: https://github.com/TheTom/llama-cpp-turboquant PR: https://github.com/ggml-org/llama.cpp/pull/22673

Without MTP

Performance: 196k context window | TG: 10 t/s

~/llama.cpp-mtp/build/bin/llama-server -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Q4_K_P.gguf   --alias qwen3.6-27b  --cache-type-k turbo3 --cache-type-v turbo3 -c 196608 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256
With MTP

Performance: 196k context window | TG: 17 t/s

~/llama-cpp-turboquant/build/bin/llama-server -m ~/model/llm/qwen3.6-27b/Qwen3.6-27B-Q4_K_M-mtp.gguf   --alias qwen3.6-27b   --spec-type mtp --spec-draft-n-max 3   --cache-type-k turbo3 --cache-type-v turbo3   -np 1 -c 196608 --temp 0.7 --top-k 20 -ngl 99   --port 8080 --host 0.0.0.0   -fa 1 -ub 256


Ran benchmarks using opencode + deepseek v4, results below:

  • If pursuing performance, Vulkan + MTP yields the best results.
  • MTP performance is not constant; it varies significantly depending on the context or task. Performance gains may differ when writing novels, planning daily tasks, or coding. Benchmarks are for reference only.
  • Currently, MTP only supports single-session conversations and cannot handle parallel requests.
  • The Vulkan backend has issues supporting TurboQuant; GPU utilization is insufficient and requires optimization.
  • ROCm + MTP suffers from VRAM issues, with unexplained spikes of 5GB, limiting usable context to slightly above 8k.

llama-bench Test Results

Environment


ROCm (HIP) - KV Cache Type Comparison (Non-MTP)

Binary: ~/llama.cpp/rocm/bin/llama-bench (build 9046)

KV Cache Type pp1024 (token/s) tg128 (token/s)
f16 (default) 904.50 28.99
q4_0 898.01 28.81

Vulkan - KV Cache Type Comparison (Non-MTP)

Standard Build (~/Downloads/llama.cpp/build-vulkan/bin/llama-bench)

KV Cache Type pp512 (token/s) tg128 (token/s)
f16 765.94 37.06
Q4_0 769.82 37.17
Q8_0 273.25 37.13

Turboquant Build (~/Downloads/llama-cpp-turboquant/build-vulkan/bin/llama-bench)

KV Cache Type pp512 (token/s) tg128 (token/s)
turbo2 193.43 ± 1.49 23.79 ± 0.17
turbo3 128.44 ± 1.31 21.88 ± 0.14
turbo4 178.94 ± 2.03 23.00 ± 0.14

> Note: During TurboQuant testing, GPU utilization was only ~30%, failing to fully leverage the GPU. The bottleneck likely lies in CPU-side quantization/dequantization operations. > q4_0/q8_0 tests failed in the turboquant build's llama-bench.


Vulkan + MTP

Binary: ~/llama.cpp/vulkan/bin/llama-cli Command: --spec-type mtp --spec-draft-n-max 3 --parallel 1 -p "tell me a jok" -n 128 -ngl 999

> Note: MTP uses -np 1 (single parallel sequence), so it cannot process in parallel. The draft model executes sequentially, limiting throughput.

Configuration Generation Speed (token/s)
Non-MTP (f16) 39.5
MTP (q4_0) 81.2
MTP (q8_0) 77.5

ROCm + MTP

Binary: ~/llama.cpp/rocm/bin/llama-cli with LD_LIBRARY_PATH

Configuration Generation Speed (token/s)
Non-MTP (f16) 29.4
MTP (q4_0) 53.6
MTP (turbo3) 47.4
MTP (turbo4) 57.2

Summary

Non-MTP (llama-bench)

KV Cache Type PP (token/s) TG128 (token/s) Backend
f16 904.50 28.99 ROCm (pp1024)
q4_0 898.01 28.81 ROCm (pp1024)
f16 765.94 37.06 Vulkan Standard (pp512)
Q4_0 769.82 37.17 Vulkan Standard (pp512)
Q8_0 273.25 37.13 Vulkan Standard (pp512)
turbo2 193.43 23.79 Vulkan TurboQuant (pp512)
turbo4 178.94 23.00 Vulkan TurboQuant (pp512)
turbo3 128.44 21.88 Vulkan TurboQuant (pp512)

MTP (llama-cli)

Configuration Generation Speed (token/s) Backend
MTP (q4_0) 81.2 Vulkan
MTP (q8_0) 77.5 Vulkan
MTP (turbo4) 57.2 ROCm
MTP (q4_0) 53.6 ROCm
MTP (turbo3) 47.4 ROCm
Non-MTP (f16) 39.5 Vulkan
Non-MTP (f16) 29.4 ROCm

Key Observations

  1. ROCm q4_0 performance is nearly identical to f16 (898 vs 905 token/s) — the difference is negligible.
  2. TurboQuant types are only available in the TurboQuant Vulkan build. turbo2 offers the fastest prompt processing (193 token/s @ pp512). Generation speeds across turbo variants are similar (~22-24 token/s).
  3. Standard Vulkan builds support Q4_0/Q8_0. Q4_0 matches f16 speed (~770 token/s pp512), while Q8_0 prompt processing is ~2.8x slower (273 token/s) but maintains the same generation speed (~37 token/s). Turbo types are exclusive to the TurboQuant build.
  4. MTP significantly boosts generation speed: Vulkan+q4_0 reaches 81.2 token/s (+106% improvement over non-MTP), Vulkan+q8_0 reaches 77.5 token/s (+96%), and ROCm+turbo4 reaches 57.2 token/s (+95%).
u/Fit-Courage5400 — 12 days ago