u/Fabulous_Fact_606

https://preview.redd.it/xyvtkzwr005h1.png?width=645&format=png&auto=webp&s=aebd5b5ef79255247c9bc91fb69d8423a0c61f86

As you guys know, the next highest quant is Unsloth's /Qwen3.6-27B-UD-Q8_K_XL.gguf. With llama.cpp before, i was getting 30-50 tk/s. vllm was kicking llama's ass with its tensor splits speeding up the 2x3090s at 70+ tk/s for months. But I can't seem to find good quants for vllm and settle for some unknown qwen3.6-mtp-8.0...it was also making minor coding mistakes here and there... now being able to run unsloth's UDQ8KXL at 70+t/s, its code output are so clean, its like a different beast altogether.

Finally got around to test out the llama ver b9455b with tensor-split, and holy f. Results below:

 llama.cpp server for Qwen3.6-27B-MTP UD-Q8_K_XL (MTP speculative decoding).
export LD_LIBRARY_PATH=/home/llama.cpp-b9455/build/bin:${LD_LIBRARY_PATH:-}
exec /home/llama.cpp-b9455/build/bin/llama-server \
  --host 0.0.0.0 --port 8000 \
  --model /home/projects/Qwen3.6-27B-MTP/Qwen3.6-27B-UD-Q8_K_XL.gguf \
  --n-gpu-layers 99 \
  --ctx-size 262144 \
  --parallel 1 --kv-unified \
  --batch-size 4096 \
  --ubatch-size 512 \
  --tensor-split 50,50 -sm tensor \
  --flash-attn on \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  --jinja \
  --no-mmap \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --metrics

-------------------------------

No more watching paint dry:

ctx = true context (incl. cached) send
pp = prefilled tokens / prefill time / prefill t/s
out = decode tokens / decode time / decode t/s

Example coding run below:

ctx 27K · pp 27K/18.8s 1417t/s · out 248/3.0s 81t/s · cold

ctx 31K · pp 3.8K/3.2s 1171t/s · out 353/4.7s 74t/s · 27K cached

ctx 37K · pp 6.7K/5.7s 1184t/s · out 335/4.5s 74t/s · 31K cached

ctx 43K · pp 5.5K/4.9s 1121t/s · out 357/5.0s 71t/s · 37K cached

ctx 44K · pp 1.3K/1.5s 861t/s · out 377/5.2s 72t/s · 43K cached

ctx 2.7K · pp 2.0K/1.5s 1294t/s · out 691/9.7s 71t/s

ctx 13K · pp 7.2K/5.0s 1421t/s · out 964/13.0s 73t/s · 5.5K cached

ctx 46K · pp 27K/19.8s 1370t/s · out 694/10.2s 67t/s · 19K cached

ctx 52K · pp 2.4K/2.6s 919t/s · out 464/6.9s 66t/s · 50K cached

ctx 58K · pp 6.5K/6.3s 1036t/s · out 101/1.5s 69t/s · 52K cached

ctx 60K · pp 2.1K/2.3s 889t/s · out 163/2.2s 74t/s · 58K cached

ctx 2.1K · pp 2.1K/2.3s 880t/s · out 1.9K/32.7s 57t/s

ctx 63K · pp 6.0K/4.8s 1266t/s · out 856/12.3s 69t/s · 57K cached · queue 1

ctx 7.3K · pp cached · out 4.5K/82.5s 54t/s · 7.3K cached

ctx 64K · pp 7.8K/5.6s 1402t/s · out 453/5.8s 78t/s · 57K cached

ctx 65K · pp 2.3K/2.8s 823t/s · out 99/1.4s 71t/s · 63K cached

ctx 65K · pp 120/0.4s · out 93/1.3s 70t/s · 65K cached

ctx 68K · pp 68K/54.2s 1247t/s · out 2.0K/28.8s 68t/s · cold

ctx 27K take 18.8s to fill cold. ctx100K will take ~60+s. Imagine every turn, waiting a minute.. or 5 minutes for pp to fill..

Another shout out to llama.cpp build b9455 2x3090