
Another shout out to llama.cpp build b9455 2x3090
As you guys know, the next highest quant is Unsloth's /Qwen3.6-27B-UD-Q8_K_XL.gguf. With llama.cpp before, i was getting 30-50 tk/s. vllm was kicking llama's ass with its tensor splits speeding up the 2x3090s at 70+ tk/s for months. But I can't seem to find good quants for vllm and settle for some unknown qwen3.6-mtp-8.0...it was also making minor coding mistakes here and there... now being able to run unsloth's UDQ8KXL at 70+t/s, its code output are so clean, its like a different beast altogether.
Finally got around to test out the llama ver b9455b with tensor-split, and holy f. Results below:
llama.cpp server for Qwen3.6-27B-MTP UD-Q8_K_XL (MTP speculative decoding).
export LD_LIBRARY_PATH=/home/llama.cpp-b9455/build/bin:${LD_LIBRARY_PATH:-}
exec /home/llama.cpp-b9455/build/bin/llama-server \
--host 0.0.0.0 --port 8000 \
--model /home/projects/Qwen3.6-27B-MTP/Qwen3.6-27B-UD-Q8_K_XL.gguf \
--n-gpu-layers 99 \
--ctx-size 262144 \
--parallel 1 --kv-unified \
--batch-size 4096 \
--ubatch-size 512 \
--tensor-split 50,50 -sm tensor \
--flash-attn on \
--cache-type-k q8_0 --cache-type-v q8_0 \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
--jinja \
--no-mmap \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 0.0 \
--metrics
-------------------------------
No more watching paint dry:
ctx= true context (incl. cached) sendpp= prefilled tokens / prefill time / prefill t/sout= decode tokens / decode time / decode t/s
Example coding run below:
ctx 27K · pp 27K/18.8s 1417t/s · out 248/3.0s 81t/s · cold
ctx 31K · pp 3.8K/3.2s 1171t/s · out 353/4.7s 74t/s · 27K cached
ctx 37K · pp 6.7K/5.7s 1184t/s · out 335/4.5s 74t/s · 31K cached
ctx 43K · pp 5.5K/4.9s 1121t/s · out 357/5.0s 71t/s · 37K cached
ctx 44K · pp 1.3K/1.5s 861t/s · out 377/5.2s 72t/s · 43K cached
ctx 2.7K · pp 2.0K/1.5s 1294t/s · out 691/9.7s 71t/s
ctx 13K · pp 7.2K/5.0s 1421t/s · out 964/13.0s 73t/s · 5.5K cached
ctx 46K · pp 27K/19.8s 1370t/s · out 694/10.2s 67t/s · 19K cached
ctx 52K · pp 2.4K/2.6s 919t/s · out 464/6.9s 66t/s · 50K cached
ctx 58K · pp 6.5K/6.3s 1036t/s · out 101/1.5s 69t/s · 52K cached
ctx 60K · pp 2.1K/2.3s 889t/s · out 163/2.2s 74t/s · 58K cached
ctx 2.1K · pp 2.1K/2.3s 880t/s · out 1.9K/32.7s 57t/s
ctx 63K · pp 6.0K/4.8s 1266t/s · out 856/12.3s 69t/s · 57K cached · queue 1
ctx 7.3K · pp cached · out 4.5K/82.5s 54t/s · 7.3K cached
ctx 64K · pp 7.8K/5.6s 1402t/s · out 453/5.8s 78t/s · 57K cached
ctx 65K · pp 2.3K/2.8s 823t/s · out 99/1.4s 71t/s · 63K cached
ctx 65K · pp 120/0.4s · out 93/1.3s 70t/s · 65K cached
ctx 68K · pp 68K/54.2s 1247t/s · out 2.0K/28.8s 68t/s · cold
ctx 27K take 18.8s to fill cold. ctx100K will take ~60+s. Imagine every turn, waiting a minute.. or 5 minutes for pp to fill..