▲ 13 r/LocalLLaMA
Qwen3.6:27B VRAM 16GB 5080: MTP Quant, Speeds, and Configs
For those of you running Qwen3.6:27B on 16GB VRAM, what quantization did you settle on?
For my primary purpose as a HA voice assistant, I've found my ideal target to be >50 tg and >800 pp. Qwen3.5:9B works really fast, but I'm experimenting with higher intelligence. Offloaded the vision model to CPU because it is infrequently used.
Currently running Qwen3.6-27B-Q3_K_S.gguf with 64 layers on GPU at the following speeds:
prompt eval time = 462.66 ms / 507 tokens ( 0.91 ms per token, 1095.83 tokens per second)
eval time = 18710.17 ms / 884 tokens ( 21.17 ms per token, 47.25 tokens per second)
total time = 19172.84 ms / 1391 tokens
draft acceptance rate = 0.59677 ( 481 accepted / 806 generated)
prompt eval time = 6001.34 ms / 8561 tokens ( 0.70 ms per token, 1426.51 tokens per second)
eval time = 2404.46 ms / 147 tokens ( 16.36 ms per token, 61.14 tokens per second)
total time = 8405.80 ms / 8708 tokens
draft acceptance rate = 0.80357 ( 90 accepted / 112 generated)
Config:
-m /models/Qwen3.6-27B/Qwen3.6-27B-Q3_K_S.gguf
--mmproj /models/Qwen3.6-27B/mmproj-BF16.gguf
--no-mmproj-offload
--host 0.0.0.0
--port 8080
--jinja
-fa on
--temp 0.6
--top-p 0.95
--top-k 20
--min_p 0.0
--presence-penalty 1.5
--repeat-penalty 1.0
--cache-ram 0
--fit on
-np 2
--fit-ctx 32000
--cache-type-k q8_0
--cache-type-v q8_0
--cache-type-k-draft q8_0
--cache-type-v-draft q8_0
--log-verbosity 4
--chat-template-kwargs '{"preserve_thinking": true}'
--spec-type draft-mtp
--spec-draft-n-max 2
u/InternationalNebula7 — 3 days ago