
u/GotHereLateNameTaken

Qwen 27b MTP Config, Llama.cpp Single 3090
What setup are you using for qwen 27b on a single 3090?
Here's what I've started using today. It has to compact often but I'm worried about giving up more accuracy and reliability with a lower quant:
llama-server -m /Models/q3.6/Qwen3.6-27B-Q5_K_S.gguf -c 65536 -ngl -1 -t 8 -ctk q8_0 -ctv q8_0 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type draft-mtp --spec-draft-n-max 2 --fit off --mmproj /Models/q3.6/mmproj-Qwen3.6-27B-f16.gguf --no-mmproj-offload
I'm getting around 65tk/s.
I've also seen these recommendations: https://github.com/noonghunna/club-3090/blob/master/docs/SINGLE_CARD.md
They seem to be using the q4 quant. How are you weighing the tradeoffs?
How is Aion UI with local llm?
Anyone tried this?
How extensible is it?
Does it work well with qwen27b?
Does it bloat the context window or manage it well?