Qwen3.6-27B: MTP + Optimized KV cache?
I'm on a M5 Pro 48GB. I just started using oMLX and love it so far.
Now I'm playing around with Qwen 3.6-27B with MTP (oMLX 0.3.9-dev2) and it's working really well, except that run into OOM for contexts > ~65k. So far, I've downloaded the official full precision qwen3.6-27B from HF and created oQ4 / oQ6 versions myself. But the more context I use, the quicker I run into OOM crashes. The 128k context benchmark works sometimes, but usually crashes the entire computer.
However, when using llama.cpp as per this post: https://www.reddit.com/r/LocalLLaMA/comments/1t57xuu/25x_faster_inference_with_qwen_36_27b_using_mtp/
I'm able to run much larger contexts (256k), with MTP support, and much less memory consumption, using this command:
llama-server \
-m Qwen3.6-27B-Q4_K_M-mtp.gguf \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-np 1 \
-c 262144 \
--temp 0.7 \
--top-k 20 \
-ngl 99 \
--port 8081
I'm guessing it has to do with the explanation in the post - That Qwen:s hybrid model only needs KV cache for 16 of 65 layers, and drivers that allocate naively will allocate much more memory than necessary? Also, llama.cpp allows setting KV cache to 8bit rather than full precision (Which I guess oMLX uses by default?)
Anyway, everything else is better in oMLX (Higher PP speed, generation speed, and caching strategy). So, my question is - Is it possible to have better optimized KV cache in oMLX to reduce memory consumption?
If so, which model and settings should I use?
Thanks in advance!