u/Background-Gold-9882

I'm on a M5 Pro 48GB. I just started using oMLX and love it so far.

Now I'm playing around with Qwen 3.6-27B with MTP (oMLX 0.3.9-dev2) and it's working really well, except that run into OOM for contexts > ~65k. So far, I've downloaded the official full precision qwen3.6-27B from HF and created oQ4 / oQ6 versions myself. But the more context I use, the quicker I run into OOM crashes. The 128k context benchmark works sometimes, but usually crashes the entire computer.

However, when using llama.cpp as per this post: https://www.reddit.com/r/LocalLLaMA/comments/1t57xuu/25x_faster_inference_with_qwen_36_27b_using_mtp/

I'm able to run much larger contexts (256k), with MTP support, and much less memory consumption, using this command:

llama-server \
-m Qwen3.6-27B-Q4_K_M-mtp.gguf \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
--cache-type-k q8_0 \     
--cache-type-v q8_0 \     
-np 1 \     
-c 262144 \     
--temp 0.7 \     
--top-k 20 \     
-ngl 99 \     
--port 8081

I'm guessing it has to do with the explanation in the post - That Qwen:s hybrid model only needs KV cache for 16 of 65 layers, and drivers that allocate naively will allocate much more memory than necessary? Also, llama.cpp allows setting KV cache to 8bit rather than full precision (Which I guess oMLX uses by default?)

Anyway, everything else is better in oMLX (Higher PP speed, generation speed, and caching strategy). So, my question is - Is it possible to have better optimized KV cache in oMLX to reduce memory consumption?

If so, which model and settings should I use?

Thanks in advance!

Hi!

I would like a setup where I can do things like:

Semantic search of file system:
- “Can you find the ideas I had regarding <XYZ> a couple of years ago?(Search txt / md / docx / pdf files in file system)
Conversation memory
- “Please summarize the key points of the XYZ discussions I had with you last week. What were the next steps?”
Multi-step automation.
- “Please search my todo lists for all tasks related to <XYZ>. Create a markdown file with a title, short summary and source link for each entry.”
File management:
- “Please sort the files in my Desktop and Downloads folders into my file repository in <folder>. List suggestions and ask for permission before executing.”

Questions:

Do people have setups like this working?
Is Hermes a good conductor for tasks like this?
Is this feasible with a local-only setup? (Privacy)
- If so, which LLM models would you recommend?
- Are they feasible on my M1 Max 64GB or M5 Pro 48GB?
What's the recommended long-term memory system for these use cases?
- Can / should long-term memory of conversations be combined with knowledge from the file system? Markdown wiki etc?
How to do embeddings / vector db RAG / MD wiki etc. for semantic searches of files? How to automatically index appropriate folders continuously?

Any tutorials to follow?

Thanks in advance!

Qwen3.6-27B: MTP + Optimized KV cache?