u/StatisticianFree706

Hey guys,

Running Qwen3.6-35B-A3B (UD-4bit) on a Mac Studio M1 Max (32GB) via omlx.

Generation speed is awesome, but I’m hard-capped at around 50k context before hitting an OOM crash.

I know the KV cache is eating my remaining unified memory. Here is what I've tried:

omlx "Turbo Quant for KV cache": Tried enabling this to save RAM, but it doesn't work at all (crashes or has no effect).
llama.cpp: Can push much higher context via swap, but the prompt eval speed is painfully slow compared to MLX.

Question: Is there any reliable workaround/CLI flag for MLX to actually force KV cache quantization for this MoE model? How are you guys squeezing out 80k+ context on 32GB machines without tanking the speed?

Thanks!

Pushing context &gt;50k in omlx on 32GB Mac? (Turbo KV Quant fails)

Pushing context >50k in omlx on 32GB Mac? (Turbo KV Quant fails)