▲ 8 r/oMLX
Pushing context >50k in omlx on 32GB Mac? (Turbo KV Quant fails)
Hey guys,
Running Qwen3.6-35B-A3B (UD-4bit) on a Mac Studio M1 Max (32GB) via omlx.
Generation speed is awesome, but I’m hard-capped at around 50k context before hitting an OOM crash.
I know the KV cache is eating my remaining unified memory. Here is what I've tried:
- omlx "Turbo Quant for KV cache": Tried enabling this to save RAM, but it doesn't work at all (crashes or has no effect).
- llama.cpp: Can push much higher context via swap, but the prompt eval speed is painfully slow compared to MLX.
Question: Is there any reliable workaround/CLI flag for MLX to actually force KV cache quantization for this MoE model? How are you guys squeezing out 80k+ context on 32GB machines without tanking the speed?
Thanks!
u/StatisticianFree706 — 3 days ago