r/oMLX

▲ 11 r/oMLX

Waiting oMLX 0.3.9 stable release

Sooo the 0.3.9 RC docs stated they expected around 24h of final testing before releasing the stable version. That window has passed.

I can’t wait to use MTP features but don’t want to switch to dev builds for my final use cases. Experimenting with the dev builds shows great performance gains and it seems we are soooo close to getting the new stable release.

What is your experience with 0.3.9? Do you think there were more bugs than expected? Or is this a release so close to perfect they are polishing everything more than usual?

reddit.com
u/TheFlyingDutchG — 19 hours ago
▲ 7 r/oMLX+1 crossposts

Creating New oQ Quantization in oMLX

Hi! I'm sorry if this is a dumb question but I'm new to oMLX and have been doing some of my own experimenting and research. I've been experimenting with the Qwen3.6-27B-oQ4-mtp model that Jundot has on huggingface, but I wanted to compare that to an oQ3 model. There isn't one available that I could find so I wanted to try making my own quantization of it in oMLX. Where can I find a full precision Qwen3.6-27B in mlx format that I could use as the source model? I was only able to find the unsloth BF16 model in gguf format. Thank you!

reddit.com
u/FleetingMemories — 1 day ago
▲ 8 r/oMLX

oMLX + pi + mcp

hello, I am trying to use omlx + pi cli with any mcp such as web-search (brave api), however i have not been successful. Is this even possible yet or not a function added to pi-cli?

1)I am running local mlx llm such as qwen/gemma.
2)Want to use web-search brave api (or similar) to have local llm do basic web searches to improve it's answers.
3) I know openclaw can do web-search but it is slow and not how i want to do things (i want to use terminal cli-agent which is fast)

reddit.com
u/PrepYourselves — 2 days ago
▲ 10 r/oMLX

Is MTP speed boost really helping ?

This question is for those who have tried the MTP quants of oQ version of models with oMLX.

Are you seeing any compromise on the quality of the outputs, compared to non-MTP versions?

Sure the speed increment on token does help, but if the tool call failures or any such issues are happening, it is not really worth the additional tok/sec we get right?

We will be able to assess this only on real scenario usages which we have been using before and are familiar with.

So are you seeing any such degradation of quality or do you think its worth going with MTP version? What are your thoughts?

reddit.com
u/msrdatha — 3 days ago
▲ 8 r/oMLX

Pushing context >50k in omlx on 32GB Mac? (Turbo KV Quant fails)

Hey guys,

Running Qwen3.6-35B-A3B (UD-4bit) on a Mac Studio M1 Max (32GB) via omlx.

Generation speed is awesome, but I’m hard-capped at around 50k context before hitting an OOM crash.

I know the KV cache is eating my remaining unified memory. Here is what I've tried:

  • omlx "Turbo Quant for KV cache": Tried enabling this to save RAM, but it doesn't work at all (crashes or has no effect).
  • llama.cpp: Can push much higher context via swap, but the prompt eval speed is painfully slow compared to MLX.

Question: Is there any reliable workaround/CLI flag for MLX to actually force KV cache quantization for this MoE model? How are you guys squeezing out 80k+ context on 32GB machines without tanking the speed?

Thanks!

reddit.com
u/StatisticianFree706 — 3 days ago
▲ 6 r/oMLX

Web search from oMLX chat?

Just started using oMLX. Its great! But so far I’m serving it to my coding agents. I tried its Chat panel, but it doesn’t seem to do web search. Is it in the settings (that I might have missed) or not supported at all? If not supported, what app y’all are using for chat conversations?!

reddit.com
u/atumblingdandelion — 3 days ago
▲ 20 r/oMLX

Qwen3.6-27B: MTP + Optimized KV cache?

I'm on a M5 Pro 48GB. I just started using oMLX and love it so far.

Now I'm playing around with Qwen 3.6-27B with MTP (oMLX 0.3.9-dev2) and it's working really well, except that run into OOM for contexts > ~65k. So far, I've downloaded the official full precision qwen3.6-27B from HF and created oQ4 / oQ6 versions myself. But the more context I use, the quicker I run into OOM crashes. The 128k context benchmark works sometimes, but usually crashes the entire computer.

However, when using llama.cpp as per this post: https://www.reddit.com/r/LocalLLaMA/comments/1t57xuu/25x_faster_inference_with_qwen_36_27b_using_mtp/

I'm able to run much larger contexts (256k), with MTP support, and much less memory consumption, using this command:

llama-server \
-m Qwen3.6-27B-Q4_K_M-mtp.gguf \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
--cache-type-k q8_0 \     
--cache-type-v q8_0 \     
-np 1 \     
-c 262144 \     
--temp 0.7 \     
--top-k 20 \     
-ngl 99 \     
--port 8081

I'm guessing it has to do with the explanation in the post - That Qwen:s hybrid model only needs KV cache for 16 of 65 layers, and drivers that allocate naively will allocate much more memory than necessary? Also, llama.cpp allows setting KV cache to 8bit rather than full precision (Which I guess oMLX uses by default?)

Anyway, everything else is better in oMLX (Higher PP speed, generation speed, and caching strategy). So, my question is - Is it possible to have better optimized KV cache in oMLX to reduce memory consumption?

If so, which model and settings should I use?

Thanks in advance!

reddit.com
u/Background-Gold-9882 — 5 days ago
▲ 9 r/oMLX

Seeking Optimization Advice: Qwen 3.6 27B Setup on M2 MacBook Pro

Happy Sunday, everyone! I'm relatively new to running local LLMs (about two weeks in), so I appreciate your patience with my questions. I'm eager to learn from this community's expertise.

Background

A few weeks ago, I discovered agentic coding through my work's GitHub Copilot account. After quickly exhausting my usage limits (lesson learned about token management!), I decided to explore running Qwen models locally on my personal laptop for hobby projects.

Hardware

  • M2 MacBook Pro Max 96GB

Models Tested

  • oMLX: Qwen 3.6 27B (oQ4/oQ5/oQ6/oQ8-fp16-mtp variants)
  • LM Studio/GGUF: Qwen 3.6 27B (Q4_K_M, Q6_K, Q8_K)
  • llama.cpp: Configured per this post

Use Case

I'm primarily doing C++ and ESP32/PlatformIO development for personal projects, including:

  • Real-time voice modulation for cosplay costumes
  • Real-time bark detection logger (courtesy of my neighbor's enthusiastic dog)

Current Configuration

After implementing MTP changes, I've settled on the following setup:

Model: oMLX Qwen 3.6 27B-oQ5-fp16-mtp

Settings:

  • Context: 262,144
  • Temperature: 0.6
  • Top P: 0.95
  • Top K: 20
  • Min P: 0
  • Repetition Penalty: 1
  • Presence Penalty: 0
  • Extended thinking: Enabled
  • Native MTP: Enabled
  • oMLX caching: Enabled

IDE Setup:

  • VS Code with Cline extension
  • OpenAI-compatible API from oMLX

Workflow:

  1. Enable PLAN mode in Cline
  2. Request feature implementation or bug research plan
  3. Switch to ACT mode and execute
  4. Wait lol

Current Performance

While the quality of Qwen 3.6 (Q4-Q8) is impressive, performance could be better:

  • Prompt processing: ~120 tok/s
  • Token generation: ~15 tok/s

Question

For those running similar hardware (especially M2 users), what combination of:

  • Software stack (oMLX, LM Studio, llama.cpp, etc.)
  • Specific Qwen 3.6 model variants
  • Inference settings

...have you found optimal? Any suggestions for improving prompt processing and token generation speeds on M2 hardware would be greatly appreciated!

reddit.com
u/cyclebiff — 4 days ago
▲ 27 r/oMLX

Qwen3.6-35B-oQ6 is the sweet spot for me with MTP

I've been having a good time playing with OpenCode and oMLX. Multi-token prediction does really seem to speed things up. I'm playing with the Qwen 3.6 35B MoE models, and I noticed that the oQ6 model is almost as fast as the oQ4 for me in token generation. This may be because the prediction acceptance rate is higher. Here are benchmarks for the two running on my machine (M5 Max 64GB) through oMLX:

oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.6-35B-A3B-oQ4-mtp
================================================================================
Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128           436.3        8.29  2346.9 tok/s   121.6 tok/s       1.489   773.8 tok/s    20.37 GB
pp4096/tg128          1073.4        8.73  3815.9 tok/s   115.4 tok/s       2.183  1935.4 tok/s    21.17 GB
pp8192/tg128          2018.7        9.17  4058.0 tok/s   109.9 tok/s       3.184  2613.2 tok/s    21.66 GB
pp16384/tg128         4503.8        9.72  3637.8 tok/s   103.7 tok/s       5.739  2877.3 tok/s    22.36 GB
oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.6-35B-A3B-oQ6-mtp
================================================================================
Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128           463.3        9.34  2210.3 tok/s   107.9 tok/s       1.650   698.3 tok/s    28.29 GB
pp4096/tg128          1121.2        9.87  3653.1 tok/s   102.1 tok/s       2.375  1778.7 tok/s    29.10 GB
pp8192/tg128          2095.8       10.38  3908.8 tok/s    97.1 tok/s       3.414  2436.9 tok/s    29.58 GB
pp16384/tg128         4732.2       10.61  3462.2 tok/s    95.0 tok/s       6.080  2715.8 tok/s    30.29 GB
u/arfung39 — 6 days ago
▲ 5 r/oMLX

Help a Noob out

Hi there! I am absolutely new to local LLMs, but I am so fascinated by the whole topic and very thrilled to learn more about it. I have an M2 Max with 64GB of RAM and have already been pretty successful with getting local LLMs to run.

Are there any recommendations in terms of YouTube tutorials that you would say changed the way you operate LLMs or understand the whole topic? It is so complex and I honestly don’t know where to start. Thanks in advance!

reddit.com
u/robdzn — 4 days ago
▲ 1 r/oMLX

Trying to pull (mlx-community/Kimi-K2.6-mlx-DQ3_K_M-q8) but it's just frozen on the downloads page, what gives?

My 512gb Mac Studio can run this monster model, but it just never downloads, what am doing wrong guys? I even generated a token from huggingface, still nothing.....

reddit.com
u/YellowBathroomTiles — 5 days ago
▲ 5 r/oMLX

Can someone explain how a harness affects things?

Wouldn’t the LLM itself be what makes the decisions? How does a harness change this?

reddit.com
u/MartiniCommander — 5 days ago
▲ 14 r/oMLX+1 crossposts

Just dropped another 3&5 mixed quant for the RAM Poor Base-model-only Mac users that want to try Gemma4 top of the line LLM.

6gb smaller that the other 3bit-mlx out there and 25% faster.

Thicc and dense 13 GB of pure LLM sweetness from Google for the desperate that don't care for vision. (just use something faster and equally good, like tiny Qwen3.5-2B)

Ideal if:

  • You just prefer the latest Gemma4 Humanities/Communications/SocialStudies edge over Qwen3.6 STEM hard focus in your 24gb ram Mac.
  • You don't like or need overly verbose thinking models (Qwen3.x 👀). Gemma4 chews only 1/4 of tokens 'thinking' if compared to Qwen3.6

Recommended Inference Parameters

For the best performance, use the following standardized sampling configuration across all use cases:

Parameter Value
temperature 1.0
top_p 0.95
top_k 64
min_p 0.05
repeat_penalty 1.05

LM Studio — Reasoning Section Parsing

To enable thinking/reasoning output parsing:

  • Start string<|channel>thought
  • End string<channel|>

Add to ninja template:

{%- set enable_thinking = true %}

u/JLeonsarmiento — 6 days ago
▲ 6 r/oMLX

How to get DFlash going?

What are people using for dflash? I’m on a M2 Max with 96 GB of RAM and I’d like to try and eke out as much perf as I can on omlx. I’ve been looking at Qwen models, but Gemma4 is giving me better perf currently.

reddit.com
u/lightguardjp — 7 days ago
▲ 41 r/oMLX

oMLX 0.3.9.dev2 released.

Highlights:
- Gemma 4 MTP on the vision path (thanks to @Prince_Canuma's mlx-vlm). Image+text decodes much faster now
- Gemma 4 on the DFlash engine (thanks to @bstnxbt's dflash-mlx)
- ParoQuant support
- omlx launch copilot joins claude / codex / opencode / openclaw / pi
- Restart server button right in the admin UI
- oQ auto-builds a proxy when the model can't fit in RAM

Plus a lot of bug fixes and 20 new contributors in this cycle.

reddit.com
u/d4mations — 9 days ago
▲ 6 r/oMLX

oMLX use in Hermes

I have oMLX installed on an M1 Max Mac Studio and it’s setup for network access (0.0.0.0) and have Hermes installed on a separate Mac Mini (Intel). I’ve configured a custom model in the Hermes config for oMLX but no matter what I try I cannot get Hermes to talk to oMLX.

Has anyone had any success with this setup?

reddit.com
u/aptonline — 9 days ago
▲ 4 r/oMLX

Choosing community in Hugging Face

should we always choose model from mlx-community as they are optimized for MLX? or it actually not matter and we achieve same result with unsloth or bartowski one?

reddit.com
u/cocacokareddit — 8 days ago