r/oMLX

M2 Max, 64G, failing so often (error 6), increase number of retries to restart the model?

MBP, M2 Max, 64G RAM. tried various models, following models' advice to reduce context size, but the failures keep happening.

this point, I just want it to work through the night rather than every bloody time (paraphrased)

>Model failed. Error 6. tried three times to restart the model. clearly nothing else can be done

OK, I added the third part there.

Is there a way to get the oMLX actual app to KEEP restarting? I come along in the morning, click "restart model" and it happily carries on for a few minutes before failing (error 6, always error 6). Sure, the next goal is to have it analyze itself and self-reconfigure, but that's too much like an "easy" button, and AI is *supposed* to be hard...

reddit.com

u/26J-stroke-6 — 10 hours ago

▲ 8 r/oMLX

Configuration Setting To Maximum output

Currently I am running oMLX on my M3 Max, with the model Qwen3.6-27B-4bit.
Here is my setting

https://preview.redd.it/ug7l5i6e96bh1.png?width=1538&format=png&auto=webp&s=cd6013fdd705d551cb038506b79300b54776b2e1

https://preview.redd.it/iy9che4i96bh1.png?width=1524&format=png&auto=webp&s=668f94f5ecf20104d5d969c3f78858234f22f141

https://preview.redd.it/q5qwgbel96bh1.png?width=1448&format=png&auto=webp&s=35f9dc1e88b03ea2b7bb40f13f2dc91d250550a7

https://preview.redd.it/io2p2zeq96bh1.png?width=1456&format=png&auto=webp&s=3c7415989adc042daf51823f7f823f97dc382b6a

https://preview.redd.it/fal1ahus96bh1.png?width=1470&format=png&auto=webp&s=ed866123bd2a4d72b0264cdf784913c0808c157f

https://preview.redd.it/i3py54yv96bh1.png?width=1446&format=png&auto=webp&s=a0eab2c196954b57caa7b2e5eb6795f9ebd90dd9

The output only around 7-10tokens.
Can you guys give me more suggestion to improve the output? Thank you a lot

reddit.com

u/SimpleRain173 — 2 days ago

▲ 13 r/oMLX

Gemma 4 models with coding harnesses

Has anyone found any good settings to use Gemma 4 models served through oMLX with coding harnesses like Pi?

For me none of the Gemma 4 models seem to be able to make tool calls in this harness - I suspect it’s due to the differing tool call format used in Gemma models.

Has anyone figured out how to make this work in oMLX + Pi or OpenCode?

Update: they do seem capable of making tool calls if you explicitly ask for it. For example if you ask it to build an html game it’ll just print out the code in the chat but then if you ask it to write it to a file with the ‘write’ tool then it will. A bit annoying as Qwen models just do it.

reddit.com

u/No_Willingness_2249 — 2 days ago

▲ 44 r/oMLX+3 crossposts

Small models fail tool-calling for different reasons — and sometimes it's an upstream chat-template bug, not the model. I built an MLX tool to tell them apart.

Everyone benchmarks tool-calling with one number: "Model X gets 71% of function calls right." That number can't tell you why the other 29% failed — and the "why" is what decides what you do next.

So I built Toolhound, an MLX-native diagnostic that runs entirely on your Mac and attributes every tool-calling failure to one of four causes:

- `framework_template_bug` — the chat template mangled the tool tokens

- `framework_parser_gap` — the model emitted a rescuable call, the framework parser missed it

- `model_format_failure` — the model can't emit a parseable call

- `model_decision_failure` — valid format, wrong tool/args

What surprised me (Qwen2.5-0.5B / 1.5B, Llama-3.2-3B, 4-bit, on an M2 Pro):

- Qwen2.5-0.5B mostly fails on an upstream chat-template bug — Qwen2.5's template renders its tool-call example with doubled braces `{{"name": ...}}`, and the small model copies it literally. That's not the model's fault. args-correct 29%.

- Qwen2.5-1.5B parses fine (96%) but fails on judgment — wrong tool/args. args-correct 71%.

- Llama-3.2-3B formats perfectly, but wrong arg types + false abstentions. args-correct 61%.

Same benchmark, opposite root causes. A plain accuracy score hides that — and the smallest model's failures aren't even fixable by a better model.

Other things it does:

- 95% bootstrap CIs on every metric (temp=0, so no seed hand-waving — the CI comes from resampling the case set)

- Reports attribution under both a strict and a lenient parser, so you can see the verdict doesn't flip

- Quantifies bf16-vs-q4 damage without confounding it with template differences (asserts identical template first)

- v2 benchmarks existing zero-training fixes (PA-Tool is wired in). Honestly, on my demo run PA-Tool didn't beat baseline on any metric — it flags a result "credible" only when its CI is disjoint from baseline's, and it wasn't (it even hurt 1.5B's arg accuracy). I'd rather the tool tell me that than rubber-stamp it.

https://github.com/Code-byte404/toolhound

Feedback very welcome — especially: which models should I add next, and are the abstention "trap" cases too easy/hard? There are `good first issue's if anyone wants to add a model or help file the template bugs it finds upstream.

u/Otherwise_Ship_9782 — 2 days ago

▲ 21 r/oMLX

Is MTP is scam on Macs?

M5 128 GB 40 GPU Cores

For local Apple Silicon inference with Qwen oQ4 models, long contexts, and agentic workloads, MTP appears to be a net negative. Disable it unless your own benchmarks prove otherwise.

u/onil_gova — 3 days ago

▲ 8 r/oMLX

Memory leak? High usage when models unloaded

After using OMLx for a few hours, memory usage seems to get stuck around 66GB (my hard limit is 102GB).

Even after unloading all models. Is this a bug or a misconfiguration on my part?

https://preview.redd.it/rni3b5tmq8bh1.png?width=1604&format=png&auto=webp&s=8977b3c70b8407c486d2697face9b581d0faf591

The only solution is to restart the server which is not practical when I'm away from home.

reddit.com

u/That-Desk-1552 — 1 day ago

▲ 7 r/oMLX+1 crossposts

MacBook Pro M5 Pro 48GB Ram

I have just purchased a MacBook Pro M5 Pro with 48GB RAM. I was wondering if any out there have the same RAM and what models they are using locally that fits with enough headroom. What inference engine, and are you using MLX or GGUF? Im only going to use it for some light terminal work, some automation, and some python and light html work. Probably going to use Hermes as harness. Please share your setup or recommondations for 48 GB Ram on the MacBook Pro M5 Pro. Thank you!

reddit.com

u/ProgramOver9309 — 3 days ago

▲ 5 r/oMLX

OMlx user experience with Rapid-mlx

I hope I can ask this question here, hope that is ok 🙏

Does any one here have experience with Rapid-mlx? There only appears to be few thread on Reddit, and I am not seeing as much community engagement as compared to oMlx.

I was asking Google gemini about how MTP vs Dflash work so I could learn to configure and to learn* and how best to configure the backend. I have been using oMlx for a while and wanted to see if I could optimize my setup. During my inquiry, it mentioned Rapid-mlx supports Pflash and should be faster for TTFT.

I have been pretty happy with oMlx. I have played a bit with LM studio, Lamma.cpp, Ollama and mxl-ml but mostly oMlx. I use Qwen 3.6 27b as well are the MoE on my m4 max mackbook pro. I have played with open code, pi.dev, and Hermes.

Wanted to hear about first hand experience from this community.

I have no experience with benchmarking. I am going to do some bench marking on my own, but I only heard about this today and am very interested in what you all have to say.

Thank you.

*Edit: typo and little clarification

reddit.com

u/apaht — 3 days ago

▲ 513 r/oMLX+3 crossposts

Haltop

My halftop

u/PrepYourselves — 5 days ago

▲ 3 r/oMLX

Trying to make sense of Model Benchmarks

I'll preface by saying i'm not a developer.
i'm just curious and eager to learn more on LLMs and coding.

I have opencode setup wit oMLX on a m1 max (40c) 64GB
i've been going through the oMLX benchmarks and looking through best options for Qwen (general coding) and Gemma (general research/reasoning)
https://omlx.ai/benchmarks

This is where i think i'm getting confused.
I'll apologize in advance if my qtns are somewhat amateurish.

i get i should be looking at the larger models (e.g 30B)
I understand a higher quant is preferred for coding (e.g 8bit)
with context though, shouldn't i be looking at higher context for coding sessions. If that is the case, doesn't that in turn lead to a larger KV cache size and chew in more onto memory.

u/rdbmas — 4 days ago

▲ 14 r/oMLX

Qwen3.6-27B-oQ8-mtp + Native MTP on M5 Max: stuck around 9–10 tok/s sustained - losing my mind

I've been hammering away at this issue for what feels like decades now. I'm using Jundot/Qwen3.6-27B-oQ8-mtp in oMLX with Pi as a coding harness and am only getting 9-10 ish t/s (generation, not prompt processing) to matter what I try...no matter what settings I fiddle with. 9-10 is the absolute max I'm getting. I'm hoping someone can suggest a fix as I've exhausted my non-expert knowledge and experience.

Hardware:

MacBook Pro M5 Max
128GB RAM
40-core GPU
oMLX running locally on LAN
Pi using the oMLX OpenAI-compatible endpoint

Model/settings:

Model: Jundot/Qwen3.6-27B-oQ8-mtp
Model Type Override: LLM
Native MTP: ON
TurboQuant KV: OFF
VLM MTP: OFF
DFlash: OFF
SpecPrefill: OFF
Thinking: OFF (for testing purposes)
Temp: 0.1
Top P: 0.95
Top K: 20
Context cap has mostly been 131072

An important details - oMLX originally auto-detected this model as VLM. In Pi, that caused the model to process one turn and then stop almost immediately. Forcing the model type to LLM fixed that behavior.

Now the issue is speed.

With Native MTP ON, a raw curl test outside Pi gives roughly:

prompt: 39 tokens
output: ~1400–1600 tokens
total time: ~149–161s
sustained speed: ~9.5–9.9 tok/s
MTP path is definitely active
MTP accept rate around 71–73%

Example log line:

MTP finish=stop tokens=1420 cycles=827 accept=591/827 (71.5%)
timing[backbone=132784.6ms mtp=6628.3ms sample=6732.0ms cache=79.6ms]
Chat completion: 1419 tokens in 148.90s (9.5 tok/s), prompt: 39

With Native MTP OFF, speed drops to roughly ~6 tok/s. So MTP is helping, but only by about 1.5–1.7x.

One interesting detail that might be relevant (honestly, I don't know at this stage of things). I had a period yesterday when I was getting 30 ish t/s for no reason at all (well, I'm sure there is a reason, I just have zero clue what it is). I went to bed happy thinking that my settings fiddling found the right combo, only to discover this morning that it was back to the glacial t/s rate.

I’m not looking to switch models right now. The goal is to get this exact MTP model working as fast as possible for Pi/coding-agent use and stop banging my head against the wall in frustration.

any help or suggestions would be appreciated beyond belief.

reddit.com

u/UnseemlyCorgi — 7 days ago

▲ 1 r/oMLX

MBP M5 24GB - anyone running similar?

I’ve had a lot of false starts with different servers/harnesses. I get something running, it responds to an initial prompt or two then I though something simple at it and it goes off the rails. Anyone successfully running something similar that will share setup?

reddit.com

u/ogfuzzball — 5 days ago

▲ 3 r/oMLX

I built a macOS menu bar app to manage oMLX no terminal needed

Update: yes I now realize that oMLX has its own panel. This is redundant if you're only using oMLX as a server, but my tool also control llama.cpp, which doesn't have one. So skip for oMLX.

------------

I got tired of opening Terminal every time I wanted to start my server or switch models, so I built a menu bar app. I've been using this for a little while now and felt it was good enough to share with others, who hopefully been thinking the same thing.

GitHub: https://github.com/cporto/llm-menubar
Download (DMG): https://github.com/cporto/llm-menubar/releases/tag/v0.2.1

What it does:

Start / stop / restart your server from the menu bar
Switch models — unloads the current one automatically, loads the new one
llama.cpp + oMLX — switch backends from the menu
Opens the server dashboard in your browser on start — llama.cpp's web UI or oMLX's admin panel
First-run wizard — finds your binary, picks your models folder, sets up launchd for you
Remembers your last model and auto-loads it on launch
Live status — animated tray icon with elapsed timers while starting and loading
One-model-at-a-time — keeps RAM in check, unloads before loading

Runs on launchd, not as a child process — server keeps running if you quit the app, and the app picks up a server that's already running.

u/WatercressCivil3048 — 7 days ago

▲ 97 r/oMLX

It’s been a while. oMLX 0.4.5.dev1 is here.

Hey everyone! It’s been a while, and I’m back with oMLX 0.4.5.dev1.
https://github.com/jundot/omlx/releases

I’ve been steadily committing changes since 0.4.4, but it was a little hard to decide where to cut the next dev release. I also wanted this release to include a meaningful attempt from the MLX kernel side, so it took a bit longer than usual. I hope you’ll understand.

The biggest change in this release is mainly relevant to people using an M3 Ultra, so apologies if this does not apply to your setup yet. - I’m also working on optimizing Gemma in a similar direction, so please stay tuned.

This release focuses on performance improvements for GLM-5.2, which I personally think is a big step forward for local AI, and MiniMax-M3, which has turned out to be a surprisingly useful model in practice.

Previously, these models “worked,” but honestly, I don’t think the long-context speed was where it needed to be for real use. With custom kernels, oMLX now gets a major speedup in long-context prefill. I also ran basic Needle in a Haystack tests and coding tests through Claude Code, and confirmed that quality did not collapse with the optimized path.

I hope this is a meaningful improvement for people using local LLMs in setups similar to mine.

Another major change is API-visible model profiles. You can now expose presets like 'qwen3-8b:thinking' or 'qwen3-8b:non-thinking' and call them directly through the API with the settings you want. Huge thanks to github pablomoralesm for this work: https://github.com/jundot/omlx/pull/1838

As always, this release was only possible because many people contributed their valuable time. I’m deeply grateful.

Thank you as well to everyone using oMLX, sharing feedback, reporting issues, and helping make the product better. It’s great to keep building local AI together!

u/cryingneko — 8 days ago

▲ 22 r/oMLX

I built a CLI tool to manage oMLX’s menu bar

Got tired of clicking around a menu bar every time I wanted to start my server or switch models, so I built a CLI. Been using this for a while now and figured someone else has probably also been quietly wishing their menu bar icon was just… text in a terminal.

GitHub: https://github.com/omlxMaster/llm-cli
Install: brew install llm-cli (or whatever)

What it does:
- llmctl start / stop / restart — your server, from the terminal where it belongs

- llmctl switch <model> — unloads the current one automatically, loads the new one

- llmctl backend llama.cpp|omlx — switch backends with a flag instead of a dropdown

- Prints the dashboard URL on start instead of opening a browser tab you didn’t ask for

- First-run wizard — llmctl init finds your binary, picks your models folder, sets up launchd for you

- Remembers your last model in a config file and auto-loads it on launch

- Live status via llmctl status — no animated tray icon, just numbers, because elapsed time is a number

- One-model-at-a-time — keeps RAM in check, unloads before loading

Runs on launchd, not as a child process — server keeps running if you quit your terminal, and the CLI picks up a server that’s already running. No Dock icon, no menu bar real estate, no mouse required.

u/neobow2 — 6 days ago

▲ 12 r/oMLX+1 crossposts

mlx-mamba3

Hey, Built this after hitting some dependency hell with the CUDA/Triton requirements on Colab. Covers SISO, MIMO, and Hybrid Attention-Mamba configs with verified numerical parity against the PyTorch reference (max error < 10⁻⁵, 12 passing tests).

Key things that work: exponential-trapezoidal discretization, complex rotary states, chunked prefill with cache consistency, mixed-precision LoRA fine-tuning, all MLX, no CUDA needed.
If you have any feedback or see any inconsistencies, hmu!

Note: haven't really tested it so far on public trained checkpoints (will do the next days/week) so this is mainly useful for local architecture experimentation and fine-tuning on toy data until the authors drop weights.

→ github.com/Jada42/mlx-mamba3

u/Traditional_Ad_6304 — 6 days ago

▲ 10 r/oMLX

How do you manage context size and your coding harness?

Hey folks!

I'm on a 48 GB M5 Pro MacBook that I picked up a couple months ago and I've been trying to get into local agentic coding. At work I'm fortunate enough to use frontier models within VSCode's Copilot (no Claude Code, Codex, etc), and it's so easy there to never worry about context size, manual compaction, etc.

On the Mac I've tried a few couple harnesses with oMLX, namely Claude Code and OpenCode, but I can't quite figure out the right workflow yet. For example I'll run into situations where part way through a session the prefill OOM guard kicks in. I've been using Qwen 3.6 35B A3B oq4 and a 65K context window, which I thought should be manageable with my 48 GB RAM. With nothing loaded and all my apps closed activity monitor shows roughly 16 GB usage, seems excessive, but I can't figure out what other system stuff I can get rid of to leave more room for the model + context.

I know I can keep turning down the context window, use a smaller model, etc., but it just feels like I'm missing something... I'd like to know immediately when I load a model with a given context size if it'll eventually hit the OOM guard or not.

I suppose I don't have a clear question, but I've been reading through this sub for a bit and still nothing has quite landed well for me. Any additional tips?

reddit.com

u/Fantastic-Storm-7867 — 8 days ago

▲ 21 r/oMLX+1 crossposts

oMLX is so good and efficient ! it’s just like having 500 Nvidia H100’s !!!

p.s. this is obviously a joke

u/Clementine-TeX — 10 days ago

▲ 3 r/oMLX

How do you set a custom system prompt for a given model?

That’s it.

reddit.com

u/JLeonsarmiento — 10 days ago

▲ 8 r/oMLX

Generation crashes around 100k context (qwen3.6)

Hi all, I'm trying to use omlx with codex. It often crashes and shows reconnecting on codex when context reaches around 100k tokens, with output token generated >5000. I'm using q4 mlx models.

Is it oom error? My device is macbook 64gb ram, m5 pro.

Is this limit normal for 64gb ram device, or I have misconfigured anything?

reddit.com

u/Such_Ad1212 — 14 days ago