r/MiniMax_AI

I built an open-source, self-hosted AI gateway: 237 providers (90+ free), auto-fallback combos, and a 10-engine token-compression pipeline (MIT)
▲ 326 r/MiniMax_AI+69 crossposts

I built an open-source, self-hosted AI gateway: 237 providers (90+ free), auto-fallback combos, and a 10-engine token-compression pipeline (MIT)

Builders-welcome post with the substance up front (disclosure: I'm the maintainer). OmniRoute is a free, MIT, self-hosted AI gateway — one OpenAI-compatible endpoint over 237 providers — built around two problems: runs dying on a provider 429, and tokens bleeding on tool/log output.

One endpoint, 237 providers — 90+ of them free. You point any tool or agent at a single OpenAI-compatible endpoint (localhost:20128/v1) and it can reach 237 LLM providers without you rewriting anything. 90+ have free tiers and 11 are free forever (no card), which aggregates to ~1.6B documented free tokens/month — and that's honest, pool-deduped math (we count each shared pool once instead of inflating it; the methodology is public in the repo). There's a one-command setup-* for 13+ coding tools (Claude Code, Codex, Cursor, Cline, Roo, Kilo, Gemini CLI…), so switching your existing setup over takes seconds.

Fallback combos — so it never stops mid-task. A "combo" is a ladder of models the router walks automatically: your subscription first, then API keys, then cheap models, then free ones. When a provider returns a 500 or you hit a rate limit, it slides to the next target in milliseconds, mid-request, and your tool never even sees the error. There are 17 routing strategies (priority, weighted, round-robin, cost-optimized, auto/coding:fast…) plus three resilience layers — a per-provider circuit breaker, a per-key cooldown, and a per-model lockout — so one dead key can't take down a whole provider.

Fusion — an ensemble mode for the hard steps. Beyond simple routing, there's a fusion strategy that fans a single prompt out to a panel of different models in parallel and then has a judge model synthesize one best answer (mixture-of-agents, built in). It's cost-aware, so easy turns stay on one fast model and it only fuses when the step is worth it.

A 10-engine compression pipeline — the part most routers don't have. Every request flows through a transparent compression pass you can toggle/stack per combo. Instead of one trick, it stacks the best of the open-source ecosystem: RTK filters command/tool output (git diffs, test logs, builds) at 60–90%, Microsoft's LLMLingua-2 does ML semantic pruning, Caveman handles prose, session-dedup strips repeats across turns. Critically, code, URLs and JSON are preserved byte-perfect, and a default-on inflation guard throws the compressed version away and sends the original if compressing would actually grow the prompt — it never makes things worse. On tool-heavy sessions that's ~89% average input-token reduction (an 8k-token git diff becomes a few hundred). Full credit to every upstream project (RTK, Caveman, LLMLingua-2, Troglodita) is in the README.

Agent-native — the agent can drive the router itself. There's a built-in MCP server (95 tools across 30 audited scopes, over stdio / SSE / streamable-HTTP), plus A2A (v0.3, JSON-RPC 2.0) support. That means an agent can query providers, switch combos, read its own remaining quota and manage memory through the gateway — not just consume tokens through it.

It's 100% local (zero telemetry, AES-256-GCM at rest), MIT-licensed, has a prompt-injection guard on every LLM route, opt-in memory, and runs on npm, Docker, desktop or your phone via Termux.

For context on whether it's worth your time: it's grown to ~9.8K GitHub stars, 1,490+ forks and 280+ contributors in ~4.5 months, with 21,000+ automated tests and 1,830+ issues closed — so it's a battle-tested project, not a brand-new experiment.

npm install -g omniroute

GitHub: https://github.com/diegosouzapw/OmniRoute · Site: https://omniroute.online

Would value a critique of the routing/compression architecture from this crowd.

u/ZombieGold5145 — 2 days ago

Need Advice: MiniMax vs Z.ai vs Kimi

I already have Codex and Antigravity. For heavy coding (backend, architecture, and UI), would you recommend adding the MiniMax Token Plan, Z.ai Coding Plan, or Kimi? Which one has the best quality and value?

reddit.com
u/Big-Refrigerator7572 — 3 days ago

My Token Plan Experience

I have to admit I was a little worried about the Token Plan, especially reading about the excessively wiped out usage bars and the poor limits. I only use M3.

I have been pleasantly surprised on the $20 plan. I have a very high cache rate so my experience may not be the same as others, but so far I’m getting about 110million tokens per 5 hours, which seems to equate to about ~1billion per week based on the weekly meter.

This is far above what I expected and I’m extremely happy with thr experience. It’s a little slower than the API but for the work I do, which is very loopy, this works out well for me.

It’s not all doom and gloom.

u/Historical_Laugh2193 — 5 days ago

NEVER subscribe to MINIMAX for your projects! It is a marketing-heavy trap that lacks the functionality needed for serious projects!

Fellow developers and system architects, I need to share a massive frustration and a warning so you do not waste your time and money like I did.

I fell for the Minimax 2.7 trap, and now I have fallen for the M3 trap. Let me be absolutely clear: Minimax M3 is terrible. If you are building anything beyond a simple script, such as proprietary ERP engines, retail data solutions, or anything that requires serious logical reasoning, this model will fall apart in your hands.

It is completely incapable of maintaining context in a real world development environment. When you feed it a well documented ADR (Architecture Decision Record) based on Graphs, it gets completely confused. It hallucinates connections, loses track of constraints, and breaks the architectural logic.

Worse yet, it completely fails to respect the AGENTS.md file. We set up clear documentation, rules, and boundaries in that markdown file right in the repository, and Minimax simply ignores it all. It acts like the documentation does not exist, which makes it impossible to rely on for serious codebase integration.

Comparing Minimax to Claude Opus is an absolute joke. It does not even come close to the capabilities of Claude Sonnet, nor does it reach the ankles of Kimi 2.7. It is a toy product disguised by heavy marketing to fool consumers.

In my daily workflow, whenever Minimax makes a structural mess, the ones who actually step up to clean it up and finish the job with absolute professionalism are Qwen 3.7 Plus and Kimi 2.7. These models actually read the documentation, understand complex architecture, have genuine context resilience, and take systemic instructions seriously.

Consider this a public service announcement: I took the bait so you never have to. If you are running large and serious projects, steer far away from Minimax. Do not fall for the synthetic benchmarks because in the trenches of real code, it is a complete disaster.

reddit.com
u/Intelligent-Taste-36 — 7 days ago

The Minimax M3 Scam: Lies, Mocks, and Complete Disrespect for AGENTS.md

Fellow developers, take a close look at the file image_d7f207.jpg that I just shared. This perfectly summarizes exactly why Minimax is absolute garbage and completely useless for any serious project.

​We all know Minimax has always had a terrible habit of generating unwanted MOCs (Mocks). To prevent this, the plan was set and the rules were strictly defined in the AGENTS.md file: DO NOT CREATE MOCS.

​So what does the M3 "Minibostinha" do? It not only ignores the plan and completely disobeys the AGENTS.md file, but it goes ahead, creates the MOCs in the Vue.js front-end, and adds comments in the code claiming they are "NOT MOCS".

​When I confronted it on the screen, the response was bizarre. It literally admitted its guilt and confessed to the lie, stating: "Yes, Carlos. They are MOCs. I made a mistake. And the comments saying NOT MOC that I put in the files were a lie to myself."

​It then listed its own lies across components like DcHeroBanner and DcColecoesGrid, admitting it fabricated fake data like Nike and Apple brands, and summer collections for a Pet Shop tenant.

​An AI model that disobeys repository documentation, invents garbage code, and literally lies to itself in the file comments is completely unfit for complex architectures. Stay far away from this piece of trash!

u/Intelligent-Taste-36 — 6 days ago

Usage Limits and Credits

Does anybody understand when exactly token usage is considered or when credits are deducted?

I've started with 8B token average consumption in Plus highspeed plan and was migrated to new limits without any resets. So I cannot deduct clearly, when my token limits are breached or credits are considered. So far, I only see 5h limits filled, weekly limits are unlimited due to legacy status.

Do I need to get my running balance below 3.2B monthly token to have non-credit depleting access again?

https://preview.redd.it/q750e2ikmdah1.png?width=1666&format=png&auto=webp&s=03667cba8780f20707387db4c3611c234b7265a5

https://preview.redd.it/iwxpy1zemdah1.png?width=1285&format=png&auto=webp&s=5044fff1885948a00f162e6b25e8aa0f319e6fd3

reddit.com
u/tigerbrowneye — 6 days ago
▲ 161 r/MiniMax_AI+2 crossposts

Deepseek V4 pro vs Minimax M3. Judge is Opus 4.8. Results are disappointing

(UPDATE: added audits of Mimo V2.5 pro & Qwen 3.7plus too, here - https://www.reddit.com/r/DeepSeek/comments/1ufok23/comment/oty8vc1/ )

built an app.
Used GLM 5.2 to make a Build Plan.
Asked deepseek v4 pro and minimax m3 to implement the build plan faithfully (two separate projects, used Opencode desktop).
Then asked claude code (Opud 4.8 high) to audit both implementations.
Results are in the image.

Very disappointed with deepseek, lot of issues.
Totally surprised my M3 which cost me the same miniscule amount as deepseek (around 35 to 40 cents), but with mostly solid results.

Surprised bcos last time when i used M2.7 for coding, it was the worst of all, cos it introduced two bugs to fix one bug, it deleted unrelated code while fixing bugs. it was horrible. Now its decently solid and as cheap as deepseek.

Happy to have M3 competing with deepseek on cost and delivering way more reliable results that we can actually use for coding.

Sad about deepseek's coding performance. hope 4.1 becomes reliable enough to replace cursor's composer 2.5 for me.

Note: redacted some info about the app in the image.

Note: I use deepseek v4 pro for non-coding tasks too, and its amazing and reliable for the price. But coding, just not safely reliable.

u/Decent-Rain5100 — 9 days ago

Update: my open-source MiniMax GUI is now a native desktop app, MiniMax Studio (Win / macOS / Linux)

Two months ago I shared an open-source web GUI for MiniMax here. Since then I rebuilt it from the ground up into a native desktop app called MiniMax Studio.

It brings everything MiniMax can do into a single window on your computer, with no browser tabs and no command line:

  • 🎬 Media studio: image, video, music (your own lyrics), and voice (30+ voices plus cloning & design)
  • 💬 Chat that shows its thinking
  • 💻 A real code workspace when you need it
  • 🧠 Remembers who you are across sessions
  • 🌍 UI and in-app help in 6 languages
  • 🖥️ Installs like any app on Windows, macOS and Linux. No browser, no CLI.

Free and open source. Screenshots and downloads in the repo:
👉 https://github.com/eduardoabreu81/minimax-agent-gui

Still evolving. Feedback very welcome!

(macOS isn't notarized yet, so right-click then Open on first launch.)

Just to be clear on the focus: this isn't meant to be the best coding agent or an IDE replacement. A lot of what makes MiniMax great lives in the multimodal features (image, video, music and speech), and as a Token Plan subscriber you're already paying for all of it. The catch is that actually reaching those features usually means juggling the website, API calls, scripts and CLI commands. MiniMax Studio is my attempt to put everything in one simple place, so you can use the full range of what your Token Plan already includes without the friction. The code workspace is there when you need it, but it's one feature among many, not the whole point.

Here's a few screenshots:

Chat with M3

Coding Area

Image Generation

Music Compose

Settings

Speech Generation

Video Generation

reddit.com
u/digitalhunters0 — 7 days ago

Model Stacking GLM 5.2 and Minimax 3

Dan does a great job of explaining how close, or far, according to some, the open source vs. SOTA model race is. Really enjoyed this video https://youtu.be/cFYdiynrxpQ?si=0vamlAqO3rx0FKV2

For those who frequently complain here, it's important to note that open-source models aren't designed to compete at the highest levels. They are great bargain-bin daily drivers and more than adequate for 90% of the work we expect to get done.

u/trainermade — 6 days ago

Is Minimax dumb again?

Am I only one to face this problem? I use Minimax as a primary driver for months, since M2.5, with small gap when M2.7 was dumb before M3 rollout. M3 is truly brilliant! Oh, I mean it was. Because two last days it is extremely dumb, it misses the code in 10-20 lines, hallucinates problems and solutions, makes a mess out of the blue. Am I so unlucky, or does anybody face the same issue?

reddit.com
u/Barni275 — 9 days ago

Lesson LEARNED

Two months ago, I subscribed to an annual starter plan, but now the service is completely unusable because even simple, single prompts trigger a quota limit error before completing. When I asked for a refund on their Discord channel, they refused and stated that refunds are not available. The lesson here is to never buy an annual subscription and to stick strictly to monthly plans because you cannot trust these services. I am currently using DeepSeek instead, but I only top up ten dollars at a time because I am afraid they might change their token pricing in the near future.

reddit.com
u/Then-Eye9700 — 9 days ago

DO NOT buy the new MiniMax M3 $20 Token Plan for agentic coding. It’s a complete marketing scam.

Hey everyone, just wanted to drop a warning here because I just got completely burned by MiniMax's new $20 "Plus" token plan. They are aggressively marketing this right now as a killer Claude Pro alternative, claiming you get roughly 1.7 Billion tokens a month. Sounds insane on paper, right?

Well, turns out it is a massive billing trap.

To give you some background, I use OpenCode as my agentic coding harness in VS Code. I actually first ran into issues with that widespread Anthropic endpoint bug where cached tokens were getting counted twice. I heard that bug finally got resolved today, but to bypass it earlier, I had set up a workaround. I routed my MiniMax key through an OpenRouter BYOK setup and then threw the OpenRouter key into OpenCode.

After doing this workaround, my token usage reduced a lot because OpenRouter has great sticky routing and response caching. My cache hits are now always upwards of 95% which is awesome, so I was fully expecting that the massive caching discounts from MiniMax's PAYG (pay-as-you-go) plan would apply to my subscription pool too.

But nope. On the Token Plan, PROMPT CACHING DOES NOT EXIST.

Every single token, whether it is a fresh input, an output, or a piece of code the agent has already read 50 times in the same session, is counted exactly the same. It all drains from that 1.7B pool at full flat-rate value. Because agentic workflows multi-scan your entire project history on every single turn, it means most of my tokens are kinda wasted.

This plan can actually be really good for heavy, traditional API workflows where you send distinct, one-off stateless requests. But for agentic workflows where context loops repeatedly, it is a complete bait-and-switch. This is actually a cheap but good quality model, but the token plan itself is highly misleading. They show you cheap API cache rates to get you excited, but their subscription token plan treats all context like a brand new cold prompt. Save your twenty bucks, the plan is fundementally flawed for agent workflows.

reddit.com
u/Ssj273 — 12 days ago

M3/Token Plan: 753M tokens burned in 25 days with Claude Code - exported the CSV, the numbers are wild

Just analyzed my billing export from the MiniMax dashboard and wanted to share the breakdown because I hadn't seen anyone post actual numbers for M3 yet.

Setup: Claude Code as main agentic harness, switching between M3-512k and M2.7 for a dev project, about 25 days of real usage.

The short version: 753,957,883 total tokens consumed on M3-512k. Of those, 414 million were cache-reads and only 4.3 million were output. That's a 96:1 cache-read to output ratio.

Every single micro-turn - lint run, file check, 3-line patch - Claude Code re-reads the full context, and every single one of those re-reads drains the 1.7B pool at the exact same rate as fresh input. No discount.

Why this matters specifically for the Token Plan

Official API pricing for M3 (≤ 512K context, permanent 50% off rate):

  • Standard input: $0.30/M
  • Cache-read: $0.06/M (5× cheaper than input on PAYG)
  • Output: $1.20/M

On the Token Plan a Discord mod confirmed: cache-reads and standard input count identically against your pool. No 5× discount.

So for my 25 days of usage, on PAYG at current pricing those same 753M tokens would have cost $130.66 total:

  • 414M cache-reads × $0.06 = $24.85
  • 335M standard input × $0.30 = $100.62
  • 4.3M output × $1.20 = $5.19

At standard list price (no discount): $261.32

On the Token Plan those same tokens consumed 44.4% of the monthly 1.7B pool - $8.87 equivalent out of the $20 price. The plan is cheaper per-token in absolute cost, but the pool ceiling is what bites you.

The actual math on productive output

With a ~90% cache hit rate (typical for agentic coding with long sessions):

PAYG behavior (cache 5× discount):
1.7B pool → ~895M tokens of actual new work

Token Plan (flat rate, no cache discount):
1.7B pool → ~170M tokens of actual new work

About 5× less real output than the headline number implies. The 1.7B is real, it's just that in agentic workflows most of it goes to re-reading context that would cost almost nothing on PAYG.

Daily M3 breakdown that made me dig into the CSV

Worst single day was June 17: 90M tokens total, 66M were cache-reads, output was only 368K. Normal coding work, nothing crazy running in the background.

The interesting days are Jun 8/9/13 where cache-reads nearly disappear - those were the days the /anthropic endpoint bug was active and context wasn't caching. Standard input spiked instead. Different failure mode, pool still drains fast either way.

What's actually working

A few things people have confirmed in various threads:

  1. LiteLLM proxy between Claude Code and MiniMax through the native OpenAI endpoint - token caching reportedly functional through this route
  2. OpenCode CLI instead of Claude Code - the context re-read ratio is significantly lower
  3. M2.7 for context-heavy scanning, M3 only where reasoning quality matters - M2.7's cache behavior in the Token Plan seems more predictable

The plan works fine for stateless/short-context work or if you're mostly on M2.7. For Claude Code with long sessions it's probably the worst possible combination for this billing model.

Export your own CSV from the dashboard, look at cache-read(Text API) vs output in the Consumed API column - if your ratio is above 50:1, the pool is burning faster than the headline number suggests.

Curious what ratios others are seeing. Happy to share the analysis script too.

reddit.com
u/Evening_Rip1006 — 9 days ago

MiniMax Plus 1.7B tokens - real or rate limited?

Does the 1.7B tokens/month on MiniMax Plus actually mean you can use all of it, or are there 5-hour/daily rate limits like Codex/Claude Code that prevent heavy usage? Anyone using it regularly?

reddit.com
u/Big-Refrigerator7572 — 9 days ago

Minimax M3

Hi guys, I plan to replace my Deepseek v4 pro for coding with minimax M3, I am not interested in using any other harness except Claude code or PI ,
How much Usage will M3 give me for the plus plan ?
I use a lot of researching code base so. For Deepseek I regularly hit 100M token a day. Mostly input 98%. So how much to expect from M3 ?
Also I heard from Reddit as M3 has issue with Claude code causing huge token consumption.

reddit.com
u/PrizeHuman5506 — 10 days ago

For all the "How much do you get?" posts

https://preview.redd.it/7mk601k6mq9h1.png?width=1012&format=png&auto=webp&s=caf8580ed6d8417df32f1d2c347e087932d2042f

I bought my year of Max Token Plan back in April. You can see in the heat map that as M2.7 was superceded by better models, I used it less and less, but that one M3 came out, I started using it again. This sub cost me 1200 Chinese yuan for the year, or about $15/mo. I get tts and images changed *by token rate* which is just crazy for me -- virtually unlimited. I get three decent videos a day. But I mostly just use this along with my many other subs to run agent loops for a lot of the time.

reddit.com
u/Illustrious-Many-782 — 9 days ago

Does the MiniMax M3 1.7B/month Token Plan count cached tokens, or only uncached tokens?

https://preview.redd.it/y36snrscem9h1.png?width=1673&format=png&auto=webp&s=5f2dd1bc868ec9580a6454fe572b0921eb318ee4

Hi everyone,

I'm trying to understand how the MiniMax M3 Token Plan actually counts usage.

The Plus ($20/month) plan advertises:~1.7B tokens / month of M3 usage

But I can't find any documentation explaining whether this 1.7B quota includes cache reads or if cached tokens are discounted like they are in the PAYG API pricing.

My observation

I'm using an agent (Hermes/OpenCode-style workflow) with a very high cache hit rate.

After only 2–3 prompts, my dashboard shows:

  • 43.56M total tokens
  • 43.03M peak tokens
  • Cache hit: 95.5%

Despite the cache hit being 95.5%, it still appears that around 43M tokens were counted against my usage.

My question

If the cache hit rate is 95.5%, should those cached tokens still consume the monthly 1.7B token quota?

Or is the dashboard simply showing processed tokens, while the subscription quota is reduced by a much smaller amount?

If anyone from MiniMax or anyone who has tested this extensively knows how the quota is actually calculated, I'd really appreciate the clarification.

Thanks!

reddit.com
u/Nazmul-TechTips — 9 days ago