r/better_claw

I built Curion, a librarian-like memory agent for AI agents
▲ 19 r/better_claw+10 crossposts

I built Curion, a librarian-like memory agent for AI agents

I’ve been working on Curion, a memory system for AI agents built around a simple idea:

The main agent should not have to manage memory manually.

Most AI agents are useful inside a single session, but they still lose important context between sessions. Project decisions, implementation history, constraints, unresolved tasks, and previous reasoning often disappear unless I manually write long handoff notes.

At first, the obvious solution seems to be giving the agent memory tools: save, search, update, delete, edit.

But that creates a second problem.

If the main agent has to manage memory by itself, it can easily receive too many raw memories. Some are relevant, some are stale, some are only partially related, and some may conflict with newer information. The agent then has to spend context and attention deciding what matters.

That creates context bloat.

Curion takes a different approach.

I think of Curion as a librarian for AI agents.

A good librarian does not just throw every possibly related book at you. They understand the question, know how information is organized, filter what matters, notice conflicts, ask clarifying questions when needed, and return the most useful context.

That is what Curion is meant to do for agent memory.

The main agent only needs to say:

“I want to remember this.”

or

“I need to recall something about this.”

Curion handles the rest.

When saving memory, Curion can decide how information should be stored, whether it relates to existing records, whether something should be updated, and whether a conflict requires clarification.

When recalling memory, Curion does not just dump raw search results into the agent’s context. It retrieves relevant records, evaluates what is useful for the current task, synthesizes the context, and clearly says when nothing relevant was found.

The analogy I use is human memory. When we want to remember something, we do not consciously search through billions of memories. We ask for what we need, and the relevant memory appears automatically beneath the surface.

Curion is built around that same interface idea for AI agents.

It is project-first: Curion focuses on the project the agent is currently working in. It can also use cross-project recall when information from another project is actually relevant.

Curion is not just a save/search tool. It is a collaborative memory layer: a specialized memory librarian that helps agents remember responsibly, reduces context bloat, and gives the main agent only the context it actually needs.

GitHub: https://github.com/geanatz/curion

NPM: https://www.npmjs.com/package/@geanatz/curion

Portfolio: https://geanatz.com

u/geanatz — 1 day ago
▲ 36 r/better_claw+1 crossposts

pricing "AI employees" is messing with my head. some notes after a couple months trying to sell this stuff

ok so I've been trying to sell OpenClaw agents to small businesses (law firms, real estate, that kind of thing) for a couple months now. building the agent part is honestly the easy bit at this point. pricing it is what's been keeping me up.

random notes, not super organized, just what I've actually run into:

per-seat pricing is dumb for this. tried it first bc that's what I know from SaaS. but nobody buying an agent cares how many "agents" you spun up, they care if their invoices go out faster. pricing per agent makes the client think about YOUR architecture instead of THEIR problem. bad idea.

what worked way better: call it an "AI employee" and charge monthly like a salary. not because it's technically accurate but because business owners already have a mental model for "what does a person cost me." suddenly you're not competing with a software subscription in their head, you're competing with hiring someone. much easier fight to win.

also — cost-plus pricing is a trap and I fell into it immediately. my first instinct was tokens cost X, compute costs Y, slap on margin, done. but like. if the thing you're selling stops a law firm from losing half a million euros they didn't even know they were losing, charging 1k/month bc that's your token cost + margin is just leaving money on the table AND making the client think of it as "a tool" instead of "the thing that found me money." find the number that shows what the problem is costing THEM (bonus if it's literally in their own reports) and price under that, not anywhere near your actual cost. feels weird the first time. isn't wrong though.

thing nobody talks about enough: if you're riding on someone else's subscription-tier LLM plan, you don't actually control your own costs. access, rate limits, which tier third party apps can even use — all of that can get yanked with zero warning. seen it happen. so now I bill the LLM usage separately as a pass-through, not bundled into my fee. slightly uglier as a single price tag but means I don't wake up one day with my margins gone because someone else changed a policy.

setup fee + monthly retainer > pure monthly. was scared the setup fee would scare people off. opposite happened — it filters out the tire kickers who just want to "try it" and ghost. and it pays for the part that's actually bespoke, bc every client's tools/workflows are different, there's no universal setup.

discounts for commitment work but HOW you frame it matters more than the actual %. saying "12 month commitment, 5% off, totally your call" converts way better than making the discounted price the default and the flexible price look like a penalty. same numbers. different vibe. people respond to the vibe.

biggest thing though — the objection is never the price. not once. it's always trust. will this thing hallucinate into a client's inbox, will it leak something it shouldn't. security isn't something you price, it's something you have to kill as a doubt before you even get to numbers. I lead with that now, before I show a single euro.

anyway. still figuring this out as I go. anyone here doing outcome-based pricing instead of flat fee, like a % of whatever it recovers/saves? been tempted but can't figure out how to measure it without turning every client into an audit project

reddit.com
u/UltraFocusMe — 1 day ago
▲ 326 r/better_claw+69 crossposts

I built an open-source, self-hosted AI gateway: 237 providers (90+ free), auto-fallback combos, and a 10-engine token-compression pipeline (MIT)

Builders-welcome post with the substance up front (disclosure: I'm the maintainer). OmniRoute is a free, MIT, self-hosted AI gateway — one OpenAI-compatible endpoint over 237 providers — built around two problems: runs dying on a provider 429, and tokens bleeding on tool/log output.

One endpoint, 237 providers — 90+ of them free. You point any tool or agent at a single OpenAI-compatible endpoint (localhost:20128/v1) and it can reach 237 LLM providers without you rewriting anything. 90+ have free tiers and 11 are free forever (no card), which aggregates to ~1.6B documented free tokens/month — and that's honest, pool-deduped math (we count each shared pool once instead of inflating it; the methodology is public in the repo). There's a one-command setup-* for 13+ coding tools (Claude Code, Codex, Cursor, Cline, Roo, Kilo, Gemini CLI…), so switching your existing setup over takes seconds.

Fallback combos — so it never stops mid-task. A "combo" is a ladder of models the router walks automatically: your subscription first, then API keys, then cheap models, then free ones. When a provider returns a 500 or you hit a rate limit, it slides to the next target in milliseconds, mid-request, and your tool never even sees the error. There are 17 routing strategies (priority, weighted, round-robin, cost-optimized, auto/coding:fast…) plus three resilience layers — a per-provider circuit breaker, a per-key cooldown, and a per-model lockout — so one dead key can't take down a whole provider.

Fusion — an ensemble mode for the hard steps. Beyond simple routing, there's a fusion strategy that fans a single prompt out to a panel of different models in parallel and then has a judge model synthesize one best answer (mixture-of-agents, built in). It's cost-aware, so easy turns stay on one fast model and it only fuses when the step is worth it.

A 10-engine compression pipeline — the part most routers don't have. Every request flows through a transparent compression pass you can toggle/stack per combo. Instead of one trick, it stacks the best of the open-source ecosystem: RTK filters command/tool output (git diffs, test logs, builds) at 60–90%, Microsoft's LLMLingua-2 does ML semantic pruning, Caveman handles prose, session-dedup strips repeats across turns. Critically, code, URLs and JSON are preserved byte-perfect, and a default-on inflation guard throws the compressed version away and sends the original if compressing would actually grow the prompt — it never makes things worse. On tool-heavy sessions that's ~89% average input-token reduction (an 8k-token git diff becomes a few hundred). Full credit to every upstream project (RTK, Caveman, LLMLingua-2, Troglodita) is in the README.

Agent-native — the agent can drive the router itself. There's a built-in MCP server (95 tools across 30 audited scopes, over stdio / SSE / streamable-HTTP), plus A2A (v0.3, JSON-RPC 2.0) support. That means an agent can query providers, switch combos, read its own remaining quota and manage memory through the gateway — not just consume tokens through it.

It's 100% local (zero telemetry, AES-256-GCM at rest), MIT-licensed, has a prompt-injection guard on every LLM route, opt-in memory, and runs on npm, Docker, desktop or your phone via Termux.

For context on whether it's worth your time: it's grown to ~9.8K GitHub stars, 1,490+ forks and 280+ contributors in ~4.5 months, with 21,000+ automated tests and 1,830+ issues closed — so it's a battle-tested project, not a brand-new experiment.

npm install -g omniroute

GitHub: https://github.com/diegosouzapw/OmniRoute · Site: https://omniroute.online

Would value a critique of the routing/compression architecture from this crowd.

u/ZombieGold5145 — 2 days ago

Already running an agent? These 4 settings cut your cost without changing anything else.

Not a different model. Not a different framework. Not a different workflow. Four config changes on whatever you're running right now. Same agent, same tasks, same daily experience.

One Medium article tracked their OpenClaw token usage and found 90% of their spend had nothing to do with the work they were actually asking the agent to do. Background overhead, stale conversation history, unused tool schemas, all of it billing silently on every single API call.

These four settings fix the silent bleed.

1. Cap your conversation history at 20 messages.

This is the single biggest cost lever most people never touch.

Every time you send a message, your agent re-sends the entire conversation history to the model. Message 1 gets sent once. Message 2 gets sent twice. By message 40, every new reply includes all 40 previous messages as input tokens. You're paying for the same old text over and over and the model doesn't need most of it.

The fix:

json

{
  "agents": {
    "defaults": {
      "maxHistoryMessages": 20
    }
  }
}

On OpenClaw: add this to ~/.openclaw/openclaw.json. On Hermes: the equivalent is in your agent config. On most frameworks: look for a session history or context window limit setting.

20 messages is enough for the model to understand your current conversation thread. It doesn't need message 3 from an hour ago to answer your question right now. The important stuff (your name, preferences, long-term context) lives in your memory files, not in conversation history.

Impact: 30-50% cost reduction on long-running conversations. A 40-message session that was sending 10,000+ input tokens per reply now sends roughly 4,000-5,000. You feel zero quality difference because the trimmed messages were irrelevant to the current exchange.

This is the setting that made one YouTuber's weekly bill drop from $50 to under $10. One line of config.

2. Route heartbeats and cron jobs to a cheap model.

Your agent checks for new messages roughly every 3 minutes. That's about 48 heartbeat checks per day. Each one sends your full system prompt, tool schemas, and memory context to the model just to ask "anything new?" and hear "nope."

If that heartbeat is hitting Sonnet 4.6 at $3/$15 per million tokens, you're spending $30-60/month on "nope." 48 times a day. Every day.

The fix:

json

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "anthropic/claude-sonnet-4-6",
        "list": [
          {
            "id": "background",
            "model": "deepseek/deepseek-v3.2"
          }
        ]
      }
    }
  }
}

Route heartbeats, cron checks, and simple classification tasks to DeepSeek V3.2 ($0.14/$0.28 per MTok) or Claude Haiku 4.5 ($1/$5). Keep your primary model for actual conversations.

On Hermes: set the background curator to the cheap model. Keep your main conversation model on whatever you prefer.

Impact: 50-80% cost reduction on your total bill. Heartbeats and cron are 70-85% of most agents' daily token volume. Moving that volume from a $15/MTok output model to a $0.28/MTok output model is a 50x price drop on the majority of your usage.

The model checking "anything new?" doesn't need intelligence. It needs to read a status and say yes or no. A $0.28 model does this identically to a $15 model.

3. Disable every skill and tool you're not actively using.

This one is invisible and that's why it bleeds money for weeks before people notice.

Every enabled skill adds its tool schema to every API call. The model needs to see the tool definitions to decide whether to use them. Even when you never use a skill, its schema rides along with every single message.

10 enabled skills = roughly 3,000-5,000 tokens of tool definitions injected into every request. That's tokens you're paying for on every heartbeat, every cron check, every casual question, every "what time is it."

Hermes Agent's tool overhead is documented: 6-8K tokens via CLI, 15-20K tokens through messaging gateways like Telegram. That overhead exists on every call regardless of whether you trigger a tool.

The fix:

Go to your skills/tools settings. Count how many are enabled. Ask yourself which ones you actually used this week. Disable everything else.

On OpenClaw: openclaw tools shows your enabled tools. Disable the ones you don't use daily.

On Hermes: hermes tools does the same.

You can always re-enable them later. Disabling is not uninstalling. You're just removing their schemas from the context window.

Impact: 15-30% cost reduction depending on how many unused skills you have enabled. Someone with web browsing, code execution, file management, and 5 custom skills enabled but only using file management regularly is burning 3,000+ tokens per call on tool descriptions that never get triggered.

4. Slow down your heartbeat interval.

Default heartbeat in most setups checks every 2-3 minutes. That's 480-720 checks per day. Each check sends tool schemas, system prompt, and memory context to the model.

For a personal agent on Telegram, do you really need to know about a new message within 2 minutes? For most people, checking every 10-15 minutes is fine. You'll see the message 10 minutes later. In exchange, you cut heartbeat API calls by 3-5x.

The fix depends on your framework:

On OpenClaw: adjust the heartbeat interval in your gateway config or cron schedule.

On Hermes: the polling interval for each channel can be configured in the gateway settings.

Going from 3-minute to 10-minute intervals: your daily heartbeat calls drop from ~480 to ~144. Same agent. Same functionality. 70% fewer background API calls.

Impact: 20-40% cost reduction on background token usage specifically. Combined with routing heartbeats to a cheap model (setting #2), the compound savings are significant. You're making fewer calls AND each call is cheaper.

For agents that primarily respond to direct messages (not monitoring real-time feeds), a 15-minute interval is more than responsive enough. If someone messages you on Telegram, a 15-minute delay before your agent sees it is barely noticeable for most use cases.

The compound effect:

These four settings stack. Here's what happens when you apply all four to a typical personal agent running Sonnet 4.6:

Before: ~$50-70/month. Full conversation history, Sonnet on everything, 12 skills enabled, heartbeat every 3 minutes.

After setting 1 (history cap): $30-45/month.

After setting 2 (model routing): $10-18/month.

After setting 3 (disable unused skills): $8-14/month.

After setting 4 (slower heartbeat): $5-10/month.

Combined: $5-10/month. Down from $50-70. Same agent. Same tasks. Same morning briefings. Same email triage. Same Telegram conversations. Four config edits.

What you're NOT changing:

Your model for conversations. Sonnet stays Sonnet for the messages you actually read and respond to. The quality of your agent's output on real tasks is identical.

Your workflows. Same cron jobs. Same automations. Same integrations. Nothing about what your agent does changes.

Your framework. These settings exist in OpenClaw, Hermes, and most managed platforms. You're not migrating anything.

The 5-minute checklist:

Open your provider dashboard (console.anthropic.com, platform.openai.com, openrouter.ai). Check your daily token usage. Find the ratio of input to output tokens. If input tokens are 5-10x your output tokens, your conversation history is bloated and unused tool schemas are padding every call.

Apply the four settings. Check your dashboard again in a week. The difference will be obvious.

reddit.com
u/ShabzSparq — 4 days ago
▲ 50 r/better_claw+3 crossposts

OpenClaw v2026.6.11 Release Notes | Fixes for Misplaced Replies, Stuck Sends, model setup failures, and more!

We heard the feedback. v2026.6.11 focuses on the rough edges that make OpenClaw feel less dependable, with fixes for misplaced replies, stuck sends, reconnects, model setup failures, and safer admin defaults.

Replies, sends, and reconnects

Across Telegram, WhatsApp, Matrix, Google Chat, iMessage, Feishu, and Mattermost, replies, commands, queued messages, and attachments are less likely to be dropped, duplicated, misrouted, or attached to the wrong conversation.

WebChat and the Control UI keep the active conversation visible more consistently after reconnects. The terminal UI now clears completed or rejected sends instead of leaving them looking stuck.

Models and fallback recovery

Model selection and setup recover more clearly when catalogs, credentials, streams, timeouts, compaction, or fallbacks go wrong. Affected OpenAI, OpenRouter, and OpenCode Go setups are less likely to leave users with a stale model choice or a stalled request.

Follow-up fixes improve fast mode in affected provider and fallback paths. Automatic fast mode itself is not new in this release.

Sessions, memory, and safer recovery

Sessions, compaction, memory, and QMD-backed memory preserve the intended conversation and useful context more consistently through long-running work, reconnects, upgrades, and transcript repair. Tool search also recovers the right context or capability more reliably.

Encrypted Matrix recovery now stops safely when required key state cannot be verified. Tool policies, approvals, and secret handling stay attached to the intended runtime state, while higher-risk actions remain disabled unless explicitly enabled.

Plugins and installation

Plugin management now handles more official integrations through normal external package installation and repair flows. The plugin inventory and setup checks give clearer guidance when a package is missing, incompatible, or needs to be reinstalled.

Admin and deployment controls

Slack router relay mode gives managed or multi-gateway deployments a supported way to centralize incoming Slack traffic while the correct gateway still handles mentions, threads, and replies. The Raft channel and Raft plugin add a local CLI wake path for External Agents, including setup and status checks.

Gateway health and troubleshooting signals now line up more consistently with whether OpenClaw is ready, restarting, or unable to continue. Agent runs started through the CLI and the broader gateway recover more cleanly from disconnects, shutdowns, routing changes, and failed startup conditions.

Setup, commands, and scheduled work

Common CLI commands now handle configuration, paths, output, and failure cases more consistently. Shell completion, doctor, config commands, and gateway configuration provide clearer guidance when an installation or setting needs attention.

Scheduled jobs and built-in tools now finish, retry, report failures, and preserve their intended inputs more consistently. The plugin SDK runtime also improves reliability for tool-backed extensions that load, return results, or run scheduled work.

Full Release Notes

This release includes 302 PR-backed units and 704 direct commits. Full notes: https://docs.openclaw.ai/releases/2026.6.11

u/hannesrudolph — 5 days ago

Tested a few local models that actually run on 8GB. Two are usable, one surprised me.

8GB is what most people actually have. 8GB unified memory on an M1 MacBook Air. 8GB VRAM on an RTX 4060. Not enough for the models everyone talks about. Just barely enough for something useful if you pick the right one.

The problem is that "fits in 8GB" is misleading. The model weights need to fit, plus the KV cache for your conversation context, plus whatever your OS and other apps are using. On 8GB, you realistically have 4-5GB available for the model. That rules out anything above 7-8B parameters at Q4 quantization.

I tested everything that actually fits on this hardware for agent work specifically. Not chat demos. Not single-turn benchmarks. Real agent tasks: tool calling through MCP servers, multi-turn conversations, email classification, and morning briefing generation through Ollama.

Most models at this size are frustrating. Two are genuinely usable. One surprised me.

Usable #1: Phi-4 Mini (3.8B). The safe pick.

bash

ollama pull phi4-mini

3.8 billion parameters. About 2.5GB at Q4_K_M. Fits on 8GB with enormous headroom. Your OS, your browser, and a few other apps all run comfortably alongside it. No memory pressure. No swapping.

Speed: 15-28 tokens per second depending on hardware. On an M1 MacBook Air, roughly 15-20 tok/s. On an RTX 4060 (8GB), closer to 28 tok/s. Fast enough for interactive conversation. Not instant, but not painfully slow.

Microsoft trained this model with a focus on reasoning and math. It scores 80.4% on MATH, which beats Llama 3.3 8B (68.0%) and even Qwen 2.5 14B (75.6%) despite being a fraction of the size. For analytical tasks (explaining concepts, solving problems step-by-step, debugging logic), Phi-4 Mini punches way above its weight class.

For agent work: basic tool calling works. Single-tool calls (search the web, read a file, query a database) succeed reliably. Classification tasks run fine. Morning briefing generation is decent if not spectacular.

The limitation: 16K context window. This is tight for agent work. Three MCP servers with their tool schemas eat 3,000-5,000 tokens of context. Your conversation history eats another 3,000-5,000. By turn 10, you're pushing the ceiling. Sessions need to be short. /new after every 5-7 exchanges.

For short, focused agent interactions on truly constrained hardware, Phi-4 Mini works. Just don't ask it to hold a long conversation or chain 4 tools together.

Usable #2: Gemma 4 E4B (4.5B). The multimodal pick.

bash

ollama pull gemma4

4.5 billion parameters. About 6GB for the model. Tight on 8GB but it fits if you're not running heavy background apps alongside it.

What makes Gemma 4 E4B worth the squeeze: 128K context window and multimodal support. Send your agent a photo of a receipt and ask "what was the total?" Send a screenshot of an error message and ask "what's wrong?" No other model at this size does images, video, and audio.

That 128K context window is 8x larger than Phi-4 Mini's 16K. For agent work, this is the difference between sessions that last 5 exchanges and sessions that last 25+. Tool schemas fit comfortably. Conversation history doesn't get truncated. You can actually have a proper back-and-forth without hitting the ceiling.

Speed: 40-60 tok/s. The fastest model in this comparison. Replies feel close to instant on short answers.

Native function calling is built into the training. Google designed Gemma 4 with tool use in mind. Single-tool calls are reliable. The model formats JSON arguments correctly and doesn't hallucinate tool names.

The limitation: reasoning depth. At 4.5B parameters, complex analytical tasks are noticeably weaker than Phi-4 Mini despite Phi-4 being smaller. Gemma 4 E4B is broader (multimodal, long context) but shallower on any single task. It handles "classify this email" perfectly. It struggles with "analyze these three emails and tell me which client is most likely to churn and why."

Apache 2.0 license. Commercially friendly.

The surprise: Qwen3.5 9B. This shouldn't work this well at 8GB.

bash

ollama pull qwen3.5:9b

9 billion parameters. At Q4_K_M, the model weights are about 5GB. On an 8GB GPU, it fits entirely in VRAM with room for a reasonable KV cache. No spilling to system RAM. No layer splitting across CPU and GPU.

I almost skipped this model because 9B on 8GB seemed like it would be too tight. I was wrong.

LocalLLM.in benchmarked it at 55-58 tokens per second on an RTX 3070 (8GB). Fully GPU-loaded. Flat speed across all context sizes up to 16K. That's faster than Phi-4 Mini on the same hardware despite being over twice the parameter count.

The quality jump from 3.8-4.5B to 9B is dramatic. On multi-turn agent conversations, Qwen3.5 9B maintains coherence across 15+ exchanges where the smaller models start drifting by turn 7. Tool calling is significantly more reliable. Chained tool calls (file search, then web search, then write results) succeeded at roughly 75% on Qwen3.5 9B vs maybe 50% on the smaller models.

It has a /think reasoning mode that activates deeper chain-of-thought when needed. On complex queries it pauses, reasons through the problem, then responds. On simple queries it skips the reasoning and responds immediately. The automatic mode switching is surprisingly good at knowing when to think harder.

32K+ context window. Enough for multiple MCP servers, conversation history, and extended sessions without truncation.

The catch: it's tight. On 8GB VRAM, the model fits but there's minimal headroom. If your context grows past 16K tokens, KV cache pressure starts pushing against the limit. On some hardware configurations, Ollama automatically spills overflow to system RAM, which tanks speed. On others, it crashes.

The practical rule: keep num_ctx at 8192-12288 on 8GB hardware. Don't try to use the full 32K context. Set it in a custom modelfile:

bash

cat > qwen-agent.modelfile << 'EOF'
FROM qwen3.5:9b
PARAMETER num_ctx 8192
PARAMETER temperature 0.3
EOF

ollama create qwen-agent -f qwen-agent.modelfile

Temperature at 0.3 for agent work. Lower = more reliable tool calls. Higher = more creative but flakier function call formatting.

On 8GB system RAM (not VRAM) without a dedicated GPU: don't try 9B. CPU-only inference on a 9B model gives you 3-6 tok/s. Painfully slow. Stick with Phi-4 Mini or Gemma 4 E4B on CPU-only 8GB machines.

What I tested and cut:

Llama 3.2 3B. 3 billion parameters. Fits easily. 10-12 tok/s on CPU. But the quality is noticeably below Phi-4 Mini on every task despite being nearly the same size. Classification accuracy was lower. Tool calling was flakier. Output felt generic. The Llama name carries weight but at 3B, Phi-4 Mini is simply better.

DeepSeek R1 8B (distilled). Fits on 8GB VRAM at Q4. Reasoning is impressive with visible chain-of-thought. But tool calling is unreliable. The model wants to reason ABOUT tools instead of calling them. Asks itself "should I use the filesystem tool here?" and then writes a paragraph about why, instead of actually making the call. Great for analysis and math. Bad for agents.

Mistral 7B Instruct v0.3. Fits well. Fast (20% faster than Llama 3.1 8B). Good for business communication and summaries. But function calling was inconsistent. The training didn't emphasize structured tool use the way Qwen and Gemma did. Fine for chat. Not for agent work.

Qwen 2.5 7B. Previous generation. Fits on 8GB. Good tool calling (the Qwen family reputation). But Qwen3.5 9B is better on every metric AND fits on the same hardware. No reason to run the older version unless you specifically need its coding specialization.

The honest 8GB assessment:

8GB is the floor, not the sweet spot. You CAN run an agent. The experience is compromised compared to 16GB+ in every measurable way. Smaller models, shorter context windows, less reliable tool calling, slower inference on the models that push the limit.

If your agent does focused, short-session tasks (classify this email, answer this question, draft this message), 8GB works. Genuinely. Qwen3.5 9B on 8GB VRAM handles these tasks well enough that you forget it's running locally.

If your agent needs long multi-step workflows, extensive tool chaining, or sustained autonomous operation, 8GB will frustrate you. The context window constraints and tool-calling reliability gaps compound over longer sessions.

The 8GB setup I'd actually run:

Qwen3.5 9B as the daily driver (if you have a dedicated GPU with 8GB VRAM). Best overall quality. Set num_ctx to 8192. Use /new frequently.

Gemma 4 E4B as the fallback for multimodal tasks and when you need longer context.

Phi-4 Mini for CPU-only machines or when you need maximum headroom for other apps.

Don't try to run two models simultaneously on 8GB. Ollama handles model swapping automatically but the swap takes 3-5 seconds and temporarily spikes memory. Run one model at a time.

$0/month. 8GB of whatever you already own. Not perfect. But real.

u/ShabzSparq — 7 days ago

US Gov now personally vets who gets GPT-5.6 early access? This feels like the start of AI feudalism

https://preview.redd.it/diws1bzdf8ah1.png?width=1080&format=png&auto=webp&s=33e6123fb1310fee60a75653bb3ac365522023c8

The Trump administration asked OpenAI to release GPT-5.6 only to a small group of "government-approved partners" first, with the feds literally approving access customer by customer during the preview window. Broader release comes later (maybe).

This comes right after the same admin forced Anthropic to disable/restrict their latest models (Mythos/Fable) over national security concerns and has been in an ongoing feud with them.

So now frontier models aren't just "released" - they're rationed through government approval. Regular developers, smaller companies, researchers, and normal users get to watch from the sidelines while the pre-vetted crowd gets the good stuff first.

Is this legitimate cyber defense, or are we watching the government turn the most powerful technology on Earth into a permissioned club?

Feels like we're alienating everyone outside the approved circle from actual progress. What's the endgame here?

reddit.com
u/ShabzSparq — 6 days ago

I priced running an agent three ways: raw Hermes, GLM-5.1 API, and a managed free tier. Real numbers.

------------
TLDR:

Hermes self-hosted with Sonnet 4.6: $17-77/month depending on model routing. Best quality. Most maintenance.

Same Hermes setup swapping Sonnet for GLM-5.1: $12/month. 90% of the quality. Same maintenance.

BetterClaw free tier with Gemini Flash: $0-3/month. Good enough for most personal use. Zero maintenance. Ceiling at 100 tasks/month.

If you're paying more than $15/month for a personal agent right now, you're either running the wrong model or managing infrastructure that a free tier handles for you.
-------------------------

Every cost comparison I've read either oversimplifies ("it's basically free!") or overcomplicates (spreadsheets with 40 line items). I wanted the honest middle. Same agent workload. Three setups. Real monthly costs including the stuff people forget to count.

The workload: morning briefing at 8am (email check, calendar, news summary). Email triage (20-30 classifications/day). 10-15 ad-hoc conversations. A few research tasks per week. Roughly 300K input tokens and 100K output tokens per day. Normal personal agent usage.

Setup 1: Hermes self-hosted on VPS. The "full control" path.

Infrastructure (verified June 2026):

Hetzner CX22: €3.79/month (~$5/month). 2 vCPU, 4GB RAM, 40GB NVMe. Hermes needs minimum 1 vCPU and 2GB RAM for CLI-only work. This is comfortably overkill.

Hermes Agent software: $0 (MIT license). Docker image: nousresearch/hermes-agent:latest. Free forever.

Setup time: 20-30 minutes if you're comfortable with SSH and Docker. Longer if you're not. The hermes doctor command catches config issues early, which saves time.

API costs (the part that actually matters):

Running Sonnet 4.6 ($3/$15 per MTok) on everything: 300K input + 100K output per day = $0.90 + $1.50 = $2.40/day. $72/month just on API calls. Add the $5 VPS and you're at $77/month.

One Medium article nailed the warning: "The $5 VPS claim is technically true and operationally misleading. If you take it at face value and ship the default setup, you will end up with a working agent and a $400 OpenRouter bill at the end of the month."

With model routing (the smart version):

DeepSeek V3.2 ($0.14/$0.28 per MTok) for background tasks: heartbeats, cron checks, email classification. ~85% of daily token volume. Cost: ~$0.05/day.

Sonnet 4.6 ($3/$15) for actual conversations and complex reasoning. ~15% of volume. Cost: ~$0.36/day.

Total API with routing: ~$12/month. Total with VPS: ~$17/month.

Hidden costs people forget:

Maintenance: 2-4 hours/month. Gateway restarts when Telegram disconnects. Docker updates. Reviewing auto-generated skills (Hermes's learning loop can encode bad patterns). Security patches. This time has value even if you don't bill for it.

One commenter in a cost analysis thread estimated that the "time tax" of self-hosting adds $10-30/month in equivalent labor, depending on how you value your hours. Most people don't count this. They should.

Setup 2: Same Hermes VPS, but swap Sonnet for GLM-5.1.

Same VPS: $5/month. Same Docker setup. Same Hermes software. Same Telegram connection.

The only change: the model.

GLM-5.1 on OpenRouter: $0.98/M input, $3.08/M output. That's roughly 3x cheaper than Sonnet on input, 5x cheaper on output.

Background tasks still on DeepSeek V3.2: ~$0.05/day.

Conversations and reasoning on GLM-5.1: ~40K input + ~45K output per day. $0.04 + $0.14 = $0.18/day.

Total API with routing: ~$7/month. Total with VPS: ~$12/month.

$5/month cheaper than the Sonnet version. For what?

GLM-5.1 scores 65.3 on BenchLM's agentic composite. Sonnet 4.6 scores 65.1. Basically tied on agent tasks. GLM-5.1 has the lowest tool-call hallucination rate measured among frontier models at 3%. Z.ai demonstrated it running 655 autonomous iterations over 8 hours continuously.

Where GLM-5.1 loses: knowledge breadth (52.3 vs Sonnet's 73.7). Ask it a general knowledge question and the answer is thinner. Context window: 203K vs Sonnet's 1M. Speed: 44 tok/s vs Sonnet's faster inference. No multimodal (text only, no images).

For agent work specifically (classify, extract, draft, research, tool calling), the quality gap between GLM-5.1 and Sonnet is invisible most days. For the 2-3 times per week where you need deep knowledge or image processing, you route that one task to Sonnet and pay the premium on that specific call.

Blended monthly cost: $12/month with GLM-5.1 as default and Sonnet for overflow. Same agent. Same daily experience. $5 less than all-Sonnet routing.

Setup 3: BetterClaw free tier. The "$0 and done" path.

Infrastructure: $0. No VPS. No Docker. No terminal. No SSH. Sign up with email, paste an API key, connect Telegram. Agent live in 7 minutes.

API costs:

Google AI Studio (Gemini 2.5 Flash) free tier: 1,500 requests/day. No credit card. No expiry. For a personal agent doing morning briefings, email triage, and 10-15 conversations, 1,500 daily requests is more than enough. Cost: $0/month.

If you want slightly better quality: DeepSeek V3.2 at $0.14/$0.28 per MTok. At my workload level: $2-3/month.

Total monthly cost: $0-3.

Setup time: 7 minutes. Sign up (30 seconds). Paste Gemini or OpenRouter API key (60 seconds). Connect Telegram (90 seconds). Create morning briefing task (3 minutes). Done.

Maintenance: roughly 15 minutes over two weeks. Adjusted one SOUL.md rule. Changed a cron timing once. That's it. No Docker. No gateway restarts. No security patches. No skill reviews.

The limits (being honest):

100 tasks/month. A morning briefing cron uses 1 task/day (30/month). Email triage uses 1 task/day (30/month). 10 ad-hoc conversations uses ~10 tasks/day (300/month). Wait. That's over 100.

Here's the real math: if your cron jobs and ad-hoc conversations together exceed 100 tasks in a month, you'll hit the ceiling by week 3. Light users (daily briefing + a few questions) stay under. Heavy users hit it.

7-day memory. Your agent forgets conversations from 2 weeks ago. Preferences and facts saved in the agent's memory persist. But contextual details ("remember that restaurant I asked about last month") don't survive past a week.

1 agent. No multi-agent setups. No parallel workflows. One agent doing a few things well.

No learning loop. Unlike Hermes, BetterClaw's free tier doesn't auto-improve from repeated tasks. Your morning briefing on day 30 uses the same approach as day 1.

The real comparison table:

Setup 1 (Hermes + Sonnet, with routing): $17/month. Best quality. Maximum control. Maximum maintenance. Setup takes hours. You're the sysadmin.

Setup 2 (Hermes + GLM-5.1, with routing): $12/month. Near-identical agent quality. Same maintenance as Setup 1. Saves $5/month by choosing a model that's tied with Sonnet on agentic benchmarks.

Setup 3 (BetterClaw free + Gemini Flash): $0-3/month. Good enough for most personal use. Zero maintenance. 7-minute setup. Ceiling exists but most light users don't hit it.

What surprised me:

The model choice moves the bill more than the infrastructure choice. Swapping Sonnet for GLM-5.1 saved more per month than the VPS costs. The $5 Hetzner bill is the smallest line item. The API spend is 70-85% of total cost regardless of how you host.

The maintenance gap between self-hosted and managed is wider than the price gap. $12/month self-hosted vs $0-3/month managed looks like $9-12/month savings. But the 2-4 hours/month of maintenance on the self-hosted path is worth more than $12 for most people.

GLM-5.1 is genuinely good enough. I expected to notice the quality difference daily. I noticed it maybe twice in two weeks. Both times were knowledge questions the model answered thinly. For tool calling, email drafts, morning briefings, and research, GLM-5.1 and Sonnet produced functionally identical output.

Who should pick what:

You want maximum quality and maximum control and don't mind sysadmin work: Hermes + Sonnet routing. $17/month.

You want the same control at a lower price and the quality gap doesn't bother you: Hermes + GLM-5.1 routing. $12/month.

You want a working agent by lunch and never want to think about infrastructure: Managed free tier. $0-3/month.

You're not sure yet: Start with the free tier. If you outgrow it, you'll know exactly which limits you hit and whether self-hosting solves them.

reddit.com
u/ShabzSparq — 6 days ago

Tested a few local model that fits on 16GB for agent work. These 3 are worth running.

16GB is the most common setup in this community. 16GB unified memory on a MacBook. 16GB VRAM on an RTX 4060 Ti or 4080. Enough to run something real. Not enough for the big models everyone benchmarks.

The problem is that "fits on 16GB" and "runs well for agent work on 16GB" are very different things. A model that fits in VRAM but can't reliably call tools is useless for an agent. A model with great benchmarks but 3 tokens per second because it's spilling to system RAM is unusable for interactive conversation.

I tested every model I could pull through Ollama that genuinely fits on 16GB, ran them as the brain of an actual agent (OpenClaw, Telegram, MCP tools connected), and measured three things: tool-calling reliability, response speed, and whether I'd actually use it daily.

Most of them disappointed. Three didn't.

#1: Qwen 3.6 35B-A3B. The one that shouldn't work this well at this size.

bash

ollama pull qwen3.6:35b-a3b

This is the model that makes 16GB feel like 32GB. It's a 35 billion parameter Mixture-of-Experts model that only activates 3 billion parameters per query. You get 35B intelligence at 3B speed and memory cost.

VRAM usage: roughly 10-11GB at Q4_K_M with 16K context. Fits comfortably on 16GB with room for Ollama overhead, the OS, and a few browser tabs. No swapping to system RAM.

Speed: 50-80 tokens per second depending on hardware. A typical short reply arrives in 1-2 seconds. Longer responses (200+ tokens) in 3-5 seconds. Fast enough that conversation feels natural, not like waiting for a local model.

Tool calling: this is where it separates from everything else at this size. Qwen 3.6 Plus scored 37.0 on MCPMark and 94% first-attempt tool-call accuracy. The 35B-A3B variant inherits that training. In my testing, it completed 3-tool chains (file search, web search, file write) at roughly 85% success rate. That's not cloud-model reliable but it's the highest I measured on 16GB hardware.

The Qwen family is the community default for local tool calling and this variant is the reason. It follows structured function-call formats instead of improvising. The JSON arguments are well-formed. The tool names are correct. It doesn't hallucinate tools that don't exist.

What it handles well: email classification, morning briefings, web search and summarize, file management, database queries, multi-step tool chains. The daily agent workload runs on this model without feeling compromised.

What it struggles with: very long context (the MoE architecture handles 128K but quality degrades past 16K on 16GB because you can't allocate enough KV cache). Also, Qwen models can overthink simple requests. The "thinking" mode sometimes triggers a 20-second reasoning chain for a question that needs a 2-word answer. Set temperature to 0.7 and disable thinking mode for agent work.

#2: Gemma 4 E4B. The smallest model that doesn't feel small.

bash

ollama pull gemma4

Google released Gemma 4 in April 2026 under Apache 2.0. The E4B variant has 4.5 billion parameters and uses only about 6GB for the model itself. On 16GB hardware, it loads instantly with room for a 128K context window, your entire system, and whatever else you're running.

VRAM usage: roughly 6GB model + 4-6GB for context depending on num_ctx setting. At 32K context: about 9-10GB total. You could run a second model alongside it if you wanted.

Speed: 40-60 tokens per second. The fastest model in this comparison. Replies feel instant on simple queries.

The reason Gemma 4 is here and not just "too small for real work": multimodal. It handles text, images, video, and audio natively. Send your agent a photo of a receipt and ask "what was the total?" Send a screenshot of an error and ask "what's wrong?" No other model in this list does this at 16GB. Qwen 3.6 35B-A3B is text-only.

Native function calling is built into the model's training. Gemma 4 was designed with tool use in mind, not retrofitted. Single-tool calls are extremely reliable. Where it falls off is multi-step chains (as I covered in the tool-calling comparison). But for an agent that does one tool call per interaction (search, classify, read file, query database), Gemma 4 is the most reliable per-call model at this size.

What it handles well: quick Q&A, image analysis, simple tool calls, lightweight classification, anything where speed matters more than deep reasoning. The 128K context window means it can process long documents that would choke smaller context models.

What it struggles with: complex multi-step reasoning. It's 4.5B parameters. Compared to the 35B MoE or a 14B dense model, the reasoning depth is noticeably shallower on analytical tasks. Also, while function calling is reliable per-call, chaining 3+ tools causes it to shortcut (synthesize from partial data instead of completing the chain).

Best for: the secondary model in a 2-model routing setup. Use Gemma 4 for quick questions, image analysis, and simple tool calls. Route complex reasoning to the bigger model.

#3: Qwen3 14B. The dense model that punches hardest.

bash

ollama pull qwen3:14b

If you want the best single-model quality that fits entirely in 16GB VRAM without MoE tricks, this is it.

VRAM usage: 9.2GB at Q4_K_M with 4K context. At 16K context: roughly 11-12GB. Fits on 16GB with room to breathe. At 32K context it pushes close to the limit.

Speed: 15-62 tokens per second depending on hardware and context length. On an RTX 4080 16GB, benchmarked at 61.85 tok/s at 19K context. On older hardware or longer contexts, expect 15-30 tok/s. Slower than the MoE variant but still usable for interactive agent work.

Tool-calling accuracy: 0.971 on standardized benchmarks. That beat GPT-4o (0.857) and Claude 3.5 Sonnet (0.851) on the same test. For a 14B model running locally, that's remarkable.

The difference between this and the 35B MoE: Qwen3 14B is a dense model. All 14 billion parameters activate on every query. The reasoning depth per-token is higher than the MoE variant even though the MoE has more total parameters. On complex tasks that require sustained analytical thinking (multi-step reasoning, nuanced writing, connecting ideas across contexts), the 14B dense model produces noticeably better output.

What it handles well: complex reasoning tasks, coding, structured analysis, long-form generation, multi-step tool chains. This is the model you bring in when the 4.5B Gemma can't handle the complexity.

What it struggles with: speed vs the MoE variant. The 14B dense is 2-3x slower than the 35B-A3B MoE on most hardware because it activates all parameters instead of a 3B subset. For quick questions where you want an instant reply, the MoE feels noticeably faster. Also, at 14B the model's knowledge breadth is still limited compared to cloud frontier models. Obscure factual questions get shaky answers.

What I tested and eliminated:

Llama 4 Scout 17B. MoE architecture. Interesting model. But tool-calling reliability was below Qwen on my tests. The community consensus matches: "Qwen family is the default for tool calling." Scout is good for general chat. Not the pick for agent work.

DeepSeek R1 14B. Excellent reasoning. Shows its chain-of-thought work. But tool calling is unreliable. The model wants to reason about tools instead of calling them. Produces beautiful explanations of what it would do instead of actually doing it. Great for analysis. Bad for agents.

Phi-4 14B. Top-tier reasoning per GB. 80.4% on MATH benchmark. But 16K context window. When you connect 3 MCP servers, the tool schemas eat 3,000-5,000 tokens of context. 16K minus tool schemas minus conversation history leaves almost nothing for the actual task. Not enough context for agent work.

Mistral Small 3.2 24B. Technically "fits" on 16GB at aggressive quantization. Practically, it spills to system RAM and drops to 18.5 tok/s. Not usable for interactive agent work. Needs 24GB to run properly.

GPT-OSS 20B. Fast (42 tok/s). Good general quality. But tool calling hasn't been validated as extensively as Qwen. And at 12-14GB VRAM, it leaves less headroom for context than the smaller models. Worth watching. Not proven enough for agent work yet.

The 16GB agent setup:

Daily driver: Qwen 3.6 35B-A3B. Fast. Reliable tool calling. Handles 85% of agent tasks. Set num_ctx to 16384.

Complex tasks: Qwen3 14B. Better reasoning. Slower. Use when the MoE gives shallow answers.

Quick questions and images: Gemma 4 E4B. Fastest. Multimodal. Use for simple tool calls and anything involving screenshots or photos.

Ollama handles model swapping automatically. When you switch from one model to another, it unloads the first and loads the second. On 16GB, expect a 2-3 second cold start when switching. That's the tradeoff for running multiple models on limited hardware.

bash

# Create custom modelfiles with proper context windows
echo 'FROM qwen3.6:35b-a3b
PARAMETER num_ctx 16384
PARAMETER temperature 0.7' > agent-fast.modelfile

echo 'FROM qwen3:14b
PARAMETER num_ctx 16384' > agent-quality.modelfile

echo 'FROM gemma4
PARAMETER num_ctx 32768' > agent-vision.modelfile

ollama create agent-fast -f agent-fast.modelfile
ollama create agent-quality -f agent-quality.modelfile
ollama create agent-vision -f agent-vision.modelfile

Three models. One machine. 16GB. $0/month.

reddit.com
u/ShabzSparq — 9 days ago

Tested Qwen, Gemma, GLM, and MiniMax for agent tool-calling. Only two really follow through.

Tool calling is the thing that separates "chat model" from "agent model." Your agent doesn't just need to generate text. It needs to read your email. Search the web. Query a database. Write to a file. Call an API. And it needs to do these things by generating structured function calls with the right parameters in the right format every single time.

Most models can do this once. The question is whether they can do it 5 times in a row without hallucinating a tool that doesn't exist, mangling the JSON arguments, or confidently reporting success when the tool call never fired.

I tested four model families on real agent workflows. Not benchmarks. Not single-call demos. Multi-step tool chains where each call depends on the previous one succeeding. The kind of work your agent actually does.

Two of them reliably follow through. The other two break in ways that waste your time more than failing outright would.

The test:

Same setup across all four. OpenClaw agent. Three MCP servers connected: filesystem (read/write project files), web search (Brave Search), and SQLite (query a local database). Same tasks:

Task 1 (simple): "Search the web for [topic] and save a summary to a file." Two tool calls. Search, then write.

Task 2 (medium): "Find all files in my project that mention authentication, search the web for current best practices, and write a comparison to a new file." Three tool calls chained. File search, web search, file write.

Task 3 (complex): "Query the database for all users who signed up last month, search the web for their company information, and create a report file with the combined data." Four+ tool calls with data passing between steps.

10 runs per task per model. Scored on: did the tool call fire correctly, did it pass the right arguments, did the chain complete, and was the final output actually correct.

Winner #1: Qwen 3.6. The tool-calling default for a reason.

Simple tasks: 10/10. Perfect.

Medium tasks: 9/10. One run used a slightly wrong file path parameter that the filesystem server rejected. The model caught the error, corrected the path, and completed the chain on retry.

Complex tasks: 8/10. Two failures. One was a malformed JSON argument on the database query (wrong date format). The other was the model skipping the web search step entirely and generating company information from its training data instead of actually searching.

Overall: 90% completion across all task types. The highest of the four.

The community consensus matches what I found. Qwen is "the default for tool calling" in the local AI space and it earns that reputation. XDA benchmarked Qwen3 14B at 0.971 tool-call accuracy on standardized tests, beating GPT-4o (0.857) and Claude 3.5 Sonnet (0.851). Qwen 3.6 Plus hit 94% first-attempt accuracy and scored 37.0 on MCPMark, the highest among open-weight models.

What makes Qwen different: it was trained with explicit function-calling templates. The model doesn't improvise tool-call syntax. It follows a rigid format that MCP servers actually parse correctly. Other models try to be "creative" with their tool calls. Qwen follows the spec.

The MoE variant (35B-A3B) runs on 16GB RAM and handles tool calling almost as reliably as the full 27B dense. If you're on consumer hardware, this is the model to start with.

Winner #2: GLM-5.1. The endurance pick.

Simple tasks: 10/10. Perfect.

Medium tasks: 9/10. Same pattern as Qwen. One minor parameter formatting issue, self-corrected on retry.

Complex tasks: 7/10. Three failures. But here's what's interesting: the failures were all on the database query step specifically. GLM-5.1 struggles with SQL date formatting more than Qwen does. On the steps that succeeded, the tool calls were clean and well-formed.

Overall: 87% completion. Slightly below Qwen. But the story isn't in the completion rate. It's in the endurance.

GLM-5.1 has the lowest tool-call hallucination rate measured among frontier models at 3%. Hallucination here means calling a tool that doesn't exist, inventing parameters, or claiming a tool call succeeded when it didn't. Most models hallucinate tool calls 5-10% of the time. GLM-5.1 almost never does.

Z.ai demonstrated GLM-5.1 running 655 autonomous iterations over 8 hours on a single task. That sustained tool-calling reliability over long sessions is where GLM-5.1 separates from everything else. Qwen is more accurate per-call. GLM-5.1 is more reliable over time. If your agent runs a 50-step workflow overnight, GLM-5.1's low hallucination rate means fewer silent failures that you discover in the morning.

The 128K context window out of the box matters for tool calling specifically. Every connected MCP server adds tool schemas to context. Three servers with 10 tools each is 3,000-5,000 tokens of schema. GLM-5.1's context window absorbs this without squeezing your conversation. Smaller context models start cutting tool descriptions, and that's when calls start failing.

Close but not quite: Gemma 4.

Simple tasks: 10/10. Perfect.

Medium tasks: 8/10. Two failures. Both the same pattern: Gemma completed the first tool call, got the results, and then... generated the final answer from the tool results without making the second tool call. It shortcutted the chain. The output looked reasonable because it synthesized from partial data. But it skipped steps.

Complex tasks: 5/10. Half the runs failed. The pattern was consistent: Gemma 4 is conservative on chained calls. It fires the first tool reliably. Sometimes the second. By tool call 3 or 4, it starts synthesizing from what it already has instead of calling the next tool. The output reads well. It's just not grounded in the data it was supposed to retrieve.

Overall: 77% completion. The frustrating part is that Gemma's individual tool calls are well-formed. The function-call training is solid. Apache 2.0 license. Native function-calling support. Multimodal (can process images in tool chains, which Qwen and GLM can't). On paper, it should be a top pick.

In practice, the chain-shortcutting behavior makes it unreliable for any agent workflow with 3+ steps. Your agent looks like it completed the task. The output appears coherent. You don't realize it skipped the web search and used training data instead of live results until you check manually.

For single-tool tasks (classify this email, search the web, read this file), Gemma 4 is excellent. For multi-step agent chains, it quietly cuts corners.

Didn't make the cut: MiniMax M2.7.

Simple tasks: 9/10. One malformed JSON argument on the first attempt.

Medium tasks: 6/10. Four failures. Mixed failure modes: wrong parameter types, invented tool names that don't exist, and one case where it generated a natural language description of what it wanted to do instead of actually calling the tool.

Complex tasks: 3/10. Seven failures. The model lost coherence by step 3 consistently. Tool calls became increasingly creative (wrong) as the chain lengthened. One memorable failure: it called the filesystem write tool with the search results as the filename and the filename as the content. Reversed the arguments entirely.

Overall: 60% completion. At $0.30/M input tokens, MiniMax is the cheapest model in this test by a wide margin. And on simple coding benchmarks it scores 56.22% on SWE-Bench Pro, which is 94% of GLM-5.1's performance. That stat is misleading for agent work. Coding benchmarks test code generation. Agent work tests structured tool calling under context pressure. MiniMax's code is fine. Its tool calls aren't.

If your agent does simple one-shot tasks (classify, extract, summarize), MiniMax works and the price is unbeatable. If your agent chains 3+ tools together, the 60% completion rate means you're debugging failures more often than using results.

The pattern across all four:

Single-tool calls: all four models handle them well (77-100%). The model generates one function call, gets a result, produces output. This is the demo that makes every model look good.

Two-tool chains: the field narrows. Gemma starts shortcutting. MiniMax starts producing malformed calls.

Three+ tool chains: only Qwen and GLM consistently complete the chain. Gemma synthesizes from partial data. MiniMax loses coherence entirely.

The lesson: if your agent only makes one tool call per interaction (search, classify, extract), pick whatever model is cheapest. They all handle it. If your agent chains tools together (search, then analyze results, then write a report, then send a summary), the model choice matters enormously. Qwen and GLM follow through. Gemma and MiniMax shortcut or fail.

The practical setup:

If you want the most reliable tool-calling agent:

Qwen 3.6 27B (or 35B-A3B MoE on 16GB hardware) as your primary model. Best per-call accuracy. Strongest on structured function calls. Community-tested extensively.

GLM-5.1 for long-running autonomous workflows. When the chain is 10+ steps and runs unsupervised, GLM's low hallucination rate and sustained endurance matter more than Qwen's slightly higher per-call accuracy.

Gemma 4 for multimodal tasks only. If your agent needs to process images alongside tool calls, Gemma is the only open-weight option here that handles vision. Just limit chains to 2 steps max.

MiniMax M2.7 for simple single-tool classification at scale. If you're routing 10,000 emails through a classifier and each one is a single tool call, MiniMax at $0.30/M is the right choice. Just don't chain it.

The model that generates the prettiest text isn't always the model that calls your tools correctly. For agent work, tool-calling reliability is everything. Two models have it. Two don't.

reddit.com
u/ShabzSparq — 10 days ago

Every free LLM provider, ranked by how fast the free tier actually runs out.

Everyone says "just use a free LLM provider." Great advice until your agent stops working at 2pm because you burned through the daily limit on your morning briefing and 6 conversations.

I ran an actual AI agent on every free tier I could find. Same daily workload: one morning briefing cron, email triage (20-30 classifications), 10-15 ad-hoc conversations throughout the day, and a few research tasks. Roughly 800-1,200 requests and 300K-500K tokens per day. Normal personal agent usage.

Here's how fast each one ran out. Ranked from "gone by breakfast" to "I genuinely forgot I wasn't paying."

TIER 1: Gone before lunch.

OpenRouter (unfunded free tier). 50 requests per day. Fifty. My morning briefing cron used 8 of them. Email triage used another 25. By 10am I had 17 requests left for the entire rest of the day. Two research tasks later I hit the wall. 50 requests is a demo, not a free tier.

The fix everyone knows: deposit $10 (your money, stays yours, withdrawable). Daily limit jumps to 1,000. That $10 never gets spent because you're still using free models at $0 per token. But without the deposit, 50 requests is functionally useless for agent work.

Together.ai. $5 signup credit. Not a free tier. A trial. Running an agent on Llama 70B, the $5 lasted 3 days. The credit doesn't replenish. When it's gone, you're paying or you're leaving. "Free" that expires is a trial wearing a free tier costume.

Anthropic / OpenAI. No meaningful free API tier. Anthropic occasionally gives signup credits that expire. OpenAI phased out the automatic $5 new-account credit. Both require a credit card for API access. If someone tells you to "just use the free Claude API," they're either confused or talking about the chat interface (which has message limits and no API access).

TIER 2: Lasts a day if you're careful.

Groq (on Llama 3.3 70B). 30 requests per minute sounds generous. But the daily ceiling is 1,000 requests AND 100K tokens per day on the 70B model. 100K tokens is the binding constraint. My agent used about 3,000-5,000 tokens per interaction. At 5K tokens per request, 100K tokens covers roughly 20 interactions. My morning briefing alone used 15K tokens. By early afternoon, I hit the token wall.

The speed is absurd though. 300+ tokens per second on their LPU hardware. When it works, Groq is the fastest free inference by a wide margin. It just runs out fast on the best models.

Groq on smaller models (Llama 3.1 8B) is more generous: 14,400 requests/day. But 8B quality is noticeably weaker on agent tasks. Classification and simple questions work fine. Complex reasoning and multi-step tool chains get shaky.

Gemini 2.5 Pro (free tier). 50 requests per day. The Pro model is Google's best, but 50 daily requests is even tighter than OpenRouter unfunded. I burned through it in 2 hours of agent work. Use Flash instead.

TIER 3: Lasts a week of moderate use.

OpenRouter (with $10 deposit). 1,000 requests/day on 28+ free models including DeepSeek R1, Llama 3.3 70B, Qwen3 Coder 480B, and Gemini Flash. At 1,000 requests/day, my agent ran comfortably for 5 days before I started having to ration evening conversations.

The model variety is the real value here. If one free model is down or rate-limited, switch to another with one config change. DeepSeek R1 for reasoning. Qwen3 Coder for coding tasks. Llama 4 Scout for long context. All free. All through one API key.

The 20 RPM cap is the hidden constraint. If your agent fires 25 tool calls in rapid succession, you'll hit 429 rate limit errors. Fine for normal conversation. Frustrating for complex multi-tool chains.

GitHub Models. Free for development. Exposes OpenAI, Llama, and other models within rate limits. Good for coding workflows. Less useful for general agent work because the limits are designed for developer prototyping, not 24/7 agents.

TIER 4: Lasts all month on normal agent usage.

Groq (on 8B models). 14,400 requests per day on Llama 3.1 8B. That's more than any personal agent needs. I ran my agent for the full test period without hitting the limit once. The trade-off is quality: 8B models handle classification, simple Q&A, and data extraction well. They struggle on nuanced summarization, complex reasoning, and creative writing.

If you route background tasks (heartbeats, cron checks, email classification) to Groq's 8B free tier and only use a better model for complex conversations, the 8B handles the volume and the daily limit is effectively invisible.

Cerebras. Roughly 1 million tokens per day. No credit card. Their wafer-scale chip hardware is fast (up to 2,000 tokens per second on some models). 1M tokens/day comfortably covers a personal agent doing daily briefings, email triage, and conversations. Model selection is narrower than OpenRouter (Llama, Qwen, DeepSeek primarily), but the volume is genuinely generous.

Mistral (Experiment tier). Roughly 1 billion tokens per month. By far the most generous raw volume. The catch: you must opt into data training. Your prompts, your conversations, your agent's interactions, all potentially used to train Mistral's models. For a personal assistant discussing your email, calendar, and daily life, that's a meaningful privacy trade-off.

If you don't care about data training (and honestly, many people don't for non-sensitive tasks), Mistral Experiment is effectively unlimited for personal agent use. 1B tokens/month is more than any individual will burn.

TIER 5: The one I forgot I wasn't paying for.

Google AI Studio (Gemini 2.5 Flash). 1,500 requests per day. 1M tokens per minute. No credit card. No expiry. Not a trial. Permanent.

I ran my full agent workload on Gemini Flash for three weeks. Morning briefing. Email triage. Ad-hoc conversations. Research tasks. Document summarization. Never hit the daily limit once.

1,500 requests/day is roughly 62 requests per hour. That's more than enough for a personal agent doing everything. My busiest day used 847 requests. My quietest day used 210. The 1,500 ceiling never felt close.

The 15 RPM (requests per minute) cap is the only constraint that matters. If your agent needs to fire 20 rapid tool calls, you'll hit the rate limit. For normal use, 15 RPM is invisible. You're not sending 15 messages per minute to your agent.

The 1M token context window on the free tier is the other killer feature. Your agent can process entire documents without chunking. Most free tiers give you small context windows. Google gives you the full million tokens for free.

The trade-off: Google may use your free-tier prompts for model training. Paid tier and Vertex AI don't. If your agent handles sensitive data, that matters. If it's doing morning briefings and general research, it probably doesn't.

The stack that makes $0 actually work all month:

Don't pick one. Stack them.

Gemini Flash (primary): 1,500 requests/day. Morning briefings, email triage, research, conversations. Your daily driver.

Groq (background tasks): 14,400 requests/day on 8B models. Heartbeat checks, cron polls, simple classifications. Fast and free.

OpenRouter with $10 deposit (fallback and variety): 1,000 requests/day across 28+ models. When you need a different model for a specific task, or when Gemini is slow, route here.

Cerebras (batch processing): 1M tokens/day. When you need to process a large document or run a heavy research task, route it here.

Total combined daily capacity: roughly 18,000+ requests per day across all providers. More than any personal agent will ever need. $0/month ongoing ($10 one-time OpenRouter deposit that never gets spent).

The numbers most people don't check before choosing a provider:

Requests per day is the headline number. Tokens per day is usually the binding constraint. Groq's 1,000 RPD on 70B looks fine until you realize the 100K token/day cap hits first.

Requests per minute determines whether your agent can do complex tool chains. 15 RPM (Gemini) and 20 RPM (OpenRouter) are fine for conversation. They struggle when your agent needs to make 25 API calls in sequence.

Data training opt-in is the privacy cost. Google free tier and Mistral Experiment use your prompts. Groq, Cerebras, and OpenRouter do not (on most models). The free tier isn't always free. Sometimes the price is your data.

Check the actual limits before you build. They change constantly. Every number in this post was verified against provider documentation in June 2026. By the time you read this, at least one has probably changed.

reddit.com
u/ShabzSparq — 12 days ago

My SOUL.md is 6 lines

You are direct and concise. Short replies unless I ask for detail.
Never send messages, emails, or book meetings without my approval.
Never delete files, sign up for services, or spend money.
If you lack access or information, say so. Do not guess or fabricate.
When I share a preference, fact, or decision, save it to memory before responding.
If a task is unclear, ask one clarifying question. Do not assume.

That's it. Six lines. Running for months.

What each line actually does:

Line 1 is the only personality instruction. "Direct and concise. Short replies unless I ask for detail." Everything else about tone, humor, warmth, communication style? Your agent picks that up from YOUR messages within a few days. You don't need to describe your preferred communication style in a paragraph. You demonstrate it by texting casually. The agent mirrors you.

Lines 2 and 3 are the safety net. Two "never" lines that eliminate every disaster story on this sub. Agent sent emails you didn't approve? Line 2 blocks it. Agent signed up for a service using your credentials? Line 3 blocks it. Agent booked a meeting without asking? Line 2 blocks it. Every "never" is a specific catastrophe that can't happen now.

Line 4 kills hallucination. The single biggest complaint about AI agents: "it confidently told me something that wasn't true." Your agent doesn't lie on purpose. It guesses when it doesn't have data and presents the guess as fact. This line makes it say "I don't have access to your calendar" instead of inventing calendar events. One line. Fixes the #1 trust issue.

Line 5 solves the memory problem. Most people complain their agent doesn't remember anything. It's not that memory is broken. It's that the agent doesn't know you want it to remember things unless you tell it to. This line means every time you say "I prefer morning meetings" or "my partner's name is Sarah," the agent writes it down immediately. After a month, your agent knows you well because you told it to pay attention.

Line 6 prevents the confident wrong answer. Without this line, your agent receives an ambiguous request and picks whichever interpretation seems most likely. Sometimes it picks wrong and does 10 minutes of work on the wrong thing. With this line, it asks "did you mean X or Y?" before starting. One clarifying question saves more time than any model upgrade.

reddit.com
u/ShabzSparq — 11 days ago
▲ 36 r/better_claw+1 crossposts

20 Agentic Engineering Concepts Every AI Builder Should Know

Most people think autonomous coding is about picking the right model.

After spending months building autonomous coding workflows, I don’t think that’s the bottleneck anymore.

The biggest improvements came from things that have nothing to do with model intelligence.

Project state.

Work ledgers.

Decision records.

Verification.

Trust boundaries.

Permission gates.

Recovery points.

Evidence collection.

A surprising number of AI failures happen because the agent doesn’t know what has already been done, cannot prove the outcome, doesn’t understand the current state of the project, or doesn’t know when it should stop and ask a human.

That’s what led me to put together this reference sheet of 20 agentic engineering concepts.
Most builders are already using some of these ideas without having names for them.

Once you have the vocabulary, it becomes much easier to reason about why an autonomous workflow succeeds or fails.

Curious which concepts you think are missing.

u/Advanced_Pudding9228 — 11 days ago

Build your own MCP tool for your agent in one Python file. The whole thing is a decorator and a docstring.

People treat MCP like a framework you have to learn. It's a protocol. Your agent gets a new capability from one decorated function with a clear docstring. Here's a working tool server start to finish.

Setup with uv:

uv init mytools && cd mytools
uv venv && source .venv/bin/activate
uv add "mcp[cli]" httpx
touch server.py

The whole server. FastMCP wraps the official SDK and handles stdio transport, JSON-RPC, and tool registration for you.

from mcp.server.fastmcp import FastMCP
import httpx

mcp = FastMCP("mytools")

u/mcp.tool()
def get_weather(city: str) -> str:
"""Get current weather for a city. Use when the user asks about weather."""
resp = httpx.get(
"https://api.openweathermap.org/data/2.5/weather",
params={"q": city, "appid": "YOUR_KEY", "units": "metric"},
)
d = resp.json()
return f"{city}: {d['main']['temp']}C, {d['weather'][0]['description']}"

u/mcp.tool()
def add_note(text: str) -> str:
"""Save a note to the local notes file."""
with open("notes.txt", "a") as f:
f.write(text + "\n")
return "saved"

if name == "main":
mcp.run()

That's it. Two tools. Your agent (Claude Code, Cursor, OpenClaw, anything MCP-capable) sees them listed and calls them on its own.

Two things doing real work here. Type hints become validation: pass a string where a float is expected and MCP returns a structured error before your function even runs. And the docstring is the tool description the model reads to decide when to call it, so write it for the model, not for yourself. "Get weather" is worse than "Get current weather for a city, use when the user asks about weather or what to wear."

Test it BEFORE wiring it to a model. This is the step everyone skips and then debugs blind. Use MCP Inspector or MCPJam to hit the tools directly. Debugging order that saves hours: direct server logs first, Inspector second, the actual LLM client last. If you add the model first you can't tell whether the bug is your tool or the model's tool-calling.

The gotcha that wrecks agent performance: too many tools. One benchmark found GitHub's MCP server dumps 43 tools into the context window before the agent does anything, and that alone tanks performance. The model has to read every tool definition every turn. So keep servers small and domain-focused, one server per domain (a notes server, a weather server, an orders server) instead of one monolith with everything. Small tool sets, focused descriptions, uncluttered context.

stdio vs HTTP: stdio means the client spawns your script as a subprocess, no ports, no auth, simplest and most secure for local. That's all you need for a personal agent. Only move to Streamable HTTP when a remote agent has to reach the server over a network, and that's when auth (OAuth) becomes the painful part, so don't take it on until you actually need remote.

If you're on a managed agent platform that supports MCP, this is how you give it a custom capability it doesn't ship with: write the one-file server, point the agent at it, done. No platform feature request needed.

reddit.com
u/Temporary-Leek6861 — 12 days ago