r/ollama

▲ 101 r/ollama+9 crossposts

Hey everyone,

I just open-sourced TuneForge.

The goal is simple: let your coding agent manage the full LLM improvement loop without ever leaving the chat window.

You can now tell your agent something like:

“Build me a customer support bot from this FAQ”

…and it can:

• Generate a clean synthetic instruction dataset (with LLM judging for quality)

• Run LoRA supervised fine-tuning on any Hugging Face causal LM

• Do a quick policy-gradient RL step using Ollama as the reward judge

• Merge the adapter, evaluate on a test set, and iterate

Everything runs locally, uses 4-bit quantization so it fits on modest hardware, and uses background jobs (with job_id polling) so long training tasks don’t freeze the MCP connection.

It’s built around the Model Context Protocol (MCP) for seamless integration with Claude Desktop, Cursor, Zed, Continue.dev, etc.

Tech: Python + Transformers + PEFT + bitsandbytes + Ollama + SQLite for job state.

Super early stage (just released), MIT licensed.

Would love feedback or ideas on what to add next. If you’re into agentic fine-tuning workflows, give it a try and let me know how it goes!

u/Just_Vugg_PolyMCP — 3 hours ago

▲ 36 r/ollama

Soon we'll run 100B models on cheap hardware

Because ternary (1.58-bit) models use weights that are either -1, 0, or 1, massive hardware optimizations become possible. This means multiplication between layers is eliminated, replaced by just addition, subtraction, and memory lookups. Furthermore, the activation function can be pre-computed and stored in a 16-bit look-up table (LUT) that takes up just 64KB.

To make things simple, here is what the data journey looks like for a single neuron. Let's say the neuron receives an INT8 input with a value of 42, and its weight is -1. The multiplication step is now trivial because the operation can only produce three possible values: -42, 0, or 42. In our case, the result is -42. The circuit can be massively simplified.

Next comes the accumulation. If we assume the neuron has a bias of 12, we pass this through a simple addition circuit, which is vastly smaller and more power-efficient than a multiplication circuit. Our accumulated value becomes -30.

Finally, we apply the activation function. Instead of performing crazy expensive floating-point math, we just use -30 as an index address to check our 64KB LUT. If the precomputed value at that address is 0, we simply fetch it from the L1 cache, which is incredibly fast and cheap. Our final output is 0.

Because of these massive silicon shortcuts, I estimate that a dedicated ASIC with highly optimized on-board memory could run a 100B parameter model at around 35W.

Likewise a 7B model could run entirely on a dedicated ASIC's internal SRAM (L3 cache, possibly 3D stacked), and it could potentially run at under 1W.

reddit.com

u/Kremho — 5 hours ago

▲ 15 r/ollama+7 crossposts

Chimera: an open-source, self-hostable agent that runs on local models (any OpenAI-compatible endpoint) and can fuse several at once

I've been building an open-source agent (Apache-2.0) and wanted to share it here because it's designed to be fully local and self-hostable: it talks to any OpenAI-compatible endpoint, so Ollama / llama.cpp / vLLM / LM Studio all work as the backend. No cloud lock-in, your keys and data stay yours.

The core idea is LLM-Fusion: for the hard steps it can run a panel of models on the same prompt, have a judge model cross-check them (consensus / contradictions / blind spots), and a synthesizer write the final answer. Locally this is fun because you can mix a few small local models and let them cross-check each other. A cost/latency-aware router keeps easy turns on a single model so you're not paying panel latency for everything.

Beyond that it's a full agent: plan -> act -> verify-or-revert (it runs your tests and treats the result as ground truth), layered memory (SQLite + FTS recall, cross-session profile, consolidation), a governance kernel, cron/proactive jobs, MCP client + OpenAPI-to-tool import, and an isolated subagent/crew layer (parallel git worktrees with per-worker verify gates). Runs on a laptop or a $5 VPS via Docker.

Honest status: it's alpha - 463 tests, mypy --strict clean, but no production mileage yet. Local reasoning quality obviously depends on the models you point it at, so I'd genuinely love to hear which local models people find good enough to actually drive an agent loop (reliable tool use + self-correction) - that's the make-or-break for going fully local.

Repo: https://github.com/brcampidelli/chimera-agent

u/Federal-Teaching2800 — 6 hours ago

▲ 5 r/ollama+1 crossposts

Run two models?

Is it smart to run two separate local models for different roles, or am I overengineering this?

Hardware:

i9-13900K
RTX 3090
96GB DDR5

Current thought process:

Primary model for RAG + general business Q&A/report drafting:
- Qwen 3 27B Dense @ Q4_K_M
Secondary model for automations/tool-calling/agent workflows:
- gpt-oss 20B @ Q5_K_M

Idea is basically:

One “thinking/writing” model
One “doer” model

Does this architecture actually make sense in practice, or am I overthinking it?

Thanks in advance!

reddit.com

u/Sullinator07 — 9 hours ago

▲ 7 r/ollama+4 crossposts

ContextForge: a local proxy that cut my Claude Code token usage by up to 72%

Hi everyone,

I’ve been working on a project to address a specific frustration I had with AI coding agents: token waste. I noticed that agents often burn a significant portion of the context window just re-reading the same files to find functions or re-discovering the repository structure on every turn.

I built ContextForge — a local proxy and CLI that acts as a "codebase-aware" runtime.

How it works

ContextForge sits between your agent (like Claude Code) and your LLM provider. Instead of letting the agent "guess" where files are, it provides local intelligence:

Local AST Graph: It indexes your repo using native C++ parsing into a local SQLite graph. When the agent needs to find a symbol, the proxy handles the lookup locally.
Context Optimization: It applies a compression pipeline that skeletonizes older file history (keeping only signatures) and vaults oversized responses (like lockfiles), replacing them with pointers.
Protocol Translation: It translates Anthropic requests into OpenAI format, which allows you to run Claude Code against Ollama/OpenAI-compatible models with full streaming support.

Case Study: "Soft-Delete" Feature

To test the architecture, I implemented a complex feature in an Express.js backend using an Ollama model. I compared a raw session (Passthrough) against one routed through ContextForge.

Metric	Passthrough Mode	ContextForge Mode	Difference
LLM round-trips	41	14	66% fewer
Input tokens	1,632,266	444,092	72.8% fewer
Output tokens	1,632,266	384,033	76.5% fewer
Session Compression	—	60,059 (13.5%)	—

Understanding the Metrics:

Workflow Savings (72.8%): These are tokens that were never generated because the tooling changed the workflow. The model used the local graph to find symbols instead of "guessing" via file searches, solving the task in 14 steps instead of 41.
Session Compression (13.5%): This is the actual text removed from the prompts within the session via skeletonization and deduplication.

Note: These results are from a specific, repository-heavy task. Savings vary significantly based on the work—long refactors benefit most, while short chats benefit much less.

Get Started

I've just released v1.0.3 and I'm looking for feedback from the community

Install: npm i -g @anuj612/contextforge
GitHub: https://github.com/anujkushwaha612/ContextForge

Note: No compiler needed — ships with prebuilt native binaries for Windows, macOS, and Linux via npm.

I’d love to hear your thoughts on the project and to tackle the new bugs and issues coming forward.

github.com

u/Independent_Pick3116 — 9 hours ago

▲ 4 r/ollama+5 crossposts

Vorrei, non vorrei e adesso puoi!

Un IDE dove il codice lo scrive l'AI, lo lanci tu, e il sandbox fa il resto.

Si chiama WebCraft. È dentro NHA 3rdArm gratis.

A parte questo, cerco disperatamente community per portare avanti il progetto! Tra lavoro e impegni, sta diventando difficile......siete interessati? L'applicativo ha tante alte features, tra cui una sezione avanzata per i connettori con market place

👉 nothumanallowed.com

https://nothumanallowed.com/3rdarm

u/Key-Outcome-2927 — 9 hours ago

▲ 35 r/ollama+2 crossposts

GPT-5.5 vs Claude Fable 5 vs Local Qwen: 3 AI Agents, 1 Task

I ran the same market-entry brief through three different AI models. The result was revealing.

I asked three models to independently create a client-ready market-entry brief for launching a privacy-first AI personal assistant for small businesses in the UK.

The models were:

Claude Fable 5 via Claude Subscription
GPT-5.5 via ChatGPT/Codex
qwen3.6:27b running locally via Ollama

Each got the exact same task. They could use web research. They could not see each other’s answers.

The brief was for a product that is local-first, helps with email, calendar, documents, reminders, research, and workflow automation, and positions itself around privacy, local storage, user control, and optional cloud model access.

The target market was UK small businesses, freelancers, consultants, and agencies.

The output needed to include segmentation, customer pains, competitor landscape, positioning, pricing, go-to-market strategy, risks, a 90-day launch plan, and a clear recommendation on whether the company should pursue the market.

Here’s what happened.

The winner: Claude Fable 5

Claude produced the strongest founder-ready strategy memo.

Its biggest strength was that it made a clear strategic choice.

It did not recommend launching as a generic “AI assistant for small businesses”. Instead, it recommended a focused wedge into regulated micro-practices and privacy-sensitive professional services: accountants, solicitors, bookkeepers, financial advisers, HR consultants, consultants, and agencies handling confidential client data.

That was the sharpest insight in the whole comparison.

Its positioning was also the strongest:

That works because it does not try to out-feature Microsoft Copilot or Google Workspace. It reframes the competition around data custody, client confidentiality, and trust.

Claude’s best recommendation was: don’t compete on being cheaper than Copilot. Compete on privacy, control, and workflows that cloud-first incumbents cannot credibly own.

It also had the strongest risk analysis: Microsoft bundling, local model quality gaps, hardware variability, support burden, regulatory shifts, and category confusion with free local tools.

Overall, Claude felt the most client-ready.

GPT-5.5 was the best operator

GPT-5.5 came very close.

It was less punchy than Claude on positioning, but stronger on execution.

It produced the most practical 90-day launch plan: choose two verticals, run workflow audits, recruit pilot firms, configure 3 to 5 daily automations per customer, measure admin hours saved, build case studies, then convert pilots into paid customers.

It was also more cautious around compliance claims. That matters. A privacy-first AI product should avoid saying “GDPR-compliant by design” too casually. Better language is: “designed to reduce unnecessary data transfer and support UK GDPR obligations, subject to configuration.”

GPT-5.5 was very useful for turning the strategy into an operating plan.

If Claude gave the boardroom memo, GPT-5.5 gave the launch checklist.

Local Qwen was better than expected

The local qwen3.6:27b model produced a coherent, complete, and genuinely useful first draft.

It covered all required sections. It had a competitor table, pricing hypothesis, go-to-market phases, risk table, and launch plan. For a local model, it performed well.

But it had weaknesses.

It made more unsupported claims. It was less disciplined with citations. It overclaimed in places, for example saying local-first meant “zero data-privacy risk”, which is not accurate. Local-first reduces risk, but it does not eliminate it.

It also picked freelancers and micro-agencies as the primary beachhead. That is easier to market to, but less strategically defensible than privacy-sensitive professional services.

Still, the result was good enough for internal ideation, early drafting, and private strategy work.

That is important.

Local models do not need to beat frontier cloud models at everything to be useful. They need to be good enough for the right part of the workflow.

My ranking

Claude Fable 5 Best for strategy, positioning, founder-ready narrative, and final synthesis.
GPT-5.5 Best for launch planning, pilot design, pricing experiments, and operational detail.
qwen3.6:27b local Best for private first drafts, brainstorming, internal notes, and cheap iteration.

The bigger takeaway

The best workflow was not “pick one model”.

The best workflow was hybrid:

Use the local model first to brainstorm privately and cheaply.

Use GPT-5.5 to turn the ideas into a practical operating plan.

Use Claude to sharpen the positioning and produce the final client-ready narrative.

That feels like where AI work is heading.

Not one model for everything.

A portfolio of models, each used where it is strongest.

For privacy-first products especially, local models have a clear role. They are not always the best final writer. They are not always the strongest strategist. But they are useful for private thinking, early drafting, and working with sensitive material before anything goes to the cloud.

In this test, local Qwen was not the winner.

But it was absolutely good enough to be part of the team.

And that may be the more important result.

GitHub

u/Acceptable-Object390 — 15 hours ago

▲ 4 r/ollama

need help selecting models

I have been trying to build something internal to help keep track of insurance policies, however noted that some of the models which i have tried is not good at summarizing the policies or even responding to me when i ask simple questions such as "i fall down, how much can i claim" etc.

I have tried gemma3:12b, gemma4:e4b, and mistro-nemo:12b. Any suggestion on what other models might be good? Trying to keep it local only, but PC is not very powerful.... mistro-nemo:12b takes 15minutes to run that query

reddit.com

u/More-Bag4369 — 10 hours ago

▲ 172 r/ollama+20 crossposts

I would like to share my latest open source local LLM inference tool implemented in C#. It supports models like Gemma4, Qwen3.6 with multi-modal (image, vision, audio), reasoning and function tool. It can run on Windows/MacOS/Linux and fully leverage GPU's capability. The API is completely compatible with OpenAI and Ollama interface.

Really appreciated if you can try it and give me some feedback. If you like it, it will be a big thank you if you can star it. Thank you very much!

u/fuzhongkai — 21 hours ago

▲ 4 r/ollama

Can we have a bigger sub?

As the title says: Can we get a bigger sub like twice as much as the Max plan for twice as much? 200$ plan like Anthropic but just 2x Ollama’s Max plan.

Please?

reddit.com

u/PA100T0 — 12 hours ago

▲ 2 r/ollama+1 crossposts

Please stop clipping/teasing finished answers when credits run out

It's very annoying to watch Claude do all the work then you essentially get a pay wall.. I love Claude but that needs to go.

I would elaborate on the game theory and marketing mechanics but I don't want to rant.

The board should know better.

Edit 1- Oh; pro tip: Joplin web clipper is good for catching the page before Claude makes the on-screen reasoning disappear at the same time as the aforementioned.

Edit 2- and also the reasoning on screen is (download files as Claude makes them while reasoning by clicking on the icon you can open it in artifacts before it disappears) great for giving to a local model while you're waiting.

Edit 3- this can actually be a feature as Claude will tease you with a whole answer even if you have low credits before it blocks it so you can go on Fable with nearly empty credits and yank all the reasoning which is arguably better than the answer you.. don't get.

This is a reproducible UX failure: the system generates accessible value, then revokes it after the user has already watched it being produced. The workaround only exists because the product creates that liminal state.

This doesn't go against T&C I've checked. It's just a poweruser move.

Anthropic really and truly.. should respect that.

Final edit: It doesn't do anything to the website whatsoever it only does something to your computer to access stuff in your browser that Claude has already sent to you through the internet.. on purpose.

reddit.com

u/FastFoodAI — 12 hours ago

▲ 1 r/ollama+1 crossposts

New to olama, I need to know how to load multiple models into RAM only

I have a server with two xenon processors, forty cores 512 GB of RAM two terabyte SSDs . In an RTX 3090 24 GB I want to save the video card. For video editing. I would like to know: how do I load multiple models using olama? And Ram only and they never get released.

reddit.com

u/wbiggs205 — 18 hours ago

▲ 24 r/ollama

Best Local Model for Coding

I’m a full-stack developer looking to start using a local AI to assist with the "ask" function and occasionally the "agent" mode. What’s the best local AI I could use? I tried "qwen3-coder-next," but I found it extremely heavy; the agent mode ran at a snail's pace, and the ask mode was sluggish too. Is it really worth trying a local AI, or should I just settle for the big-name providers like Gemini, GPT, and Claude?

reddit.com

u/Spirited_West_4123 — 1 day ago

▲ 19 r/ollama

Can you run a useful coding only llm on an m4 MacBook air with 24Gb of ram?

Hi everyone, I was wondering if you can run any sort of useful LLM for coding only on my 24Gb m4 10 core MBA machine. The machine is modded with a thermal pad and some extra stuff so it doesn't throttle and performs the same as my flatmate's MBP M3 as per our cinebench 2026 measurements. I understand it's far too weak to run any general model to do anything useful but I was wondering if it's good enough to run a coding specific LLM and not even for massive projects, just like some stm32 embedded code and some simple data analysis python scripts. I did some research already but I'm still not sure what the best model would be since I'm not super into the LLM scape. Could anyone provide some insight into this if they tried it, what models are worth playing with or it it's a waste of time. I don't expect claude level performance of course. I wanted some guidance before going down the rabbit hole.

reddit.com

u/Actual_Sport4509 — 1 day ago

▲ 1 r/ollama+2 crossposts

I built a vibe coder that shows receipts

stealth.cyphes.com

u/Fluffy-Ad-889 — 1 day ago

▲ 40 r/ollama+3 crossposts

Small models fail tool-calling for different reasons — and sometimes it's an upstream chat-template bug, not the model. I built an MLX tool to tell them apart.

Everyone benchmarks tool-calling with one number: "Model X gets 71% of function calls right." That number can't tell you why the other 29% failed — and the "why" is what decides what you do next.

So I built Toolhound, an MLX-native diagnostic that runs entirely on your Mac and attributes every tool-calling failure to one of four causes:

- `framework_template_bug` — the chat template mangled the tool tokens

- `framework_parser_gap` — the model emitted a rescuable call, the framework parser missed it

- `model_format_failure` — the model can't emit a parseable call

- `model_decision_failure` — valid format, wrong tool/args

What surprised me (Qwen2.5-0.5B / 1.5B, Llama-3.2-3B, 4-bit, on an M2 Pro):

- Qwen2.5-0.5B mostly fails on an upstream chat-template bug — Qwen2.5's template renders its tool-call example with doubled braces `{{"name": ...}}`, and the small model copies it literally. That's not the model's fault. args-correct 29%.

- Qwen2.5-1.5B parses fine (96%) but fails on judgment — wrong tool/args. args-correct 71%.

- Llama-3.2-3B formats perfectly, but wrong arg types + false abstentions. args-correct 61%.

Same benchmark, opposite root causes. A plain accuracy score hides that — and the smallest model's failures aren't even fixable by a better model.

Other things it does:

- 95% bootstrap CIs on every metric (temp=0, so no seed hand-waving — the CI comes from resampling the case set)

- Reports attribution under both a strict and a lenient parser, so you can see the verdict doesn't flip

- Quantifies bf16-vs-q4 damage without confounding it with template differences (asserts identical template first)

- v2 benchmarks existing zero-training fixes (PA-Tool is wired in). Honestly, on my demo run PA-Tool didn't beat baseline on any metric — it flags a result "credible" only when its CI is disjoint from baseline's, and it wasn't (it even hurt 1.5B's arg accuracy). I'd rather the tool tell me that than rubber-stamp it.

https://github.com/Code-byte404/toolhound

Feedback very welcome — especially: which models should I add next, and are the abstention "trap" cases too easy/hard? There are `good first issue's if anyone wants to add a model or help file the template bugs it finds upstream.

u/Otherwise_Ship_9782 — 1 day ago

▲ 3 r/ollama+1 crossposts

Optimal Bits per Weight vs Model Size

u/Kremho — 1 day ago

▲ 326 r/ollama+69 crossposts

I built an open-source, self-hosted AI gateway: 237 providers (90+ free), auto-fallback combos, and a 10-engine token-compression pipeline (MIT)

Builders-welcome post with the substance up front (disclosure: I'm the maintainer). OmniRoute is a free, MIT, self-hosted AI gateway — one OpenAI-compatible endpoint over 237 providers — built around two problems: runs dying on a provider 429, and tokens bleeding on tool/log output.

One endpoint, 237 providers — 90+ of them free. You point any tool or agent at a single OpenAI-compatible endpoint (localhost:20128/v1) and it can reach 237 LLM providers without you rewriting anything. 90+ have free tiers and 11 are free forever (no card), which aggregates to ~1.6B documented free tokens/month — and that's honest, pool-deduped math (we count each shared pool once instead of inflating it; the methodology is public in the repo). There's a one-command setup-* for 13+ coding tools (Claude Code, Codex, Cursor, Cline, Roo, Kilo, Gemini CLI…), so switching your existing setup over takes seconds.

Fallback combos — so it never stops mid-task. A "combo" is a ladder of models the router walks automatically: your subscription first, then API keys, then cheap models, then free ones. When a provider returns a 500 or you hit a rate limit, it slides to the next target in milliseconds, mid-request, and your tool never even sees the error. There are 17 routing strategies (priority, weighted, round-robin, cost-optimized, auto/coding:fast…) plus three resilience layers — a per-provider circuit breaker, a per-key cooldown, and a per-model lockout — so one dead key can't take down a whole provider.

Fusion — an ensemble mode for the hard steps. Beyond simple routing, there's a fusion strategy that fans a single prompt out to a panel of different models in parallel and then has a judge model synthesize one best answer (mixture-of-agents, built in). It's cost-aware, so easy turns stay on one fast model and it only fuses when the step is worth it.

A 10-engine compression pipeline — the part most routers don't have. Every request flows through a transparent compression pass you can toggle/stack per combo. Instead of one trick, it stacks the best of the open-source ecosystem: RTK filters command/tool output (git diffs, test logs, builds) at 60–90%, Microsoft's LLMLingua-2 does ML semantic pruning, Caveman handles prose, session-dedup strips repeats across turns. Critically, code, URLs and JSON are preserved byte-perfect, and a default-on inflation guard throws the compressed version away and sends the original if compressing would actually grow the prompt — it never makes things worse. On tool-heavy sessions that's ~89% average input-token reduction (an 8k-token git diff becomes a few hundred). Full credit to every upstream project (RTK, Caveman, LLMLingua-2, Troglodita) is in the README.

Agent-native — the agent can drive the router itself. There's a built-in MCP server (95 tools across 30 audited scopes, over stdio / SSE / streamable-HTTP), plus A2A (v0.3, JSON-RPC 2.0) support. That means an agent can query providers, switch combos, read its own remaining quota and manage memory through the gateway — not just consume tokens through it.

It's 100% local (zero telemetry, AES-256-GCM at rest), MIT-licensed, has a prompt-injection guard on every LLM route, opt-in memory, and runs on npm, Docker, desktop or your phone via Termux.

For context on whether it's worth your time: it's grown to ~9.8K GitHub stars, 1,490+ forks and 280+ contributors in ~4.5 months, with 21,000+ automated tests and 1,830+ issues closed — so it's a battle-tested project, not a brand-new experiment.

npm install -g omniroute

GitHub: https://github.com/diegosouzapw/OmniRoute · Site: https://omniroute.online

Would value a critique of the routing/compression architecture from this crowd.

u/ZombieGold5145 — 2 days ago

▲ 61 r/ollama+40 crossposts

Ask questions across your Markdown notes using a fully local Graph RAG engine. Built for Obsidian vaults, works with any folder of Markdown files. Extracts entity-relation triples from wikilinks & YAML frontmatter, retrieves answers via hybrid search (vector + BM25 + temporal). Multilingual. No cloud. Runs on Ollama.

https://github.com/benmaster82/Kwipu

u/WritHerAI — 2 days ago

▲ 61 r/ollama

Ollama 0.31: Faster Gemma 4 on Apple Silicon with MTP. Here is my test showing a 56% boost on M1 Pro 16GB (2021)

Gemma 4 models are up to 90% faster with Ollama 0.31 on Apple Silicon. The speedup comes from multi-token prediction (MTP). This was achieved through contributions to the MLX kernel and the improvement isn't limited to Gemma 4 models. Gemma 4 is the first model to receive this improvement.

Note: You need to download new mlx tags to get this improvement.

Scope

Ollama's official benchmark measures Gemma 4 12B on the Aider polyglot benchmark using an M5 Max.

I wanted to run a simpler benchmark on an older device. I tested gemma4:e4b on a single JSON generation prompt. This means a different model size and a different workload. This isn't a validation or refutation of their numbers, just a separate data point on smaller hardware with a narrower task.

What I Tested

I compared gemma4:e4b and gemma4:e4b-mlx on M1 Pro MacBook Pro 16GB with a simple JSON array generation prompt. I picked this prompt because MTP especially performs better when output predictable (closing brackets, repeated identifiers, boilerplate).

Generation Speed (Higher is Better)

Prompt: "Generate a JSON array of 30 fake user objects with id, name, email, and signup_date fields."

Model	tokens/second
gemma4:e4b	32.60
gemma4:e4b-mlx	50.95

It performs 56% better with mlx on a 5 years old machine.

Conclusion

This is a meaningful improvement but because it comes from multi-token prediction, it mostly benefits predictable generations. That is, it won't give the same performance improvement for all prompts.

The app I used: Reins: Chat for Ollama