r/LocalLLM

▲ 125 r/LocalLLM+22 crossposts

I would like to share my latest open source local LLM inference tool implemented in C#. It supports models like Gemma4, Qwen3.6 with multi-modal (image, vision, audio), reasoning and function tool. It can run on Windows/MacOS/Linux and fully leverage GPU's capability. The API is completely compatible with OpenAI and Ollama interface.

Really appreciated if you can try it and give me some feedback. If you like it, it will be a big thank you if you can star it. Thank you very much!

u/fuzhongkai — 4 hours ago

▲ 31 r/LocalLLM+9 crossposts

Local coding models need better repo context, not just bigger context windows

Local coding models have a repo-context problem.

When using llama/qwen/mistral/gemma for coding, the hard part is often not the model itself. It is getting the right files/functions into context without dumping too much raw source.

Long context helps, but it does not solve retrieval.

If the model never sees the right file, it still guesses.

I’ve been building SigMap, a zero-dependency CLI that creates a compact repo map for coding workflows.

Instead of sending raw source first, it extracts:

function signatures
classes/interfaces
exports
import relationships
ranked file matches per query

The workflow is simple:

repo map first → find likely files → read full source only where needed

Benchmarked across 18 repos / 90 tasks:

81.1% hit@5 vs 13.6% random baseline
~6× better file retrieval
96.9% token reduction in the benchmark setup
41.4% fewer prompts per task

No embeddings. No vector DB. No npm dependencies.

This is not meant to replace LSPs, grep, agent search, MCP tools, or full-file reads.

It is meant to give local coding models / agents a cheap first-pass structure map before deeper inspection.

Repo: https://github.com/manojmallick/sigmap

Benchmark suite: https://github.com/manojmallick/sigmap-benchmark-suite

Curious how people here handle repo context with local coding models.

Are you mostly using grep/search, RAG, repo maps, MCP tools, or just relying on longer-context models?

Edit: Good point from the comments — SigMap core is model-agnostic. The docs currently look too focused on proprietary assistants, so I’ll add clearer examples for VSCodium/Open VSX, Continue, Cline/Roo Code, Aider, OpenHands, and local Ollama/llama.cpp workflows.

u/Independent-Flow3408 — 5 hours ago

▲ 17 r/LocalLLM

Ornith posts/comments...

I frequently see odd posts/comments about this model, but they all seem fake. I get the feeling someone's creating throwaway accounts to advertise it.

reddit.com

u/Connect-Painter-4270 — 3 hours ago

▲ 17 r/LocalLLM

OCR / LLM Local Recommendation for this data

So I'd like to be able to get the text out of this image. Is that a thing that an LLM can help with? I have OLLAMA installed and running.

I am just not sure what tools to throw at the problem? It'd be great if it was structured in some kind of way, but I'd settle for the text.

Anyone any pointers?

Thanks!

u/nacnud_uk — 3 hours ago

▲ 3 r/LocalLLM

Which LLM model should i use?

Quick background - i read and summarize a lot of medical records that come in huge PDFs, like 5000pages.

I was thinking of trying the NVIDIA DGX Spark, to run some LLM, to help break up the PDFs into dates of service, and then summarize them.

Does anyone have any idea if that will work, and if so which LLM would you suggest I start with.

It's important to stay local, for HIPAA compliance.

Thanks!!

reddit.com

u/brianbonedoc — 2 hours ago

▲ 15 r/LocalLLM

Is agentic coding possible on an NVIDIA RTX Ti 16 GB?

I'm trying to replace a Claude Code subscription with locally hosted LLM (tested several qwen models, mostly 3.5 9b and 27b with different contexts, on an Nvidia RTX 5060 16GB VRAM with 64GB system RAM). However, the models don't even seem to receive my prompts and just make up something to do or just ask me what they should do, when I've literally just told them. Is it possible at all with only 16 GB VRAM?

reddit.com

u/-davidde- — 7 hours ago

▲ 10 r/LocalLLM

Qwen 3.6 27B settings for LM Studio (32GB VRAM)

I want to use Qwen 27B for C# enterprise solution-level analysis, refactoring, and generating code.

However, even when using higher quants (Q6+), Qwen has entered an infinite loop, or just been too slow to be useful.

I have been advised (by Claude Sonnet 5) that my looping issue with Qwen 3.6 27B Q6 quant may not have been down to the smaller quant but rather my LM studio settings.

Anyone here use Qwen 3.6 27B with LM Studio? What are your settings and best quants to get successful long running coding tasks?

Are you getting decent results with unsloth Q4_K_XL? I am considering it. I use Q8 atm but have been told by some on this subreddit that a smaller quant can work.

If you use llama.cpp, that's fine, I'd like to see the llama.cpp parameters that give you best results. I'm not averse to using it.

Also, if anyone is running a Linux VM with llama.cpp on Windows, can you tell me your setup and its token output compared to windows?

Thanks.

reddit.com

u/No_Oil_6152 — 6 hours ago

▲ 18 r/LocalLLM+8 crossposts

Chimera: an open-source, self-hostable agent that runs on local models (any OpenAI-compatible endpoint) and can fuse several at once

I've been building an open-source agent (Apache-2.0) and wanted to share it here because it's designed to be fully local and self-hostable: it talks to any OpenAI-compatible endpoint, so Ollama / llama.cpp / vLLM / LM Studio all work as the backend. No cloud lock-in, your keys and data stay yours.

The core idea is LLM-Fusion: for the hard steps it can run a panel of models on the same prompt, have a judge model cross-check them (consensus / contradictions / blind spots), and a synthesizer write the final answer. Locally this is fun because you can mix a few small local models and let them cross-check each other. A cost/latency-aware router keeps easy turns on a single model so you're not paying panel latency for everything.

Beyond that it's a full agent: plan -> act -> verify-or-revert (it runs your tests and treats the result as ground truth), layered memory (SQLite + FTS recall, cross-session profile, consolidation), a governance kernel, cron/proactive jobs, MCP client + OpenAPI-to-tool import, and an isolated subagent/crew layer (parallel git worktrees with per-worker verify gates). Runs on a laptop or a $5 VPS via Docker.

Honest status: it's alpha - 463 tests, mypy --strict clean, but no production mileage yet. Local reasoning quality obviously depends on the models you point it at, so I'd genuinely love to hear which local models people find good enough to actually drive an agent loop (reliable tool use + self-correction) - that's the make-or-break for going fully local.

Repo: https://github.com/brcampidelli/chimera-agent

u/Federal-Teaching2800 — 7 hours ago

▲ 10 r/LocalLLM

What About YOUR Data ?

I see a lot of you guys paying for API per-token access for models such as DeepSeek, but would this not also compromise your data? And would your data also land in training for future models, and possibly, almost certainly, also be sold to OpenAI and Anthropic?

If I were OpenAI and Anthropic, with the enormous compute I have, I would definitely put some of it into exposing services for API per-token usage for local models, and then collect the data from people. Or I would just buy it from the API providers directly. This way, I also get to have the data of people who would not share their data with me.

But mostly, I would not leave some data for my competitors that they have and I do not, to prevent exposing any advantage.

Is this not the case? Do those API providers you use have some mechanism to encrypt data and so on?

On the other hand, by the way, if I use the models directly from the original companies, like Qwen, they get my data, and they get to improve their models, and I get better models in the next releases, as opposed to when I use the models through external APIs. i practically punish the maker of the open source models by using their models to create data and give it to others.

reddit.com

u/Zealousideal_Sort74 — 7 hours ago

▲ 18 r/LocalLLM+8 crossposts

Shared catalog of web skills

Agents waste time and tokens re-learning every site. On each run they screenshot, snapshot the DOM, and figure out the page from scratch.

I built an open source catalog of reusable browser skills. Skills capture each site's network requests and DOM, making it 30 times faster.

You can upload your own skills or request new sites.

Github repo: https://github.com/browser-memory/bmem

u/Deep-Occasion-7391 — 2 hours ago

▲ 2 r/LocalLLM

Running Qwen3.6 35B on RTX3080

My 6 year-old RTX3080 (10GB) could run this 35B MOE model and it shocked me that the results are actually usable.

What I mean is that it's not just chatting with it but connecting with a coding agent like Pi and letting it do the work.

When I first tested Qwen it was really slow like 10token/s so I didn't bother trying. Now with the optimizations it runs at 40+t/s stable with 128k context. I was considering to purchase a mac studio but I guess this can last me for a while now :)

https://preview.redd.it/v3bppxnoumbh1.png?width=1916&format=png&auto=webp&s=0c49b2925a883b9bdd408488eb9c0d4eea4e4d85

reddit.com

u/foresttrader — 3 hours ago

▲ 3 r/LocalLLM

What's the best way to use hybrid planning?

I've been trying the whole let the big model (Opencode Go GLM 5.2) plan out what needs to be done and let my local Qwen 3.5 35B A3B execute it, but I need tips on making sure GLM is passing as much information as possible.

I've been simply using plan mode and telling it to create a technical implementation plan and then switching the models and switching to Build and prompting execute. It's been working and saving an incredible amount of GLM usage, but I feel like a lot of context gets lost.

reddit.com

u/Forward_Jackfruit813 — 4 hours ago

▲ 3 r/LocalLLM

Should I buy Titan RTX gpu

Hey everyone,

I'm completely new to local LLMs. I'm a software developer and recently decided to start experimenting with local models and agents to help me build things more efficiently. I haven't really studied the field yet, so my understanding is pretty basic.

From what I've gathered, the general advice seems to be "the more RAM/VRAM you have, the better the models you can run." Right now I have an RX 5700 8GB, and I have an opportunity to buy a Titan RTX 24GB locally for $280. Is that a worthwhile upgrade specifically for running local LLMs and will this 2018 card be able to keep up with newer hardware? I don't think there could any better deal at this price point, no?

reddit.com

u/dattebanee — 4 hours ago

▲ 1 r/LocalLLM+2 crossposts

Eight – Interactive fiction made with AI

Somewhere you can't quite leave. Choose a way through. https://huggingface.co/spaces/alvations/hallway8

This is a interactive fiction escape room kinda game and the whole game is created by just me and my AI agents through Claude Code. The full game is opened source and anyone looking to replicate any workflow processes used to create the game can do so at https://github.com/alvations/melon-eight

u/alvations — 6 hours ago

▲ 19 r/LocalLLM

My voice agent sounded smart until one phone number was transcribed wrong.

The agent sounded good.

Natural voice. Good prompt. Nice handoff logic. CRM update worked. Calendar integration worked.

Then it heard one phone number wrong and the whole thing became useless.

That’s when I realized voice-agent STT should not be judged like normal transcription.

The transcript can be “mostly correct” and still fail the workflow.

For voice agents, these words matter more than the rest:

phone numbers
appointment times
dates
names
email addresses
prices
addresses
order IDs
“don’t”
“not”
“actually”
“wait”
“no, I meant…”

Those are the words that change the action.

I’m testing this now with HubSpot fields instead of just transcript accuracy.

Example scorecard:

did the phone number field match?
did the appointment date match?
did the agent catch the correction?
did it ask for confirmation?
did CRM update only after confirmation?
did the transcript preserve the negation?

Smallest AI Pulse is interesting to me here because I’m not evaluating it as “can it write a nice transcript?” I’m evaluating whether a real-time STT layer can capture workflow-critical entities while the call is still happening.

For AI voice agents, I think entity accuracy deserves its own benchmark.

Not WER.

Not vibes.

Did the system capture the fields that matter?

reddit.com

u/annabellecuddles — 10 hours ago

▲ 4 r/LocalLLM+3 crossposts

I developed an Android application that can turn a mobile phone into an AI inference service node.[Self-promotion]

Hi everyone, I've developed an application that runs an LLM service on mobile devices and makes it available as a service node for other applications to access via a local area network.

It supports LiteRT-LM and llmam.cpp, and can run the vast majority of models (provided your hardware configuration allows). Currently, mobile hardware typically supports models up to 3B.

Regarding the API service, I've ensured it's compatible with the OpenAI and Ollam interface specifications.

Furthermore, I've integrated Hugging Face's Hub Point, allowing you to directly search, download, and import Hugging Face models within the application.

Link :

https://play.google.com/store/apps/details?id=com.chaterminal.mobilellm

u/Head_Invite3039 — 7 hours ago

▲ 4 r/LocalLLM+1 crossposts

Migrating from Antigravity to VS Code

Hey everyone,

I'm a non-coder who dares building a feature-heavy project. I’ve been using Google Antigravity IDE (Pro tier), but tokens burn way too fast during micro-iterations and UI polishing.

My Plan is to use a local IDE + OpenRouter/Local LLMs for small tweaks and micro-iterations, and use Antigravity only for heavy engineering.

I tried switching to VS Code + Continue with OpenRouter, tried to migrate all Skills, MCPs and AGENT.md, but the development flow feels much less automated.

The Problems i faced is no automatic testing: Antigravity automatically runs tests (TDD) on every code change and loops tasks. In VS Code, nothing triggers automatically; I have to explicitly tell the agent to use tools. Also the agent ignores requirement to trigger graphify MCP even though I pasted the exact same rule from Antigravity as system prompt.

I am working in same directory as Antigravity with installed MCPs (Supabase, GitHub, Graphify) and skills placed in .agents\skills (ttd, ui-ux-pro-max, frontend-design, seo, prompt-engineer)

What am i missing? Should i explicitly install the skills somehow instead of placing them in .agents\skills? Is VS Code + Continue the wrong choice? Is there a better agentic automation developing tool?

Appreciate any migration tips or config tweaks!

reddit.com

u/AdditionalTadpole758 — 11 hours ago

▲ 15 r/LocalLLM

LLM Thinking for Excessively Long

I am trying to run Qwen 3.5 9B locally but it keeps thinking for very long amounts of time. I am running it on Linux with a 9060xt 16G. Is this normal?

u/JohnGaltW — 13 hours ago

▲ 327 r/LocalLLM+9 crossposts

Hey everyone,

I just open-sourced TuneForge.

The goal is simple: let your coding agent manage the full LLM improvement loop without ever leaving the chat window.

You can now tell your agent something like:

“Build me a customer support bot from this FAQ”

…and it can:

• Generate a clean synthetic instruction dataset (with LLM judging for quality)

• Run LoRA supervised fine-tuning on any Hugging Face causal LM

• Do a quick policy-gradient RL step using Ollama as the reward judge

• Merge the adapter, evaluate on a test set, and iterate

Everything runs locally, uses 4-bit quantization so it fits on modest hardware, and uses background jobs (with job_id polling) so long training tasks don’t freeze the MCP connection.

It’s built around the Model Context Protocol (MCP) for seamless integration with Claude Desktop, Cursor, Zed, Continue.dev, etc.

Tech: Python + Transformers + PEFT + bitsandbytes + Ollama + SQLite for job state.

Super early stage (just released), MIT licensed.

Would love feedback or ideas on what to add next. If you’re into agentic fine-tuning workflows, give it a try and let me know how it goes!

u/Just_Vugg_PolyMCP — 23 hours ago

▲ 57 r/LocalLLM+7 crossposts

We're building agents that can read millions of documents, but still forget a video they watched yesterday.

One thing has felt odd to me while working with AI agents.

We've gotten pretty good at giving them memory for text.

They can search documentation, index repositories, retrieve past conversations, and even build long-term memory over time.

Videos, though, are still treated as temporary input.

The agent watches a recording, answers a few questions, and when the session ends, that understanding is usually gone. Next session, the same video gets processed all over again.

That feels like an architectural gap rather than a model limitation.

A video isn't fundamentally different from any other source of information. Once you've extracted transcripts, OCR, visual observations, and timestamps, why throw that work away?

I ended up building an open-source project around this idea.

Instead of asking the agent to repeatedly "watch" the same video, it builds a persistent local index the first time. Future questions become retrieval instead of video analysis.

It changed how I think about video in agent workflows.

I'm curious whether others see this as a real missing piece, or if you've already solved it another way.

GitHub: https://github.com/oxbshw/watch-skill

u/Fearless-Role-2707 — 18 hours ago