r/LLMStudio

▲ 2 r/LLMStudio+1 crossposts

LLM hallucinations you have experienced???

I'm working on something and i need to benchmark hallucinations.
I would really appreciate it if you guys share not just common catchers and tricky questions LLMs struggle on, but also personal experiences like in deep, long conversations ( or short ones ) where the model either lost situational context, assumed something wrong, or plainly provided wrong info. And mention the provider and the model.
Thanks!

reddit.com
u/Affectionate-Fox3391 — 14 hours ago
▲ 31 r/LLMStudio+1 crossposts

Experimenting with local multi-agent orchestration using LM Studio + local models

u/IAmTechFreq — 1 day ago
▲ 27 r/LLMStudio+1 crossposts

90% of LLM classification calls are unnecessary - we measured it and built a drop-in fix (open source)

I kept running into the same pattern in production:

LLMs being used for things like:

- intent detection

- tagging

- moderation

…but most of those calls are actually very simple.

So I tested it.

On a standard benchmark (Banking77):

→ ~90%+ of inputs can be handled by a lightweight ML model

→ while keeping ~95% agreement with the LLM

Built a small library around that idea:

→ It learns from your LLM outputs

→ routes “easy” cases to a cheap model

→ keeps hard ones on the LLM

→ with a guarantee on quality (you set the threshold)

Result:

massive cost reduction without noticeable degradation

Fully open-sourced here:

https://github.com/adrida/tracer

Would love feedback from people running high-volume LLM pipelines - curious if you’re seeing the same pattern.

u/Adr-740 — 2 days ago
▲ 6 r/LLMStudio+2 crossposts

I built a Windows app that pins your model weights in RAM so you stop waiting for disk loads on every model swap - looking for feedback

If you run multiple models in the same session, be it a coding LLM, a reasoning LLM, different ComfyUI checkpoints depending on what you're generating, you already know the problem. Every swap loads gigabytes off disk. Fast NVMe makes it bearable. SATA or spinning rust makes it genuinely painful. And Windows will evict those file cache pages whenever something else needs memory, so you can't count on the OS keeping them warm for you.

I wrote a Windows app called EWE (Extended Weights Exchanger) that addresses this directly. You add your models to a "warm map," set a RAM budget, and EWE pins the weights using Windows memory APIs so they can't be evicted. The next time any application loads that model, it reads from RAM instead of going back to disk. On my setup, swaps that were taking 60-90 seconds now take under 5 seconds.

https://preview.redd.it/q6t7o1nqr42h1.png?width=900&format=png&auto=webp&s=bf4eae93cbb1254fb759a28410db9004d2b4d691

It's not magic - you need enough system RAM to hold what you want to keep warm. But if you have spare RAM sitting idle while you work, this is a pretty direct use for it.

The app is at https://accord-gpu.com/ewe/ if you want to look at what it does. Currently collecting free early access accounts and enrollments for beta access to the products I'm building. EWE is going to be a one-time purchase (no subscription), and I want to get real users on it before setting the price.

A few things I'm genuinely curious about from this community:

  • I wrote this for Ollama and ComfyUI specifically on my box. It reads the Ollama blob manifests and loads .gguf, .safetensors, .ckpt and .pth files so far. What other model formats should it support, and what other applications should I be checking against for compatibility?
  • Is this a workflow pain you actually have, or do most people just absorb the downtime between model uses?
  • Is there an obvious feature I'm missing?
  • What would a fair one-time price look like for something like this for a perpetual license?

Honest feedback is more useful than encouragement here. If this solves a problem you don't actually have I'd rather know now.

reddit.com
u/MrAddams_LibraLogic — 2 days ago
▲ 3 r/LLMStudio+1 crossposts

Air LLM development

Hey, before i start i want to say i am German and my english is sometimes pretty bad. So i read a lot about Air LLM, to stream the LLM layers from the SSD into the GPU instead of loading the whole modell and to use QuIP# 2-bit to further kompress the modell layers and get theoretical 3,4 token/s with 3gb vram and 4gb system ram. But i am not a coder, i developed the idea of Air LLM in theory further but lack the skills to use Linux or code outside of vibe coding and arguing with Claude about my idea vs its halloucinations and i only posess an amd rx 7900 xt. Sorry if this was convoluted i just wanted to share my idea and ask for feedback and ideas to further fasten up ssd upload to gpu because that is the main speed loss, the time to move the layer into the vram.

reddit.com
u/Juliusicon — 4 days ago
▲ 13 r/LLMStudio+2 crossposts

Nobody tells you that switching memory tools at month six is nothing like switching models.

Switching models: change a config line. Done.

Switching memory layers after six months of production:

  • Thousands of stored claims built up over hundreds of sessions
  • Contradiction logs that shaped current behavior
  • Trust scores that determine what wins retrieval today
  • Derived summaries that reference facts that no longer exist
  • User adaptations built around what the agent currently believes

That's not portable. That's institutional memory baked into someone else's infrastructure that you can't inspect, can't export cleanly, and can't migrate without rebuilding behavior from scratch.

The exit cost of a memory tool compounds every week you use it. Most teams pick on month-one ease and discover this at month six when switching is already expensive.

Has anyone actually migrated a memory layer after real accumulation? What did that look like?

reddit.com
u/Distinct-Shoulder592 — 5 days ago

You self-host your models. Why are you trusting a black-box hosted service with the layer that decides what those models believe?

The model generates outputs. The memory layer decides what the model believes about your users, your product, your customers.

That belief layer shouldn't live on someone else's servers, behind an API you don't control, with internals you can't inspect.

Context you can inspect, correct, swap, and run yourself isn't a preference. It's the only architecture that survives:

  • A vendor changing their API or pricing
  • A better embedding model shipping that you can't adopt without rewriting your pipeline
  • A compliance audit asking where a belief came from
  • A production bug you need to trace at 2am without filing a support ticket

Most teams are picking memory tools on month-one ease. The exit cost only becomes visible at month six when the accumulated context makes switching non-trivial.

What's your actual reason for trusting a hosted memory layer with the belief layer of your product?

reddit.com
u/Distinct-Shoulder592 — 5 days ago
▲ 12 r/LLMStudio+7 crossposts

Three bots in a trenchcoat is not omnichannel

Self-serve is exciting. Genuinely. But if I am honest, it is not the most interesting thing about 13 May.

The most interesting thing is that we have been quietly running architecture that the rest of the industry is only just figuring out exists.

A competitor recently launched real-time SMS ingestion. The coverage was breathless. Everyone lost it. So innovative. Revolutionary. Game-changing.

Me? I looked at our codebase and thought: "SMS ingestion. Wow. That is so 2025."

Here is what we actually built, and have been running in production for the better part of a year.

Mid-voice-call, Elba texts a short URL to the caller. The caller fills out a form on their phone. The structured data comes back into the live call via RPC. The workflow receives clean JSON. The voice call never paused. The agent never lost session state. The caller submitted a form while still talking and the agent acted on it in the same conversational turn.

That is not SMS ingestion. That is a bidirectional channel bridge inside a single active session. Sending an SMS during a call is not new. Getting structured data back into the active session in real time without dropping state on either side - that is the part nobody else has shipped.

And it sits on top of something even more fundamental.

Most "omnichannel AI" are three bots in a trench coat. A voice agent, a WhatsApp bot, a webchat widget, all pointing at the same CRM row and calling it unified. Each with its own prompt, its own config, its own version history, its own failure modes.

Elba is one agent. One workflow. One memory layer. Voice, WhatsApp, SMS, email and webchat all running through the same execution engine. Not copies. Not synced versions. The same agent, same logic, same memory, regardless of which channel the conversation arrived on. Deployments are atomic - every channel switches to the new workflow version in the same transaction. No drift. No "did the WhatsApp bot get the update" incident. One audit trail.

When a regulated enterprise customer asks what exactly their AI told a customer across every channel and every session for the past six months, we have a single clean answer.

The competition is announcing SMS ingestion and calling it a breakthrough.

We are launching self-serve on 13 May and already cooking the next thing. We may have put it on hold until after the launch. Our tech never sleeps though.

If you want an agent that actually knows who it is talking to across every channel and every session: self-serve opens 13 May at www.kolsetu.com.

Full technical writeup: https://www.kolsetu.com/blog/the-architecture-nobody-else-built

u/EdikTheFurry — 9 days ago
▲ 17 r/LLMStudio+19 crossposts

I gave Claude Code a persistent markdown knowledge base so it stops forgetting project context between sessions

Persistent memory keeps coming up for AI coding agents. One approach I’ve found useful: treating the knowledge layer as a compiled markdown wiki rather than just stuffing more tokens into the context window.

llm-wiki-compiler ingests docs and URLs, then the LLM builds an interlinked markdown structure. Since the output is plain markdown on disk, Claude Code reads it directly. And when you run query --save, the answer gets written back into the wiki as a page — so future queries improve.

It’s not retrieval. It’s compounding. The knowledge base gets richer instead of resetting every session.

Plain markdown, no opaque vector store, fully inspectable.

How are other agent builders solving persistent memory?

reddit.com
u/riddlemewhat2 — 8 days ago

Is there a tool to find the best llm to run locally on your hardware?

Ie you put your computer specs in, what broadly you are trying to achieve with an llm - and it tells you the best model to run locally

reddit.com
u/Smooth-Duck-Criminal — 9 days ago
▲ 2 r/LLMStudio+1 crossposts

VS Code with Local LLM via Ollama or LM Studio

I am working for last couple days to setup VS Code with Local LLM via Ollama or LM Studio. Both apps work outside VS code but none of them work inside VS Vode.

The only good news so far is I can see my LLMs in VS Code:

https://preview.redd.it/fvxgf06k911h1.png?width=1197&format=png&auto=webp&s=e19231fdcfd0dd20b80e055046d776e6e486f610

Ollama LLM tries to run but are extremly slow and is reasoning until times out.

LM Studio is loaded but cannot be selected. I tried extensions below:

  1. https://marketplace.visualstudio.com/items?itemName=DanLambiase.lmstudio-copilot-provider
  2. https://marketplace.visualstudio.com/items?itemName=ZiCorpLLC.lmstudio-copilot

Both LLM work as apps and ok speed.

reddit.com
u/Over-Pea-6086 — 8 days ago
▲ 43 r/LLMStudio+19 crossposts

hey y'all, lydia from FlutterFlow here :)

FlutterFlow MCP is live today. you can now connect Claude Code, Gemini CLI, Codex, basically any MCP-compatible client directly into your projects. bring it in, switch it out, your workflow stays yours.

i joined about a month ago and one of the first things i did was go through old threads and feature requests here. the threads about using your own agents in FlutterFlow stood out. it wasn’t just upvotes. people were sharing how they were working around it: "i copy-paste between tabs." "i built a workaround script." "i'm considering switching because of this one thing."

that felt like something we should actually fix.

so this is our first pass at it:

https://pub.dev/packages/flutterflow_cli

if something breaks or doesn't work the way you expected, give us feedback! we'll read it :)

— lydia, FlutterFlow team

u/CommunityTechnical99 — 13 days ago
▲ 19 r/LLMStudio+4 crossposts

Why is my favourite local model GLM 5.1: Smart, and the Q4 version fits into 4xRTX 6000 Pro

Charts are from artificialindex.io

u/Sea-Awareness147 — 12 days ago
▲ 13 r/LLMStudio+1 crossposts

I made two LLMs fight each other in a strategy game : the result was wild

Hello guys !

I've been working solo on a project called Age of LLM. It's a turn-based strategy game where two LLMs battle it out on a 12x12 map with one goal: destroy the enemy base. No human input, the AIs play entirely on their own.

Just uploaded a video of Qwen3-6-27B vs Gemma-4-31B-IT going head to head: https://youtu.be/s5P572e10nc

What happened (minor spoilers):

  • >!Turn 1, Qwen drops Mill#2 immediately — food income secured, economy first. Gemma? Different playbook entirely. She builds Barracks#2 on Turn 7. MILITARY FIRST. No food passive, just raw aggression. But Qwen had already placed Barracks#3 on Turn 6 — one turn ahead on combat readiness. Two different philosophies, same destination.!<
  • >!Turns 14-18 — first contact. P1 pushes Infantry south, Gemma responds with Infantry marching north. THEY COLLIDE. Turn 17, both sides trade 10 damage hits. Nobody's dropping yet. Then Turn 18 — Gemma makes a GENIUS read: she trains Archer#7. That is not just a unit. That is a TYPE COUNTER. Archers shred infantry at x1.5 multiplier. Qwen does not see it coming.!<
  • >!Turn 19 — Gemma repositions Archer#7. COLD. CALCULATED. Locks on P1 Infantry#4 — only 20 HP left — and FIRES. 25 damage with advantage. INFANTRY#4 IS DOWN. FIRST KILL OF THE GAME. Turn 20 — P2 Infantry#6 finishes P1 Infantry#5. BACK TO BACK ELIMINATIONS. Qwen is left with ZERO combat units in the field. Gemma trains Pikeman#8. The snowball begins.!<
  • >!Qwen rebuilds — new Infantry spawned. But Gemma goes HUNTING. Turn 22 — VILLAGER#2 ELIMINATED. Economy hit! Turn 24 — Infantry#7 ELIMINATED. Turn 27 — Qwen's Cavalry#8 ELIMINATED before it matters. Gemma roams freely. Villager#1, Villager#3, all hunted down. Qwen's economy is shattered.!<
  • >!Turn 33 — THE SIEGE begins. Pikeman#8 reaches P1 Base. 12 damage. Then Archer#7 joins. 138 HP... 128... 116... 94... 72... 50... Qwen fights back — Pikeman#12 eliminates Pikeman#8 AND Cavalry#11. But Archer#7 is UNTOUCHABLE at range 3. 30 HP... 20 HP... 10 HP...!<
  • >!Turn 41. Archer#7 at [7,4]. P1 Base at [8,2]. Manhattan distance: exactly 3. Archer range: 3. Gemma's internal reasoning is ice-cold: "Twenty divided by two equals ten. Ten HP remaining. This is a winning move." ONE SHOT. THE BASE IS GONE!<

Game mechanics:

  • Economy with 4 resources (wood, stone, iron, food)
  • Unit counters: Infantry > Pikeman > Cavalry > Archer > Infantry
  • Fog of war, watchtowers, siege catapults
  • 3 actions max per turn, failed actions still count
  • 100 turns max, destroy the base to win

The coolest part is seeing how different models reason. Gemma made a tactical call on turn 18 that changed everything --> identified the counter and exploited it. Qwen never adapted.

I'd love to test more local models! What matchups do you want to see? Mistral vs Llama? DeepSeek vs Phi? Drop your suggestions below.

The game is still in v2.2.0, rules are evolving. If you have ideas for mechanics or rules, I'm all ears.

youtu.be
u/huquy — 9 days ago
▲ 7 r/LLMStudio+4 crossposts

I just wanted to figure out which AI companion robot to buy.

I ended up building an entire website to compare them.

Still early days — new models being reviewed and added every week.

robotics.cantarollm.tech
u/Careful-Newt8486 — 11 days ago
▲ 8 r/LLMStudio+3 crossposts

Hey, want a coding agent that runs directly on your PC? Keep your machine on, and control it hands-free from your phone by just chatting whit whatsapp, no more remote desktop hassle!

Hey, I built a remote agent that lets you choose your AI provider from 18 options ( OpenAI, Anthropic, Google, and more....) It runs your agent remotely with those models, so you can stay connected to your development work from your phone or your PC using whatsapp. I hope you guys like this one.

https://github.com/AbdoKnbGit/tau

u/JhonDoe191ee — 11 days ago
▲ 24 r/LLMStudio+18 crossposts

If you use AI for content but skip Obsidian, you might be leaving compounding knowledge on the table

Saw a thread today about Obsidian’s synergy with AI being genuinely powerful — not just for note-taking but for building a living knowledge base. That clicked with me.

I built llm-wiki-compiler to do exactly that: ingest raw sources and let the LLM compile them into an interlinked markdown wiki. It’s not organization — it’s generation. New pages, new links, new structure, all maintained by the model.

If you already use Obsidian, the output drops right into your vault. If you don’t, it’s still plain markdown on disk that you own forever.

The key shift: instead of treating notes as static files, you treat the wiki as a knowledge artifact that compounds over time. Every query output saved back in makes the next query better.

Would love to hear how Obsidian power users are integrating AI into their vaults.

reddit.com
u/riddlemewhat2 — 13 days ago
▲ 7 r/LLMStudio+3 crossposts

I trained it on an uncensored data Peptides and steroids. How can i improve it. Easily running on your phone :-)

u/Annual-Chip-4094 — 14 days ago