u/Exact_Law_6489

r/JanitorAI_Refuges r/WhatIsMyCQS r/ChatbotRefugees r/CitiesSkylines2 r/ChaiUnofficial r/CAIRevolution r/buildapc r/LocalLLaMA

▲ 14 r/ChatbotRefugees

LettuceAI 2.1 & 2.1.1 (Android & Desktop) are live

Hey everyone! I'm the developer of LettuceAI.

LettuceAI is an open-source, privacy-first, cross-platform AI chat app built for character chats, roleplay, and long conversations that actually stay coherent.

It supports both local models (built-in llama.cpp engine, Ollama, LM Studio) and external APIs with full BYOK support, so you stay in control of your own setup. No forced accounts, no cloud routing through us, no vendor lock-in. Your requests go directly to the model provider you choose.

Where 2.0 was about companions and souls, 2.1 is about your hardware and your data, and 2.1.1 is about trusting both. The short version: local models now spread across every GPU in your machine, a new performance dashboard shows tokens per second on every generation, and OpenRouter models can be pinned to a specific provider endpoint. Then 2.1.1 rebuilt device sync from the ground up so your settings and memories actually arrive intact, made multi-GPU configuration honest about what it will really do, and added guided tours for the trickiest parts of the app.

What's new in 2.1

Local models & multi-GPU

Multi-GPU layer distribution: llama.cpp now splits a model's layers across every selected GPU, with automatic or manual splits, a per-model single-GPU device override, and per-GPU VRAM reservation so nothing overflows
A smart offloader sizes each device's share automatically and keeps far more of the model on the GPU instead of falling back to the CPU
KV cache placement modes (auto, split with layers, system RAM, or pinned to a main GPU), explained inline with fully localized pickers in the model editor

Performance metrics

A new local-LLM performance page graphs tokens per second and prompt/generation timing, per run and across runs, including group chats
Every generation is linked to its message, with a per-message action to open that reply's performance detail
MTP speculative-decoding stats (acceptance and draft counts) are persisted and shown on each message

Providers

OpenRouter provider pinning: pick a specific provider endpoint per model, with live pricing, cache rates, uptime, and provider logos in the picker, and route every request exclusively through it
Sprout hardware probe: an open-source companion service you run on your Ollama machine, so remote-Ollama runnability is judged against the real hardware behind the endpoint instead of a guess

Mobile & portability

The HuggingFace model browser now works properly on phones: it pairs with a remote Ollama provider and pulls GGUF models straight to that host, with files and recommended settings in a slide-in drawer
Chat export and import rebuilt around the official SillyTavern jsonl format, so histories move cleanly in and out
A shared memory cycle hub unifies the memory-cycle UI across chat and group memory pages, plus a wave of memory and embedding reliability fixes underneath

What's new in 2.1.1

Device sync, rebuilt

Conflict resolution now keeps the newest data instead of the oldest: previously a fresh install could silently overwrite the host device's real settings, advanced settings, and prompt templates with its own defaults
Memory metadata survives sync: importance, categories, timestamps, and embedding versions arrive intact instead of being stripped to bare text
Large libraries sync in chunks with no more transfer size ceiling, and a sync that fails or drops mid-transfer reports a real error instead of completing silently with partial data
Companion data, creation helper drafts, and ASR learning data (custom vocabulary, corrections, dismissed suggestions) now sync between devices, and ASR learning data is included in backups too
If a past sync ever ate your settings or memories: update both devices to 2.1.1 and re-run sync from Settings, it resends everything cleanly

Multi-GPU clarity

A leftover single-GPU pin can no longer silently disable multi-GPU: enabling multi-GPU takes precedence, while a deliberate per-model pin still wins where intended
The model editor now shows the effective GPU setup a model will really use, including settings inherited from global defaults, and pinned models get a visible notice with a one-tap "Remove pin"
Model browser installs persist your offload intent (auto, CPU, GPU, mixed) instead of a hardware-specific layer count, so VRAM you add later is actually used

Provider fixes

The zAI (GLM) provider actually works now: requests were sent to a nonexistent endpoint, so every call failed. Chat completions, GLM's thinking toggle, and API key verification all target the real Z.AI API. If you already added zAI with a regular API key, point the credential's base URL at https://api.z.ai/api/paas/v4 (coding-plan keys work as-is)

Chat & desktop fixes

Stopping a generation now actually stops it: a stop pressed before the first word arrived was silently lost; the "canceled" reply kept generating, was saved invisibly, and reappeared above your next message after a refresh
Linux desktop ships a working embeddings runtime again: the bundled ONNX Runtime library was a 0-byte file; builds now include the real library, and installs with the broken file repair themselves automatically on first use

Quality of life

Five new guided tours: the local model editor, runtime defaults, the model browser's recommendation panel, group chats, and dynamic memory walk you through themselves on first visit, in every supported language
Safe model file deletion: deleting a GGUF or mmproj warns you when a configured model still uses it, listing exactly what would break, without blocking you
Souls written by the right model: companion soul writing now respects the dedicated Soul Writer model in Settings instead of always using the character's chat model, so model A can do the roleplay while model B writes the soul

There's more in the full changelog. These are just the bits worth calling out.

If that sounds interesting, come and join our Discord server! It's the best place to follow updates, give feedback, and influence the future direction of the app.

AI usage (as requested by the subreddit moderation team): while developing LettuceAI, we used AIs such as Fable 5, Claude Opus 4.8 and GPT 5.5 to debug problems and discuss ideas. The coding and UI/UX design were carried out by humans.

Links:

Website: https://www.lettuceai.app/
Download: https://www.lettuceai.app/download
Full changelog: https://www.lettuceai.app/changelog
GitHub: https://github.com/LettuceAI/app
Discord: https://discord.gg/8eHDxEbRy4

reddit.com

u/Exact_Law_6489 — 15 hours ago

▲ 27 r/ChaiUnofficial

LettuceAI: open-source chat/RP, 2.0 update soon

Hello again everyone.

I'm the developer of LettuceAI, a free, open-source chat/RP app. I built it after getting burned one too many times by apps that changed the deal after I was invested: new limits, new prices, features moved behind a paywall. An open-source app like LettuceAI can't do that to you, because the code is public and the chats live on your device. Nobody can throttle, delete, or paywall what you already have.

What you get:

Chats stored on your phone or PC. No server wipe can touch them.
No ads, no subscription, no tiers. One version, free, open source.
No filter in the app. Whatever AI you connect is what answers you.
Bring your own provider, or run a model on your own PC for nothing.
Imports character cards (v1/v2/v3, PNG) from chub, JanitorAI, anywhere.
We support Windows, Mac, Linux, Android.

I've shipped a lot since the last build, so 2.0 is almost here. Here's what's new, with a bit more detail on each.

Faster local models

If you run the model on your own machine, 2.0 adds MTP speculative decoding: the runtime predicts several tokens ahead as a quick guess, then the main model verifies them in one pass. When the guess is right, you get those tokens for free, which is a real speed bump on supported models. It stops guessing the moment confidence drops so it never makes things slower, and it works either with models that have the prediction layers built in or with a small separate draft file the app downloads alongside. There's also a runnability score in the model browser that tells you how well a model will run on your RAM and VRAM before you sit through a 20GB download.

Companion mode: characters that actually change (opt-in)

https://preview.redd.it/ppvs5ibafo9h1.png?width=1722&format=png&auto=webp&s=f9d81057a0ccedef9197bd07e920d2150b738454

This is the big one, and it's a mode you choose per character at creation (or switch on later). Leave it off and characters behave exactly like normal roleplay, your existing chats are untouched. Turn it on and the character gets:

A soul that grows. The character has a set of identity blocks (essence, traits, backstory, appearance, goals, voice, fears, habits, boundaries, likes) that change at different speeds. Backstory never moves, core essence and traits shift only after sustained patterns, and small things like new favorites pick up quickly. As you talk, a growth engine quietly updates these from what actually happened, and periodically consolidates scattered changes back into the core so the character evolves coherently instead of contradicting itself.
A relationship that's earned. Closeness, trust, and affection are tracked on a scale that can go negative, so a character can genuinely warm up to you or cool off. New companions start guarded rather than instantly fond, negative shifts land faster than positive ones, and a betrayed trust takes longer to rebuild than it took to lose. There's also short-term tension that flares and fades.
A sense of time. Optionally, a companion can track real dates, and you can freeze or advance its clock for flashbacks or long-running arcs. Memories can be dated and show live relative time ("3 days ago"), and you can ask things like "what happened last week?"

Group chats where you direct who speaks

https://preview.redd.it/dtmq15vnno9h1.png?width=584&format=png&auto=webp&s=436d26c075262aa9dceeae88ec264cc09dc68fd4

Tap a character's avatar to choose who replies next, so you orchestrate the scene instead of getting random turn order. Pick someone before you send (the avatar gets a glowing ring), or send first and then tap anyone to make them respond instantly. The participants bar also lets you mute characters, set per-character mention, and tweak group appearance, plus there's group message search with jump-to-message, per-session author notes, and editable side widgets on desktop.

Branch tree: see every path you took

https://preview.redd.it/due8euuymo9h1.png?width=1722&format=png&auto=webp&s=91c5870d956326ae2673c5445416c21b7cb5bc7f

Every time you take a different path you make a branch, and now you can see all of them as a visual tree with the lineage drawn between them. Open any branch to continue from it, fork a new one from any exact message, or flip to compare mode to put two branches side by side and see where they diverged. No more guessing which "(branch)" was which.

Generate images on your own GPU

Avatar and scene images can now run through ComfyUI or Diffusers on your own GPU. Scene images can reference multiple images at once so a character stays on-model across generations, the results render inline right in the conversation, and there's a LoRA library you can attach per character and per persona.

And a lot of polish

Audio upload and playback with a dedicated library, a custom desktop title bar with selectable designs, configurable folders for your models, new providers, and the entire interface routed through a real translation system.

Some custom title bar designs

https://preview.redd.it/3j2pavl2oo9h1.png?width=1822&format=png&auto=webp&s=c06350d9f6c23caaf8e7de2d469b3e1826d43abe

https://preview.redd.it/a3t0xk24oo9h1.png?width=1822&format=png&auto=webp&s=237609c8921815d816fe15de93a76a0e2fd0cc3a

https://preview.redd.it/ua9v09w5oo9h1.png?width=1822&format=png&auto=webp&s=5880b3e300052bc7d1a566b75815ac9cdcc61001

If everything goes as planned, 2.0 releases this Saturday.

The app is free, the AI behind it is bring-your-own. Permanent free tiers cover casual chatting, pennies-per-message if you want the smartest models, or run it fully local for nothing. You always know exactly what you are paying for, and 2.0 keeps every byte on your device.

Downloads: https://www.lettuceai.app/
Source Code: https://github.com/LettuceAI/app

One person's project, not a company. If something breaks, I am the one who fixes it.

reddit.com

u/Exact_Law_6489 — 10 days ago

▲ 30 r/CitiesSkylines2

My island city

Sorry for the low-resolution images. My city is called Wayford and has a population of 75k.

The factory region is still work in progress.

The map is "Goto Islands" by ceej12 on Paradox Mods. Map link: https://mods.paradoxplaza.com/mods/80271/Any

u/Exact_Law_6489 — 14 days ago

▲ 11 r/LocalLLaMA

Which is the better local mobile TTS: Kokoro or Supertonic?

I saw a few posts saying that Kokoro is better, but they both sound pretty good in their demos. How good are they in production, though?

reddit.com

u/Exact_Law_6489 — 22 days ago

▲ 69 r/ChaiUnofficial

LettuceAI: open-source chat/RP, chats stay yours

Hello everyone.

I'm the developer of LettuceAI, a free, open-source chat/RP app. I built it after getting burned one too many times by apps that changed the deal after I was invested: new limits, new prices, features moved behind a paywall. Any app can do that to you. An open-source app like LettuceAI can't, because the code is public and the chats live on your device. Nobody can throttle, delete, or paywall what you already have, including me.

What you get:

Chats stored on your phone or PC. No server wipe can touch them.
No ads, no subscription, no tiers. One version, free, and open source means it stays that way.
No filter in the app. Whatever AI you connect is what answers you.
Imports character cards (v1/v2/v3, PNG) from chub, JanitorAI, anywhere.
Memory that holds. Long conversations where the character still knows your birthday.
Group chats, image generation, voice in and out.
Windows, Mac, Linux, Android.

The honest part: the app is free, the AI behind it is bring-your-own. There are providers with permanent free tiers that cover casual chatting, pay-per-message options that cost pennies if you want the smartest models, or you can run a model on your own PC for nothing. You pick, and you can always see exactly what you are paying for.

Small vocabulary:

Source code: the app's recipe, public for anyone to read. No hidden ads or tracking.
Open-source: anyone can copy the code, so the app can't quietly turn bad later.
Provider: the AI that writes the replies. You pick it, so you pick what it costs.
On your device: chats are files on your own phone or PC, not on a server overseas.

Downloads: https://www.lettuceai.app/ Source Code: https://github.com/LettuceAI/app

One person's project, not a company. If something breaks, I am the one who fixes it.

I added a screenshot from LettuceAI's Desktop version to this post.

u/Exact_Law_6489 — 24 days ago

▲ 0 r/buildapc

Mixing HyperX Fury HX432C16FB3/8 (CL16-18-18) with Kingston Fury Beast KF432C16BB/16 (CL16-20-20) at 3200MHz?

I currently have:

- 2x HyperX Fury HX432C16FB3/8

- DDR4-3200

- CL16-18-18

- 1.35V

I'm considering adding either one or two Kingston Fury Beast KF432C16BB/16 modules:

- DDR4-3200

- CL16-20-20

- 1.35V

System:

- ASUS PRIME B560M-A

- Intel i7-10700K

- RTX 4060 8GB (I know its unrelated)

I understand that mixed kits are never guaranteed and that the system will likely run at the loosest timings supported by all modules. However, I have a few questions before commiting:

Has anyone mixed these specific HyperX Fury and Kingston Fury Beast modules?
Were you able to run them at 3200 MT/s with XMP enabled?
Is there any chance the Fury Beast sticks can run at 16-18-18, or should I fully expect the system to fall back to 16-20-20?
Any stability issues with 4 DIMMs populated on a B560 motherboard and 10700K?

I also have a question about capacity vs timings.

My main workloads are:

- Rust/Java development

- Docker containers

- Games like Cities: Skylines II, Minecraft, Marvel Rivals etc.

- Occasional local AI/LLM workloads

How noticeable would the real-world difference be between running DDR4-3200 at 16-18-18 versus 16-20-20 for workloads like these?

Finally, which option would you personally choose?

32 GB (2x16) with tighter timings

48 GB (2x8 + 2x16) with potentially looser timings

For my use case, would the additional capacity outweigh any performance loss from looser timings?

My goal is to maximize RAM capacity without spending a fortune, which is why I'm hoping to keep using my existing RAM instead of replacing it with a completely new kit.

I'd also run MemTest86 before and after installation to verify stability.

Thanks!

reddit.com

u/Exact_Law_6489 — 1 month ago

▲ 20 r/ChatbotRefugees

LettuceAI 1.9 (Android) / 1.6 (Desktop) is live!

Hey everyone! I'm the developer of LettuceAI.

LettuceAI is an open-source, privacy-first, cross-platform AI chat app built for character chats, roleplay, and long conversations that actually stay coherent.

This release is the biggest chat-customization update so far. The short version: you can now build a custom widget panel next to your conversation, tune the whole look of a chat with a live preview, and get hands-on with what your companion remembers. On top of that there's a rebuilt model editor and a solid round of local-model reliability work.

What's new in 1.9 / 1.6

Chat Widgets & appearance

New Chat Widgets system: build a custom panel beside your conversation from composable pieces (character and persona info, scratch pad, image, stat tracker, memory, companion state, quick snippets, dice, session info, author note, and layout blocks). Edit in place with a sticky toolbar, drag-to-reorder, an add-widget picker, per-widget design variants, a real library image picker, and cross-column moves. Layouts are saved per character
Chat appearance moved into a side-anchored drawer you open from the chat header, with a live preview that updates as you tweak, a tabbed shared form, side-flip, and a message-actions entry
New desktop chat layout controls: column width, alignment, full-shell behavior, independent header and footer toggles, a center widget mode, and a draggable divider to resize the widget area. Group chats mirror the same settings
Optional author name and timestamp headers above messages
Optional per-message info: show the generating model, input/output/total token counts, time-to-first-token, and tokens/sec, each toggleable, with a choice of placement and text size
Chat background blur is now applied to the image directly, dropping the separate bubble-blur control for a cleaner result

Companion & memory

New companion memory tools: trigger a memory-processing cycle manually, watch it run with a progress bar and live output viewer (with cancel), and review and edit the generated context summary inline
Anti-loop dynamic memory: adjusted sampling to reduce repetition loops, with live visibility into generation
Companion relationship meters now show low and high anchor labels for context

Local models & performance

More reliable GPU offload: context sizing now accounts for layers offloaded to the GPU, so a model that runs fine with mixed CPU and GPU offload is no longer wrongly reported as too big to fit
Improved VRAM headroom estimates so context creation no longer fails with out-of-memory on partially offloaded models. The compute-buffer reserve is derived from model dimensions and batch size, and a context that hits OOM is retried smaller even when a KV cache type is set
llama.cpp now drops the existing model before reload, avoiding double-pinned VRAM
Sharpened runnability scoring with MoE active-path awareness, an expanded quantization table, KV cache quant types, and a repaired GGUF parser
Performance metrics (time-to-first-token and tokens/sec) are now saved with each message and shown in message details after a reload, in both direct and group chats
Smarter local-model thinking with a force-send thinking-state toggle and recognition of Gemma channel-style reasoning tags

Model editor & providers

Redesigned model editor: a flatter, box-free layout with unified section tabs and a runtime-report drawer, width-aware on desktop and clean on mobile
Added support for image-only OpenRouter models
The creation helper can now use llama.cpp models

Quality of life

Made the Help me Reply history window configurable so it can look back as far as you want
Added a direct Save action to the unsaved-changes toast
The chat settings drawer now saves and updates the session immediately after changing a value
The scroll-to-bottom button now tracks the composer height as it grows and anchors to the messages column when widgets are shown

Fixes & stability

Streaming messages now apply chat appearance settings (author name, timestamp header, etc.) while generating, instead of only after the message finishes
Fixed identity placeholders leaking into injected memories, lorebook entries, and summaries
Fixed the model selector's "only free models" toggle colliding with the title on mobile
Companion memory now allows companion categories on edit and stops placeholder leakage
Cleaned up orphaned memory embeddings and repaired the embeddings migration
Made the speech-recognition migration idempotent
Settings are now reloaded after successful syncs
The reset flow removes Whisper and Kokoro models
The group chat memories page now renders properly when a chat background image is used

There's more in the full changelog. These are just the bits worth calling out.

If that sounds interesting, come and join our Discord server! It's the best place to follow updates, give feedback, and influence the future direction of the app.

Links:

Website: https://www.lettuceai.app/
Download: https://www.lettuceai.app/download
Full changelog: https://www.lettuceai.app/changelog
GitHub: https://github.com/LettuceAI/app
Discord: https://discord.gg/8eHDxEbRy4

Chat Widgets feature in action

reddit.com

u/Exact_Law_6489 — 1 month ago

▲ 1 r/WhatIsMyCQS

What Is My CQS

reddit.com

u/Exact_Law_6489 — 2 months ago

▲ 5 r/ChatbotRefugees

AI Basics Day 10: VRAM math and quantisation, or how to tell if a model will actually fit on your card

Hello everyone!

Last time we looked at local LLM runtimes: what a runtime even is, why the model file and the program that loads it are two separate decisions, the six or seven runtimes people actually use (llama.cpp, ollama, LM Studio, koboldcpp, oobabooga, vLLM, Apple MLX), why llama.cpp is faster than ollama in practice despite sharing the same engine, why vLLM is not what you want as a single user, and the runtime/UI separation that trips up almost every newcomer. The short version: pick a runtime that matches your use case, then point a separate frontend at it, then stop worrying about it.

For anyone who missed the earlier days:

Today we are doing VRAM math and quantisation: what quantisation actually is and what the Q4_K_M suffixes on HuggingFace mean, where the quality cliff is, how to actually compute whether a model will fit on your card before you waste two hours downloading 18 GB of GGUF, why DeepSeek and Gemma 4 break the usual KV cache math, and a cheat sheet of what realistically fits on every consumer VRAM tier from 8 GB to 48 GB.

Heads up: this is a long one. There is no way to talk about VRAM without also talking about quantisation, and there is no way to talk about either without enough vocabulary to make sense of HuggingFace filenames. If you would rather skim, jump to the cheat sheet near the end.

If you have ever stared at gemma-4-31b-it-Q4_K_M.gguf and wondered what any of that means, or downloaded a model that promised to run on your card and then OOM'd the moment you tried to load it, this post is for you.

(OOM = Out Of Memory)

What quantisation actually is

Models are stored as big arrays of numbers. Each number is a weight that the model multiplies inputs by, layer after layer, until tokens come out the other end. A small model has billions of these. A large model has hundreds of billions.

The natural way to store a number is 16-bit floating point (FP16 or BF16). That is the format models are typically trained in. Each weight takes 2 bytes. So a 7B-parameter model in FP16 is 7 × 2 = 14 GB on disk. A 70B is 140 GB. A 405B is 810 GB. None of this fits on a gaming card.

Quantisation is the trick of storing each weight in fewer bits than the model was trained with. Instead of 16 bits per weight, use 8. Or 4. Or 3. The numbers lose a little precision, the file shrinks proportionally, the model still mostly works.

The miracle of the field is that this works far better than it has any right to. A 4-bit quantisation of a model is about a quarter the size of the FP16 version, runs faster, fits on smaller cards, and usually performs almost identically on benchmarks. Below 4 bits things get rougher, but a Q4 of a frontier model is what almost everyone is actually running locally, and the gap to the full-precision version is generally small enough to ignore.

Think of it as JPEG for model weights. JPEG throws away detail your eye cannot see, and you get a tenth the file size with a picture that looks the same. Quantisation throws away precision the model does not need much of, and you get a quarter the file with a model that mostly behaves the same.

Decoding the filename

HuggingFace GGUF filenames look like Mistral-Small-3-24B-Q4_K_M.gguf or gemma-4-26b-a4b-IQ4_XS.gguf and the suffixes are doing real work. Once you know the pattern they are simple.

The format is roughly: Q<bits>_<scheme>_<size>.

The Q<bits> part is how many bits per weight. Q2 is 2-bit (tiny, rough), Q4 is 4-bit (the sweet spot for most users), Q5 is 5-bit, Q6 is 6-bit, Q8 is 8-bit (very close to full precision). Smaller number = smaller file = lower quality.
The K (or no K) is the quantisation scheme. K-quants ("k-quantisation") are the modern smart scheme that uses different bit allocations for different parts of the model: layers that matter most get more bits, layers that matter less get fewer. The older non-K schemes (Q4_0, Q4_1, Q5_0, Q5_1) treat every weight the same and are mostly obsolete. If you see a file without _K in it, it is probably legacy. Prefer K-quants where available.
The size suffix (_S, _M, _L, _XL) is small / medium / large / extra large within that bit level. Q4_K_M is "4-bit K-quant, medium size" — slightly bigger and slightly better than Q4_K_S. The differences are small. If you have the VRAM, pick _M over _S.

A separate family worth knowing about:

IQ-quants (IQ4_XS, IQ3_M, IQ2_S, etc.) are "imatrix quants". They use an importance matrix computed from running real data through the model to figure out which weights matter most, then allocate bits accordingly. At the same bit count, IQ-quants generally outperform K-quants. The catch is they can be a touch slower to run (more CPU work per token), so on weaker hardware they sometimes feel less responsive even though they are technically smarter.

So Q4_K_M = 4-bit K-quant, medium. IQ4_XS = 4-bit imatrix quant, extra-small. Q5_K_S = 5-bit K-quant, small. Q8_0 = 8-bit, legacy scheme. You can now read any GGUF filename.

The quality cliff

Roughly where the quality cost sits, from people running benchmarks on real models:

F16 / BF16 (100%): the reference. Almost nobody runs this locally below 7B because the file is huge for what you get.
Q8 (~99%): indistinguishable from F16 in almost every test. The "I have plenty of VRAM and want the best" choice.
Q6 (~97-98%): very close to Q8. A good "if it fits" tier.
Q5_K_M (~95-97%): the comfortable sweet spot. Visible quality, small loss vs the reference.
Q4_K_M (~93-95%): where most of this community actually lives. Cheap on memory, fast, only mildly worse than Q5. The default.
IQ4_XS (~93%): similar quality to Q4_K_M, slightly smaller, slightly slower on weak hardware.
Q3_K_M (~88-91%): noticeable degradation on small models. Acceptable on big ones (70B+).
Q2_K (~75-85%): rough on small models, surprisingly tolerable on very large ones.
IQ1 (~50-70%): experimental. Used to run massive models on absurdly little memory. Quality is not great.

The single rule of thumb most worth memorising: a Q4 of a 70B beats a Q8 of a 13B every time. When choosing between "smaller model at high quant" and "bigger model at low quant", bigger model wins almost always, down to about Q3 on big models. Below Q3, the cliff starts catching up.

For most people on consumer hardware, the answer is Q4_K_M, occasionally Q5_K_M if it fits. Going lower than Q4 is for stretching to bigger models. Going higher than Q5 is for showing off.

VRAM math from first principles

Now the actual math. Three components add up to your total VRAM use:

1. The model weights themselves. This is the biggest piece.

weight_bytes = params × bits_per_weight ÷ 8

A 12B at Q4 = 12,000,000,000 × 4 / 8 = 6 GB. A 24B at Q5 = 24,000,000,000 × 5 / 8 = 15 GB. A 70B at Q4 = 35 GB. A 35B-A3B MoE at Q4 = 17.5 GB (all 35B of weights have to be in memory, even though only 3B activate per token; more on this below).

2. The KV cache. This is the part nobody warns you about, and it can be huge with long context.

The KV cache stores, for every token in your context window, the key and value projections at every layer of the model. The size scales linearly with context length. The formula in its simplest form:

kv_bytes = 2 × layers × kv_heads × head_dim × context_length × bytes_per_value

The 2 is for K and V (two separate caches). The architecture (layers, heads, head_dim) depends on the model. The bytes_per_value is 2 for FP16 cache, 1 for Q8 cache, 0.5 for Q4 cache (yes, you can quantise the KV cache itself).

For a typical 12-13B at FP16 cache and 16k context, this is around 2-3 GB. At 32k context, 4-6 GB. At 128k context, double-digit GB. Long context is not free.

3. Overhead. Runtime workspace, activation buffers, scratch memory for matrix multiplications. A safe heuristic is max(5% of model size, 200 MB). So a 24B model carries maybe 1-2 GB of overhead.

Add it up, plus headroom.

total_vram = weights + kv_cache + overhead

And then leave about 10% of your card's VRAM unused for the OS, the runtime's allocator quirks, and the occasional spike. If your card is 12 GB, plan to use 10.8 GB. If your card is 24 GB, plan to use 21.6 GB. People who pack right to the limit get OOM kills at the worst moments.

Architecture wrinkles most calculators ignore

The above math is the textbook version, and it works for most models. But a few important architectures break it in ways that matter.

DeepSeek's MLA (Multi-Head Latent Attention). DeepSeek V3, V3.2, and V4 use a compressed representation for the KV cache, projecting it down to a low-rank latent space. In practice this means a DeepSeek model's KV cache per token is much smaller than its layer count would suggest. A naive textbook calculation will overshoot the real cache size by a factor of 4-8× for these models. Important thing is: this only applies to native-architecture DeepSeek models. DeepSeek distills, which are fine-tunes of other open-weight base models, inherit their base model's attention rather than MLA. For those use the standard formula.

Sliding-window attention (Gemma 4, Cohere Command-R, Mistral 7B). These architectures use a fixed attention window in most layers instead of letting attention span the full context. The KV cache for those layers is capped at the window size, not the context length. Gemma 4 in particular alternates local sliding-window layers with global full-context layers, with windows of 512 tokens on smaller dense variants (E2B / E4B) and 1024 tokens on the 26B and 31B. Gemma 4 also pairs this with a shared KV cache trick where the last N layers reuse key-value states from earlier layers. The combined effect: a Gemma 4 31B at its native 256k context uses dramatically less KV cache than a naive formula would predict, because most layers only need cache for the window. This is why Gemma 4 punches above its weight on long context.

Mixture of Experts (MoE). Models like Qwen3.5 35B-A3B or DeepSeek V3 are MoE: total parameters far exceed active parameters per token. A 35B-A3B has 35B of weights but only routes ~3B through any given forward pass. The memory side is size for the full 35B because all the experts have to be loaded and ready. The speed side is closer to a 3B model because only 3B of compute is happening per token. So MoE breaks the "model size implies speed" relationship most people start with.

You can do these calculations by hand. It is not hard. But it is the kind of math you do not want to do twice for the same model, and it is the kind of math where one wrong layer count and you have downloaded a model that does not fit.

A note on LettuceAI, because it is on-topic

Worth flagging since this is the VRAM-math post, and the math is exactly what this part of the app handles: I am the developer of LettuceAI, an open-source chat/RP app. The HuggingFace browser inside the app computes a runnability score for every GGUF it shows you, scaled 0-100 with labels (excellent / good / marginal / poor / unrunnable).

The score is the math from this post, wired up to your actual hardware: it pulls the layer count, head count, embedding size, and architecture from the GGUF metadata, computes weights + KV cache + overhead, and grades how well it will run on the RAM and VRAM you have, with 10% headroom built in. It knows about the architecture wrinkles in the previous section: DeepSeek MLA shrinks the cache automatically, Gemma 4's sliding window caps it at the window size, and MoE total/active split is handled. It also assigns a quality score per quant (F16=100, Q8=95, Q6=90, Q5_K_M=85, Q4_K_M=75, IQ4_XS=72, Q3_K_M=60, Q2_K=35, IQ1=15) and blends that with the fit score, so a Q4 24B and a Q8 7B do not both come out as "fits fine" when one is meaningfully smarter than the other.

It also flags GPU offload mode: full (everything fits in VRAM, blazing fast), nearFull (model fits, KV cache spills a bit), kvSpill / kvHeavySpill (more KV cache on RAM than VRAM, slower), or RAM-backed model with VRAM context (model on system RAM, context on GPU). The post below covers what those modes mean.

Mentioning it because it fits today's topic, not as a pitch. You can absolutely do this math by hand. Several other tools have similar features (LM Studio's "this model probably fits" indicator, koboldcpp's loader, various community VRAM calculators on the web). The source for LettuceAI's scoring function is open at src-tauri/src/hf_browser/mod.rs on the GitHub repo if you want to read the actual formulas, or steal them. Site is at https://www.lettuceai.app.

Back to the regular series.

The cheat sheet by VRAM tier

Rough realistic targets for each common consumer VRAM tier in mid-2026. Assumes Q4_K_M unless noted, and a reasonable RP context length (8k-16k). Each tier has more headroom than these examples; the goal is "comfortable" not "maximum".

8 GB VRAM (RTX 3060 8GB, RTX 4060, etc.): 7-9B dense models at Q4-Q5, with 8-16k context. Stheno 3.2 8B, Qwen3.5 9B Small, Llama 3.1 8B. Pushing to a 12B at Q3-Q4 is possible but tight. Long context will start spilling.
12 GB VRAM (RTX 3060 12GB, RTX 4070, RTX 3080 10/12GB): 12B dense at Q5, 14-15B at Q4 with 16k context. Rocinante-X-12B, Mistral Nemo 12B fine-tunes, Snowpiercer-15B at Q4. Partial offload of a 24B is possible but slow. The first tier where RP feels comfortable on a fine-tune.
16 GB VRAM (RTX 4060 Ti 16GB, RTX 4080, M-series 16GB unified): 14-15B at Q5-Q6, 24B at Q4 with comfortable context. Mistral Small 3 24B at Q4_K_M is the natural target. Gemma 4 26B-A4B MoE works well here because of the sliding-window cache trick.
24 GB VRAM (RTX 3090, RTX 4090, M-series 24-32GB): the sweet spot tier. 24B at Q5-Q6, 27-32B at Q4, Qwen3.5 35B-A3B MoE happily, Gemma 4 31B dense at Q4 with reasonable context. Most heavy RP users live here.
32 GB VRAM (RTX 5090): sits between the 24 GB and 48 GB tiers. 32B at Q5-Q6, Gemma 4 31B at Q5, the Qwen3.5 35B-A3B MoE with room for long context, and a partial-offload path into 70B at Q4 if you accept some spillover.
48 GB+ (2x RTX 3090, RTX 6000 Ada, Mac Studio, etc.): 70B at Q4-Q5 (Midnight Miqu 70B, Midnight Rose 70B, Llama 3.3 70B fine-tunes). The very large open MoE models start to be reachable at low quant if you have 64-128 GB unified memory or stack of cards.

A few rough patterns the cheat sheet reflects:

Each VRAM tier moves you up about one model-size class at the same quant.
Dropping from Q5 to Q4 saves roughly 20% on the weights (5/8 vs 4/8 bits per weight), which is enough to push you up half a model class on most setups, not a full class.
MoE models are weird: a 35B-A3B still costs you the full 35B of weights in memory (every expert has to be loaded and ready, even if only one routes per token). What you get is the speed of a 3B forward pass, not the memory of one. Expert-offload tricks exist in some runtimes but trade away most of the speed advantage.

Context length is part of the math

The big gotcha that catches everyone: context length is part of VRAM cost. Doubling your context window roughly doubles your KV cache.

Concretely, on a typical 13-14B model:

4k context: ~600-800 MB of KV cache (FP16)
8k context: ~1.2-1.6 GB
16k context: ~2.4-3.2 GB
32k context: ~5-7 GB
64k context: ~10-14 GB
128k context: ~20-28 GB

This is why people say "my 14B fit yesterday but won't load with 32k context today." Yesterday they used 4k context, today they bumped it to 32k, and the model itself did not get bigger but the KV cache grew by 4 GB.

The first thing to do: set the context length to what you actually use. The runtime allocates KV cache for the maximum context length you set, even if your current chat is 200 tokens. If you do not actually run 32k-token conversations, do not allocate for them. This costs you nothing and is often the difference between a model fitting and OOMing.

The second, much more powerful tool is the next section.

KV cache quantisation, the most underrated VRAM trick

You can quantise the KV cache itself, separately from the weights. This is the single biggest thing most people are not doing that they should be.

By default the KV cache is stored in FP16 (2 bytes per K or V value). Most modern runtimes also support Q8_0 (1 byte) and Q4_0 (0.5 bytes) cache types. In llama.cpp the flags are --cache-type-k q8_0 --cache-type-v q8_0 (or q4_0). koboldcpp, LM Studio, and ollama expose the same option through their UIs.

What it buys you:

Q8 KV cache: halves your KV cache size, with a tiny quality cost most people cannot detect in normal RP. Effectively free.
Q4 KV cache: quarters your KV cache size, with a more noticeable quality cost on long-context coherence. Worth it on tight VRAM, especially when paired with a higher-quant weights set.

The numbers from the earlier table get cut by 2× or 4× when you switch the cache to Q8 or Q4. That 14B at 32k context dropping from ~5-7 GB of KV cache to ~2.5-3.5 GB of cache is the difference between "barely fits" and "comfortable with room to spare."

Two practical caveats:

In llama.cpp, Flash Attention must be enabled (--flash-attn) for Q8/Q4 KV cache to work. Most consumer-grade frontends turn this on by default; check yours.
Some architectures with custom KV layouts (notably DeepSeek's MLA models, anything with sliding-window quirks) can have compatibility issues with the more aggressive KV quant types. If a model fails to load with Q4 cache, fall back to Q8 or F16.

Combining a sensible context length with Q8 KV cache covers 90% of the "this model does not quite fit" problem. It is the first thing to try before you start dropping to a smaller weight quant or a smaller model.

Not everyone has an RTX 5090 or a maxed-out Mac Studio

The cheat sheet above assumes you have a real discrete GPU. Plenty of people do not. Here is what the rest of the local-LLM world actually looks like, and what is realistic on each kind of setup.

Pure CPU + system RAM

If you have no usable GPU but a decent CPU and reasonable amounts of DDR4 or DDR5 RAM, you can still run models. The constraint is not capacity (RAM is cheap), it is memory bandwidth, which is what bottlenecks token generation on CPU.

Rough realistic targets on a modern x86 CPU (Ryzen 7000-series, Intel 13th-gen+) with 32-64 GB of fast RAM:

7-9B at Q4_K_M: 5-10 tokens/sec on DDR5. Slow but usable for low-volume chat.
12-14B at Q4_K_M: 3-6 tokens/sec. Watchable, not snappy.
24B at Q4_K_M: 1-3 tokens/sec. Painful for interactive RP, fine for long-form generation you read after the fact.
70B+: technically possible if you have 64-128 GB RAM, but speeds drop to under 1 token/sec. Not practical for chat.

DDR5 versus DDR4 makes a real difference here. A DDR5-6000 system can be roughly 2× faster than the same CPU on DDR4-3200 for LLM inference, because it is bandwidth-bound. AVX-512 (where supported) helps too.

Apple Silicon (unified memory)

This is the underrated budget option for serious local LLMs. The M-series Macs use unified memory, meaning the same RAM is available to both CPU and GPU with no copy needed. A Mac with 32 GB unified RAM can run things that would need a $1000+ NVIDIA card on the PC side.

M1/M2/M3/M4 Pro with 32 GB: comparable to a 16 GB discrete GPU in practice. Runs 24B at Q4 comfortably.
M3/M4 Max with 64 GB: handles 32B at Q5, the Qwen3.5 35B-A3B MoE, Gemma 4 31B.
M3/M4 Ultra Mac Studio with 128+ GB: 70B comfortably, and you can stretch into the very large MoE territory at low quant.

Apple Silicon LLM speed scales with the chip's memory bandwidth, which is dramatically higher than typical DDR. An M4 Max at 64 GB is genuinely competitive with a 24 GB NVIDIA card for inference, at lower total system cost.

Integrated GPU + shared system RAM

Modern AMD APUs (Ryzen AI 7000/8000 series with Radeon 780M / 890M iGPU) and Intel Arc iGPUs can run small models with their integrated graphics using a chunk of system RAM as VRAM. This is the "I have a laptop with no discrete GPU" tier.

Realistic targets: 7-9B at Q4 with 4-8k context. Speeds around 8-15 tokens/sec for the better APUs. Anything bigger than 12B starts to feel painful.

Mixed offload (small GPU + lots of RAM)

If you have a low-end discrete GPU (say 6-8 GB) plus a lot of system RAM (32-64 GB), the standard approach is mixed offload: put as many of the model's layers as fit on the GPU, keep the rest on CPU/RAM. The setting is usually called n_gpu_layers or --ngl, and it is a number telling the runtime how many of the model's layers to put on GPU.

This is how a lot of people run models that "should not fit." Each token still has to bounce between GPU and CPU once per offloaded boundary, and CPU layers are slower than GPU layers, so the more you spill the slower it gets. Rough speed-vs-spillover picture:

Full GPU offload (model + KV + everything in VRAM): full speed, maybe 30-100 tokens/sec depending on model and card.
Slight spillover (~10-20% on CPU): 60-80% of full speed. Still snappy.
Moderate spillover (~30-50% on CPU): 20-40% of full speed. Noticeable but fine for chat.
Heavy spillover (50%+ on CPU): 5-15% of full speed. Slow, but if you do not mind waiting for responses you can run much bigger models than your VRAM alone would allow.
All CPU (no GPU layers): falls back to the pure-CPU numbers above.

There is no right answer here, just a tradeoff. If you want a big model for the quality and you do not mind generation taking a while, heavy spillover is a perfectly valid choice; plenty of people happily run a 70B at 2-3 tokens/sec because the writing is worth the wait. If you want snappy interactive chat, keep most layers on the GPU and pick a smaller model that fits.

Practical recipe: most runtimes pick a sensible default ngl for you based on your VRAM. If you tune manually, start from "all layers on GPU" and decrease until it fits, leaving 10% headroom. Do not start from zero and increase, you will under-utilise your card. A 12-14B model on a 6 GB card with 32 GB RAM lands around 10-20 tokens/sec with this approach. A 70B on the same hardware can land around 2-4 tokens/sec if you are patient.

What none of these are good for

Models above ~70B parameters (DeepSeek V3/V4, Qwen3.5 122B+, Kimi K2.5, etc.) are out of reach for any of the setups above without serious server-tier hardware. If you want to use those models, the answer is BYOK to a hosted provider, not local. We covered this tradeoff in Day 8.

Tomorrow (or whenever)

Day 11 will be sampling settings for local models: the part where local actually diverges from cloud APIs. We covered temperature and top-p in Day 2 as the basics, but local runtimes ship a whole zoo of modern samplers (min-p, DRY, XTC, dynamic temperature, smoothing factor, top-A, mirostat) that do not exist on the OpenAI or Anthropic side, and most of them genuinely help RP quality once you understand what they are doing. We will go through what each one does, when it helps, when it hurts, and the small set of presets that cover 95% of real use.

That's all for today. I hope this helps!

reddit.com

u/Exact_Law_6489 — 2 months ago

▲ 3 r/ChatbotRefugees

AI Basics Day 9: What is a local LLM runtime, and which one should you actually use?

Hello everyone!

Last time we looked at BYOK and local LLMs at a high level: the three tiers (app-bundled, BYOK, local), what each one actually changes about cost, privacy, content rules, and reliability, why OpenRouter is the on-ramp most people take into BYOK, and why so much of this community has been drifting toward local over the last year. The short version: bundled apps trade convenience for control, BYOK trades flat billing for model choice and fewer rules, and local trades hardware money for unlimited use plus actual privacy.

For anyone who missed the earlier days:

Today we are getting into local LLM runtimes: what a runtime actually is, why the model file and the program that loads it are two separate decisions, the six or seven runtimes people actually use, and how to pick one without spending a weekend reading benchmarks.

If you have ever tried to "install a local model" and ended up with five tabs open trying to figure out whether you want ollama, LM Studio, koboldcpp, llama.cpp, oobabooga, or vLLM, this post is about that.

What a runtime even is

A local model is just a file. A big file, usually tens of gigabytes, full of weights and metadata, but conceptually just a file sitting on your disk. The file does nothing on its own. It is not a program. It cannot answer questions by itself any more than a .pdf can read itself out loud.

A runtime is the program that loads the file, takes a prompt, runs the math (the forward passes through every layer of the model), and gives you back tokens. It is the thing that turns "weights on disk" into "model you can talk to."

The important consequence: the model and the runtime are two separate decisions. The same GGUF file of, say, Qwen3.6 27B can be loaded by llama.cpp, ollama, LM Studio, koboldcpp, oobabooga, or half a dozen other tools, and you will get more or less the same outputs out the other end. Picking a model and picking a runtime are not the same question. Most beginner confusion comes from people treating them like one decision when they are two.

A short word on formats

Local models come in several formats. We will do a full deep-dive in a later post, but you need the one-paragraph version to make sense of the rest of today's post.

GGUF is the dominant consumer-local format. It is the format llama.cpp uses, which means by extension it is what ollama, LM Studio, koboldcpp, and most consumer-facing tools use. GGUF files are CPU-runnable, GPU-runnable, Mac-runnable, partial-offload-runnable, quantised down to small sizes, and easy to share. If you have ever downloaded a model off HuggingFace and the filename ended in .gguf, that is what we are talking about.

Other formats exist. safetensors is the raw, unquantised format models are usually first released in, used by transformers/vLLM/exllamav2. EXL2 and AWQ and GPTQ are GPU-only quantised formats with their own runtimes, generally faster on pure-GPU setups but less portable. MLX is Apple's native format for M-series Macs.

For most people in this community, the answer is GGUF, and the runtimes below all run GGUF. Where I mention a non-GGUF tool, I will flag it.

The runtimes people actually use

Six or seven names cover well over 95% of local LLM setups. Here is what each one actually is, what it is for, and who should bother.

llama.cpp

The engine. Written in C++, runs on basically anything (CPU, NVIDIA, AMD, Intel, Apple Silicon), and almost every consumer-facing local LLM tool is built on top of it. Pure command-line and HTTP server, no GUI of its own. If you run llama.cpp directly, you get raw control, the latest features the day they ship, and you also get to read the docs.

Most people do not run llama.cpp directly. They run something built on top of llama.cpp, which is most of the rest of this list.

Use it directly if you want maximum control, do not mind a CLI, and want zero abstraction between you and the model. Otherwise, you are still using it; you are just using it through someone's wrapper.

ollama

The friendly llama.cpp wrapper. ollama pull qwen3.5:27b and you have a model downloaded and a server running. The whole "find a GGUF, download it, configure context length, write a server config" pipeline collapses into one command. The defaults are sensible.

What you give up is fine control. ollama has its own model naming convention, its own preset settings per model, and a slightly opaque approach to where files live and how they are configured. Power users sometimes find this frustrating. Newcomers love it precisely because that frustration is hidden.

Use it if you want a model running in five minutes and you mostly just want the API endpoint to plug into another app.

LM Studio

The desktop GUI. Built on llama.cpp (and MLX on Mac). Browse models from inside the app, click download, click load, click chat. Also exposes an OpenAI-compatible API server so you can use it as a backend for other tools.

LM Studio is the easiest entry point for non-technical users by a wide margin. The main complaints: closed-source (you cannot audit or fork it), and the model browsing/filtering can lag behind what is actually new on HuggingFace.

Use it if you want a polished GUI, you do not care about closed-source, and you want to browse-click-chat instead of dealing with terminals.

koboldcpp

The RP-focused fork of llama.cpp. Long-standing favourite in the AI dungeon / RP community because it ships with features that matter for long-context chat: context shifting (so the model does not have to re-process the whole conversation every turn when you hit the context limit), built-in lorebook and world info support, a web UI tuned for storytelling, and a SillyTavern-compatible API endpoint.

If you came from AI Dungeon, NovelAI, or have been doing RP locally for any length of time, koboldcpp is probably what you already use or what your friends recommended.

Use it if RP is your primary use case and you want a runtime that has been thinking about RP-specific problems for years.

text-generation-webui (oobabooga)

The kitchen sink. One install gives you support for multiple backends (llama.cpp, exllamav2, transformers, and others), every sampler under the sun, a chat UI, a notebook UI, and a million extensions. Has been around longer than most of the others. The UI looks a bit dated, the experience is overwhelming if you just want to chat, but if you want to compare backends or run an obscure sampler, this is the tool.

Use it if you are the kind of person who reads every option in a settings menu and you want the most flexible single tool.

vLLM

The server-grade runtime. Built for serving many users at once, with batched throughput, paged attention, and the kind of performance you want if you are running a model behind a real product. Linux + NVIDIA only in practice. Uses safetensors, not GGUF. Not aimed at single-user local chat, but worth knowing exists.

Use it if you are running a model for multiple users, or you have serious server hardware and want maximum throughput. Skip it for a personal RP setup.

Apple MLX and MLX-based tools (LM Studio MLX backend, Ollama MLX, mlx-lm)

For Mac M-series specifically. Apple's own ML framework, runs models natively on the unified memory architecture, fast on M2/M3/M4 hardware. The MLX format is growing, more models are getting MLX conversions every month, and the performance gap over GGUF-on-Mac is real on bigger models.

Use it if you have an M-series Mac with serious RAM (32GB+) and you want the best local performance that hardware can give you.

Honourable mentions

TabbyAPI / exllamav2. GPU-only, very fast, uses EXL2 format. Power users with one big NVIDIA card who want every token per second they can squeeze out.
MLC LLM. Cross-platform (including mobile, Vulkan, WebGPU), interesting for niche cases but most people do not need it.
Jan / GPT4All / etc. Various GUI-first apps that bundle a runtime. Same shape as LM Studio, smaller ecosystems.

Picking one without overthinking it

Most people are deciding between three or four of these. Quick prose flowchart:

"I have never run a local model before, I want it to just work." → LM Studio. Browse, download, chat. If you are on a Mac, even more so.
"I want a server running in the background that other apps can talk to." → ollama. The API-first design is exactly what you want.
"I do RP and I want a runtime that takes RP seriously." → koboldcpp.
"I want full control, I am happy in a terminal, I want the new features the day they ship." → llama.cpp directly.
"I want one tool that can do everything and I do not mind the complexity." → oobabooga.
"I am on a Mac and I want maximum performance." → LM Studio with MLX backend, or mlx-lm directly.
"I am running a model for more than one person." → vLLM.

You can change your mind later. Models are portable. Switching runtimes is an afternoon at most.

The runtime/UI separation, which trips up almost everyone

This is the part newcomers miss most often, and it is worth being explicit about.

Almost every runtime above (llama.cpp, ollama, LM Studio, koboldcpp, oobabooga, vLLM) exposes an OpenAI-compatible HTTP API. That is a technical way of saying "the runtime acts like the OpenAI API, just at localhost instead of api.openai.com." Any chat app that can talk to OpenAI can talk to it.

What this means in practice: you do not pick "ollama OR SillyTavern." You run ollama as a server in the background, and you run SillyTavern as the chat UI on top of it, pointed at http://localhost:11434/v1. SillyTavern handles the UI, the cards, the prompt building. ollama handles the model. Two programs, two roles, talking to each other on your machine.

Same shape if you use LM Studio's server, koboldcpp's API, oobabooga's API, vLLM, llama.cpp's server, anything. The UI you chat in does not have to be the program running the model. Most heavy local-LLM users have a runtime running quietly in the background and a separate frontend they actually look at.

If you are coming from cloud apps where "the app" and "the model" felt like one thing, this split is the single most important mental shift to make. It is also why people who say "I tried local and it was too hard" usually got stuck on this exact point.

A note on LettuceAI, because it is on-topic

Worth flagging since this is the runtime post: I am the developer of LettuceAI, an open-source chat/RP app. One of the choices I made was to bundle llama.cpp directly into the app, so the runtime-and-UI split above does not apply if you use it. Same engine as koboldcpp and ollama and LM Studio under the hood, just shipped pre-wired so you do not have to install a separate runtime, start a server, paste an endpoint into a frontend, and so on.

There is also a HuggingFace GGUF browser inside the app, so you can search and download models without leaving it. If you do not have local hardware, BYOK to OpenAI/Anthropic/DeepSeek/etc. works in the same UI.

Mentioning it because it fits today's topic, not as a pitch. ollama + SillyTavern is a perfectly good setup. So is LM Studio on its own. So is koboldcpp + SillyTavern. LettuceAI is one option among several, and the right answer depends on whether you want one combined app or two separate ones.

Back to the regular series.

The thing nobody tells you about runtime speed

Different runtimes have different performance characteristics, and the gaps are bigger than you might expect. A few things that are true across most setups:

exllamav2 > llama.cpp for single-user GPU-only throughput. If you have one big NVIDIA card and only care about tokens per second, EXL2-format models on exllamav2 will usually beat GGUF on llama.cpp by a noticeable margin.
llama.cpp > ollama in practice, even though they share the same engine. ollama wraps llama.cpp and adds its own server, defaults, and abstractions on top, which adds real overhead. If you run llama.cpp's own server directly, you generally get faster generation and lower memory use than the same model under ollama. ollama wins on ergonomics, not raw speed.
llama.cpp ≈ koboldcpp ≈ LM Studio for GGUF performance. These three are thinner wrappers and stay close to the engine's actual speed. Differences come from default settings (context length, batch size, sampler choices) more than the runtime itself.
MLX > GGUF on Apple Silicon at larger model sizes. The gap is real but only matters once you are running 27B+ models.
CPU inference is much slower than GPU, but not unusable for small models. A modern CPU can run a 7B–12B model at a readable pace. Once you cross into 24B+, you really want GPU offload.

Why vLLM is probably not what you want as a single user

vLLM keeps coming up in benchmarks as "the fastest", so it is worth being explicit about why most of you should still skip it.

vLLM is optimised for serving many requests at once. Its big tricks (paged attention, continuous batching) shine when there are dozens or hundreds of concurrent conversations sharing one GPU. For a single user typing one message at a time, those tricks do basically nothing. You see throughput numbers in benchmarks because the benchmarks are running 100 requests in parallel, not because the next token comes back faster for you specifically.

On top of that:

It is Linux + NVIDIA in practice. Mac, AMD, Intel users are out.
It is safetensors-first, not GGUF. The local-LLM ecosystem (HuggingFace GGUF repos, every consumer tool) is mostly GGUF. Running vLLM means rethinking how you get models.
It pre-allocates a lot of VRAM for KV cache, which is great when that cache is shared across many users and wasteful when it is just you.
Setup is closer to deploying a server than running an app. Docker, CUDA versions, model conversion, the whole stack.
Single-stream generation latency is often worse than llama.cpp or exllamav2 on the same hardware, because vLLM is not optimising for that case.

If you are one person on one consumer GPU running one chat at a time, the right answer is almost always llama.cpp, exllamav2, or one of the GGUF wrappers, not vLLM. vLLM is the right tool when you have ten friends, or a real product, sharing a card.

If you are getting bad speeds, the first thing to check is not the runtime. It is whether you actually got the model onto your GPU. Partial GPU offload (some layers on GPU, some on CPU) is much slower than full GPU, and full CPU is much slower than partial GPU. We will cover this properly in the VRAM math post.

What about the model itself?

Briefly, because day 10 will do this properly: a "good" runtime running a "bad" model is a bad experience. A "rough" runtime running the right model is usually a great experience. The runtime is the easier decision; the model is the one that actually determines whether your chats feel good.

Most of this community in mid-2026 is running one of two things.

Base/instruct models for general intelligence, picked for what actually fits on consumer hardware: Qwen3.5 (9B small, 27B dense, or 35B-A3B MoE if you have the VRAM) is the broadly-recommended default. Gemma 4 (26B MoE, 31B dense, both new as of April 2026) is the buzzy new release. Mistral Small 3 (24B) is the steady mid-size workhorse.

The bigger open models people talk about online, GLM-4.7 (300B+ MoE), DeepSeek V3.2 (671B / 37B active MoE), and the new DeepSeek V4 (284B Flash / 1.6T Pro), are not really "local" in the consumer sense. You can technically run them if you have multiple GPUs or a server with a lot of RAM, but for the typical 8-24GB-VRAM gaming PC, those are BYOK / API models. Worth knowing they exist, not worth planning your local setup around.

RP-tuned community fine-tunes for prose quality and looser refusals, which is where most of the actually-good RP experience lives. The names that come up over and over on r/SillyTavernAI and r/LocalLLaMA:

L3-8B-Stheno-v3.2 is the 8GB-VRAM tier favourite.
Rocinante-X-12B and Snowpiercer-15B (Mistral Nemo based) are the 12-16GB tier favourites for adult RP and complex characters.
Dan's PersonalityEngine v1.3.0 (24B) is the current generalist RP pick at that size.
Midnight Miqu 70B v1.5 and Midnight Rose 70B v2.0.3 are the high-VRAM tier favourites, both heavily focused on prose quality.
MythoMax-L2-13B is the elder statesman, still pulled tens of thousands of times a month for being reliable.
The long tail of Drummer, SicariusSicariiStuff, and friends on HuggingFace ships a new tune most weeks.
Abliterated variants (refusals surgically removed from the weights, see day 5) of the base models above are very popular in this scene.

Basically: base models give you the best general intelligence, RP fine-tunes give you better prose and looser refusals. Most heavy RP users end up running a fine-tune or a uncensored model. We will dig into model selection properly in a few days.

Day 10 will be VRAM math and quantisation: how to figure out what your hardware can actually run before you spend two hours downloading a model that will not fit, what "Q4_K_M" and friends mean on a HuggingFace filename, what quantisation actually does to a model's quality, and where the sweet spots are for different hardware tiers. That is the post that turns "I have a 12GB card" into "I can comfortably run a 12B fine-tune at Q5_K_M with 16k context, or a 24B at Q4 if I tighten the context window."

That's all for today. I hope this helps!

reddit.com

u/Exact_Law_6489 — 2 months ago

▲ 9 r/ChatbotRefugees

AI Basics Day 8: What does BYOK and "running your own model" actually mean, and why is half the community migrating that way?

Hello everyone!

Last time we looked at reasoning models: how RLVR training lets models develop their own internal "thinking" style, why DeepSeek R1 was a landmark open release, why closed labs started hiding their chains of thought, and why writing "think step by step" into a system prompt actively hurts a modern reasoning model. The short version: the model already knows how to reason, you cannot improve on its learned policy with hand-written instructions, and your job is to give it good targets, not to choreograph how it thinks.

For anyone who missed the earlier days:

Today we are starting the BYOK and local LLM arc: what it actually means to bring your own key or run your own model, the three tiers people pick between, and why so much of this community has been quietly migrating away from app-bundled subscriptions over the last year.

This is the orientation post. The deeper technical pieces (VRAM math, quantisation, GGUF, llama.cpp vs ollama vs LM Studio vs vLLM, what hardware runs what) we will cover over the next several days, because there is way too much to fit in one post.

The three tiers, in plain terms

Almost every way you can run a chat or RP model today falls into one of three tiers. The names vary, the lines blur a little at the edges, but the shape is consistent.

Tier 1: App-bundled. You sign up for an app (Character.ai, Janitor, Chub, c.ai, JanitorAI's bundled tier, etc.), pay a subscription or use the free version, and chat. The app picks the model, hosts the inference, writes the system prompt, sets the content rules, and bills you. You do not see the model, you do not pick the model, you do not control the prompt below your character card.

Tier 2: BYOK (Bring Your Own Key). You sign up for an API provider (OpenAI, Anthropic, DeepSeek, Google, OpenRouter, Mistral, etc.) and get an API key. You paste that key into the chat app of your choice (SillyTavern, Chub's Mercury, KoboldAI Lite, JanitorAI's BYOK mode, etc.). The app still handles the UI and the prompt building, but the actual model calls go through your key, billed to your account, on whatever provider you chose. The model is still running on someone else's servers, you are just paying for it directly.

Tier 3: Local. The model weights live on your machine. You download them, you load them into a runtime (LM Studio, ollama, koboldcpp, llama.cpp, text-generation-webui, etc.), and inference happens on your own CPU/GPU. No network call to anyone. No per-token bill. The only ongoing cost is electricity, and the upfront cost is whatever hardware you needed to fit the model.

Most people start at Tier 1, get frustrated, move to Tier 2, and then a meaningful fraction eventually some drift toward Tier 3 once they have the hardware or the patience for it.

What "BYOK" actually changes

The word "BYOK" gets thrown around a lot, but the practical changes when you move from a bundled app to BYOK are pretty specific.

Billing flips from flat to per-token. Instead of $X/month for unlimited (or rate-limited) chat, you pay per million tokens, in and out. For most RP usage this is cheaper if you talk casually and more expensive if you run 32k-context sessions all day on a premium model. DeepSeek V3 and the cheaper Gemini tiers are pennies per long session. GPT-5 or Claude 4.7 on long contexts adds up faster.

You pick the model. This is the big one. Bundled apps usually offer one or two models, sometimes a "premium" toggle. With BYOK you can swap between DeepSeek V3.2, Claude 4.7, GPT-5, Gemini 3, Kimi K2.5, GLM-4.7, Mistral Large, and dozens of fine-tunes hosted on OpenRouter (Or other platforms), all from the same chat UI. Different models suit different scenes, and being able to switch mid-chat is genuinely useful.

Content moderation depends on the provider, not the app. This is the most misunderstood part. The chat app's ToS no longer governs what you can write, because the app is not hosting the inference. Whatever rules the API provider has are what apply. OpenAI and Anthropic have strict usage policies and will rate-limit or ban accounts for sustained policy violations. DeepSeek, Mistral, and most open-weights-via-API providers are far more permissive in practice. OpenRouter sits in the middle and depends on which underlying model you route to. None of this means "no rules at all." It means the rules move from "the app I signed up with" to "the lab whose model I'm using."

One thing worth noting: some BYOK frontends add their own filter on top, and you cannot turn it off. A handful of hosted "BYOK" wrappers screen prompts on the way out even though you are paying the provider directly. If you specifically went BYOK to escape app-level filtering, check the app actually passes your prompt through cleanly before assuming it does. The self-hosted frontends (SillyTavern, KoboldAI Lite) do not add filtering of their own; some browser-based ones do.

Privacy shifts, but does not vanish. Your prompts and replies are now visible to the API provider you chose, not the app you typed them into. For most providers, business-tier API traffic is not used for training and is retained only briefly, but each provider has different defaults, and "the API provider can see your chats" is still true. BYOK is not the same as local.

Also worth being explicit about: some providers will train on your prompts even when you are paying for the API, unless you opt out (and sometimes there is no opt-out on the cheaper tiers). The cheaper Chinese provider tiers and Google's free Gemini tier are the well-known examples; OpenAI and Anthropic's standard API tiers do not train on your traffic by default, but their consumer chat products do. Read the privacy policy of the specific tier you signed up for. "I paid for it, so they can't use it" is not a safe assumption.

Rate limits and reliability change. Bundled apps queue you behind everyone else on their plan. BYOK gives you your own quota with the provider, which is usually much higher and much more reliable for heavy users. The flip side is that if the provider has an outage, your chat is down until they fix it.

You become the integrator. When something breaks, the chat app blames the API and the API blames the app. There is no single support line. For most people this is a minor annoyance. For some it is the reason they bounce back to a bundled app.

What "local" actually changes

Tier 3 is a bigger jump than Tier 2, and the tradeoffs are different in kind, not just in degree.

Per-token cost goes to zero. Once the model is on your disk, you can generate as many tokens as you want for the price of electricity. No quota, no surprise bill, no provider rate limit. For heavy users this is the single biggest reason to go local.

Privacy goes to actually-private. Not "the provider promises not to train on it." The prompt never leaves your machine. For RP content specifically, this matters to a lot of people. It is also why a meaningful chunk of this subreddit ended up local: the content policy debate stops being relevant when the content does not leave the room.

No content rules, period. This is the part a lot of people care about most and the part nobody else can offer. There is no ToS, no provider filter, no app-level moderation. You can run uncensored fine-tunes, abliterated models (day 5 callback: the safety layer surgically removed from the weights), or community RP-tuned models that would get your account flagged on a hosted API. For adult RP specifically, this is the only tier where the question "is this allowed?" simply does not exist. You picked the model, you run the model, you decide what it does. The flip side is that there is also nobody stopping you from generating things you would regret, so the responsibility is fully yours; but for consenting-adult RP between you and your own machine, local is the only place where the answer is unconditionally yes.

Hardware becomes the bottleneck. This is the catch. Modern strong models are large. Frontier-quality models (DeepSeek V3.2, Kimi K2.5, GLM-4.7) are hundreds of billions of parameters and you are not running those on a gaming PC. What you can run locally is the 7B–70B range, with the 12B–32B sweet spot being where most local users live. These are not frontier models, but they are surprisingly capable, especially for RP, and they have closed a lot of the gap over the last year.

Setup is a real one-time cost. Picking a runtime, picking a model, picking a quantisation, getting it to load without OOMing, configuring context length, plugging it into SillyTavern or your UI of choice: this is an evening of work the first time, and ten minutes every time after that. Not hard, but not zero either. We will spend most of next week walking through this part.

Quality ceiling is lower, but not by as much as people think. A 32B local model is not Claude Opus 4.7. It is also not the 2023 disaster people remember. Modern local models in the 24B–70B range, especially the newer Qwen3.5 and Gemma 4 based fine-tunes and the Mistral Small 3 family, are good enough that a lot of users genuinely prefer them for RP over premium APIs. The difference is biggest on long, complex reasoning tasks and smallest on character-driven dialogue.

The middle road: OpenRouter

Worth its own short section, because OpenRouter is the on-ramp a lot of people use between Tier 1 and full BYOK.

OpenRouter is a single API that proxies to dozens of underlying providers. You get one API key, one bill, and access to OpenAI, Anthropic, Google, DeepSeek, Mistral, xAI, Cohere, plus a long tail of open-weights models hosted on Together, Fireworks, DeepInfra, Novita, and others. Models you would otherwise need five separate accounts to use are all one dropdown away.

Why people start there:

One key, many models. You can switch from Claude 4.7 to DeepSeek V3.2 to a Qwen3 fine-tune without re-pasting API keys.
Pay-as-you-go without provider minimums. Some providers (Anthropic in particular) have business-account friction. OpenRouter is a credit-card top-up away.
Free tier on a rotating selection of models. OpenRouter has historically offered a free tier for some open-weights models with daily rate limits. Quality and availability fluctuate, but for "try BYOK before committing" this is the cheapest possible entry point.
Decent moderation posture. OpenRouter itself is fairly hands-off; the underlying provider's rules still apply, but OpenRouter is not adding its own filter on top of most models.

Why people eventually leave OpenRouter for direct provider keys:

Small markup. OpenRouter takes a cut. For heavy users, going direct to the provider is a few percent cheaper.
Latency. Routing through a proxy adds a small amount of overhead. Usually negligible, occasionally not.
Provider-specific features. Some provider features (Anthropic's prompt caching, OpenAI's reasoning effort dial, DeepSeek's prefix caching) work better or only work when you call the provider directly.

For most people in this community, OpenRouter is the right starting point for BYOK, and the right place to stay unless you are heavy enough on one specific model to justify a direct account.

The tradeoff at a glance

Rough shape of the three tiers, to give you something to anchor on:

Bundled app. Cheapest to start, easiest to use, locked model selection, app-controlled content rules, your chats live on the app's servers. Good for "I just want to chat, I don't want to think about any of this." Bad if you care about model quality, privacy, or content freedom.
BYOK. Moderate cost (scales with use), one-time setup, full model selection, provider-controlled content rules, your chats live on the provider's servers. Good for "I want frontier-model quality without the bundled app's restrictions." The current sweet spot for most active users in this community.
Local. Highest upfront cost (hardware), free per token, full model selection within hardware limits, no content rules at all, your chats never leave your machine. Good for "I run RP often enough that hardware pays for itself, and I care about privacy." Bad if your hardware is weak or you specifically need frontier-tier quality.

Cost shape, simplified:

Bundled: $X/month, flat.
BYOK: pennies per casual session, dollars per heavy session, scales linearly with use.
Local: hardware capex up front ($300 to $3000+ depending on what you want to run), then near-zero ongoing.

If you only chat a few hours a week, the bundled app is usually still the cheapest option in absolute terms, and there is nothing wrong with staying there. If you chat heavily, BYOK becomes cheaper than the bundled options surprisingly quickly. If you chat very heavily, local pays for the hardware in months.

Why the community is migrating

Three things stacked on top of each other over the past year.

Bundled apps tightened their content rules. Character.ai went through a well-documented filter tightening in 2025 and never really walked it back. Janitor, Chub, c.ai variants, and most of the bundled players have followed similar patterns: stricter moderation, more refusals, more silent quality degradation on the cheap tiers, more "premium" features behind higher subs. For RP specifically, this hit hard, because RP often touches content that mainstream chat apps want to keep at arm's length.

Open-weights models got genuinely good. This is the day-7 callback. DeepSeek R1 in January 2025 was the moment open-weights stopped being "a worse free version of GPT" and became "a real option." Since then, DeepSeek V3.2, Qwen3.5, GLM-4.7, Kimi K2.5, and the Mistral Large/Small 3 line have closed enough of the gap that for a lot of use cases (RP very much included) the open option is the better one, not just the cheaper one. BYOK gave people direct access to those models. Local gave people unlimited access.

Per-token economics started favouring BYOK. DeepSeek V3 and Gemini Flash dropped per-token prices to the point where a heavy RP user can run thousands of messages a month for less than the cost of a premium chat-app subscription. Once the economics flipped, the only reason to stay bundled was convenience, and convenience is a one-evening setup problem.

Stack those three together and the migration looks less like a trend and more like a market correction. Bundled apps were the only option in 2023. They are now one option of three, and for power users, often the least attractive of the three.

Where this leaves you

If you are reading this post, you are probably already somewhere in the middle of this migration. Most likely cases:

Still on a bundled app, curious about BYOK. Pick an OpenRouter account, drop $10 in credit, paste the key into SillyTavern or your UI of choice, and try DeepSeek V3.2 or GLM 5 for a week. The "is BYOK worth it" question answers itself in about three sessions.
Already on BYOK, wondering if local is worth it. Depends entirely on hardware. If you have a 12GB+ GPU, the answer is "probably yes, at least to try." If you have an 8GB GPU, the answer is "for small models, maybe." If you have a CPU and 16GB of RAM, the answer is "for tiny models with low expectations, sure." We will cover this properly in the VRAM math post.
Already local, looking for the next tier. Bigger models, better quantisations, faster runtimes, or multi-GPU setups. The deep-dive posts coming up are for you.

Tomorrow (day 9) we will get into the practical side of local LLMs: what actually runs where, the runtimes people use (llama.cpp, ollama, LM Studio, koboldcpp, vLLM, and a few less common ones), the differences between them, and how to pick. After that we will tackle VRAM math, quantisation and GGUF, sampling settings for local models, and finally the hardware question (which GPU, how much RAM, when does CPU inference make sense).

That's all for today. I hope this helps!

reddit.com

u/Exact_Law_6489 — 2 months ago

▲ 31 r/LocalLLaMA

Recently, I’ve seen lots of ads for the Kimi K2.6 across various social media platforms, and I’d like to hear from people who have used it.

Is it genuinely that good, or is it just a model with impressive benchmark scores that doesn't perform well in real use?

reddit.com

u/Exact_Law_6489 — 2 months ago

▲ 39 r/ChaiUnofficial+3 crossposts

Hey everyone! I’m the developer of LettuceAI.

LettuceAI is an open-source, privacy-first, cross-platform AI chat app built for character chats, roleplay, and long conversations that actually stay coherent.

It supports both local models (llama.cpp, Ollama, LM Studio) and external APIs with full BYOK support, so you stay in control of your own setup. No forced accounts, no cloud routing through us, no vendor lock-in. Your requests go directly to the model provider you choose.

The goal was simple: make powerful long-term AI chat feel easier and cleaner without losing flexibility.

That means:

Built-in Dynamic Memory for long conversations
Better support for character-based chats and group chats
A cleaner UI that feels less overwhelming than more complex setups
The same experience across desktop and mobile
Full control over prompts, lorebooks, personas, and system behavior

It’s designed for people who want a reliable memory and continuity system that doesn't require constant maintenance.

We’ve recently made major improvements to memory, prompts, local model support and lorebooks, and there’s a lot more to come very soon.

If that sounds interesting, come and join our Discord server! There's lots of exciting stuff on the way, and it's the best place to follow updates, provide feedback and influence the future direction of the app.

Links:

Website: https://www.lettuceai.app/
GitHub: https://github.com/LettuceAI/app
Discord: https://discord.gg/8eHDxEbRy4

u/Exact_Law_6489 — 2 months ago

u/Exact_Law_6489

LettuceAI 2.1 &amp; 2.1.1 (Android &amp; Desktop) are live

What's new in 2.1

What's new in 2.1.1

LettuceAI: open-source chat/RP, 2.0 update soon

Faster local models

Companion mode: characters that actually change (opt-in)

Group chats where you direct who speaks

Branch tree: see every path you took

Generate images on your own GPU

And a lot of polish

My island city

Which is the better local mobile TTS: Kokoro or Supertonic?

LettuceAI: open-source chat/RP, chats stay yours

Mixing HyperX Fury HX432C16FB3/8 (CL16-18-18) with Kingston Fury Beast KF432C16BB/16 (CL16-20-20) at 3200MHz?

LettuceAI 1.9 (Android) / 1.6 (Desktop) is live!

What's new in 1.9 / 1.6

What Is My CQS

AI Basics Day 10: VRAM math and quantisation, or how to tell if a model will actually fit on your card

What quantisation actually is

Decoding the filename

The quality cliff

VRAM math from first principles

Architecture wrinkles most calculators ignore

A note on LettuceAI, because it is on-topic

The cheat sheet by VRAM tier

Context length is part of the math

KV cache quantisation, the most underrated VRAM trick

Not everyone has an RTX 5090 or a maxed-out Mac Studio

Pure CPU + system RAM

Apple Silicon (unified memory)

Integrated GPU + shared system RAM

Mixed offload (small GPU + lots of RAM)

What none of these are good for

Tomorrow (or whenever)

AI Basics Day 9: What is a local LLM runtime, and which one should you actually use?

What a runtime even is

A short word on formats

The runtimes people actually use

llama.cpp

ollama

LM Studio

koboldcpp

text-generation-webui (oobabooga)

vLLM

Apple MLX and MLX-based tools (LM Studio MLX backend, Ollama MLX, mlx-lm)

Honourable mentions

Picking one without overthinking it

The runtime/UI separation, which trips up almost everyone

A note on LettuceAI, because it is on-topic

The thing nobody tells you about runtime speed

Why vLLM is probably not what you want as a single user

What about the model itself?

AI Basics Day 8: What does BYOK and "running your own model" actually mean, and why is half the community migrating that way?

The three tiers, in plain terms

What "BYOK" actually changes

What "local" actually changes

The middle road: OpenRouter

The tradeoff at a glance

Why the community is migrating

Where this leaves you

LettuceAI 2.1 & 2.1.1 (Android & Desktop) are live