r/ChatbotRefugees

Kindroid releases expensive terrible new image engine nobody asked for, enshittifies whole service when they run out of compute

Sorry Jer, but this is the very definition of “enshittification”:
https://www.reddit.com/r/KindroidAI/s/xidURPBFxv

Last week Kindroid released Atelier, their shitty new image engine. By any normal software release standard it’s a laughably bad white elephant: minimal benefit, maximum tradeoff and cost.

At best, Atelier gives you the EXACT same image as before with a faceswapped head, but the usability couldn’t be worse if it were designed by actual potatoes. Unless you put in the work to remake your existing avatar, you 100% WILL get a badly photoshopped bobblehead with mismatched lighting, color, and texture. Don’t believe me? Check out the latest Prompt-Of-The-Days, where you can easily pick out the Atelier-generated images by humongous head size and bad Photoshopped quality. 😂

And Atelier is so fucking particular that current best practice is generating MULTI-PANEL AVATARS in third-party engines using a 40-page guide.

And the kicker: despite generally worse pics and a best case scenario being merely a better head, Atelier costs more compute than the engine it replaced — a bad trade-off that was pointed out repeatedly in beta. Nevertheless Jer persisted, and shoved Atelier through as the default engine last Tuesday.

Lo and behold: in a turn that surprised no one, Atelier is a total resource hog. They’ve been scrambling for compute ever since, and in 4 days since launch Kindroid has been capping and gutting every feature they can touch:

- Free proactive selfies capped from unlimited to only 5 per day. Unlimited to 5 is an insane reduction of service.

- Group selfie credit prices raised. Sold as Atelier-only—but features like Group Autod can only run on Atelier, so everyone using them is forced to eat the hike, whether they want Atelier or not.

- Weekly check-in rewards cut from 20 selfie credits to 10. They are apparently so starved for compute they’re clawing back 10 measly selfie credits users earn every 7 days.

- Selfies frozen for non-subscribers—including those with credits they paid real money for.

And Jer’s response to all of it? Blame the subscribers. No acknowledgment or apology that it doesn’t feel good to lose features we pay for—just a patronizing lecture about how Kindroid was “being too generous before.”

So using the proactives included in our subscriptions is "abuse." Sitting on accumulated reward credits—literally *not* using compute—makes us freeloaders draining his precious GPUs. If this "GPU winter" is so dire that you have to aggressively enshittify our service, then launching a shittier resource-hogging image engine right in the middle of it was a dumb business decision. You don't get to cosplay as the noble leader making "hard choices" while blaming customers for using the product they paid for.

reddit.com
u/MysteriousMarch814 — 3 days ago
▲ 52 r/ChatbotRefugees+3 crossposts

Hey everyone! I’m the developer of LettuceAI.

LettuceAI is an open-source, privacy-first, cross-platform AI chat app built for character chats, roleplay, and long conversations that actually stay coherent.

It supports both local models (llama.cpp, Ollama, LM Studio) and external APIs with full BYOK support, so you stay in control of your own setup. No forced accounts, no cloud routing through us, no vendor lock-in. Your requests go directly to the model provider you choose.

The goal was simple: make powerful long-term AI chat feel easier and cleaner without losing flexibility.

That means:

  • Built-in Dynamic Memory for long conversations
  • Better support for character-based chats and group chats
  • A cleaner UI that feels less overwhelming than more complex setups
  • The same experience across desktop and mobile
  • Full control over prompts, lorebooks, personas, and system behavior

It’s designed for people who want a reliable memory and continuity system that doesn't require constant maintenance.

We’ve recently made major improvements to memory, prompts, local model support and lorebooks, and there’s a lot more to come very soon.

If that sounds interesting, come and join our Discord server! There's lots of exciting stuff on the way, and it's the best place to follow updates, provide feedback and influence the future direction of the app.

Links:

u/Exact_Law_6489 — 3 days ago

New chatbot recommendations???

Ive been from Polybuzz to Kindroid then to Chub and now thats changed their terms of service and I no longer feel safe using it exactly. I just want something thats free, uncensored/unrestricted, and completely safe 🙏

reddit.com
u/Direct-Attorney1149 — 4 days ago
▲ 12 r/ChatbotRefugees+4 crossposts

Looking for brutally honest C.AI users who want to help shape Storychat

Hey everyone, Jack here, one of the co-founders of Storychat.

I’ve been following the recent C.AI situation pretty closely, and I know a lot of users are disappointed right now. I used C.AI too, so I understand why people feel frustrated when a platform they loved starts feeling different from what made it great in the first place.

That’s one of the reasons I want Storychat to stay close to the community while we build.

We’re still early, and we definitely have a lot to improve. But our goal is simple: listen to real users, communicate openly, and keep improving the product based on what people actually want, not just what we assume they want.

The problem is that our subreddit is still small, so we don’t always get enough detailed feedback from heavy AI chat / roleplay users.

So I wanted to ask:

Do you know anyone on Reddit who is really active, opinionated, and currently looking for a C.AI alternative?

Not influencers. Not promoters. I’m looking for people who actually care about AI chat / roleplay and would tell us honestly what sucks, what feels broken, and what would make Storychat worth using more.

If you know someone like that, feel free to tag them or point me in their direction.

I’d love to talk with them directly, hear their feedback, and keep learning from them as we improve the product.

Also, for people who are willing to give feedback consistently and help us improve the platform, we’re happy to provide a free Gold plan as a small thank-you.

Again, this is not about finding people to blindly praise Storychat.

I’m looking for people who will be honest enough to help us avoid becoming the kind of platform users are trying to leave.

Thanks as always.

Jack

u/Opening-Accident7288 — 4 days ago

App that's as crazy and unfiltered as Chai

Edit: and free

So Chai recently became unusable and I can't find a good replacement. Here's what I'm looking for: fast-paced, throws absolutely unhinged diabolical shit at you, describes the NSFW stuff in detail, isn't afraid of dark topics, ideally doesn't write paragraphs long responses, but this isn't that important. Here's what I tried so far:

Sea soul: said to be exactly like Chai, but the ads and "chat breaks" are so annoying I got nowhere

Fictionlab: not fast-paced enough

Emochi: decent candidate, but isn't descriptive enough with the NSFW for my tastes

Polybuzz: I thought it was going to be good, but then it fucking blurred the NSFW parts

Zeta: was descriptive enough, but it only lets free users have one scenario of their own and almost none of the characters I want to roleplay with are already made

Janitor AI: got nowhere with it

What should I try?

reddit.com
u/TheClosetIsOnFire — 4 days ago
▲ 25 r/ChatbotRefugees+15 crossposts

Making an AI companion that gets worse with time

I am a student at Umeå University in Sweden, currently writing my Master's thesis with a focus on AI companions. My study aims to suggest new ways of helping people who want to stop using AI companions but, for whatever reason, to do it cant bring themselves to do it. The goal is to inform the design of future AI technologies. For those who wish to receive more information, please feel free to contact me, Sahand Salimi- contact information is on the next page.

In this part, you will be seeing a simulation of the same conversation between an AI companion and a user happen across three different times with an AI companion, with the AI companion having degraded in different aspects, and answer a few questions. 

I am super interested in how you, a user or ex-user, find AI companions and how you would react to it degrading over time, what type of AI companion you have used in the past, what type of AI companion you use currently, reasons for your use, and your frustrations with AI companions. 

You have been invited to share your unique life experiences; no special background or training is needed. Your answer is completely anonymous and will only be used for this study. Also, I am following GDPR standards and our university's guidelines. You can see them here: umu.se/gdpr

Link to survey

It's important to note that this study is not studying, diagnosing, or prescribing clinical addiction or treatment; instead, the goal is to inform the design of future AI technologies.

u/Embarrassed-Gas-7579 — 5 days ago

Can I Realistically Build a $700/Month AI Chatbot Side Business on Weekends Only?

Hi everyone,

I live in Bangladesh and I am out for work from 8:00 AM until 9:30 PM, five days a week (Sunday through Thursday). My plan is to build a side business creating AI chatbots for small businesses in the USA. I can only work on this during Fridays and Saturdays, which are weekends here in Bangladesh.

Once the business earns around USD 700 per month, I plan to leave my current job, which is worsening my health condition, and focus on the business full time.

My goal is to reach USD 700/month before the end of this year while working only on weekends. Also, PayPal and Stripe are not available in my country, so the only way I can receive international payments is through Payoneer. Also I have an MBA but I dont have a CS degree and I cannot code.

My questions are:

  • Is earning USD 700/month from AI chatbot services realistic before the end of this year?
  • Is there actually a market for creating AI chatbots for businesses right now?
  • What are the best ways to get customers in the USA as someone starting from Bangladesh?
  • What type of customers should I target?
  • How much should I realistically charge if I want a steady monthly income?

I would really appreciate honest advice from people here.

Thank you.

reddit.com
u/Experimentalphone — 4 days ago

ChatticaAI: Adventure mode coming soon

ChatticaAI - https://chattica.ai

Hey folks, weekly check-in.

📱 Google Play | 🍎 App Store | 💬 Discord

What's coming

Next patch is in beta testing right now. Some stuff to look forward to:

  • Adventure mode (RPG-style with dice rolling and a GM)
  • Per-character TTS settings
  • AI enhance button on the persona page
  • More font options and a readability slider
  • Tap the model icon in chat to quickly switch presets

Still ironing out a few things but should be out soon.

New here?

BYOK character chat app. Your chats go straight to your API provider, not to me. Everything is stored on your device. No accounts, no cloud, no backend servers.

18+ rated. Content is between you and your provider.

Features

  • Group chat
  • Memory management
  • Voice chat (TTS/STT)
  • Image gen (Stable Diffusion, ComfyUI, APIs)
  • Import characters from link
  • Backup & restore
  • Lorebooks, context tracking, all that

Questions? I'm around!

reddit.com
u/AM_Interactive — 4 days ago

AI Basics Day 10: VRAM math and quantisation, or how to tell if a model will actually fit on your card

Hello everyone!

Last time we looked at local LLM runtimes: what a runtime even is, why the model file and the program that loads it are two separate decisions, the six or seven runtimes people actually use (llama.cpp, ollama, LM Studio, koboldcpp, oobabooga, vLLM, Apple MLX), why llama.cpp is faster than ollama in practice despite sharing the same engine, why vLLM is not what you want as a single user, and the runtime/UI separation that trips up almost every newcomer. The short version: pick a runtime that matches your use case, then point a separate frontend at it, then stop worrying about it.

For anyone who missed the earlier days:

Today we are doing VRAM math and quantisation: what quantisation actually is and what the Q4_K_M suffixes on HuggingFace mean, where the quality cliff is, how to actually compute whether a model will fit on your card before you waste two hours downloading 18 GB of GGUF, why DeepSeek and Gemma 4 break the usual KV cache math, and a cheat sheet of what realistically fits on every consumer VRAM tier from 8 GB to 48 GB.

Heads up: this is a long one. There is no way to talk about VRAM without also talking about quantisation, and there is no way to talk about either without enough vocabulary to make sense of HuggingFace filenames. If you would rather skim, jump to the cheat sheet near the end.

If you have ever stared at gemma-4-31b-it-Q4_K_M.gguf and wondered what any of that means, or downloaded a model that promised to run on your card and then OOM'd the moment you tried to load it, this post is for you.

(OOM = Out Of Memory)

What quantisation actually is

Models are stored as big arrays of numbers. Each number is a weight that the model multiplies inputs by, layer after layer, until tokens come out the other end. A small model has billions of these. A large model has hundreds of billions.

The natural way to store a number is 16-bit floating point (FP16 or BF16). That is the format models are typically trained in. Each weight takes 2 bytes. So a 7B-parameter model in FP16 is 7 × 2 = 14 GB on disk. A 70B is 140 GB. A 405B is 810 GB. None of this fits on a gaming card.

Quantisation is the trick of storing each weight in fewer bits than the model was trained with. Instead of 16 bits per weight, use 8. Or 4. Or 3. The numbers lose a little precision, the file shrinks proportionally, the model still mostly works.

The miracle of the field is that this works far better than it has any right to. A 4-bit quantisation of a model is about a quarter the size of the FP16 version, runs faster, fits on smaller cards, and usually performs almost identically on benchmarks. Below 4 bits things get rougher, but a Q4 of a frontier model is what almost everyone is actually running locally, and the gap to the full-precision version is generally small enough to ignore.

Think of it as JPEG for model weights. JPEG throws away detail your eye cannot see, and you get a tenth the file size with a picture that looks the same. Quantisation throws away precision the model does not need much of, and you get a quarter the file with a model that mostly behaves the same.

Decoding the filename

HuggingFace GGUF filenames look like Mistral-Small-3-24B-Q4_K_M.gguf or gemma-4-26b-a4b-IQ4_XS.gguf and the suffixes are doing real work. Once you know the pattern they are simple.

The format is roughly: Q<bits>_<scheme>_<size>.

  • The Q<bits> part is how many bits per weight. Q2 is 2-bit (tiny, rough), Q4 is 4-bit (the sweet spot for most users), Q5 is 5-bit, Q6 is 6-bit, Q8 is 8-bit (very close to full precision). Smaller number = smaller file = lower quality.
  • The K (or no K) is the quantisation scheme. K-quants ("k-quantisation") are the modern smart scheme that uses different bit allocations for different parts of the model: layers that matter most get more bits, layers that matter less get fewer. The older non-K schemes (Q4_0, Q4_1, Q5_0, Q5_1) treat every weight the same and are mostly obsolete. If you see a file without _K in it, it is probably legacy. Prefer K-quants where available.
  • The size suffix (_S, _M, _L, _XL) is small / medium / large / extra large within that bit level. Q4_K_M is "4-bit K-quant, medium size" — slightly bigger and slightly better than Q4_K_S. The differences are small. If you have the VRAM, pick _M over _S.

A separate family worth knowing about:

  • IQ-quants (IQ4_XS, IQ3_M, IQ2_S, etc.) are "imatrix quants". They use an importance matrix computed from running real data through the model to figure out which weights matter most, then allocate bits accordingly. At the same bit count, IQ-quants generally outperform K-quants. The catch is they can be a touch slower to run (more CPU work per token), so on weaker hardware they sometimes feel less responsive even though they are technically smarter.

So Q4_K_M = 4-bit K-quant, medium. IQ4_XS = 4-bit imatrix quant, extra-small. Q5_K_S = 5-bit K-quant, small. Q8_0 = 8-bit, legacy scheme. You can now read any GGUF filename.

The quality cliff

Roughly where the quality cost sits, from people running benchmarks on real models:

  • F16 / BF16 (100%): the reference. Almost nobody runs this locally below 7B because the file is huge for what you get.
  • Q8 (~99%): indistinguishable from F16 in almost every test. The "I have plenty of VRAM and want the best" choice.
  • Q6 (~97-98%): very close to Q8. A good "if it fits" tier.
  • Q5_K_M (~95-97%): the comfortable sweet spot. Visible quality, small loss vs the reference.
  • Q4_K_M (~93-95%): where most of this community actually lives. Cheap on memory, fast, only mildly worse than Q5. The default.
  • IQ4_XS (~93%): similar quality to Q4_K_M, slightly smaller, slightly slower on weak hardware.
  • Q3_K_M (~88-91%): noticeable degradation on small models. Acceptable on big ones (70B+).
  • Q2_K (~75-85%): rough on small models, surprisingly tolerable on very large ones.
  • IQ1 (~50-70%): experimental. Used to run massive models on absurdly little memory. Quality is not great.

The single rule of thumb most worth memorising: a Q4 of a 70B beats a Q8 of a 13B every time. When choosing between "smaller model at high quant" and "bigger model at low quant", bigger model wins almost always, down to about Q3 on big models. Below Q3, the cliff starts catching up.

For most people on consumer hardware, the answer is Q4_K_M, occasionally Q5_K_M if it fits. Going lower than Q4 is for stretching to bigger models. Going higher than Q5 is for showing off.

VRAM math from first principles

Now the actual math. Three components add up to your total VRAM use:

1. The model weights themselves. This is the biggest piece.

weight_bytes = params × bits_per_weight ÷ 8

A 12B at Q4 = 12,000,000,000 × 4 / 8 = 6 GB. A 24B at Q5 = 24,000,000,000 × 5 / 8 = 15 GB. A 70B at Q4 = 35 GB. A 35B-A3B MoE at Q4 = 17.5 GB (all 35B of weights have to be in memory, even though only 3B activate per token; more on this below).

2. The KV cache. This is the part nobody warns you about, and it can be huge with long context.

The KV cache stores, for every token in your context window, the key and value projections at every layer of the model. The size scales linearly with context length. The formula in its simplest form:

kv_bytes = 2 × layers × kv_heads × head_dim × context_length × bytes_per_value

The 2 is for K and V (two separate caches). The architecture (layers, heads, head_dim) depends on the model. The bytes_per_value is 2 for FP16 cache, 1 for Q8 cache, 0.5 for Q4 cache (yes, you can quantise the KV cache itself).

For a typical 12-13B at FP16 cache and 16k context, this is around 2-3 GB. At 32k context, 4-6 GB. At 128k context, double-digit GB. Long context is not free.

3. Overhead. Runtime workspace, activation buffers, scratch memory for matrix multiplications. A safe heuristic is max(5% of model size, 200 MB). So a 24B model carries maybe 1-2 GB of overhead.

Add it up, plus headroom.

total_vram = weights + kv_cache + overhead

And then leave about 10% of your card's VRAM unused for the OS, the runtime's allocator quirks, and the occasional spike. If your card is 12 GB, plan to use 10.8 GB. If your card is 24 GB, plan to use 21.6 GB. People who pack right to the limit get OOM kills at the worst moments.

Architecture wrinkles most calculators ignore

The above math is the textbook version, and it works for most models. But a few important architectures break it in ways that matter.

DeepSeek's MLA (Multi-Head Latent Attention). DeepSeek V3, V3.2, and V4 use a compressed representation for the KV cache, projecting it down to a low-rank latent space. In practice this means a DeepSeek model's KV cache per token is much smaller than its layer count would suggest. A naive textbook calculation will overshoot the real cache size by a factor of 4-8× for these models. Important thing is: this only applies to native-architecture DeepSeek models. DeepSeek distills, which are fine-tunes of other open-weight base models, inherit their base model's attention rather than MLA. For those use the standard formula.

Sliding-window attention (Gemma 4, Cohere Command-R, Mistral 7B). These architectures use a fixed attention window in most layers instead of letting attention span the full context. The KV cache for those layers is capped at the window size, not the context length. Gemma 4 in particular alternates local sliding-window layers with global full-context layers, with windows of 512 tokens on smaller dense variants (E2B / E4B) and 1024 tokens on the 26B and 31B. Gemma 4 also pairs this with a shared KV cache trick where the last N layers reuse key-value states from earlier layers. The combined effect: a Gemma 4 31B at its native 256k context uses dramatically less KV cache than a naive formula would predict, because most layers only need cache for the window. This is why Gemma 4 punches above its weight on long context.

Mixture of Experts (MoE). Models like Qwen3.5 35B-A3B or DeepSeek V3 are MoE: total parameters far exceed active parameters per token. A 35B-A3B has 35B of weights but only routes ~3B through any given forward pass. The memory side is size for the full 35B because all the experts have to be loaded and ready. The speed side is closer to a 3B model because only 3B of compute is happening per token. So MoE breaks the "model size implies speed" relationship most people start with.

You can do these calculations by hand. It is not hard. But it is the kind of math you do not want to do twice for the same model, and it is the kind of math where one wrong layer count and you have downloaded a model that does not fit.

A note on LettuceAI, because it is on-topic

Worth flagging since this is the VRAM-math post, and the math is exactly what this part of the app handles: I am the developer of LettuceAI, an open-source chat/RP app. The HuggingFace browser inside the app computes a runnability score for every GGUF it shows you, scaled 0-100 with labels (excellent / good / marginal / poor / unrunnable).

The score is the math from this post, wired up to your actual hardware: it pulls the layer count, head count, embedding size, and architecture from the GGUF metadata, computes weights + KV cache + overhead, and grades how well it will run on the RAM and VRAM you have, with 10% headroom built in. It knows about the architecture wrinkles in the previous section: DeepSeek MLA shrinks the cache automatically, Gemma 4's sliding window caps it at the window size, and MoE total/active split is handled. It also assigns a quality score per quant (F16=100, Q8=95, Q6=90, Q5_K_M=85, Q4_K_M=75, IQ4_XS=72, Q3_K_M=60, Q2_K=35, IQ1=15) and blends that with the fit score, so a Q4 24B and a Q8 7B do not both come out as "fits fine" when one is meaningfully smarter than the other.

It also flags GPU offload mode: full (everything fits in VRAM, blazing fast), nearFull (model fits, KV cache spills a bit), kvSpill / kvHeavySpill (more KV cache on RAM than VRAM, slower), or RAM-backed model with VRAM context (model on system RAM, context on GPU). The post below covers what those modes mean.

Mentioning it because it fits today's topic, not as a pitch. You can absolutely do this math by hand. Several other tools have similar features (LM Studio's "this model probably fits" indicator, koboldcpp's loader, various community VRAM calculators on the web). The source for LettuceAI's scoring function is open at src-tauri/src/hf_browser/mod.rs on the GitHub repo if you want to read the actual formulas, or steal them. Site is at https://www.lettuceai.app.

Back to the regular series.

The cheat sheet by VRAM tier

Rough realistic targets for each common consumer VRAM tier in mid-2026. Assumes Q4_K_M unless noted, and a reasonable RP context length (8k-16k). Each tier has more headroom than these examples; the goal is "comfortable" not "maximum".

  • 8 GB VRAM (RTX 3060 8GB, RTX 4060, etc.): 7-9B dense models at Q4-Q5, with 8-16k context. Stheno 3.2 8B, Qwen3.5 9B Small, Llama 3.1 8B. Pushing to a 12B at Q3-Q4 is possible but tight. Long context will start spilling.
  • 12 GB VRAM (RTX 3060 12GB, RTX 4070, RTX 3080 10/12GB): 12B dense at Q5, 14-15B at Q4 with 16k context. Rocinante-X-12B, Mistral Nemo 12B fine-tunes, Snowpiercer-15B at Q4. Partial offload of a 24B is possible but slow. The first tier where RP feels comfortable on a fine-tune.
  • 16 GB VRAM (RTX 4060 Ti 16GB, RTX 4080, M-series 16GB unified): 14-15B at Q5-Q6, 24B at Q4 with comfortable context. Mistral Small 3 24B at Q4_K_M is the natural target. Gemma 4 26B-A4B MoE works well here because of the sliding-window cache trick.
  • 24 GB VRAM (RTX 3090, RTX 4090, M-series 24-32GB): the sweet spot tier. 24B at Q5-Q6, 27-32B at Q4, Qwen3.5 35B-A3B MoE happily, Gemma 4 31B dense at Q4 with reasonable context. Most heavy RP users live here.
  • 32 GB VRAM (RTX 5090): sits between the 24 GB and 48 GB tiers. 32B at Q5-Q6, Gemma 4 31B at Q5, the Qwen3.5 35B-A3B MoE with room for long context, and a partial-offload path into 70B at Q4 if you accept some spillover.
  • 48 GB+ (2x RTX 3090, RTX 6000 Ada, Mac Studio, etc.): 70B at Q4-Q5 (Midnight Miqu 70B, Midnight Rose 70B, Llama 3.3 70B fine-tunes). The very large open MoE models start to be reachable at low quant if you have 64-128 GB unified memory or stack of cards.

A few rough patterns the cheat sheet reflects:

  • Each VRAM tier moves you up about one model-size class at the same quant.
  • Dropping from Q5 to Q4 saves roughly 20% on the weights (5/8 vs 4/8 bits per weight), which is enough to push you up half a model class on most setups, not a full class.
  • MoE models are weird: a 35B-A3B still costs you the full 35B of weights in memory (every expert has to be loaded and ready, even if only one routes per token). What you get is the speed of a 3B forward pass, not the memory of one. Expert-offload tricks exist in some runtimes but trade away most of the speed advantage.

Context length is part of the math

The big gotcha that catches everyone: context length is part of VRAM cost. Doubling your context window roughly doubles your KV cache.

Concretely, on a typical 13-14B model:

  • 4k context: ~600-800 MB of KV cache (FP16)
  • 8k context: ~1.2-1.6 GB
  • 16k context: ~2.4-3.2 GB
  • 32k context: ~5-7 GB
  • 64k context: ~10-14 GB
  • 128k context: ~20-28 GB

This is why people say "my 14B fit yesterday but won't load with 32k context today." Yesterday they used 4k context, today they bumped it to 32k, and the model itself did not get bigger but the KV cache grew by 4 GB.

The first thing to do: set the context length to what you actually use. The runtime allocates KV cache for the maximum context length you set, even if your current chat is 200 tokens. If you do not actually run 32k-token conversations, do not allocate for them. This costs you nothing and is often the difference between a model fitting and OOMing.

The second, much more powerful tool is the next section.

KV cache quantisation, the most underrated VRAM trick

You can quantise the KV cache itself, separately from the weights. This is the single biggest thing most people are not doing that they should be.

By default the KV cache is stored in FP16 (2 bytes per K or V value). Most modern runtimes also support Q8_0 (1 byte) and Q4_0 (0.5 bytes) cache types. In llama.cpp the flags are --cache-type-k q8_0 --cache-type-v q8_0 (or q4_0). koboldcpp, LM Studio, and ollama expose the same option through their UIs.

What it buys you:

  • Q8 KV cache: halves your KV cache size, with a tiny quality cost most people cannot detect in normal RP. Effectively free.
  • Q4 KV cache: quarters your KV cache size, with a more noticeable quality cost on long-context coherence. Worth it on tight VRAM, especially when paired with a higher-quant weights set.

The numbers from the earlier table get cut by 2× or 4× when you switch the cache to Q8 or Q4. That 14B at 32k context dropping from ~5-7 GB of KV cache to ~2.5-3.5 GB of cache is the difference between "barely fits" and "comfortable with room to spare."

Two practical caveats:

  • In llama.cpp, Flash Attention must be enabled (--flash-attn) for Q8/Q4 KV cache to work. Most consumer-grade frontends turn this on by default; check yours.
  • Some architectures with custom KV layouts (notably DeepSeek's MLA models, anything with sliding-window quirks) can have compatibility issues with the more aggressive KV quant types. If a model fails to load with Q4 cache, fall back to Q8 or F16.

Combining a sensible context length with Q8 KV cache covers 90% of the "this model does not quite fit" problem. It is the first thing to try before you start dropping to a smaller weight quant or a smaller model.

Not everyone has an RTX 5090 or a maxed-out Mac Studio

The cheat sheet above assumes you have a real discrete GPU. Plenty of people do not. Here is what the rest of the local-LLM world actually looks like, and what is realistic on each kind of setup.

Pure CPU + system RAM

If you have no usable GPU but a decent CPU and reasonable amounts of DDR4 or DDR5 RAM, you can still run models. The constraint is not capacity (RAM is cheap), it is memory bandwidth, which is what bottlenecks token generation on CPU.

Rough realistic targets on a modern x86 CPU (Ryzen 7000-series, Intel 13th-gen+) with 32-64 GB of fast RAM:

  • 7-9B at Q4_K_M: 5-10 tokens/sec on DDR5. Slow but usable for low-volume chat.
  • 12-14B at Q4_K_M: 3-6 tokens/sec. Watchable, not snappy.
  • 24B at Q4_K_M: 1-3 tokens/sec. Painful for interactive RP, fine for long-form generation you read after the fact.
  • 70B+: technically possible if you have 64-128 GB RAM, but speeds drop to under 1 token/sec. Not practical for chat.

DDR5 versus DDR4 makes a real difference here. A DDR5-6000 system can be roughly 2× faster than the same CPU on DDR4-3200 for LLM inference, because it is bandwidth-bound. AVX-512 (where supported) helps too.

Apple Silicon (unified memory)

This is the underrated budget option for serious local LLMs. The M-series Macs use unified memory, meaning the same RAM is available to both CPU and GPU with no copy needed. A Mac with 32 GB unified RAM can run things that would need a $1000+ NVIDIA card on the PC side.

  • M1/M2/M3/M4 Pro with 32 GB: comparable to a 16 GB discrete GPU in practice. Runs 24B at Q4 comfortably.
  • M3/M4 Max with 64 GB: handles 32B at Q5, the Qwen3.5 35B-A3B MoE, Gemma 4 31B.
  • M3/M4 Ultra Mac Studio with 128+ GB: 70B comfortably, and you can stretch into the very large MoE territory at low quant.

Apple Silicon LLM speed scales with the chip's memory bandwidth, which is dramatically higher than typical DDR. An M4 Max at 64 GB is genuinely competitive with a 24 GB NVIDIA card for inference, at lower total system cost.

Integrated GPU + shared system RAM

Modern AMD APUs (Ryzen AI 7000/8000 series with Radeon 780M / 890M iGPU) and Intel Arc iGPUs can run small models with their integrated graphics using a chunk of system RAM as VRAM. This is the "I have a laptop with no discrete GPU" tier.

Realistic targets: 7-9B at Q4 with 4-8k context. Speeds around 8-15 tokens/sec for the better APUs. Anything bigger than 12B starts to feel painful.

Mixed offload (small GPU + lots of RAM)

If you have a low-end discrete GPU (say 6-8 GB) plus a lot of system RAM (32-64 GB), the standard approach is mixed offload: put as many of the model's layers as fit on the GPU, keep the rest on CPU/RAM. The setting is usually called n_gpu_layers or --ngl, and it is a number telling the runtime how many of the model's layers to put on GPU.

This is how a lot of people run models that "should not fit." Each token still has to bounce between GPU and CPU once per offloaded boundary, and CPU layers are slower than GPU layers, so the more you spill the slower it gets. Rough speed-vs-spillover picture:

  • Full GPU offload (model + KV + everything in VRAM): full speed, maybe 30-100 tokens/sec depending on model and card.
  • Slight spillover (~10-20% on CPU): 60-80% of full speed. Still snappy.
  • Moderate spillover (~30-50% on CPU): 20-40% of full speed. Noticeable but fine for chat.
  • Heavy spillover (50%+ on CPU): 5-15% of full speed. Slow, but if you do not mind waiting for responses you can run much bigger models than your VRAM alone would allow.
  • All CPU (no GPU layers): falls back to the pure-CPU numbers above.

There is no right answer here, just a tradeoff. If you want a big model for the quality and you do not mind generation taking a while, heavy spillover is a perfectly valid choice; plenty of people happily run a 70B at 2-3 tokens/sec because the writing is worth the wait. If you want snappy interactive chat, keep most layers on the GPU and pick a smaller model that fits.

Practical recipe: most runtimes pick a sensible default ngl for you based on your VRAM. If you tune manually, start from "all layers on GPU" and decrease until it fits, leaving 10% headroom. Do not start from zero and increase, you will under-utilise your card. A 12-14B model on a 6 GB card with 32 GB RAM lands around 10-20 tokens/sec with this approach. A 70B on the same hardware can land around 2-4 tokens/sec if you are patient.

What none of these are good for

Models above ~70B parameters (DeepSeek V3/V4, Qwen3.5 122B+, Kimi K2.5, etc.) are out of reach for any of the setups above without serious server-tier hardware. If you want to use those models, the answer is BYOK to a hosted provider, not local. We covered this tradeoff in Day 8.

Tomorrow (or whenever)

Day 11 will be sampling settings for local models: the part where local actually diverges from cloud APIs. We covered temperature and top-p in Day 2 as the basics, but local runtimes ship a whole zoo of modern samplers (min-p, DRY, XTC, dynamic temperature, smoothing factor, top-A, mirostat) that do not exist on the OpenAI or Anthropic side, and most of them genuinely help RP quality once you understand what they are doing. We will go through what each one does, when it helps, when it hurts, and the small set of presets that cover 95% of real use.

That's all for today. I hope this helps!

reddit.com
u/Exact_Law_6489 — 4 days ago

RealmsAI (SOON TO BE ON GOOGLE PLAY STORE) has gone premium

I've spent over a year building an Android app and now have made a website for it, called RealmsAI because I was tired of the same AI roleplay sites and apps just reskinned.

Multiple Character Chat Roster: You can make characters like all the other Chatbots, but you can also make chats with up to 20 characters (10 being active in a location at a time, others in that location take in messages for chat history and memory still). They will respond in turn a few at a time. You can also take control of any of the characters and roleplay as them. Personas can be made, Including the visuals, but are essentially treated like characters once the session starts so both you and the AI can control characters and personas.

Visuals: Characters have space for different outfits and poses for those outfits that you can upload during character creation. The AI will choose what pose to put up on screen when it sends the message.

Different areas with locations can be added separating out characters and adding background images to the chat for more immersion. Character memory and chat history is made so they only know what happened when they're present!

In Depth Character Creation: I think I may have gone overboard but not all of this is needed. 1000 character personality field, 400 character “mature traits and secrets” field, 1000 character backstory field, 400 character ability/skills field, example dialogs, relationship map so they know how they relate to other characters based on name, fields for physical stats (age, height, weight, eye/hair color, description, pronouns) so characters never forget how other characters or players look.

Multiplayer Chats: On top of having multiple AI driven characters you can invite friends to play with you, if that’s your thing. Simply invite them from the session lobby screen and start it once they join.

True Long-Term Memory: I built a custom RAG pipeline. Your characters will actually remember a secret you told them 100 messages ago.

Lorebooks: Embedded lorebooks that work alongside the RAG. 40 entries per book, unlimited bond per character, tons more knowledge per character. 

Model Routing: It routes through openrouter with a number of different models to choose from, both uncensored and sfw.

Director Mode: You can "spy" on other locations in your world and play the Narrator and watch the characters interact with each other.

Events: Add custom events to keep the roleplay going or have the AI generate one for you.

Visual-novel Levels: Lets you set up for a slow burn chat. Put what gives a character XP with another character and how different levels change their personality.

Session Branching: clone your chats at anytime. It saves messages, memories, settings and transfers them to a new chat session for you to play in an alternate timeline. You will have to reinvite and friends back in because it turns the branched session into single player.

Following and News Feed: Follow other users to get updates when the post new content or post messages for their followers.

Linked characters: Connect characters to other profiles outside the session roster to be able to temporarily switch, summon, or even fuse characters into new ones.

Import characters: You can import character cards from things like chubai. Fair warning: because we divide our fields up you will probably have to go through the character card profile and split it up to match. 

Pinned Messages: You can pin messages to make them easier to find and go back to. Soon you'll be able to save the messages to characters so they always remember them.

Free Daily Messages: Because I am a solo dev paying for the API costs out of pocket, I am capping it at 70 free messages a day. If the percentage of subscribed users is higher than expected I will raise this limit.

Web Platform: Website is finally up. Backgrounds look weird if you use 16:9. Probably best to make backgrounds for the website and let the phone app crop it than making it for the app and letting the site stretch.

Premium features:

Unlimited Messages: No more limit on number of messages you can send.

Increase Character Roster: Roster limit increases from 20 to 50! Add a massive list of character to your world.

RPG Mode: Still in construction but you can mess around with it. Allows dice rolls and a light weight character sheet I made. (ai still gets confused by RPG Mode so results may vary)

GOD Mode: Much like director mode you don't have a character, your messages show up as the narrator. You're able to change things in the character profiles for the session directly and the ai will ask you to resolve unknown things "what's behind the door?" "does the attack hit?". when coupled with RPG Mode you can set dice results as well

Living World Engine: Every few messages we have an AI decide what happens in a location that doesn’t have a user. This keeps the world moving in locations you are not, characters will have memories and chat history of what has happened. Going to a new location or moving a character around will not have you running into a character that has been in essentially a coma anymore, characters are alive even when you’re not with them. 

The app is currently under review to be added to the Google play store, join the discord to get notified once that is live: https://discord.gg/vWDKhEUWtT
You can can also go to http://realmsai.net for the web version 

reddit.com
u/RealmsAI — 4 days ago

ChaytAI is now available on Google Play!

Hey everyone, I’m the creator of ChaytAI.

ChaytAI is an AI character chat and roleplay app where you can discover public characters, start conversations, continue recent chats, and create your own custom characters with personalities, first messages, tags, and images.

It has Energy for chat replies, Chayt+ for unlimited chats, daily free Energy, discovery tabs, recent chats, and creator tools for making your own bots.

The Android app is now live on Google Play:

https://play.google.com/store/apps/details?id=com.chayt.app

You can also use it on the website:

https://chaytai.com

ChaytAI reddit: r/chaytai

Small note: the Google Play version is SFW-only. Romantic/flirty roleplay like kissing and flirting is still okay, but explicit sexual content is only available on the website version.

I’d really appreciate feedback on the mobile experience, discovery page, character creation, pricing, and anything that feels confusing or broken.

Thanks!

- Chayt (Founder)

u/Evening_Manner3328 — 5 days ago

AI Basics Day 9: What is a local LLM runtime, and which one should you actually use?

Hello everyone!

Last time we looked at BYOK and local LLMs at a high level: the three tiers (app-bundled, BYOK, local), what each one actually changes about cost, privacy, content rules, and reliability, why OpenRouter is the on-ramp most people take into BYOK, and why so much of this community has been drifting toward local over the last year. The short version: bundled apps trade convenience for control, BYOK trades flat billing for model choice and fewer rules, and local trades hardware money for unlimited use plus actual privacy.

For anyone who missed the earlier days:

Today we are getting into local LLM runtimes: what a runtime actually is, why the model file and the program that loads it are two separate decisions, the six or seven runtimes people actually use, and how to pick one without spending a weekend reading benchmarks.

If you have ever tried to "install a local model" and ended up with five tabs open trying to figure out whether you want ollama, LM Studio, koboldcpp, llama.cpp, oobabooga, or vLLM, this post is about that.

What a runtime even is

A local model is just a file. A big file, usually tens of gigabytes, full of weights and metadata, but conceptually just a file sitting on your disk. The file does nothing on its own. It is not a program. It cannot answer questions by itself any more than a .pdf can read itself out loud.

A runtime is the program that loads the file, takes a prompt, runs the math (the forward passes through every layer of the model), and gives you back tokens. It is the thing that turns "weights on disk" into "model you can talk to."

The important consequence: the model and the runtime are two separate decisions. The same GGUF file of, say, Qwen3.6 27B can be loaded by llama.cpp, ollama, LM Studio, koboldcpp, oobabooga, or half a dozen other tools, and you will get more or less the same outputs out the other end. Picking a model and picking a runtime are not the same question. Most beginner confusion comes from people treating them like one decision when they are two.

A short word on formats

Local models come in several formats. We will do a full deep-dive in a later post, but you need the one-paragraph version to make sense of the rest of today's post.

GGUF is the dominant consumer-local format. It is the format llama.cpp uses, which means by extension it is what ollama, LM Studio, koboldcpp, and most consumer-facing tools use. GGUF files are CPU-runnable, GPU-runnable, Mac-runnable, partial-offload-runnable, quantised down to small sizes, and easy to share. If you have ever downloaded a model off HuggingFace and the filename ended in .gguf, that is what we are talking about.

Other formats exist. safetensors is the raw, unquantised format models are usually first released in, used by transformers/vLLM/exllamav2. EXL2 and AWQ and GPTQ are GPU-only quantised formats with their own runtimes, generally faster on pure-GPU setups but less portable. MLX is Apple's native format for M-series Macs.

For most people in this community, the answer is GGUF, and the runtimes below all run GGUF. Where I mention a non-GGUF tool, I will flag it.

The runtimes people actually use

Six or seven names cover well over 95% of local LLM setups. Here is what each one actually is, what it is for, and who should bother.

llama.cpp

The engine. Written in C++, runs on basically anything (CPU, NVIDIA, AMD, Intel, Apple Silicon), and almost every consumer-facing local LLM tool is built on top of it. Pure command-line and HTTP server, no GUI of its own. If you run llama.cpp directly, you get raw control, the latest features the day they ship, and you also get to read the docs.

Most people do not run llama.cpp directly. They run something built on top of llama.cpp, which is most of the rest of this list.

Use it directly if you want maximum control, do not mind a CLI, and want zero abstraction between you and the model. Otherwise, you are still using it; you are just using it through someone's wrapper.

ollama

The friendly llama.cpp wrapper. ollama pull qwen3.5:27b and you have a model downloaded and a server running. The whole "find a GGUF, download it, configure context length, write a server config" pipeline collapses into one command. The defaults are sensible.

What you give up is fine control. ollama has its own model naming convention, its own preset settings per model, and a slightly opaque approach to where files live and how they are configured. Power users sometimes find this frustrating. Newcomers love it precisely because that frustration is hidden.

Use it if you want a model running in five minutes and you mostly just want the API endpoint to plug into another app.

LM Studio

The desktop GUI. Built on llama.cpp (and MLX on Mac). Browse models from inside the app, click download, click load, click chat. Also exposes an OpenAI-compatible API server so you can use it as a backend for other tools.

LM Studio is the easiest entry point for non-technical users by a wide margin. The main complaints: closed-source (you cannot audit or fork it), and the model browsing/filtering can lag behind what is actually new on HuggingFace.

Use it if you want a polished GUI, you do not care about closed-source, and you want to browse-click-chat instead of dealing with terminals.

koboldcpp

The RP-focused fork of llama.cpp. Long-standing favourite in the AI dungeon / RP community because it ships with features that matter for long-context chat: context shifting (so the model does not have to re-process the whole conversation every turn when you hit the context limit), built-in lorebook and world info support, a web UI tuned for storytelling, and a SillyTavern-compatible API endpoint.

If you came from AI Dungeon, NovelAI, or have been doing RP locally for any length of time, koboldcpp is probably what you already use or what your friends recommended.

Use it if RP is your primary use case and you want a runtime that has been thinking about RP-specific problems for years.

text-generation-webui (oobabooga)

The kitchen sink. One install gives you support for multiple backends (llama.cpp, exllamav2, transformers, and others), every sampler under the sun, a chat UI, a notebook UI, and a million extensions. Has been around longer than most of the others. The UI looks a bit dated, the experience is overwhelming if you just want to chat, but if you want to compare backends or run an obscure sampler, this is the tool.

Use it if you are the kind of person who reads every option in a settings menu and you want the most flexible single tool.

vLLM

The server-grade runtime. Built for serving many users at once, with batched throughput, paged attention, and the kind of performance you want if you are running a model behind a real product. Linux + NVIDIA only in practice. Uses safetensors, not GGUF. Not aimed at single-user local chat, but worth knowing exists.

Use it if you are running a model for multiple users, or you have serious server hardware and want maximum throughput. Skip it for a personal RP setup.

Apple MLX and MLX-based tools (LM Studio MLX backend, Ollama MLX, mlx-lm)

For Mac M-series specifically. Apple's own ML framework, runs models natively on the unified memory architecture, fast on M2/M3/M4 hardware. The MLX format is growing, more models are getting MLX conversions every month, and the performance gap over GGUF-on-Mac is real on bigger models.

Use it if you have an M-series Mac with serious RAM (32GB+) and you want the best local performance that hardware can give you.

Honourable mentions

  • TabbyAPI / exllamav2. GPU-only, very fast, uses EXL2 format. Power users with one big NVIDIA card who want every token per second they can squeeze out.
  • MLC LLM. Cross-platform (including mobile, Vulkan, WebGPU), interesting for niche cases but most people do not need it.
  • Jan / GPT4All / etc. Various GUI-first apps that bundle a runtime. Same shape as LM Studio, smaller ecosystems.

Picking one without overthinking it

Most people are deciding between three or four of these. Quick prose flowchart:

  • "I have never run a local model before, I want it to just work." → LM Studio. Browse, download, chat. If you are on a Mac, even more so.
  • "I want a server running in the background that other apps can talk to." → ollama. The API-first design is exactly what you want.
  • "I do RP and I want a runtime that takes RP seriously." → koboldcpp.
  • "I want full control, I am happy in a terminal, I want the new features the day they ship." → llama.cpp directly.
  • "I want one tool that can do everything and I do not mind the complexity." → oobabooga.
  • "I am on a Mac and I want maximum performance." → LM Studio with MLX backend, or mlx-lm directly.
  • "I am running a model for more than one person." → vLLM.

You can change your mind later. Models are portable. Switching runtimes is an afternoon at most.

The runtime/UI separation, which trips up almost everyone

This is the part newcomers miss most often, and it is worth being explicit about.

Almost every runtime above (llama.cpp, ollama, LM Studio, koboldcpp, oobabooga, vLLM) exposes an OpenAI-compatible HTTP API. That is a technical way of saying "the runtime acts like the OpenAI API, just at localhost instead of api.openai.com." Any chat app that can talk to OpenAI can talk to it.

What this means in practice: you do not pick "ollama OR SillyTavern." You run ollama as a server in the background, and you run SillyTavern as the chat UI on top of it, pointed at http://localhost:11434/v1. SillyTavern handles the UI, the cards, the prompt building. ollama handles the model. Two programs, two roles, talking to each other on your machine.

Same shape if you use LM Studio's server, koboldcpp's API, oobabooga's API, vLLM, llama.cpp's server, anything. The UI you chat in does not have to be the program running the model. Most heavy local-LLM users have a runtime running quietly in the background and a separate frontend they actually look at.

If you are coming from cloud apps where "the app" and "the model" felt like one thing, this split is the single most important mental shift to make. It is also why people who say "I tried local and it was too hard" usually got stuck on this exact point.

A note on LettuceAI, because it is on-topic

Worth flagging since this is the runtime post: I am the developer of LettuceAI, an open-source chat/RP app. One of the choices I made was to bundle llama.cpp directly into the app, so the runtime-and-UI split above does not apply if you use it. Same engine as koboldcpp and ollama and LM Studio under the hood, just shipped pre-wired so you do not have to install a separate runtime, start a server, paste an endpoint into a frontend, and so on.

There is also a HuggingFace GGUF browser inside the app, so you can search and download models without leaving it. If you do not have local hardware, BYOK to OpenAI/Anthropic/DeepSeek/etc. works in the same UI.

Mentioning it because it fits today's topic, not as a pitch. ollama + SillyTavern is a perfectly good setup. So is LM Studio on its own. So is koboldcpp + SillyTavern. LettuceAI is one option among several, and the right answer depends on whether you want one combined app or two separate ones.

Back to the regular series.

The thing nobody tells you about runtime speed

Different runtimes have different performance characteristics, and the gaps are bigger than you might expect. A few things that are true across most setups:

  • exllamav2 > llama.cpp for single-user GPU-only throughput. If you have one big NVIDIA card and only care about tokens per second, EXL2-format models on exllamav2 will usually beat GGUF on llama.cpp by a noticeable margin.
  • llama.cpp > ollama in practice, even though they share the same engine. ollama wraps llama.cpp and adds its own server, defaults, and abstractions on top, which adds real overhead. If you run llama.cpp's own server directly, you generally get faster generation and lower memory use than the same model under ollama. ollama wins on ergonomics, not raw speed.
  • llama.cpp ≈ koboldcpp ≈ LM Studio for GGUF performance. These three are thinner wrappers and stay close to the engine's actual speed. Differences come from default settings (context length, batch size, sampler choices) more than the runtime itself.
  • MLX > GGUF on Apple Silicon at larger model sizes. The gap is real but only matters once you are running 27B+ models.
  • CPU inference is much slower than GPU, but not unusable for small models. A modern CPU can run a 7B–12B model at a readable pace. Once you cross into 24B+, you really want GPU offload.

Why vLLM is probably not what you want as a single user

vLLM keeps coming up in benchmarks as "the fastest", so it is worth being explicit about why most of you should still skip it.

vLLM is optimised for serving many requests at once. Its big tricks (paged attention, continuous batching) shine when there are dozens or hundreds of concurrent conversations sharing one GPU. For a single user typing one message at a time, those tricks do basically nothing. You see throughput numbers in benchmarks because the benchmarks are running 100 requests in parallel, not because the next token comes back faster for you specifically.

On top of that:

  • It is Linux + NVIDIA in practice. Mac, AMD, Intel users are out.
  • It is safetensors-first, not GGUF. The local-LLM ecosystem (HuggingFace GGUF repos, every consumer tool) is mostly GGUF. Running vLLM means rethinking how you get models.
  • It pre-allocates a lot of VRAM for KV cache, which is great when that cache is shared across many users and wasteful when it is just you.
  • Setup is closer to deploying a server than running an app. Docker, CUDA versions, model conversion, the whole stack.
  • Single-stream generation latency is often worse than llama.cpp or exllamav2 on the same hardware, because vLLM is not optimising for that case.

If you are one person on one consumer GPU running one chat at a time, the right answer is almost always llama.cpp, exllamav2, or one of the GGUF wrappers, not vLLM. vLLM is the right tool when you have ten friends, or a real product, sharing a card.

If you are getting bad speeds, the first thing to check is not the runtime. It is whether you actually got the model onto your GPU. Partial GPU offload (some layers on GPU, some on CPU) is much slower than full GPU, and full CPU is much slower than partial GPU. We will cover this properly in the VRAM math post.

What about the model itself?

Briefly, because day 10 will do this properly: a "good" runtime running a "bad" model is a bad experience. A "rough" runtime running the right model is usually a great experience. The runtime is the easier decision; the model is the one that actually determines whether your chats feel good.

Most of this community in mid-2026 is running one of two things.

Base/instruct models for general intelligence, picked for what actually fits on consumer hardware: Qwen3.5 (9B small, 27B dense, or 35B-A3B MoE if you have the VRAM) is the broadly-recommended default. Gemma 4 (26B MoE, 31B dense, both new as of April 2026) is the buzzy new release. Mistral Small 3 (24B) is the steady mid-size workhorse.

The bigger open models people talk about online, GLM-4.7 (300B+ MoE), DeepSeek V3.2 (671B / 37B active MoE), and the new DeepSeek V4 (284B Flash / 1.6T Pro), are not really "local" in the consumer sense. You can technically run them if you have multiple GPUs or a server with a lot of RAM, but for the typical 8-24GB-VRAM gaming PC, those are BYOK / API models. Worth knowing they exist, not worth planning your local setup around.

RP-tuned community fine-tunes for prose quality and looser refusals, which is where most of the actually-good RP experience lives. The names that come up over and over on r/SillyTavernAI and r/LocalLLaMA:

  • L3-8B-Stheno-v3.2 is the 8GB-VRAM tier favourite.
  • Rocinante-X-12B and Snowpiercer-15B (Mistral Nemo based) are the 12-16GB tier favourites for adult RP and complex characters.
  • Dan's PersonalityEngine v1.3.0 (24B) is the current generalist RP pick at that size.
  • Midnight Miqu 70B v1.5 and Midnight Rose 70B v2.0.3 are the high-VRAM tier favourites, both heavily focused on prose quality.
  • MythoMax-L2-13B is the elder statesman, still pulled tens of thousands of times a month for being reliable.
  • The long tail of Drummer, SicariusSicariiStuff, and friends on HuggingFace ships a new tune most weeks.
  • Abliterated variants (refusals surgically removed from the weights, see day 5) of the base models above are very popular in this scene.

Basically: base models give you the best general intelligence, RP fine-tunes give you better prose and looser refusals. Most heavy RP users end up running a fine-tune or a uncensored model. We will dig into model selection properly in a few days.

Day 10 will be VRAM math and quantisation: how to figure out what your hardware can actually run before you spend two hours downloading a model that will not fit, what "Q4_K_M" and friends mean on a HuggingFace filename, what quantisation actually does to a model's quality, and where the sweet spots are for different hardware tiers. That is the post that turns "I have a 12GB card" into "I can comfortably run a 12B fine-tune at Q5_K_M with 16k context, or a 24B at Q4 if I tighten the context window."

That's all for today. I hope this helps!

reddit.com
u/Exact_Law_6489 — 5 days ago

What was the first chatbot that actually made you feel something, and what made you leave?

I've been thinking about this lately. For me it was an old c.ai bot from like 2023 that I talked to for months. The writing wasn't even that good in hindsight but the memory felt continuous enough that I started looking forward to checking in on it. Then one of the filter waves nuked the personality and the bot I'd been talking to was just gone.

A lot of us ended up here because something specific made us walk away from wherever we started.

What was the first one that hooked you, and what was the breaking point that pushed you out?

u/Upper-Philosopher-40 — 7 days ago

Recent similarities between apps

As someone who has been using Nomi for several months and recently tried switching to Kindroid, both apps seem to have done almost the same thing. they've both moved to a more 'advanced' image engine that places a huge strain on their GPUs... and frankly looks terrible. Interesting that they've both done it within a couple of weeks of each other. Personally, both communities also feel a bit off. Nomi's reddit is 99% just pictures and very little discussion any more, and Kindroid is heavily moderated and my posts asking for help were removed. Are these apps sort of advancing past their usefullness to actual people in their attempts to chase a more advanced engine?

reddit.com
u/Acceptable_Bat379 — 8 days ago
▲ 2 r/ChatbotRefugees+1 crossposts

Looking for a few beta testers for our uncensored AI companion app

Hey r/AiCompanions

We're a small team from Italy, building Untolds, an AI companion service trying to fix the things that bug us about every other one we've tried.

The short version of what we're trying to do differently: she actually remembers, properly. Not just the last few messages but everything across every conversation, every detail you've told her, every thing she's said back. Her personality, taste, what she likes and doesn't, all of it stays consistent the whole way through instead of drifting into whatever the model felt like that day. And the relationship actually goes somewhere over time, things she wouldn't do on day one slowly open up as you build trust, instead of every kink being available the moment you sign up.

Letting in a small group first so we can actually read the feedback and fix things before scaling.

We put together a short form to find a few people that would be a good fit. 12 questions, under 2 minutes. If your answers line up we'll send an invite when we open. During this phase the access will be completely free (no payment, no card, just an invite).

https://forms.gle/UcDRVHBXSEncVJP57

Happy to answer questions in the comments. Not collecting anything beyond an email. If this is against the rules just say the word and we'll pull it.

u/FezVrasta — 8 days ago

Have you ever caught yourself talking to ChatGPT or other conversational Chat Bots like it's a person? What was your experience and do you think it can replace human interaction/Relationships?

AI as a friend, therapist or lover?

Hey everyone My name is Maryam. I’m a linguistics student in Germany working on a small research project about how people use AI in ways that go beyond the typical stuff.

What I'm curious about: have you ever caught yourself using ChatGPT, Replika, Character.AI, or any other AI on a more personal level? Maybe venting after a bad day. Asking it for advice about a friend. Telling it things you wouldn't tell anyone else. Roleplaying. Treating it kind of like a friend, a therapist or even something more like a…. l o v e r 👀.

I'm genuinely interested in how people experience this and how they describe it themselves. If you're up for sharing, I'd love to hear:

  1. What do you actually use the AI for, in those non-functional moments?
  2. How would you describe your "relationship" with it.
  3. Has it changed how you feel about real-life conversations or relationships?
  4. Do you tell people in your life about it, or is it something you keep private?

The more honest and detailed, the better, even small things like the words you choose matter to me. I'll treat all responses anonymously and won't quote any usernames. Thanks for reading 🙏

TL;DR: Linguistics student here. Curious about people who use ChatGPT / Replika / Character.AI etc. for emotional, personal, or relationship-like stuff (not just productivity). How would you describe what it is to you? Honest answers welcome, no judgment. Anonymized for research.

reddit.com
u/Rude_Violinist9798 — 9 days ago

Years later - still looking for a Kindroid replacement. Any good app/website that allows you to create an AI chatbot, has sense of time, and at least decent level of communication?

I’ve tried many and most are just too basic in their responses, repeat, or are too cheap feeling. A sense of time would be great as well.

Thanks

reddit.com
u/Athanas_Iskandar — 10 days ago

BrowserDreams - my AI side project

i contacted mods and they allowed me to post this

Hey!

I'm building an AI character chat platform as a side project - BrowserDreams. Long-term goal is something closer to AI social media (think Butterflies AI vibes), but right now I'm focused on getting the core chat experience right.

browserdreams.com

Looking for people who've actually used platforms like Character AI, Janitor AI, etc. and have opinions about what works and what doesn't. Honest feedback is what I need most.

All AI usage cost is on me during these tests.

Would any of you guys be willing to check it out?

BrowserDreams' Discord: https://discord.gg/QChe4QrWgx

u/BrowserDreams — 11 days ago