flat-rate AI subscriptions hide a pretty wild cost-to-value mismatch, and image generation is the issue.
the spread in what users actually cost on the same plan is easily 10-100x. someone running agentic coding loops doing real work pays the same $20 as someone generating hundreds of images a day. the GPUs don’t care which one you are. you average it across the user base and post one price.
image generation is one of the heaviest GPU workloads in consumer AI. sustained, iterative, on expensive hardware. At public API rates a single high quality image is worth dozens of coding turns.
if you priced this purely on cost, image gen would be gated or metered. it isn’t, because it’s the cleanest demo AI has. visual, shareable, a non-technical person gets it in five seconds. coding, reasoning, and agentic workflows are where the durable value lives, but they don’t go viral.
so the heaviest workload on the platform subsidizes the most viral one to drive signups. which means you, paying for a plan you mostly use for code or research, are partly underwriting someone else’s image marathon.
if image gen got stripped out or strictly metered, base subscription prices would probably come down, GPU pressure would ease across the industry, and more compute would point at workloads people actually use for work. you’d also lose the easiest demo AI has, which is why nobody is going to do it.
it isn’t going anywhere. just worth being honest about what’s happening. image generation isn’t free, it’s subsidized by everyone else on the plan, and the heaviest workload in the stack is the one being given away.
so i have an rtx 2060 (6gb) and got tired of every “local AI” project assuming i had a 4090. wanted one app where i could chat with a local LLM, generate images, do RAG over my own data, and fine-tune a model without juggling four repos. So I built my-lm.
> electron + python desktop app, MIT licensed, runs entirely offline.
what’s in it:
• chat with qwen 2.5 / llama 3.2 / phi-3.5, streaming + system prompt
• SDXL image gen with live latent previews every 2 steps via TAESDXL (watching it materialize is genuinely cool), compel for long prompts, ADetailer-style face fix, 4× ESRGAN upscale
• BookMind — RAG book recommender using mongo atlas $vectorSearch with grounded LLM explanations (model can only mention books from the retrieved set, no hallucinated titles)
• QLoRA fine-tuning UI with live loss/epoch/LR streaming from a TrainerCallback, and a one-click “merge adapter” button. no notebook required.
• first-launch modal detects missing models and one-click installs from HF — no git lfs nightmares
the 6gb-is-fine part is what i’m most proud of. some of the tricks:
• SDXL uses enable\_model\_cpu\_offload(); text encoders only hit gpu during compel encoding then go back
• face detailer reuses the base pipe’s weights instead of loading a second SDXL
• 4× upscale runs tiled at 384px with 32px overlap
• training is 4-bit NF4 + gradient checkpointing
• gpu-aware model catalog hides anything that won’t fit your card by default
honest caveats: don’t run chat + image + training at the same time, they each want the whole gpu. tested mostly on windows + linux with cuda 12.1; mac is cpu/mps and slow. it’s a v0, expect rough edges.
repo: https://github.com/Azayzel/my-lm
would love feedback or PRs, especially from folks on similarly tiny gpus. curious what other vram tricks people have been using.