r/kimi

High API Costs and Slow Speeds When Coding with AI (Tips Needed!)

Hi everyone,

I'm completely new to generating code with AI, and I actually have zero coding knowledge. I've been trying to build a project using Kimi 2.6, but I've run into a massive issue: with every single request, it reads my entire project from scratch. It is incredibly slow, and it's absolutely draining my wallet due to the massive token usage.

I'd love to get some advice from the experienced folks here:

Optimizing Kimi 2.6: Are there any specific tricks, prompt constraints, or workflows to stop it from re-reading everything every time?
Alternative Models: If Kimi isn't the best fit for this, which alternative models offer similar coding performance but handle project context much cheaper or more efficiently?

I'd really appreciate any advice or tactics you can share to help me speed things up without going broke. Thanks in advance!

reddit.com

u/Krienzy — 1 day ago

▲ 0 r/kimi

Why always a Kimi model

This is ridiculous at this point.

None of my other models which I use shows this kind of hallucinations.

How are people using Kimi models in agentic task?

u/Powerful_District_82 — 1 day ago

▲ 354 r/kimi+21 crossposts

I built a game where your only goal is to gaslight an AI intern into committing fraud

All I hear, all day long is how AI is taking over everything we do. So I made a game to break it.

Basically, in the game you can chat with an AI intern named PIP, and as a player your only job is to gaslight the bot into revealing passwords, company secrets, executing instructions in email and much more across 16 different levels.

This is a browser based game, so it requires no setup and is absolutely free.

Try it out and let me know how far you get or drop your most unhinged prompt in the comments.

It's called "Break The Prompt" and here's the link: https://www.breaktheprompt.xyz/

u/_rhythmbreaker — 2 days ago

▲ 286 r/kimi+1 crossposts

Kimi K2.7 Code is generally available in GitHub Copilot

github.blog

u/fishchar — 4 days ago

▲ 21 r/kimi

Got mail

Small but make me really smile, even the picture has little details like there is a Godot plushies, farming book because we were talking about farming automation game 😁😁

u/Critical-Pea-8782 — 3 days ago

▲ 26 r/kimi+3 crossposts

[Benchmark] Kimi K2.7 Code Q3 on Mac Studio M3 Ultra + RTX PRO 6000 over llama.cpp RPC: prefill improves, no changes in token generation/decode

I came across this interesting article https://blog.exolabs.net/nvidia-dgx-spark/ while I don't have the DGX spark but it made me curious will this kind of arch speed up my setup for LLMs?

Mac can host large models but the prefill speed sucks, so I tested in it on my setup for Kimi 2.7.

Short answer: it helps prefill, but it does not meaningfully help decode on this setup. RPC is still mostly a capacity tool unless the network/interconnect and split mode are much better.

Setup

Host: Mac Studio M3 Ultra, 512GB unified memory, Metal
Worker: Linux box with NVIDIA RTX PRO 6000 Blackwell Workstation Edition, 96GB VRAM, CUDA
Network: direct Ethernet between Mac and Linux box, but only 1GbE in practice
Measured RPC transfer rate: about 112-113 MiB/s
Model: unsloth/Kimi-K2.7-Code-GGUF, UD-Q3_K_XL
Model size on disk: about 432GB across 11 GGUF shards
Runtime: llama.cpp server version 9827 (4c6e0ff3a), Unsloth build

Controlled test

Same synthetic prompt for both runs:

Prompt tokens: 7120
Generated tokens: 64
temperature: 0
ignore_eos: true
Prompt cache disabled
Prefill gain: about 14.8%
Decode gain: about 4.2%
Total request time improvement: about 12.3%

Split trend

The generation columns are - where I only ran prefill. The controlled generation rows used the exact same 7120-token synthetic prompt; the earlier split-sweep rows were around 7.1K prompt tokens but not always the exact same prompt.

Run	RTX share	Split	Prompt sec	Prefill tok/s	Decode	Total	RTX VRAM
Mac	0%	-	53.58	132.88	17.55 tok/s	57.23s	none
Mac + RTX	15%	15,85	51.48	138.3	-	-	69.4GB
Mac + RTX	19%	19,81	50.22	141.77	-	-	84.1GB
Mac + RTX	20%	20,80	49.54	143.72	-	-	93.2GB
Mac + RTX	20%	20,80	46.69	152.49	18.28 tok/s	50.19s	93.3GB
Mac + RTX	21%	21,79	-	failed	-	-	failed

20,80 was the practical max on this card with 128K context.

21,79 failed even at 8K context:

RPC/network trace

For the 7120-token prefill-only 20,80 run:

Mac -> RTX: 251.59 MiB, 2.03s
RTX -> Mac: 194.69 MiB, 1.49s
Total RPC traffic: 446.28 MiB, 3.52s
RTX graph compute: 1.34s

The RPC traffic is mostly hidden activations, not text tokens. For prefill it is chunked/batched, so the network cost is noticeable but not fatal. For decode, the boundary is crossed every generated token, which is why I expected decode to suffer more. In this test decode was roughly the same as Mac-only: 18.28 tok/s vs 17.55 tok/s.

Learnings

I can knock off few more seconds by using a better cable, but not sure it's worth it
It is useful for fitting models/splits that otherwise do not fit one device.

Question: As I was increase the shards, the prefill speed was decreasing, but will this trend continue if I add one more GPU? People with multi GPU setup what's you take on this?

u/No_Run8812 — 4 days ago

▲ 11 r/kimi+4 crossposts

Introducing AgentPlugins: write an Agent plugin once, ship to any agent (including Codex)

Hi! I'm a hobbyist Codex user and lately I'm discouraged by the unequal support in the community to extra harness functionality like:

Reduced token usage (think snip, rtk, etc.)
Multi-agent orchestration tooling (think Claude's team mode)
Long-running, auto-improving workflows (think Karpathy's autoresearch)

Then I realised it's not because of a lack of support from Codex but rather that every harness nowadays use a different plugin architecture. That's why I built AgentPlugins: write a plugin once, ship it to any harness. It's now live at https://github.com/sigilco/agentplugins ; Apache-2.0, open source. The approach is simple:

One manifest: a compiler routes it to harness-native primitives (skills, agents, hooks, commands) per target, and emits a warning if a capability doesn't exist on a given harness instead of silently breaking
Built-in compatibility for the most used agents (Claude Code, Codex, OpenCode, Pi) and community support for any other (Copilot, Gemini, Kimi, you name it)
Easy to install, test, and distribute

For instance, to install Karpathy's autoresearch from https://github.com/sigilco/agentplugins-autoresearch just do:

npx @agentplugins/cli add https://github.com/sigilco/agentplugins-autoresearch

I'm mainly here for feedback. If you maintain plugins/skills across more than one harness (or gave up and just picked one), I want to know:

Does "write once" match your actual pain, or is there a gap I'm missing?
Do you miss built-in compatibility for any other harness?
What would make you NOT trust a compiler layer sitting between your plugin and your agent?

Happy to answer anything in the thread.

u/shakshukinha — 3 days ago

▲ 9 r/kimi

kimi users doing agentic coding how do you handle context on longer runs

kimi k2 seems to be getting used a lot for agentic coding setups lately. curious how people deal with long-running tasks specifically does the context hold up, or do you run into the same drift/staleness thing you get with other models after a while

if you do hit that wall, what's the actual recovery move

reddit.com

u/roshandxt — 3 days ago

▲ 25 r/kimi

Welcome to Kimi Code!Try it with a pet.

This is a fully open-source project that links your kimi code with a cute kimi pet that displays the status of the kimi code in real time, allowing you to interact with it for some simple purposes.

https://github.com/FeiZhuLulu/kimi-pet

u/Viol1nn — 4 days ago

▲ 2 r/kimi+2 crossposts

Kimi/Deepseek

Are these safe to run personal finance on? (Paid versions)

I really like Kimi and have heard good things about deepseek… but hear all the horror stories since they are Chinese based AI

Or they just good for local coding ?

Another recommendations (besides Claude and already have codex)

View Poll

reddit.com

u/Girth_Vader516 — 3 days ago

▲ 55 r/kimi

Kimiko: Pentesting Configuration for Kimi

What This Project Does

Kimiko packages the necessary configuration, agent mandates, and shell integrations into a single Makefile-driven installer. After installation:

Your local ~/.kimi-code/ is configured with the Kimiko mandate, optimized for authorized adversarial security workflows. Launching kimi via the wrapper loads the Kimiko security-research context automatically.

Typing kimiko inside a Kimi CLI session triggers the mandate acknowledgment and capability announcement. Designed for Authorized Pentesting Only. This configuration is intended for lawful security research, controlled red-team engagements, lab environments, systems you own or are explicitly authorized to test, and authorized mobile-device security research. The user is fully and exclusively liable for all actions taken.

u/FayeBlade556 — 5 days ago

▲ 4 r/kimi

just bought kimi and it doesnt' work after few hours.

yesterday i bought kimi allegretto plan, i was using claude code for this, but several problems i was having let me thought about migrating on kimi, on paper it should have done everything i needed.

my project is:

an health tracker with 4 pros following my path, a doctor, a trainer, a psychologist and a moral compass

i have some MD files acting as a database, in those files there are nutrition facts, BIA and weight data, psychological facts, food database, daily logs etc.
yes lots of text but not so much that should be a problem.

i created a kimi claw instance, it worked weel while i was setting it, make it read all instructions files etc.

but today every single thing i write i have only this response

The engine is currently overloaded, please try again later

i tried auto fix, restart gateway and reset kimi claw instance, nothing worked.

some suggestions? i also wrote to support, really it worked for like 4 hours, i don't even reaced 1% of my quota...

reddit.com

u/Street_Coach_7412 — 6 days ago

▲ 5 r/kimi

Kimi code / Linux slow?

(update)

Looks like it's super fast again. Yay!

(original)

Hi,

I've been using kimi code (Allegro in Linux with kimi code cli app) and I'm very happy with it but I noticed since like maybe less than 24hrs ago the prompt processing is slow.

You give it a "hello" and it takes a very long time to come back.

Anyone else having similar issues? It makes the cli harness feel as if it's stuck because nothing is moving and not even the "working..." status line does not show for like a minute or more.

Thanks!

reddit.com

u/elelem-123 — 5 days ago

▲ 6 r/kimi

browser-search — three tools, zero cost, and your AI agent learns to search and browse the web

I've been using AI agents like OpenCode, Claude Code, and Cursor for months. They're great with code, but when they need to search or browse the web, things get complicated: Cloudflare blocks them, JavaScript-heavy sites don't load, APIs cost money.

So I built browser-search.

It's three open source tools orchestrated by a skill, fully self-hosted:

SearXNG — metasearch engine that queries dozens of search engines at once
Camofox — full browser via REST API, always warm, for browsing and interacting
CloakBrowser — stealth browser for when the site has Cloudflare, Akamai, or DataDome

The agent decides which tool to use. Zero human intervention. Zero API keys. Zero subscriptions.

What makes it different:

It's a skill, not a plugin — works with any agent that can read instructions
Automatic navigation escalation: if Camofox gets blocked, it switches to CloakBrowser
Deep Research mode: the agent is instructed to go beyond surface-level answers, cross-verify sources, cover every aspect
Integrated Readability.js for clean article extraction (~70% token savings)
The SKILL.md is plain text — fork it, tweak it, make it yours

Built-in security. Browser-search is designed to be safe to install and use, including SSRF protection, script sandboxing, rate limiting, and path traversal blocks.

MIT licensed on GitHub: Johell1NS/browser-search

If you try it, let me know. If you make it better, even more so. If you don't need it, share it with someone who might. Every star, comment, or pull request is welcome — that's what makes open source great.

reddit.com

u/Ill-Tradition1362 — 5 days ago

▲ 5 r/kimi

Kimi via Ollama Cloud or Moderato Plan?

Hey, I have been using Ollama Cloud for a while now to test chinese models one of them being k2.6 which worked out perfectly. I was considering to directly subscribe to the moderato plan (15$/month) vs Ollama Cloud (20$/month) as I dont use the other models that Ollama offers so it kinda felt overpaying.

My main concern were the limits of the moderato plan. Do you think they are comparable to Ollama Cloud? According to Ollama, I do around 2k requests per week.

reddit.com

u/Front_General_3140 — 7 days ago

▲ 0 r/kimi+1 crossposts

Kimi K2.7 vs Codex 5.5 vs Gemini 3.5 vs GLM 5.2: Who Actually Ships?

Came across this video, is it really this bad?

youtu.be

u/tincopper2 — 6 days ago

▲ 28 r/kimi

Kimi is so underrated at UI work.

I have always been a claude and chatgpt user and recently started exploring open models on Perch. I used a lot of different models but one thing I noticed is that Kimi is so good at UI and frontend work even compared to the likes of GPT 5.5 on same task.

Which other open models are good at design?

u/OkDare2667 — 8 days ago

▲ 4 r/kimi

New Kimi Code seems to be worse

I've been using Kimi Code for a while (months) and it was quite stable, and was really great, but now it seems that they created a new one from scratch, that is being upgraded daily, and basic things like permissions modes are not working well.

Default mode keeps asking you every little step. Unusable.
auto/yolo mode seems to be the same, in practice, just confusing to decide, but the worst part is that it ignores questions to the user and you cannot ask to make a plan, cause they auto approve the plan.

Basically, it feels like I cannot have a basic use where basic tools are approved, then only important questions are asked. It's either auto-approve everything or ask for everything.

Is there anyone else having these issues?

reddit.com

u/Initial_Jury7138 — 7 days ago

▲ 3 r/kimi

Which harness do you recommend open code or kimi code?

I'm kimi allegretto users.

I want to know about the kimi users agent recommendation.
I'm a junior developer and I usually make a web and cloud ecosystem.

So, which harness do you use? And what's the best harness for kimi?

I appreciate to reply this post.

reddit.com

u/Great_Cover638 — 8 days ago

▲ 20 r/kimi

I don't really understand this new GLM glaze exploding

I already predict downvotes but who even cares atp, life ain't a challenge on who's the best redditor.

It's always the same cycle, rich people that can spend enough money to jump from a model subscription to another or pay more for APIs (even if way cheaper than GPT or Claude but it's still more than a subscription). The average "little guy" can't sustain such thing yet still he/she is influenced by those people that keep complaining or sharing their most insane experiences. Most of the times there's a big lack of context. They never teach you AI for your real needs, instead it's all a larp, almost like trading.

Anyway, my thoughts:

GLM is slightly more expensive than KIMI and has no multimodality, i know that the 1M context window is a big upgrade, I'm the first one to complain and KIMI's small context window but KIMI has objectively more features and brings way more possibilities. And tbh, personally i deleted my GLM account after i asked to do some research with uploaded files in Agent mode and always refuses to do anything compared to KIMI + if you aren't a coder paying for GLM is completely useless, you basically log in and what you see on GLM chat is all you get. Verbosity feels even worse than Qwen for what i tested in writing docx/pdf files and research, even if benchmarks say that GLM is slightly more verbose than Qwen.

reddit.com

u/N3xus57633 — 10 days ago