u/Otherwise_Flan7339

AI caching might actually become one of the most important ways to reduce token usage.

In normal web dev, caching was mostly about reducing latency and compute. The same request comes again and the cached response gets served.

But with LLMs people rarely send the exact same prompt twice. They just repeat the same intent with slightly different wording over and over again.

Was reading through Bifrost’s semantic caching docs today and they handle this by combining direct cache matching with semantic similarity search, and LiteLLM seems to be exploring pretty similar ideas through semantic caching + Redis-backed caching for repeated LLM workloads.

Makes sense why this is becoming important across AI infra now. Once you have copilots, agents, MCP workflows, support bots, meeting summaries etc running continuously, there’s probably way more duplicated intent underneath than most people realize.

We should probably think more about this, and I’d also love to know what other methods people are using to reduce token usage beyond caching like context pruning or prompt compression.

u/Otherwise_Flan7339 — 20 hours ago

feels like people are giving AI agents production access way too casually.

people are being way too unserious with how they use these tools and even how they’re writing code now lol.

giving agents access to MCP servers, APIs, databases, internal tools, prod workflows etc without properly understanding permissions or security boundaries is kinda insane when you think about it.

and the scary part is most of these workflows are only getting more autonomous.

lowkey makes me wanna restart learning ethical hacking again because this problem is definitely not going away anytime soon

reddit.com
u/Otherwise_Flan7339 — 1 day ago

Posting because every cost breakdown I've seen is either enterprise-scale or a hobbyist's $20 OpenRouter bill. Here's the middle.

Stack: small agent product, around 200K tasks/month, average 8-12 LLM calls per task. Mix of Sonnet for harder work, Haiku for classification, light fallback to GPT.

Monthly:

  • LLM API: ~$5K, give or take $500 month to month. Sonnet is most of it, Haiku is most of the calls.
  • Gateway: one small instance running Bifrost. Both Bifrost and LiteLLM are free and open source so the cost is purely infra. We needed 4 nodes when we were on LiteLLM to handle the same load, dropped to 1 after switching. Whatever your cloud provider charges for that delta.
  • Observability: ~$200/month, self-hosted Grafana + Postgres for traces.
  • Vector DB: ~$80/month, Qdrant on a small instance.

Things that helped:

  • Exact-match caching (not even semantic) cut LLM spend ~25%
  • Killing one verbose tool output ate another ~8%. Model was paying full input cost on the same long tool result every loop.
  • Migrated to Sonnet 4.6 for 1M context. Same window, no surcharge, since 4.6 has 1M GA at standard pricing. The old beta still had the 2x premium until today.

Honest take: at our scale, the LLM API bill is the only one that matters. Everything else is rounding error. Optimizing the proxy or DB before optimizing prompts and caching is procrastination.

What's everyone else's actual breakdown look like? Specifically curious about teams in the 100K-500K tasks/month range. The public numbers above and below this band are everywhere, this band's quiet.

reddit.com
u/Otherwise_Flan7339 — 21 days ago

Boring infra cost breakdown for an LLM agent stack at moderate scale

7 MCP servers, ~90 tools. Last month our token bill jumped 40% with no traffic increase.

Instrumented and the answer was obvious: every tool definition from every server gets injected into context on every request. 90 definitions per call. Tool list was the majority of input tokens.

"Trim your tool list" was the suggested fix. Disabling capability your agents need isn't a fix.

Different pattern converging across vendors. Cloudflare shipped it first as Code Mode (TypeScript runtime). Anthropic's engineering team documented context dropping 150K → 2K tokens on a Drive-to-Salesforce workflow. Bifrost has a Python version.

Idea: don't inject tool definitions. Expose tools as a code interface the model reads on demand and runs in a sandbox. Model loads only what it needs, writes a script, gets the final result. Intermediate defs never enter context.

Bifrost's vendor benchmark shows 58% / 85% / 92% reductions at 96 / 251 / 508 tools. Take vendor numbers with salt.

Our test on prod traffic, 7 servers / 90 tools:

  • Input tokens per task: ~14K → ~5K
  • ~60% reduction in input cost on agent calls
  • No change in task success over 2 weeks

Smaller than the marketing numbers but real.

Caveats: adds sandbox runtime overhead, needs decent tool docstrings, compounding only kicks in past ~5 servers.

Worth measuring if your MCP token bill has outrun your traffic.

reddit.com
u/Otherwise_Flan7339 — 21 days ago

Posting because every cost breakdown I've seen is either enterprise-scale or a hobbyist's $20 OpenRouter bill. Here's the middle.

Stack: small agent product, around 200K tasks/month, average 8-12 LLM calls per task. Mix of Sonnet for harder work, Haiku for classification, light fallback to GPT.

Monthly:

  • LLM API: ~$5K, give or take $500 month to month. Sonnet is most of it, Haiku is most of the calls.
  • Gateway: one small instance running Bifrost. Both Bifrost and LiteLLM are free and open source so the cost is purely infra. We needed 4 nodes when we were on LiteLLM to handle the same load, dropped to 1 after switching. Whatever your cloud provider charges for that delta.
  • Observability: ~$200/month, self-hosted Grafana + Postgres for traces.
  • Vector DB: ~$80/month, Qdrant on a small instance.

Things that helped:

  • Exact-match caching (not even semantic) cut LLM spend ~25%
  • Killing one verbose tool output ate another ~8%. Model was paying full input cost on the same long tool result every loop.
  • Migrated to Sonnet 4.6 for 1M context. Same window, no surcharge, since 4.6 has 1M GA at standard pricing. The old beta still had the 2x premium until today.

Honest take: at our scale, the LLM API bill is the only one that matters. Everything else is rounding error. Optimizing the proxy or DB before optimizing prompts and caching is procrastination.

What's everyone else's actual breakdown look like? Specifically curious about teams in the 100K-500K tasks/month range. The public numbers above and below this band are everywhere, this band's quiet.

reddit.com
u/Otherwise_Flan7339 — 21 days ago
▲ 51 r/AIcosts+2 crossposts

We're 4 months in, 9 MCP servers in prod, 2 more we built ourselves because the community ones were unmaintained.

Quick notes for anyone earlier in the journey.

Half the public servers are dead. The Rapid Claw audit pegged it at 52%, lines up with our experience. We tried 13 community servers, kept 6, forked 2 of those because the maintainers were gone. Treat every MCP install like adding an unmaintained npm package, because most of them are.

Stateful sessions break behind load balancers. Session state lives on the instance handling the connection. We didn't notice until server #5 needed to scale and we couldn't put it behind an ALB without sticky sessions hacks. The 2026 roadmap has this on the fix list. Until then, single-instance deploys.

Tool descriptions are trusted context. Whatever's in the description text goes into the model's context window. Invariant Labs has the canonical poisoning demo. If you're pulling a server you don't control, you should be reading every description before install. We weren't, for the first 7 we added.

Fan-out kills observability. 9 servers means 9 log formats, 9 auth setups, 9 places to look when something breaks. We put a gateway in front around month 3 (we use Bifrost OSS https://github.com/maximhq/bifrost, Docker's MCP Gateway and Microsoft's mcp-gateway are the other ones we looked at). One log stream, centralized auth.

The Azure CVE was the kick we needed. CVE-2026-32211, missing auth on Azure DevOps MCP, CVSS 9.1, disclosed Apr 3. Forced us to actually inventory what each server can touch. Most teams I've asked haven't done this audit. Do it.

Protocol's fine. Ecosystem is the problem.

u/Otherwise_Flan7339 — 22 days ago
▲ 4 r/AIcosts+2 crossposts

Why are people still hardcoding provider SDKs in 2026

Genuine question because I keep running into it.

Was helping a friend debug their agent stack last week. Three different provider SDKs imported directly. Retry logic in five files. A try/except block doing what looked like a poor man's fallback to a different model. This is at a seed funded startup.

I know everyone reading this knows what an LLM gateway is. The pitch hasn't changed in two years. Unified API, fallback, caching, cost tracking, virtual keys, observability. Same talking points across Bifrost, LiteLLM, Kong AI Gateway, Cloudflare AI Gateway, take your pick.

But the cost case has actually shifted under us and I don't see people talking about it.

We pulled 30 days of our agent traffic at my last check. Stuff that gateways now solve out of the box that we were hand-rolling:

  • Semantic caching cut our token spend by ~31% on a customer support agent. Repetitive queries we were billing for every single time.
  • Fallback config replaced ~400 lines of provider-specific retry code. We hadn't deleted the old code yet but we will.
  • Per-team virtual keys finally let our finance person stop asking me which prompt cost $1,800 last Tuesday.

If you're 6+ months in and still calling provider SDKs directly, you're paying for that decision in token spend and on-call pages. Should have moved earlier honestly.

reddit.com
u/Otherwise_Flan7339 — 24 days ago

Genuine question because I keep running into it.

Was helping a friend debug their agent stack last week. Three different provider SDKs imported directly. Retry logic in five files. A try/except block doing what looked like a poor man's fallback to a different model. This is at a seed funded startup.

I know everyone reading this knows what an LLM gateway is. The pitch hasn't changed in two years. Unified API, fallback, caching, cost tracking, virtual keys, observability. Same talking points across Bifrost, LiteLLM, Kong AI Gateway, Cloudflare AI Gateway, take your pick.

But the cost case has actually shifted under us and I don't see people talking about it.

We pulled 30 days of our agent traffic at my last check. Stuff that gateways now solve out of the box that we were hand-rolling:

  • Semantic caching cut our token spend by ~31% on a customer support agent. Repetitive queries we were billing for every single time.
  • Fallback config replaced ~400 lines of provider-specific retry code. We hadn't deleted the old code yet but we will.
  • Per-team virtual keys finally let our finance person stop asking me which prompt cost $1,800 last Tuesday.

If you're 6+ months in and still calling provider SDKs directly, you're paying for that decision in token spend and on-call pages. Should have moved earlier honestly.

reddit.com
u/Otherwise_Flan7339 — 24 days ago
▲ 2 r/AIcosts+1 crossposts

DeepSeek R2 came out last week; pricing roughly 70% lower than the Western frontier models we were using. For a pre-seed startup that number matters.

The problem with switching models mid-production: we had LangChain agents with prompts tuned to a specific provider's behavior. Every previous model switch meant updating config, testing, redeploying, and praying nothing broke at 2am. With 3 people on the team that's a half-day minimum.

What we did instead: route through a gateway with weighted routing config. Set R2 to handle 30% of traffic initially, watch error rates and output quality for 48 hours, then bump to 70%. No code changes. No redeploys. If R2 started producing bad outputs we could roll back in 30 seconds by changing a config value.

The 48-hour shadow period caught one prompt that broke badly on R2's tool-call format. Fixed it before it ever hit majority traffic. Would have been a production incident if we'd done a hard cutover.

Bill dropped 41.3% in the first week. Still watching quality metrics but so far no regressions on the tasks that matter.

reddit.com
u/Otherwise_Flan7339 — 1 month ago

Anthropic confirmed their best model won't be public. 50 companies get it. We're not one of them.

Anthropic confirmed Claude Mythos (apparently their most capable model ever built) isn't going public. 50 organizations get access through a gated program called Project Glasswing. That's it.

I understand the reasoning. A model that's reportedly excellent at finding security vulnerabilities doesn't get a public API on day one. The responsible deployment argument is real.

But here's the practical impact for early-stage startups: we're now in a two-tier market. Fifty organizations get to build on capabilities the rest of us can't access. If Mythos is as capable as early reports suggest, those 50 companies have an 18-month head start on whatever product categories require that level of reasoning.

The compounding question nobody's talking about: the organizations with Glasswing access are almost certainly large enterprises, not pre-seed startups. They'll define what the frontier model is actually used for, ship products that set user expectations, and by the time public access opens, the category leaders will be entrenched.

OpenAI went through a version of this with GPT-4 access tiers in 2023. The early-access holders didn't dominate every category, but they owned the initial product narrative.

Nothing actionable here if you're a small team; we don't have the leverage to get into a 50-org whitelist. But if your product roadmap depends on frontier-level reasoning, worth acknowledging that the constraint is structural rather than just a waitlist.

reddit.com
u/Otherwise_Flan7339 — 1 month ago