u/Aromatic_Pumpkin8856

Follow-up: Two weeks later. I came back to Anthropic — but Ollama Cloud is still probably the right call for most of you. Here's the telemetry
▲ 5 r/AnthropicAi+3 crossposts

Follow-up: Two weeks later. I came back to Anthropic — but Ollama Cloud is still probably the right call for most of you. Here's the telemetry

Three weeks ago I posted about switching from Claude Max ($200/mo) to Ollama Cloud Pro ($20/mo) after Anthropic started throttling me mid-session. DeepSeek-v4-pro looked better on the metrics that mattered: lines of code per hour, lines per million tokens.

A bunch of things have changed since then — on both sides of the comparison — and the picture has flipped for my specific workload. That qualifier is doing a lot of work in this post, and I want to be upfront about it: if you're not running a heavy parallel-agent setup with deterministic guardrails, Ollama Cloud is almost certainly still the right answer. I delayed this follow-up specifically so I'd have a clean two-week window to measure against. Here's what the SigNoz data says about May 3 – May 17.

What changed (the confounders, up front)

Seven variables moved between the original post and this one. A fair report has to name them before reading any chart, otherwise we're cherry-picking:

On my side (3 things):

  1. Cache-fix proxy running at localhost:9801. Sits in front of the Anthropic API and rewrites requests to maximize prompt-cache hits. Effective input-token cost drops by ~80% on cache-eligible turns. cnighswonger/claude-code-cache-fix
  2. My project git-prism with semantic analyzers for Go/Python/TS/Rust. Replaces raw git diff / git log reads with structured manifests. Same insight, ~10× fewer tokens per query. mikelane/git-prism
  3. rtk (Rust Token Killer) — a transparent Bash wrapper that filters verbose CLI output before it hits the model's context. 60-90% savings on cargo, npm, git status, etc. rtk-ai/rtk

> Note: I'm testing some recent updates to git-prism that will make the setup more transparent, the use more automatic, and to give instructions on how to make it play well with rtk. If anyone's interested in that, stay tuned.

On Anthropic's side (2 things): 4. SpaceX compute deal, announced May 6. Claude Code 5-hour rate limits doubled for Pro/Max/Team/Enterprise. Peak-hour reductions removed. 5. 7-day window +50% until July — an additional capacity bump on top of the SpaceX one, until the new infra is fully online.

On Ollama's side: 6. Limits tightened, twice, since the original post. Ollama Cloud doesn't bill in tokens — it bills in GPU-time "levels." deepseek-v4-pro is level 4 (extra heavy), the most expensive level they price. Where it used to take me weeks to dent the quota in late April, I now burn the 7-day window in about 5 days on a normal workload. And Kimi K2.6 Pro (a different model I briefly tested) blew through my entire 5-hour window in under 10 minutes.

On my wallet: 7. Upgraded Ollama Cloud Pro → Max ($20/mo → $100/mo) during this window. Max gives 5× Pro's usage allowance and keeps the 3-concurrent-model cap. So when I report below that I'm burning the 7-day window in 5 days on deepseek-v4-pro, that's against the Max quota — not Pro. The same workload at the same model on Pro would have hit the wall in roughly one day.

So: I can't honestly attribute the productivity numbers in this post entirely to Anthropic getting better. The cache proxy + git-prism + rtk are doing real work. But neither could the Reddit baseline be entirely blamed on Anthropic — the throttling was real, the model was the same one I'd used for months. The honest framing: all seven changes are real, they all compound, and the bottom line is what I ship.

The data — past 2 weeks (May 3 – May 17)

Same definitions as the original post. "Tokens" includes input + output + cache reads + cache creation. "Active hours" is wall-clock summed across concurrent agents (frequently >24h/day because of parallel dispatches — this metric reflects per-agent-hour throughput, not calendar hours).

Date Tokens (M) Lines Active hrs Tok/hr (M) Lines/hr Lines/M-tok Hypoth. API $
Sun May 3 309.1 17,712 26.88 11.5 659 57.3 $1,299
Mon May 4 408.0 28,106 26.53 15.4 1,059 68.9 $1,432
Tue May 5 274.7 15,488 24.88 11.0 623 56.4 $914
Wed May 6 ⚡ SpaceX 580.1 26,365 16.62 34.9 1,587 45.4 $709
Thu May 7 1,040.6 26,272 13.27 78.4 1,980 25.2 $2,441
Fri May 8 616.8 20,148 16.96 36.4 1,188 32.7 $3,025
Sat May 9 446.8 9,387 8.33 53.7 1,127 21.0 $2,253
Sun May 10 784.6 27,160 17.78 44.1 1,528 34.6 $1,568
Mon May 11 154.9 4,681 7.11 21.8 659 30.2 $426
Tue May 12 310.0 13,816 20.79 14.9 665 44.6 $1,553
Wed May 13 392.0 23,505 30.35 12.9 775 60.0 $1,545
Thu May 14 269.2 15,374 14.26 18.9 1,078 57.1 $1,091
Fri May 15 387.9 18,429 15.24 25.5 1,210 47.5 $861
Sat May 16 547.9 14,515 19.72 27.8 736 26.5 $620
Sun May 17* 80.6 934 1.38 $104

* May 17 partial — only ~1.4 active hours in by query time.

14-day totals (excluding partial May 17): 6.5 billion tokens, 261,000 lines of code, ~258 active agent-hours, $19,734 in hypothetical API spend.

Side-by-side with the Reddit baseline

The original post quoted five reference days. Here's how those numbers land against this window's bests:

Reference Tok/hr (M) Lines/hr Lines/M-tok
Apr 16 — original Anthropic peak 29.6 1,688 41.5
Apr 21 — throttled 12.9 431 22.3
Apr 29 — Ollama Cloud / DeepSeek 21.5 1,174 54.7
May 7 — post-SpaceX peak (this window) 78.4 1,980 25.2
May 6 — post-SpaceX day 1 34.9 1,587 45.4
2-week median 25.5 1,078 45.0

A few things jump out:

  • Token throughput is 2.6× the original "Anthropic at its best" reference. That's the cache-fix proxy doing the heavy lifting — cache reads cost a fraction of fresh input tokens, so the same calendar hour pushes far more raw tokens through the pipe.
  • Lines/hr peak (1,980) beats every single day in the Reddit post. This is the cleanest signal that the workflow itself is producing more code, not just spending more tokens. May 6, May 7, May 8, May 10 all clear the previous Anthropic best.
  • Lines/M-tok dropped, though — median ~45 here vs. 54.7 on the Ollama day. Translation: I'm shipping more total code, but I'm also spending more tokens per line of code. The cache proxy makes those extra tokens cheap, but the efficiency-per-token metric is no longer where DeepSeek was.

This is the trade-off the original post missed: lines/M-tok measures token efficiency, lines/hr measures throughput, and they're not the same thing. DeepSeek won on efficiency. Anthropic-with-the-stack wins on throughput. And throughput is what I actually feel during a session.

What broke on Ollama's side

Two weeks of trying to keep DeepSeek in the rotation — and not on the $20/mo Pro plan from the original post, but on the $100/mo Max plan, which advertises 5× Pro's usage. Specifically:

  • Kimi K2.6 Pro experiment (May 9): blew through my entire 5-hour window in under 10 minutes, on Max. Walked away from that model. This is what an even heavier model than deepseek-v4-pro does to the GPU-time quota — and it's the cautionary tale, not the norm.
  • deepseek-v4-pro 7-day window on Max: exhausted in 5 calendar days this past week on a normal workload. In April, the same workload on the Pro plan struggled to hit 3% of the same window. Five times the budget, dramatically less runway. DeepSeek isn't the problem here — it still paces well within a session — but the weekly envelope has shrunk to the point that a steady week of work will run it out before reset.

Reading Ollama's current docs explains it: cloud usage is measured in GPU-time levels (1–4), not tokens. deepseek-v4-pro is level 4 — the most expensive tier. They've also acknowledged revising limits twice since launch. The 3-concurrent-model cap that I called "annoying but survivable" in the first post is now the binding constraint, because queued requests still count toward the GPU-time budget.

The real story isn't about the models. It's about the cage.

Here's the part I want to land hard, because I think the original post got it wrong and a lot of the responses to it got it wrong too:

You must have a deterministic cage for the stochastic agents.

That's the whole game. It doesn't matter whether the model behind the API is Claude Opus 4.7 or DeepSeek-v4-pro or whatever drops next month. What matters is whether your SDLC pipeline constrains the model's freedom to be wrong:

  • TDD with tests that have to be RED before they go GREEN
  • A typecheck/lint/security gauntlet that no PR escapes
  • An adversarial code reviewer that's a different agent than the one that wrote the code
  • A merge gate that mechanically verifies the gauntlet ran, not just that the human said "looks good"
  • Issue-before-implementation discipline so every change has a documented anchor

If you have that cage, DeepSeek-v4-pro produces output every bit as good as Opus 4.7 — even with Ollama's recent quota changes. I'm not speaking theoretically here. I run both, side by side, daily. The cage catches the things either model gets wrong, and what comes out the other end is interchangeable in quality.

The reason Claude+stack wins for certain parts of my workflow is not that Claude is smarter — it's that I'm running 14+ concurrent agents on time-critical work, and the binding constraint on Ollama Cloud is the 3-concurrent-model cap regardless of plan tier. If you're running one or two agents at a time, that constraint never bites you, and the $20-$100/mo you'd spend on Ollama is real savings over $200/mo Claude Max.

My actual setup: two lanes, not one

I want to be specific about why I kept the $100/mo Ollama Cloud Max subscription, because "fallback" was the wrong word in an earlier draft and I think the honest answer matters here.

It's not a fallback. It's a second lane. I run both subscriptions in parallel, every day, for different categories of work:

  • Claude Max ($200/mo) — the fast lane. Anything time-critical: client work, things I need shipped today, debugging a production issue, the heavy-parallelism gauntlet that needs to run 14 agents at once. I'm regularly pushing my Claude token limits as it is — and you can't just buy a second Claude Max account without risking a ban — so I treat that quota as scarce.
  • Ollama Cloud Max ($100/mo) — the patient lane. Anything where I can wait: an app for my wife, side projects only I'll use, learning exercises, scripts that aren't on a deadline. DeepSeek-v4-pro takes longer per turn and I have to queue agents because of the 3-concurrent cap, but the output is genuinely as good as Opus 4.7 thanks to the cage, and it doesn't burn my scarce Claude budget.

That two-lane split is the reason both subscriptions earn their keep. Drop either one and the other gets overloaded. Total monthly stack: $300/mo for what would cost $400+ if I tried to do all of it on Claude alone — and I'd hit the ban-risk wall trying to buy more Claude capacity anyway.

Who should actually use what

  • You're running 1-3 agents at a time, working on a single repo, single-language, with a real test suite and a CI pipeline that has teeth: Ollama Cloud Pro ($20/mo) with deepseek-v4-pro. This is the right answer for most people reading this post. You'll never hit the concurrency cap. The cage catches what the model gets wrong. You save 10×.
  • You're running 5-10 agents, multi-repo, polyglot, willing to engineer a custom token-savings stack: Ollama Cloud Max ($100/mo) is still viable, with caveats. Watch the GPU-time levels carefully. Stick to level 2-3 models for routine work, save level 4 for the heavy lifts.
  • You have time-critical, high-pressure work AND lower-priority patient work: run both lanes like I do. $200 Claude Max for the fast lane, $100 Ollama Max for the patient lane. Don't try to do everything on one provider — you'll either burn out the quota or pay 3× more than you need to.
  • You don't have the cage yet: build the cage first. Whichever model you use will produce garbage without it. The model choice is a 10% optimization on top of an 80% problem.

What I'd actually recommend now to anyone reading the first post:

  1. Don't ditch a model based on one bad week — instrument first. Half of what I attributed to "Anthropic got worse" in April was cache misses I could have controlled.
  2. Cache hit rate is the variable that matters most, no matter whose API you're hitting. A naive Claude Code workflow burns input tokens on every turn. A cache-optimized one keeps the same context resident across the whole session.
  3. Build the cage before you optimize the model. TDD discipline, an adversarial review agent, a mechanical merge gate. Without those, switching models is rearranging deck chairs.

I'll keep this dashboard running. Happy to share the SigNoz dashboard JSON if anyone wants to set up their own.


Sources / receipts:

u/Aromatic_Pumpkin8856 — 6 days ago

What happened

I've been a heavy Claude Code user for months. I instrument everything through SigNoz so I have actual data, not vibes. For a long time, Anthropic was excellent — I consistently struggled to hit my token limits. That's the good problem.

Then something changed on Anthropic's side. I don't know exactly what, maybe extended thinking changes, maybe rate limit policy, maybe something else. But around April 17–18, a single /review-pr command started eating through more than half my 5-hour token window. My workflow didn't change. I run the same commands I've run for months. The model behavior did.

The result: Mon Apr 20 and Tue Apr 21, I was getting throttled repeatedly mid-session. Same work — implementation, code review, multi-agent orchestration. Just constantly hitting walls.

At 9:51am PDT today I switched Claude Code to deepseek-v4-pro:cloud via Ollama Cloud's $20/mo Pro plan. Here's what happened.


The data

Two metrics that matter: lines of code changed per hour (real output) and lines per million tokens (efficiency — how much code gets produced per unit of compute).

Session Avg tok/hr Lines/hr Lines/M-tok
Thu Apr 16 — Anthropic at its best 29,555,000 1,688 41.5
Sat Apr 18 — something starting to shift 14,640,000 717 35.0
Mon Apr 20 — throttled, cutting out 10,475,000 289 18.4
Tue Apr 21 — throttled again 12,860,000 431 22.3
Today Apr 29 — Ollama Cloud / DeepSeek 21,466,000 1,174 54.7

Mon and Tue weren't a different type of work. It was the same implementation and review work I've done for months. The throttling is what cut the output.


The part that floored me

DeepSeek on Ollama Pro is slower than Claude. My plan caps me at 3 concurrent cloud models. I regularly run 14+ parallel agents — church review swarms, dual-repo implementations, blog agents, the works. Today all of that had to queue.

And I still produced 1,174 lines/hr — nearly 70% of my best-ever Anthropic session, which had no concurrency limits and a faster model.

Compared to what Anthropic has been delivering this week: 2.7–4× more productive, with more constraints, on a plan that costs 10× less.


My honest takeaway

DeepSeek is better. Not "competitive." Not "surprisingly good for the price." Better. At least for the way I work — heavy parallelism, large codebases, long multi-agent sessions.

Anthropic's Claude used to be the clear answer for me. Thursday's session proves it can still hit those peaks. But whatever changed in the past week has made it unusable for my workflow at the rate limits they're enforcing. DeepSeek through Ollama Cloud doesn't have that problem.

$200/mo → $20/mo. More productive. Less friction.

I'll keep tracking the data and post a follow-up after a full week.

reddit.com
u/Aromatic_Pumpkin8856 — 24 days ago

The 5h bar in the strip above is red despite being only half-full. The bar colors by pace, so 50% used with only 49 minutes elapsed in a 5-hour window puts me ~34 percentage points ahead, which trips red. If I'd used 85% of my weekly budget by day 6, that would be fine. 85% by day 3 would mean I'm on track to blow through the window before it resets.

Computed as used% - elapsed%:

  • ≤ 0 → green (on or under pace)
  • up to 15 percentage points ahead → yellow
  • more than 15 → red

So: 5h bar at 50% used and ~16% elapsed = ~34 points ahead → red. 7d bar at 9% used and ~3.5% elapsed = ~5 points ahead → yellow. The ctx bar is fill-based; context isn't a time window.

Anthropic's official usage panel and ccusage both report how much I've used. Neither tells me whether I'm using it too fast. Without a color signal, you can't tell from a 50% reading whether you're on pace or about to crash through the limit early. The pace bars answer that question: am I about to get rate-limited mid-task?

Full writeup with the bash and the thresholds here

u/Aromatic_Pumpkin8856 — 27 days ago

Welcome to r/PracticalAIDev

This community is for developers, makers, and technical builders using AI to create real things.

There are plenty of places online to argue about models, complain about rate limits, or recycle hot takes. This is not one of them.

r/PracticalAIDev is for practical, useful discussion around AI in software development:

* Projects you’ve built with AI help

* Coding workflows that genuinely save time

* Tool reviews (Claude Code, Cursor, Copilot, etc.)

* Prompting techniques that improve results

* Architecture and engineering tradeoffs

* Useful AI news for developers

* Questions from people trying to get better

What We Value

* Building over complaining

* Signal over noise

* Helpful feedback over ego

* Real examples over vague claims

* Curiosity over gatekeeping

What to Do First

  1. Introduce yourself in the comments
  2. Share what you’re building or learning
  3. Ask a practical question
  4. Post something useful for other developers

Founding Members Matter

You’re here early, which means you help shape what this place becomes.

Let’s build a community worth checking every day.

reddit.com
u/Aromatic_Pumpkin8856 — 1 month ago