u/Aromatic_Pumpkin8856

Three weeks ago I posted about switching from Claude Max ($200/mo) to Ollama Cloud Pro ($20/mo) after Anthropic started throttling me mid-session. DeepSeek-v4-pro looked better on the metrics that mattered: lines of code per hour, lines per million tokens.

A bunch of things have changed since then — on both sides of the comparison — and the picture has flipped for my specific workload. That qualifier is doing a lot of work in this post, and I want to be upfront about it: if you're not running a heavy parallel-agent setup with deterministic guardrails, Ollama Cloud is almost certainly still the right answer. I delayed this follow-up specifically so I'd have a clean two-week window to measure against. Here's what the SigNoz data says about May 3 – May 17.

What changed (the confounders, up front)

Seven variables moved between the original post and this one. A fair report has to name them before reading any chart, otherwise we're cherry-picking:

On my side (3 things):

Cache-fix proxy running at localhost:9801. Sits in front of the Anthropic API and rewrites requests to maximize prompt-cache hits. Effective input-token cost drops by ~80% on cache-eligible turns. cnighswonger/claude-code-cache-fix
My project git-prism with semantic analyzers for Go/Python/TS/Rust. Replaces raw git diff / git log reads with structured manifests. Same insight, ~10× fewer tokens per query. mikelane/git-prism
rtk (Rust Token Killer) — a transparent Bash wrapper that filters verbose CLI output before it hits the model's context. 60-90% savings on cargo, npm, git status, etc. rtk-ai/rtk

> Note: I'm testing some recent updates to git-prism that will make the setup more transparent, the use more automatic, and to give instructions on how to make it play well with rtk. If anyone's interested in that, stay tuned.

On Anthropic's side (2 things): 4. SpaceX compute deal, announced May 6. Claude Code 5-hour rate limits doubled for Pro/Max/Team/Enterprise. Peak-hour reductions removed. 5. 7-day window +50% until July — an additional capacity bump on top of the SpaceX one, until the new infra is fully online.

On Ollama's side: 6. Limits tightened, twice, since the original post. Ollama Cloud doesn't bill in tokens — it bills in GPU-time "levels." deepseek-v4-pro is level 4 (extra heavy), the most expensive level they price. Where it used to take me weeks to dent the quota in late April, I now burn the 7-day window in about 5 days on a normal workload. And Kimi K2.6 Pro (a different model I briefly tested) blew through my entire 5-hour window in under 10 minutes.

On my wallet: 7. Upgraded Ollama Cloud Pro → Max ($20/mo → $100/mo) during this window. Max gives 5× Pro's usage allowance and keeps the 3-concurrent-model cap. So when I report below that I'm burning the 7-day window in 5 days on deepseek-v4-pro, that's against the Max quota — not Pro. The same workload at the same model on Pro would have hit the wall in roughly one day.

So: I can't honestly attribute the productivity numbers in this post entirely to Anthropic getting better. The cache proxy + git-prism + rtk are doing real work. But neither could the Reddit baseline be entirely blamed on Anthropic — the throttling was real, the model was the same one I'd used for months. The honest framing: all seven changes are real, they all compound, and the bottom line is what I ship.

The data — past 2 weeks (May 3 – May 17)

Same definitions as the original post. "Tokens" includes input + output + cache reads + cache creation. "Active hours" is wall-clock summed across concurrent agents (frequently >24h/day because of parallel dispatches — this metric reflects per-agent-hour throughput, not calendar hours).

Date	Tokens (M)	Lines	Active hrs	Tok/hr (M)	Lines/hr	Lines/M-tok	Hypoth. API $
Sun May 3	309.1	17,712	26.88	11.5	659	57.3	$1,299
Mon May 4	408.0	28,106	26.53	15.4	1,059	68.9	$1,432
Tue May 5	274.7	15,488	24.88	11.0	623	56.4	$914
Wed May 6 ⚡ SpaceX	580.1	26,365	16.62	34.9	1,587	45.4	$709
Thu May 7	1,040.6	26,272	13.27	78.4	1,980	25.2	$2,441
Fri May 8	616.8	20,148	16.96	36.4	1,188	32.7	$3,025
Sat May 9	446.8	9,387	8.33	53.7	1,127	21.0	$2,253
Sun May 10	784.6	27,160	17.78	44.1	1,528	34.6	$1,568
Mon May 11	154.9	4,681	7.11	21.8	659	30.2	$426
Tue May 12	310.0	13,816	20.79	14.9	665	44.6	$1,553
Wed May 13	392.0	23,505	30.35	12.9	775	60.0	$1,545
Thu May 14	269.2	15,374	14.26	18.9	1,078	57.1	$1,091
Fri May 15	387.9	18,429	15.24	25.5	1,210	47.5	$861
Sat May 16	547.9	14,515	19.72	27.8	736	26.5	$620
Sun May 17*	80.6	934	1.38	—	—	—	$104

* May 17 partial — only ~1.4 active hours in by query time.

14-day totals (excluding partial May 17): 6.5 billion tokens, 261,000 lines of code, ~258 active agent-hours, $19,734 in hypothetical API spend.

Side-by-side with the Reddit baseline

The original post quoted five reference days. Here's how those numbers land against this window's bests:

Reference	Tok/hr (M)	Lines/hr	Lines/M-tok
Apr 16 — original Anthropic peak	29.6	1,688	41.5
Apr 21 — throttled	12.9	431	22.3
Apr 29 — Ollama Cloud / DeepSeek	21.5	1,174	54.7
May 7 — post-SpaceX peak (this window)	78.4	1,980	25.2
May 6 — post-SpaceX day 1	34.9	1,587	45.4
2-week median	25.5	1,078	45.0

A few things jump out:

Token throughput is 2.6× the original "Anthropic at its best" reference. That's the cache-fix proxy doing the heavy lifting — cache reads cost a fraction of fresh input tokens, so the same calendar hour pushes far more raw tokens through the pipe.
Lines/hr peak (1,980) beats every single day in the Reddit post. This is the cleanest signal that the workflow itself is producing more code, not just spending more tokens. May 6, May 7, May 8, May 10 all clear the previous Anthropic best.
Lines/M-tok dropped, though — median ~45 here vs. 54.7 on the Ollama day. Translation: I'm shipping more total code, but I'm also spending more tokens per line of code. The cache proxy makes those extra tokens cheap, but the efficiency-per-token metric is no longer where DeepSeek was.

This is the trade-off the original post missed: lines/M-tok measures token efficiency, lines/hr measures throughput, and they're not the same thing. DeepSeek won on efficiency. Anthropic-with-the-stack wins on throughput. And throughput is what I actually feel during a session.

What broke on Ollama's side

Two weeks of trying to keep DeepSeek in the rotation — and not on the $20/mo Pro plan from the original post, but on the $100/mo Max plan, which advertises 5× Pro's usage. Specifically:

Kimi K2.6 Pro experiment (May 9): blew through my entire 5-hour window in under 10 minutes, on Max. Walked away from that model. This is what an even heavier model than deepseek-v4-pro does to the GPU-time quota — and it's the cautionary tale, not the norm.
deepseek-v4-pro 7-day window on Max: exhausted in 5 calendar days this past week on a normal workload. In April, the same workload on the Pro plan struggled to hit 3% of the same window. Five times the budget, dramatically less runway. DeepSeek isn't the problem here — it still paces well within a session — but the weekly envelope has shrunk to the point that a steady week of work will run it out before reset.

Reading Ollama's current docs explains it: cloud usage is measured in GPU-time levels (1–4), not tokens. deepseek-v4-pro is level 4 — the most expensive tier. They've also acknowledged revising limits twice since launch. The 3-concurrent-model cap that I called "annoying but survivable" in the first post is now the binding constraint, because queued requests still count toward the GPU-time budget.

The real story isn't about the models. It's about the cage.

Here's the part I want to land hard, because I think the original post got it wrong and a lot of the responses to it got it wrong too:

You must have a deterministic cage for the stochastic agents.

That's the whole game. It doesn't matter whether the model behind the API is Claude Opus 4.7 or DeepSeek-v4-pro or whatever drops next month. What matters is whether your SDLC pipeline constrains the model's freedom to be wrong:

TDD with tests that have to be RED before they go GREEN
A typecheck/lint/security gauntlet that no PR escapes
An adversarial code reviewer that's a different agent than the one that wrote the code
A merge gate that mechanically verifies the gauntlet ran, not just that the human said "looks good"
Issue-before-implementation discipline so every change has a documented anchor

If you have that cage, DeepSeek-v4-pro produces output every bit as good as Opus 4.7 — even with Ollama's recent quota changes. I'm not speaking theoretically here. I run both, side by side, daily. The cage catches the things either model gets wrong, and what comes out the other end is interchangeable in quality.

The reason Claude+stack wins for certain parts of my workflow is not that Claude is smarter — it's that I'm running 14+ concurrent agents on time-critical work, and the binding constraint on Ollama Cloud is the 3-concurrent-model cap regardless of plan tier. If you're running one or two agents at a time, that constraint never bites you, and the $20-$100/mo you'd spend on Ollama is real savings over $200/mo Claude Max.

My actual setup: two lanes, not one

I want to be specific about why I kept the $100/mo Ollama Cloud Max subscription, because "fallback" was the wrong word in an earlier draft and I think the honest answer matters here.

It's not a fallback. It's a second lane. I run both subscriptions in parallel, every day, for different categories of work:

Claude Max ($200/mo) — the fast lane. Anything time-critical: client work, things I need shipped today, debugging a production issue, the heavy-parallelism gauntlet that needs to run 14 agents at once. I'm regularly pushing my Claude token limits as it is — and you can't just buy a second Claude Max account without risking a ban — so I treat that quota as scarce.
Ollama Cloud Max ($100/mo) — the patient lane. Anything where I can wait: an app for my wife, side projects only I'll use, learning exercises, scripts that aren't on a deadline. DeepSeek-v4-pro takes longer per turn and I have to queue agents because of the 3-concurrent cap, but the output is genuinely as good as Opus 4.7 thanks to the cage, and it doesn't burn my scarce Claude budget.

That two-lane split is the reason both subscriptions earn their keep. Drop either one and the other gets overloaded. Total monthly stack: $300/mo for what would cost $400+ if I tried to do all of it on Claude alone — and I'd hit the ban-risk wall trying to buy more Claude capacity anyway.

Who should actually use what

You're running 1-3 agents at a time, working on a single repo, single-language, with a real test suite and a CI pipeline that has teeth: Ollama Cloud Pro ($20/mo) with deepseek-v4-pro. This is the right answer for most people reading this post. You'll never hit the concurrency cap. The cage catches what the model gets wrong. You save 10×.
You're running 5-10 agents, multi-repo, polyglot, willing to engineer a custom token-savings stack: Ollama Cloud Max ($100/mo) is still viable, with caveats. Watch the GPU-time levels carefully. Stick to level 2-3 models for routine work, save level 4 for the heavy lifts.
You have time-critical, high-pressure work AND lower-priority patient work: run both lanes like I do. $200 Claude Max for the fast lane, $100 Ollama Max for the patient lane. Don't try to do everything on one provider — you'll either burn out the quota or pay 3× more than you need to.
You don't have the cage yet: build the cage first. Whichever model you use will produce garbage without it. The model choice is a 10% optimization on top of an 80% problem.

What I'd actually recommend now to anyone reading the first post:

Don't ditch a model based on one bad week — instrument first. Half of what I attributed to "Anthropic got worse" in April was cache misses I could have controlled.
Cache hit rate is the variable that matters most, no matter whose API you're hitting. A naive Claude Code workflow burns input tokens on every turn. A cache-optimized one keeps the same context resident across the whole session.
Build the cage before you optimize the model. TDD discipline, an adversarial review agent, a mechanical merge gate. Without those, switching models is rearranging deck chairs.

I'll keep this dashboard running. Happy to share the SigNoz dashboard JSON if anyone wants to set up their own.

Sources / receipts:

Anthropic — Higher usage limits for Claude and a compute deal with SpaceX
Engadget — Anthropic is doubling Claude Code rate limits after deal with SpaceX
Slashdot — Anthropic Raises Claude Code Usage Limits, Credits New Deal With SpaceX
Ollama — Pricing · Cloud docs
DevToolHub — Ollama Cloud Free vs Pro: Usage Limits, Pricing & What You Get (2026)
ollama/ollama#13089 — Frustrating very limiting max tokens on the cloud models to only 16,384

What happened

I've been a heavy Claude Code user for months. I instrument everything through SigNoz so I have actual data, not vibes. For a long time, Anthropic was excellent — I consistently struggled to hit my token limits. That's the good problem.

Then something changed on Anthropic's side. I don't know exactly what, maybe extended thinking changes, maybe rate limit policy, maybe something else. But around April 17–18, a single /review-pr command started eating through more than half my 5-hour token window. My workflow didn't change. I run the same commands I've run for months. The model behavior did.

The result: Mon Apr 20 and Tue Apr 21, I was getting throttled repeatedly mid-session. Same work — implementation, code review, multi-agent orchestration. Just constantly hitting walls.

At 9:51am PDT today I switched Claude Code to deepseek-v4-pro:cloud via Ollama Cloud's $20/mo Pro plan. Here's what happened.

The data

Two metrics that matter: lines of code changed per hour (real output) and lines per million tokens (efficiency — how much code gets produced per unit of compute).

Session	Avg tok/hr	Lines/hr	Lines/M-tok
Thu Apr 16 — Anthropic at its best	29,555,000	1,688	41.5
Sat Apr 18 — something starting to shift	14,640,000	717	35.0
Mon Apr 20 — throttled, cutting out	10,475,000	289	18.4
Tue Apr 21 — throttled again	12,860,000	431	22.3
Today Apr 29 — Ollama Cloud / DeepSeek	21,466,000	1,174	54.7

Mon and Tue weren't a different type of work. It was the same implementation and review work I've done for months. The throttling is what cut the output.

The part that floored me

DeepSeek on Ollama Pro is slower than Claude. My plan caps me at 3 concurrent cloud models. I regularly run 14+ parallel agents — church review swarms, dual-repo implementations, blog agents, the works. Today all of that had to queue.

And I still produced 1,174 lines/hr — nearly 70% of my best-ever Anthropic session, which had no concurrency limits and a faster model.

Compared to what Anthropic has been delivering this week: 2.7–4× more productive, with more constraints, on a plan that costs 10× less.

My honest takeaway

DeepSeek is better. Not "competitive." Not "surprisingly good for the price." Better. At least for the way I work — heavy parallelism, large codebases, long multi-agent sessions.

Anthropic's Claude used to be the clear answer for me. Thursday's session proves it can still hit those peaks. But whatever changed in the past week has made it unusable for my workflow at the rate limits they're enforcing. DeepSeek through Ollama Cloud doesn't have that problem.

$200/mo → $20/mo. More productive. Less friction.

I'll keep tracking the data and post a follow-up after a full week.

Follow-up: Two weeks later. I came back to Anthropic — but Ollama Cloud is still probably the right call for most of you. Here's the telemetry