
Follow-up: Two weeks later. I came back to Anthropic — but Ollama Cloud is still probably the right call for most of you. Here's the telemetry
Three weeks ago I posted about switching from Claude Max ($200/mo) to Ollama Cloud Pro ($20/mo) after Anthropic started throttling me mid-session. DeepSeek-v4-pro looked better on the metrics that mattered: lines of code per hour, lines per million tokens.
A bunch of things have changed since then — on both sides of the comparison — and the picture has flipped for my specific workload. That qualifier is doing a lot of work in this post, and I want to be upfront about it: if you're not running a heavy parallel-agent setup with deterministic guardrails, Ollama Cloud is almost certainly still the right answer. I delayed this follow-up specifically so I'd have a clean two-week window to measure against. Here's what the SigNoz data says about May 3 – May 17.
What changed (the confounders, up front)
Seven variables moved between the original post and this one. A fair report has to name them before reading any chart, otherwise we're cherry-picking:
On my side (3 things):
- Cache-fix proxy running at
localhost:9801. Sits in front of the Anthropic API and rewrites requests to maximize prompt-cache hits. Effective input-token cost drops by ~80% on cache-eligible turns. cnighswonger/claude-code-cache-fix - My project git-prism with semantic analyzers for Go/Python/TS/Rust. Replaces raw
git diff/git logreads with structured manifests. Same insight, ~10× fewer tokens per query. mikelane/git-prism - rtk (Rust Token Killer) — a transparent Bash wrapper that filters verbose CLI output before it hits the model's context. 60-90% savings on
cargo,npm,git status, etc. rtk-ai/rtk
> Note: I'm testing some recent updates to git-prism that will make the setup more transparent, the use more automatic, and to give instructions on how to make it play well with rtk. If anyone's interested in that, stay tuned.
On Anthropic's side (2 things): 4. SpaceX compute deal, announced May 6. Claude Code 5-hour rate limits doubled for Pro/Max/Team/Enterprise. Peak-hour reductions removed. 5. 7-day window +50% until July — an additional capacity bump on top of the SpaceX one, until the new infra is fully online.
On Ollama's side:
6. Limits tightened, twice, since the original post. Ollama Cloud doesn't bill in tokens — it bills in GPU-time "levels." deepseek-v4-pro is level 4 (extra heavy), the most expensive level they price. Where it used to take me weeks to dent the quota in late April, I now burn the 7-day window in about 5 days on a normal workload. And Kimi K2.6 Pro (a different model I briefly tested) blew through my entire 5-hour window in under 10 minutes.
On my wallet:
7. Upgraded Ollama Cloud Pro → Max ($20/mo → $100/mo) during this window. Max gives 5× Pro's usage allowance and keeps the 3-concurrent-model cap. So when I report below that I'm burning the 7-day window in 5 days on deepseek-v4-pro, that's against the Max quota — not Pro. The same workload at the same model on Pro would have hit the wall in roughly one day.
So: I can't honestly attribute the productivity numbers in this post entirely to Anthropic getting better. The cache proxy + git-prism + rtk are doing real work. But neither could the Reddit baseline be entirely blamed on Anthropic — the throttling was real, the model was the same one I'd used for months. The honest framing: all seven changes are real, they all compound, and the bottom line is what I ship.
The data — past 2 weeks (May 3 – May 17)
Same definitions as the original post. "Tokens" includes input + output + cache reads + cache creation. "Active hours" is wall-clock summed across concurrent agents (frequently >24h/day because of parallel dispatches — this metric reflects per-agent-hour throughput, not calendar hours).
| Date | Tokens (M) | Lines | Active hrs | Tok/hr (M) | Lines/hr | Lines/M-tok | Hypoth. API $ |
|---|---|---|---|---|---|---|---|
| Sun May 3 | 309.1 | 17,712 | 26.88 | 11.5 | 659 | 57.3 | $1,299 |
| Mon May 4 | 408.0 | 28,106 | 26.53 | 15.4 | 1,059 | 68.9 | $1,432 |
| Tue May 5 | 274.7 | 15,488 | 24.88 | 11.0 | 623 | 56.4 | $914 |
| Wed May 6 ⚡ SpaceX | 580.1 | 26,365 | 16.62 | 34.9 | 1,587 | 45.4 | $709 |
| Thu May 7 | 1,040.6 | 26,272 | 13.27 | 78.4 | 1,980 | 25.2 | $2,441 |
| Fri May 8 | 616.8 | 20,148 | 16.96 | 36.4 | 1,188 | 32.7 | $3,025 |
| Sat May 9 | 446.8 | 9,387 | 8.33 | 53.7 | 1,127 | 21.0 | $2,253 |
| Sun May 10 | 784.6 | 27,160 | 17.78 | 44.1 | 1,528 | 34.6 | $1,568 |
| Mon May 11 | 154.9 | 4,681 | 7.11 | 21.8 | 659 | 30.2 | $426 |
| Tue May 12 | 310.0 | 13,816 | 20.79 | 14.9 | 665 | 44.6 | $1,553 |
| Wed May 13 | 392.0 | 23,505 | 30.35 | 12.9 | 775 | 60.0 | $1,545 |
| Thu May 14 | 269.2 | 15,374 | 14.26 | 18.9 | 1,078 | 57.1 | $1,091 |
| Fri May 15 | 387.9 | 18,429 | 15.24 | 25.5 | 1,210 | 47.5 | $861 |
| Sat May 16 | 547.9 | 14,515 | 19.72 | 27.8 | 736 | 26.5 | $620 |
| Sun May 17* | 80.6 | 934 | 1.38 | — | — | — | $104 |
* May 17 partial — only ~1.4 active hours in by query time.
14-day totals (excluding partial May 17): 6.5 billion tokens, 261,000 lines of code, ~258 active agent-hours, $19,734 in hypothetical API spend.
Side-by-side with the Reddit baseline
The original post quoted five reference days. Here's how those numbers land against this window's bests:
| Reference | Tok/hr (M) | Lines/hr | Lines/M-tok |
|---|---|---|---|
| Apr 16 — original Anthropic peak | 29.6 | 1,688 | 41.5 |
| Apr 21 — throttled | 12.9 | 431 | 22.3 |
| Apr 29 — Ollama Cloud / DeepSeek | 21.5 | 1,174 | 54.7 |
| May 7 — post-SpaceX peak (this window) | 78.4 | 1,980 | 25.2 |
| May 6 — post-SpaceX day 1 | 34.9 | 1,587 | 45.4 |
| 2-week median | 25.5 | 1,078 | 45.0 |
A few things jump out:
- Token throughput is 2.6× the original "Anthropic at its best" reference. That's the cache-fix proxy doing the heavy lifting — cache reads cost a fraction of fresh input tokens, so the same calendar hour pushes far more raw tokens through the pipe.
- Lines/hr peak (1,980) beats every single day in the Reddit post. This is the cleanest signal that the workflow itself is producing more code, not just spending more tokens. May 6, May 7, May 8, May 10 all clear the previous Anthropic best.
- Lines/M-tok dropped, though — median ~45 here vs. 54.7 on the Ollama day. Translation: I'm shipping more total code, but I'm also spending more tokens per line of code. The cache proxy makes those extra tokens cheap, but the efficiency-per-token metric is no longer where DeepSeek was.
This is the trade-off the original post missed: lines/M-tok measures token efficiency, lines/hr measures throughput, and they're not the same thing. DeepSeek won on efficiency. Anthropic-with-the-stack wins on throughput. And throughput is what I actually feel during a session.
What broke on Ollama's side
Two weeks of trying to keep DeepSeek in the rotation — and not on the $20/mo Pro plan from the original post, but on the $100/mo Max plan, which advertises 5× Pro's usage. Specifically:
- Kimi K2.6 Pro experiment (May 9): blew through my entire 5-hour window in under 10 minutes, on Max. Walked away from that model. This is what an even heavier model than
deepseek-v4-prodoes to the GPU-time quota — and it's the cautionary tale, not the norm. deepseek-v4-pro7-day window on Max: exhausted in 5 calendar days this past week on a normal workload. In April, the same workload on the Pro plan struggled to hit 3% of the same window. Five times the budget, dramatically less runway. DeepSeek isn't the problem here — it still paces well within a session — but the weekly envelope has shrunk to the point that a steady week of work will run it out before reset.
Reading Ollama's current docs explains it: cloud usage is measured in GPU-time levels (1–4), not tokens. deepseek-v4-pro is level 4 — the most expensive tier. They've also acknowledged revising limits twice since launch. The 3-concurrent-model cap that I called "annoying but survivable" in the first post is now the binding constraint, because queued requests still count toward the GPU-time budget.
The real story isn't about the models. It's about the cage.
Here's the part I want to land hard, because I think the original post got it wrong and a lot of the responses to it got it wrong too:
You must have a deterministic cage for the stochastic agents.
That's the whole game. It doesn't matter whether the model behind the API is Claude Opus 4.7 or DeepSeek-v4-pro or whatever drops next month. What matters is whether your SDLC pipeline constrains the model's freedom to be wrong:
- TDD with tests that have to be RED before they go GREEN
- A typecheck/lint/security gauntlet that no PR escapes
- An adversarial code reviewer that's a different agent than the one that wrote the code
- A merge gate that mechanically verifies the gauntlet ran, not just that the human said "looks good"
- Issue-before-implementation discipline so every change has a documented anchor
If you have that cage, DeepSeek-v4-pro produces output every bit as good as Opus 4.7 — even with Ollama's recent quota changes. I'm not speaking theoretically here. I run both, side by side, daily. The cage catches the things either model gets wrong, and what comes out the other end is interchangeable in quality.
The reason Claude+stack wins for certain parts of my workflow is not that Claude is smarter — it's that I'm running 14+ concurrent agents on time-critical work, and the binding constraint on Ollama Cloud is the 3-concurrent-model cap regardless of plan tier. If you're running one or two agents at a time, that constraint never bites you, and the $20-$100/mo you'd spend on Ollama is real savings over $200/mo Claude Max.
My actual setup: two lanes, not one
I want to be specific about why I kept the $100/mo Ollama Cloud Max subscription, because "fallback" was the wrong word in an earlier draft and I think the honest answer matters here.
It's not a fallback. It's a second lane. I run both subscriptions in parallel, every day, for different categories of work:
- Claude Max ($200/mo) — the fast lane. Anything time-critical: client work, things I need shipped today, debugging a production issue, the heavy-parallelism gauntlet that needs to run 14 agents at once. I'm regularly pushing my Claude token limits as it is — and you can't just buy a second Claude Max account without risking a ban — so I treat that quota as scarce.
- Ollama Cloud Max ($100/mo) — the patient lane. Anything where I can wait: an app for my wife, side projects only I'll use, learning exercises, scripts that aren't on a deadline. DeepSeek-v4-pro takes longer per turn and I have to queue agents because of the 3-concurrent cap, but the output is genuinely as good as Opus 4.7 thanks to the cage, and it doesn't burn my scarce Claude budget.
That two-lane split is the reason both subscriptions earn their keep. Drop either one and the other gets overloaded. Total monthly stack: $300/mo for what would cost $400+ if I tried to do all of it on Claude alone — and I'd hit the ban-risk wall trying to buy more Claude capacity anyway.
Who should actually use what
- You're running 1-3 agents at a time, working on a single repo, single-language, with a real test suite and a CI pipeline that has teeth: Ollama Cloud Pro ($20/mo) with
deepseek-v4-pro. This is the right answer for most people reading this post. You'll never hit the concurrency cap. The cage catches what the model gets wrong. You save 10×. - You're running 5-10 agents, multi-repo, polyglot, willing to engineer a custom token-savings stack: Ollama Cloud Max ($100/mo) is still viable, with caveats. Watch the GPU-time levels carefully. Stick to level 2-3 models for routine work, save level 4 for the heavy lifts.
- You have time-critical, high-pressure work AND lower-priority patient work: run both lanes like I do. $200 Claude Max for the fast lane, $100 Ollama Max for the patient lane. Don't try to do everything on one provider — you'll either burn out the quota or pay 3× more than you need to.
- You don't have the cage yet: build the cage first. Whichever model you use will produce garbage without it. The model choice is a 10% optimization on top of an 80% problem.
What I'd actually recommend now to anyone reading the first post:
- Don't ditch a model based on one bad week — instrument first. Half of what I attributed to "Anthropic got worse" in April was cache misses I could have controlled.
- Cache hit rate is the variable that matters most, no matter whose API you're hitting. A naive Claude Code workflow burns input tokens on every turn. A cache-optimized one keeps the same context resident across the whole session.
- Build the cage before you optimize the model. TDD discipline, an adversarial review agent, a mechanical merge gate. Without those, switching models is rearranging deck chairs.
I'll keep this dashboard running. Happy to share the SigNoz dashboard JSON if anyone wants to set up their own.
Sources / receipts:
- Anthropic — Higher usage limits for Claude and a compute deal with SpaceX
- Engadget — Anthropic is doubling Claude Code rate limits after deal with SpaceX
- Slashdot — Anthropic Raises Claude Code Usage Limits, Credits New Deal With SpaceX
- Ollama — Pricing · Cloud docs
- DevToolHub — Ollama Cloud Free vs Pro: Usage Limits, Pricing & What You Get (2026)
- ollama/ollama#13089 — Frustrating very limiting max tokens on the cloud models to only 16,384