
"Google has a whole department whose only job is to steal startups."
Welcome to the real world, fam!

Welcome to the real world, fam!
Suddenly after every message I get:
API Error: Claude Code is unable to respond to this request, which appears to violate our Usage Policy (https://www.anthropic.com/legal/aup). Try rephrasing the request or attempting a different approach.
If you are not using /remote-control you need to start using it now. It will be life changing. Especially if you feel like even 2-3 agents running at the same time is already overkill and giving you AI burnout. For those that don't know, you can enter /remote-control into either a CLI session or a Desktop session and the session will be immediately accessible and will sync in real time with your phone Claude app.
First obvious super power is I could leave my office while still being "plugged in". I could sit in meetings, go to the gym, etc etc and still be "working", IE just keeping the agents on task and answering any questions they have. Similarly I could be plugged out and then come back at a whim's notice and return to working. Or better yet, answer critical questions in meetings by quickly asking one of the agents to clarify any questions about the data I am handling day to day. Similarly, I could be "plugged out" whenever I wanted and leave these agents on pause for days and come back to them whenever I wanted.
What I have noticed is I am taking way more walks, I am going to the gym WAY more, and all around just feel insanely better throughout the day. I am in the looks for getting a dog as I just want to be outside constantly now. And the feeling of being able to use my phone in voice mode and just handle 2-3 agents seamlessly is just so .... fun, if I am allowed to use that word. It is weird. I honestly think this is going to be the future.
Like literally an earpiece and that's it. And its synced up with one of your devices or even some cloud server you are using and you are literally having agents report back via voice (or sending live messages to a watch/phone for non-voice environments) and you are basically a CEO all day while your agents are doing the grunt work. Even for governance I am finding that this system works and I can easily detect bugs or flaws in the data and fix them in real time. It is actually wild.
Another benefit I have noticed is that I deleted all my social media apps. One of my vices would be to scroll twitter or reddit etc when I woke up and when I went to bed. I've noticed that checking my phone for agent updates gives me that same sort of weird "dopamine" hit and I don't have this urge to go on doomscrolling mode. Instead I've downloaded Kindle and am reading books morning and night. Not to mention another benefit is you don't always have to be "locked in" either. You can literally leave an agent for days and just come back and pick up where you left off (as long as you have an always-on setup like a mac mini etc).
Just overall excited about this and wanted to share
I don’t know about you, but I feel lately running 6-8 windows in terminal and constantly switching context really messes with my productivity.
Feel like I got a mindset that if I’m not burning through all my tokens in a reset time window, I’m not productive.
Wonder if working in a single thread would be a better option, is it just me or anyone can relate?
I've been meditating on this for the last few months and I's still not sure how I feel about this change.
On one hand, my nearly 30 years of experience in design and development is serving me well, in that I am capable of making sound decisions about virtually any aspect of significant and complex digital products. Which means my day-to-day is no longer spent banging out code, building wireframes, or fiddling with Photoshop. It's literally reading and writing specs all day about which feature to add, how it should function, etc.
So, that feels ...smarter, somehow.
But, on the other hand my identity, and consequently where I had previously found joy, is somehow still tied to what in my head I had always thought of as the "craft" of execution.
I wonder how people who were in "Product Manager" roles are adapting to this change? Are they also moving up the thought-ladder into a different role? Or are we, as an army of displaced designers and developers, suddenly encroaching on an immovable barrier between product design and where the money starts?
Genuine question, because the answer keeps surprising me. I'll be talking to folk at AI events in Berlin this week who really knows Claude Code — CLI, MCP, subagents, etc — and I'll ask what hooks they run and where. Almost all go quiet, or mumble about things that are not hooks at all. So few have ever set one up, some most have never heard of them (though you might have seen Claude suggesting them at some point in conversation, and ignored it - like me at first).
So if that's you: a hook is a small script the harness runs for you, automatically. On every prompt, on things the model does, when it stops. You write a rule once and it runs whether the model likes it or not. They're right there in the docs, but almost nobody I talk to actually uses them.
(now as a warning, just because they are there, does not mean Claude will listen! that's for another post)./
A concrete example, because "hooks are useful" means nothing on its own.
The first hook I set up — on Claude's own suggestion, while I was tinkering with my setup — catches the model talking about work in calendar time. I absolutely cannot stand Claude quoting me human time for things - "this'll take 3 hours... 2 weeks, blah blah". The only honest unit for AI work, being very direct, is tokens (and how close to your limits you are running), so the hook spots that phrasing and makes it redo the line.
In fact, as I was drafting this in Claude playing with some data, it fired on me while I was writing this post, which amused me.
I've added more since. Some catch obvious mistakes before they happen, some keep an eye on what a session is costing, some just enforce my own habits. Every one took fiddling. None of them worked clean on the first go.
That's really all I've got. Just trying to start a conversation. If you're deep in Claude Code: do you run any hooks? Which ones, and where do they actually earn their keep? And if you don't — never came up, or you tried and bounced off?
I've been working in Claude Code all day and, just a few hours ago, started getting blocked by the safety guardrails. This is happening across multiple unrelated projects, none of which are doing anything shady or different from what I usually do. The blocks are interrupting my work. Has anyone else been having the same issue today?
Anthropic quietly shipped /workflows in Claude Code 2.1.147 and it might be the biggest shift in how we build multi-agent systems yet.
Until now, the pattern was:
one main agent (an LLM) decides what sub-agents to spawn, holds every intermediate result, and plans the next step.
The problem?
Every sub-agent result re-enters the orchestrator's context.
Spin up 10 agents and your main session pays a 'token tax' each time getting sloppier and more forgetful as the window fills.
/workflows replaces the LLM orchestrator with code.
You define a workflow.js file.
Sub-agent outputs flow from one phase to the next directly never touching the main context window.
What you get:
- Phases with structured schemas (predictable outputs)
- Parallel fan-out + streaming pipelines
- Conditionals, loops, and budgets in real JS
- Automatic retries on failure
- Live progress view via /workflows
- Run workflows in the background while your main session stays free
The principle is what's interesting:
use code for what code is good at (control flow), and models for what models are good at (judgment inside each step).
Update Note: It looks like they have taken it down for now
It was on the changelog earlier
For context: I'm a software eng @ a fortune 500/FAANG tier company. We use AI. We treat all ai code with humans as the bottleneck. That is: You generate AI code, you own it. It has bugs? It's your bug.
Claude has only gotten better. 4.7 reasoning has only improved, albeit it thinks more. My question is: what the hell are y'all up to that I constantly hear things like claude broke and everything sucks?
You need to review the code. YOU need to understand what claude outputs. AI is nondeterministic, so I don't know why people are creating agentic flows for deterministic work. Need determinism? Generate an audit the code man.
What are people's workflows here that I constantly hear about degraded quality? Personally I just create plenty of skills and harnesses for information that it needs, I set off parallel tasks that are sandboxed from each other (E.g using a worktree, different folder, whatever your taste is), I review the code, I tweak it myself manually.. and that's it.
At the end of the day, I've been a software engineer for 10 years, I understand anything claude generates is something I have to own and be able to debug eventually myself if the world suddenly gets rid of AI (which we know it won't, but it's the sentiment that should be held).
I'm not coming from a place of reprimanding, truly I'm not, but I just don't see how it's gotten worse. I work on very high perf software and claude has helped a lot in saving me time on ASM analysis and algorithmic reasoning for things where throughput matters.
Okay, time for Round 3.
A few days ago I posted Round 2, where we benchmarked GPT-5.4, GPT-5.5, GPT-5.3-codex, and GPT-5.4-mini across different effort levels using Codex on the same React project and the same feature-building prompt. Here's the note-taking app used as our test project.
That round came after a lot of useful feedback from the 1st experiment.
The first version was based on summarization of a Git repo (Express JS), but the problem with summarization is that it was very subjective (all summaries are essentially "correct" depending on preference/perspective).
So in Round 2 we moved to something more practical: give every model the same real coding task inside the same repository, then compare the actual feature implementation.
For Round 3, I repeated the same style of experiment, but this time with Anthropic models via Claude Code.
The models / model families tested were:
The task was the same type of real feature-build benchmark as before: implement an outline panel inside a small React note-taking app, while preserving the existing app behavior and respecting the prompt constraints.
We ran this experiment with each model-effort combination in their own separate Git worktree using Claude Code programmatic access. Every run got the same repo and the same prompt. After the runs completed, we checked whether the implementation actually worked in the UI first before we used our scoring system - 4 models grading against the same set of code quality attributes and averaged the scores.
For the successful / usable runs, we evaluated code quality across following dimensions:
The final code quality ranking was then used as the ordering for the token usage and turns/cost comparison tables (below and also available in the attached infographic):
| Rank | Run | Model | Effort | Quality Score | Input Tokens | Output Tokens | Turns | Runtime | Cost |
|---|---|---|---|---|---|---|---|---|---|
| 1 | exp-22 | Opus 4.7 1M | XHigh | 32.50 / 35 | 3.8M | 31.4K | 67 | 8.3m | $4.21 |
| 2 | exp-27 | Opus 4.7 | XHigh | 31.75 / 35 | 4.0M | 23.4K | 141 | 8.7m | $4.06 |
| 3 | exp-20 | Opus 4.7 1M | Medium | 30.50 / 35 | 2.0M | 15.2K | 47 | 4.2m | $2.05 |
| 4 | exp-26 | Opus 4.7 | High | 30.00 / 35 | 2.0M | 15.4K | 86 | 5.5m | $2.57 |
| 5= | exp-21 | Opus 4.7 1M | High | 29.50 / 35 | 2.9M | 20.0K | 60 | 5.2m | $3.23 |
| 5= | exp-23 | Opus 4.7 1M | Max | 29.50 / 35 | 4.2M | 36.4K | 70 | 8.6m | $4.68 |
| 7 | exp-28 | Opus 4.7 | Max | 29.25 / 35 | 4.7M | 33.9K | 149 | 11.2m | $4.81 |
| 8 | exp-24 | Opus 4.7 | Low | 28.75 / 35 | 2.0M | 12.5K | 92 | 5.1m | $2.50 |
| 9 | exp-19 | Opus 4.7 1M | Low | 28.00 / 35 | 1.9M | 12.0K | 45 | 3.6m | $2.21 |
| 10 | exp-25 | Opus 4.7 | Medium | 27.75 / 35 | 2.4M | 15.7K | 103 | 6.2m | $2.77 |
No surprises but Opus 4.7 1M xhigh came out on top overall, with an averaged quality score of 32.50 / 35.
A few interesting findings:
Opus 4.7 xhigh was very close behind at 31.75 / 35.
The best “efficient quality” run was arguably Opus 4.7 1M medium, which scored 30.50 / 35 while using much less time and cost than the xhigh / max runs.
At the model-family level, Opus 4.7 1M narrowly led regular Opus 4.7, but the difference was small:
So in this specific test, the two Opus 4.7 families were very close. The 1M version had the top individual run, but regular Opus 4.7 was also highly competitive.
Sonnet 4.6 and Opus 4.6 Legacy trailed the 4.7 models pretty clearly on code quality in this particular benchmark.
One thing that stood out: higher effort was not always better.
The xhigh runs were strongest, but max did not consistently improve quality. In a couple of cases, max increased cost and token usage without producing a better implementation.
4.6 Opus (which we love) we were sure, was better than 4.7 (during its release at least) but results say other wise (I have conspiracy theories).
Caveat: This is still a N=1/single-task benchmark. The results should be treated as directional rather than definitive.
The next thing we plan to do is compare the Anthropic results against the OpenAI Round 2 results more directly, especially quality-per-dollar and quality-per-minute. I kept this post focused only on the Anthropic side so the comparison does not get too messy.
I've been using the Claude Code desktop app for a while and it works really well for me.
Visual diffs are nice, the plan view is helpful when outlining a task etc. but I keep noticing a lot of the developers I follow/in here seem to mainly use the CLI, and a lot of new features appear to ship there first.
I get the obvious side like the CLI is faster once you're comfortable in terminal and I can get it connected into other tools etc.
What I'm trying to understand though is the day-to-day use case from people who've actually used both:
I guess I have FOMO and just curious!
I decided to finally upgrade my subscription to the max plan, and OMG I can’t really say I was even using Claude before!
Previously, I was using 2 pro subscriptions, Codex Plus, OpenCode, and DeepSeek APIs. With Claude, I would max out the month and only use 1M tokens due to the limits. NOW, in 2 days, I spent 21.5M tokens and haven't hit any limits AT ALL!!! Plus, Claude’s design, omg, the ability to use it freely instead of it maxing out for the entire week after one session is insane.
I LOVE IT
day 1: opened claude code for the first time.
day 2: watched three youtube tutorials on "how to think like a founder."
day 3: fully functional saas.
day 4: needed a landing page so piped it through runable.
day 5: linkedin post saying "we're building something special."
day 6: YC application.
day 7: height calculator. the vision was always there.