Cut my LangGraph agent from $300/day to $63 by routing boring sub tasks off Opus 4.1
I've been running a fairly typical LangGraph agent that does research, writes code, and deploys. The loop was eating around $300 a day on Opus 4.1, and most of those calls weren't hard reasoning. They were things like reading a file, summarizing a log, or calling a search tool and reformatting the result. Pure overhead that happened to run on the most expensive model in the stack.
So I split the agent into two tiers. Hard sub tasks (architectural decisions, debugging unfamiliar code) still hit Opus 4.1. Everything else, the routine tool calling and summarization work, now goes through a cheap default model. For the past week that default has been a mix of DeepSeek V4 Pro and Tencent Hunyuan Hy3 preview, with the Hy3 preview handling most steps that involve many tool calls.
The routing lives in a LangGraph ConditionalEdge. The router node inspects the task metadata and branches accordingly. Something like:
builder.add_conditional_edges(
"router",
route_task,
{
"hard": "opus_node",
"cheap": "hy3_node",
},
)
The route_task function checks if the step touches more than three files in an unfamiliar repo or asks for an architectural decision. If so, it hits Opus 4.1. Otherwise, it goes to the cheap tier.
I run the cheap tier on a refurbished Mac Studio M2 Ultra with 192GB of unified memory. Cost me around $5,500. The official deployment path from Tencent is vLLM or SGLang on eight H200 class GPUs, which isn't happening in a home lab. The Apple Silicon route works because the 4 bit quantized weights land around 165GB and fit in unified memory with some headroom. Setup was conda plus the community MLX port from Hugging Face. Hours of fiddling, not a clean afternoon. Throughput lands around 5 to 12 tokens per second depending on context length. That sounds slow, but most of my agent steps spend their wall clock time waiting on tool execution anyway, so it doesn't bottleneck the loop. I'd like to try the 8 bit MLX build once someone publishes it, mainly to see if reasoning across files gets stronger.
The model itself is a 295B MoE with 21B active parameters per token and a 256K context window. For tool calling specifically, OpenRouter had it ranked first by call volume shortly after launch, which is what made me try it. In my own loop it's been reliable across workflows that run 200 to 300 tool calls without derailing.
Opus 4.1 costs roughly $15 per million input, $75 per million output. My daily burn is about 10M input and 2M output. Running everything on Opus lands around $300. Now I send 80% of that through the cheap tier at $0.18 per million input and $0.59 per million output. That part costs under $3. Opus handles the remaining 20%, roughly $60. Total lands around $63.
A concrete example from this week. I had the agent convert a long Notion export into a slide deck. That single run burned 4.2 million output tokens. On Opus 4.1 the output alone would have been over $300. The cheap tier handled it for roughly $2.50 and the slide quality was fine. Not Opus level on design taste, but completely usable for an internal draft. I wouldn't use it for a deck going to a client without a final polish pass.
Where the cheap tier isn't the right choice, and I still reach for Opus every time, is deep debugging across a codebase I don't know well, or tasks that need holding a very precise spec in memory across many turns. It also struggles with long chains of math proofs where one wrong step cascades. For those, the cost of Opus 4.1 is worth it.
Honestly the thing I overlooked at first was tool latency. I kept blaming the model for slow responses when it was actually a webhook I wrote that was sleeping on cold starts. Took me three days of staring at LangSmith traces to realize the bottleneck was a 2 second cold boot on a lambda, not the LLM. The routing pattern only started paying off after I fixed that.