Split my agent into a cheap router model and a premium synthesis model, bill dropped about 75%
I've been building an internal enrichment agent for our team (5 people, B2B sales context) that takes a list of company names and enriches them with public info before our outreach folks touch them. Around 8 tools wired in. The usual stuff: web search, scrape, internal vector DB lookup, dedupe against our CRM, classify by ICP fit, draft a short outreach paragraph, plus a couple of glue tools for handling edge cases.
When I first got it working everything was gpt-5.4 because that's what I had set up. Worked fine, bill was scary. Roughly $290 the first week processing about 1,200 companies. Wouldn't scale to the volume our sales person actually wants (closer to 5k/week).
Looked at the logs more carefully and the bill breakdown surprised me. About 75% of LLM calls were what I'd call "router" calls. Given the current state, the available tools, and the last tool result, pick the next action. These calls have a tiny output (one tool name plus a JSON arg blob) and don't really need 5.4-level reasoning. They just need to be cheap, fast, and barely smart enough to not pick stupid tools.
The remaining 25% were "synthesis" calls. Summarize this scraped page. Draft this paragraph. Reason about whether the evidence actually matches our ICP. Those benefit from a real model.
Swapped the architecture so routing uses GPT-OSS 120B on an OpenAI-compatible endpoint (I'm on GMI Cloud, a couple of other hosts price it similarly), synthesis stays on gpt-5.4. SDK doesn't care, you just pass a different base_url and model string depending on the call site.
Numbers from this week processing about 1,400 companies: total around $65. So roughly 78% reduction at slightly higher throughput. Quality on the final outputs feels the same to our sales person. We ran 50 companies through both stacks side by side before fully switching to validate.
A few things I had to fix:
GPT-OSS 120B's tool calling JSON is mostly clean but occasionally leaves a trailing comma. Wrapped the parse in a sanitizer.
Default max_tokens was 4096 and the model was happy to fill the reasoning channel even when I just wanted a tool pick. Dropped routing calls to 256 and tightened the prompt.
Per-call latency on routing is maybe 100-200ms slower than 5.4 on average, but throughput is fine because routing isn't on the user-facing critical path.
If most of your agent calls are tool-pick decisions rather than synthesis, this split is probably the biggest single win available. Pulling them apart took us from "we can't scale this" to "it scales fine" without changing anything else.
The thing I'm still figuring out is whether GPT-OSS 120B is actually the right size for the routing job or whether I could push down to a 30-something B model and save more. Quality might tank with more tools registered, haven't actually tested yet.