Cost-routing tasks across models in one session - cheap model for grunt work, frontier for reasoning, local for sensitive code
Been experimenting with model routing at the workflow level instead of the app level and wanted to share what it looks like in practice.
The setup: Zero, an open source coding agent (github.com/gitlawb/zero) that treats the model as a swappable component. It talks to 25+ providers - OpenAI, Anthropic, Gemini, DeepSeek, Qwen, Groq, plus local models through Ollama or LM Studio - and you switch mid-session with /model without losing context.
The routing pattern I've settled into:
- Cheap/fast model for scaffolding, file reads, summaries, boilerplate
- Frontier model only for the steps that need real reasoning - the escalation is a single command, same context
- Local model for anything touching code or data I don't want leaving the machine
The cost curve changes completely. Instead of paying frontier prices for 100% of tokens, you pay them for the 20% of steps that actually need it. Over a week of heavy use the difference is not subtle.
Implementation details that matter: sessions are files on disk (resumable/forkable, so routing decisions survive restarts), it's a single Go binary, no telemetry, and there's a headless mode (zero exec, streams JSON) if you want to wire the routing into scripts or CI instead of doing it interactively.
Open question I haven't solved: my escalation decisions are still vibes-based. Has anyone built actual heuristics for when a task deserves the expensive model - token-count thresholds, retry-on-failure escalation, task classification? Curious what's working for people running this at scale.