If you are shipping AI agents, do you know what one real conversation costs end to end?
If you have ever tried to estimate what an AI agent will cost in the real world, you already know the annoying part: the number changes the moment the flow starts doing real work.
A request is no longer just “one LLM call.” It can become STT, retrieval, tool calls, retries, memory, and TTS all in the same run, and each piece adds cost in a different unit. That is where the clean spreadsheet version usually breaks down.
We keep running into the same problem: the cost of an agent is easy to estimate on paper, and much harder to estimate once the flow starts behaving like a real system.
STT is usually priced per minute of audio. LLMs are priced per token. TTS can be priced per character or per second depending on the provider. Those units do not line up cleanly, so once retries or branching enter the flow, the total stops being obvious very quickly.
The part that makes this annoying is not just the pricing. It is that the cost is tied to behavior. A slightly longer user turn, one extra retry, a tool call that adds context, or a different response style can change the final number more than people expect.
For us, cost is not a static line item. It is part of the run. If the agent branches, retries, carries too much context, or crosses providers, the cost shifts with it. That is why we started looking at cost alongside tracing and evaluation instead of treating it as a separate spreadsheet problem.
Who this is for:
- Teams building voice agents, copilots, RAG systems, or multimodal flows.
- People trying to estimate cost before usage gets real and messy.
- Technical teams that want to understand what is actually driving spend inside a run.
What you can do with it:
- Trace which step is adding the most cost.
- Compare runs across models, prompts, and providers.
- See how retries and branching change total spend.
- Simulate different conversation patterns before they hit production.
- Tie cost back to actual behavior instead of guessing from pricing pages.
We are curious how other teams are handling this in practice.
Do you estimate cost from provider pricing first and then adjust later, or do you already have run-level cost visibility built into your agent stack? What is the part that usually surprises you most: retries, retrieval, tool use, or response length?
If you are building AI agents right now, how are you estimating this today by provider pricing, by back-of-the-envelope assumptions, or from actual run data? And when the number surprises you, what is usually the reason: retries, retrieval, tool use, or just longer-than-expected conversations?
If you are working on this kind of system, try it on one of your own flows and see what shows up. We would genuinely like to hear what feels accurate, what feels off, and what you would want to measure better.