How do you feel about combining voice agents with Generative UI?

I've been thinking about the future of voice agents and wondering if pure voice is actually the best interface.

Most discussions focus on either:

● Voice-only assistants

● Chat-based assistants

● Generative UI experiences

But what if they were combined?

For example, instead of a voice agent simply responding with words:

User: "Show me my portfolio."

The agent could respond verbally while also generating an interactive UI containing charts, filters, recent transactions, and actions.

Or:

User: "Find me a flight to Bangalore next weekend."

Instead of reading out 20 options, the agent could generate a visual card layout while continuing the conversation.

In this model, voice becomes the input/output layer, while the UI is generated dynamically based on intent and context.

I'm curious what others think:

● Is voice + Generative UI the natural evolution of AI assistants?

● Are there products already doing this well?

● When should an AI speak versus generate a visual interface?

● Would users actually prefer this over traditional apps?

Interested to hear thoughts from people building voice agents, GenUI systems, or multimodal products.

reddit.com
u/Beginning_Race8551 — 14 hours ago
▲ 2 r/speechtech+1 crossposts

How do you feel about combining voice agents with Generative UI?

I've been thinking about the future of voice agents and wondering if pure voice is actually the best interface.

Most discussions focus on either:

● Voice-only assistants

● Chat-based assistants

● Generative UI experiences

But what if they were combined?

For example, instead of a voice agent simply responding with words:

User: "Show me my portfolio."

The agent could respond verbally while also generating an interactive UI containing charts, filters, recent transactions, and actions.

Or:

User: "Find me a flight to Bangalore next weekend."

Instead of reading out 20 options, the agent could generate a visual card layout while continuing the conversation.

In this model, voice becomes the input/output layer, while the UI is generated dynamically based on intent and context.

I'm curious what others think:

● Is voice + Generative UI the natural evolution of AI assistants?

● Are there products already doing this well?

● When should an AI speak versus generate a visual interface?

● Would users actually prefer this over traditional apps?

Interested to hear thoughts from people building voice agents, GenUI systems, or multimodal products.

reddit.com
u/Beginning_Race8551 — 23 hours ago

How are companies making voice-to-voice AI economically viable?

I've been exploring voice-to-voice AI systems such as Gemini Live, OpenAI Realtime, and other conversational voice assistants, and one thing I'm struggling to understand is the economics behind them.

When I look at token pricing, audio input/output costs, long conversation durations, context management, and infrastructure costs, it feels like real-time voice interactions could become expensive very quickly.

Yet we're seeing more companies launch products with seemingly unlimited or generous usage plans.

What am I missing?

Some questions I have:

● How much does a typical 10–15 minute voice conversation actually cost?

● Is most of the cost coming from audio processing or context accumulation?

● Are companies aggressively summarizing conversation history behind the scenes?

● How much do caching and smaller models reduce costs?

● Are these products profitable, or are companies currently subsidizing usage to gain market share?

I'd love to hear from anyone who has built or operated a production voice AI system and can share insights, benchmarks, or lessons learned.

reddit.com
u/Beginning_Race8551 — 1 day ago

How does Gemini Live actually calculate token usage in voice-to-voice conversations?

​

I've been trying to understand how token consumption works in Gemini Live (or similar real-time voice models), but I haven't found a detailed explanation anywhere.

Most documentation mentions audio tokens and gives approximate token rates per second of audio. However, when people use Gemini Live, the reported token usage often seems much higher than what those simple calculations would suggest.

My assumptions are:

● Audio is converted into audio tokens.

● Speech is transcribed internally.

● Conversation history remains in context.

● System instructions are included throughout the session.

What I'm unable to understand is:

  1. Does the model continuously reprocess previous conversation context during a live session?

  2. Are transcriptions counted separately from audio tokens?

  3. How does context accumulation affect token usage over longer conversations?

  4. Is the token count mostly coming from audio, or from the growing context window?

  5. Are there any architectural details available about how Gemini Live manages session memory and context?

I feel like many developers understand text model token consumption reasonably well, but voice-to-voice token accounting is still somewhat of a black box.

Has anyone found a detailed explanation, benchmark, blog post, or conducted experiments on this?

reddit.com
u/Beginning_Race8551 — 1 day ago