Hey everyone,
I’m building a B2B AI customer support agent and trying to make RAG fast enough for a future voice agent.
Right now:
- Everything is in AWS us-east-1
- Vector search is under 100ms for p99
- Using openai's text-embedding-3-small for embedding
The issue is embedding the query takes around 600ms to 1.1s every time. That’s basically my bottleneck now.
Tried the obvious stuff like keeping infra close, but no real improvement.
Couple questions:
- Are there faster embedding models that don’t kill quality?
- Does reducing dimensions actually help latency or not really?
- Is it worth self hosting a smaller embedding model for this?
- What kind of latency are people getting in real voice RAG setups?
Would really appreciate any practical tips here.