What I learned building low latency and high throughput AI agents
- Know your workload.
- Before building the feature, estimate input tokens, output tokens, expected concurrency, and whether the user needs an instant response or can tolerate asynchronous processing.
- Reduce tokens.
- Do not send full context because it is convenient. Compress, retrieve, summarize, and preserve provenance.
- Embrace parallelism.
- If the work is independent, split it. File scans, scan/offset based analysis, artifact classification, and output candidate often parallelize well.
- Microservices and queues add complexity, but they also let different stages scale, retry, and fail independently. Don't overoptimize.
- Expect failures.
LLM APIs fail. Providers rate-limit. Responses violate schema. Tool calls hang. Sandboxes break. Repos have bad tests. Treat every model call like a network call to a flaky dependency / data source, because that is what it is.
u/tropical_vortex — 4 days ago