Moving an agentic workflow to production without building a custom control plane from scratch
I’ve been building out a multi-agent workflow for our internal ops (primarily handling automated document parsing -> RAG lookup -> updating an internal CRM based on specific business rules)
Initially, we hacked it together using raw LangChain and a custom FastAPI wrapper. It worked fine as a PoC, but moving it closer to production hit a wall of boring, non-AI infrastructure problems:
- **State & Memory Management:** Keeping track of conversation states and tool-execution logs across multiple steps.
- **Data Privacy:** Management refused to let customer data leave our VPC, so standard API wrappers were out.
- **The "Oops, it hallucinated a function call" Problem:** Building manual evaluation loops to catch agent failures before they hit the CRM.
We looked at building our own orchestration layer, but it felt like a massive waste of sprint capacity to re-invent the wheel for logging, RBAC, and local Docker/VPC execution.
We ended up testing out **Lyzr** because its SDK allows for local deployment/VPC hosting right out of the box, and it ships with a built-in control plane/trace logs so I didn't have to build an observability UI myself.
It saved us a ton of plumbing work, but it’s not a magic bullet. The learning curve on configuring complex multi-agent handoffs in their framework took a minute to click, and the documentation could definitely be more robust for niche edge cases.
For those running complex multi-agent workflows *locally* or in a private cloud: did you build your own microservices framework for agent state/monitoring, or did you lean on an existing low-code/SDK wrapper like Lyzr or LangGraph? What are the scale bottlenecks I should look out for next?