MLOps - observability at scale (agentic space )
Hi folks , has anyone here worked in the agentic AI space?
How are you handling observability for AI agents especially around infrastructure, tracing, monitoring, debugging, and reliability at scale?
I’m particularly interested in learning from people who have experience with large-scale agentic deployments in Tier 1 tech companies. Experience from smaller implementations is still useful, but I’m mainly looking for insights from production environments with high scale and complexity.
Any tips, lessons learned, or recommended tooling/frameworks would be appreciated.