Sharing this because I think it's a useful case study for anyone thinking about applying multi-agent AI to network ops and infrastructure automation.
The problem:
A Florida telecom operator was managing 254 Juniper switches manually. Every provisioning job required engineers to SSH in, run diagnostics, identify issues, generate fix commands, apply them, and update the Jira ticket. End-to-end: 1,016 engineer-hours across their fleet.
What we built:
A 3-agent system on our platform:
- Master Orchestrator — receives network alerts (email / syslog), parses and classifies the issue, coordinates child agents, manages Jira integration
- Syslog Parser Agent — extracts device identity, error codes, severity from raw syslog messages
- Command Generator Agent — generates precise debug and remediation commands based on device context and issue type
Human-in-the-loop checkpoint: engineer reviews and approves the fix before execution. After approval, agent applies the fix, verifies resolution, and closes the Jira ticket with full audit trail automatically.
Results:
- 1,016 hours → 60 hours (95% reduction)
- 80% faster fault detection
- 75% fewer escalations to senior engineers
- Live in production in 4 weeks
- Still running today
What made the difference architecturally:
The key was treating this as an agentic graph, not a DAG workflow. The orchestrator owns control flow at runtime it decides whether to escalate, retry, or hand off to a specialist agent based on what the syslog parser returns. You can't express that in Airflow or n8n. The graph has to be cyclic and the state has to be shared and mutable across agent cycles.
Observability was the other unlock knowing exactly what each agent decided and why, with full replay capability, is what made this debuggable in production.
Happy to go deeper on any part of the architecture in comments.