u/24hjh

We’re currently building a fleet of Raspberry Pi 5 devices with Hailo AI Hats running computer vision workloads for the aquaculture industry. At the moment we are managing around 20 devices but a lot more to come.

The devices stream/process video, run inference on the edge, send telemetry to and business data Azure, and are remotely monitored/managed through our cloud platform.

Most of my background is in backend/cloud infrastructure, so while a lot of the distributed systems concepts feel familiar, operating physical edge devices definitely introduces a whole new category of problems 😄

The edge devices run containerised apps with Docker Compose and we have one “agent” container that manages all ops related tasks like OTA updates and other cloud communications.

I’d be curious to hear from people who have built/operated similar systems.

Things I’m especially interested in:
- OTA update strategies (we built our own OTA platform for apps but no firmware/OS support yet)
- observability/monitoring (We use Prometheus and custom made dashboard)
- remote debugging (Using ZeroTier we can SSH into the devices for debugging, but we plan to discontinue this approach and replace with outbound HTTPS due to Client’s network requirements)
- networking reliability
- handling flaky devices
- synchronization between video streams and inference results
- anything that becomes painful once you move from “few devices” to actual fleets

Would love to hear any lessons learned, pitfalls, or “I wish we had thought about this earlier” stories.

Operating Raspberry Pi + Edge AI fleets in production, pitfalls and lessons learned?