




Guide to building smoltorrent | A Distributed Storage System for ML Checkpoint
Wrote up an article, diving deep into 4x Raspberry pi 4B 4GB RAM Cluster based Distributed Checkpoint Storage System!
Stats are given below:
942 MB checkpoint numbers:
Real setup: Mac mini M4 coordinator + 4× Pi 4B workers.
If you train models on home clusters and live in fear of losing checkpoints…
this one’s for you.
A few interesting engineering problems popped up while building it:
- checkpoint writes are not atomic → watcher sometimes detects partially-written safetensors
- slow Raspberry Pi SD cards created backpressure during parallel shard replication
- retry logic without checksums caused silent corruption bugs early on
- mDNS discovery sounds simple until nodes disappear/rejoin mid-transfer
- shard sizing mattered much more than expected because tiny shards killed throughput with socket overhead
Current design:
- coordinator splits safetensors into shards
- automatic fallback to replica during restore
- filesystem watcher retries incomplete checkpoints until finalized
- Prometheus/Grafana/Loki stack for monitoring + alerts
- mDNS discovery to get rid of hardcoded IPs
Honestly the most useful part wasn’t even the storage system itself, it forced me to finally understand TCP flow control, retries, backpressure, partial writes, and distributed failure handling in a very practical way.
Curious how others here handle checkpoint durability on small/home clusters without relying entirely on cloud object storage.
Fully open source.
Monitoring? Prometheus + Grafana + Loki in Docker. Per-shard speeds, error counts, unified logs, email alerts if anything goes unrecoverable. No SSH hell.
One yaml config. One launch./sh. Done.
If you're training on a home/dorm cluster and living in fear of losing 3-day runs… this is for you.
What’s inside the article:
- Automatic watcher daemon (syncs the moment training writes a file)
- mDNS zero-config discovery
- Prometheus + Grafana + Loki monitoring (no SSH)
- Restart behaviour deep dive (coordinator down, Pi reboot, both at once)