
Open-source GPU observability with workload attribution - maps DCGM metrics to pods/jobs/teams (K8s + Slurm, OTLP)
A common pain point in multi-team GPU clusters: DCGM tells you a node is at 90% utilization. It doesn't tell you which team, pod, or job is driving that.
We open-sourced l9gpu to solve this. It's a node-level agent that emits GPU metrics via OTLP with full workload attribution baked in.
Kubernetes: maps metrics to pod, namespace, and deployment
Slurm: maps to job, user, and partition
What's included:
- NVIDIA, AMD MI300X, Intel Gaudi support
- LLM inference metrics (vLLM, SGLang, TGI)
- Vendor-neutral OTLP export
- Pre-built Grafana dashboards
- 17 Prometheus alert rules
- MIT licensed, derived from Meta's gcm project
https://github.com/last9/gpu-telemetry
How are others handling GPU cost attribution and chargeback in shared clusters?