u/bakibab

A common pain point in multi-team GPU clusters: DCGM tells you a node is at 90% utilization. It doesn't tell you which team, pod, or job is driving that.

We open-sourced l9gpu to solve this. It's a node-level agent that emits GPU metrics via OTLP with full workload attribution baked in.

Kubernetes: maps metrics to pod, namespace, and deployment

Slurm: maps to job, user, and partition

What's included:

- NVIDIA, AMD MI300X, Intel Gaudi support

- LLM inference metrics (vLLM, SGLang, TGI)

- Vendor-neutral OTLP export

- Pre-built Grafana dashboards

- 17 Prometheus alert rules

- MIT licensed, derived from Meta's gcm project

https://github.com/last9/gpu-telemetry

How are others handling GPU cost attribution and chargeback in shared clusters?

Open-source GPU observability with workload attribution - maps DCGM metrics to pods/jobs/teams (K8s + Slurm, OTLP)