u/rockdarko — reddlx

Notes on monitoring Claude Code in production: OTel temporality, cost-as-estimate, cost attribution by team

Hi! Just a little post about getting visibility into Claude Code usage when an org rolls it out across teams (or for your unhinged self, obviously). So Claude Code emits OpenTelemetry metrics now (yay!) That landed quietly in the docs and the community who are busy vibe coding I guess but it ends up being very the practical if you have a real otel-based o11y stack. I've been running this on a Prometheus backend for a bit and wanted to share the gotchas worth knowing up front, because none of them are obvious from the spec.

https://preview.redd.it/n1ty9j2tsi1h1.png?width=1840&format=png&auto=webp&s=3c8581089be9546b68a4dcab74bd6757b5793e5c

For the dashboard side: I built one in PromQL that implements all of this, as the OSS-stack parallel to the existing Azure Application Insights dashboard (25052 by 1w2w3y on Grafana Labs). MIT licensed. Write-up of the implementation with more screenshots is on my blog.

Article: https://rockdarko.dev/posts/grafana-dashboard-for-claude-code-on-prometheus/

Dashboard: https://grafana.com/grafana/dashboards/25255-claude-code-metrics-prometheus/

Repo: https://github.com/rockdarko/claude-code-metrics-prometheus

Things that bit me or that I had to verify against the source if you care:

Pin OTel temporality to cumulative. The SDK's default is cumulative now, but defaults have drifted across versions. If you end up with delta on the producer and a Prometheus-flavored consumer expecting cumulative, your rate() queries return wrong numbers silently. Set OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=cumulative explicitly on the Claude Code side and stop wondering.
The cost number is a client-side estimate. Claude Code computes it from token counts and per-model pricing at request time. It's useful for trend visibility and per-team attribution, but it won't match Anthropic's invoice to the cent. Cache discounts, prompt caching, and pricing changes mid-window are the main drift sources.
PR / commit counters only increment when Claude Code itself opens the PR or commit (e.g., via gh CLI inside a session). PRs the developer opens manually afterward don't register. Worth knowing before you wire alerts around "team X opened zero PRs this week."
OTEL_RESOURCE_ATTRIBUTES is the lever for org-wide visibility. Setting attributes like team=, cost_center=, project= at SDK startup propagates as Prometheus labels and gives you per-team / per-project rollups out of the same metrics. The per-user data is exposed (user.id is included by default); what you do with that is your org's call.
Cache hit ratio is the single biggest lever on monthly cost. It is the difference between a sustainable bill and an alarming one. Worth a dedicated panel.

Hope this helps a few in here! 😄 Cheers

reddit.com

u/rockdarko — 5 days ago

▲ 49 r/sre

Observability for AI tooling: Grafana dashboard for Claude Code's OpenTelemetry metrics on Prometheus

Hi! I'm an SRE who got pretty excited when Claude Code added the ability to emit OpenTelemetry metrics. Felt like that capability landed pretty quietly out there, so I built a Grafana dashboard on top.

https://preview.redd.it/6llimh66pi1h1.png?width=1840&format=png&auto=webp&s=61945c7ef15ec3ab45c34888ab77359171760f5a

The metrics mostly cover what you'd want to watch: cost, cache hit ratio, active time, tool decisions, lines of code. Compatible with Prometheus, VictoriaMetrics, Mimir, Thanos.

https://preview.redd.it/2wydaoj7pi1h1.png?width=1820&format=png&auto=webp&s=816aa081f92981aa10ab56eb3d492eabfab78b8b

Parallel implementation of dashboard 25052 by 1w2w3y (Azure Application Insights / KQL). Every panel rewritten in PromQL.

https://preview.redd.it/pdnyz1j8pi1h1.png?width=1833&format=png&auto=webp&s=0ccff65ce3b5762e7c04f365f633a930469df485

Things worth flagging up front (covered in the article):

- Temporality settings matter. Pin to cumulative or you'll get silently broken rates.

- Cost is a client-side estimate; it won't match Anthropic billing to the cent.

- The PR counter only increments when Claude Code itself opens the PR (e.g., via gh CLI inside a session); manual PRs don't register.

- Custom labels via OTEL_RESOURCE_ATTRIBUTES extend the dashboard to per-team / per-project / per-cost-center views. For org-wide rollouts the same labels enable cost attribution by team or cost center; the per-user data is exposed too, what you do with it is up to you.

Article with the walkthrough: https://rockdarko.dev/posts/grafana-dashboard-for-claude-code-on-prometheus/

Dashboard on Grafana Labs: https://grafana.com/grafana/dashboards/25255-claude-code-metrics-prometheus/

Repo (MIT): https://github.com/rockdarko/claude-code-metrics-prometheus

reddit.com

u/rockdarko — 5 days ago

▲ 21 r/grafana

Grafana dashboard for Claude Code CLI metrics on a Prometheus-compatible backend

It consumes Claude Code's OTLP metrics on Prometheus-compatible backends (Prometheus, VictoriaMetrics, Mimir, Thanos), all queries in PromQL.

https://preview.redd.it/91di760hoo0h1.png?width=1840&format=png&auto=webp&s=4f36834f24ff6f38c840ed23d37add196557e2dd

Panels: cost by model/project/user, cache hit ratio, active time, edit-decision breakdowns, leaderboards. Custom labels for per-team / per-project views via OTEL_RESOURCE_ATTRIBUTES.

Parallel implementation of dashboard 25052 by 1w2w3y, which targets Azure Application Insights via KQL. Every panel rewritten in PromQL for the OSS metrics stack. Credit to that author for the original concept.

https://preview.redd.it/8bzzqlikoo0h1.png?width=1833&format=png&auto=webp&s=0343f83bb6e092c5e6ed8e4a25496d48b07e1c90

Direct download: https://grafana.com/grafana/dashboards/25255-claude-code-metrics-prometheus/

Article: https://rockdarko.dev/posts/grafana-dashboard-for-claude-code-on-prometheus/

Repo (MIT, PRs welcome): https://github.com/rockdarko/claude-code-metrics-prometheus

Happy to answer questions about the panel queries or extend with what people want.

reddit.com

u/rockdarko — 10 days ago