r/Observability

▲ 22 r/Observability+7 crossposts

Cosmo - Real-time PostgreSQL TUI Dashboard (v0.2.0)

Just shipped Cosmo — a clean TUI to monitor your Postgres database in real-time.

Github: https://github.com/mujib77/cosmo

Live overview, active queries, WAL rate, locks, and more.

I’m actively developing more features and older version support.
Would love your feedback and suggestions!

u/VermicelliLittle6451 — 20 hours ago
▲ 11 r/Observability+5 crossposts

Anyone using telemetry data in tandem with AI coding agents?

Hey folks 👋

I'm building an open-source dev tool that turns telemetry data into knowledge graphs that can be used as context in AI coding agents for debugging purposes or improving performance & costs.

Why? My intuition is three fold:

(1) coding agents are much more useful when they understand how a system actually behaves in production, not just what the repo looks like

(2) using raw telemetry data (for example traces) doesn't really work with coding agents at scale

(3) telemetry context graphs might be even cheaper and more efficient to query compared to using raw telemetry data

Before spending too much time on this & going down the rabbit hole, I'm trying to sanity-check my assumptions and assess if this is actually useful for people building/running AI systems in production. Curious to hear from software engineers that have tried something like this: what worked & what didn't, etc.

Happy to hear thoughts directly in the comments and if anyone's interested in helping out with feedback on the actual tool as I build it, please let me know and I can send more details in private - not my intention to spam anyone.

Appreciate it 🙇

reddit.com
u/n4r735 — 1 day ago

A VC walks into an observability conference… (Observability Summit MSP, through Friday) VC (and ex-founder) looking to meet founders

About 15 years in observability now. Built products. Co-founded a company in the space. Been on both sides of a few exits, both as a operator and investor. These days I back early-stage infra and observability founders at an early stage (Pre-seed, Seed) VC firm. Most of what we invest in is Commercial Open Source Software (COSS) based, however, it's not a hard and fast rule.

I'm in MSP for the Observability Summit Thursday and Friday. If you're a founder building in observability, or just local and building anything adjacent, I'd love 15 minutes of your time. Coffee, hallway corner, whatever works.

No pitch, no agenda. I mostly want to meet people who are actually shipping. Happy to trade operator scar tissue if any of it is useful to you.

Comment or DM and I'll send the booking link to meet at the Observability Summit in MSP Thursday and Friday this week.

reddit.com
u/Forward-Device382 — 1 day ago

Is production observability worth paying for when your stack already feels overloaded?

We keep adding tools to solve problems and at this point our devops stack feels bloated. Every time something breaks, we introduce another monitoring or logging layer and it just gets heavier. Better visibility helps, but now a lot of time goes into maintaining the tools instead of fixing issues. Feels like we are adding complexity without getting faster at resolving incidents.

..if anyone found something that actually improved resolution time without adding more overhead. Could be a process change, a workflow tweak, or even consolidating tools instead of stacking them.

Maybe this is just a setup problem on our side, but interested in what actually worked for other teams?

reddit.com
u/EnoughGrade1906 — 3 days ago
▲ 3 r/Observability+2 crossposts

How to track marketplace visitors?

I will launch my marketplace soon. I prefer opensource tools, with goal to keep as much info before we start a marketing campaign.

reddit.com
u/DR_Fabiano — 5 days ago
▲ 2 r/Observability+1 crossposts

How do you currently turn noisy incident logs into a useful timeline?

I’m working on a small local prototype around incident-log grouping, and I’m trying to understand whether this output shape is actually useful for SRE/DevOps workflows.

During incidents, exported alerts/logs often become repeated noise:

- database pool exhaustion

- payment gateway timeout

- retry storm

- checkout failure

- unrelated background events

The prototype, Crystal Incident Lens, tries to turn exported events into:

- incident groups

- timestamp-backed incident paths

- related-incident candidates

- resource usage numbers

Example recovered path from a demo incident set:

db_pool_exhaustion -> payment_gateway_timeout -> retry_storm -> checkout_failure

I also tested it on a 2,500-line slice of Rootly AI Labs’ public logs dataset:

https://github.com/Rootly-AI-Labs/logs-dataset

Current measured result:

- 2,500 real log lines

- 10 incident groups

- 0 queue rejections

- 27.42 events/sec

- 21.43 MB working set RAM

- 56.0 CPU-sec

Important boundaries:

- This is a prototype research project.

- It is not a Datadog/PagerDuty replacement.

- It is not an LLM wrapper.

- The runnable engine is not public yet.

- Rootly is only used as a public dataset source; this is not affiliated with or endorsed by Rootly.

What I’m trying to validate:

Can a local event-memory layer reduce noisy operational exports into a smaller, evidence-backed incident picture without sending logs to a cloud API?

Questions for SREs / DevOps / incident response people:

  1. Is this output shape useful?

  2. Do you care more about deduplication or causal timeline?

  3. What exported log/alert format should I test next?

  4. What would make this convincing enough to try on anonymized real incidents?

Proof repo:

https://github.com/Antriksh005/CRYSTAL_GITHUB_PUBLIC

Disclaimer: I built this prototype in my spare time; there is no paid product or cloud service behind it.

u/Salt_Diamond5703 — 5 days ago
▲ 6 r/Observability+1 crossposts

Are there any otel compatible symbolicators? Do we even need one?

Disclaimer: I am not promoting anything, I'm building an observability platform but it will not be mentioned in this post. If I build anything because of this post it will be MIT open source and otel compatible.

Hi,

I've been dealing with stack traces from mobile and frontend otel clients coming in obfuscated. I've built a small symbolicator I use for sourcemaps, but rn it only works with javascritp and it's not a standalone tool compatible with otel.

I'm considering building a symbolicator that can work with: sourcemaps, flutter, android and ios. I think this would be fun to do and obviously this is a ridiculous amount of work, so before I go down the rabbit hole I wanted to ask how people are handling this in general?

If you do have mobile/frontend monitoring are there oss tools that address this currently? Is symbolication even needed?

Any experience you share would be greatly appreciated.

Thank you!

reddit.com
u/narrow-adventure — 7 days ago

The Observability Cosmos

https://preview.redd.it/aq25kl10ba1h1.png?width=900&format=png&auto=webp&s=396df6e2c70baa549c9eaf8287639a59de470372

So, I have built a mapping to the observability space.

The market seems to be evolving and growing at an incredible rate. New specialisms are developing and AI is changing the nature of observability itself. This is an attempt to identify some kind of order and structure. It currently encompasses 126 products (with many more to come) across 16 categories.

It has been launched as a beta - so any feedback is welcome.

If you want to dive straight in and explore the Cosmos, this is your launchpad:
https://observability-360.com/Product/Cosmos

There is also an introductory article here:
https://observability-360.com/article/viewArticle?id=introducing-the-observability-cosmos

And an explanation of the classifications here:
https://observability-360.com/article/viewArticle?id=observability-cosmos-classifications

reddit.com
u/Observability-Guy — 7 days ago

Things I wish I knew before sitting the Certification Dynatrace Implementation Pro exam

Hot take after passing Dynatrace Implementation Pro: the official learning path isn’t enough.
It teaches breadth. The exam tests operational detail you only pick up by actually configuring things in production: ActiveGate config via agctl, Log API v2 payload types, Environment permissions hierarchy, federation types, custom service detection per language…
Davis AI and DQL? Manageable if you use them daily. But this config-level detail, you only learn it by spending real time on the platform.
What saved me: a quiz built from my own notes, drilled for two weeks. Official videos become useless after one viewing.
Sitting Pro Administration in two weeks. If anyone’s prepping for Associate, Implementer or Admin and wants a hand, happy to help.

reddit.com
u/Zeavan23 — 8 days ago
▲ 7 r/Observability+1 crossposts

Anyone actually doing pattern analysis across their agent's traces, or are we all just eyeballing dashboards?

Genuine question. Been thinking about this all week.

That Obsidian + Claude guide going around right now is good. Capture everything, let Claude read across your notes, surface connections you missed. I run something similar for my own reading list. It works.

But here's what's been bugging me. The same engineers sharing that post have agents in production generating thousands of traces a day. Every trace is a decision the agent made while nobody was watching. Every trace gets dumped into LangSmith or Langfuse and never looked at again.

That's not a second brain. That's the graveyard with good folders the guide explicitly warns about.

Your Obsidian vault compounds because something reads across it. Your trace store doesn't compound because nothing does. New trace lands, old trace forgotten. The knowledge your agent generates about its own failures evaporates the moment the request returns 200.

The asymmetry is wild when you actually look at it. We spend a Sunday wiring up N8N so Claude can find patterns in our reading list. Then Monday we ship an agent to prod with zero mechanism to find patterns across the agent's own behavior. A regression in pattern A and a regression in pattern B look identical in the dashboard. Both returned 200. Both took 4 seconds. Nothing tells you the agent took two different paths to get there. A new failure mode shows up and gets logged next to 40,000 successful runs that look exactly like it.

The loop the Obsidian guide describes (capture, connection, return) is exactly what's missing for agents. Capture is already automatic, every observability tool does it. Connection is the part nobody's doing. And without connection there's no return, no ritual of going back and noticing what shifted.

So what's everyone actually doing here? Custom clustering on traces? Scheduled LLM passes over recent runs? Some kind of embedding-based grouping? Or is it really just dashboards and prayer?

reddit.com
u/Finorix079 — 10 days ago

Observability Platform for Internal Coding Tools?

Founder of an AI SaaS startup here. I'm looking for a telemetry or observability platform for internal coding tools - like Cursor, WindSurf, Claude Code and others.

I've been using posthog for my observability and telemetry of my production deployment. I'm not concerned about that.

We are a remote team and we use multiple coding tools for coding, like Cursor, WindSurf, Claude Code, and VS Code. It's a team of twelve people, not much, so we haven't restricted anyone from using any tools. All the laptops are managed devices by the company. We do have a Tailscale VPN and a secure web gateway.

I want to observe and monitor what my team is actually using in Claude Code, WindSurf, and other coding agents. We are already spending 70-80k$ per month for token costs.

I just want to minimally observe that they are not using it for their personal projects or some external open source contributions.

Do you guys know any easy way to do this? Are there any observability platforms, softwares that do that for coding agents?

reddit.com
u/AssociationSure6273 — 13 days ago

Datadog Log Monitor false alerting during pod restarts — need help

I have a NestJS cron job running every 15 mins on Kubernetes EKS. It emits cron_job_success or cron_job_failure on completion. My Datadog log monitor alerts when count == 0 over a rolling window.

Problem: Pod restarts for ~2 mins, monitor evaluates during the gap, sees 0 logs, fires false alert. Pod recovers, cron succeeds, log is right there.

Query:

logs("service:my-service env:my-env ((@1.jobEvent:myCronJob @1.jobStatus:cron_job_success) OR (@data.jobStatus:cron_job_failure @data.jobName:myCronJob))").index("*").rollup("count").last("16m") == 0

Already tried: 300s evaluation delay, changing window from 15m to 20m. Neither fixed it.

Need:

Pod restarts 2 mins and recovers → no alert

Pod down entire 15 mins → alert

Cron fails → alert

Is the fix in the query, monitor config, or both?

reddit.com
u/According_Stop_6284 — 11 days ago
▲ 4 r/Observability+3 crossposts

Gomotz - Network monitoring system

GoMotz – a free, self-hosted Domotz alternative for Raspberry Pi

I just released the beta version of *GoMotz – an open-source network monitoring system built in Go, designed to run on a Raspberry Pi.

What it does

*Network Dashboard*

- Public IP detection with one-click copy

- ISP & ASN info, live uptime tracking

- Latency stats, success rate, connection history

*Device Monitoring*

- Auto-discovers all devices (IP, MAC, hostname, vendor)

- Filter Online / Offline / Conflict

*Network Tools*

- Portscan, TCP Check, DNS Lookup, Traceroute, Ping, HTTP(S) Check, Speedtest

*Monitoring*

- Device, TCP Port, SNMP, Ping, HTTP(s), Domain Expiry monitors

*Requirements:** Raspberry Pi 4 (2GB+), fully self-contained, no cloud.

It's beta – bugs are expected! Would really appreciate testing, feedback, and issue reports from the community.

🔗 GitHub: https://github.com/mascarenhasmelson/gomotz

📖 Read how it started: https://www.0xmm.in/posts/monitoring/

u/rawpackets — 11 days ago
▲ 48 r/Observability+1 crossposts

CNCF TOC votes in favor of OTel Graduation

The CNCF technical oversight committee has voted to approve the OTel due diligence document.

This is one of the final steps towards graduation: the thorough due diligence, which included interviews with end users and resolution of the recommendations given in previous steps, has been finished and approved by the TOC 🎉

github.com
u/jpkroehling — 13 days ago

i just spent fifteen minutes logging into aws, navigating through five different javascript-heavy menus, and waiting for pages to load just to spin up a single instance for testing. the sheer amount of bloat on these web consoles is driving me insane. i want to live entirely in my terminal and never look at a web gui again. does anyone have a solid workflow for finding and provisioning cloud hardware strictly through the cli without having to write a massive custom python script just to hit their apis?

reddit.com
u/Aven_Reed — 14 days ago