r/Observability

KumaAlert v3.2 released! v4.0 AND ANDROID landing...

Hey everyone!

Over a month since our last update and we've got a few changes to talk about... Today we've launched v3.2 (along with a few minor changes since v3.0.0, full details below) and a few updates about v4.0 and Android Launch...

v3.2 - The Fast One: A huge update and architecture overhaul to speed up the app.

New Features:

Now notifies how long a monitor has been down - live outage in minutes/hours
Optional "last updated" caption on the widget - Thanks to JO for the suggestion

Performance:

Complete overhaul of backend architecture - heartbeat history now lives in memory instead of being repeatedly re-written to on device db.
The monitor tab no longer re-parses every monitor on each load, incoming status changes are applied at a steady rate rather than landing at the same time and freezing up.
Overview tab no longer consistently rebuilds itself - live average-responses update in place - No removed abilities/features!
The app now is much smaller and takes up less space, on first launch, it will clear all unused history.

v4.0 and Android

KumaAlert iOS v4.0 and Android v1.0 Launch on Wednesday 15th July - Including new features for iOS and the initial launch of the Android app following several weeks of development - I'm super excited to announce this launch, v4.0 will include several new features to take your Uptime Kuma instance to the next level, including incident management, incident timelines, emergency mode and a new Apple Watch app!

Download: https://apps.apple.com/gb/app/kumaalert/id6760863575

Website: https://kumaalert.app

Discord: https://embers.cafe/discord (General chat & Beta testing)

Finally, thank you so much to the whole community for your continued support - Thanks to all the Android beta testers who have and are continuing to contribute to the designs and testing of the app and thanks to Mark/Jo for assisting on the Discord server!

KumaAlert Team (ie, Bobby!)

u/Panic135 — 3 days ago

▲ 1 r/Observability

What do you think of AI agent observability in production?

We shipped our first agent to production three months ago and observability has been the part nobody prepared me for. When something breaks and a user reports it i still end up reconstructing what happened manually across multiple workflows before I can even start debugging

reddit.com

u/Karzmo — 3 days ago

▲ 0 r/Observability+2 crossposts

Building a lightweight Kafka monitoring tool for small teams — worth paying for it

Been running Kafka in production for a while now and honestly the monitoring situation for small teams sucks. Confluent Control Center is way overkill/expensive, Datadog's Kafka integration is priced like you're a 200-person company, and the open source stuff (AKHQ, Kafdrop, Burrow) works but needs someone to babysit the setup, patch it, and actually understand consumer lag internals to make sense of it. I'm thinking about building a simple hosted tool — just point it at your cluster, get consumer lag alerts, topic health, broker metrics, no Prometheus/Grafana stack to maintain. If you're running Kafka on a small team (like 2-10 devs) — what do you currently use for this? Would you actually pay for something dead simple over self-hosting the OSS stack, or is that a dealbreaker for you? Trying to figure out if this is a real problem or just something that annoys me specifically.

Note : I have used AI for corrections

reddit.com

u/Gloomy-Long-8045 — 3 days ago

▲ 1 r/Observability

Is there already a tool that pinpoints which AI coding step introduced a regression? (not just logs it after)

I've spent the last few weeks going deep on the AI observability space LangSmith, Braintrust, AgentOps, Langfuse, all of it. They're all great at one thing: telling you what happened. Traces, spans, evals, metrics.

That's not really my problem when I'm vibe coding though. My problem is dumber than that.

An agent makes 30 changes in a row. Everything looks fine at the time. Some time later I notice something's not working right. Now I have no real way to know which of those 30 changes was the actual cause.

So I end up either re-prompting and hoping, manually bisecting through the session myself, or just reverting a big chunk of work and starting over.

What I keep wishing existed is something that sits there while I'm coding, checkpointing as it goes, and the moment something stops working, it actually runs the tests/compile checks right then and tells me exactly which step caused it with proof, not a guess and lets me jump back to right before that step. Bonus if it remembers the failure so it doesn't happen again, and if it can eventually trace a live production issue back to the session that introduced it.

Does something like this already exist and I've just missed it? Or is this a problem I've mostly invented from watching one too many long sessions go sideways?

reddit.com

u/No-Temporary4325 — 4 days ago

▲ 27 r/Observability+2 crossposts

Cerberus: A drop-in Prometheus, Loki & Tempo gateway for ClickHouse

Translate PromQL, LogQL, and TraceQL into optimized CH SQL — keep Grafana, swap the backend.

cerberus.foo

u/tcostasouza — 7 days ago

▲ 16 r/Observability

Has anyone integrated LiteLLM with OpenTelemetry in production?

I've been running LiteLLM in production for a while and recently started testing its OpenTelemetry integration to get end-to-end traces across the gateway, provider calls, and the rest of our services.

The documentation looks solid, especially around the newer tracing model and the GenAI semantic conventions, but I'm curious about real-world experience rather than the happy path.

I'm particularly interested in things like trace propagation across services, span hierarchy, sampling strategies, exporter choice, and whether you keep the default span structure or enable the dedicated litellm_request span. I'm also wondering how people are handling prompt/response logging versus privacy requirements and whether anyone is exporting to collectors before sending data to platforms like Langfuse, Phoenix, Jaeger, or Datadog.

For those already using this in production, what has worked well, what didn't, and are there any pitfalls or configuration choices you wish you had made differently?

reddit.com

u/jeann1977 — 7 days ago

▲ 5 r/Observability

Visualizing AI agent traces as a tree — what's missing when runs get large?

I'm leading AgentPrism, an open-source tool for visualizing AI agent traces, but I'm not here to pitch it — I want feedback from people building complex agent-style workflows, and this felt like the right room to ask.

I've been experimenting with workflows on different platforms like n8n and Mastra, plus the Claude and Codex CLIs directly. The idea I'm chasing: import AI traces and render the run as a tree, so it's easier to inspect what happened across nodes, retries, branches, and tool calls — instead of scrolling raw spans.

What I'd love to hear from people running this for real:

- What parts of execution debugging are most painful for you?
- Do tree-style views actually help, or do you prefer another way to inspect runs?
- When workflows get large, what information is missing from the standard execution view?

I'll share back what I learn in the comments. Happy to show what we've built if anyone's curious.

reddit.com

u/Federal_Can5247 — 6 days ago

▲ 0 r/Observability

I've built the most flexible uptime monitoring tool out there

Hi all,

I've been working on my own uptime monitoring tool called Hesklo for the past few months. It started with the goal to make my own monitoring more flexible, but has grown into a full product with advanced flows, loads of notifications options and public status pages.

The core idea: drag your monitor onto a canvas, wire it to waits, branches and notify steps, and that diagram is your escalation policy. A very visual way to setup monitoring flows and automation.

It's still early but it's working and I'm using it for myself and customer sites. It's pretty fast, reliable and working well for my use case. The docs can be found here if needed: https://www.hesklo.com/docs

So feel free to test it out! There's a free tier that includes one monitor and all functionality, except for the public status pages. Any feedback or product requests are more than welcome of course. 🙂

reddit.com

u/ExpertBlink — 6 days ago

▲ 0 r/Observability

We've been building an observability and LLM tools platform — looking for early adopters/beta users

Hey everyone,

We've spent the past year building Reiver — a platform that combines APM and a unified LLM gateway in one product, written in Rust. The gateway will not have a fee like openrouter's 3%. There are some restrictions based on account tier but the fee for the gateway is 0. No per host charges. No per seat charges.

What it does:

APM — distributed tracing, error tracking, log aggregation, real-time metrics, and continuous profiling. Correlation between all of those. Dashboards widgets support promql and sql.
LLM Gateway — route requests to different providers (OpenAI, Anthropic) through a single OpenAI-compatible API with automatic failover, PII redaction, prompt injection protection, templated and type checked prompt management, canary prompt deployment, input/output tool/topic blocking, LLM as a judgem cost tracking and many more features.
AI Agent integration — MCP server so your AI agents (Claude, Cursor, etc.) can query your dashboards, alerts, and traces directly
Agent Hub (A2A discovery/permission layer) where you can connect agents with the same protections in the gateway

We're looking for a couple of teams to use Reiver for free during the beta period. After beta, early adopters get discounted rates. In exchange we'd ask for:

feedback
Filing bugs when you hit them
Honest input on what's missing

Good fit if you:

Using LLMs in production and want unified cost visibility + observability in one tool
Tired of stitching together Datadog + LangSmith (or similar) and paying huge amount of money for both
Running a team of 5-50 engineers where current APM pricing doesn't make sense

While we are in a beta you can use the stripe sandbox creditcard as a payment method,

reddit.com

u/AdCute4280 — 7 days ago

▲ 6 r/Observability+1 crossposts

Trace Explorer: nice UI but why is everything so sluggish

We’re currently using the OTEL Collector, and from a technical perspective the data arrives in Google Cloud just fine. So this is not about ingestion being broken or traces/logs missing. The data is there, the integration works.

And yes, the UI of Trace Explorer and Log Explorer looks really good. No question. But honestly, that is not a strong differentiator anymore. Other tools have good UIs too.

We explicitly chose a cloud solution because we did not want the maintenance overhead. No self-hosted stack, no updates, no operations burden, no “who is responsible for keeping this thing alive?” That was the whole point.

But to be honest, I’m starting to prefer the LGTM stack a lot more, especially Grafana. It just feels better for day-to-day usage.

What bothers me most about Trace Explorer is not even a missing feature. It is the constant loading animation, the choppy browser experience, and the feeling that the UI is overloaded. Everything takes too long. Everything feels sluggish. When I’m looking at traces or logs, I don’t need a pretty loading spinner. I need instant results.

And I’m not the only one. Me and my colleagues only open it when they really have no other option. That says a lot.

For context: I’m not using some old machine. I’m on a MacBook Pro with an M4 chip and Vivaldi. My colleagues are also on modern MacBooks, some using Safari. Still, Trace Explorer often feels like clicking through molasses.

Maybe I’m spoiled by Grafana and similar tools, but observability tools need to be fast. When I’m debugging an issue, I don’t want to fight the tool.

How do you deal with this? Are you using Trace Explorer / Log Explorer productively and actually happy with it? Do you have any workarounds, browser tips, or do you forward everything into other tools?

Disclaimer: This text is translated and polished with AI

reddit.com

u/Skeltek — 6 days ago

▲ 8 r/Observability

I'm about to pioneer observability in my current company, give me some advice

A bit of context: My current company is a team of about 30 people, no dedicated DevOps team, a subsidiary company of a bigger corp.

We have a dozen monolithic codebase on AWS infra, mostly on ECS, a few more arriving on Fargate and Lambda

Lately there's a quite a few instances of legacy bad architectures and coding practices leading to some services essentially DDoS-ing themselves or adjacent dependencies. Coupled with alarming numbers of supply chain attacks and vulnerability recently. Corporate had grow paranoid enough to invest seriously on "monitoring and security enhancement".

I have been advocating for better observability for quite sometimes, but it just stopped at better logging practices and adopting sentry for a couple projects.

This is a golden opportunity to build and pioneer an observability stack, "the right way", and I intend to take every advantages.

My colleagues arent familiar with observability at all, but are willing to learn and adopt better tooling and practices.

As for myself, I have had luck with OTeL + VictoriaMetrics/VictoriaLogs/VictoriaTraces + Grafana for some of my personal stuff. But obviously not on the same scale as ~10 production applications

If it was up to me, I would just use that same stack, but to present a fair overview of the ecosystems for my colleagues and managements, I need to also consider other competitors, like clickhouse-based products like SigNoz, ClickStack,... (and OpenObserve?), as well as third-party vendors like datadog, splunk,...

Documentations and videos could only get me so far, there are a few points that would require extension experience:

1/ Functionality-wise, what could Clickhouse-based products and third-party vendor offer that was not possible on a LGMT stacks or Victoria stacks?

2/ Cost-wise, how would each differs, LGMT vs Clickhouse vs 3rd party? I know this is a very vague questions and depends a lot on specifics, so let just say I have 10 projects that can operate comfortably on a 2vCPU and 8GB RAM ECS instances. How would cost compare?

3/ Strategy-wise. For context, I intend to use the standard Agent-To-Gateway Pattern setup. But should I:

pick 2 or 3 projects and collect both application and eBPF telemetry?
collect eBPF telemetry for all projects first and slowly adopt application telemetry, since that would require no code changes for current projects?
collect application telemetry first and slowly adopt eBPF?
any other suggestion?

I would loves to hear opinions and experience people has on similar situations

Any insight is appreciated

reddit.com

u/Lumethys — 8 days ago

▲ 6 r/Observability+4 crossposts

AMA with Josh: what slows teams down after they find a risk?

Finding risks is usually not the hard part anymore.

The harder questions are:

- Is this actually important?

- Who owns it?

- What application does it affect?

- Is there evidence for the control?

- What should we fix first?

- What can AI safely help with?

I’m hosting an AMA with Josh from IBM Concert to talk about how teams move from findings to action across application risk, compliance, resilience, and remediation.

We can also get into how Concert helps with things like application context, compliance controls, evidence assessment, vulnerability prioritization, remediation planning, integrations, and AI-assisted workflows.

Drop your questions below.

reddit.com

u/therealabenezer — 7 days ago

▲ 6 r/Observability+1 crossposts

How do production teams manage Prometheus scrape config for node exporter in AWS EC2 environments?

I am running Prometheus on an EC2 instance scraping node exporter from other EC2 servers using static private IPs in prometheus.yml.

Trying to understand what is considered production practice:

Do teams manually update prometheus.yml or use GitOps / IaC to generate it?
Is EC2 service discovery commonly used instead of static IPs?
In AWS setups, do people rely on tags (like monitoring=true) for target discovery, or is that just for automation tools?
Is node exporter always kept on private subnets with SG restricted only to Prometheus?
Do production setups ever expose node exporter outside VPC (with protections), or always internal only?
At what scale does static config become a problem, and what is usually the first upgrade path?

Current setup works, but not sure if this is a “lab pattern” or something actually acceptable in production. Here is the screenshot of the config file that I used. I removed the content marked in yellow because that wasn't working. Seeking your suggestions.

https://preview.redd.it/fwcd4ezpt5ah1.png?width=1111&format=png&auto=webp&s=17b25ad5676a7a32a73956c429184d435dd95680

https://preview.redd.it/mbfhga9xs5ah1.jpg?width=1866&format=pjpg&auto=webp&s=690e79d3d89e3dc3f084ab43b3c55513f5d15619

reddit.com

u/Wide_Impact_9392 — 7 days ago

▲ 14 r/Observability+2 crossposts

How to Generate RED Metrics from Traces Without Blowing Up Your Cardinality?

I wrote a post on how to generate RED metrics from your traces at the Collector before they hit your backend and why you'd want to do that instead of letting your backend handle it.

I also added some tips on how not to blow up your metric cardinality in the process.

telflo.com

u/Broad_Technology_531 — 9 days ago

▲ 5 r/Observability

Prometheus exporter vs OTLP for Temporal SDK metrics in multi-worker deployments

I just wrote up a detailed comparison of these two approaches, specifically for the case where you run multiple Temporal workers on the same host (bare metal, PM2, systemd).

The core issue is that the Prometheus exporter starts an in-process HTTP server. Scale to 2+ workers on the same machine → every worker tries to bind:9464 → EADDRINUSE. You can assign unique ports per worker, but now your Prometheus scrape config is tightly coupled to your process management.

The alternative: OTLP push to a shared OpenTelemetry Collector. All workers push to grpc://localhost:4317; the collector aggregates and serves Prometheus text format on:9464. One scrape target regardless of worker count—no port management.

The post includes:

\- Working OTel Collector config (OTLP receiver → batch processor → Prometheus exporter)

\- Docker Compose with proper resource limits

\- PM2 ecosystem config with per-worker service names

\- Startup guard script so the collector doesn't fail silently

\- Honest discussion about metrics loss when the collector is down

\- Comparison table of both approaches

[https://2ssk.medium.com/temporal-sdk-metrics-prometheus-exporter-vs-otlp-for-multi-worker-deployments-df9327b28fc5\](https://2ssk.medium.com/temporal-sdk-metrics-prometheus-exporter-vs-otlp-for-multi-worker-deployments-df9327b28fc5)

Would be interested to hear what others are using. I know K8s changes the equation since each worker is its own pod with its own port — Prometheus operator handles that well. But for bare metal / PM2 users, OTLP has been a big improvement.

TLDR: Prometheus exporter for single workers/Docker/K8s, OTLP for multi-worker on same host.

(GPT has been used to write this body.)

reddit.com

u/ban_rakash — 8 days ago

▲ 37 r/Observability+10 crossposts

Do you actually need Kafka between your OTel collector and ClickHouse?

Kafka → ClickHouse is the default pattern for OTel pipelines, and for org-wide streaming with replay and many consumers it's a great fit. But for a lot of single-sink observability setups, it's a cluster you're babysitting for no reason.

This post compares where the Kafka layer does real work vs. where you can drop it. It also checks what processing the Collector can or can't do alone (stateful dedup, enrichment-conditional filtering, dynamic sampling, etc.)
https://www.glassflow.dev/blog/opentelemetry-to-clickhouse-do-you-need-kafka?utm_source=reddit&utm_medium=socialmedia&utm_campaign=reddit_organic

Curious what others run:

Kafka buffer,
straight from the collector, or
a lighter processor in between

Leave your comments below, I'd like to discuss the options and understand what folks are using these days!

glassflow.dev

u/Marksfik — 14 days ago

▲ 2 r/Observability

MSP Monitoring Stack – Looking for Architecture Recommendations

Hi everyone,

I'm looking for some advice from people who have built monitoring platforms for Managed Service Providers.

We're currently using PRTG, but we're planning to replace it with a more modern and scalable monitoring stack.

## Requirements

- Multi-tenancy for both **metrics** and **logs**
- Ability to build dashboards that are:
- Customer-specific (e.g. Customer A → Hosts 1–100)
- Cross-customer (e.g. Host 1 from every customer on a single dashboard)
- Retention of **1 year** for both metrics and logs
- Alerting with:
- Alert grouping
- Acknowledgements
- Comments on alerts
- Web UI and mobile app support

## Preferred Approach

Ideally, we'd like to stay as close to the Prometheus ecosystem as possible.

Some customer environments already have InfluxDB, but if possible I'd like to avoid maintaining multiple time-series databases and standardize on a single stack.

Is a "Prometheus-only" (or Prometheus ecosystem) approach realistic for this use case?

## Environment

We currently manage approximately:

- ~50 customers
- 35-node Ceph cluster
- ~200 firewalls
- Juniper switches
- Linux servers
- Windows servers
- VMware
- Proxmox
- Hyper-V

## Questions

- What monitoring stack would you build today for an MSP?
- Would you use Prometheus + Mimir + Loki + Grafana, or something completely different?
- How do you implement multi-tenancy?
- What do you use for alert management (acknowledgements, comments, escalation, mobile app, etc.)?
- Would you completely eliminate InfluxDB, or are there good reasons to keep it around?

I'd really appreciate hearing about real-world architectures and lessons learned from anyone running monitoring at MSP scale.

Thanks!

reddit.com

u/Lost_Advance6517 — 11 days ago

▲ 8 r/Observability

Homelab Observability... what are people actually using?

Just starting out with a homelab and want to set up a small but useful observability stack. like enough dashboards to understand what my services are doing without turning the observability stack into the largest thing in the lab.

I'm interested in learning that how people running observability at home or in small self-hosted setups... like what stack are you using and what other things I should consider in the initial stage? However I’m less interested in the “enterprise perfect architecture” answer and more interested in the, this gives me useful signal without eating my weekend... :)

Any help would be appreciated

reddit.com

u/HistoricalMost5922 — 12 days ago

▲ 9 r/Observability+5 crossposts

things i wish i knew before evaluating AI agents in production

been working through agent evaluation properly and wanted to share a few things that actually changed how i think about it.

start from the symptom not the layer

wrong tool being called is a component problem. correct answer but too many steps is a trajectory problem. final answer looks wrong is an outcome problem. unsafe action or injection risk is an adversarial problem. once you map symptoms to layers debugging gets way faster.

most teams only check final outputs

trajectory evaluation catches a whole class of failures that output checking misses entirely including duplicate calls, loops, unnecessary retries and cost blowouts.

an uncalibrated LLM judge is worse than no judge

if you haven't validated your LLM as judge against a small set of human labels you're adding noise on top of noise. calibration is not optional.

convert every production failure into a test case

before your next release not after. within a few cycles you have a regression suite that actually catches things before deployment.

adversarial testing is not optional

if your agent reads external content or takes real actions, indirect prompt injection through tool outputs is a real failure mode most eval setups ignore entirely.

if you want to go deeper on all of this we have a hands on bootcamp on june 27 where we cover all four layers live with real notebooks: https://www.eventbrite.co.uk/e/agent-evals-bootcamp-tickets-1990306501323?aff=raieval

u/camerongreen95 — 11 days ago

▲ 1 r/Observability+3 crossposts

The hard part of autonomous SRE was never the AI. It's how much you trust it.

An AI agent just did the 3 AM on-call diagnosis I used to wake up for. In 30 seconds. On my laptop. With nothing but open source.

So I filmed the whole thing. One continuous take, no cuts. I crashed a real pod, the kernel killed it, and ~30 seconds later a full post-mortem landed in Slack: cause, fix, how to prevent the next one. No human on the keyboard.

Then I showed it failing. On camera, I triggered a slow memory leak the agent doesn't catch - memory climbing 20 MB a minute while the dashboard swears everything is "100% healthy." Most vendor demos quietly cut that part. I think it's the most important part.

Because the hard part of autonomous SRE was never the AI. It's how much you trust it.

That's Episode 1. Four more to go - all free, all open source.

I would truly love to hear your thoughts- where would you draw the line on letting an agent act on your cluster, not just diagnose it?

u/natishalomX — 14 days ago

r/Observability

KumaAlert v3.2 released! v4.0 AND ANDROID landing...

What do you think of AI agent observability in production?

Building a lightweight Kafka monitoring tool for small teams — worth paying for it

Is there already a tool that pinpoints which AI coding step introduced a regression? (not just logs it after)

Cerberus: A drop-in Prometheus, Loki &amp; Tempo gateway for ClickHouse

Has anyone integrated LiteLLM with OpenTelemetry in production?

Visualizing AI agent traces as a tree — what's missing when runs get large?

I've built the most flexible uptime monitoring tool out there

We've been building an observability and LLM tools platform — looking for early adopters/beta users

Trace Explorer: nice UI but why is everything so sluggish

I'm about to pioneer observability in my current company, give me some advice

AMA with Josh: what slows teams down after they find a risk?

How do production teams manage Prometheus scrape config for node exporter in AWS EC2 environments?

How to Generate RED Metrics from Traces Without Blowing Up Your Cardinality?

Prometheus exporter vs OTLP for Temporal SDK metrics in multi-worker deployments

Do you actually need Kafka between your OTel collector and ClickHouse?

MSP Monitoring Stack – Looking for Architecture Recommendations

Homelab Observability... what are people actually using?

things i wish i knew before evaluating AI agents in production

The hard part of autonomous SRE was never the AI. It's how much you trust it.

I would truly love to hear your thoughts- where would you draw the line on letting an agent act on your cluster, not just diagnose it?

Cerberus: A drop-in Prometheus, Loki & Tempo gateway for ClickHouse