u/Alarmed_Tennis_6533

Looking for SREs to deploy Wachd and break it

Looking for SREs to deploy Wachd and break it

Built an open-source on-call tool that runs AI root cause analysis inside your cluster when alerts fire — Grafana/Prometheus webhooks in, plain-English probable cause out. Self-hosted, Ollama for air-gapped, single Helm chart.

I'll walk you through setup personally. 30 minutes, honest feedback is all I need. Coffee gift card ($15–20) as a thank you.

github.com/wachd/wachd — if you deal with alert fatigue or are migrating off OpsGenie, DM me.

u/Alarmed_Tennis_6533 — 4 days ago
▲ 0 r/sre

Looking for 5 SREs to deploy Wachd and break it

Built a self-hosted OpsGenie replacement with AI root cause analysis. Open source, Apache 2.0, free to self-host forever.

Looking for 5 engineers to deploy it in a real environment and tell me what's broken, missing, or wrong. I'll prioritise fixes based on what you find.

Helm chart, deploys in 30 minutes on Kubernetes. wachd.io — github.com/wachd/wachd

That's the honest version. No fake free

reddit.com
u/Alarmed_Tennis_6533 — 7 days ago
▲ 1 r/sre+1 crossposts

Atlassian confirmed OpsGenie EOL. A lot of teams are now figuring out what to move to and the options aren't obviously equivalent — some are SaaS-only, some are self-hosted, some require ripping out your whole alerting stack.

Wrote a breakdown covering the main options across both categories:

SaaS: PagerDuty, Grafana OnCall (Cloud), "Incident.io", Better Stack

Self-hosted / Open Source: Wachd, Grafana OnCall (OSS)

Each one covers: what it does well, what it doesn't, and who it's actually for.

Full guide: https://wachd.io/blog/opsgenie-alternatives-2026

Happy to answer questions — been through a few of these migrations.

u/Alarmed_Tennis_6533 — 18 days ago

Most alert pipelines tell you *that* something is wrong. Built Wachd to tell you *why*.

When an alert fires, it queries your existing observability tools — Prometheus/Grafana for metric history, Loki/Datadog/Splunk for error logs, GitHub/GitLab for recent commits — correlates the timeline around the alert, strips PII, and runs it through AI to produce a plain-English probable cause for the on-call engineer.

The key design decision: it doesn't replace your observability stack, it reads from it. You keep Prometheus, Grafana, Loki — Wachd just adds a correlation + AI layer that fires automatically when an alert comes in via webhook.

Stack integration:

- Alert sources: Grafana, Datadog, Prometheus Alertmanager, generic webhook

- Logs: Loki, Datadog, Splunk, Dynatrace

- Metrics: Prometheus, Grafana

- AI: Ollama (local/air-gapped), Claude, OpenAI, Gemini

- Notifications: Slack, Teams, SMS, voice call

Fully self-hosted, Apache 2.0, Helm chart. Air-gapped mode with Ollama for environments that can't send incident data to a cloud AI provider.

GitHub: https://github.com/wachd/wachd

Demo: https://youtu.be/VQAx-Kxhcoc

Curious what edge cases people see with the correlation approach — especially around multi-service incidents where the root cause isn't in the alerting service itself.

reddit.com
u/Alarmed_Tennis_6533 — 18 days ago

Wachd — self-hosted OpsGenie replacement that tells your on-call engineer WHY the alert fired, not just that it fired. AI root cause analysis, Helm chart, Apache 2.0. wachd.io

reddit.com
u/Alarmed_Tennis_6533 — 21 days ago