u/vibehacker2025

Yesterday one of our workflows blew through the memory ceiling on n8n Cloud and made the entire instance inaccessible. Not just slow fully unreachable (503 error). Couldn’t load the UI, couldn’t disable the workflow, couldn’t do anything. Restarting the instance didn’t help because the workflow just spun right back up and OOM’d it again.

We were down for several hours waiting for support to come online (they’re on London time, we’re US) before someone could unstick it on their end.

Fully owning my part, I almost certainly pushed something un-optimized. Too much data per batch, probably retaining execution data I didn’t need. Lesson learned on that front.

What’s eating at me though is the recovery side. Support confirmed there’s no self-serve way to remotely disable or quarantine a specific workflow when you can’t access the platform. Their advice was “optimize your workflows” and “upgrade your plan,” which… yes, fine, but doesn’t help in the moment when you’re staring at a 503 and your clients are pinging you.

We run client-facing stuff on n8n with actual SLAs attached, so a multi-hour outage with zero levers to pull is a real problem.

So how are people handling this?

External health/memory monitoring with alerting before you hit the ceiling?
Team conventions around batch size, execution data retention, etc?
Any kind of watchdog workflow / circuit breaker pattern people have built?
Anyone found a way to remotely disable a workflow when the UI is dead, or is support genuinely the only path?
For the self-hosters: did you move critical stuff off Cloud specifically because of this, and was it worth the ops overhead?

I love n8n so I don’t want to migrate, but just trying to make sure this doesn’t happen to us again.

One bad workflow took down our entire n8n instance for 4+ hours with no way to kill it from outside