u/Kage605

▲ 4 r/PCBaumeister+1 crossposts

High-end RTX 5090 PC hard shuts down in AI/CUDA workloads, but GPU swap makes both systems mostly stable. Need help narrowing this down.

Hi everyone,

I’m trying to diagnose a very strange hard shutdown issue on a new high-end PC. I use the system mainly for AI workloads and content production with ComfyUI, especially img2vid, txt2img, upscaling and video workflows.

System A – New / Problem System

CPU: AMD Ryzen 9 9950X3D

GPU originally: RTX 5090 32 GB

RAM: 64 GB DDR5-6000 CL30

Mainboard originally: MSI MPG X870E Carbon WiFi

SSD: Samsung 990 Pro 2 TB

PSU originally: be quiet! Dark Power 14 Titanium 1200 W

OS: Windows

Cooling: 360 mm AIO

System B – Older PC

CPU: Intel i7-14700K

GPU originally: RTX 5080

RAM: 32 GB

PSU: be quiet! Straight Power 12 1000 W Platinum

Same ComfyUI workloads also run on this system

Original problem

With the RTX 5090 in the new PC, the system randomly hard shut down during ComfyUI / CUDA / AI workloads.

By hard shutdown I mean:

no bluescreen

no freeze first

no error message

instant power-off

PC can be powered on again normally afterward via the case power button

It mostly happened during AI/video workloads, not during normal gaming. Benchmarks and gaming were mostly fine. The system could pass OCCT / 3DMark / gaming, but ComfyUI could still shut it down.

First repair

The PC was sent back for repair. The following parts were replaced:

CPU

mainboard

RAM

After that, the original RAM training / boot issues seemed fixed. First boot was much faster and normal.

However, after more testing, the hard shutdowns came back with the RTX 5090 in the new PC.

With a reduced GPU power limit, around 69%, it became more stable, but it still occasionally hard shut down. Some workflows were still almost 100% reproducible.

Additional tests I did

I tested a lot:

clean NVIDIA driver reinstall with DDU

fresh ComfyUI installation

different ComfyUI versions / workflows

GPU reseated multiple times

GPU power connector reseated and checked multiple times

different wall socket

GPU power limit reduced

core and memory underclock tested

RAM tested individually

PSU OC/single-rail mode tested

The strange part: normal benchmarks and games could run fine, but AI/CUDA workloads triggered hard shutdowns.

GPU swap test

To narrow it down, I swapped only the GPUs between the two PCs.

Important detail:

I only swapped the graphics cards.

I did not swap PSUs.

I did not swap PSU cables.

Each PC kept its own PSU and own GPU power cable.

After the swap:

New PC + RTX 5080

Much more stable than before

Img2Vid and txt2img workloads that previously caused hard shutdowns now mostly run fine

No regular hard shutdown behavior like before

However, even with the RTX 5080, the new PC sometimes runs into OOM / memory-related errors after a few videos in some ComfyUI workflows. It does not hard shut down like before, but it is still not as smooth as expected.

Old PC + RTX 5090

Runs stable

Same ComfyUI workloads run fine

No hard shutdowns so far

This older PC can handle very large queues, sometimes 100+ jobs, without the same kind of problems

This made me think the RTX 5090 itself is probably not obviously defective, and the new PC is not generally unstable either. The issue seems to be mainly the combination of:

new PC + RTX 5090 + its power delivery / platform behavior / AI workload transients

PSU swap test

A replacement PSU was tested in the new PC.

Important detail:

The PSU was replaced.

The PSU cables were also replaced with the new original cables.

The GPU power cable / 12VHPWR / 12V-2x6 cable was checked multiple times, both by me and after repair/testing.

Both PSU-side 12VHPWR / 12V-2x6 ports were tested.

Results:

New PSU + RTX 5080 in new PC: stable

New PSU + RTX 5090 in new PC: still hard shutdowns

It seemed slightly better at first with the new cable / second PSU port, but eventually it still shut down

Even with 69% power limit, -210 MHz core and -30 MHz memory, it still hard shut down

So now I have:

CPU replaced

RAM replaced

mainboard replaced

PSU replaced

new PSU cables tested

both PSU-side 12VHPWR/12V-2x6 ports tested

RTX 5080 works much better in the new PC

RTX 5090 works stable in the old PC

RTX 5090 in the new PC still causes hard shutdowns

What I’m trying to figure out

At this point, I’m confused.

Could this still be:

GPU issue that only appears in one platform?

PCIe / platform issue, even though the mainboard was replaced?

Some weird compatibility issue between RTX 5090 and this platform / PSU / motherboard combo?

A transient power issue that standard benchmarks do not reproduce?

Something related to CUDA / AI workloads causing behavior that FurMark / 3DMark / OCCT do not catch?

Some other component or configuration in the new PC that was not replaced?

The repair/testing used standard tools like FurMark, Prime95, HDDScan and BurnInTest. Those passed, but they do not seem to reproduce the same kind of fast CUDA/AI load changes that ComfyUI creates.

Current status

I currently keep the GPUs swapped because that setup is usable for content production:

old PC + RTX 5090 is stable

new PC + RTX 5080 is much more stable than before

But obviously I bought the new system to use the RTX 5090 in it.

What would you test next?

What could still explain hard shutdowns only with the RTX 5090 in the new system, even after CPU, RAM, mainboard, PSU and PSU cables were replaced?

Any ideas are appreciated.

reddit.com
u/Kage605 — 6 days ago