High-end RTX 5090 PC hard shuts down in AI/CUDA workloads, but GPU swap makes both systems mostly stable. Need help narrowing this down.
Hi everyone,
I’m trying to diagnose a very strange hard shutdown issue on a new high-end PC. I use the system mainly for AI workloads and content production with ComfyUI, especially img2vid, txt2img, upscaling and video workflows.
System A – New / Problem System
CPU: AMD Ryzen 9 9950X3D
GPU originally: RTX 5090 32 GB
RAM: 64 GB DDR5-6000 CL30
Mainboard originally: MSI MPG X870E Carbon WiFi
SSD: Samsung 990 Pro 2 TB
PSU originally: be quiet! Dark Power 14 Titanium 1200 W
OS: Windows
Cooling: 360 mm AIO
System B – Older PC
CPU: Intel i7-14700K
GPU originally: RTX 5080
RAM: 32 GB
PSU: be quiet! Straight Power 12 1000 W Platinum
Same ComfyUI workloads also run on this system
Original problem
With the RTX 5090 in the new PC, the system randomly hard shut down during ComfyUI / CUDA / AI workloads.
By hard shutdown I mean:
no bluescreen
no freeze first
no error message
instant power-off
PC can be powered on again normally afterward via the case power button
It mostly happened during AI/video workloads, not during normal gaming. Benchmarks and gaming were mostly fine. The system could pass OCCT / 3DMark / gaming, but ComfyUI could still shut it down.
First repair
The PC was sent back for repair. The following parts were replaced:
CPU
mainboard
RAM
After that, the original RAM training / boot issues seemed fixed. First boot was much faster and normal.
However, after more testing, the hard shutdowns came back with the RTX 5090 in the new PC.
With a reduced GPU power limit, around 69%, it became more stable, but it still occasionally hard shut down. Some workflows were still almost 100% reproducible.
Additional tests I did
I tested a lot:
clean NVIDIA driver reinstall with DDU
fresh ComfyUI installation
different ComfyUI versions / workflows
GPU reseated multiple times
GPU power connector reseated and checked multiple times
different wall socket
GPU power limit reduced
core and memory underclock tested
RAM tested individually
PSU OC/single-rail mode tested
The strange part: normal benchmarks and games could run fine, but AI/CUDA workloads triggered hard shutdowns.
GPU swap test
To narrow it down, I swapped only the GPUs between the two PCs.
Important detail:
I only swapped the graphics cards.
I did not swap PSUs.
I did not swap PSU cables.
Each PC kept its own PSU and own GPU power cable.
After the swap:
New PC + RTX 5080
Much more stable than before
Img2Vid and txt2img workloads that previously caused hard shutdowns now mostly run fine
No regular hard shutdown behavior like before
However, even with the RTX 5080, the new PC sometimes runs into OOM / memory-related errors after a few videos in some ComfyUI workflows. It does not hard shut down like before, but it is still not as smooth as expected.
Old PC + RTX 5090
Runs stable
Same ComfyUI workloads run fine
No hard shutdowns so far
This older PC can handle very large queues, sometimes 100+ jobs, without the same kind of problems
This made me think the RTX 5090 itself is probably not obviously defective, and the new PC is not generally unstable either. The issue seems to be mainly the combination of:
new PC + RTX 5090 + its power delivery / platform behavior / AI workload transients
PSU swap test
A replacement PSU was tested in the new PC.
Important detail:
The PSU was replaced.
The PSU cables were also replaced with the new original cables.
The GPU power cable / 12VHPWR / 12V-2x6 cable was checked multiple times, both by me and after repair/testing.
Both PSU-side 12VHPWR / 12V-2x6 ports were tested.
Results:
New PSU + RTX 5080 in new PC: stable
New PSU + RTX 5090 in new PC: still hard shutdowns
It seemed slightly better at first with the new cable / second PSU port, but eventually it still shut down
Even with 69% power limit, -210 MHz core and -30 MHz memory, it still hard shut down
So now I have:
CPU replaced
RAM replaced
mainboard replaced
PSU replaced
new PSU cables tested
both PSU-side 12VHPWR/12V-2x6 ports tested
RTX 5080 works much better in the new PC
RTX 5090 works stable in the old PC
RTX 5090 in the new PC still causes hard shutdowns
What I’m trying to figure out
At this point, I’m confused.
Could this still be:
GPU issue that only appears in one platform?
PCIe / platform issue, even though the mainboard was replaced?
Some weird compatibility issue between RTX 5090 and this platform / PSU / motherboard combo?
A transient power issue that standard benchmarks do not reproduce?
Something related to CUDA / AI workloads causing behavior that FurMark / 3DMark / OCCT do not catch?
Some other component or configuration in the new PC that was not replaced?
The repair/testing used standard tools like FurMark, Prime95, HDDScan and BurnInTest. Those passed, but they do not seem to reproduce the same kind of fast CUDA/AI load changes that ComfyUI creates.
Current status
I currently keep the GPUs swapped because that setup is usable for content production:
old PC + RTX 5090 is stable
new PC + RTX 5080 is much more stable than before
But obviously I bought the new system to use the RTX 5090 in it.
What would you test next?
What could still explain hard shutdowns only with the RTX 5090 in the new system, even after CPU, RAM, mainboard, PSU and PSU cables were replaced?
Any ideas are appreciated.