u/ChanceAvocado

[EDIT1]: I added some other tests in the list below. I also ordered a better PSU (ROG Strix 1000 Platinum) and will install it as soon it is delivered tomorrow. I will keep you all posted. Thanks!

I usually build pcs for all my friends. I put together all their gaming builds and a lot of other configurations for people that usually ask me. I can say that I have quite experience on this.

But this time... this is the first time I encounter a problem this insidious, and after all the testing I did and all the online researches, my last hope is to write here and seek help of true experts.

This time I had to put together a build for a friend who does architectural renderings.

Complete build specs:

Component Specification
CPU Ryzen 9 9950X3D
Motherboard GIGABYTE X870 AORUS ELITE WIFI7 and now ROG STRIX B850-F GAMING WIFI (read below)
RAM Crucial Pro DDR5 RAM 64GB Kit (2x32GB) 6000MHz CL40
Graphics Card PNY GeForce RTX 5080
Cooler MSI MAG CORELIQUID A13 360
SSD Samsung 9100 PRO 2TB SSD
Case NZXT H6 Flow
Power Supply MSI MAG A1000GLS PCIE5 1000W
Operating System Windows 11 Pro

After assembling the build, installing windows and updating drivers: the pc crashes every 10-120 minutes.
When the problem occurs, first the video output dies, then after 15-20 seconds the whole pc reboots itself.
The pc crashes randomly, both during stress tests and in idle.

I analyzed windows crash dump files with WhoCrashed (https://www.resplendence.com/whocrashed) and errors always point to:

  1. NVIDIA Windows Kernel Mode Driver - nvlddmkm.sys (nvlddmkm+1a22440) - VIDEO_TDR_ERROR
  2. DirectX Graphics Kernel - dxgkrnl.sys (dxgkrnl!NtGdiDdDDISetProcessSchedulingPriorityClass+0x1A3D) - VIDEO_TDR_ERROR

After some reboots, I also got the automatic NVIDIA popup saying that NVIDIA software encountered an error and asked to approve sending diagnostic data.

Tests performed:

  1. Installation of latest Nvidia Game Ready driver (v596.21) - problem persists
  2. Installation of latest Nvidia Studio driver (v595.79) - problem persists
  3. Complete video driver removal with DisplayDriverUninstaller in Windows Safe Mode - problem persists
  4. Default windows update driver - problem persists
  5. Installation of a previous Nvidia Studio driver (v591.44) - problem persists
  6. Integrated GPU disabled in bios - problem persists
  7. NVIDIA GPU forced to PCIe4 in bios - problem persists
  8. nvlddmkm.sys full user access control in windows advanced settings (I found this on a forum) - problem persists
  9. Multiple benchmark and stess test sessions on the whole system with FurMark, OCCT, yCruncher and UserBenchmark with Nvidia GPU in the system - problem persists
  10. Multiple benchmark and stess test sessions on the whole system with FurMark, OCCT, yCruncher and UserBenchmark with only AMD integrated GPU in the system (removed Nvidia GPU) - PROBLEM SOLVED (10 hours of testing without any crash)
  11. At this point I said myself: "YEAH, I found the issue, I have a faulty RTX 5080!" So I decided to replace the RTX 5080 with another and.... the problem RETURNED. (RMA of the RTX5080 - problem persists)
  12. At this point I still don't know if the problem is Hardware or Software, so I decided to replace the moterboard and buy a different model from a more safe brand, in order to exclude drivers compatibility/stability issues that could happen in a build with components this new in the market. I got a ROG STRIX B850-F GAMING WIFI but.... the problem persists.
  13. Installation of Windows10 - problem persists
  14. Change of the RTX 5080 power cable from the 12pin 12VHPWR to 3*8 pin with the original NVIDIA adapter included in the GPU box - problem persists
  15. Undervolt of RTX 5080 to 80% - problem persists
  16. Changed video output from HDMI to DP cable - problem persists
  17. Updated bios and even rolled back to a previous version - problem persists
  18. NEW TESTS START HERE: Disabled EXPO/XPM in bios - problem persists
  19. Disabled ASPM in bios - problem persists

At this point, I am still not sure if the problem is hardware or software. I think that I can exclude CPU, RAM and SSD since they are still in the system when I remove the RTX 5080 and the problem resolves.
I can't exclude the PSU since it could have a stability problem on power output, so I think should try to replace the power supply.

The last and scariest possibility is that the problem is a software compatibility/stability issue that will be solved with a future BIOS/Driver/Windows update, but I can't deliver this pc saying to the "customer" that he has a crashing PC and he needs to wait for a future update.... if this is the case, I don't know what the solution could be.

reddit.com
u/ChanceAvocado — 24 days ago