[EDIT1]: I added some other tests in the list below. I also ordered a better PSU (ROG Strix 1000 Platinum) and will install it as soon it is delivered tomorrow. I will keep you all posted. Thanks!
I usually build pcs for all my friends. I put together all their gaming builds and a lot of other configurations for people that usually ask me. I can say that I have quite experience on this.
But this time... this is the first time I encounter a problem this insidious, and after all the testing I did and all the online researches, my last hope is to write here and seek help of true experts.
This time I had to put together a build for a friend who does architectural renderings.
Complete build specs:
| Component | Specification |
|---|---|
| CPU | Ryzen 9 9950X3D |
| Motherboard | GIGABYTE X870 AORUS ELITE WIFI7 and now ROG STRIX B850-F GAMING WIFI (read below) |
| RAM | Crucial Pro DDR5 RAM 64GB Kit (2x32GB) 6000MHz CL40 |
| Graphics Card | PNY GeForce RTX 5080 |
| Cooler | MSI MAG CORELIQUID A13 360 |
| SSD | Samsung 9100 PRO 2TB SSD |
| Case | NZXT H6 Flow |
| Power Supply | MSI MAG A1000GLS PCIE5 1000W |
| Operating System | Windows 11 Pro |
After assembling the build, installing windows and updating drivers: the pc crashes every 10-120 minutes.
When the problem occurs, first the video output dies, then after 15-20 seconds the whole pc reboots itself.
The pc crashes randomly, both during stress tests and in idle.
I analyzed windows crash dump files with WhoCrashed (https://www.resplendence.com/whocrashed) and errors always point to:
- NVIDIA Windows Kernel Mode Driver - nvlddmkm.sys (nvlddmkm+1a22440) - VIDEO_TDR_ERROR
- DirectX Graphics Kernel - dxgkrnl.sys (dxgkrnl!NtGdiDdDDISetProcessSchedulingPriorityClass+0x1A3D) - VIDEO_TDR_ERROR
After some reboots, I also got the automatic NVIDIA popup saying that NVIDIA software encountered an error and asked to approve sending diagnostic data.
Tests performed:
- Installation of latest Nvidia Game Ready driver (v596.21) - problem persists
- Installation of latest Nvidia Studio driver (v595.79) - problem persists
- Complete video driver removal with DisplayDriverUninstaller in Windows Safe Mode - problem persists
- Default windows update driver - problem persists
- Installation of a previous Nvidia Studio driver (v591.44) - problem persists
- Integrated GPU disabled in bios - problem persists
- NVIDIA GPU forced to PCIe4 in bios - problem persists
- nvlddmkm.sys full user access control in windows advanced settings (I found this on a forum) - problem persists
- Multiple benchmark and stess test sessions on the whole system with FurMark, OCCT, yCruncher and UserBenchmark with Nvidia GPU in the system - problem persists
- Multiple benchmark and stess test sessions on the whole system with FurMark, OCCT, yCruncher and UserBenchmark with only AMD integrated GPU in the system (removed Nvidia GPU) - PROBLEM SOLVED (10 hours of testing without any crash)
- At this point I said myself: "YEAH, I found the issue, I have a faulty RTX 5080!" So I decided to replace the RTX 5080 with another and.... the problem RETURNED. (RMA of the RTX5080 - problem persists)
- At this point I still don't know if the problem is Hardware or Software, so I decided to replace the moterboard and buy a different model from a more safe brand, in order to exclude drivers compatibility/stability issues that could happen in a build with components this new in the market. I got a ROG STRIX B850-F GAMING WIFI but.... the problem persists.
- Installation of Windows10 - problem persists
- Change of the RTX 5080 power cable from the 12pin 12VHPWR to 3*8 pin with the original NVIDIA adapter included in the GPU box - problem persists
- Undervolt of RTX 5080 to 80% - problem persists
- Changed video output from HDMI to DP cable - problem persists
- Updated bios and even rolled back to a previous version - problem persists
- NEW TESTS START HERE: Disabled EXPO/XPM in bios - problem persists
- Disabled ASPM in bios - problem persists
At this point, I am still not sure if the problem is hardware or software. I think that I can exclude CPU, RAM and SSD since they are still in the system when I remove the RTX 5080 and the problem resolves.
I can't exclude the PSU since it could have a stability problem on power output, so I think should try to replace the power supply.
The last and scariest possibility is that the problem is a software compatibility/stability issue that will be solved with a future BIOS/Driver/Windows update, but I can't deliver this pc saying to the "customer" that he has a crashing PC and he needs to wait for a future update.... if this is the case, I don't know what the solution could be.