u/Inovermyheadflailing

▲ 2 r/pchelp

Looking for help identifying likely hardware problem - WHEA 18 and Kernel 41 errors

I'll try to keep this as clear as possible, and apologies for any glaring errors - most of my pc knowledge is quite basic other than the deep dive I have had in the last few days trying to fix this problem! Thank you in advance for any possible help!

I have an Alienware PC which I bought 5 years ago as a bit of a gift to myself, and I have never fiddled with it or overclocked it or done anything else fancy. The specs are:

CPU: AMD Ryzen 7 5800x 8-core processor
GPU: AMD Radeon RX 5700 XT
BIOS: Alienware 2.8.0 27/03/2024
OS: Windows 11

I have started to get restarts, almost always while playing games (with some games much more prone to triggering immediately than others, eg it will only last a minute of so in Stranded: Alien Dawn, whereas it managed for over an hour on Stellaris), which show up as a Kernel-Power (Event ID 41) critical error, along with a WHEA (Event ID 18) Cache Hierarchy Error, coming from a different Processor Apic ID number each time. Usually the screen goes fully green for a moment before restarting.

Googling has suggested either faulty CPU / GPU or PSU which seems to be a pretty comprehensive list of almost all the hardware .. so I have tried my best to test these:

I ran Corecycler overnight to stress test the CPU, and all came back fine, no errors and no overnight restarts / crashes.

I ran Furmark to stress test the GPU, and it all looks ok as far as I can tell - again no crashes and the temp heads up to 85 degrees and stays there for the duration of the test. It's worth noting that this is quite a bit hotter than I see on the AMD overlay while attempting to play games - usually 50-70 is the range I see. Annoyingly recently AMD Adrenaline seems to have lost capability to show CPU temp, but that was also not going crazy in gaming.

PSU I'm struggling a bit more to find any reasonable test, but it's worth noting that the failures are restarts only, and there is no break in power delivery.

I have tried playing around with the settings in BIOS, including increasing Voltage, manually setting Memory frequency etc, none of which changed much, and I have since reset to the default settings, as this is a bit above my capability.

GPU voltage tends to sit at 0.725 or 0.75V for much of the time - eg now when I'm only using the web browser, and then oscillated between that level and 1.1875V when gaming - is that an unusual pattern?

I opened the box up and to my very amateur eyes nothing looked disconnected / damaged, I carefully pulled out some of the more accessible connections, cleaned and reconnected them, but nothing changed.

I'm at a bit of a loss, but also not that keen to just cough up for a new PC when it might be something fixable.

reddit.com
u/Inovermyheadflailing — 8 days ago