So, I have a problem that I’m now certain I cannot solve on my own: my GPU crash every five days (more or less 20 minutes. The sound continues to play for a short time and then full stop). I have tried a lot, but since I’m using Linux for only nine months now, I’m running out of ideas. What is normal or not? I don’t know.
Let’s start at the beginning, my entire computer (up to the power cord) is brand new, assembled by hand on my own, last month:
CPU: Intel I7-12700K
AIO: Corsair iCUE H150i RGB ELITE
MB: ASUS STRIX Z690-G (DDR5)
RAM: DDR5 Corsair Vengeance 32Go
SSD1: Samsung 980 PRO MZ-V8P500BW (500Go)
SSD2: Samsung 870 EVO (1To)
SSD3: Crucial CT1000MX500SSD1 (1To)
Power: Seasonic PRIME TX-1000 80Plus Titanium
Case: Bequiet! Silent Base 802 Black
GPU: PowerColor AMD Radeon RX 7900 XT Hellhound
Displays: I have two in HDMI, with one using an adapter to go on a DisplayPort (working perfectly on my last Nvidia config).
Probably the worst idea I could have, knowing that my GPU is far too young and therefore not fully supported yet. Impossible to tell if my trouble comes from driver or hardware side. This machine was built to replace another, whose GPU failed me some time ago, meaning that I do not have any to do tests with…
My system is a Fedora 37, on a kernel 220.127.116.11.rc5 that I just updated today from 18.104.22.168.rc2. LLVM 15.0.6. Mesa 22.3.3. I performed a memtest (with no error) and a quick Vram test (with no error). At first, the crash occurred everyday (24 hours more or less 20 minutes), with the same message as what I have now:
[amdgpu]] *ERROR* ring gfx_0.0.0 timeout
I’m sure it means that yes, my GPU indeed crashed. I also noticed regular core dump with Steam, specifically with gldriverquery, and experienced at least one complete failure triggered by the Unreal Engine 5.1 (by just keeping it open in idle, same gfx timeout message). I was able to push my crash occurrence to 5 days after completely shutting down aspm.
I’m able to play Cyberpunk 2077 for 4 hours but it crashes in a rather unusual way: process unkillable, by any means short to a hard reboot… and I can use my system normally on the side. Cool, but concerning.
If I change the performance mode of my GPU with CoreCtrl from auto to high, I suffer heavy stuttering and glitches with Youtube videos. It also looks like the voltage is constantly going down. I’m not fully sure if it means something important or not.
I don’t know what to do now, what are your advises? Please remember, I’m still a beginner with Linux.