GPU hang: how could I investigate/fix this?

Using Fedora Silverblue 35. Sometimes when I run a graphics intensive game the whole system freezes. Sometimes the monitors also go blank, other times they don’t. It looks like GPU driver crash.

  • It doesn’t happen every time: often, I can play for hours without it happening.
  • When it does happen, it usually happens quickly after I start the game.
  • It doesn’t seem to be related to the game: it happens equally in all graphically-intensive ones. Simpler, 2D games don’t trigger it.
  • I’m using a AMD Vega 64 GPU with whatever drivers Fedora uses by default.
  • The hardware is probably not at fault: this never happens in Windows on the same machine.

Does anyone have any tips on how I could go about investigating what happened (maybe after reboot), or any suggestions on what I could try?

Thank you!

Switch to xorg from wayland it should solve this isssue.
If it didn’t solve then post output of Inxi -G in </>

Thanks!

Switching to Xorg only replaces one problem with another: yeah, it no longer freezes, but apparently the drivers then forget to spin up my GPU fans, for some reason, so after a minute or so the GPU’s thermal protection kicks in, it shuts down and the fans go to max.

It’s strange because under wayland it’s all good. I’d expect fan control to not be affected by wayland/X, but alas.

Anyway, it wouldn’t be worth it as a workaround for me: I can’t stand Xorg – the stutter and dropped frames are way too obvious in GNOME. Buttery smooth Wayland is pretty much the main reason I switched away from Windows. And if I have to logout when I play a game, I might as well boot to all the way to Windows instead, it’s not that much slower :stuck_out_tongue:

inxi -G says:

Graphics:
  Device-1: AMD Vega 10 XL/XT [Radeon RX Vega 56/64] driver: amdgpu v: kernel
  Display: wayland server: X.Org 1.21.1.4 driver: loaded: amdgpu,ati
    unloaded: fbdev,modesetting,vesa resolution: 1: 3840x2160~60Hz
    2: 3840x2160~60Hz 3: 1920x1080~50Hz 4: 1440x2560~60Hz

Is the freeze on wayland a known issue? Any bug I could track?

Reinstalling gpu driver should resolve that spinning issue.

Hi, if from your Gnome Settings → Power there available setting to set Performance, would you like to try it? I believe it related to auto selecting mode power management kind of things. Or if not available, you could try to avoid Balanced.

Weather with Performance setting above the issue is resolve or not, would you like to report it to bugzilla.redhat.com against the kernel package.

martin: how would I reinstall the GPU driver on Silverblue?

Syaifur: I only have Balanced and Power Saver. How would I avoid Balanced? Switch to Power Saver? Do I need to install something else? I will file a bug, just need to try the things requested in the bug template (try with rawhide kernel, get kernel logs, etc.)

sudo rpm-ostree upgrade
rpm-ostree install amd- letest driver _x64 and try this if it solves thst issue

Please upgrade your system first as suggested by @frankjunior above.

If the problem still present, you could try to add amdgpu.dpm=0 to kernel boot paramemter with rpm-ostree kargs --append='amdgpu.dpm=0'. If this not works, remove with rpm-ostree kargs --delete='amdgpu.dpm=0'.

Source: drm/amdgpu AMDgpu driver — The Linux Kernel documentation find variable dpm.

1 Like

@frankjunior: I have no idea what package you mean :slight_smile: I can’t see anything that looks like amd-* in fedora packages. Do you mean xorg-x11-drv-amdgpu-21.0.0-1.fc35.x86_64?

@oprizal: yeah, I update every day :slight_smile: that being said, today’s update included both a new kernel and an updated mesa, so let me first see if it reproduces. As I said, it’s unreliable so I’ll have to use it for a bit to see. If it freezes again, I’ll try the amdgpu.dpm setting, and also try to get some kernel logs as recommended in the bug template for kernel :slight_smile:

@oprizal: you’re probably right that it’s dpm-related, here are the log messages from the crash (which, if anyone is curious, I obtained by running journalctl -b -1 on the next boot):

Feb 13 02:56:46 fedora kernel: amdgpu: [powerplay] No response from smu
Feb 13 02:56:46 fedora kernel: amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x2830001, error code: 0x0
Feb 13 02:56:48 fedora kernel: amdgpu: [powerplay] No response from smu
Feb 13 02:56:50 fedora kernel: amdgpu: [powerplay] No response from smu
Feb 13 02:56:50 fedora kernel: amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x2830001, error code: 0x0
Feb 13 02:56:53 fedora kernel: amdgpu: [powerplay] No response from smu
Feb 13 02:56:55 fedora kernel: amdgpu: [powerplay] No response from smu
Feb 13 02:56:55 fedora kernel: amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x2830001, error code: 0x0
Feb 13 02:56:57 fedora kernel: amdgpu: [powerplay] No response from smu
Feb 13 02:56:59 fedora kernel: amdgpu: [powerplay] No response from smu
Feb 13 02:56:59 fedora kernel: amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x2830001, error code: 0x0
Feb 13 02:57:04 fedora kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!

I’ll try the kernel parameter and report back.

To close the loop here: after 3 days I can confidently say that disabling dynamic power management for amdgpu (amdgpu.dpm=0) solves the problem.

I have filed a bug with the kernel here: 2054948 – AMD Vega64 GPU Freeze when dynamic power management on (I still need to collect some info, such as trying it out on the latest rawhide kernel).

Thank you everyone for your help!

1 Like

Is there an alternative way to set a parameter for the amdgpu module than rpm-ostree kargs --append='amdgpu.dpm=0'? I tried and it didn’t work due to my system not being booted using ostree. I tried setting options amdgpu dpm=0 in /etc/modprobe.d/amdgpu.conf as per this doc: 31.6.2. Loading a Customized Module - Persistent Changes Red Hat Enterprise Linux 6 | Red Hat Customer Portal but the setting doesn’t take effect when I reboot

@jonh If you’re using regular Fedora Workstation or any spins, you could use:

# Adding
sudo grubby --args="amdgpu.dpm=0" --update-kernel=ALL

# To remove it
sudo grubby --remove-args="amdgpu.dpm=0" --update-kernel=ALL