AMDGPU crash every 5 days

Ok, I’ll do that. Thank you for your time!

Here is a list of open AMD GPU issues. Do any of them sound like what you are seeing?

1 Like

Unfortunately, no! I did see something very close once, but I tried everything they proposed with no result. I came here precisely because I was totally out of ideas. Now, I think it could be hardware related, but impossible to confirm without any machine to swap the card on…

I will try to see with my reseller if they can do something for me about that. If it’s not hardware, it cannot be anything else but a driver issue.

I just saw this post, it looks like they have some of the same error messages as you. It has a link to a bug.

That seems strange. Everyone should be able to access that directory.

# ls /sys/kernel/debug/dri
1  128

# ls /sys/kernel/debug/dri/1
clients  framebuffer  internal_clients  state                virtio-gpu-host-visible-mm  Virtual-1
crtc-0   gem_names    name              virtio-gpu-features  virtio-gpu-irq-fence

# ls -ld /sys/kernel/debug/dri/1
drwxr-xr-x. 4 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1

# ls -ld /sys/kernel/debug/dri/1/*
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/clients
drwxr-xr-x. 2 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/crtc-0
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/framebuffer
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/gem_names
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/internal_clients
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/name
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/state
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/virtio-gpu-features
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/virtio-gpu-host-visible-mm
-r--r--r--. 1 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/virtio-gpu-irq-fence
drwxr-xr-x. 2 root root 0 Jan 24 13:34 /sys/kernel/debug/dri/1/Virtual-1

Have you checked the permissions.?
Is this potentially caused by the repeated crashes?

It is known that a crash during a write (either to memory or drive) has the potential to corrupt data. After a crash a full power off before rebooting is suggested as a good thing to minimize the potential for corrupt data remaining in memory. The tmpfs structures in ram (/sys, /proc, /dev, /run, among others and including the GPU memory) all may retain corrupt data during a reboot after a crash unless a full power off is performed.

How do I check? I mean, I access the directory with sudo, so… I should have all permissions to get in.

I will try that now, see if something change.

Yes, that’s precisely the thread I’m talking about. i tried a lot of things (albeit, not every single feature mask), with no avail. But I admit I had not a reliable way to crash my computer at that time… now I have. I will look more closely, but it could be long.

I can confirm that it still crash even after a complete shutdown. Note that I can access the file with my Terminal, just not with Gnome File (instant crash), and I can access the elements inside it with a Terminal too if needed… but any attempt to use Gnome File result in an instant (GPU!) crash.

I do note that this time, ‘Problem Reporting’ was triggered and now show me this message:

The kernel log indicates that hardware errors were detected.
This is most likely not a software problem.

Hardware related, you think?