F37 Lockup on new AMD 7950x System

I just built a new system with an AMD 7950x processor and ASUS ProArt X670E-CREATOR WIFI motherboard. I am currently using the iGPU built into the 7950x for video and X11. It’s running a fresh install of F37 with current updates. I have been running into problems with the system completely locking up so bad that the mouse won’t even move. Most of the time I’ve been using Firefox, but the last time all I was doing was looking at the Problem Reporting app and Firefox wasn’t running. Every time the problem reporting tells me “The backtrace does not contain enough meaningful function frames to be reported”.

The last time I was able to ssh in from another system and capture some logs while it was hung. From journalctl -b, first I see tons of these errors towards the end:

Nov 27 14:57:06 overkill kernel: amdgpu 0000:6b:00.0: amdgpu: 00000000e67f5dba pin failed
Nov 27 14:57:06 overkill kernel: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12Nov 27 14:57:06 overkill kernel: amdgpu 0000:6b:00.0: amdgpu: 00000000e67f5dba pin failed 
Nov 27 14:57:06 overkill kernel: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12Nov 27 14:57:06 overkill /usr/libexec/gdm-x-session[3315]: (WW) AMDGPU(0): flip queue failed: Cannot allocate memory
Nov 27 14:57:06 overkill /usr/libexec/gdm-x-session[3315]: (WW) AMDGPU(0): Page flip failed: Cannot allocate memory
Nov 27 14:57:06 overkill /usr/libexec/gdm-x-session[3315]: (EE) AMDGPU(0): present flip failed
Nov 27 14:57:06 overkill /usr/libexec/gdm-x-session[3315]: (WW) AMDGPU(0): flip queue failed: Cannot allocate memory
Nov 27 14:57:06 overkill /usr/libexec/gdm-x-session[3315]: (WW) AMDGPU(0): Page flip failed: Cannot allocate memory
Nov 27 14:57:06 overkill /usr/libexec/gdm-x-session[3315]: (EE) AMDGPU(0): present flip failed
Nov 27 14:57:06 overkill kernel: amdgpu 0000:6b:00.0: amdgpu: 00000000e67f5dba pin failed
Nov 27 14:57:06 overkill kernel: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
Nov 27 14:57:06 overkill /usr/libexec/gdm-x-session[3315]: (WW) AMDGPU(0): flip queue failed: Cannot allocate memory 
Nov 27 14:57:06 overkill /usr/libexec/gdm-x-session[3315]: (WW) AMDGPU(0): Page flip failed: Cannot allocate memory
Nov 27 14:57:06 overkill /usr/libexec/gdm-x-session[3315]: (EE) AMDGPU(0): present flip failed
Nov 27 14:57:06 overkill kernel: amdgpu 0000:6b:00.0: amdgpu: 00000000e67f5dba pin failed
Nov 27 14:57:06 overkill kernel: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
Nov 27 14:57:06 overkill kernel: amdgpu 0000:6b:00.0: amdgpu: 00000000e67f5dba pin failed
Nov 27 14:57:06 overkill kernel: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
Nov 27 14:57:06 overkill /usr/libexec/gdm-x-session[3315]: (WW) AMDGPU(0): flip queue failed: Cannot allocate memory
Nov 27 14:57:06 overkill /usr/libexec/gdm-x-session[3315]: (WW) AMDGPU(0): Page flip failed: Cannot allocate memory
Nov 27 14:57:06 overkill /usr/libexec/gdm-x-session[3315]: (EE) AMDGPU(0): present flip failed

Then, right after that at the end of the log there is a null pointer reference and the CPU crashes:

Nov 27 14:57:06 overkill kernel: BUG: kernel NULL pointer dereference, address: 0000000000000038
Nov 27 14:57:06 overkill kernel: #PF: supervisor read access in kernel mode
Nov 27 14:57:06 overkill kernel: #PF: error_code(0x0000) - not-present page
Nov 27 14:57:06 overkill kernel: PGD 0 P4D 0 
Nov 27 14:57:06 overkill kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Nov 27 14:57:06 overkill kernel: CPU: 10 PID: 3315 Comm: Xorg Not tainted 6.0.9-300.fc37.x86_64 #1
Nov 27 14:57:06 overkill kernel: Hardware name: ASUS System Product Name/ProArt X670E-CREATOR WIFI, BIOS 0805 11/04/2022
Nov 27 14:57:06 overkill kernel: RIP: 0010:ttm_resource_del_bulk_move+0x74/0xe0 [ttm]
Nov 27 14:57:06 overkill kernel: Code: 74 69 48 39 ef 74 56 4c 8d 67 38 4c 8d 75 38 4c 89 e7 e8 bf f8 14 ed 84 c0 74 0f 48 8b 53 38 48 8b 43 40 48 89 42 08 48 89 10 <4c> 8b 6d 38 4c 89 f6 4c 89 e7 4c 89 ea e8 4a f8 14 ed 84 c0 74 10
Nov 27 14:57:06 overkill kernel: RSP: 0018:ffffb4ec11bab8d0 EFLAGS: 00010202
Nov 27 14:57:06 overkill kernel: RAX: ffff9969cc725b10 RBX: ffff996a58eea900 RCX: ffff9969cc725b10
Nov 27 14:57:06 overkill kernel: RDX: ffff996a58eea3f8 RSI: ffff996a58eea938 RDI: ffff996a58eea938
Nov 27 14:57:06 overkill kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Nov 27 14:57:06 overkill kernel: R10: 0000000000000001 R11: 000000000000000a R12: ffff996a58eea938
Nov 27 14:57:06 overkill kernel: R13: 0000000000000001 R14: 0000000000000038 R15: ffff9969cc725ad0
Nov 27 14:57:06 overkill kernel: FS:  00007f5bead56a80(0000) GS:ffff9970f8480000(0000) knlGS:0000000000000000
Nov 27 14:57:06 overkill kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 27 14:57:06 overkill kernel: CR2: 0000000000000038 CR3: 0000000167c0e000 CR4: 0000000000750ee0
Nov 27 14:57:06 overkill kernel: PKRU: 55555554
Nov 27 14:57:06 overkill kernel: Call Trace:
Nov 27 14:57:06 overkill kernel:  <TASK>
Nov 27 14:57:06 overkill kernel:  ttm_resource_free+0x31/0x80 [ttm]
Nov 27 14:57:06 overkill kernel:  ttm_bo_move_accel_cleanup+0x9b/0x240 [ttm]
Nov 27 14:57:06 overkill kernel:  amdgpu_bo_move+0x60b/0x6a0 [amdgpu]
Nov 27 14:57:06 overkill kernel:  ttm_bo_handle_move_mem+0xa8/0x170 [ttm]
Nov 27 14:57:06 overkill kernel:  ttm_mem_evict_first+0x204/0x490 [ttm]
Nov 27 14:57:06 overkill kernel:  ttm_bo_mem_space+0x1c9/0x220 [ttm]
Nov 27 14:57:06 overkill kernel:  ttm_bo_validate+0x9f/0x110 [ttm]
Nov 27 14:57:06 overkill kernel:  amdgpu_bo_pin_restricted+0x117/0x270 [amdgpu]
Nov 27 14:57:06 overkill kernel:  ? preempt_count_add+0x6a/0xa0
Nov 27 14:57:06 overkill kernel:  dm_plane_helper_prepare_fb+0x97/0x290 [amdgpu]
Nov 27 14:57:06 overkill kernel:  drm_atomic_helper_prepare_planes+0x74/0x160
Nov 27 14:57:06 overkill kernel:  drm_atomic_helper_commit+0x72/0x140
Nov 27 14:57:06 overkill kernel:  drm_atomic_helper_page_flip+0x5f/0xd0
Nov 27 14:57:06 overkill kernel:  drm_mode_page_flip_ioctl+0x580/0x5d0
Nov 27 14:57:06 overkill kernel:  ? drm_mode_cursor2_ioctl+0x10/0x10
Nov 27 14:57:06 overkill kernel:  drm_ioctl_kernel+0xa9/0x150
Nov 27 14:57:06 overkill kernel:  drm_ioctl+0x22d/0x410
Nov 27 14:57:06 overkill kernel:  ? drm_mode_cursor2_ioctl+0x10/0x10
Nov 27 14:57:06 overkill kernel:  amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
Nov 27 14:57:06 overkill kernel:  __x64_sys_ioctl+0x90/0xd0
Nov 27 14:57:06 overkill kernel:  do_syscall_64+0x5b/0x80
Nov 27 14:57:06 overkill kernel:  ? do_syscall_64+0x67/0x80
Nov 27 14:57:06 overkill kernel:  ? do_syscall_64+0x67/0x80
Nov 27 14:57:06 overkill kernel:  ? do_syscall_64+0x67/0x80
Nov 27 14:57:06 overkill kernel:  ? do_syscall_64+0x67/0x80
Nov 27 14:57:06 overkill kernel:  entry_SYSCALL_64_after_hwframe+0x63/0xcd
Nov 27 14:57:06 overkill kernel: RIP: 0033:0x7f5beb366baf
Nov 27 14:57:06 overkill kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
Nov 27 14:57:06 overkill kernel: RSP: 002b:00007ffdb6940120 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Nov 27 14:57:06 overkill kernel: RAX: ffffffffffffffda RBX: 000055ee9dbc3990 RCX: 00007f5beb366baf
Nov 27 14:57:06 overkill kernel: RDX: 00007ffdb69401b0 RSI: 00000000c01864b0 RDI: 000000000000000e
Nov 27 14:57:06 overkill kernel: RBP: 00007ffdb69401b0 R08: 0000000000000559 R09: 0000000000000004
Nov 27 14:57:06 overkill kernel: R10: 000055ee9d97f010 R11: 0000000000000246 R12: 00000000c01864b0
Nov 27 14:57:06 overkill kernel: R13: 000000000000000e R14: 0000000000000001 R15: 000055ee9dbbe9a0
Nov 27 14:57:06 overkill kernel:  </TASK>
Nov 27 14:57:06 overkill kernel: Modules linked in: rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc ip_set nf_tables nfnetlink qrtr bnep sunrpc vfat fat intel_rapl_msr intel_rapl_common mt7921e snd_hda_codec_realtek mt7921_common snd_hda_codec_generic snd_hda_codec_hdmi mt76_connac_lib mt76 snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec mac80211 edac_mce_amd btusb snd_hda_core btrtl btbcm snd_hwdep libarc4 kvm_amd btintel snd_seq btmtk snd_seq_device cfg80211 kvm bluetooth snd_pcm asus_nb_wmi eeepc_wmi asus_wmi snd_timer irqbypass ledtrig_audio sparse_keymap snd platform_profile joydev thunderbolt rapl intel_wmi_thunderbolt wmi_bmof pcspkr soundcore k10temp i2c_piix4 rfkill gpio_amdpt gpio_generic zram amdgpu hid_microsoft ff_memless drm_ttm_helper ttm iommu_v2
Nov 27 14:57:06 overkill kernel:  crct10dif_pclmul gpu_sched crc32_pclmul drm_buddy crc32c_intel nvme polyval_clmulni drm_display_helper polyval_generic atlantic nvme_core ccp ghash_clmulni_intel cec igc sp5100_tco macsec ucsi_acpi nvme_common wmi typec_ucsi video typec ip6_tables ip_tables fuse
Nov 27 14:57:06 overkill kernel: CR2: 0000000000000038
Nov 27 14:57:06 overkill kernel: ---[ end trace 0000000000000000 ]---
Nov 27 14:57:06 overkill kernel: RIP: 0010:ttm_resource_del_bulk_move+0x74/0xe0 [ttm]
Nov 27 14:57:06 overkill kernel: Code: 74 69 48 39 ef 74 56 4c 8d 67 38 4c 8d 75 38 4c 89 e7 e8 bf f8 14 ed 84 c0 74 0f 48 8b 53 38 48 8b 43 40 48 89 42 08 48 89 10 <4c> 8b 6d 38 4c 89 f6 4c 89 e7 4c 89 ea e8 4a f8 14 ed 84 c0 74 10
Nov 27 14:57:06 overkill kernel: RSP: 0018:ffffb4ec11bab8d0 EFLAGS: 00010202
Nov 27 14:57:06 overkill kernel: RAX: ffff9969cc725b10 RBX: ffff996a58eea900 RCX: ffff9969cc725b10
Nov 27 14:57:06 overkill kernel: RDX: ffff996a58eea3f8 RSI: ffff996a58eea938 RDI: ffff996a58eea938
Nov 27 14:57:06 overkill kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Nov 27 14:57:06 overkill kernel: R10: 0000000000000001 R11: 000000000000000a R12: ffff996a58eea938
Nov 27 14:57:06 overkill kernel: R13: 0000000000000001 R14: 0000000000000038 R15: ffff9969cc725ad0
Nov 27 14:57:06 overkill kernel: FS:  00007f5bead56a80(0000) GS:ffff9970f8480000(0000) knlGS:0000000000000000
Nov 27 14:57:06 overkill kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 27 14:57:06 overkill kernel: CR2: 0000000000000038 CR3: 0000000167c0e000 CR4: 0000000000750ee0
Nov 27 14:57:06 overkill kernel: PKRU: 55555554
Nov 27 14:57:06 overkill kernel: note: Xorg[3315] exited with preempt_count 1
Nov 27 14:57:07 overkill abrt-dump-journal-oops[1276]: abrt-dump-journal-oops: Found oopses: 1
Nov 27 14:57:07 overkill abrt-dump-journal-oops[1276]: abrt-dump-journal-oops: Creating problem directories
Nov 27 14:57:07 overkill abrt-server[4956]: Can't find a meaningful backtrace for hashing in '.'
Nov 27 14:57:07 overkill abrt-server[4956]: Preserving oops '.' because DropNotReportableOopses is 'no'
Nov 27 14:57:08 overkill abrt-notification[4976]: System encountered a non-fatal error in ??()
Nov 27 14:57:08 overkill abrt-dump-journal-oops[1276]: Reported 1 kernel oopses to Abrt

Here are my system details:

System:
  Kernel: 6.0.9-300.fc37.x86_64 arch: x86_64 bits: 64 compiler: gcc v: 2.38-24.fc37
    Console: pty pts/0 Distro: Fedora release 37 (Thirty Seven)
Machine:
  Type: Desktop System: ASUS product: N/A v: N/A serial: <superuser required>
  Mobo: ASUSTeK model: ProArt X670E-CREATOR WIFI v: Rev 1.xx serial: <superuser required>
    UEFI: American Megatrends v: 0805 date: 11/04/2022
CPU:
  Info: 16-core model: AMD Ryzen 9 7950X bits: 64 type: MT MCP arch: Zen 4 rev: 2 cache:
    L1: 1024 KiB L2: 16 MiB L3: 64 MiB
  Speed (MHz): avg: 616 high: 2766 min/max: 400/5881 boost: enabled cores: 1: 400 2: 400 3: 400
    4: 400 5: 400 6: 400 7: 400 8: 2766 9: 400 10: 400 11: 400 12: 400 13: 400 14: 400 15: 400
    16: 400 17: 400 18: 400 19: 2765 20: 400 21: 400 22: 400 23: 400 24: 400 25: 400 26: 400
    27: 2583 28: 400 29: 400 30: 400 31: 400 32: 400 bogomips: 288011
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
  Device-1: AMD Raphael vendor: ASUSTeK driver: amdgpu v: kernel arch: RDNA-2 bus-ID: 6b:00.0
    temp: 39.0 C
  Display: server: X.org v: 1.20.14 with: Xwayland v: 22.1.5 driver: X: loaded: amdgpu
    unloaded: fbdev,modesetting,vesa dri: radeonsi gpu: amdgpu tty: 199x84 resolution: 1: 3840x2160
    2: 1920x1200 3: 1920x1200
  API: OpenGL Message: GL data unavailable in console. Try -G --display
Audio:
  Device-1: AMD Rembrandt Radeon High Definition Audio vendor: ASUSTeK driver: snd_hda_intel
    v: kernel bus-ID: 6b:00.1
  Device-2: AMD Family 17h/19h HD Audio vendor: ASUSTeK driver: snd_hda_intel v: kernel
    bus-ID: 6b:00.6
  Sound API: ALSA v: k6.0.9-300.fc37.x86_64 running: yes
  Sound Server-1: PulseAudio v: 16.1 running: no
  Sound Server-2: PipeWire v: 0.3.60 running: yes
Network:
  Device-1: MEDIATEK MT7922 802.11ax PCI Express Wireless Network Adapter vendor: Foxconn
    driver: N/A bus-ID: 07:00.0
  Device-2: Intel Ethernet I225-V vendor: ASUSTeK driver: igc v: kernel port: N/A
    bus-ID: 08:00.0
  IF: eno1 state: up speed: 1000 Mbps duplex: full mac: <filter>
  Device-3: Aquantia AQC113CS NBase-T/IEEE 802.3bz Ethernet [AQtion] vendor: ASUSTeK ProArt
    X570-CREATOR WIFI driver: atlantic v: kernel port: N/A bus-ID: 09:00.0
  IF: eno2 state: down mac: <filter>
  IF-ID-1: br0 state: up speed: 1000 Mbps duplex: unknown mac: <filter>
Bluetooth:
  Device-1: Foxconn / Hon Hai Wireless_Device type: USB driver: btusb v: 0.8 bus-ID: 3-6:2
  Report: rfkill ID: hci0 rfk-id: 0 state: up address: see --recommends
Drives:
  Local Storage: total: 2.73 TiB used: 40.51 GiB (1.4%)
  ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 980 PRO 2TB size: 1.82 TiB temp: 32.9 C
  ID-2: /dev/nvme1n1 vendor: Western Digital model: WD BLACK SN850X 1000GB size: 931.51 GiB
    temp: 28.9 C
Partition:
  ID-1: / size: 1.82 TiB used: 40.15 GiB (2.2%) fs: btrfs dev: /dev/nvme0n1p3
  ID-2: /boot size: 973.4 MiB used: 313.7 MiB (32.2%) fs: ext4 dev: /dev/nvme0n1p2
  ID-3: /boot/efi size: 598.8 MiB used: 46.5 MiB (7.8%) fs: vfat dev: /dev/nvme0n1p1
  ID-4: /home size: 1.82 TiB used: 40.15 GiB (2.2%) fs: btrfs dev: /dev/nvme0n1p3
Swap:
  ID-1: swap-1 type: zram size: 8 GiB used: 0 KiB (0.0%) dev: /dev/zram0
Sensors:
  System Temperatures: cpu: 42.4 C mobo: N/A gpu: amdgpu temp: 39.0 C
  Fan Speeds (RPM): N/A
Info:
  Processes: 486 Uptime: 24m Memory: 30.49 GiB used: 2.65 GiB (8.7%) Init: systemd
  target: graphical (5) Compilers: gcc: 12.2.1 clang: 15.0.4 Packages: 24 note: see --rpm
  Shell: Bash v: 5.2.9 inxi: 3.3.23

In one of the problem reports, I see this, not sure if it’s related:

watchdog: BUG: soft lockup - CPU#4 stuck for 134s! [plymouthd:18176]

Can anyone tell me what I can do to get to the bottom of this?

Thanks.

1 Like

If it’s helpful…this might not be exactly related to your issue, but it does look like there are some bugs in the amdgpu driver version included in Linux kernel 6.0, targeted for patching in kernel 6.1, that cause almost the exact same initial symptom.

Looks like it’s related to how the driver handles VRAM management with integrated graphics - maybe because it’s a somewhat newer chip and the way its integrated graphics memory works isn’t fully accounted for by the driver yet? Total speculation there, though:

John, thanks for the pointers to those issues. Those certainly sound relevant. It’s interesting that the second one has to do with suspend. The reason I was using X11/Xorg where I hit this lockup is that when I was using Wayland after it woke up from suspend the entire screen was black. Sounds like they are related. I’ve been digging through them trying to figure out which kernel would have the fixes, but I’ve now found 4 different issues have been fixed, but apparently there are still at least 3 open issues.

For reference, here is another related suspend issue:

Then I found these two and they sound exactly like what I’m seeing:

Since the 3 I listed are still open, I guess I’ll have to monitor them waiting for fixes.

I’m having this same issue but I’m not using the amdgpu. I actually could never get it work beyond incredibly low resolutions so I added an NV GPU despite not needing high end graphics. The amdgpu is NOT disabled in bios but nothing is connected to it and as far as I know it’s not in use.

System:
F37
AMD7950X
6.0.8-300.fc37.x86_64
4 DIMMS, non XMP/EXPO mode just 4800 downclocked to 3600 because of AM5
PBO undervolt of -10 (stress tested up to -20 before stability issues)
Capped PBO to 185w (max temps are about 85c under full load)

Hopefully 6.1 fixes this, but wanted to chime in.

I just wanted to follow up for others looking at this. I’ve been running home built RCs of 6.1 and have not had a lockup in 2 weeks. So I’d say 6.1 fixes this (for me).

Again, I’m not actually using the built-in AMD GPU, but just having it enabled in BIOS caused this lockup in pre 6.1. Case closed for me, yay!

I just had this happen on F37 running Linux 6.0.18 on a laptop using a Ryzen 7 PRO 4750U with integrated Radeon Graphics. Same freeze symptoms (mouse won’t move) and flip errors in the journal.

It’s good to hear that this may have been resolved in 6.1. I tried upgrading to it a few weeks ago, but it unfortunately had another issue that made my system unsable with a dock, so I went back to 6.0 for now. Apparently the “main” problem should be resolved, but people seem to have a lot of other closely related problems, judging from the long thread and linked issues, so I’ve been wary of updating (I need to actually get work done on this system!).

maybe related to a similar issue I had on amd epyc; I got this messages on linux 5.15.0-48-generic:

AMD-Vi: Completion-Wait loop timed out
watchdog: BUG: soft lockup - CPU#7 stuck for 27s! [swapper/7:0]

googling it leads to this bug report:
https://support.lenovo.com/us/en/solutions/tt1512-thinksystem-server-with-amd-processor-running-linux-may-hang-or-crash-with-kernel-message-amd-vi-completion-wait-loop-timed-out

TL;DR: fixed in kernel 6.4; workaround: iommu=pt

so after editing /etc/default/grub.cfg:

GRUB_CMDLINE_LINUX="$GRUB_CMDLINE_LINUX iommu=pt"
update-grub
reboot

the issue did not reappear.