Kernel 5.18.11 not working, what do I do?

I’ve been using Linux for years, but this is my first time taking a bad kernel update. I guess it was about time…

I have 5.18.9 and 5.18.10 installed on my system, both of which still work fine, but 5.18.11 will not boot. First attempt gave me several “soft lockup” errors:

Jul 19 22:57:45 black-cherry kernel: watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [kworker/9:1:285]
Jul 19 22:58:13 black-cherry kernel: watchdog: BUG: soft lockup - CPU#9 stuck for 48s! [kworker/9:1:285]
Jul 19 22:58:33 black-cherry kernel: watchdog: BUG: soft lockup - CPU#14 stuck for 41s! [lscpu:5899]
# etc...

Next boot gave some interesting, different oops-es, namely page fault & null pointer dereference:

Jul 19 23:06:27 black-cherry kernel: ==================================================================
Jul 19 23:06:27 black-cherry kernel: BUG: KFENCE: invalid write in kfence_handle_page_fault+0x20/0x2a0
Jul 19 23:06:27 black-cherry kernel: Invalid write at 0x00000000a9d7cfcf:
Jul 19 23:06:27 black-cherry kernel:  kfence_handle_page_fault+0x20/0x2a0
Jul 19 23:06:27 black-cherry kernel:  page_fault_oops+0x5b/0x280
Jul 19 23:06:27 black-cherry kernel: 
Jul 19 23:06:27 black-cherry kernel: CPU: 13 PID: 4650 Comm: cc1 Not tainted 5.18.11-200.fc36.x86_64 #1
Jul 19 23:06:27 black-cherry kernel: Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS PRO WIFI/Z390 AORUS PRO WIFI-CF, BIOS F11 10/15/2019
Jul 19 23:06:27 black-cherry kernel: ==================================================================
Jul 19 23:06:27 black-cherry kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Jul 19 23:06:27 black-cherry kernel: #PF: supervisor read access in kernel mode
Jul 19 23:06:27 black-cherry kernel: #PF: error_code(0x0000) - not-present page
Jul 19 23:06:27 black-cherry kernel: PGD 12d166067 P4D 12d166067 PUD 125e91067 PMD 0 
Jul 19 23:06:27 black-cherry kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jul 19 23:06:27 black-cherry kernel: CPU: 13 PID: 4650 Comm: cc1 Tainted: G    B             5.18.11-200.fc36.x86_64 #1
Jul 19 23:06:27 black-cherry kernel: Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS PRO WIFI/Z390 AORUS PRO WIFI-CF, BIOS F11 10/15/2019
Jul 19 23:06:27 black-cherry kernel: RIP: 0010:asm_exc_int3+0x8/0x40
Jul 19 23:06:27 black-cherry kernel: Code: 00 00 0f 01 ca 6a ff e8 66 09 00 00 48 89 c4 48 89 e7 e8 bb 1d f6 ff e9 76 0a 00 00 66 0f 1f 44 00 00 0f 01 ca 6a ff f6 44 24 <10> 03 75 18 ff 74 24 28 ff 74 24 28 ff 74 24 28 ff 74 24 28 ff 74
Jul 19 23:06:27 black-cherry kernel: RSP: 0000:fffffe0000301f18 EFLAGS: 00010446

Anyway, I have two important questions:

  1. How do I report this to the kernel maintainers? ABRT says there isn’t enough detail in the logs for either failure. And if I’m going to reach out via email, I have no idea who to approach—how do I know what module is causing these oops-es?
  2. How do I get 5.18.11 off my system? Mashing shift and arrowing thru Grub is getting old fast.
  • sudo dnf downgrade kernel unhelpfully offers to remove 5.18.9 (wrong one) and install 5.17.5 (way old). This leaves the broken version installed.
  • sudo dnf rm kernel*5.18.11* also removes akmod-nvidia and kmod-nvidia, which depend on 5.18.11 specifically for some reason. I do not want to uninstall my graphics drivers. Besides, I thought the whole point of akmods was that they worked with multiple kernel versions, no?
3 Likes

https://bugzilla.redhat.com/

have you tried boot into a older kernel does that work.

1 Like

I think he did

1 Like

Yes, I can boot from an older kernel no problem. That’s how I managed to write the original post :slight_smile:

I’ll take a look at Red Hat’s bugzilla after work today. Thanks for the recommendation, that wasn’t the first place to come to mind. The crash is 100% reproducible so far, so it should be easy to re-test if they need more data, I hope.

Still looking for suggestions about the package management thing. One idea I had: I could force install nvidia without the latest kernel (breaking dependencies). Really, the akmod should work with another kernel version installed. But that sounds a little dicey, and I don’t want to make the situation worse, so I haven’t tried it yet.

1 Like

You can boot to the 5.18.10 kernel then do sudo dnf remove kernel*5.18.11* and accept the removals. You may also add the ‘–noautoremove’ option to see if it still wants to remove akmod-nvidia. The kmod-nvidia package related to that kernel must be removed, but akmod-nvidia should not be a forced removal.

Even it you remove the akmod-nvidia package it is simple to reinstall it. Already loaded and operating kernel modules should not be affected by a remove and reinstall. I have done so several times.

edit:
The desire to remove the akmod-nvidia package may be related to removing the kernel-devel or the kernel-devel-matched package for the newest kernel. If so then allow it and reinstall akmod-nvidia while booted to the older kernel. My system only has the kernel-devel-matched package for the latest installed kernel so you might need to install that one for the older kernel as well.

2 Likes

i think this is a issue with nvidia systems not with amd gpus as i didn’t find any bad.

1 Like

I have an nvidia GPU but I have no problem since it isn’t much better than the iGPU so I have disabled it completely.
Nvidia drivers on Linux can be annoying sometimes.

Edit: try to disable your nvidia drivers temporarily and check if something changes.

Performance of the GPU is often perception based, so what is ideal for me may seem only blase to you, and vice versa.

What does your post have to do with the problem posed by the OP?

1 Like

can you provide some info such as cpu gpu
inxi -Fxz
and someone with same sets of hardware can confirm the issue.

2 Likes

Sorry I was just stating that on systems with nvidia hardware but no nvidia driver it isn’t a problem. I was not suggesting him to completely stop using his nvidia GPU, but maybe uninstall the driver for a short while and try again to check if something changes. I should have phrased it better, thanks for the remark :slight_smile:

1 Like

Lots of activity here! Thanks for the attention everyone.

That’s what I expected too, but it doesn’t match dnf’s behavior:

Vanilla rm
$ sudo dnf remove kernel*5.18.11*                                                                                <1>
Dependencies resolved.
=======================================================================================================================
 Package                                 Arch       Version                 Repository                            Size
=======================================================================================================================
Removing:
 kernel                                  x86_64     5.18.11-200.fc36        @updates                               0  
 kernel-core                             x86_64     5.18.11-200.fc36        @updates                              92 M
 kernel-debug                            x86_64     5.18.11-200.fc36        @updates                               0  
 kernel-debug-core                       x86_64     5.18.11-200.fc36        @updates                              96 M
 kernel-debug-devel                      x86_64     5.18.11-200.fc36        @updates                              64 M
 kernel-debug-devel-matched              x86_64     5.18.11-200.fc36        @updates                               0  
 kernel-debug-modules                    x86_64     5.18.11-200.fc36        @updates                              57 M
 kernel-devel                            x86_64     5.18.11-200.fc36        @updates                              63 M
 kernel-devel-matched                    x86_64     5.18.11-200.fc36        @updates                               0  
 kernel-modules                          x86_64     5.18.11-200.fc36        @updates                              56 M
 kernel-modules-extra                    x86_64     5.18.11-200.fc36        @updates                             3.3 M
Removing dependent packages:
 akmod-nvidia                            x86_64     3:515.57-1.fc36         @rpmfusion-nonfree-nvidia-driver      23 k
 kmod-nvidia                             x86_64     3:515.57-1.fc36         @rpmfusion-nonfree-nvidia-driver       0  
 kmod-nvidia-5.18.11-200.fc36.x86_64     x86_64     3:515.57-1.fc36         @@commandline                         29 M
Removing unused dependencies:
 akmods                                  noarch     0.5.7-8.fc36            @updates                              47 k
 kmodtool                                noarch     1.1-3.fc36              @fedora                               28 k
 xorg-x11-drv-nvidia-kmodsrc             x86_64     3:515.57-1.fc36         @rpmfusion-nonfree-nvidia-driver      32 M

Transaction Summary
=======================================================================================================================
Remove  17 Packages

Freed space: 493 M
Is this ok [y/N]: 
With --noautoremove
$ sudo dnf remove --noautoremove kernel*5.18.11*                                                                 <2>
Dependencies resolved.
=======================================================================================================================
 Package                                 Arch       Version                 Repository                            Size
=======================================================================================================================
Removing:
 kernel                                  x86_64     5.18.11-200.fc36        @updates                               0  
 kernel-core                             x86_64     5.18.11-200.fc36        @updates                              92 M
 kernel-debug                            x86_64     5.18.11-200.fc36        @updates                               0  
 kernel-debug-core                       x86_64     5.18.11-200.fc36        @updates                              96 M
 kernel-debug-devel                      x86_64     5.18.11-200.fc36        @updates                              64 M
 kernel-debug-devel-matched              x86_64     5.18.11-200.fc36        @updates                               0  
 kernel-debug-modules                    x86_64     5.18.11-200.fc36        @updates                              57 M
 kernel-devel                            x86_64     5.18.11-200.fc36        @updates                              63 M
 kernel-devel-matched                    x86_64     5.18.11-200.fc36        @updates                               0  
 kernel-modules                          x86_64     5.18.11-200.fc36        @updates                              56 M
 kernel-modules-extra                    x86_64     5.18.11-200.fc36        @updates                             3.3 M
Removing dependent packages:
 akmod-nvidia                            x86_64     3:515.57-1.fc36         @rpmfusion-nonfree-nvidia-driver      23 k
 akmods                                  noarch     0.5.7-8.fc36            @updates                              47 k
 kmod-nvidia                             x86_64     3:515.57-1.fc36         @rpmfusion-nonfree-nvidia-driver       0  
 kmod-nvidia-5.18.11-200.fc36.x86_64     x86_64     3:515.57-1.fc36         @@commandline                         29 M

Transaction Summary
=======================================================================================================================
Remove  15 Packages

Freed space: 461 M
Is this ok [y/N]: 

In either case, akmod-nvidia gets hit. It’s not an unused dependency, it’s a dependent package.

I’m noticing now that the whole akmods package gets removed too—I don’t think that should be happening, and it’s probably the direct reason that nvidia-akmod gets removed. I think the issue might be that the kernel(-debug)-devel-matched dependency is provided by the 5.18.11 kernel specifically. Really, there should be plenty of kernels that can provide this, right?

$ dnf deplist akmods
Last metadata expiration check: 0:03:38 ago on Wed 20 Jul 2022 05:58:31 PM EDT.
package: akmods-0.5.7-7.fc36.noarch
  dependency: (kernel-debug-devel-matched if kernel-debug-core)
   provider: kernel-debug-devel-matched-5.18.11-200.fc36.x86_64
  dependency: (kernel-devel-matched if kernel-core)
   provider: kernel-devel-matched-5.18.11-200.fc36.x86_64
  dependency: (kernel-lpae-devel-matched if kernel-lpae-core)
  dependency: /bin/sh
   provider: bash-5.1.16-2.fc36.x86_64
  dependency: /usr/bin/bash
   provider: bash-5.1.16-2.fc36.x86_64
# etc...

I actually tried this last night. Uninstall everything 5.18.11, and nvidia gets removed along with. Installing nvidia-akmod after that just pulls 5.18.11 back in, as dependencies. Maybe if I rebooted after uninstalling? Kind of forgot that nouveau exists for a minute there…

I agree!

Good idea, I’ll give this a shot too. I don’t know if the nvidia kernel module actually caused the errors, but nvidia is too problematic not to check it.

Output
$ inxi -Fxz
System:
  Kernel: 5.18.10-200.fc36.x86_64 arch: x86_64 bits: 64 compiler: gcc
    v: 2.37-27.fc36 Desktop: GNOME v: 42.3.1
    Distro: Fedora release 36 (Thirty Six)
Machine:
  Type: Desktop System: Gigabyte product: Z390 AORUS PRO WIFI v: N/A
    serial: <superuser required>
  Mobo: Gigabyte model: Z390 AORUS PRO WIFI-CF v: x.x
    serial: <superuser required> UEFI: American Megatrends v: F11
    date: 10/15/2019
Battery:
  ID-1: hidpp_battery_0 charge: 95% condition: N/A volts: 4.1 min: N/A
    model: Logitech G703 LIGHTSPEED Wireless Gaming Mouse w/ HERO
    status: discharging
CPU:
  Info: 8-core model: Intel Core i9-9900K bits: 64 type: MT MCP
    arch: Coffee Lake rev: D cache: L1: 512 KiB L2: 2 MiB L3: 16 MiB
  Speed (MHz): avg: 5103 high: 5157 min/max: 800/5100 cores: 1: 5102
    2: 5095 3: 5101 4: 5080 5: 5157 6: 5104 7: 5101 8: 5100 9: 5128 10: 5117
    11: 5088 12: 5084 13: 5101 14: 5100 15: 5100 16: 5099 bogomips: 115200
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx
Graphics:
  Device-1: NVIDIA TU102 [GeForce RTX 2080 Ti Rev. A] vendor: eVga.com.
    driver: nvidia v: 515.57 arch: Turing bus-ID: 01:00.0
  Display: wayland server: X.Org v: 1.22.1.3 with: Xwayland v: 22.1.3
    compositor: gnome-shell driver: X: loaded: nvidia
    gpu: nvidia,nvidia-nvswitch resolution: 1: 2560x1440~144Hz
    2: 1080x1920~60Hz
  OpenGL: renderer: NVIDIA GeForce RTX 2080 Ti/PCIe/SSE2
    v: 4.6.0 NVIDIA 515.57 direct render: Yes
Audio:
  Device-1: Intel Cannon Lake PCH cAVS vendor: Gigabyte driver: snd_hda_intel
    bus-ID: 1-13.3:8 v: kernel bus-ID: 00:1f.3
  Device-2: NVIDIA TU102 High Definition Audio vendor: eVga.com.
    driver: snd_hda_intel v: kernel bus-ID: 01:00.1
  Device-3: C-Media CM6631A Audio Processor type: USB
    driver: hid-generic,snd-usb-audio,usbhid
  Sound Server-1: ALSA v: k5.18.10-200.fc36.x86_64 running: yes
  Sound Server-2: PulseAudio v: 15.0 running: no
  Sound Server-3: PipeWire v: 0.3.55 running: yes
Network:
  Device-1: Intel Cannon Lake PCH CNVi WiFi driver: iwlwifi v: kernel
    bus-ID: 00:14.3
  IF: wlo1 state: up mac: <filter>
  Device-2: Intel Ethernet I219-V vendor: Gigabyte driver: e1000e v: kernel
    port: N/A bus-ID: 00:1f.6
  IF: eno2 state: down mac: <filter>
Bluetooth:
  Device-1: Intel Bluetooth 9460/9560 Jefferson Peak (JfP) type: USB
    driver: btusb v: 0.8 bus-ID: 1-14:6
  Report: rfkill ID: hci0 rfk-id: 0 state: up address: see --recommends
Drives:
  Local Storage: total: 6.41 TiB used: 2.55 TiB (39.7%)
  ID-1: /dev/nvme0n1 vendor: Intel model: SSDPEKNW010T8 size: 953.87 GiB
    temp: 40.9 C
  ID-2: /dev/nvme1n1 vendor: Intel model: SSDPEKNW010T8 size: 953.87 GiB
    temp: 37.9 C
  ID-3: /dev/sda vendor: Seagate model: ST5000LM000-2AN170 size: 4.55 TiB
Partition:
  ID-1: / size: 952.28 GiB used: 151.62 GiB (15.9%) fs: btrfs
    dev: /dev/nvme0n1p3
  ID-2: /boot size: 973.4 MiB used: 421.7 MiB (43.3%) fs: ext4
    dev: /dev/nvme0n1p2
  ID-3: /boot/efi size: 598.8 MiB used: 14 MiB (2.3%) fs: vfat
    dev: /dev/nvme0n1p1
  ID-4: /home size: 952.28 GiB used: 151.62 GiB (15.9%) fs: btrfs
    dev: /dev/nvme0n1p3
Swap:
  ID-1: swap-1 type: zram size: 8 GiB used: 0 KiB (0.0%) dev: /dev/zram0
Sensors:
  System Temperatures: cpu: 40.0 C pch: 61.0 C mobo: N/A
  Fan Speeds (RPM): N/A
Info:
  Processes: 384 Uptime: 23m Memory: 15.46 GiB used: 2.93 GiB (19.0%)
  Init: systemd target: graphical (5) Compilers: gcc: 12.1.1 Packages: 16
  note: see --pkg Shell: Zsh v: 5.8.1 inxi: 3.3.19

I replaced the nouveau blacklist lines in the grub cmdline with nvidia to temporarily switch drivers. I tested it first on 5.18.10 (works as expected), then 5.18.11 (boots without error!). Then I booted 5.18.11 with nvidia like normal and…it boots fine? The “C” in NVIDIA is for “consistency”, I guess. :upside_down_face:

If it sticks, I guess I’ll mark your post as the solution! Maybe the crash isn’t as reproducible as it seemed at first.

I’m still curious if anyone knows a straightforward way to blacklist one version of the kernel without upsetting akmods and friends. Might come in useful down the line, once in a while.

1 Like

The actual necessary procedure is

  1. Uninstall kernel5.18.11 with --noautoremove
  2. Reboot
  3. Install akmod-nvidia (and nothing else) It will pull in akmods and all the other packages needed to build the drivers, including kernel-devel-matched.
    I noted that you have the below which may be part of your problem. The one from the @@commandline is built by the akmod-nvidia package to match your installed kernel and the other is downloaded from the repo. They may conflict.
 kmod-nvidia                             x86_64     3:515.57-1.fc36         @rpmfusion-nonfree-nvidia-driver       0  
 kmod-nvidia-5.18.11-200.fc36.x86_64     x86_64     3:515.57-1.fc36         @@commandline                         29 M
  1. reboot.

BTW, you probably do not need any of the kernel-debug* packages, so unless you are doing something that is directly related to kernel development where debugging is necessary you could remove all those for all the installed kernels.