Nvidia driver setup on custom Kernel

Hey guys,

I have a Surface Book and to make the dGPU properly work I have the use a custom kernel from the Surface Linux community.
I was using the dGPU without any problems on Arch, but I decided to make the change and try Fedora.

To make it work on Arch, after installing the custom kernel there is an additional setup to make that I discovered, which related specifically to my Surface Book model.

It consists in creating the following script and the respective systemd service:

#!/bin/sh

echo 1 | tee /sys/bus/platform/devices/MSHW0041:00/dgpu_power
echo 1 > /sys/bus/pci/rescan
setpci -H1 -s 01:00.0 6a.b=81
setpci -H1 -s 01:00.0 4.w=0407
echo 1 > /sys/bus/pci/rescan
setpci -s 01:00.0 4.w
setpci -s 01:00.0 6a.b
modprobe nvidia

and the service:

[Unit]
Description=Nvidia GPU initialization
Before=display-manager.service

[Service]
Type=oneshot
ExecStart=/usr/bin/dgpu.sh
ExecStartPre=/bin/sleep 10

[Install]
WantedBy=multi-user.target

After enabling it (systemctl enable dgpu.service) I usually go on and install DKMS or equivalent. I learned that for Fedora is Akmod.

Unfortunately, I’m struggling a lot to make the NVIDIA driver work. Let me explain what I’ve done.

So I went ahead and followed the steps on the RPM Fusion Doc and after enabling the Non-Free Repo I installed akmod-nvidia and xorg-x11-drv-nvidia-cuda. Then waited some minutes, rebooted, but it still wasn’t loaded.

I can see some problems with the service I need to use, actually. It seems to be not able to find the nvidia module. This is what systemctl status dgpu.service prompts me:

× dgpu.service - Nvidia GPU initialization
     Loaded: loaded (/etc/systemd/system/dgpu.service; enabled; vendor preset: disabled)
     Active: failed (Result: exit-code) since Wed 2022-08-31 20:48:15 -03; 2min 58s ago
    Process: 732 ExecStartPre=/bin/sleep 1 (code=exited, status=0/SUCCESS)
    Process: 992 ExecStart=/usr/local/bin/dgpu.sh (code=exited, status=1/FAILURE)
   Main PID: 992 (code=exited, status=1/FAILURE)
        CPU: 26ms

Aug 31 20:48:13 fedora systemd[1]: Starting dgpu.service - Nvidia GPU initialization...
Aug 31 20:48:14 fedora dgpu.sh[999]: 1
Aug 31 20:48:15 fedora dgpu.sh[1062]: 0407
Aug 31 20:48:15 fedora dgpu.sh[1063]: 81
Aug 31 20:48:15 fedora dgpu.sh[1064]: modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.19.4-1.surface.fc36.x86_64
Aug 31 20:48:15 fedora systemd[1]: dgpu.service: Main process exited, code=exited, status=1/FAILURE
Aug 31 20:48:15 fedora systemd[1]: dgpu.service: Failed with result 'exit-code'.
Aug 31 20:48:15 fedora systemd[1]: Failed to start dgpu.service - Nvidia GPU initialization.

My SecureBoot is disabled as well (from sudo mokutil --sb-state):

SecureBoot disabled
Platform is in Setup Mode

I confirmed that I’m on the right kernel with uname -r: 5.19.4-1.surface.fc36.x86_64

I also confirmed that the NVIDIA dGPU is recognized with lspci | grep 'NVIDIA':
01:00.0 3D controller: NVIDIA Corporation GM206M [GeForce GTX 965M] (rev a1)

These are my kernel parameters:
GRUB_CMDLINE_LINUX="rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1 initcall_blacklist=simpledrm_platform_driver_init rhgb quiet nouveau.modeset=0 pci=realloc pcie_port_pm=off pcie_aspm=off"

And to make sure it is all correct in grub I ran: sudo grub2-mkconfig -o /boot/grub2/grub.cfg

By listing everything NVIDIA-related, I get (from sudo dnf list installed *nvidia*):

Installed Packages
akmod-nvidia.x86_64                                                         3:515.65.01-1.fc36                                        @rpmfusion-nonfree-nvidia-driver
nvidia-persistenced.x86_64                                                  3:515.65.01-1.fc36                                        @rpmfusion-nonfree-nvidia-driver
nvidia-settings.x86_64                                                      3:515.65.01-1.fc36                                        @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia.x86_64                                                  3:515.65.01-1.fc36                                        @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-cuda.x86_64                                             3:515.65.01-1.fc36                                        @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-cuda-libs.i686                                          3:515.65.01-1.fc36                                        @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-cuda-libs.x86_64                                        3:515.65.01-1.fc36                                        @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-kmodsrc.x86_64                                          3:515.65.01-1.fc36                                        @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-libs.i686                                               3:515.65.01-1.fc36                                        @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-libs.x86_64                                             3:515.65.01-1.fc36                                        @rpmfusion-nonfree-nvidia-driver
xorg-x11-drv-nvidia-power.x86_64                                            3:515.65.01-1.fc36                                        @rpmfusion-nonfree-nvidia-driver

At last, running nvidia-smi gives me:
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I don’t know what to try next.

Can anyone help me? :sos:

1 Like

What’s the output of journalctl --boot -u akmods?

What is the output of lsmod | grep -iE 'nvidia|nouveau' If it returns a list of nouveau modules then the nvidia driver did not load. OTOH, if it returns a list of nvidia modules then the nvidia drivers are loaded and active.

Your list of installed nvidia packages does not show the kmod-nvidia package which should have been built by the akmod-nvidia package when it was installed.

It seems likely that since you are using a custom kernel then akmods may not know how to build the kernel module for you. It does know how to build the modules for a fedora kernel.

This is the output:

Aug 31 20:48:13 fedora systemd[1]: Starting akmods.service - Builds and install new kmods from akmod packages...
Aug 31 20:48:14 fedora akmods[730]: Checking kmods exist for 5.19.4-1.surface.fc36.x86_64[  OK  ]
Aug 31 20:48:14 fedora akmods[730]: Files needed for building modules against kernel
Aug 31 20:48:14 fedora akmods[730]: 5.19.4-1.surface.fc36.x86_64 could not be found as the following
Aug 31 20:48:14 fedora akmods[730]: directories are missing:
Aug 31 20:48:14 fedora akmods[730]: /usr/src/kernels/5.19.4-1.surface.fc36.x86_64/
Aug 31 20:48:14 fedora akmods[730]: /lib/modules/5.19.4-1.surface.fc36.x86_64/build/Is the correct kernel-devel package installed?[FAILED]
Aug 31 20:48:14 fedora systemd[1]: Finished akmods.service - Builds and install new kmods from akmod packages.

This FAILED part got me concerned

(post deleted by author)

Here is the output for the command:

nouveau              2416640  0
mxm_wmi                16384  1 nouveau
wmi                    32768  2 mxm_wmi,nouveau
drm_ttm_helper         16384  1 nouveau
drm_display_helper    172032  2 i915,nouveau
ttm                    90112  3 drm_ttm_helper,i915,nouveau
video                  61440  2 i915,nouveau

Yep, no nvidia driver loaded.

This is weird. When I was using Arch, by installing the nvidia-dkms package it was able to build for the custom kernel. I thought akmods would be able to do the same thing.

I’m completely lost now :face_with_spiral_eyes:

That error log shows that akmods expects the kernel 5.19.4 to be installed and tries to build the modules for it but since the ‘/usr/src/kernels/5.19.4-1.surface.fc36.x86_64/’ directory is missing it fails. The 3 lines directly above the line that says [FAILED] show why. (as well as the comment about the kernel-devel package)

I tried manually installing the Drivers, but it seems to complain about the Linux Headers. For some reason, it cannot find any linux-headers from the surface-kernel. And that’s odd, because I know there are packages for it in other distros (as linux-headers-surface or linux-surface-headers package)
I can see it’s explicit in the installation guide but for some reason the Fedora process of installation doesn’t contain it.

By checking the contents of the repository I can’t see it either: Index of /fedora/f36/

Ok, starting to think about going back to Arch…