2x NVIDIA RTX 2070 Super Operate normally until NVLink installed

I’ve got a rather unique issue. I have a pair of Nvidia RTX 2070 Supers, which when installed in my machine without NVLink, boot and operate properly.

As soon as I connect them with NVLink, I begin getting the following issue in my Xorg log, and no display:

https://pastebin.com/3B9YcHcW

[    12.604] (==) NVIDIA(0): No modes were requested; the default mode "nvidia-auto-select"
[    12.604] (==) NVIDIA(0):     will be used as the requested mode.
[    12.604] (==) NVIDIA(0): 
[    12.605] (II) NVIDIA(0): Validated MetaModes:
[    12.605] (II) NVIDIA(0):     "DFP-5:nvidia-auto-select"
[    12.605] (II) NVIDIA(0): Virtual screen size determined to be 3840 x 2160
[    12.657] (--) NVIDIA(0): DPI set to (103, 103); computed from "UseEdidDpi" X config
[    12.657] (--) NVIDIA(0):     option
[    12.657] (EE) NVIDIA(G0): GeForce RTX 2070 SUPER (GPU-1) already has an X screen
[    12.657] (EE) NVIDIA(G0):     assigned; skipping this GPU screen
[    12.657] (EE) NVIDIA(G0): Failing initialization of X screen
[    12.657] (II) UnloadModule: "nvidia"
[    12.657] (II) UnloadSubModule: "wfb"
[    12.657] (II) UnloadSubModule: "fb"
[    12.657] (II) UnloadModule: "nouveau"
[    12.657] (II) Unloading nouveau
[    12.657] (II) UnloadModule: "modesetting"
[    12.657] (II) Unloading modesetting
[    12.657] (II) UnloadModule: "fbdev"
[    12.657] (II) Unloading fbdev
[    12.657] (II) UnloadSubModule: "fbdevhw"
[    12.657] (II) Unloading fbdevhw
[    12.657] (II) UnloadModule: "vesa"
[    12.657] (II) Unloading vesa
[    12.659] (II) NVIDIA: Using 24576.00 MB of virtual memory for indirect memory
[    12.659] (II) NVIDIA:     access.
[    15.661] (EE) NVIDIA(GPU-0): Failed to initialize DMA.
[    15.663] (EE) NVIDIA(0): Failed to allocate push buffer
[    20.968] (EE) 
Fatal server error:
[    20.968] (EE) AddScreen/ScreenInit failed for driver 0
[    20.968] (EE) 
[    20.968] (EE) 
Please consult the Fedora Project support 
	 at ht tp://wiki.x.org
 for help. 
[    20.968] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
[    20.968] (EE) 
[    20.969] (EE) Server terminated with error (1). Closing log file.

Operating without the NVLink connector, everything starts normally.

https://pastebin.com/0JKZUDv0

[root@localhost log]# rpm -qa | grep -i noveau
[root@localhost log]# rpm -qa | grep -i nvidia
xorg-x11-drv-nvidia-libs-440.82-1.fc31.x86_64
nvidia-settings-440.82-1.fc31.x86_64
xorg-x11-drv-nvidia-kmodsrc-440.82-1.fc31.x86_64
kmod-nvidia-5.5.17-200.fc31.x86_64-440.82-1.fc31.x86_64
xorg-x11-drv-nvidia-cuda-libs-440.82-1.fc31.x86_64
xorg-x11-drv-nvidia-440.82-1.fc31.x86_64
akmod-nvidia-440.82-1.fc31.x86_64
xorg-x11-drv-nvidia-cuda-440.82-1.fc31.x86_64
nvidia-persistenced-440.82-1.fc31.x86_64

root@localhost log]# lsmod | grep nvidia
nvidia_drm             57344  8
nvidia_modeset       1118208  9 nvidia_drm
nvidia_uvm           1093632  0
nvidia              20508672  441 nvidia_uvm,nvidia_modeset
drm_kms_helper        233472  1 nvidia_drm
drm                   585728  11 drm_kms_helper,nvidia_drm
ipmi_msghandler       118784  2 ipmi_devintf,nvidia
i2c_nvidia_gpu         16384  0

Please let me know if any additional data is required. I am using the NVidia drivers from RPMFusion. I have tested this configuration under Windows 10, and it operates as expected with the Nvidia drivers.

1 Like

Hi @aclater, welcome to ask.Fedora!

Short notice, please wrap your code into high ticks to make them better readable:

    ```
    Code
    ```

Regarding your issue: the errors do not look like nvlink is failing, but more that there is already X running. Can you please post your output of dmesg? I am looking for lines like this (taken from here):

[ 1105.780063] ipmi device interface
[ 1105.974672] nvidia: loading out-of-tree module taints kernel.
[ 1105.996891] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[ 1105.997272] nvidia 0000:01:00.0: enabling device (0400 -> 0403)

Also, just to be sure about this: you did not install any prime/bumblebee things, am I right? And did you configure your xorg.conf properly? Was it maybe changed by some mechanism?

Thanks for the tip on code formatting.

Just to clarify - The exact same config works fine with two cards installed and no NVLink, but fails when the NVLink is installed.

No other changes are made.

dmesg output here
https://pastebin.com/mD5t6nVU

This is a fresh install of Fedora 31 - I did not make any configuration changes to the xorg.conf - i simply installed the NVidia drivers via RPMFusion, following their directions.

Seem to be a bit closer, but no less frustrated. I installed nvidia-xconfig and got a few results, this one being the most comical:

[  1007.776] (**) NVIDIA(0): Depth 24, (--) framebuffer bpp 32
[  1007.776] (==) NVIDIA(0): RGB weight 888
[  1007.776] (==) NVIDIA(0): Default visual is TrueColor
[  1007.776] (==) NVIDIA(0): Using gamma correction (1.0, 1.0, 1.0)
[  1007.776] (**) Option "AllowNVIDIAGpuScreens"
[  1007.776] (II) Applying OutputClass "nvidia" options to /dev/dri/card1
[  1007.776] (**) NVIDIA(0): Option "SLI" "AA"
[  1007.776] (**) NVIDIA(0): Option "BaseMosaic" "False"
[  1007.776] (**) NVIDIA(0): Option "AllowEmptyInitialConfiguration"
[  1007.776] (**) NVIDIA(0): NVIDIA SLI antialiasing selected.
[  1007.776] (**) NVIDIA(0): Enabling 2D acceleration
[  1007.776] (II) Loading sub module "glxserver_nvidia"
[  1007.776] (II) LoadModule: "glxserver_nvidia"
[  1007.776] (II) Loading /usr/lib64/xorg/modules/extensions/libglxserver_nvidia.so
[  1007.779] (II) Module glxserver_nvidia: vendor="NVIDIA Corporation"
[  1007.779] 	compiled for 1.6.99.901, module version = 1.0.0
[  1007.779] 	Module class: X.Org Server Extension
[  1007.779] (II) NVIDIA GLX Module  440.82  Wed Apr  1 19:47:36 UTC 2020
[  1007.779] (II) NVIDIA: The X server supports PRIME Render Offload.
[  1007.779] (EE) NVIDIA(GPU-0): Failed to find a valid SLI configuration.
[  1007.779] (EE) NVIDIA(GPU-0): Invalid SLI configuration 1 of 1:
[  1007.779] (EE) NVIDIA(GPU-0): GPUs:
[  1007.779] (EE) NVIDIA(GPU-0):     1) NVIDIA GPU at PCI:10:0:0
[  1007.779] (EE) NVIDIA(GPU-0):     2) NVIDIA GPU at PCI:11:0:0
[  1007.779] (EE) NVIDIA(GPU-0): Errors:
[  1007.779] (EE) NVIDIA(GPU-0):     - No video link present
[  1007.779] (WW) NVIDIA(GPU-0): Failed to find a valid SLI configuration for the NVIDIA
[  1007.779] (WW) NVIDIA(GPU-0):     graphics device PCI:11:0:0. Please see Chapter 30:
[  1007.780] (WW) NVIDIA(GPU-0):     Configuring SLI and Multi-GPU FrameRendering in the README
[  1007.780] (WW) NVIDIA(GPU-0):     for troubleshooting suggestions.
[  1007.780] (EE) NVIDIA(GPU-0): Only one GPU will be used for this X screen.
[  1007.780] (EE) NVIDIA(GPU-0): The NVIDIA graphics device PCI:11:0:0 is part of an active SLI
[  1007.780] (EE) NVIDIA(GPU-0):     configuration and is currently unavailable for single GPU
[  1007.780] (EE) NVIDIA(GPU-0):     rendering.  Please see Chapter 30: Configuring SLI and
[  1007.780] (EE) NVIDIA(GPU-0):     Multi-GPU FrameRendering in the README for troubleshooting
[  1007.780] (EE) NVIDIA(GPU-0):     information.
[  1007.780] (EE)  *** Aborting ***

/etc/X11/xorg.conf

# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 440.82


Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0" 0 0
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "InputDevice"

    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/input/mice"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"

    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "BaseMosaic" "False"
    Option         "SLI" "AA"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

nvidia-smi nvlink -c

GPU 0: GeForce RTX 2070 SUPER (UUID: GPU-409287de-103f-fd85-d2b1-b5d09a742443)
	 Link 0, P2P is supported: true
	 Link 0, Access to system memory supported: true
	 Link 0, P2P atomics supported: true
	 Link 0, System memory atomics supported: true
	 Link 0, SLI is supported: true
	 Link 0, Link is supported: false
GPU 1: GeForce RTX 2070 SUPER (UUID: GPU-8788c70f-2c32-ceb9-82e5-d23280aed8ba)
	 Link 0, P2P is supported: true
	 Link 0, Access to system memory supported: true
	 Link 0, P2P atomics supported: true
	 Link 0, System memory atomics supported: true
	 Link 0, SLI is supported: true
	 Link 0, Link is supported: false

lots more head scratching all around.

That is among the most weird error messages I have seen so far :joy:

Have you tried the original NVIDIA documentation about SLI on LInux?

There is some troubleshooting in there, and also some tips how to generate a proper xorg.conf

Yep, been through them up, down and sideways - to no avail. SLI works in Windows… Just no love in X.

Ok, then I am out of ideas. Maybe you have more luck at a move nvidia-Linux specialized forum? Phoronix maybe, or some reddit community?
And if you ever solve it I would be happy if you could let us know what the issue was…