Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia GPU drivers version mismatch causing drivers not to load #148

Open
linkion opened this issue Jan 23, 2025 · 13 comments · Fixed by #151
Open

Nvidia GPU drivers version mismatch causing drivers not to load #148

linkion opened this issue Jan 23, 2025 · 13 comments · Fixed by #151
Labels
bug Something isn't working

Comments

@linkion
Copy link

linkion commented Jan 23, 2025

Describe the bug

On aurora-dx-nvidia with stable-daily, the Nvidia drivers have mismatched versions, at least, I think so. This is causing the desktop to be rendered only via software, causing slow performance.

rpm-ostree status:

State: idle
AutomaticUpdates: stage; rpm-ostreed-automatic.timer: no runs since boot
Deployments:
● ostree-image-signed:docker://ghcr.io/ublue-os/aurora-dx-nvidia:stable-daily
                   Digest: sha256:cdfdabe43576067c973c2b87aa2b5dce3c96b11ef4d4c3ee71d66ef6e252103e
                  Version: 41.20250123.3 (2025-01-23T17:16:25Z)
          LayeredPackages: Sunshine
            LocalPackages: rpmfusion-free-release-41-1.noarch rpmfusion-nonfree-release-41-1.noarch

  ostree-image-signed:docker://ghcr.io/ublue-os/aurora-dx-nvidia:stable
                   Digest: sha256:55bd2163f623eb916d3e72fdb1860c53f5abdbdbffb5d75eac1398976b5e3bbe
                  Version: 41.20250119.3 (2025-01-19T14:51:17Z)
          LayeredPackages: Sunshine
            LocalPackages: rpmfusion-free-release-41-1.noarch rpmfusion-nonfree-release-41-1.noarch

What did you expect to happen?

Nvidia drivers would match and GPU accelerated rendering would work.

Output of bootc status

No staged image present
Current booted state is native ostree
Current rollback state is native ostree

Output of groups

patrickorave wheel incus-admin lxd docker libvirt plugdev

Extra information or context

running: nvidia-smi

Failed to initialize NVML: Driver/library version mismatch
NVML library version: 565.77

running: cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module  565.57.01  Thu Oct 10 12:29:05 UTC 2024
GCC version:  gcc version 14.2.1 20240912 (Red Hat 14.2.1-3) (GCC)

I read here I can check if a process is using the GPU by running the follow command

2909 is the proc id for /urs/bin/kwin_wayland

running: lsof -p 2909 | grep /dev/dri

<no output>

5477 is the proc id for firefox

running: lsof -p 5477 | grep /dev/dri

<no output>
@linkion linkion changed the title Nvidia GPU driver and CUDA version mismatch causing drivers not to load Nvidia GPU drivers version mismatch causing drivers not to load Jan 23, 2025
@dosubot dosubot bot added the bug Something isn't working label Jan 23, 2025
@linkion
Copy link
Author

linkion commented Jan 23, 2025

ran rpm -qa | grep nvidia:

1759:	nvidia-gpu-firmware-20241210-1.fc41.noarch
2030:	ublue-os-nvidia-addons-0.12-1.fc41.noarch
2032:	libnvidia-ml-565.77-1.fc41.x86_64
2033:	libnvidia-cfg-565.77-1.fc41.x86_64
2035:	nvidia-driver-cuda-libs-565.77-1.fc41.x86_64
2037:	libnvidia-fbc-565.77-1.fc41.x86_64
2050:	libnvidia-container1-1.17.4-1.x86_64
2051:	libnvidia-container-tools-1.17.4-1.x86_64
2052:	nvidia-container-toolkit-base-1.17.4-1.x86_64
2053:	nvidia-libXNVCtrl-565.77-1.fc41.x86_64
2054:	nvidia-modprobe-565.77-1.fc41.x86_64
2055:	nvidia-kmod-common-565.77-2.fc41.noarch
2056:	akmod-nvidia-565.77-1.fc41.x86_64
2057:	nvidia-persistenced-565.77-1.fc41.x86_64
2066:	nvidia-driver-libs-565.77-1.fc41.x86_64
2067:	nvidia-driver-565.77-1.fc41.x86_64
2069:	libnvidia-ml-565.77-1.fc41.i686
2082:	nvidia-settings-565.77-1.fc41.x86_64
2083:	nvidia-driver-cuda-565.77-1.fc41.x86_64
2084:	kmod-nvidia-565.57.01-2.fc41.x86_64
2085:	nvidia-container-toolkit-1.17.4-1.x86_64
2087:	libva-nvidia-driver-0.0.13^20241108git259b7b7-1.fc41.x86_64
2101:	nvidia-driver-libs-565.77-1.fc41.i686
2103:	nvidia-driver-cuda-libs-565.77-1.fc41.i686

And I can see that

2084: kmod-nvidia-565.57.01-2.fc41.x86_64

Is on an older version from the rest

@linkion
Copy link
Author

linkion commented Jan 23, 2025

This issue is now appearing on the stable build that was released a few hours ago:

rpm-ostree status:

State: idle
AutomaticUpdates: stage; rpm-ostreed-automatic.timer: no runs since boot
Deployments:
● ostree-image-signed:docker://ghcr.io/ublue-os/aurora-dx-nvidia:stable
                   Digest: sha256:42e14e78b7b0bfec82e05c85633acb2b02e41414ee91a2175445b5896923917f
                  Version: 41.20250123.1 (2025-01-23T15:34:18Z)
          LayeredPackages: Sunshine
            LocalPackages: rpmfusion-free-release-41-1.noarch rpmfusion-nonfree-release-41-1.noarch

  ostree-image-signed:docker://ghcr.io/ublue-os/aurora-dx-nvidia:stable-daily
                   Digest: sha256:cdfdabe43576067c973c2b87aa2b5dce3c96b11ef4d4c3ee71d66ef6e252103e
                  Version: 41.20250123.3 (2025-01-23T17:16:25Z)
          LayeredPackages: Sunshine
            LocalPackages: rpmfusion-free-release-41-1.noarch rpmfusion-nonfree-release-41-1.noarch

rpm -qa:

  1759:	nvidia-gpu-firmware-20241210-1.fc41.noarch
  2030:	ublue-os-nvidia-addons-0.12-1.fc41.noarch
  2032:	libnvidia-ml-565.77-1.fc41.x86_64
  2033:	libnvidia-cfg-565.77-1.fc41.x86_64
  2035:	nvidia-driver-cuda-libs-565.77-1.fc41.x86_64
  2037:	libnvidia-fbc-565.77-1.fc41.x86_64
  2050:	libnvidia-container1-1.17.4-1.x86_64
  2051:	libnvidia-container-tools-1.17.4-1.x86_64
  2052:	nvidia-container-toolkit-base-1.17.4-1.x86_64
  2053:	nvidia-libXNVCtrl-565.77-1.fc41.x86_64
  2054:	nvidia-modprobe-565.77-1.fc41.x86_64
  2055:	nvidia-kmod-common-565.77-2.fc41.noarch
  2056:	akmod-nvidia-565.77-1.fc41.x86_64
  2057:	nvidia-persistenced-565.77-1.fc41.x86_64
  2066:	nvidia-driver-libs-565.77-1.fc41.x86_64
  2067:	nvidia-driver-565.77-1.fc41.x86_64
  2069:	libnvidia-ml-565.77-1.fc41.i686
  2082:	nvidia-settings-565.77-1.fc41.x86_64
  2083:	nvidia-driver-cuda-565.77-1.fc41.x86_64
  2084:	kmod-nvidia-565.57.01-2.fc41.x86_64
  2085:	nvidia-container-toolkit-1.17.4-1.x86_64
  2087:	libva-nvidia-driver-0.0.13^20241108git259b7b7-1.fc41.x86_64
  2101:	nvidia-driver-libs-565.77-1.fc41.i686
  2103:	nvidia-driver-cuda-libs-565.77-1.fc41.i686

@ledif
Copy link
Collaborator

ledif commented Jan 23, 2025

The latest stable image was accidentally updated today but will be reverted and you can rollback to a previous working version.

The kernel version for :stable images is fixed to an older version while we're waiting out a separate regression. It seems like the shared kernel modules for NVIDIA are somehow mismatched with the rest of the graphics toolkit.

The :latest tag has the newest upstream kernel from Fedora. Can you try to rebase to :latest instead of :stable and see if the versions mismatch there as well?

@linkion
Copy link
Author

linkion commented Jan 23, 2025

The latest stable image was accidentally updated today but will be reverted and you can rollback to a previous working version.

The kernel version for :stable images is fixed to an older version while we're waiting out a separate regression. It seems like the shared kernel modules for NVIDIA are somehow mismatched with the rest of the graphics toolkit.

The :latest tag has the newest upstream kernel from Fedora. Can you try to rebase to :latest instead of :stable and see if the versions mismatch there as well?

Can confirm, :latest has working Nvidia

rpm-ostree status:

State: idle
AutomaticUpdates: stage; rpm-ostreed-automatic.timer: no runs since boot
Deployments:
● ostree-image-signed:docker://ghcr.io/ublue-os/aurora-dx-nvidia:latest
                   Digest: sha256:900e8391176b90dacef6a3dd9dfa30a78dcd1bd71dcd7e25cc35e233d93672d2
                  Version: latest-41.20250123.3 (2025-01-23T17:15:22Z)
          LayeredPackages: Sunshine
            LocalPackages: rpmfusion-free-release-41-1.noarch
                           rpmfusion-nonfree-release-41-1.noarch

  ostree-image-signed:docker://ghcr.io/ublue-os/aurora-dx-nvidia:stable
                   Digest: sha256:42e14e78b7b0bfec82e05c85633acb2b02e41414ee91a2175445b5896923917f
                  Version: 41.20250123.1 (2025-01-23T15:34:18Z)
          LayeredPackages: Sunshine
            LocalPackages: rpmfusion-free-release-41-1.noarch
                           rpmfusion-nonfree-release-41-1.noarch

rpm -qa | grep nvidia:

1759:	nvidia-gpu-firmware-20241210-1.fc41.noarch
1963:	ublue-os-nvidia-addons-0.12-1.fc41.noarch
1965:	libnvidia-ml-565.77-1.fc41.x86_64
1966:	libnvidia-cfg-565.77-1.fc41.x86_64
1968:	nvidia-driver-cuda-libs-565.77-1.fc41.x86_64
1970:	libnvidia-fbc-565.77-1.fc41.x86_64
1983:	libnvidia-container1-1.17.4-1.x86_64
1984:	libnvidia-container-tools-1.17.4-1.x86_64
1985:	nvidia-container-toolkit-base-1.17.4-1.x86_64
1986:	nvidia-libXNVCtrl-565.77-1.fc41.x86_64
1987:	nvidia-modprobe-565.77-1.fc41.x86_64
1988:	kmod-nvidia-565.77-1.fc41.x86_64  <-------------- up-to-date kernel driver
1989:	nvidia-kmod-common-565.77-2.fc41.noarch
1990:	nvidia-persistenced-565.77-1.fc41.x86_64
1999:	nvidia-driver-libs-565.77-1.fc41.x86_64
2000:	nvidia-driver-565.77-1.fc41.x86_64
2002:	libnvidia-ml-565.77-1.fc41.i686
2015:	nvidia-settings-565.77-1.fc41.x86_64
2016:	nvidia-driver-cuda-565.77-1.fc41.x86_64
2017:	nvidia-container-toolkit-1.17.4-1.x86_64
2019:	libva-nvidia-driver-0.0.13^20241108git259b7b7-1.fc41.x86_64
2033:	nvidia-driver-libs-565.77-1.fc41.i686
2035:	nvidia-driver-cuda-libs-565.77-1.fc41.i686

@ledif
Copy link
Collaborator

ledif commented Jan 24, 2025

I rolled back the :stable image to its previous state and confirmed on my desktop with an NVIDIA card that it's back to normal and nvidia-smi seems to be fine. Since you've confirmed that :latest works as well, then the problem seems to be isolated to :stable-daily.

We need to investigate further.

@ledif
Copy link
Collaborator

ledif commented Jan 25, 2025

I think I found the root problem of the version mismatch on stable-daily and it should be resolved when #151 is merged and new images are pushed.

@linkion
Copy link
Author

linkion commented Jan 28, 2025

@ledif
Issue has come back on latest-41.20250128.1 build:

rpm-ostree status:

State: idle
AutomaticUpdates: stage; rpm-ostreed-automatic.timer: last run 1h 26min ago
Deployments:
● ostree-image-signed:docker://ghcr.io/ublue-os/aurora-dx-nvidia:latest
                   Digest: sha256:0802db65256f33d733369a9c49d9c1ac035bd17b8910fb162294148b92cb4e36
                  Version: latest-41.20250128.1 (2025-01-28T05:54:43Z)
          LayeredPackages: Sunshine
            LocalPackages: rpmfusion-free-release-41-1.noarch rpmfusion-nonfree-release-41-1.noarch

  ostree-image-signed:docker://ghcr.io/ublue-os/aurora-dx-nvidia:latest
                   Digest: sha256:78eca1598c05793eb86fe682b0ade3832119842850df01a4bff7b677336a5b95
                  Version: latest-41.20250126.2 (2025-01-26T23:16:07Z)
          LayeredPackages: Sunshine
            LocalPackages: rpmfusion-free-release-41-1.noarch rpmfusion-nonfree-release-41-1.noarch

rpm -qa | grep nvidia:

1759:	nvidia-gpu-firmware-20241210-1.fc41.noarch
1961:	ublue-os-nvidia-addons-0.12-1.fc41.noarch
1963:	libnvidia-ml-570.86.15-1.fc41.x86_64
1965:	libnvidia-cfg-570.86.15-1.fc41.x86_64
1966:	nvidia-driver-cuda-libs-570.86.15-1.fc41.x86_64
1998:	libnvidia-fbc-570.86.15-1.fc41.x86_64
2018:	libnvidia-container1-1.17.4-1.x86_64
2019:	libnvidia-container-tools-1.17.4-1.x86_64
2020:	nvidia-container-toolkit-base-1.17.4-1.x86_64
2021:	nvidia-libXNVCtrl-570.86.15-1.fc41.x86_64
2022:	nvidia-modprobe-570.86.15-1.fc41.x86_64
2023:	nvidia-persistenced-570.86.15-1.fc41.x86_64
2041:	nvidia-kmod-common-570.86.15-1.fc41.noarch
2042:	akmod-nvidia-570.86.15-1.fc41.x86_64
2059:	nvidia-driver-libs-570.86.15-1.fc41.x86_64
2060:	nvidia-driver-570.86.15-1.fc41.x86_64
2064:	libnvidia-ml-570.86.15-1.fc41.i686
2078:	nvidia-settings-570.86.15-1.fc41.x86_64
2079:	nvidia-driver-cuda-570.86.15-1.fc41.x86_64
2080:	kmod-nvidia-565.77-1.fc41.x86_64   <---------------- old kmod
2081:	nvidia-container-toolkit-1.17.4-1.x86_64
2083:	libva-nvidia-driver-0.0.13^20241108git259b7b7-1.fc41.x86_64
2097:	nvidia-driver-libs-570.86.15-1.fc41.i686
2099:	nvidia-driver-cuda-libs-570.86.15-1.fc41.i686

nvidia-smi

Failed to initialize NVML: Driver/library version mismatch
NVML library version: 570.86

@inffy
Copy link
Collaborator

inffy commented Jan 28, 2025

Yup we know

@linkion
Copy link
Author

linkion commented Jan 28, 2025

Yup we know

Awesome, keep up the good work!

@inffy inffy reopened this Jan 28, 2025
@inffy
Copy link
Collaborator

inffy commented Jan 28, 2025

Yup we know

Awesome, keep up the good work!

Can you try updating. It should roll you back to a working one

@linkion
Copy link
Author

linkion commented Jan 28, 2025

Yup we know

Awesome, keep up the good work!

Can you try updating. It should roll you back to a working one

I just came back from classes; it doesn't seem like it's rolling me back

ujust update:

── 14:37:25 - System update ────────────────────────────────────────────────────
[sudo] password for patrickorave:
Note: This system is image (rpm-ostree) based.
note: automatic updates (stage) are enabled
Pulling manifest: ostree-image-signed:docker://ghcr.io/ublue-os/aurora-dx-nvidia:latest
Checking out tree bd004e4... done
Enabled rpm-md repositories: updates fedora rpmfusion-free-updates rpmfusion-free rpmfusion-nonfree-updates rpmfusion-nonfree copr:copr.fedorainfracloud.org:kylegospo:wallpaper-engine-kde-plugin copr:copr.fedorainfracloud.org:lizardbyte:beta updates-archive
Importing rpm-md... done
rpm-md repo 'updates' (cached); generated: 2025-01-28T02:55:16Z solvables: 18760
rpm-md repo 'fedora' (cached); generated: 2024-10-24T13:55:59Z solvables: 76624
rpm-md repo 'rpmfusion-free-updates' (cached); generated: 2025-01-24T10:58:05Z solvables: 59
rpm-md repo 'rpmfusion-free' (cached); generated: 2024-10-27T07:49:25Z solvables: 347
rpm-md repo 'rpmfusion-nonfree-updates' (cached); generated: 2025-01-24T11:15:41Z solvables: 45
rpm-md repo 'rpmfusion-nonfree' (cached); generated: 2024-10-27T07:58:23Z solvables: 218
rpm-md repo 'copr:copr.fedorainfracloud.org:kylegospo:wallpaper-engine-kde-plugin' (cached); generated: 2024-08-27T11:29:15Z solvables: 2
rpm-md repo 'copr:copr.fedorainfracloud.org:lizardbyte:beta' (cached); generated: 2025-01-27T17:49:36Z solvables: 60
rpm-md repo 'updates-archive' (cached); generated: 2025-01-28T03:14:54Z solvables: 26782
Resolving dependencies... done
No upgrade available.
...

after rebooting, no change...

rpm-ostree status:

State: idle
AutomaticUpdates: stage; rpm-ostreed-automatic.timer: no runs since boot
Deployments:
● ostree-image-signed:docker://ghcr.io/ublue-os/aurora-dx-nvidia:latest
                   Digest: sha256:0802db65256f33d733369a9c49d9c1ac035bd17b8910fb162294148b92cb4e36
                  Version: latest-41.20250128.1 (2025-01-28T05:54:43Z)
          LayeredPackages: Sunshine
            LocalPackages: rpmfusion-free-release-41-1.noarch
                           rpmfusion-nonfree-release-41-1.noarch

  ostree-image-signed:docker://ghcr.io/ublue-os/aurora-dx-nvidia:latest
                   Digest: sha256:78eca1598c05793eb86fe682b0ade3832119842850df01a4bff7b677336a5b95
                  Version: latest-41.20250126.2 (2025-01-26T23:16:07Z)
          LayeredPackages: Sunshine
            LocalPackages: rpmfusion-free-release-41-1.noarch
                           rpmfusion-nonfree-release-41-1.noarch

@linkion
Copy link
Author

linkion commented Jan 28, 2025

I'm gonna go ahead and switch to stable-daily, I can switch back if you want me to

@ledif
Copy link
Collaborator

ledif commented Jan 29, 2025

One issue was fixed with #151 but another one popped up immediately afterwards.

Fortunately, both :latest and :stable-daily should now have the updated driver and matching kmods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants