fabric-manager-535 setup fails during install on Grace/Hopper arm64 system running noble
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
fabric-manager-535 (Ubuntu) |
Fix Released
|
Undecided
|
Mitchell Augustin | ||
linux (Ubuntu) |
Fix Released
|
Undecided
|
Mitchell Augustin | ||
nvidia-graphics-drivers-535-server (Ubuntu) |
Fix Released
|
Undecided
|
Mitchell Augustin |
Bug Description
This error occurs on both the standard and largemem variants of the latest Noble server build of Ubuntu:
Ubuntu Noble Numbat (development branch) (GNU/Linux 6.6.0-14-
Ubuntu Noble Numbat (development branch) (GNU/Linux 6.6.0-14-generic aarch64) (iso link: https:/
CPU/GPU: Nvidia Grace/Hopper
lsb_release -rd:
No LSB modules are available.
Description: Ubuntu Noble Numbat (development branch)
Release: 24.04
Kernel versions affected:
GNU/Linux 6.6.0-14-
GNU/Linux 6.6.0-14-generic aarch64
Package version: nvidia-
Expected behavior: Package starts as expected during post-install setup steps
Actual behavior:
On our grace/hopper system running noble, when installing nvidia-
This is the last output I see from my installer shell:
+ apt install -y nvidia-
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
nvidia-
0 upgraded, 1 newly installed, 0 to remove and 1 not upgraded.
Need to get 1795 kB of archives.
After this operation, 8679 kB of additional disk space will be used.
Get:1 http://
Fetched 1795 kB in 1s (2439 kB/s)
Selecting previously unselected package nvidia-
(Reading database ... 103745 files and directories currently installed.)
Preparing to unpack .../nvidia-
Unpacking nvidia-
Setting up nvidia-
Created symlink /etc/systemd/
Progress: [ 60%] [######
This does not appear to cause a panic/reboot, as I can still interact with the console, and it even appears that the apt process is still running in ps aux (although it doesn't seem to progress). However, I observe the following output in the console that I believe may be related:
[ 1453.814597] watchdog: BUG: soft lockup - CPU#16 stuck for 670s! [(udev-
[ 1477.814602] watchdog: BUG: soft lockup - CPU#16 stuck for 693s! [(udev-
[ 1501.814606] watchdog: BUG: soft lockup - CPU#16 stuck for 715s! [(udev-
[ 1525.814611] watchdog: BUG: soft lockup - CPU#16 stuck for 738s! [(udev-
[ 1579.666718] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 17-...D } 240893 ji
ffies s: 653 root: 0x2/.
[ 1579.678114] rcu: blocking rcu_node structures (internal RCU debug): l=1:15-29:0x4/.
[ 1597.814625] watchdog: BUG: soft lockup - CPU#16 stuck for 805s! [(udev-
[ 1621.814630] watchdog: BUG: soft lockup - CPU#16 stuck for 827s! [(udev-
[ 1630.562655] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1630.568973] rcu: 17-...0: (1 GPs behind) idle=2444/
qs=126842
[ 1630.578665] rcu: hardirqs softirqs csw/system
[ 1630.584381] rcu: number: 0 0 0
[ 1630.590109] rcu: cputime: 0 0 0 ==> 1110384(ms)
[ 1630.597458] rcu: (detected by 20, t=285099 jiffies, g=74061, q=113266 ncpus=72)
Changed in fabric-manager-535 (Ubuntu): | |
assignee: | nobody → Mitchell Augustin (mitchellaugustin) |
Changed in linux (Ubuntu): | |
assignee: | nobody → Mitchell Augustin (mitchellaugustin) |
Changed in nvidia-graphics-drivers-535-server (Ubuntu): | |
assignee: | nobody → Mitchell Augustin (mitchellaugustin) |
Changed in fabric-manager-535 (Ubuntu): | |
status: | New → Fix Released |
Changed in linux (Ubuntu): | |
status: | New → Fix Released |
Changed in nvidia-graphics-drivers-535-server (Ubuntu): | |
status: | New → Fix Released |
This bug no longer appears to be reproducible on noble with the 6.8 generic kernels, so I have marked it as resolved.