fabric-manager-535 setup fails during install on Grace/Hopper arm64 system running noble

Bug #2052663 reported by Mitchell Augustin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
fabric-manager-535 (Ubuntu)
Fix Released
Undecided
Mitchell Augustin
linux (Ubuntu)
Fix Released
Undecided
Mitchell Augustin
nvidia-graphics-drivers-535-server (Ubuntu)
Fix Released
Undecided
Mitchell Augustin

Bug Description

This error occurs on both the standard and largemem variants of the latest Noble server build of Ubuntu:
Ubuntu Noble Numbat (development branch) (GNU/Linux 6.6.0-14-generic-64k aarch64) (iso link: https://cdimage.ubuntu.com/ubuntu-server/daily-live/20240207.1/noble-live-server-arm64+largemem.iso)
Ubuntu Noble Numbat (development branch) (GNU/Linux 6.6.0-14-generic aarch64) (iso link: https://cdimage.ubuntu.com/ubuntu-server/daily-live/20240207/noble-live-server-arm64.iso)
CPU/GPU: Nvidia Grace/Hopper

lsb_release -rd:
No LSB modules are available.
Description: Ubuntu Noble Numbat (development branch)
Release: 24.04

Kernel versions affected:
GNU/Linux 6.6.0-14-generic-64k aarch64
GNU/Linux 6.6.0-14-generic aarch64

Package version: nvidia-fabricmanager-535 (535.154.05-0ubuntu1 arm64)

Expected behavior: Package starts as expected during post-install setup steps

Actual behavior:
On our grace/hopper system running noble, when installing nvidia-fabricmanager-535, the installation froze at 60% twice, along with all ssh processes. I am also unable to ssh back into the system after this happens.

This is the last output I see from my installer shell:
+ apt install -y nvidia-fabricmanager-535
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  nvidia-fabricmanager-535
0 upgraded, 1 newly installed, 0 to remove and 1 not upgraded.
Need to get 1795 kB of archives.
After this operation, 8679 kB of additional disk space will be used.
Get:1 http://ports.ubuntu.com/ubuntu-ports noble/multiverse arm64 nvidia-fabricmanager-535 arm64 535.154.05-0ubuntu1 [1795 kB]
Fetched 1795 kB in 1s (2439 kB/s)
Selecting previously unselected package nvidia-fabricmanager-535.
(Reading database ... 103745 files and directories currently installed.)
Preparing to unpack .../nvidia-fabricmanager-535_535.154.05-0ubuntu1_arm64.deb ...
Unpacking nvidia-fabricmanager-535 (535.154.05-0ubuntu1) ...
Setting up nvidia-fabricmanager-535 (535.154.05-0ubuntu1) ...
Created symlink /etc/systemd/system/multi-user.target.wants/nvidia-fabricmanager.service → /lib/systemd/system/nvidia-fabricmanager.service.

Progress: [ 60%] [#################################################################################.......................................................]

This does not appear to cause a panic/reboot, as I can still interact with the console, and it even appears that the apt process is still running in ps aux (although it doesn't seem to progress). However, I observe the following output in the console that I believe may be related:
[ 1453.814597] watchdog: BUG: soft lockup - CPU#16 stuck for 670s! [(udev-worker):33269]
[ 1477.814602] watchdog: BUG: soft lockup - CPU#16 stuck for 693s! [(udev-worker):33269]
[ 1501.814606] watchdog: BUG: soft lockup - CPU#16 stuck for 715s! [(udev-worker):33269]
[ 1525.814611] watchdog: BUG: soft lockup - CPU#16 stuck for 738s! [(udev-worker):33269]
[ 1579.666718] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 17-...D } 240893 ji
ffies s: 653 root: 0x2/.
[ 1579.678114] rcu: blocking rcu_node structures (internal RCU debug): l=1:15-29:0x4/.
[ 1597.814625] watchdog: BUG: soft lockup - CPU#16 stuck for 805s! [(udev-worker):33269]
[ 1621.814630] watchdog: BUG: soft lockup - CPU#16 stuck for 827s! [(udev-worker):33269]
[ 1630.562655] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1630.568973] rcu: 17-...0: (1 GPs behind) idle=2444/1/0x4000000000000000 softirq=13696/13700 f
qs=126842
[ 1630.578665] rcu: hardirqs softirqs csw/system
[ 1630.584381] rcu: number: 0 0 0
[ 1630.590109] rcu: cputime: 0 0 0 ==> 1110384(ms)
[ 1630.597458] rcu: (detected by 20, t=285099 jiffies, g=74061, q=113266 ncpus=72)

Changed in fabric-manager-535 (Ubuntu):
assignee: nobody → Mitchell Augustin (mitchellaugustin)
Changed in linux (Ubuntu):
assignee: nobody → Mitchell Augustin (mitchellaugustin)
Changed in nvidia-graphics-drivers-535-server (Ubuntu):
assignee: nobody → Mitchell Augustin (mitchellaugustin)
Changed in fabric-manager-535 (Ubuntu):
status: New → Fix Released
Changed in linux (Ubuntu):
status: New → Fix Released
Changed in nvidia-graphics-drivers-535-server (Ubuntu):
status: New → Fix Released
Revision history for this message
Mitchell Augustin (mitchellaugustin) wrote :

This bug no longer appears to be reproducible on noble with the 6.8 generic kernels, so I have marked it as resolved.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.