Can not boot impish in Cavium ThunderX

Bug #1942633 reported by Diego Mascialino
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
dann frazier

Bug Description

Hi I am trying to deploy a Impish server (Ubuntu 5.11.0-13.14-generic 5.11.7) in

arm64
    description: Computer
    product: Cavium ThunderX CN88XX board
    width: 64 bits

I tried 2 releases:
 - 20210817 ( https://images.maas.io/ephemeral-v3/stable/impish/arm64/20210817/ga-21.10/ )
 - 20210830 ( https://images.maas.io/ephemeral-v3/candidate/impish/arm64/20210830/ )

20210817:
 Ends the deploy successfully but fails to boot from disk:

```
[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x431f0a11]
[ 0.000000] Linux version 5.13.0-14-generic (buildd@bos02-arm64-018) (gcc (Ubuntu 10.3.0-6ubuntu1) 10.3.0, GNU ld (GNU Binutils for Ubuntu) 2.37) #14-Ubuntu SMP Mon Aug 2 12:40:58 UTC 2021 (Ubuntu 5.13.0-14.14-generic 5.13.1)
[ 0.000000] Machine model: Cavium ThunderX CN88XX board
....
[ 139.896752] raid6: .... xor() 1224 MB/s, rmw enabled
[ 139.957046] raid6: using neon recovery algorithm
[ 140.020937] xor: measuring software checksum speed
[ 140.085571] 8regs : 2432 MB/sec
[ 140.149834] 32regs : 2687 MB/sec
[ 140.212202] arm64_neon : 4390 MB/sec
[ 140.271985] xor: using function: arm64_neon (4390 MB/sec)
[ 140.335013] async_tx: api initialized (async)
done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... [ 140.569847] Btrfs loaded, crc32c=crc32c-generic, zoned=yes
Scanning for Btrfs filesystems
done.
Begin: Waiting for root file system ... Begin: Running /scripts/local-block ... mdadm: No devices listed in conf file were found.
done.
mdadm: No devices listed in conf file were found.
```

20210830

Fails the network boot:
```
[ 140.351066] async_tx: api initialized (async)
done.
Begin: Running /scripts/init-premount ... cloud-initramfs-dyn-netconf: did no find a nic with 1c:1b:0d:0d:52:7c
done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... Begin: Waiting up to 180 secs for BOOTIF to become available ... Failure: Interface BOOTIF did not appear in time
done.
ipconfig: BOOTIF: SIOCGIFINDEX: No such device
ipconfig: no devices to configure
```

I will attach full logs of each try.

JFYI, this is the same machine as https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1923230

Revision history for this message
Diego Mascialino (dmascialino) wrote :
Revision history for this message
Diego Mascialino (dmascialino) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1942633

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: impish
Revision history for this message
Diego Mascialino (dmascialino) wrote :

Im not able to run: apport-collect

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Paolo Pisati (p-pisati) wrote :

I can reproduce the issue on phanpy with 5.13.0-14-generic:

...
[ 169.882563] random: fast init done
[ 169.926830] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
done.
Begin: Running /scripts/local-bottom ... done.
Begin: Running /scripts/init-bottom ... Failed to send exit request: Connection refused
done.
[ 171.269768] systemd[1]: Inserted module 'autofs4'
[ 171.421523] systemd[1]: systemd 248.3-1ubuntu3 running in system mode. (+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS -OPENSSL +ACL +BLKID +CURL +ELFUTILS -FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP -LIBFDISK +PCRE2 -PWQUALITY -P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=hybrid)
[ 171.463836] systemd[1]: Detected architecture arm64.

Welcome to Ubuntu Impish Indri (development branch)!

[ 171.523576] systemd[1]: Hostname set to <phanpy>.
[ 173.864657] random: crng init done
...

I've noticed some stack traces during boot, i'll try to extract them.

Revision history for this message
Paolo Pisati (p-pisati) wrote :
Revision history for this message
Paolo Pisati (p-pisati) wrote :

With 5.13.0-16.16, the issue is way more evident.

Revision history for this message
Paolo Pisati (p-pisati) wrote :

5.14.0-9.9 exhibits kernel stack corruption on boot.

Revision history for this message
Paolo Pisati (p-pisati) wrote :
Revision history for this message
Paolo Pisati (p-pisati) wrote :
Revision history for this message
Paolo Pisati (p-pisati) wrote :
Revision history for this message
dann frazier (dannf) wrote :

I can attempt a bisect - let me know if someone else is already doing that :)

Revision history for this message
Paolo Pisati (p-pisati) wrote :

Feel free to bisect, i'm checking our SAUCE patches/configs.

BTW:

Firmware Version: 2017-10-12 12:34:31

it's the last FW version available, isn't it? I'm wondering about the device tree.

Revision history for this message
dann frazier (dannf) wrote :

For the CRB1S systems, the following is the firmware string that matters, and it shows that phanpy is running the latest (and very likely final) version:

BIOS Date: 06/14/2018 14:42:48 Ver: 0ACGA022

Note that this is also reproducible on our Gigabyte R120 systems which use acpi=force by default, which suggests this is not a device tree problem.

I can reproduce with upstream v5.13, and I've a bisect between that and v5.11 in progress.

Revision history for this message
dann frazier (dannf) wrote :

First bad commit:

commit 9ec37efb87832b578d7972fc80b04d94f5d2bbe3 (HEAD, refs/bisect/bad)
Author: Marc Zyngier <email address hidden>
Date: Tue Mar 30 16:11:42 2021 +0100

    PCI/MSI: Make pci_host_common_probe() declare its reliance on MSI domains

    The generic PCI host driver relies on MSI domains for MSIs to
    be provided to its end-points. Make this dependency explicit.

    This cures the warnings occuring on arm/arm64 VMs when booted
    with PCI virtio devices and no MSI controller (no GICv3 ITS,
    for example).

    It is likely that other drivers will need to express the same
    dependency.

    Link: https://<email address hidden>
    Signed-off-by: Marc Zyngier <email address hidden>
    Signed-off-by: Lorenzo Pieralisi <email address hidden>
    Acked-by: Bjorn Helgaas <email address hidden>

Revision history for this message
Paolo Pisati (p-pisati) wrote :

I think it's a red herring: even after reversing this commit on top of 5.13-rc1 (the first tag showing the issue) or Impish/master-next HEAD, the kernel still Oops and hang there.

Revision history for this message
dann frazier (dannf) wrote :

hm.. you're right. I was tracking down a different issue which had the following backtrace:

[ 10.701967] Call trace:
[ 10.704415] ata_host_activate+0x160/0x170
[ 10.708518] ahci_host_activate+0x170/0x1e0
[ 10.712711] ahci_init_one+0x898/0xd74 [ahci]
[ 10.717116] local_pci_probe+0x4c/0xc0
[ 10.720883] work_for_cpu_fn+0x28/0x40
[ 10.724635] process_one_work+0x20c/0x4d0
[ 10.728647] worker_thread+0x250/0x564
[ 10.732393] kthread+0x134/0x140
[ 10.735616] ret_from_fork+0x10/0x18

Reverting the bisect-id'd commit from upstream 5.13-rc1 does fix this issue for me, but perhaps it was already fixed in v5.13 final.

I *can* reproduce the backtrace here w/ the archive 5.13.0-16, but I can not with a locally built 5.13-rc1 (w/ the above patch reverted). The fact that you can reproduce with 5.13-rc1 makes me wonder if the issue maybe toolchain related. I'm building in a hirsute environment. I'll try building 5.13.0-16 in a hirsute environment to test that theory.

Revision history for this message
dann frazier (dannf) wrote :

Update: The issue does not follow toolchain, which is good. Rather I did find that upstream v5.13-rc1 is stable - with the patch in comment #15 reverted - while upstream v5.13 is not. I bisected v5.13-rc1..v5.13, reverting the comment #15 patch at each test. A "bad" kernel wouldn't always fail the same way, but would always fail before completing boot. The bisect hit this commit:

commit 0c6c2d3615efb7c292573f2e6c886929a2b2da6c (HEAD, refs/bisect/bad)
Author: Mark Brown <email address hidden>
Date: Wed Apr 28 13:12:31 2021 +0100

    arm64: Generate cpucaps.h

While this looks innocuous, it is messing with the code that chooses which "features" a CPU has, which includes erratum that may need kernel workarounds. So I went back and compared the CPU features messages between a "good" kernel and a "bad" one. Noticeably missing from a "bad" one was this message:

[ 0.000000] CPU features: kernel page table isolation forced OFF by ARM64_WORKAROUND_CAVIUM_27456

I went back and tested Ubuntu's 5.13.0-16 w/ kpti=off, and it booted fine. As does upstream v5.15-rc2, where previously I was also seeing stack overflows/corruption. So, seems like the above change is likely the problem, next step is to figure out why.

Revision history for this message
dann frazier (dannf) wrote :

Root caused and reported upstream:
 https://www.spinics.net/lists/arm-kernel/msg921821.html

Changed in linux (Ubuntu):
assignee: nobody → dann frazier (dannf)
Revision history for this message
dann frazier (dannf) wrote :
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Revision history for this message
dann frazier (dannf) wrote :

A patch has been merged into the arm64 tree and is tagged for stable 5.13+

Revision history for this message
Diego Mascialino (dmascialino) wrote :

Thanks a lot dann.

I dont know the release process of the kernel, and when our public images are generated by CPC.

I have to test it when is uploaded here: https://images.maas.io/ephemeral-v3/candidate/impish/arm64/

Do you have any tip about how to be aware when this should happen?

Revision history for this message
dann frazier (dannf) wrote :

The patch was merged into 5.13.0-17.17, which is currently in impish-proposed. At some point it should promulgate to the release pocket. You can view the status here:
 https://launchpad.net/ubuntu/+source/linux

I don't know how long after that it will before it appears in images. As a workaround, you can test with kpti=off on the kernel command line.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-hwe-5.13/5.13.0-17.17~20.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
dann frazier (dannf)
Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Revision history for this message
dann frazier (dannf) wrote :

Verified:

ubuntu@doerfel:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.13.0-18-generic root=UUID=36113164-ebf4-4fed-9d98-4b0b859bf98e ro acpi=force
ubuntu@doerfel:~$ cat /proc/version
Linux version 5.13.0-18-generic (buildd@bos02-arm64-027) (gcc (Ubuntu 11.2.0-7ubuntu2) 11.2.0, GNU ld (GNU Binutils for Ubuntu) 2.37) #18-Ubuntu SMP Mon Oct 4 14:52:32 UTC 2021

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
dann frazier (dannf) wrote :

Also good in focal:
ubuntu@doerfel:~$ cat /proc/version
Linux version 5.13.0-17-generic (buildd@bos02-arm64-034) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #17~20.04.1-Ubuntu SMP Tue Sep 28 14:05:10 UTC 2021

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 5.13.0-19.19

---------------
linux (5.13.0-19.19) impish; urgency=medium

  * impish/linux: 5.13.0-19.19 -proposed tracker (LP: #1946337)

  * impish:linux-aws 5.13 panic during systemd autotest (LP: #1946001)
    - [Config] disable KFENCE

 -- Andrea Righi <email address hidden> Thu, 07 Oct 2021 11:09:51 +0200

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.