[SRU][Regression] Revert "PM: ACPI: reboot: Use S5 for reboot" which causes Bus Fatal Error when rebooting system with BCM5720 NIC

Bug #1917471 reported by prabhakar pujeri
26
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Jeff Lane 
Focal
Fix Released
Medium
Jeff Lane 
Impish
Fix Released
Medium
Jeff Lane 
Jammy
Fix Released
Medium
Jeff Lane 

Bug Description

SRU Justification:

[IMPACT]

This is being reported by a hardware partner as it is being noticed a lot both in their internal testing teams and also being reported with some frequency by customers who are seeing these messages in their logs and thus it is generating an unusualy high volume of support calls from the field.

In 5.4, commit d60cd06331a3566d3305b3c7b566e79edf4e2095 was introduced upstream and pulled into Ubuntu between 5.4.0-58.64 and 5.4.0-59.65. Upstream, these errors were discovered and that patch was reverted (see Fix Below). We carry the revert commit in all subsequent Focal HWE kernels starting at 5.12, but the fix was never pulled back into Focal 5.4.

according to the hardware partner:

the following error messages are observed when rebooting a machine that uses the BCM5720 chipset, which is a widely used 1GbE controller found on LOMs and OCP NICs as well as many PCIe NIC models.

[ 146.429212] shutdown[1]: Rebooting.
[ 146.435151] kvm: exiting hardware virtualization
[ 146.575319] megaraid_sas 0000:67:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
[ 148.088133] [qede_unload:2236(eno12409)]Link is down
[ 148.183618] qede 0000:31:00.1: Ending qede_remove successfully
[ 148.518541] [qede_unload:2236(eno12399)]Link is down
[ 148.625066] qede 0000:31:00.0: Ending qede_remove successfully
[ 148.762067] ACPI: Preparing to enter system sleep state S5
[ 148.794638] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[ 148.803731] {1}[Hardware Error]: event severity: recoverable
[ 148.810191] {1}[Hardware Error]: Error 0, type: fatal
[ 148.816088] {1}[Hardware Error]: section_type: PCIe error
[ 148.822391] {1}[Hardware Error]: port_type: 0, PCIe end point
[ 148.829026] {1}[Hardware Error]: version: 3.0
[ 148.834266] {1}[Hardware Error]: command: 0x0006, status: 0x0010
[ 148.841140] {1}[Hardware Error]: device_id: 0000:04:00.0
[ 148.847309] {1}[Hardware Error]: slot: 0
[ 148.852077] {1}[Hardware Error]: secondary_bus: 0x00
[ 148.857876] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f
[ 148.865145] {1}[Hardware Error]: class_code: 020000
[ 148.870845] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000
[ 148.879842] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030
[ 148.886575] {1}[Hardware Error]: TLP Header: 40000001 0000030f 90028090 00000000
[ 148.894823] tg3 0000:04:00.0: AER: aer_status: 0x00100000, aer_mask: 0x00010000
[ 148.902795] tg3 0000:04:00.0: AER: [20] UnsupReq (First)
[ 148.910234] tg3 0000:04:00.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID
[ 148.918806] tg3 0000:04:00.0: AER: aer_uncor_severity: 0x000ef030
[ 148.925558] tg3 0000:04:00.0: AER: TLP Header: 40000001 0000030f 90028090 00000000
[ 148.933984] reboot: Restarting system
[ 148.938319] reboot: machine restart

The hardware partner did some bisection and observed the following:

Kernel version Fatal Error
5.4.0-42.46 No
5.4.0-45.49 No
5.4.0-47.51 No
5.4.0-48.52 No
5.4.0-51.56 No
5.4.0-52.57 No
5.4.0-53.59 No
5.4.0-54.60 No
5.4.0-58.64 No
5.4.0-59.65 yes
5.4.0-60.67 yes

[FIX]
The fix is to apply this patch from upstream:

commit 9d3fcb28f9b9750b474811a2964ce022df56336e
Author: Josef Bacik <email address hidden>
Date: Tue Mar 16 22:17:48 2021 -0400

    Revert "PM: ACPI: reboot: Use S5 for reboot"

    This reverts commit d60cd06331a3566d3305b3c7b566e79edf4e2095.

    This patch causes a panic when rebooting my Dell Poweredge r440. I do
    not have the full panic log as it's lost at that stage of the reboot and
    I do not have a serial console. Reverting this patch makes my system
    able to reboot again.

Example:
https://code.launchpad.net/~bladernr/ubuntu/+source/linux/+git/focal/+ref/1917471

The hardware partner has preemptively pulled our 5.4 tree, applied the fix and tested it in their labs and determined that this does resolve the issue.

[TEST CASE]
Install the patched kernel on a machine that uses a BCM5720 LOM and reboot the machine and see that the errors no longer appear.

Revision history for this message
Jeff Lane  (bladernr) wrote :

original project wasn't correct - moving to the kernel to be examined, this is potentially a regression

affects: kernel-sru-workflow → linux
summary: - Bus Fatal Error observed when reboot on BCM5720
+ [Regression] Bus Fatal Error observed when reboot on BCM5720
Revision history for this message
Michael Reed (mreed8855) wrote : Re: [Regression] Bus Fatal Error observed when reboot on BCM5720

This issue only happens on the 5.4 kernel. This does not occur on HWE kernel versions

Revision history for this message
Michael Reed (mreed8855) wrote :

I have created a test kernel with this patch removed. I cannot guarantee this patch will be removed if this works because I do not know what the regression risk is.

https://people.canonical.com/~mreed/lp_1917471/

Revision history for this message
Narendra K (knarendra) wrote :

Michael,

Without 'PCI/ACPI: Whitelist hotplug ports for D3 if power managed by ACPI' patch, issue is still observed. I added the "PCI/ACPI: Whitelist hotplug ports for D3 if power managed by ACPI" patch back to 20.04 git tree and also added following patch -

Revert "PM: ACPI: reboot: Use S5 for reboot"
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel?h=v5.17-rc7&id=9d3fcb28f9b9750b474811a2964ce022df56336e

With patch "Revert "PM: ACPI: reboot: Use S5 for reboot" added to 20.04 git, issue is not observed.

Could the patch "Revert "PM: ACPI: reboot: Use S5 for reboot" be pulled into one of the upcoming SRUs ?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Does removing module `tg3` prior reboot help?

Revision history for this message
Sujith Pandel (sujithpandel) wrote :

Hi Jeff,

Could you please help take this Focal SRU request forward? We are requesting to include this patch into LTS kernel v5.4 of focal.

Revert "PM: ACPI: reboot: Use S5 for reboot"
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel?h=v5.17-rc7&id=9d3fcb28f9b9750b474811a2964ce022df56336e

Revision history for this message
Jeff Lane  (bladernr) wrote :

Hi Sujith, Kai-Heng is examining this and asked about removing tg3 prior to reboot (I suppose:

from a shell `rmmod tg3` before you do the reboot.

We can look at adding that patch.

Revision history for this message
Sujith Pandel (sujithpandel) wrote :

@Vinay - Please share update on this trial:
rmmod tg3 before giving 'reboot' command. Check if issue appears.

Jeff Lane  (bladernr)
no longer affects: linux (Ubuntu)
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status: New → Confirmed
Jeff Lane  (bladernr)
affects: linux → linux (Ubuntu)
Changed in linux (Ubuntu Focal):
status: New → In Progress
importance: Undecided → Medium
Changed in linux (Ubuntu Impish):
importance: Undecided → Medium
Changed in linux (Ubuntu Jammy):
importance: Undecided → Medium
Changed in linux (Ubuntu Focal):
assignee: nobody → Jeff Lane (bladernr)
Changed in linux (Ubuntu Impish):
assignee: nobody → Jeff Lane (bladernr)
Changed in linux (Ubuntu Jammy):
assignee: nobody → Jeff Lane (bladernr)
Changed in linux (Ubuntu Impish):
status: New → In Progress
Changed in linux (Ubuntu Jammy):
status: New → In Progress
Revision history for this message
Jeff Lane  (bladernr) wrote :

The patch you're requesting reverts this patch:
commit d60cd06331a3566d3305b3c7b566e79edf4e2095
Author: Kai-Heng Feng <email address hidden>
Date: Fri Oct 30 15:06:57 2020 +0800

    PM: ACPI: reboot: Use S5 for reboot

    After reboot, it's not possible to use hotkeys to enter BIOS setup
    and boot menu on some HP laptops.

    BIOS folks identified the root cause is the missing _PTS call, and
    BIOS is expecting _PTS to do proper reset.

    Using S5 for reboot is default behavior under Windows, "A full
    shutdown (S5) occurs when a system restart is requested" [1], so
    let's do the same here.

    [1] https://docs.microsoft.com/en-us/windows/win32/power/system-power-states

    Signed-off-by: Kai-Heng Feng <email address hidden>
    [ rjw: Subject edit ]
    Signed-off-by: Rafael J. Wysocki <email address hidden>

It looks like this was applied to 5.4 and the patch that reverts this was pulled into 5.13 and later, so that is why 5.4 is the only affected version. I was confused as to why this wasn't also appearing in Impish and Jammy. Now we know. So I can close those tasks.

Changed in linux (Ubuntu Impish):
status: In Progress → Fix Released
Changed in linux (Ubuntu Jammy):
status: In Progress → Fix Released
Revision history for this message
Vinay HM (vinay-hm) wrote :

@sujith
In reply to comment #8.
Issue is still observed even after removing tg3 driver by using "rmmod tg3".
Please find the serial logs of the efforts.

Revision history for this message
Vinay HM (vinay-hm) wrote :
Revision history for this message
Sujith Pandel (sujithpandel) wrote :

Hi Jeff,
I think we should now go ahead with the SRU request for the fix.
rmmod tg3 does not help workaround the issue.

Fix details:
Revert "PM: ACPI: reboot: Use S5 for reboot"
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel?h=v5.17-rc7&id=9d3fcb28f9b9750b474811a2964ce022df56336e

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Sujith, does the issue happen when "tg3" is blacklisted? Something like "blacklist=tg3 modprobe.blacklist=tg3" in kernel parameter.

Revision history for this message
Vinay HM (vinay-hm) wrote :

@Kai-Heng Feng
In reply to comment #14
Issue is still observed even after blacklisting tg3 driver by giving "blacklist=tg3 modprobe.blacklist=tg3" in kernel parameter.
Please find the serial logs of the efforts.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you please try "reboot=" kernel parameter? the value can be "bios, acpi, kbd, triple, efi, or pci".

Jeff Lane  (bladernr)
description: updated
Jeff Lane  (bladernr)
description: updated
summary: - [Regression] Bus Fatal Error observed when reboot on BCM5720
+ [SRU][Regression] Bus Fatal Error observed when reboot on BCM5720
summary: - [SRU][Regression] Bus Fatal Error observed when reboot on BCM5720
+ [SRU][Regression] Revert "PM: ACPI: reboot: Use S5 for reboot" which
+ causes Bus Fatal Error when rebooting system with BCM5720 NIC
Jeff Lane  (bladernr)
description: updated
Revision history for this message
Vinay HM (vinay-hm) wrote :

@Kai-Heng Feng
In reply to comment #16
Can you please try "reboot=" kernel parameter? the value can be "bios, acpi, kbd, triple, efi, or pci".
--> I have tried passing kernel parameters as mentioned above, and all the times I have observed the fatal error.
Attaching console logs of the efforts.

Revision history for this message
Jeff Lane  (bladernr) wrote :

PR made for the Focal part of this

Changed in linux (Ubuntu Focal):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.4.0-110.124 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Vinay HM (vinay-hm) wrote :

In reply to comment 19
I have tried reproducing the fatal-error issue by using focal proposed kernel (linux-image-unsigned-5.4.0-110-generic_5.4.0-110.124_amd64.deb) - Issue is no longer seen.
Attaching serial console logs of the efforts.

Michael Reed (mreed8855)
tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (11.1 KiB)

This bug was fixed in the package linux - 5.4.0-110.124

---------------
linux (5.4.0-110.124) focal; urgency=medium

  * focal/linux: 5.4.0-110.124 -proposed tracker (LP: #1969053)

  * net/mlx5e: Fix page DMA map/unmap attributes (LP: #1967292)
    - net/mlx5e: Fix page DMA map/unmap attributes

  * xfs: Fix deadlock between AGI and AGF when target_ip exists in xfs_rename()
    (LP: #1966803)
    - xfs: Fix deadlock between AGI and AGF when target_ip exists in xfs_rename()

  * LRMv6: add multi-architecture support (LP: #1968774)
    - [Packaging] resync dkms-build{,--nvidia-N}

  * xfrm interface cannot be changed anymore (LP: #1968591)
    - xfrm: fix the if_id check in changelink

  * Use kernel-testing repo from launchpad for ADT tests (LP: #1968016)
    - [Debian] Use kernel-testing repo from launchpad

  * vmx_ldtr_test in ubuntu_kvm_unit_tests failed (FAIL: Expected 0 for L1 LDTR
    selector (got 50)) (LP: #1956315)
    - KVM: nVMX: Set LDTR to its architecturally defined value on nested VM-Exit

  * [SRU][Regression] Revert "PM: ACPI: reboot: Use S5 for reboot" which causes
    Bus Fatal Error when rebooting system with BCM5720 NIC (LP: #1917471)
    - Revert "PM: ACPI: reboot: Use S5 for reboot"

  * Focal update: v5.4.181 upstream stable release (LP: #1967582)
    - Makefile.extrawarn: Move -Wunaligned-access to W=1
    - HID:Add support for UGTABLET WP5540
    - Revert "svm: Add warning message for AVIC IPI invalid target"
    - serial: parisc: GSC: fix build when IOSAPIC is not set
    - parisc: Drop __init from map_pages declaration
    - parisc: Fix data TLB miss in sba_unmap_sg
    - parisc: Fix sglist access in ccio-dma.c
    - btrfs: send: in case of IO error log it
    - platform/x86: ISST: Fix possible circular locking dependency detected
    - selftests: rtc: Increase test timeout so that all tests run
    - net: ieee802154: at86rf230: Stop leaking skb's
    - selftests/zram: Skip max_comp_streams interface on newer kernel
    - selftests/zram01.sh: Fix compression ratio calculation
    - selftests/zram: Adapt the situation that /dev/zram0 is being used
    - ax25: improve the incomplete fix to avoid UAF and NPD bugs
    - vfs: make freeze_super abort when sync_filesystem returns error
    - quota: make dquot_quota_sync return errors from ->sync_fs
    - nvme: fix a possible use-after-free in controller reset during load
    - nvme-tcp: fix possible use-after-free in transport error_recovery work
    - nvme-rdma: fix possible use-after-free in transport error_recovery work
    - drm/amdgpu: fix logic inversion in check
    - Revert "module, async: async_synchronize_full() on module init iff async is
      used"
    - ftrace: add ftrace_init_nop()
    - module/ftrace: handle patchable-function-entry
    - arm64: module: rework special section handling
    - arm64: module/ftrace: intialize PLT at load time
    - iwlwifi: fix use-after-free
    - drm/radeon: Fix backlight control on iMac 12,1
    - ext4: check for out-of-order index extents in ext4_valid_extent_entries()
    - ext4: check for inconsistent extents between index and leaf block
    - ext4: prevent partial update of the extent blocks
    - taskstats: Cleanup t...

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can anyone please give this test kernel a try:
https://people.canonical.com/~khfeng/lp1917471/

The kernel uses S5 reboot, but ignores AER for GHES.

Revision history for this message
Vinay HM (vinay-hm) wrote :

In reply to comment 22
I have tried reproducing the fatal-error issue by using provided test kernel (linux-image-unsigned-5.18.0-4-generic_5.18.0-4.4_amd64.deb) and Issue is seen.
Attaching serial logs of the efforts.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Vinay, I am running out of idea. Are all affected systems Dell ones? I am thinking a simple DMI quirk for Dell platforms.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

The AER error seems to be an MSI/MSI-X interrupt, and the only difference I can find between tg3_remove_one() and tg3_shutdown() is the former clears the bus mastering bit. So if possible please give the following test kernel a try and see if using ACPI S5 for reboot is okay with the change:
https://people.canonical.com/~khfeng/clear-bus-master/

Revision history for this message
Michael Reed (mreed8855) wrote :

Hi Vinay,

Can you please test Kai-Heng's kernel in comment #25? He is looking to test a solution that could solve the root cause of this issue and that will also reinstate the original patch because the original patch is fix for other issues.

Thanks,
Michael

Revision history for this message
Sujith Pandel (sujithpandel) wrote :

>https://people.canonical.com/~khfeng/clear-bus-master/

We see the original issue with this kernel.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please give this one a try:
https://people.canonical.com/~khfeng/tg3-poweroff/

In addition to clearing bus master, this kernel also powers down tg3 device. Hopefully this can avoid the unwanted MSI interrupt.

Revision history for this message
Sujith Pandel (sujithpandel) wrote :

>https://people.canonical.com/~khfeng/tg3-poweroff/

This one does not show the issue.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Thanks a lot! I'll send the patch to upstream based on your test result.

Really appreciated!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.