Some SPR systems throw kernel warnings from uncore_discovery.c

Bug #2049637 reported by Jeff Lane 
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
intel
Confirmed
Medium
Unassigned
linux (Ubuntu)
Confirmed
Medium
Unassigned
Jammy
Confirmed
Medium
Unassigned

Bug Description

[Impact]
On some Sapphire Rapids CPUs we are seeing Kernel warnings in the kern.log:
https://certification.canonical.com/hardware/202311-32288/submission/341156/
Intel(R) Xeon(R) Gold 6442Y

Oct 31 03:35:55 N8 kernel: [ 92.770372] ------------[ cut here ]------------
Oct 31 03:35:55 N8 kernel: [ 92.825738] WARNING: CPU: 48 PID: 1 at arch/x86/events/intel/uncore_discovery.c:184 uncore_insert_box_info+0x134/0x350
Oct 31 03:35:55 N8 kernel: [ 92.953850] Modules linked in:
Oct 31 03:35:55 N8 kernel: [ 92.990464] CPU: 48 PID: 1 Comm: swapper/0 Not tainted 5.15.0-88-generic #98-Ubuntu
Oct 31 03:35:55 N8 kernel: [ 93.082179] Hardware name: ASUSTeK COMPUTER INC. ESC N8-E11/Z13PN-D32 Series, BIOS 0402 09/08/2023
Oct 31 03:35:55 N8 kernel: [ 93.189501] RIP: 0010:uncore_insert_box_info+0x134/0x350
Oct 31 03:35:55 N8 kernel: [ 93.206419] Freeing initrd memory: 106936K
Oct 31 03:35:55 N8 kernel: [ 93.253138] Code: c2 01 48 83 c0 04 39 d1 0f 8e c6 01 00 00 49 8b 4c 24 38 8b 0c 01 41 89 0c 07 49 8b 74 24 40 8b 34 06 41 89 34 06 39 f9 75 cf <0f> 0b 4c 89 ff e8 b2 07 33 00 4c 89 f7 e8 aa 07 33 00 5b 41 5c 41
Oct 31 03:35:55 N8 kernel: [ 93.527071] RSP: 0000:ff5c25ed800efc98 EFLAGS: 00010246
Oct 31 03:35:55 N8 kernel: [ 93.589669] RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000003
Oct 31 03:35:55 N8 kernel: [ 93.675160] RDX: 0000000000000002 RSI: 0000000000018000 RDI: 0000000000000003
Oct 31 03:35:55 N8 kernel: [ 93.760654] RBP: ff5c25ed800efcc0 R08: 0000000000000010 R09: ff32ac8a801df260
Oct 31 03:35:55 N8 kernel: [ 93.846130] R10: 0000000000000246 R11: 00000000ffffffff R12: ff32ac8a8b8412a0
Oct 31 03:35:55 N8 kernel: [ 93.931613] R13: ff5c25ed800efcf8 R14: ff32ac8a8aa32cb0 R15: ff32ac8a801df260
Oct 31 03:35:55 N8 kernel: [ 94.017099] FS: 0000000000000000(0000) GS:ff32ac99bfa00000(0000) knlGS:0000000000000000
Oct 31 03:35:55 N8 kernel: [ 94.114042] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 31 03:35:55 N8 kernel: [ 94.182871] CR2: 0000000000000000 CR3: 0000000d07e10001 CR4: 0000000000771ee0
Oct 31 03:35:55 N8 kernel: [ 94.268360] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 31 03:35:55 N8 kernel: [ 94.353828] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
Oct 31 03:35:55 N8 kernel: [ 94.439332] PKRU: 55555554
Oct 31 03:35:55 N8 kernel: [ 94.471788] Call Trace:
Oct 31 03:35:55 N8 kernel: [ 94.501100] <TASK>
Oct 31 03:35:55 N8 kernel: [ 94.526275] ? show_trace_log_lvl+0x1d6/0x2ea
Oct 31 03:35:55 N8 kernel: [ 94.578457] ? show_trace_log_lvl+0x1d6/0x2ea
Oct 31 03:35:55 N8 kernel: [ 94.630686] ? parse_discovery_table.isra.0+0x162/0x1a0
Oct 31 03:35:55 N8 kernel: [ 94.693295] ? show_regs.part.0+0x23/0x29
Oct 31 03:35:55 N8 kernel: [ 94.741331] ? show_regs.cold+0x8/0xd
Oct 31 03:35:55 N8 kernel: [ 94.785212] ? uncore_insert_box_info+0x134/0x350
Oct 31 03:35:55 N8 kernel: [ 94.841591] ? __warn+0x8c/0x100
Oct 31 03:35:55 N8 kernel: [ 94.880281] ? uncore_insert_box_info+0x134/0x350
Oct 31 03:35:55 N8 kernel: [ 94.936636] ? report_bug+0xa4/0xd0
Oct 31 03:35:55 N8 kernel: [ 94.978460] ? handle_bug+0x39/0x90
Oct 31 03:35:55 N8 kernel: [ 95.020246] ? exc_invalid_op+0x19/0x70
Oct 31 03:35:55 N8 kernel: [ 95.066232] ? asm_exc_invalid_op+0x1b/0x20
Oct 31 03:35:55 N8 kernel: [ 95.116341] ? uncore_insert_box_info+0x134/0x350
Oct 31 03:35:55 N8 kernel: [ 95.172708] ? uncore_insert_box_info+0xe3/0x350
Oct 31 03:35:55 N8 kernel: [ 95.228032] parse_discovery_table.isra.0+0x162/0x1a0
Oct 31 03:35:55 N8 cloud-init[1992]: |.+.o .o .o o +|
Oct 31 03:35:55 N8 kernel: [ 95.288570] intel_uncore_has_discovery_tables+0x19e/0x270
Oct 31 03:35:55 N8 kernel: [ 95.354298] ? type_pmu_register+0x2f/0x42
Oct 31 03:35:55 N8 kernel: [ 95.403385] intel_uncore_init+0xe3/0x226
Oct 31 03:35:55 N8 kernel: [ 95.451409] ? type_pmu_register+0x42/0x42
Oct 31 03:35:55 N8 kernel: [ 95.500506] do_one_initcall+0x46/0x1e0
Oct 31 03:35:55 N8 kernel: [ 95.546475] do_initcalls+0x12f/0x159
Oct 31 03:35:55 N8 kernel: [ 95.590372] kernel_init_freeable+0x162/0x1b5
Oct 31 03:35:55 N8 kernel: [ 95.642556] ? rest_init+0x100/0x100
Oct 31 03:35:55 N8 kernel: [ 95.685405] kernel_init+0x1b/0x150
Oct 31 03:35:55 N8 kernel: [ 95.727228] ? rest_init+0x100/0x100
Oct 31 03:35:55 N8 kernel: [ 95.770054] ret_from_fork+0x1f/0x30
Oct 31 03:35:55 N8 kernel: [ 95.812906] </TASK>
Oct 31 03:35:55 N8 kernel: [ 95.839108] ---[ end trace 2d0c57130f45fd62 ]---

https://certification.canonical.com/hardware/202305-31570/submission/312593/
Intel(R) Xeon(R) Gold 6426Y
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135184] ------------[ cut here ]------------
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135185] WARNING: CPU: 0 PID: 1 at arch/x86/events/intel/uncore_discovery.c:184 uncore_insert_box_info+0x134/0x350
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135192] Modules linked in:
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135194] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.15.0-69-generic #76-Ubuntu
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135198] Hardware name: HPE ProLiant ML110 Gen11/ProLiant ML110 Gen11, BIOS 1.30 03/01/2023
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135200] RIP: 0010:uncore_insert_box_info+0x134/0x350
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135202] Code: c2 01 48 83 c0 04 39 d1 0f 8e c6 01 00 00 49 8b 4c 24 38 8b 0c 01 41 89 0c 07 49 8b 74 24 40 8b 34 06 41 89 34 06 39 f9 75 cf <0f> 0b 4c 89 ff e8 22 a2 32 00 4c 89 f7 e8 1a a2 32 00 5b 41 5c 41
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135206] RSP: 0000:ff3b3e198006bc98 EFLAGS: 00010246
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135209] RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000003
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135210] RDX: 0000000000000002 RSI: 0000000000018000 RDI: 0000000000000003
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135212] RBP: ff3b3e198006bcc0 R08: 0000000000000010 R09: ff31766844f3c5e0
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135214] R10: ff31766844fa4438 R11: 0000000000000000 R12: ff31766844f5fa20
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135216] R13: ff3b3e198006bcf8 R14: ff31766844f3ca20 R15: ff31766844f3c5e0
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135218] FS: 0000000000000000(0000) GS:ff3176e5bf800000(0000) knlGS:0000000000000000
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135220] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135222] CR2: 0000000000000000 CR3: 0000004f35e10001 CR4: 0000000000771ef0
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135224] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135225] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135227] PKRU: 55555554
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135228] Call Trace:
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135230] <TASK>
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135232] parse_discovery_table.isra.0+0x162/0x1a0
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135235] intel_uncore_has_discovery_tables+0x19e/0x270
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135238] ? type_pmu_register+0x21/0x42
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135243] intel_uncore_init+0xe3/0x226
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135246] ? type_pmu_register+0x42/0x42
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135249] do_one_initcall+0x46/0x1e0
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135253] do_initcalls+0x12f/0x159
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135256] kernel_init_freeable+0x162/0x1b5
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135259] ? rest_init+0x100/0x100
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135263] kernel_init+0x1b/0x150
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135265] ? rest_init+0x100/0x100
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135266] ret_from_fork+0x1f/0x30
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135270] </TASK>
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135271] ---[ end trace 6011f2a9999291c3 ]---

This doesn't happen on ALL SPR platforms, but it does happen periodically, and always seems to be centered around arch/x86/events/intel/uncore_discovery.c

This doesn't seem to cause an stability issues that we've seen, but we need to know if these are innocuous, and better, can this be fixed so the kernel no longer spits out warnings (which triggers the kernel taint flag)?

[Fixes]
commit 5d515ee40cb57ea5331998f27df7946a69f14dc3
Author: Kan Liang <email address hidden>
Date: Thu Jan 12 12:01:05 2023 -0800
perf/x86/uncore: Don't WARN_ON_ONCE() for a broken discovery table

Clean cherry pick from 6.3 (and exists in Mantic and later already)

[Test Case]
On SPR systems, the kernel warning should not appear in kern.log and the kernel should not show the taint flag (9) for "Kernel issued warning"

[Where problems could occur]
This is a specific bug fix to resolve this issue identified by Intel and should not generate issues outside the scope of this fix.

Tags: patch
Revision history for this message
Jeff Lane  (bladernr) wrote (last edit ):

Found this commit in mainline and our 6.5 HWE kernel. Checking now to see if there are any prerequisites as well.

commit 5d515ee40cb57ea5331998f27df7946a69f14dc3
Author: Kan Liang <email address hidden>
Date: Thu Jan 12 12:01:05 2023 -0800
perf/x86/uncore: Don't WARN_ON_ONCE() for a broken discovery table

Jeff Lane  (bladernr)
description: updated
Revision history for this message
yunyings (yunying-sun) wrote :

It seems like the uncore warning that triggered by broken UPI discovery table on some SPR MCC.
Link for fix patches: https://<email address hidden>/

To fix it, the commits below from mainline kernel v6.3-rc1 should be backported to the kernel being used:
5d515ee40cb5 perf/x86/uncore: Don't WARN_ON_ONCE() for a broken discovery table
65248a9a9ee1 perf/x86/uncore: Add a quirk for UPI on SPR
bd9514a4d5ec perf/x86/uncore: Ignore broken units in discovery table
3af548f23610 perf/x86/uncore: Fix potential NULL pointer in uncore_get_alias_name
dbf061b26221 perf/x86/uncore: Factor out uncore_device_to_die()

Jeff Lane  (bladernr)
Changed in intel:
assignee: nobody → Jeff Lane  (bladernr)
Changed in linux (Ubuntu):
assignee: nobody → Jeff Lane  (bladernr)
Changed in intel:
importance: Undecided → Medium
Changed in linux (Ubuntu):
importance: Undecided → Medium
Changed in intel:
status: New → In Progress
Changed in linux (Ubuntu):
status: New → In Progress
Jeff Lane  (bladernr)
Changed in linux (Ubuntu Jammy):
status: New → In Progress
assignee: nobody → Jeff Lane  (bladernr)
importance: Undecided → Medium
Jeff Lane  (bladernr)
no longer affects: linux (Ubuntu Focal)
Revision history for this message
Jeff Lane  (bladernr) wrote :

Hi... could you provide a patch to backport

65248a9a9ee1 perf/x86/uncore: Add a quirk for UPI on SPR

To 5.15?

I started and it became a bit of a rabbit hole trying to figure out the prerequisite patches for that one which has merge conflicts when trying to cherry pick it.

Changed in intel:
assignee: Jeff Lane  (bladernr) → nobody
Changed in linux (Ubuntu):
assignee: Jeff Lane  (bladernr) → nobody
Changed in linux (Ubuntu Jammy):
assignee: Jeff Lane  (bladernr) → nobody
Changed in intel:
status: In Progress → Confirmed
Changed in linux (Ubuntu):
status: In Progress → Confirmed
Changed in linux (Ubuntu Jammy):
status: In Progress → Confirmed
Revision history for this message
yunyings (yunying-sun) wrote :

Commit 65248a9a9ee1 (perf/x86/uncore: Add a quirk for UPI on SPR) needs some code adaptations when backporting to kernel 5.15.

I just made an adapted patch locally. Here attached. Please try to apply and test.

tags: added: patch
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.