Some SPR systems throw kernel warnings from uncore_discovery.c
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
intel |
Confirmed
|
Medium
|
Unassigned | ||
linux (Ubuntu) |
Confirmed
|
Medium
|
Unassigned | ||
Jammy |
Confirmed
|
Medium
|
Unassigned |
Bug Description
[Impact]
On some Sapphire Rapids CPUs we are seeing Kernel warnings in the kern.log:
https:/
Intel(R) Xeon(R) Gold 6442Y
Oct 31 03:35:55 N8 kernel: [ 92.770372] ------------[ cut here ]------------
Oct 31 03:35:55 N8 kernel: [ 92.825738] WARNING: CPU: 48 PID: 1 at arch/x86/
Oct 31 03:35:55 N8 kernel: [ 92.953850] Modules linked in:
Oct 31 03:35:55 N8 kernel: [ 92.990464] CPU: 48 PID: 1 Comm: swapper/0 Not tainted 5.15.0-88-generic #98-Ubuntu
Oct 31 03:35:55 N8 kernel: [ 93.082179] Hardware name: ASUSTeK COMPUTER INC. ESC N8-E11/Z13PN-D32 Series, BIOS 0402 09/08/2023
Oct 31 03:35:55 N8 kernel: [ 93.189501] RIP: 0010:uncore_
Oct 31 03:35:55 N8 kernel: [ 93.206419] Freeing initrd memory: 106936K
Oct 31 03:35:55 N8 kernel: [ 93.253138] Code: c2 01 48 83 c0 04 39 d1 0f 8e c6 01 00 00 49 8b 4c 24 38 8b 0c 01 41 89 0c 07 49 8b 74 24 40 8b 34 06 41 89 34 06 39 f9 75 cf <0f> 0b 4c 89 ff e8 b2 07 33 00 4c 89 f7 e8 aa 07 33 00 5b 41 5c 41
Oct 31 03:35:55 N8 kernel: [ 93.527071] RSP: 0000:ff5c25ed80
Oct 31 03:35:55 N8 kernel: [ 93.589669] RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000003
Oct 31 03:35:55 N8 kernel: [ 93.675160] RDX: 0000000000000002 RSI: 0000000000018000 RDI: 0000000000000003
Oct 31 03:35:55 N8 kernel: [ 93.760654] RBP: ff5c25ed800efcc0 R08: 0000000000000010 R09: ff32ac8a801df260
Oct 31 03:35:55 N8 kernel: [ 93.846130] R10: 0000000000000246 R11: 00000000ffffffff R12: ff32ac8a8b8412a0
Oct 31 03:35:55 N8 kernel: [ 93.931613] R13: ff5c25ed800efcf8 R14: ff32ac8a8aa32cb0 R15: ff32ac8a801df260
Oct 31 03:35:55 N8 kernel: [ 94.017099] FS: 000000000000000
Oct 31 03:35:55 N8 kernel: [ 94.114042] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 31 03:35:55 N8 kernel: [ 94.182871] CR2: 0000000000000000 CR3: 0000000d07e10001 CR4: 0000000000771ee0
Oct 31 03:35:55 N8 kernel: [ 94.268360] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 31 03:35:55 N8 kernel: [ 94.353828] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
Oct 31 03:35:55 N8 kernel: [ 94.439332] PKRU: 55555554
Oct 31 03:35:55 N8 kernel: [ 94.471788] Call Trace:
Oct 31 03:35:55 N8 kernel: [ 94.501100] <TASK>
Oct 31 03:35:55 N8 kernel: [ 94.526275] ? show_trace_
Oct 31 03:35:55 N8 kernel: [ 94.578457] ? show_trace_
Oct 31 03:35:55 N8 kernel: [ 94.630686] ? parse_discovery
Oct 31 03:35:55 N8 kernel: [ 94.693295] ? show_regs.
Oct 31 03:35:55 N8 kernel: [ 94.741331] ? show_regs.
Oct 31 03:35:55 N8 kernel: [ 94.785212] ? uncore_
Oct 31 03:35:55 N8 kernel: [ 94.841591] ? __warn+0x8c/0x100
Oct 31 03:35:55 N8 kernel: [ 94.880281] ? uncore_
Oct 31 03:35:55 N8 kernel: [ 94.936636] ? report_
Oct 31 03:35:55 N8 kernel: [ 94.978460] ? handle_
Oct 31 03:35:55 N8 kernel: [ 95.020246] ? exc_invalid_
Oct 31 03:35:55 N8 kernel: [ 95.066232] ? asm_exc_
Oct 31 03:35:55 N8 kernel: [ 95.116341] ? uncore_
Oct 31 03:35:55 N8 kernel: [ 95.172708] ? uncore_
Oct 31 03:35:55 N8 kernel: [ 95.228032] parse_discovery
Oct 31 03:35:55 N8 cloud-init[1992]: |.+.o .o .o o +|
Oct 31 03:35:55 N8 kernel: [ 95.288570] intel_uncore_
Oct 31 03:35:55 N8 kernel: [ 95.354298] ? type_pmu_
Oct 31 03:35:55 N8 kernel: [ 95.403385] intel_uncore_
Oct 31 03:35:55 N8 kernel: [ 95.451409] ? type_pmu_
Oct 31 03:35:55 N8 kernel: [ 95.500506] do_one_
Oct 31 03:35:55 N8 kernel: [ 95.546475] do_initcalls+
Oct 31 03:35:55 N8 kernel: [ 95.590372] kernel_
Oct 31 03:35:55 N8 kernel: [ 95.642556] ? rest_init+
Oct 31 03:35:55 N8 kernel: [ 95.685405] kernel_
Oct 31 03:35:55 N8 kernel: [ 95.727228] ? rest_init+
Oct 31 03:35:55 N8 kernel: [ 95.770054] ret_from_
Oct 31 03:35:55 N8 kernel: [ 95.812906] </TASK>
Oct 31 03:35:55 N8 kernel: [ 95.839108] ---[ end trace 2d0c57130f45fd62 ]---
https:/
Intel(R) Xeon(R) Gold 6426Y
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135184] ------------[ cut here ]------------
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135185] WARNING: CPU: 0 PID: 1 at arch/x86/
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135192] Modules linked in:
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135194] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.15.0-69-generic #76-Ubuntu
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135198] Hardware name: HPE ProLiant ML110 Gen11/ProLiant ML110 Gen11, BIOS 1.30 03/01/2023
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135200] RIP: 0010:uncore_
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135202] Code: c2 01 48 83 c0 04 39 d1 0f 8e c6 01 00 00 49 8b 4c 24 38 8b 0c 01 41 89 0c 07 49 8b 74 24 40 8b 34 06 41 89 34 06 39 f9 75 cf <0f> 0b 4c 89 ff e8 22 a2 32 00 4c 89 f7 e8 1a a2 32 00 5b 41 5c 41
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135206] RSP: 0000:ff3b3e1980
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135209] RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000003
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135210] RDX: 0000000000000002 RSI: 0000000000018000 RDI: 0000000000000003
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135212] RBP: ff3b3e198006bcc0 R08: 0000000000000010 R09: ff31766844f3c5e0
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135214] R10: ff31766844fa4438 R11: 0000000000000000 R12: ff31766844f5fa20
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135216] R13: ff3b3e198006bcf8 R14: ff31766844f3ca20 R15: ff31766844f3c5e0
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135218] FS: 000000000000000
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135220] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135222] CR2: 0000000000000000 CR3: 0000004f35e10001 CR4: 0000000000771ef0
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135224] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135225] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135227] PKRU: 55555554
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135228] Call Trace:
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135230] <TASK>
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135232] parse_discovery
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135235] intel_uncore_
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135238] ? type_pmu_
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135243] intel_uncore_
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135246] ? type_pmu_
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135249] do_one_
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135253] do_initcalls+
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135256] kernel_
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135259] ? rest_init+
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135263] kernel_
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135265] ? rest_init+
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135266] ret_from_
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135270] </TASK>
Apr 14 17:29:28 ML110Gen11 kernel: [ 2.135271] ---[ end trace 6011f2a9999291c3 ]---
This doesn't happen on ALL SPR platforms, but it does happen periodically, and always seems to be centered around arch/x86/
This doesn't seem to cause an stability issues that we've seen, but we need to know if these are innocuous, and better, can this be fixed so the kernel no longer spits out warnings (which triggers the kernel taint flag)?
[Fixes]
commit 5d515ee40cb57ea
Author: Kan Liang <email address hidden>
Date: Thu Jan 12 12:01:05 2023 -0800
perf/x86/uncore: Don't WARN_ON_ONCE() for a broken discovery table
Clean cherry pick from 6.3 (and exists in Mantic and later already)
[Test Case]
On SPR systems, the kernel warning should not appear in kern.log and the kernel should not show the taint flag (9) for "Kernel issued warning"
[Where problems could occur]
This is a specific bug fix to resolve this issue identified by Intel and should not generate issues outside the scope of this fix.
description: | updated |
Changed in intel: | |
assignee: | nobody → Jeff Lane (bladernr) |
Changed in linux (Ubuntu): | |
assignee: | nobody → Jeff Lane (bladernr) |
Changed in intel: | |
importance: | Undecided → Medium |
Changed in linux (Ubuntu): | |
importance: | Undecided → Medium |
Changed in intel: | |
status: | New → In Progress |
Changed in linux (Ubuntu): | |
status: | New → In Progress |
Changed in linux (Ubuntu Jammy): | |
status: | New → In Progress |
assignee: | nobody → Jeff Lane (bladernr) |
importance: | Undecided → Medium |
no longer affects: | linux (Ubuntu Focal) |
tags: | added: patch |
Found this commit in mainline and our 6.5 HWE kernel. Checking now to see if there are any prerequisites as well.
commit 5d515ee40cb57ea 5331998f27df794 6a69f14dc3
Author: Kan Liang <email address hidden>
Date: Thu Jan 12 12:01:05 2023 -0800
perf/x86/uncore: Don't WARN_ON_ONCE() for a broken discovery table