Ubuntu
linux package

Kdump fails to capture dump on Firestone NV when machine crashes while running stress-ng.

Bug #1680349 reported by bugproxy on 2017-04-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	The Ubuntu-power-systems project	Invalid	High	Canonical Kernel Team
	linux (Ubuntu)	Invalid	High	Canonical Kernel Team

Bug Description

== Comment: #0 - PAVITHRA R. PRAKASH <> - 2017-03-10 02:43:10 ==
---Problem Description---

Ubuntu 17.04: Kdump fails to capture dump on Firestone NV when machine crashes while running stress-ng. Machine hangs.

---Steps to Reproduce---

1. Configure kdump.
2. Install stress-ng
# apt-get install stress-ng
3. Run stress-ng
# stress-ng - a 0

Logs:
========
root@ltc-firep3:~# kdump-config load
Modified cmdline:root=UUID=8b0d5b99-6087-4f40-82ea-375c83a4c139 ro quiet splash irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service ata_piix.prefer_ms_hyperv=0 elfcorehdr=155200K
* loaded kdump kernel
root@ltc-firep3:~# kdump-config show
DUMP_MODE: kdump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.panic_on_oops=1
KDUMP_COREDIR: /var/crash
crashkernel addr:
/var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinux-4.10.0-11-generic
kdump initrd:
/var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-4.10.0-11-generic
current state: ready to kdump

kexec command:
/sbin/kexec -p --command-line="root=UUID=8b0d5b99-6087-4f40-82ea-375c83a4c139 ro quiet splash irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service ata_piix.prefer_ms_hyperv=0" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz
root@ltc-firep3:~# stress-ng -a 0
stress-ng: info: [3900] defaulting to a 86400 second run per stressor
stress-ng: info: [3900] dispatching hogs: 160 af-alg, 160 affinity, 160 aio, 160 aiol, 160 apparmor, 160 atomic, 160 bigheap, 160 brk, 160 bsearch, 160 cache, 160 cap, 160 chdir, 160 chmod, 160 chown, 160 chroot, 160 clock, 160 clone, 160 context, 160 copy-file, 160 cpu, 160 cpu-online, 160 crypt, 160 daemon, 160 dccp, 160 dentry, 160 dir, 160 dirdeep, 160 dnotify, 160 dup, 160 epoll, 160 eventfd, 160 exec, 160 fallocate, 160 fanotify, 160 fault, 160 fcntl, 160 fiemap, 160 fifo, 160 filename, 160 flock, 160 fork, 160 fp-error, 160 fstat, 160 full, 160 futex, 160 get, 160 getdent, 160 getrandom, 160 handle, 160 hdd, 160 heapsort, 160 hsearch, 160 icache, 160 icmp-flood, 160 inotify, 160 io, 160 iomix, 160 ioprio, 160 itimer, 160 kcmp, 160 key, 160 kill, 160 klog, 160 lease, 160 link, 160 locka, 160 lockbus, 160 lockf, 160 lockofd, 160 longjmp, 160 lsearch, 160 madvise, 160 malloc, 160 matrix, 160 membarrier, 160 memcpy, 160 memfd, 160 mergesort, 160 mincore, 160 mknod, 160 mlock, 160 mmap, 160 mmapfork, 160 mmapmany, 160 mq, 160 mremap, 160 msg, 160 msync, 160 netlink-proc, 160 nice, 160 nop, 160 null, 160 numa, 160 oom-pipe, 160 opcode, 160 open, 160 personality, 160 pipe, 160 poll, 160 procfs, 160 pthread, 160 ptrace, 160 pty, 160 qsort, 160 quota, 160 rdrand, 160 readahead, 160 remap, 160 rename, 160 resources, 160 rlimit, 160 rmap, 160 rtc, 160 schedpolicy, 160 sctp, 160 seal, 160 seccomp, 160 seek, 160 sem, 160 sem-sysv, 160 sendfile, 160 shm, 160 shm-sysv, 160 sigfd, 160 sigfpe, 160 sigpending, 160 sigq, 160 sigsegv, 160 sigsuspend, 160 sleep, 160 sock, 160 sockfd, 160 sockpair, 160 spawn, 160 splice, 160 stack, 160 stackmmap, 160 str, 160 stream, 160 switch, 160 symlink, 160 sync-file, 160 sysfs, 160 sysinfo, 160 tee, 160 timer, 160 timerfd, 160 tlb-shootdown, 160 tmpfs, 160 tsc, 160 tsearch, 160 udp, 160 udp-flood, 160 unshare, 160 urandom, 160 userfaultfd, 160 utime, 160 vecmath, 160 vfork, 160 vforkmany, 160 vm, 160 vm-rw, 160 vm-splice, 160 wait, 160 wcs, 160 xattr, 160 yield, 160 zero, 160 zlib, 160 zombie
stress-ng: info: [3900] cache allocate: using built-in defaults as unable to determine cache details
stress-ng: info: [3900] cache allocate: default cache size: 2048K
stress-ng: info: [3907] stress-ng-atomic: this stressor is not implemented on this system: ppc64le Linux 4.10.0-11-generic
stress-ng: info: [3955] stress-ng-exec: running as root, won't run test.
stress-ng: info: [3999] stress-ng-icache: this stressor is not implemented on this system: ppc64le Linux 4.10.0-11-generic
stress-ng: info: [4040] stress-ng-lockbus: this stressor is not implemented on this system: ppc64le Linux 4.10.0-11-generic
stress-ng: info: [4313] stress-ng-numa: system has 2 of a maximum 256 memory NUMA nodes
stress-ng: info: [4455] stress-ng-rdrand: this stressor is not implemented on this system: ppc64le Linux 4.10.0-11-generic
stress-ng: fail: [4558] stress-ng-rtc: ioctl RTC_ALRM_READ failed, errno=22 (Invalid argument)
stress-ng: fail: [4017] stress-ng-key: keyctl KEYCTL_DESCRIBE failed, errno=127 (Key has expired)
stress-ng: fail: [4017] stress-ng-key: keyctl KEYCTL_UPDATE failed, errno=127 (Key has expired)
stress-ng: fail: [4017] stress-ng-key: keyctl KEYCTL_READ failed, errno=127 (Key has expired)
stress-ng: fail: [4017] stress-ng-key: request_key failed, errno=126 (Required key not available)
stress-ng: fail: [4017] stress-ng-key: keyctl KEYCTL_DESCRIBE failed, errno=127 (Key has expired)
info: 5 failures reached, aborting stress process
[ 170.733680] Memory failure: 0xceda8: recovery action for dirty LRU page: Recovered
[ 171.036660] Memory failure: 0xce8e9: recovery action for dirty LRU page: Recovered
[ 171.161610] Memory failure: 0xce4fb: recovery action for dirty LRU page: Recovered
[ 171.170348] AppArmor DFA next/check upper bounds error
[ 171.204790] Memory failure: 0xd2146: recovery action for dirty LRU page: Recovered
[ 171.232026] Memory failure: 0xcefe6: recovery action for dirty LRU page: Recovered
[ 171.232899] Memory failure: 0xce578: recovery action for dirty LRU page: Recovered
[ 171.236850] Memory failure: 0xcfdfb: recovery action for dirty LRU page: Recovered
[ 171.336249] Memory failure: 0xcd715: recovery action for dirty LRU page: Recovered
[ 171.337550] Memory failure: 0xfb86c: recovery action for dirty LRU page: Recovered
[ 171.367483] Memory failure: 0xce92c: recovery action for dirty LRU page: Recovered
[ 171.369980] Memory failure: 0xceabe: recovery action for dirty LRU page: Recovered
[ 171.372534] Memory failure: 0xbcf3a: recovery action for dirty LRU page: Recovered
[ 171.375318] Memory failure: 0xceef9: recovery action for dirty LRU page: Recovered
[ 171.377701] Memory failure: 0xce722: recovery action for dirty LRU page: Recovered
[ 171.384725] Memory failure: 0xcedef: recovery action for dirty LRU page: Recovered
[ 171.398538] Memory failure: 0xcf927: recovery action for dirty LRU page: Recovered
[ 171.401492] Memory failure: 0xce881: recovery action for dirty LRU page: Recovered
[ 171.403476] Memory failure: 0xce2d4: recovery action for dirty LRU page: Recovered
[ 171.404104] Memory failure: 0xce17a: recovery action for dirty LRU page: Recovered
[ 171.404682] Memory failure: 0xd9f0b: recovery action for dirty LRU page: Recovered
stress-ng: info: [4865] stress-ng-spawn: running as root, won't run test.
[ 171.406159] Memory failure: 0xdaae0: recovery action for dirty LRU page: Recovered
[ 171.415810] Memory failure: 0xb5355: recovery action for dirty LRU page: Recovered
[ 171.434513] Memory failure: 0xb5576: recovery action for dirty LRU page: Recovered
[ 171.435161] Memory failure: 0xbd0fd: recovery action for dirty LRU page: Recovered
[ 171.436046] Memory failure: 0xceec0: recovery action for dirty LRU page: Recovered
[ 171.449215] Memory failure: 0xcecda: recovery action for dirty LRU page: Recovered
[ 171.453705] Memory failure: 0xcf005: recovery action for dirty LRU page: Recovered
[ 171.491202] Memory failure: 0xfb99e: recovery action for dirty LRU page: Recovered
[ 171.493054] Memory failure: 0xb2dbe: recovery action for dirty LRU page: Recovered
[ 171.503540] Memory failure: 0xced0f: recovery action for dirty LRU page: Recovered
[ 171.504809] Memory failure: 0xb2dad: recovery action for clean LRU page: Recovered
[ 171.506327] Memory failure: 0xb3268: recovery action for dirty LRU page: Recovered
[ 171.523449] Memory failure: 0xb3238: recovery action for dirty LRU page: Recovered
[ 171.524558] Memory failure: 0xcea57: recovery action for dirty LRU page: Recovered
[ 171.525611] Memory failure: 0xce6c8: recovery action for dirty LRU page: Recovered
[ 171.526501] Memory failure: 0xbd0d0: recovery action for dirty LRU page: Recovered
[ 171.528740] Memory failure: 0xcea27: recovery action for dirty LRU page: Recovered
[ 171.536166] Memory failure: 0xce469: recovery action for dirty LRU page: Recovered
[ 171.537409] Memory failure: 0xcec3f: recovery action for dirty LRU page: Recovered
[ 171.538991] Memory failure: 0xcec80: recovery action for dirty LRU page: Recovered
[ 171.540183] Memory failure: 0xb0283: recovery action for dirty LRU page: Recovered
[ 171.568190] Memory failure: 0xb0165: recovery action for dirty LRU page: Recovered
[ 171.569451] Memory failure: 0xda648: recovery action for dirty LRU page: Recovered
[ 171.669472] Memory failure: 0xb2d6a: recovery action for dirty LRU page: Recovered
stress-ng: info: [4929] stress-ng-stream: using built-in defaults as unable to determine cache details
stress-ng: info: [4929] stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark code
stress-ng: info: [4929] stress-ng-stream: do NOT submit any of these results to the STREAM benchmark results
stress-ng: info: [4929] stress-ng-stream: Using CPU cache size of 2048K
[ 171.722081] Memory failure: 0xcf20d: recovery action for dirty LRU page: Recovered
[ 171.723615] Memory failure: 0xa975f: recovery action for dirty LRU page: Recovered
[ 171.745730] Memory failure: 0xb2d85: recovery action for clean LRU page: Recovered
stress-ng: info: [4986] stress-ng-sysfs: running as root, just traversing /sys and not read/writing to /sys files.
[ 172.043162] Memory failure: 0xaa6aa: recovery action for dirty LRU page: Recovered
[ 172.048888] Memory failure: 0xb02d0: recovery action for dirty LRU page: Recovered
[ 172.103892] Memory failure: 0xcd8db: recovery action for dirty LRU page: Recovered
[ 172.105545] Memory failure: 0xb2d9e: recovery action for dirty LRU page: Recovered
[ 172.106053] Memory failure: 0xcf2f4: recovery action for dirty LRU page: Recovered
[ 172.106224] Memory failure: 0xa9758: recovery action for clean LRU page: Recovered
[ 172.146851] Memory failure: 0xce5e8: recovery action for clean LRU page: Recovered
[ 172.234564] Memory failure: 0x9e8b2: recovery action for dirty LRU page: Recovered
[ 172.236835] Memory failure: 0xac4f8: recovery action for dirty LRU page: Recovered
[ 172.238363] Memory failure: 0xcebb2: recovery action for dirty LRU page: Recovered
stress-ng: info: [5105] stress-ng-tsc: this stressor is not implemented on this system: ppc64le Linux 4.10.0-11-generic
[ 172.494650] Memory failure: 0xcecb6: recovery action for clean LRU page: Recovered
[ 172.495944] Memory failure: 0xa9710: recovery action for dirty LRU page: Recovered
[ 172.496511] Memory failure: 0xb55d7: recovery action for dirty LRU page: Recovered
[ 172.496932] Memory failure: 0x9e8cb: recovery action for dirty LRU page: Recovered
[ 172.716658] Memory failure: 0x9e628: recovery action for dirty LRU page: Recovered
[ 172.780960] Memory failure: 0xcf3ac: recovery action for dirty LRU page: Recovered
[ 172.781447] Memory failure: 0xceaac: recovery action for dirty LRU page: Recovered
[ 172.781891] Memory failure: 0xb55a1: recovery action for dirty LRU page: Recovered
[ 172.845268] Memory failure: 0x84318: recovery action for dirty LRU page: Recovered
[ 172.846308] Memory failure: 0x84322: recovery action for dirty LRU page: Recovered
[ 172.860021] Memory failure: 0xbd067: recovery action for dirty LRU page: Recovered
[ 172.924176] Memory failure: 0xce68e: recovery action for dirty LRU page: Recovered
[ 172.926255] Memory failure: 0x92ee8: recovery action for dirty LRU page: Recovered
[ 172.926720] Memory failure: 0xda136: recovery action for dirty LRU page: Recovered
[ 172.927534] Memory failure: 0xb2d75: recovery action for dirty LRU page: Recovered
[ 173.008909] Memory failure: 0xac4e6: recovery action for dirty LRU page: Recovered
[ 173.042161] Memory failure: 0xcea49: recovery action for dirty LRU page: Recovered
[ 173.076591] Memory failure: 0x9e8fb: recovery action for dirty LRU page: Recovered
[ 173.124359] Memory failure: 0x8434b: recovery action for dirty LRU page: Recovered
[ 173.288102] Memory failure: 0xcf5e7: recovery action for dirty LRU page: Recovered
[ 173.440243] Memory failure: 0xb012d: recovery action for dirty LRU page: Recovered
[ 173.565679] Memory failure: 0x1cc382: recovery action for clean LRU page: Recovered
[ 173.620166] Memory failure: 0x84334: recovery action for dirty LRU page: Recovered
[ 173.635189] Memory failure: 0xb02bf: recovery action for dirty LRU page: Recovered
[ 173.636070] Memory failure: 0x9e8f0: recovery action for dirty LRU page: Recovered
[ 173.638929] Memory failure: 0x84362: recovery action for dirty LRU page: Recovered
[ 173.643249] Memory failure: 0xcda1c: recovery action for dirty LRU page: Recovered
[ 173.648607] Memory failure: 0x9a27a: recovery action for dirty LRU page: Recovered
[ 173.651927] Memory failure: 0xced46: recovery action for dirty LRU page: Recovered
[ 173.711413] Memory failure: 0x9a270: recovery action for dirty LRU page: Recovered
[ 173.733759] Memory failure: 0xb55b1: recovery action for dirty LRU page: Recovered
[ 173.738553] Memory failure: 0x840d1: recovery action for dirty LRU page: Recovered
[ 173.740023] Memory failure: 0xb01ae: recovery action for dirty LRU page: Recovered
[ 173.740992] Memory failure: 0x1c9ca8: recovery action for dirty LRU page: Recovered
[ 173.742282] Memory failure: 0xa97fa: recovery action for dirty LRU page: Recovered
[ 173.783778] Memory failure: 0xc763c: recovery action for dirty LRU page: Recovered
[ 173.785593] Memory failure: 0xb02b6: recovery action for dirty LRU page: Recovered
[ 173.788206] AppArmor DFA next/check upper bounds error
[ 173.788390] Memory failure: 0x1ca066: dirty LRU page still referenced by 1 users
[ 173.788395] Memory failure: 0x1ca066: recovery action for dirty LRU page: Failed
[ 174.403722] Memory failure: 0x1c979a: recovery action for dirty LRU page: Recovered
[ 174.428211] Memory failure: 0xb02db: recovery action for dirty LRU page: Recovered
stress-ng: info: [5591] stress-ng-yield: limiting to 160 yielders (instance 0)
stress-ng: info: [5689] stress-ng-atomic: this stressor is not implemented on this system: ppc64le Linux 4.10.0-11-generic
[ 174.643022] Memory failure: 0x1ca7c1: recovery action for dirty LRU page: Recovered
stress-ng: info: [6033] stress-ng-exec: running as root, won't run test.
[ 174.691794] Unable to handle kernel paging request for data at address 0x000002f4
[ 174.692217] Faulting instruction address: 0xd000000014bc0a90
[ 174.692484] Oops: Kernel access of bad area, sig: 11 [#1]
[ 174.692780] SMP NR_CPUS=2048
[ 174.692788] NUMA
[ 174.693003] PowerNV
[ 174.693269] Modules linked in: btrfs xor raid6_pq cuse wp512 kvm_hv kvm_pr sctp(+) rmd320 libcrc32c dccp_ipv4(+) kvm rmd256 rmd160 rmd128 md4 binfmt_misc algif_hash dccp af_alg ofpart cmdlinepart ipmi_powernv ipmi_devintf powernv_flash ipmi_msghandler mtd opal_prd ibmpowernv powernv_rng joydev input_leds mac_hid at24 nvmem_core uio_pdrv_genirq uio ip_tables x_tables autofs4 hid_generic usbhid hid uas usb_storage ast crc32c_vpmsum i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci tg3
[ 174.694545] CPU: 98 PID: 6170 Comm: stress-ng-dccp Not tainted 4.10.0-11-generic #13-Ubuntu
[ 174.694645] task: c000001e42c50800 task.stack: c000001e42cf8000
[ 174.694758] NIP: d000000014bc0a90 LR: d000000014bc21cc CTR: c000000000a3b340
[ 174.694872] REGS: c000001fff4476b0 TRAP: 0300 Not tainted (4.10.0-11-generic)
[ 174.694974] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
[ 174.695059] CR: 24002242 XER: 20000000
[ 174.695217] CFAR: c000000000008850 DAR: 00000000000002f4 DSISR: 40000000 SOFTE: 1
[ 174.695217] GPR00: d000000014bc21cc c000001fff447930 d000000014bcb670 0000000000000001
[ 174.695217] GPR04: c000001e3b602700 c000001df14a0c60 0000000000000474 c000001df14a0c74
[ 174.695217] GPR08: c000001df14a0800 0000000000000000 c000001e3b60c400 0000000000000000
[ 174.695217] GPR12: 0000000000002200 c000000007b87200 c000001fff444000 0000000000000000
[ 174.695217] GPR16: c000000000fc2800 0000000000000000 0000000000000040 0000000000002711
[ 174.695217] GPR20: 000000000000a4d2 000000000100007f 000000000100007f c0000000013d2880
[ 174.695217] GPR24: 0000000000000001 0000000000000001 0000000000000000 0000000000000004
[ 174.695217] GPR28: c0000000013d2880 c000001df14a0c74 0000000000000000 c000001e3b602700
[ 174.696147] NIP [d000000014bc0a90] dccp_v4_ctl_send_reset+0xa8/0x2f0 [dccp_ipv4]
[ 174.696238] LR [d000000014bc21cc] dccp_v4_rcv+0x5d4/0x850 [dccp_ipv4]
[ 174.696312] Call Trace:
[ 174.696345] [c000001fff447930] [c000001fff4479c0] 0xc000001fff4479c0 (unreliable)
[ 174.696978] [c000001fff4479c0] [d000000

-----------------------------MACHINE HANGS -------------------------------------

== Comment: #29 - Kevin W. Rudd <> - 2017-03-20 12:50:22 ==
Hari.

I was able to get access to the system for a quick set of validation tests. With the default kdump settings, kdump completed OK and correctly saved a vmcore. When stress-ng is run first, kdump hangs with the default settings, and also when the settings are modified to use "maxcpus=1" and "noirqdistrib".

The following message was printed prior to the stress-ng induced hangs, but not when kdump completed without stress-ng running:

"Ignoring boot flags, incorrect version 0x0"

I will attach console logs from each test for review.

== Comment: #30 - Kevin W. Rudd <> - 2017-03-20 12:51:54 ==
The default boot options had "quiet splash", so I trimmed out the useless "ubuntu" splash messages from the log.

== Comment: #36 - Hari Krishna Bathini <> - 2017-04-05 13:31:23 ==
If panic timeout is set to zero and any secondary CPUs don't respond to IPI,
kdump waits forever for a system reset, to try again to get ALL secondary
CPUs to respond to IPI. The hang here is because panic timeout is set to
zero and a few secondary CPUs didn't respond to IPI. System reset support
is still work in progress for Open Power machines. Meantime, to workaround
the hang issue, panic timeout value can be set to a non-zero value with

$ echo 10 > /proc/sys/kernel/panic

I did try it but kdump didn't take off. Instead the system just rebooted
(better than a hang, I guess :) ). As for why kdump didn't take off, I
am debugging it..

Hi Canonical,

I think it would be better to have a non-zero default value for panic
timeout (CONFIG_PANIC_TIMEOUT)?

Thanks
Hari

Tags:

Revision history for this message

bugproxy (bugproxy) wrote on 2017-04-06: dmesg log

dmesg log Edit (59.7 KiB, application/octet-stream)

Default Comment by Bridge

tags:

added: architecture-ppc64le bugnameltc-152446 severity-high targetmilestone-inin1704

Revision history for this message

bugproxy (bugproxy) wrote on 2017-04-06: sosreport

sosreport Edit (5.0 MiB, application/x-xz)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2017-04-06: dl log message

dl log message Edit (73.1 KiB, application/octet-stream)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2017-04-06: sosreport before stress-ng test

sosreport before stress-ng test Edit (4.9 MiB, application/x-xz)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2017-04-06: sosreport after stress-ng test

sosreport after stress-ng test Edit (4.9 MiB, application/x-xz)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2017-04-06: successful kdump

successful kdump Edit (10.1 KiB, text/plain)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2017-04-06: Failed kdump with defaults after stress-ng was invoked

Failed kdump with defaults after stress-ng was invoked Edit (2.4 KiB, text/plain)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2017-04-06: failed kdump with wiki suggestions applied (stress-ng actually triggered panic)

failed kdump with wiki suggestions applied (stress-ng actually triggered panic) Edit (6.7 KiB, text/plain)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2017-04-06: crash log with panic timeout set to a non-zero value

crash log with panic timeout set to a non-zero value Edit (19.6 KiB, text/x-log)

Default Comment by Bridge

Changed in ubuntu:
assignee:	nobody → Taco Screen team (taco-screen-team)
affects:	ubuntu → linux (Ubuntu)

Manoj Iyer (manjo) on 2017-05-08

Changed in linux (Ubuntu):
assignee:	Taco Screen team (taco-screen-team) → Nish Aravamudan (nacc)
importance:	Undecided → High

Revision history for this message

bugproxy (bugproxy) wrote on 2017-05-24: dmesg log

#10

dmesg log Edit (59.7 KiB, application/octet-stream)

Default Comment by Bridge

Manoj Iyer (manjo) on 2017-06-01

tags:

added: ubuntu-17.04

Revision history for this message

bugproxy (bugproxy) wrote on 2017-07-10: Failed kdump with defaults after stress-ng was invoked

#11

Failed kdump with defaults after stress-ng was invoked Edit (2.4 KiB, text/plain)

Default Comment by Bridge

Manoj Iyer (manjo) on 2017-07-19

Changed in ubuntu-power-systems:
importance:	Undecided → High

Revision history for this message

bugproxy (bugproxy) wrote on 2017-07-19: failed kdump with wiki suggestions applied (stress-ng actually triggered panic)

#12

failed kdump with wiki suggestions applied (stress-ng actually triggered panic) Edit (6.7 KiB, text/plain)

Default Comment by Bridge

Manoj Iyer (manjo) on 2017-07-31

tags:

added: triage-a

Revision history for this message

Nish Aravamudan (nacc) wrote on 2017-07-31: Re: Ubuntu 17.04: Kdump fails to capture dump on Firestone NV when machine crashes while running stress-ng.

#13

I believe this request should be routed to the kernel team (setting of a kernel default value for POWER systems)?

Revision history for this message

bugproxy (bugproxy) wrote on 2017-07-31: successful kdump

#14

successful kdump Edit (10.1 KiB, text/plain)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2017-07-31: Failed kdump with defaults after stress-ng was invoked

#15

Failed kdump with defaults after stress-ng was invoked Edit (2.4 KiB, text/plain)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2017-07-31: failed kdump with wiki suggestions applied (stress-ng actually triggered panic)

#16

failed kdump with wiki suggestions applied (stress-ng actually triggered panic) Edit (6.7 KiB, text/plain)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2017-07-31: crash log with panic timeout set to a non-zero value

#17

crash log with panic timeout set to a non-zero value Edit (19.6 KiB, text/x-log)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2017-07-31: Comment bridged from LTC Bugzilla

#18

------- Comment From <email address hidden> 2017-07-31 14:20 EDT-------
(In reply to comment #51)
> I believe this request should be routed to the kernel team (setting of a
> kernel default value for POWER systems)?

Some CPUs are entering hang state by the time of crash.
There are two sides to look at here.

1. Why stress test is crashing the system and why certain CPUs are
not responding..

2. Why kdump kernel is not able to boot why some CPUs are not responding..

We are investigating this from the perspective of case 2. Would appreciate if somebody
could look into this from the perspective of case 1.

Thanks
Hari

David Britton (dpb) on 2017-08-14

Changed in linux (Ubuntu):
assignee:	Nish Aravamudan (nacc) → nobody

Manoj Iyer (manjo) on 2017-08-14

Changed in linux (Ubuntu):
assignee:	nobody → Canonical Kernel Team (canonical-kernel-team)

Joseph Salisbury (jsalisbury) on 2017-08-14

tags:

added: kernel-da-key

Andrew Cloke (andrew-cloke) on 2017-08-21

Changed in ubuntu-power-systems:
assignee:	nobody → Canonical Kernel Team (canonical-kernel-team)

Manoj Iyer (manjo) on 2017-09-11

tags:

added: triage-r
removed: triage-a

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-11-07: Re: Ubuntu 17.04: Kdump fails to capture dump on Firestone NV when machine crashes while running stress-ng.

#19

Colin, can you look into why stress-ng has crashed this system?

Thanks.
Cascardo.

Revision history for this message

Colin Ian King (colin-king) wrote on 2017-11-07:

#20

Just one thing to note, running stress-ng as root with all the stressors is a pathological test scenario. The manual does state:

       Running stress-ng with root privileges will adjust out of memory set‐
       tings on Linux systems to make the stressors unkillable in low memory
       situations, so use this judiciously. With the appropriate privilege,
       stress-ng can allow the ionice class and ionice levels to be adjusted,
       again, this should be used with care.

Revision history for this message

Colin Ian King (colin-king) wrote on 2017-11-07:

#21

Which version of stress-ng is being used? stress-ng -V will show this info.

Revision history for this message

Colin Ian King (colin-king) wrote on 2017-11-07:

#22

The CPU IPI issue may occurring because of the cpu-online stressor, this can rapidly turn CPUs offline/online. I suggest re-running the tests with the '-x cpu-online' option to exclude that stress test to see if that is causing that specific issue.

Revision history for this message

Colin Ian King (colin-king) wrote on 2017-11-07:

#23

A null pointer deference is occurring in dccp_v4_ctl_send_reset:

[ 174.691794] Unable to handle kernel paging request for data at address 0x000002f4

This is an offset from the pointer in register GRP09, which is zero:

[ 174.695217] GPR08: c000001df14a0800 0000000000000000 c000001e3b60c400 0000000000000000

Looking at the object code, I believe this oops is because the control sockets have been cleared for some reason, although it's not obvious why.

Revision history for this message

Colin Ian King (colin-king) wrote on 2017-11-07:

#24

The kernel being testing is 4.10.0-11-generic, I believe a fix may be in Ubuntu-4.10.0-15.17 that may address this issue. We've seen this sort of issue before, e.g. bug 1654073

I believe the pertinent fix is upstream commit:

commit 449809a66c1d0b1563dee84493e14bf3104d2d7e
Author: Eric Dumazet <email address hidden>
Date: Wed Mar 1 08:39:49 2017 -0800

tcp/dccp: block BH for SYN processing

..and this landed in Ubuntu-4.10.0-15.17 as commit 0c244cdf6da6ddeaa4eed2511a75be8f908763aa

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2017-11-16:

#25

Also, note that I have opened bug LP#1730660 ("Set PANIC_TIMEOUT=10 on Power Systems"), to handle the PANIC_TIMEOUT option. I have submitted patches to set that option for Xenial, Zesty, Artful and beyond.

Revision history for this message

bugproxy (bugproxy) wrote on 2017-11-16: dl log message

#26

dl log message Edit (73.1 KiB, application/octet-stream)

Default Comment by Bridge

Andrew Cloke (andrew-cloke) on 2017-11-28

tags:

added: ppc64el-kdump

Andrew Cloke (andrew-cloke) on 2017-12-04

Changed in ubuntu-power-systems:
status:	New → Triaged

Revision history for this message

Andrew Cloke (andrew-cloke) wrote on 2018-02-26: Re: Ubuntu 17.04: Kdump fails to capture dump on Firestone NV when machine crashes while running stress-ng.

#27

This issue was originally raised with 17.04 which is no longer supported. Could you confirm this is fixed with the Xenial HWE kernel or the Artful release?

Changed in ubuntu-power-systems:
status:	Triaged → Incomplete

Andrew Cloke (andrew-cloke) on 2018-02-26

tags:

added: triage-g
removed: triage-r

Frank Heimes (fheimes) on 2018-03-05

tags:

removed: ubuntu-17.04

Andrew Cloke (andrew-cloke) on 2018-03-05

summary:

- Ubuntu 17.04: Kdump fails to capture dump on Firestone NV when machine
- crashes while running stress-ng.
+ Kdump fails to capture dump on Firestone NV when machine crashes while
+ running stress-ng.

Revision history for this message

Andrew Cloke (andrew-cloke) wrote on 2018-03-08:

#28

We believe this issue is resolved in Artful (17.10), could you please validate before we close this bug?

Revision history for this message

Manoj Iyer (manjo) wrote on 2018-03-08:

#29

debian.master/config/ppc64el/config.common.ppc64el:CONFIG_PANIC_TIMEOUT=10

is now set to non '0' value for Xenial, Artful and Bionic.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-03-13: Comment bridged from LTC Bugzilla

#30

Download full text (13.8 KiB)

(In reply to comment #65)
> We believe this issue is resolved in Artful (17.10), could you please
> validate before we close this bug?

Issue is observed even on 17.10.

md.unit=kdump-tools.service ata_piix.prefer_ms_hyperv=0 elfcorehdr=157184K
* loaded kdump kernel
root@ltc-firep3:~# kdump-config show
DUMP_MODE: kdump
USE_KDUMP: 1
KDUMP_SYSCTL: kernel.panic_on_oops=1
KDUMP_COREDIR: /var/crash
crashkernel addr:
/var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinux-4.13.0-36-generic
kdump initrd:
/var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-4.13.0-36-generic
current state: ready to kdump

(In reply to comment #65)
> We believe this issue is resolved in Artful (17.10), could you please
> validate before we close this bug?

Issue is observed even on 17.10.

md.unit=kdump-tools.service ata_piix.prefer_ms_hyperv=0 elfcorehdr=157184K
* loaded kdump kernel
root@ltc-firep3:~# kdump-config show
DUMP_MODE:        kdump
USE_KDUMP:        1
KDUMP_SYSCTL:     kernel.panic_on_oops=1
KDUMP_COREDIR:    /var/crash
crashkernel addr:
/var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinux-4.13.0-36-generic
kdump initrd:
/var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-4.13.0-36-generic
current state:    ready to kdump

kexec command:
/sbin/kexec -p --command-line="root=UUID=6d6f8d6e-ccb9-49e7-b260-c2e1e3bca3ab ro quiet splash irqpoll nr_cpus=1 nousb systemd.unit=kdump-tools.service ata_piix.prefer_ms_hyperv=0" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz
root@ltc-firep3:~# stress-ng -a 0
stress-ng: info:  [3578] rdrand stressor will be skipped, CPU does not support the rdrand instruction.
stress-ng: info:  [3578] tsc stressor will be skipped, CPU does not support the tsc instruction.
stress-ng: info:  [3578] disabled 'bind-mount' as it may hang the machine (enable it with the --pathological option)
stress-ng: info:  [3578] disabled 'cpu-online' as it may hang the machine (enable it with the --pathological option)
stress-ng: info:  [3578] disabled 'oom-pipe' as it may hang the machine (enable it with the --pathological option)
stress-ng: info:  [3578] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor
stress-ng: info:  [3578] dispatching hogs: 160 af-alg, 160 affinity, 160 aio, 160 aiol, 160 apparmor, 160 atomic, 160 bigheap, 160 branch, 160 brk, 160 bsearch, 160 cache, 160 cap, 160 chdir, 160 chmod, 160 chown, 160 chroot, 160 clock, 160 clone, 160 context, 160 copy-file, 160 cpu, 160 crypt, 160 cyclic, 160 daemon, 160 dccp, 160 dentry, 160 dev, 160 dir, 160 dirdeep, 160 dnotify, 160 dup, 160 epoll, 160 eventfd, 160 exec, 160 fallocate, 160 fanotify, 160 fault, 160 fcntl, 160 fiemap, 160 fifo, 160 filename, 160 flock, 160 fork, 160 fp-error, 160 fstat, 160 full, 160 futex, 160 get, 160 getdent, 160 getrandom, 160 handle, 160 hdd, 160 heapsort, 160 hsearch, 160 icache, 160 icmp-flood, 160 inode-flags, 160 inotify, 160 io, 160 iomix, 160 ioprio, 160 itimer, 160 kcmp, 160 key, 160 kill, 160 klog, 160 lease, 160 link, 160 locka, 160 lockbus, 160 lockf, 160 lockofd, 160 longjmp, 160 lsearch, 160 madvise, 160 malloc, 160 matrix, 160 membarrier, 160 memcpy, 160 memfd, 160 memrate, 160 memthrash, 160 mergesort, 160 mincore, 160 mknod, 160 mlock, 160 mmap, 160 mmapfork, 160 mmapmany, 160 mq, 160 mremap, 160 msg, 160 msync, 160 netdev, 160 netlink-proc, 160 nice, 160 nop, 160 null, 160 numa, 160 opcode, 160 open, 160 personality, 160 pipe, 160 poll, 160 procfs, 160 pthread, 160 ptrace, 160 pty, 160 qsort, 160 quota, 160 radixsort, 160 readahead, 160 remap, 160 rename, 160 resources, 160 rlimit, 160 rmap, 160 rtc, 160 schedpolicy, 160 sctp, 160 seal, 160 seccomp, 160 seek, 160 sem, 160 sem-sysv, 160 sendfile, 160 shm, 160 shm-sysv, 160 sigfd, 160 sigfpe, 160 sigpending, 160 sigq, 160 sigsegv, 160 sigsuspend, 160 sleep, 160 sock, 160 sockdiag, 160 sockfd, 160 sockpair, 160 softlockup, 160 spawn, 160 splice, 160 stack, 160 stackmmap, 160 str, 160 stream, 160 swap, 160 switch, 160 symlink, 160 sync-file, 160 sysfs, 160 sysinfo, 160 tee, 160 timer, 160 timerfd, 160 tlb-shootdown, 160 tmpfs, 160 tsearch, 160 udp, 160 udp-flood, 160 unshare, 160 urandom, 160 userfaultfd, 160 utime, 160 vecmath, 160 vfork, 160 vforkmany, 160 vm, 160 vm-rw, 160 vm-splice, 160 wait, 160 wcs, 160 xattr, 160 yield, 160 zero, 160 zlib, 160 zombie
stress-ng: info:  [3578] cache allocate: using built-in defaults as unable to determine cache details
stress-ng: info:  [3610] stress-ng-cyclic: for best results, run just 1 instance of this stressor
stress-ng: info:  [3623] stress-ng-dirdeep: 60987043 inodes available, exercising up to 60987043 inodes
stress-ng: info:  [3639] stress-ng-exec: running as root, won't run test.
[  190.350568] AppArmor DFA next/check upper bounds error
[  190.352846] AppArmor DFA next/check upper bounds error
stress-ng: info:  [3743] stress-ng-lockbus: this stressor is not implemented on this system: ppc64le Linux 4.13.0-36-generic
stress-ng: info:  [3946] stress-ng-numa: system has 2 of a maximum 256 memory NUMA nodes
stress-ng: fail:  [4051] stress-ng-quota: quotactl command Q_GETQUOTA failed: errno=3 (No such process)
stress-ng: fail:  [4051] stress-ng-quota: quotactl command Q_GETFMT failed: errno=3 (No such process)
stress-ng: fail:  [4051] stress-ng-quota: quotactl command Q_GETINFO failed: errno=3 (No such process)
stress-ng: fail:  [4140] stress-ng-rtc: ioctl RTC_ALRM_READ failed, errno=22 (Invalid argument)
stress-ng: fail:  [3719] stress-ng-key: keyctl KEYCTL_UPDATE failed, errno=127 (Key has expired)
stress-ng: fail:  [3719] stress-ng-key: keyctl KEYCTL_READ failed, errno=127 (Key has expired)
stress-ng: fail:  [3719] stress-ng-key: request_key failed, errno=126 (Required key not available)
stress-ng: fail:  [3719] stress-ng-key: keyctl KEYCTL_DESCRIBE failed, errno=127 (Key has expired)
stress-ng: fail:  [3719] stress-ng-key: keyctl KEYCTL_UPDATE failed, errno=127 (Key has expired)
info: 5 failures reached, aborting stress process
[  191.335947] AppArmor DFA next/check upper bounds error
[  191.397177] AppArmor DFA next/check upper bounds error
stress-ng: info:  [4614] stress-ng-spawn: running as root, won't run test.
stress-ng: info:  [4720] stress-ng-stream: using built-in defaults as unable to determine cache details
stress-ng: info:  [4720] stress-ng-stream: stressor loosely based on a variant of the STREAM benchmark code
stress-ng: info:  [4720] stress-ng-stream: do NOT submit any of these results to the STREAM benchmark results
stress-ng: info:  [4720] stress-ng-stream: Using CPU cache size of 2048K
[  191.645947] AppArmor DFA next/check upper bounds error
stress-ng: info:  [4865] stress-ng-sysfs: running as root, just traversing /sys and not read/writing to /sys files.
[  192.335432] Memory failure: 0xc0190: recovery action for dirty LRU page: Recovered
[  192.880349] Memory failure: 0xc0193: recovery action for dirty LRU page: Recovered
stress-ng: info:  [5568] stress-ng-yield: limiting to 160 yielders (instance 0)
[  196.195563] sysrq: SysRq : Trigger a crash
[  196.195835] Unable to handle kernel paging request for data at address 0x00000000
[  196.196633] Faulting instruction address: 0xc000000000793948
[  196.196724] Oops: Kernel access of bad area, sig: 11 [#1]
[  196.196823] SMP NR_CPUS=2048
[  196.196844] NUMA
[  196.196915] PowerNV
[  196.197040] Modules linked in: tgr192 wp512 rmd320 rmd256 unix_diag rmd160 sctp libcrc32c rmd128 md4 binfmt_misc dccp_ipv4 algif_hash dccp af_alg idt_89hpesx joydev input_leds mac_hid ofpart cmdlinepart powernv_flash at24 mtd uio_pdrv_genirq ipmi_powernv powernv_rng uio ibmpowernv vmx_crypto ipmi_devintf ipmi_msghandler opal_prd crct10dif_vpmsum ip_tables x_tables autofs4 hid_generic usbhid hid uas usb_storage ast i2c_algo_bit ttm crc32c_vpmsum drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci tg3
[  196.198435] CPU: 34 PID: 3479 Comm: bash Not tainted 4.13.0-36-generic #40-Ubuntu
[  196.198530] task: c000000fd7c29b00 task.stack: c000000fd6018000
[  196.198614] NIP: c000000000793948 LR: c000000000794878 CTR: c000000000793920
[  196.198711] REGS: c000000fd601b9f0 TRAP: 0300   Not tainted  (4.13.0-36-generic)
[  196.198922] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
[  196.199114]   CR: 28422222  XER: 20000000
[  196.199222] CFAR: c00000000000878c DAR: 0000000000000000 DSISR: 42000000 SOFTE: 1
[  196.199222] GPR00: c000000000794878 c000000fd601bc70 c0000000015f6200 0000000000000063
[  196.199222] GPR04: c000000ffa48ade8 c000000ffa4a2068 9000000000009033 0000000000000032
[  196.199222] GPR08: 0000000000000007 0000000000000001 0000000000000000 9000000000001003
[  196.199222] GPR12: c000000000793920 c000000007a67600 0000000010180df8 0000000010189e30
[  196.199222] GPR16: 0000000010189ea8 0000000010151210 000000001018bd58 000000001018de48
[  196.199222] GPR20: 000001003b9701d8 0000000000000001 0000000010164590 0000000010163bb0
[  196.199222] GPR24: 00007ffffaab70f4 00007ffffaab70f0 c0000000014fa770 0000000000000002
[  196.199222] GPR28: 0000000000000063 0000000000000004 c0000000014824f4 c0000000014fab10
[  196.201896] NIP [c000000000793948] sysrq_handle_crash+0x28/0x30
[  196.201985] LR [c000000000794878] __handle_sysrq+0xf8/0x2b0
[  196.202895] Call Trace:
[  196.202938] [c000000fd601bc70] [c000000000794858] __handle_sysrq+0xd8/0x2b0 (unreliable)
[  196.203054] [c000000fd601bd10] [c000000000795074] write_sysrq_trigger+0x64/0x90
[  196.203151] [c000000fd601bd40] [c000000000450468] proc_reg_write+0x88/0xd0
[  196.203235] [c000000fd601bd70] [c0000000003a1f6c] __vfs_write+0x3c/0x70
[  196.203367] [c000000fd601bd90] [c0000000003a3ba8] vfs_write+0xd8/0x220
[  196.203501] [c000000fd601bde0] [c0000000003a5a28] SyS_write+0x68/0x110
[  196.203594] [c000000fd601be30] [c00000000000b184] system_call+0x58/0x6c
[  196.203681] Instruction dump:
[  196.203734] 4bfff9f1 4bfffe50 3c4c00e6 384228e0 7c0802a6 60000000 39200001 3d42001d
[  196.203946] 394adb30 912a0000 7c0004ac 39400000 <992a0000> 4e800020 3c4c00e6 384228b0
[  196.204221] ---[ end trace c2e83d4780c5d8dd ]---
[  196.284422]
[  196.284791] Sending IPI to other CPUs
[  207.391786] ERROR: 19 cpu(s) not responding
[  217.392637] kexec: waiting for cpu 2 (physical 10) to enter OPAL
[  218.394232] kexec: timed out waiting for cpu 2 (physical 10) to enter OPAL
[  218.394456] kexec: waiting for cpu 7 (physical 15) to enter OPAL
[  219.396003] kexec: timed out waiting for cpu 7 (physical 15) to enter OPAL
[  219.396233] kexec: waiting for cpu 16 (physical 24) to enter OPAL
[  220.398083] kexec: timed out waiting for cpu 16 (physical 24) to enter OPAL
[  220.398329] kexec: waiting for cpu 29 (physical 45) to enter OPAL
[  221.401214] kexec: timed out waiting for cpu 29 (physical 45) to enter OPAL
[  221.401470] kexec: waiting for cpu 42 (physical 74) to enter OPAL
[  222.405561] kexec: timed out waiting for cpu 42 (physical 74) to enter OPAL
[  222.405794] kexec: waiting for cpu 45 (physical 77) to enter OPAL
[  223.409855] kexec: timed out waiting for cpu 45 (physical 77) to enter OPAL
[  223.410108] kexec: waiting for cpu 56 (physical 96) to enter OPAL
[  224.415029] kexec: timed out waiting for cpu 56 (physical 96) to enter OPAL
[  224.415273] kexec: waiting for cpu 60 (physical 100) to enter OPAL
[  225.420345] kexec: timed out waiting for cpu 60 (physical 100) to enter OPAL
[  225.420578] kexec: waiting for cpu 61 (physical 101) to enter OPAL
[  226.425688] kexec: timed out waiting for cpu 61 (physical 101) to enter OPAL
[  226.425988] kexec: waiting for cpu 62 (physical 102) to enter OPAL
[  227.431092] kexec: timed out waiting for cpu 62 (physical 102) to enter OPAL
[  227.431339] kexec: waiting for cpu 65 (physical 105) to enter OPAL
[  228.436631] kexec: timed out waiting for cpu 65 (physical 105) to enter OPAL
[  228.437203] kexec: waiting for cpu 84 (physical 1052) to enter OPAL
[  229.465219] kexec: timed out waiting for cpu 84 (physical 1052) to enter OPAL
[  229.465561] kexec: waiting for cpu 88 (physical 1056) to enter OPAL
[  230.493736] kexec: timed out waiting for cpu 88 (physical 1056) to enter OPAL
[  230.494022] kexec: waiting for cpu 90 (physical 1058) to enter OPAL
[  231.522156] kexec: timed out waiting for cpu 90 (physical 1058) to enter OPAL
[  231.522700] kexec: waiting for cpu 101 (physical 1069) to enter OPAL
[  232.551499] kexec: timed out waiting for cpu 101 (physical 1069) to enter OPAL
[  232.552046] kexec: waiting for cpu 112 (physical 1096) to enter OPAL
[  233.581776] kexec: timed out waiting for cpu 112 (physical 1096) to enter OPAL
[  233.582209] kexec: waiting for cpu 119 (physical 1103) to enter OPAL
[[  294.370123276,5] OPAL: Switch to big-endian OS
[  295.370243186,3] OPAL: CPU 0xa not in OPAL !
234.612219] kexec: timed o  4.04035|Ignoring boot flags, incorrect version 0x0
4.10333|ISTEP  6. 3
4.58052|ISTEP  6. 4
4.58097|ISTEP  6. 5
16.35927|HWAS|PRESENT> DIMM[03]=AAAAAAAAAAAAAAAA
16.35928|HWAS|PRESENT> Membuf[04]=CCCC000000000000
16.35928|HWAS|PRESENT> Proc[05]=C000000000000000
16.45967|ISTEP  6. 6
17.54322|================================================
17.54323|Error reported by unknown (0xE500)
17.54323|  <none>
17.54323|  ModuleId   0x0b unknown
17.54323|  ReasonCode 0xe540 unknown
17.54324|  UserData1  unknown : 0x0006000000000101
17.54324|  UserData2  unknown : 0xc8e9003600000000
17.54324|User Data Section 0, type UD
17.54325|  Subsection type 0x06
17.54325|  ComponentId errl (0x0100)
17.54325|  CALLOUT
17.54325|  PROCEDURE ERROR
17.54326|  Procedure: 16
17.54326|User Data Section 1, type UD
17.54326|  Subsection type 0x04
17.54327|  ComponentId errl (0x0100)
17.54327|User Data Section 2, type UD
17.54327|  Subsection type 0x06
17.54327|  ComponentId errl (0x0100)
17.54328|  CALLOUT
17.54328|  HW CALLOUT
17.54328|  Reporting CPU ID: 15
17.54328|  Called out entity:
17.54329|User Data Section 3, type UD
17.54329|  Subsection type 0x33
17.54329|  ComponentId unknown (0xe500)
17.54330|User Data Section 4, type UD
17.54330|  Subsection type 0x01
17.54330|  ComponentId unknown (0xe500)
17.54331|  STRING
17.54331|
17.54331|User Data Section 5, type UD
17.54331|  Subsection type 0x15
17.54332|  ComponentId hb-trace (0x3100)
17.54332|User Data Section 6, type UD
17.54332|  Subsection type 0x01
17.54333|  ComponentId unknown (0xe500)
17.54333|  STRING
17.54333|
17.54334|User Data Section 7, type UD
17.54334|  Subsection type 0x03
17.54334|  ComponentId errl (0x0100)
17.54335|User Data Section 8, type UD
17.54335|  Subsection type 0x01
17.54335|  ComponentId errl (0x0100)
17.54335|  STRING
17.54336|  Hostboot Build ID: hostboot-2eb7706-f28ad92/hbicore.bin
17.54336|User Data Section 9, type UD
17.54336|  Subsection type 0x04
17.54337|  ComponentId errl (0x0100)
17.54337|================================================

Revision history for this message

Thadeu Lima de Souza Cascardo (cascardo) wrote on 2018-03-13:

#31

This is caused by stress-ng holding those CPUs. Should kdump work in this case, we need IBM's help in fixing it.

Meanwhile, can you try bionic, and see if the 4.15 kernel from bionic fixes it?

Cascardo.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-03-13: successful kdump

#32

successful kdump Edit (10.1 KiB, text/plain)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2018-03-13: Failed kdump with defaults after stress-ng was invoked

#33

Failed kdump with defaults after stress-ng was invoked Edit (2.4 KiB, text/plain)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2018-03-13: failed kdump with wiki suggestions applied (stress-ng actually triggered panic)

#34

failed kdump with wiki suggestions applied (stress-ng actually triggered panic) Edit (6.7 KiB, text/plain)

Default Comment by Bridge

Revision history for this message

bugproxy (bugproxy) wrote on 2018-03-13: crash log with panic timeout set to a non-zero value

#35

crash log with panic timeout set to a non-zero value Edit (19.6 KiB, text/x-log)

Default Comment by Bridge

Revision history for this message

Andrew Cloke (andrew-cloke) wrote on 2018-03-19:

#36

Following comment #31, could you confirm that this is resolved with the 4.15 kernel in Bionic?

Revision history for this message

bugproxy (bugproxy) wrote on 2018-03-21: Comment bridged from LTC Bugzilla

#37

(In reply to comment #70)
> Default Comment by Bridge
>
> Default Comment by Bridge
>
> Default Comment by Bridge
>
> Default Comment by Bridge
>
> Following comment #31, could you confirm that this is resolved with the 4.15
> kernel in Bionic?

Issue is observed even on 18.04 [ 4.15.0-12-generic ].

Attaching logs.

Thanks,
Pavithra

Revision history for this message

bugproxy (bugproxy) wrote on 2018-03-21: 18.04 console log

#38

18.04 console log Edit (14.6 KiB, text/plain)

Default Comment by Bridge

Revision history for this message

Andrew Cloke (andrew-cloke) wrote on 2018-03-21:

#39

Thanks for confirming the issue is observed on the 4.15 Bionic kernel.

Following on from comment #31, we believe this is an upstream issue associated with the ppc64el architecture. If you could let us know when a fix has been upstreamed for this issue, we will attempt to backport and integrate it.

Thanks again.

Revision history for this message

bugproxy (bugproxy) wrote on 2018-03-22: Comment bridged from LTC Bugzilla

#40

(In reply to comment #66)
> debian.master/config/ppc64el/config.common.ppc64el:CONFIG_PANIC_TIMEOUT=10
>
> is now set to non '0' value for Xenial, Artful and Bionic.

Thanks for this fix as it resolves the hang...

Revision history for this message

Manoj Iyer (manjo) wrote on 2018-04-05:

#41

Based on the confirmation IBM provided that this is now fixed. I am making this as invalid, please reopen if this is still and issue.

Changed in linux (Ubuntu):
status:	New → Invalid
Changed in ubuntu-power-systems:
status:	Incomplete → Invalid

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-05: 18.04 console log

#42

18.04 console log Edit (14.6 KiB, text/plain)

Default Comment by Bridge

Revision history for this message

Andrew Cloke (andrew-cloke) wrote on 2018-04-05:

#43

This console log has been posted against a bug that has been closed. Is this issue still persisting? Should the bug be reopened?

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-10: Comment bridged from LTC Bugzilla

#44

Hello Canonical,

There are two problems here

1. Hang when CPUs don't respond to IPI. This can be resolved by setting a non-zero panic timeout.
That fix is now available in Ubuntu releases.

2. But dump capture is still not successful with case1 fix as the system is reseting, if some CPUs
are not responding to IPIs, instead of booting into KDump kernel.

While case1 is resolved - the kernel doesn't hang anymore if some CPUs don't respond to IPIs. As for
case2, I suspect the reset could be happening in f/w while reinitializing CPUs. Need to confirm that
though. Also, exploring if it is possible to workaround what seems like a firwmare limitation. Please
keep the bug open while I work on case2. If we can't workaround case2, may have to close this as
will-not-fix.

Thanks
Hari

Revision history for this message

bugproxy (bugproxy) wrote on 2018-04-23:

#45

(In reply to comment #80)
> Hello Canonical,
>
> There are two problems here
>
> 1. Hang when CPUs don't respond to IPI. This can be resolved by setting a
> non-zero panic timeout.
> That fix is now available in Ubuntu releases.
>
> 2. But dump capture is still not successful with case1 fix as the system
> is reseting, if some CPUs
> are not responding to IPIs, instead of booting into KDump kernel.
>
> While case1 is resolved - the kernel doesn't hang anymore if some CPUs don't
> respond to IPIs. As for
> case2, I suspect the reset could be happening in f/w while reinitializing
> CPUs. Need to confirm that
> though. Also, exploring if it is possible to workaround what seems like a
> firwmare limitation. Please
> keep the bug open while I work on case2. If we can't workaround case2, may
> have to close this as
> will-not-fix.

While this is on my TODO list, putting this low on my priority list as it is an
unlikely user scenario...

Thanks
Hari

bugproxy (bugproxy) on 2019-01-29

tags:

added: targetmilestone-inin18041
removed: targetmilestone-inin1704

Brad Figg (brad-figg) on 2019-07-24

tags:

added: cscc

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

Kdump fails to capture dump on Firestone NV when machine crashes while running stress-ng.

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package