qemu-system-ppc64le fails with kvm acceleration

Bug #1920784 reported by sadoon albader
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
Undecided
Ubuntu on IBM Power Systems Bug Triage
linux (Ubuntu)
Fix Released
Undecided
Frank Heimes
Hirsute
Fix Released
Undecided
Frank Heimes

Bug Description

(Suspected glibc issue!)

qemu-system-ppc64(le) fails when invoked with kvm acceleration with error "illegal instruction"

> qemu-system-ppc64(le) -M pseries,accel=kvm

Illegal instruction (core dumped)

In dmesg:

Facility 'SCV' unavailable (12), exception at 0x7624f8134c0c, MSR=900000000280f033

Version-Release number of selected component (if applicable):
qemu 5.2.0
Linux kernel 5.11
glibc 2.33
all latest updates as of submitting the bug report

How reproducible:
Always

Steps to Reproduce:
1. Run qemu with kvm acceleration

Actual results:
Illegal instruction

Expected results:
Normal VM execution

Additional info:
The machine is a Raptor Talos II Lite with a Sforza V1 8-core, but was also observed on a Raptor Blackbird with the same processor.

This was also observed on Fedora 34 beta, which uses glibc 2.33
Also tested on ArchPOWER (unofficial port of Arch Linux for ppc64le) with glibc 2.33
Fedora 33 and Ubuntu 20.10, both using glibc 2.32 do not have this issue, and downgrading the Linux kernel from 5.11 to 5.4 LTS on ArchPOWER solved the problem. Kernel 5.9 and 5.10 have the same issue when combined with glibc2.33

ProblemType: Bug
DistroRelease: Ubuntu 21.04
Package: qemu-system 1:5.2+dfsg-6ubuntu2
ProcVersionSignature: Ubuntu 5.11.0-11.12-generic 5.11.0
Uname: Linux 5.11.0-11-generic ppc64le
.sys.firmware.opal.msglog: Error: [Errno 13] Permission denied: '/sys/firmware/opal/msglog'
ApportVersion: 2.20.11-0ubuntu60
Architecture: ppc64el
CasperMD5CheckResult: pass
CurrentDesktop: Unity:Unity7:ubuntu
Date: Mon Mar 22 14:48:39 2021
InstallationDate: Installed on 2021-03-22 (0 days ago)
InstallationMedia: Ubuntu-Server 21.04 "Hirsute Hippo" - Alpha ppc64el (20210321)
KvmCmdLine: COMMAND STAT EUID RUID PID PPID %CPU COMMAND
ProcKernelCmdLine: root=UUID=f3d03315-0944-4a02-9c87-09c00eba9fa1 ro
ProcLoadAvg: 1.20 0.73 0.46 1/1054 6071
ProcSwaps:
 Filename Type Size Used Priority
 /swap.img file 8388544 0 -2
ProcVersion: Linux version 5.11.0-11-generic (buildd@bos02-ppc64el-002) (gcc (Ubuntu 10.2.1-20ubuntu1) 10.2.1 20210220, GNU ld (GNU Binutils for Ubuntu) 2.36.1) #12-Ubuntu SMP Mon Mar 1 19:26:20 UTC 2021
SourcePackage: qemu
UpgradeStatus: No upgrade log present (probably fresh install)
VarLogDump_list: total 0
acpidump:

cpu_cores: Number of cores present = 8
cpu_coreson: Number of cores online = 8
cpu_smt: SMT=4

CVE References

Revision history for this message
sadoon albader (sadoonalbader) wrote :
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in qemu (Ubuntu):
status: New → Confirmed
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Since this seems to be broken on all Distributions as soon as the triggering
combination of kernel/glibc is present I think we'd want to open that up to
upstream qemu for a wider discussion and to also hit the ppc64 architecture
experts.

Furthermore I'm not entirely sure if this needs to be fixed in qemu, it might instead be the case that instead a fix is needed in glibc.

Therefore I'm adding a qemu (upstream) bug task for now to have the bug reported there as well (might be worth for awareness anyway) - but chances are that after some debugging it will turn out to become a glibc issue instead.

If only I could break this test out of kvm ioctl into something simpler, then we could then properly file against glibc ....

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (3.5 KiB)

Hi Sadoon,
thanks for the report!
There isn't much to find about this issue yet.
One automatic syscaller crash report [1].
On the emulation side there is [2][3].

On the glibc side we have [4][5] adding the use of it with [6] being a fix.
All those seem to be in glibc 2.33 - so I'd expect with [6] it should only
be issued on power9 which in turn should HW-support the instruction.

I was trying to recreate this on power8 and power9 machines.
As expected on power8 just nothing happens (the instruction isn't used due to [6]).
TBH I first wondered if these Sforza chips [7][8][9] you mentioned are
fully identical to a classic IBM p9 box - but I was indeed able to reproduce
the issue just fine on an IBM-sold P9
dmesg:
[ 1516.438442] Facility 'SCV' unavailable (12), exception at 0x76c9f84c49a0, MSR=900000000280f033
[ 1516.438472] qemu-system-ppc[42884]: illegal instruction (4) at 76c9f84c49a0 nip 76c9f84c49a0 lr 1f12839d9f0 code 1 in libc-2.33.so[76c9f8380000+220000]
[ 1516.438489] qemu-system-ppc[42884]: code: e8010010 7c0803a6 4e800020 60420000 7ca42b78 4bffed65 60000000 38210020
[ 1516.438493] qemu-system-ppc[42884]: code: e8010010 7c0803a6 4e800020 60420000 <44000001> 4bffffb8 60000000 60420000

The chip I used for this test is:
Model: 2.2 (pvr 004e 1202)
Model name: POWER9, altivec supported

The syscall this crashes in belongs to the ioctl
(gdb) bt
#0 __GI___ioctl (fd=<optimized out>, request=536915584) at ../sysdeps/unix/sysv/linux/powerpc/ioctl.c:56
#1 0x00000cb63ef7d9f0 in kvm_vcpu_ioctl (cpu=cpu@entry=0x7d0f48010010, type=type@entry=536915584) at ../../accel/kvm/kvm-all.c:2654
#2 0x00000cb63ef7dbdc in kvm_cpu_exec (cpu=0x7d0f48010010) at ../../accel/kvm/kvm-all.c:2491
#3 0x00000cb63ee78344 in kvm_vcpu_thread_fn (arg=0x7d0f48010010) at ../../accel/kvm/kvm-cpus.c:49
#4 0x00000cb63f1d14bc in qemu_thread_start (args=<optimized out>) at ../../util/qemu-thread-posix.c:521
#5 0x00007d0f4ac69114 in start_thread (arg=0x7d0f23dfe720) at pthread_create.c:473
#6 0x00007d0f4ab755c0 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:103

And jumping into the code of the __GI___ioctl we can clearly see
the scv instruction is indeed there in the executed code path:

   0x7ffff66c4984 <__GI___ioctl+292> bl 0x7ffff66c36e8 <__GI___tcgetattr+8>
   0x7ffff66c4988 <__GI___ioctl+296> nop
   0x7ffff66c498c <__GI___ioctl+300> addi r1,r1,32
   0x7ffff66c4990 <__GI___ioctl+304> ld r0,16(r1)
   0x7ffff66c4994 <__GI___ioctl+308> mtlr r0
   0x7ffff66c4998 <__GI___ioctl+312> blr
   0x7ffff66c499c <__GI___ioctl+316> ori r2,r2,0
  >0x7ffff66c49a0 <__GI___ioctl+320> scv 0

[1]: https://webcache.googleusercontent.com/search?q=cache:uS0jhPekyqMJ:https://syzkaller-ppc64.appspot.com/text%3Ftag%3DCrashReport%26x%3D17d99883000000+&cd=2&hl=de&ct=clnk&gl=uk
[2]: https://git.qemu.org/?p=qemu.git;a=commit;h=3c89b8d6ac5b8728cd7620f9885bd953edd18a11
[3]: https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg05425.html
[4]: https://sourceware.org/git/?p=glibc.git;a=commit;h=68ab82f56690ada86ac1e0c46bad06ba189a10ef
[5]: https://sourceware.o...

Read more...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

qemu calls this ioctl on ppc64 as:
  sysdeps/unix/sysv/linux/powerpc/ioctl.c
result = INLINE_SYSCALL (ioctl, 3, fd, request, arg);

The mapping of macros in sysdeps/unix/sysv/linux/powerpc/sysdep.h seems to be:
INTERNAL_SYSCALL -> INTERNAL_SYSCALL_NCS -> TRY_SYSCALL_SCV -> SYSCALL_SCV

 76 #define SYSCALL_SCV(nr) \
 77 ({ \
 78 __asm__ __volatile__ \
 79 (".machine \"push\"\n\t" \
 80 ".machine \"power9\"\n\t" \
 81 "scv 0\n\t" \
 82 ".machine \"pop\"\n\t" \
 83 "0:" \
 84 : "=&r" (r0), \
 85 "=&r" (r3), "=&r" (r4), "=&r" (r5), \
 86 "=&r" (r6), "=&r" (r7), "=&r" (r8) \
 87 : ASM_INPUT_##nr \
 88 : "r9", "r10", "r11", "r12", \
 89 "lr", "ctr", "memory"); \
 90 r3; \
 91 })

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

[10] outlined to use PPC_FEATURE2_SCV but [4] does just that.
In addition [6] added power9 machine settings as only on this ISA it
is available - like:
+ .machine "push"
+ .machine "power9"
        scv 0
+ .machine "pop"

Maybe there is some generated "scv 0" left that needs the same [6] treatment?

OTOH In a normal test program I can run "scv 0" just fine.
But not other scv levels (expected).

# cat test.c
#include <stdio.h>

int main() {
   printf("Hello scv 0\n");
   __asm__(
   "scv 0\n\t"
   );
   printf("survived\n");
   __asm__(
   "scv 1\n\t"
   );
   printf("survived level 1\n");
   return 0;
}
# gcc -Wall -o test test.c
./test
Hello scv 0
survived
Illegal instruction (core dumped)

IIRC .machine is only a psedo-op for the assembler.
So it is correct that I can't see it in the live disassembly of gdb.

The failing "svc 0" from glibcs __GI___ioctl is
   0x00007ffff66c49a0 <+320>: 01 00 00 44 scv 0
And in my test program it is
   0x0000000100000848 <+44>: 01 00 00 44 scv 0

Hmm, this is the same opcode but fails in just one of the cases.
This might need someone being more an ppc64/glibc expert than me :-/

@Frank - could you modify this bug to become mirrored to IBM for their arch-expertise please?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

As my other repro-code didn't trigger the issue I looked at qemu again and found that before the failing ioctl->scv call there are plenty other even some very similar (same?) calls that work just fine.

I wonder if on guest setup qemu (or e.g. the rom we load) might set some arch-bits or such which then breaks the next "scv 0" call.

I attached the full ioctl log here.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I might be spoiled by the s390x-POP style to define instructions, but in the following doc about the PowerISA unfortunately there is no list of reasons-for-SIGILL. Therefore I'm out of options on this waiting for someone - most likely IBM - to chime in.

https://wiki.raptorcs.com/w/images/f/f5/PowerISA_public.v3.1.pdf

Revision history for this message
Laurent Vivier (laurent-vivier) wrote :

You need a kernel with a the following fix for POWER9:

commit 25edcc50d76c834479d11fcc7de46f3da4d95121
Author: Fabiano Rosas <email address hidden>
Date: Thu Feb 4 17:05:17 2021 -0300

    KVM: PPC: Book3S HV: Save and restore FSCR in the P9 path

    The Facility Status and Control Register is a privileged SPR that
    defines the availability of some features in problem state. Since it
    can be written by the guest, we must restore it to the previous host
    value after guest exit.

    This restoration is currently done by taking the value from
    current->thread.fscr, which in the P9 path is not enough anymore
    because the guest could context switch the QEMU thread, causing the
    guest-current value to be saved into the thread struct.

    The above situation manifested when running a QEMU linked against a
    libc with System Call Vectored support, which causes scv
    instructions to be run by QEMU early during the guest boot (during
    SLOF), at which point the FSCR is 0 due to guest entry. After a few
    scv calls (1 to a couple hundred), the context switching happens and
    the QEMU thread runs with the guest value, resulting in a Facility
    Unavailable interrupt.

    This patch saves and restores the host value of FSCR in the inner
    guest entry loop in a way independent of current->thread.fscr. The old
    way of doing it is still kept in place because it works for the old
    entry path.

    Signed-off-by: Fabiano Rosas <email address hidden>
    Signed-off-by: Paul Mackerras <email address hidden>

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Frank Heimes (fheimes) wrote :

Thx Laurent, I took the hirsute master-next source and cherry-picked the patch and it applied cleanly.
Now I kicked off a kernel build of this patched kernel in the following PPA:
https://launchpad.net/~fheimes/+archive/ubuntu/lp1920784
(however, the builds will take some time to complete)

If it can be proofed that this patched kernel fixes the problem, I can go ahead and work on a patch submission for hirsute/21.04. (kernel freeze is April 8th)

Changed in ubuntu-power-systems:
status: New → Confirmed
Changed in linux (Ubuntu):
assignee: nobody → Frank Heimes (fheimes)
Revision history for this message
sadoon albader (sadoonalbader) wrote :

The guys on the Fedora side seem to have found the patch to fix this:

https://bugzilla.redhat.com/show_bug.cgi?id=1941652#c6

Apparently it will go upstream in Linux 5.11, but earlier versions also need the fix, specifically 5.9 and 5.10

Thank you!

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

@Sadoon - yes, that is the same fix that Laurent pointed to a few hours before.

@Frank - the kernel I had before was 5.11.0-11-generic (failing). I've tested "5.11.0-13-generic #14~lp1920784" from your PPA and can confirm that this fixes the issue.

Thanks Laurent for identifying the fix and thanks Frank for the kernel.
I'll mark bug tasks accordingly and @Frank you'll let me know if there is anything else you need to drive this to completion.

Changed in qemu:
status: New → Invalid
Changed in glibc (Ubuntu):
status: New → Invalid
Changed in qemu (Ubuntu):
status: Confirmed → Invalid
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

And gladly this was only added in >=5.9 and we have Groovy (5.8) and Hirsute (5.11) so only the Hirsute kernel is needed to adapt, but further backports are not needed.

Revision history for this message
Frank Heimes (fheimes) wrote :

The fix was sent to the kernel teams mailing list:
https://lists.ubuntu.com/archives/kernel-team/2021-March/thread.html#118449

Changed in linux (Ubuntu):
status: Confirmed → In Progress
Changed in ubuntu-power-systems:
status: Confirmed → In Progress
Thomas Huth (th-huth)
no longer affects: qemu
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (37.7 KiB)

This bug was fixed in the package linux - 5.11.0-14.15

---------------
linux (5.11.0-14.15) hirsute; urgency=medium

  * hirsute/linux: 5.11.0-14.15 -proposed tracker (LP: #1923103)

  * Packaging resync (LP: #1786013)
    - update dkms package versions

  * Include Infiniband Peer Memory interface (LP: #1923104)
    - SAUCE: RDMA/core: Introduce peer memory interface

  * Hirsute update: v5.11.12 upstream stable release (LP: #1923069)
    - arm64: mm: correct the inside linear map range during hotplug check
    - virtiofs: Fail dax mount if device does not support it
    - ext4: shrink race window in ext4_should_retry_alloc()
    - ext4: fix bh ref count on error paths
    - fs: nfsd: fix kconfig dependency warning for NFSD_V4
    - rpc: fix NULL dereference on kmalloc failure
    - iomap: Fix negative assignment to unsigned sis->pages in
      iomap_swapfile_activate
    - ASoC: rt1015: fix i2c communication error
    - ASoC: rt5640: Fix dac- and adc- vol-tlv values being off by a factor of 10
    - ASoC: rt5651: Fix dac- and adc- vol-tlv values being off by a factor of 10
    - ASoC: sgtl5000: set DAP_AVC_CTRL register to correct default value on probe
    - ASoC: es8316: Simplify adc_pga_gain_tlv table
    - ASoC: soc-core: Prevent warning if no DMI table is present
    - ASoC: cs42l42: Fix Bitclock polarity inversion
    - ASoC: cs42l42: Fix channel width support
    - ASoC: cs42l42: Fix mixer volume control
    - ASoC: cs42l42: Always wait at least 3ms after reset
    - NFSD: fix error handling in NFSv4.0 callbacks
    - ASoC: mediatek: mt8192: fix tdm out data is valid on rising edge
    - kernel: freezer should treat PF_IO_WORKER like PF_KTHREAD for freezing
    - vhost: Fix vhost_vq_reset()
    - io_uring: fix ->flags races by linked timeouts
    - io_uring: halt SQO submission on ctx exit
    - scsi: st: Fix a use after free in st_open()
    - scsi: qla2xxx: Fix broken #endif placement
    - staging: comedi: cb_pcidas: fix request_irq() warn
    - staging: comedi: cb_pcidas64: fix request_irq() warn
    - ASoC: rt5659: Update MCLK rate in set_sysclk()
    - ASoC: rt711: add snd_soc_component remove callback
    - thermal/core: Add NULL pointer check before using cooling device stats
    - locking/ww_mutex: Simplify use_ww_ctx & ww_ctx handling
    - locking/ww_mutex: Fix acquire/release imbalance in
      ww_acquire_init()/ww_acquire_fini()
    - nvmet-tcp: fix kmap leak when data digest in use
    - io_uring: imply MSG_NOSIGNAL for send[msg]()/recv[msg]() calls
    - Revert "PM: ACPI: reboot: Use S5 for reboot"
    - nouveau: Skip unvailable ttm page entries
    - static_call: Align static_call_is_init() patching condition
    - ext4: do not iput inode under running transaction in ext4_rename()
    - io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with
      MSG_WAITALL
    - net: mvpp2: fix interrupt mask/unmask skip condition
    - mptcp: deliver ssk errors to msk
    - mptcp: fix poll after shutdown
    - mptcp: init mptcp request socket earlier
    - mptcp: add a missing retransmission timer scheduling
    - flow_dissector: fix TTL and TOS dissection on IPv4 fragments
    - mptcp: fix DATA_FIN processing f...

Changed in linux (Ubuntu Hirsute):
status: In Progress → Fix Released
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: In Progress → Fix Released
Frank Heimes (fheimes)
no longer affects: qemu (Ubuntu Hirsute)
no longer affects: qemu (Ubuntu)
no longer affects: glibc (Ubuntu Hirsute)
no longer affects: glibc (Ubuntu)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.