Azure: Jammy fio test hangs, swiotlb buffers exhausted

Bug #1998838 reported by Tim Gardner
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
New
Undecided
Unassigned
Jammy
Fix Released
Critical
Tim Gardner
Kinetic
Fix Released
High
Tim Gardner

Bug Description

SRU Justification

[Impact]
Hello Canonical Team,

This issue was found while doing the validation on CPC's Jammy CVM image. We are up against a tight timeline to deliver this to a partner on 10/5. Would appreciate prioritizing this.

While running fio, the command fails to exit after 2 minutes. I watched `top` as the command hung and I saw kworkers getting blocked.

sudo fio --ioengine=libaio --bs=4K --filename=/dev/sdc1:/dev/sdd1:/dev/sde1:/dev/sdf1:/dev/sdg1:/dev/sdh1:/dev/sdi1:/dev/sdj1:/dev/sdk1:/dev/sdl1:/dev/sdm1:/dev/sdn1:/dev/sdo1:/dev/sdp1:/dev/sdq1:/dev/sdr1 --readwrite=randwrite --runtime=120 --iodepth=1 --numjob=96 --name=iteration9 --direct=1 --size=8192M --group_reporting --overwrite=1

Example system logs:
---------------------------------------------------------------------------------------------------------------
[ 1096.297641] INFO: task kworker/u192:0:8 blocked for more than 120 seconds.
[ 1096.302785] Tainted: G W 5.15.0-1024-azure #30-Ubuntu
[ 1096.306312] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1096.310489] INFO: task jbd2/sda1-8:1113 blocked for more than 120 seconds.
[ 1096.313900] Tainted: G W 5.15.0-1024-azure #30-Ubuntu
[ 1096.317481] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1096.324117] INFO: task systemd-journal:1191 blocked for more than 120 seconds.
[ 1096.331219] Tainted: G W 5.15.0-1024-azure #30-Ubuntu
[ 1096.335332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
---------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------
[ 3241.013230] systemd-udevd[1221]: sdl1: Worker [6686] processing SEQNUM=13323 killed
[ 3261.492691] systemd-udevd[1221]: sdl1: Worker [6686] failed
---------------------------------------------------------------------------------------------------------------

TOP report:
---------------------------------------------------------------------------------------------------------------
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
417 root 20 0 0 0 0 R 66.2 0.0 0:34.61 ksoftirqd/59
435 root 20 0 0 0 0 I 24.5 0.0 0:09.03 kworker/59:1-mm_percpu_wq
416 root rt 0 0 0 0 S 23.5 0.0 0:01.86 migration/59
366 root 0 -20 0 0 0 I 19.2 0.0 0:16.64 kworker/49:1H-kblockd
378 root 0 -20 0 0 0 I 17.9 0.0 0:15.71 kworker/51:1H-kblockd
455 root 0 -20 0 0 0 I 17.9 0.0 0:14.76 kworker/62:1H-kblockd
135 root 0 -20 0 0 0 I 17.5 0.0 0:13.08 kworker/17:1H-kblockd
420 root 0 -20 0 0 0 I 16.9 0.0 0:14.63 kworker/58:1H-kblockd
...
---------------------------------------------------------------------------------------------------------------

LISAv3 Testcase: perf_premium_datadisks_4k
Image : "canonical-test 0001-com-ubuntu-confidential-vm-jammy-preview 22_04-lts-cvm latest"
VMSize : "Standard_DC96ads_v5"

For repro-ability, I am seeing this every time I run the storage perf tests. It always seems to happen on iteration 9 or 10. When running manually, I had to run the command three or four times to reproduce the issue.

[Test Case]

Microsoft tested, requires lots of cores (96) and disks (16)

[Where things could go wrong]

swiotlb buffers could be double freed.

[Other Info]

SF: #00349781

Revision history for this message
Tim Gardner (timg-tpi) wrote :

https://<email address hidden>/ is considered a root cause fix.

affects: linux (Ubuntu) → linux-azure (Ubuntu)
Changed in linux-azure (Ubuntu Jammy):
assignee: nobody → Tim Gardner (timg-tpi)
importance: Undecided → Critical
status: New → In Progress
Tim Gardner (timg-tpi)
Changed in linux-azure (Ubuntu Jammy):
status: In Progress → Fix Committed
Changed in linux-azure (Ubuntu Kinetic):
assignee: nobody → Tim Gardner (timg-tpi)
importance: Undecided → High
status: New → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-azure - 5.15.0-1029.36

---------------
linux-azure (5.15.0-1029.36) jammy; urgency=medium

  * jammy/linux-azure: 5.15.0-1029.36 -proposed tracker (LP: #1998845)

  * Azure: Jammy fio test hangs, swiotlb buffers exhausted (LP: #1998838)
    - SAUCE: scsi: storvsc: Fix swiotlb bounce buffer leak in confidential VM

linux-azure (5.15.0-1027.33) jammy; urgency=medium

  * jammy/linux-azure: 5.15.0-1027.33 -proposed tracker (LP: #1997044)

  [ Ubuntu: 5.15.0-56.62 ]

  * jammy/linux: 5.15.0-56.62 -proposed tracker (LP: #1997079)
  * CVE-2022-3566
    - tcp: Fix data races around icsk->icsk_af_ops.
  * CVE-2022-3567
    - ipv6: annotate some data-races around sk->sk_prot
    - ipv6: Fix data races around sk->sk_prot.
  * CVE-2022-3621
    - nilfs2: fix NULL pointer dereference at nilfs_bmap_lookup_at_level()
  * CVE-2022-3564
    - Bluetooth: L2CAP: Fix use-after-free caused by l2cap_reassemble_sdu
  * CVE-2022-3524
    - tcp/udp: Fix memory leak in ipv6_renew_options().
  * CVE-2022-3565
    - mISDN: fix use-after-free bugs in l1oip timer handlers
  * CVE-2022-3594
    - r8152: Rate limit overflow messages
  * CVE-2022-43945
    - SUNRPC: Fix svcxdr_init_decode's end-of-buffer calculation
    - SUNRPC: Fix svcxdr_init_encode's buflen calculation
    - NFSD: Protect against send buffer overflow in NFSv2 READDIR
    - NFSD: Protect against send buffer overflow in NFSv3 READDIR
    - NFSD: Protect against send buffer overflow in NFSv2 READ
    - NFSD: Protect against send buffer overflow in NFSv3 READ
    - NFSD: Remove "inline" directives on op_rsize_bop helpers
    - NFSD: Cap rsize_bop result based on send buffer size
  * CVE-2022-42703
    - mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuse
  * 5.15.0-53-generic no longer boots (LP: #1996740)
    - drm/amd/display: Add helper for blanking all dp displays

linux-azure (5.15.0-1024.30) jammy; urgency=medium

  * jammy/linux-azure: 5.15.0-1024.30 -proposed tracker (LP: #1996817)

  * Azure: Jammy fio test causes panic (LP: #1996806)
    - scsi: storvsc: Fix unsigned comparison to zero

 -- Tim Gardner <email address hidden> Mon, 05 Dec 2022 11:54:13 -0700

Changed in linux-azure (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/5.15.0-1030.37 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-azure verification-needed-jammy
Revision history for this message
Tim Gardner (timg-tpi) wrote :

Microsoft tested and accepted. Marking verification-done-jammy

tags: added: verification-done-jammy
removed: verification-needed-jammy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure/5.19.0-1016.17 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-kinetic' to 'verification-done-kinetic'. If the problem still exists, change the tag 'verification-needed-kinetic' to 'verification-failed-kinetic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-kinetic-linux-azure verification-needed-kinetic
Tim Gardner (timg-tpi)
tags: added: verification-done-kinetic
removed: verification-needed-kinetic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (103.1 KiB)

This bug was fixed in the package linux-azure - 5.19.0-1016.17

---------------
linux-azure (5.19.0-1016.17) kinetic; urgency=medium

  * kinetic/linux-azure: 5.19.0-1016.17 -proposed tracker (LP: #1999735)

  * Packaging resync (LP: #1786013)
    - [Packaging] update helper scripts

  [ Ubuntu: 5.19.0-28.29 ]

  * kinetic/linux: 5.19.0-28.29 -proposed tracker (LP: #1999746)
  * mm:vma05 in ubuntu_ltp fails with '[vdso] bug not patched' on kinetic/linux
    5.19.0-27.28 (LP: #1999094)
    - fix coredump breakage

linux-azure (5.19.0-1015.16) kinetic; urgency=medium

  * kinetic/linux-azure: 5.19.0-1015.16 -proposed tracker (LP: #1999417)

  * Azure: Jammy fio test hangs, swiotlb buffers exhausted (LP: #1998838)
    - SAUCE: scsi: storvsc: Fix swiotlb bounce buffer leak in confidential VM

  * Azure: MANA New Feature MANA XDP_Redirect Action (LP: #1998351)
    - net: mana: Add support of XDP_REDIRECT action

linux-azure (5.19.0-1014.15) kinetic; urgency=medium

  * kinetic/linux-azure: 5.19.0-1014.15 -proposed tracker (LP: #1997782)

  * Jammy/linux-azure: CONFIG_BLK_DEV_FD=n (LP: #1972017)
    - [Config] azure: CONFIG_BLK_DEV_FD=n

  * remove circular dep between linux-image and modules (LP: #1989334)
    - [Packaging] remove circular dep between modules and image

  * [Azure] [NVMe] cpu soft lockup issue when run fio against nvme disks
    (LP: #1995408)
    - PCI: hv: Only reuse existing IRTE allocation for Multi-MSI

  * [Azure][Arm64] Unable to detect all VF nics / Failing provisioning
    (LP: #1996117)
    - PCI: hv: Fix the definition of vector in hv_compose_msi_msg()

  * Kinetic update: v5.19.9 upstream stable release (LP: #1994068) // Kinetic
    update: v5.19.15 upstream stable release (LP: #1994078) // Kinetic update:
    v5.19.17 upstream stable release (LP: #1994179)
    - [Configs] azure: Updates after rebase

  * Packaging resync (LP: #1786013)
    - debian/dkms-versions -- update from kernel-versions (main/master)

  [ Ubuntu: 5.19.0-27.28 ]

  * kinetic/linux: 5.19.0-27.28 -proposed tracker (LP: #1997794)
  * Packaging resync (LP: #1786013)
    - debian/dkms-versions -- update from kernel-versions (main/2022.11.14)
  * selftests/.../nat6to4 breaks the selftests build (LP: #1996536)
    - [Config] Disable selftests/net/bpf/nat6to4
  * Expose built-in trusted and revoked certificates (LP: #1996892)
    - [Packaging] Expose built-in trusted and revoked certificates
  * support for same series backports versioning numbers (LP: #1993563)
    - [Packaging] sameport -- add support for sameport versioning
  * Add cs35l41 firmware loading support (LP: #1995957)
    - ASoC: cs35l41: Move cs35l41 exit hibernate function into shared code
    - ASoC: cs35l41: Add common cs35l41 enter hibernate function
    - ASoC: cs35l41: Do not print error when waking from hibernation
    - ALSA: hda: cs35l41: Don't dereference fwnode handle
    - ALSA: hda: cs35l41: Allow compilation test on non-ACPI configurations
    - ALSA: hda: cs35l41: Drop wrong use of ACPI_PTR()
    - ALSA: hda: cs35l41: Consolidate selections under SND_HDA_SCODEC_CS35L41
    - ALSA: hda: hda_cs_dsp_ctl: Add Library to support CS_DSP ALSA controls
    - ALSA: hda: hda_cs_dsp_c...

Changed in linux-azure (Ubuntu Kinetic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.