[SRU] bcache deadlock during read IO in writeback mode

Bug #1980925 reported by nikhil kshirsagar
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned
Focal
Invalid
Medium
Unassigned
Jammy
Fix Released
Medium
Unassigned

Bug Description

SRU Justification:

[Impact]

When Random Read I/O is started with a test like -

fio --name=read_iops --directory=/home/ubuntu/bcache_mount/ --size=16G --ioengine=libaio --direct=1 --verify=0 --bs=4K --iodepth=128 --rw=randread --randrepeat=0

or

random read-writes with a test like,

fio --filename=/home/ubuntu/bcache_mount/cachedfile --size=15GB --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=128 --name=iops-test-job --randrepeat=0

traces are seen in the kernel log,

[ 4473.699902] INFO: task bcache_writebac:1835 blocked for more than 120 seconds.
[ 4474.050921] Not tainted 5.15.50-051550-generic #202206251445
[ 4474.350883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4474.731391] task:bcache_writebac state:D stack: 0 pid: 1835 ppid: 2 flags:0x00004000
[ 4474.731408] Call Trace:
[ 4474.731411] <TASK>
[ 4474.731413] __schedule+0x23d/0x5a0
[ 4474.731433] schedule+0x4e/0xb0
[ 4474.731436] rwsem_down_write_slowpath+0x220/0x3d0
[ 4474.731441] down_write+0x43/0x50
[ 4474.731446] bch_writeback_thread+0x78/0x320 [bcache]
[ 4474.731471] ? read_dirty_submit+0x70/0x70 [bcache]
[ 4474.731487] kthread+0x12a/0x150
[ 4474.731491] ? set_kthread_struct+0x50/0x50
[ 4474.731494] ret_from_fork+0x22/0x30
[ 4474.731499] </TASK>

The bug exists till kernel 5.15.50-051550-generic

The reproducer is pasted below:

# uname -a
Linux bronzor 5.15.50-051550-generic #202206251445 SMP Sat Jun 25 14:51:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
sdd 8:48 0 279.4G 0 disk
└─sdd1 8:49 0 60G 0 part
  └─bcache0 252:0 0 60G 0 disk /home/ubuntu/bcache_mount
nvme0n1 259:0 0 372.6G 0 disk
└─nvme0n1p1 259:2 0 15G 0 part
  └─bcache0 252:0 0 60G 0 disk /home/ubuntu/bcache_mount

fio --name=read_iops --directory=/home/ubuntu/bcache_mount --size=12G --ioengine=libaio --direct=1 --verify=0 --bs=4K --iodepth=128 --rw=randread --group_reporting=1
read_iops: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
fio-3.28
Starting 1 process
read_iops: Laying out IO file (1 file / 12288MiB)

The test does not progress beyond a few minutes, and this trace is then seen in the kernel log,

[ 4473.699902] INFO: task bcache_writebac:1835 blocked for more than 120 seconds.
[ 4474.050921] Not tainted 5.15.50-051550-generic #202206251445
[ 4474.350883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4474.731391] task:bcache_writebac state:D stack: 0 pid: 1835 ppid: 2 flags:0x00004000
[ 4474.731408] Call Trace:
[ 4474.731411] <TASK>
[ 4474.731413] __schedule+0x23d/0x5a0
[ 4474.731433] schedule+0x4e/0xb0
[ 4474.731436] rwsem_down_write_slowpath+0x220/0x3d0
[ 4474.731441] down_write+0x43/0x50
[ 4474.731446] bch_writeback_thread+0x78/0x320 [bcache]
[ 4474.731471] ? read_dirty_submit+0x70/0x70 [bcache]
[ 4474.731487] kthread+0x12a/0x150
[ 4474.731491] ? set_kthread_struct+0x50/0x50
[ 4474.731494] ret_from_fork+0x22/0x30
[ 4474.731499] </TASK>

[Fix]
These 3 fixes are needed for the SRU.

dea3560e5f31965165bcf34ecf0b47af28bfd155, 6445ec3df23f24677064a327dce437ef3e02dc6,
dc60301fb408e06e0b718c0980cdd31d2b238bee

I have built these fixes into kernel 5.15.0-39-generic (jammy) and tested to verify the problem is fixed.

[Regression Potential]

Regression potential should be minimal. I have not seen any potential drawbacks or harmful effects of this fix in my testing.

CVE References

Changed in linux (Ubuntu):
milestone: none → jammy-updates
milestone: jammy-updates → none
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1980925

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Focal):
status: New → Incomplete
Changed in linux (Ubuntu Jammy):
status: New → Incomplete
Revision history for this message
nikhil kshirsagar (nkshirsagar) wrote :

Logs have been pasted in the description of the bug.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Focal):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Jammy):
status: Incomplete → Confirmed
Revision history for this message
nikhil kshirsagar (nkshirsagar) wrote (last edit ):

<removed this comment since it mentioned the fixes needed for the SRU, all of which are already now in the description>

description: updated
description: updated
description: updated
description: updated
description: updated
description: updated
Stefan Bader (smb)
Changed in linux (Ubuntu Jammy):
importance: Undecided → Medium
Changed in linux (Ubuntu Focal):
importance: Undecided → Medium
Revision history for this message
Stefan Bader (smb) wrote :

The patches which introduced the problem are both v5.7 and have not been backported into v5.4.

Changed in linux (Ubuntu):
status: Confirmed → Invalid
Changed in linux (Ubuntu Focal):
status: Confirmed → Invalid
Revision history for this message
Stefan Bader (smb) wrote :

2/3 patches were already included in upstream v5.15.46. Updated the shared commits to refer to both reports and committed the 3rd patch for next cycle (the stable updates also are for next cycle).

Changed in linux (Ubuntu Jammy):
status: Confirmed → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (75.1 KiB)

This bug was fixed in the package linux - 5.15.0-47.51

---------------
linux (5.15.0-47.51) jammy; urgency=medium

  * jammy/linux: 5.15.0-47.51 -proposed tracker (LP: #1983903)

  * Jammy update: v5.15.46 upstream stable release (LP: #1981864)
    - UBUNTU: [Packaging] Move python3-dev to build-depends

  * touchpad and touchscreen doesn't work at all on ACER Spin 5 (SP513-54N)
    (LP: #1884232)
    - x86/PCI: Eliminate remove_e820_regions() common subexpressions
    - x86: Log resource clipping for E820 regions
    - x86/PCI: Clip only host bridge windows for E820 regions
    - x86/PCI: Add kernel cmdline options to use/ignore E820 reserved regions
    - x86/PCI: Disable E820 reserved region clipping via quirks
    - x86/PCI: Revert "x86/PCI: Clip only host bridge windows for E820 regions"

  * [SRU][H/OEM-5.13/OEM-5.14/U][J/OEM-5.17/U] Fix invalid MAC address after
    hotplug tbt dock (LP: #1942999)
    - SAUCE: igc: wait for the MAC copy when enabled MAC passthrough

  * Mass Storage Gadget driver truncates device >2TB (LP: #1981390)
    - usb: gadget: storage: add support for media larger than 2T

  * AMD Rembrandt: DP tunneling fails with Thunderbolt monitors (LP: #1983143)
    - SAUCE: drm/amd: Fix DP Tunneling with Thunderbolt monitors
    - drm/amd/display: Fix for dmub outbox notification enable
    - Revert "drm/amd/display: Fix DPIA outbox timeout after S3/S4/reset"
    - drm/amd/display: Reset link encoder assignments for GPU reset
    - drm/amd/display: Fix DPIA outbox timeout after S3/S4/reset
    - drm/amd/display: Fix new dmub notification enabling in DM
    - SAUCE: thunderbolt: Add DP out resource when DP tunnel is discovered.

  * Fix sub-optimal I210 network speed (LP: #1976438)
    - igb: Make DMA faster when CPU is active on the PCIe link

  * e1000e report hardware hang (LP: #1973104)
    - e1000e: Enable GPT clock before sending message to CSME
    - Revert "e1000e: Fix possible HW unit hang after an s0ix exit"

  * ioam6.sh in net from ubuntu_kernel_selftests fails with 5.15 kernels in
    Focal (LP: #1982930)
    - selftests: net: fix IOAM test skip return code

  * Additional fix for TGL + AUO panel flickering (LP: #1983297)
    - Revert "UBUNTU: SAUCE: drm/i915/display/psr: Fix flicker on TGL + AUO panel"
    - drm/i915/display: Fix sel fetch plane offset calculation
    - drm/i915: Nuke ORIGIN_GTT
    - drm/i915/display: Drop PSR support from HSW and BDW
    - drm/i915/display/psr: Handle plane and pipe restrictions at every page flip
    - drm/i915/display/psr: Do full fetch when handling multi-planar formats
    - drm/i915/display: Drop unnecessary frontbuffer flushes
    - drm/i915/display: Handle frontbuffer rendering when PSR2 selective fetch is
      enabled
    - drm/i915/display: Fix glitches when moving cursor with PSR2 selective fetch
      enabled
    - SAUCE: drm/i915/display/psr: Reinstate fix for TGL + AUO panel flicker

  * AMD Yellow Carp DMCUB fw update for s0i3 B0 fixes (LP: #1957026)
    - drm/amd/display: Optimize bandwidth on following fast update
    - drm/amd/display: Fix surface optimization regression on Carrizo
    - drm/amd/display: Reset DMCUB before HW init

  * GPIO character devi...

Changed in linux (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-lowlatency-hwe-5.15/5.15.0-48.54~20.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-hwe-5.15/5.15.0-48.54~20.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia/5.15.0-1006.6 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia/5.15.0-1007.7 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy' to 'verification-done-jammy'. If the problem still exists, change the tag 'verification-needed-jammy' to 'verification-failed-jammy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-gkeop-5.15/5.15.0-1003.5~20.04.2 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
nikhil kshirsagar (nkshirsagar) wrote :

I have verified that this issue is fixed in the jammy kernel 5.15.0-48-generic.

tags: added: verification-done-jammy
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-mtk/5.15.0-1030.34 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-jammy-linux-mtk' to 'verification-done-jammy-linux-mtk'. If the problem still exists, change the tag 'verification-needed-jammy-linux-mtk' to 'verification-failed-jammy-linux-mtk'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-jammy-linux-mtk-v2 verification-needed-jammy-linux-mtk
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.