Regression: nfs cannot access/list wildcard file unless its cached when there is a symlink in path

Bug #1971482 reported by Jonathan
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Undecided
Unassigned
Bionic
Fix Released
Medium
Stefan Bader
Focal
Fix Released
Medium
Stefan Bader
Impish
Fix Released
Medium
Stefan Bader
Jammy
Fix Released
Medium
Unassigned
Kinetic
Invalid
Undecided
Unassigned

Bug Description

Some of our build machines have recently started failing builds. It was noted that all the machines that fail the build are running the most recent kernel 5.4.0-109-generic #123~18.04.1-Ubuntu.

The following command was created to minimally reproduce the issue:

$ while true; do sudo /usr/local/scripts/drop_cache.sh; ls -la /shared/projects/MityCAM-jenkins/intermediates/MityCAM-Yocto/latest/libdaq*.ipk | tee -a /tmp/ls.log 2>&1; sleep 1;done
ls: cannot access '/shared/projects/MityCAM-jenkins/intermediates/MityCAM-Yocto/latest/libdaq*.ipk': No such file or directory
ls: cannot access '/shared/projects/MityCAM-jenkins/intermediates/MityCAM-Yocto/latest/libdaq*.ipk': No such file or directory
ls: cannot access '/shared/projects/MityCAM-jenkins/intermediates/MityCAM-Yocto/latest/libdaq*.ipk': No such file or directory
ls: cannot access '/shared/projects/MityCAM-jenkins/intermediates/MityCAM-Yocto/latest/libdaq*.ipk': No such file or directory

Note directly trying to list the file works every time, it seems the bug must be related to the use of the wildcard.

$ while true; do sudo /usr/local/scripts/drop_cache.sh; ls -la /shared/projects/MityCAM-jenkins/intermediates/MityCAM-Yocto/latest/libdaq_1.0.9323-r3_cortexa9hf-neon.ipk | tee -a /tmp/ls.log 2>&1; sleep 1;done
-rw-r--r-- 1 jenkins engineer 202526 May 2 13:47 /shared/projects/MityCAM-jenkins/intermediates/MityCAM-Yocto/latest/libdaq_1.0.9323-r3_cortexa9hf-neon.ipk
-rw-r--r-- 1 jenkins engineer 202526 May 2 13:47 /shared/projects/MityCAM-jenkins/intermediates/MityCAM-Yocto/latest/libdaq_1.0.9323-r3_cortexa9hf-neon.ipk
-rw-r--r-- 1 jenkins engineer 202526 May 2 13:47 /shared/projects/MityCAM-jenkins/intermediates/MityCAM-Yocto/latest/libdaq_1.0.9323-r3_cortexa9hf-neon.ipk

The drop_cache was needed to force the error everytime, otherwise it would fail only a few times then succeed a bunch in a row before randomly failing again.

$ cat /usr/local/scripts/drop_cache.sh
#!/bin/bash

# Test script to drop filesystem cache
sync

# Clear pagecache, dentries, and inodes
echo 3 > /proc/sys/vm/drop_caches

Downgrading the kernel to 5.4.0-107-generic on one of the machines caused the problem to go away.

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-5.4.0-109-generic 5.4.0-109.123~18.04.1
ProcVersionSignature: Ubuntu 5.4.0-107.121~18.04.1-generic 5.4.174
Uname: Linux 5.4.0-107-generic x86_64
ApportVersion: 2.20.9-0ubuntu7.27
Architecture: amd64
Date: Tue May 3 15:04:34 2022
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
 TERM=xterm-256color
 XDG_RUNTIME_DIR=<set>
 PATH=(custom, no user)
SourcePackage: linux-signed-hwe-5.4
UpgradeStatus: No upgrade log present (probably fresh install)

CVE References

Revision history for this message
Jonathan (jjcf89) wrote :
Revision history for this message
Jonathan (jjcf89) wrote :

Attaching tcpdump for machine seeing errors
$ sudo tcpdump -w /tmp/data.pcap -i bond0 host 10.0.0.23

Revision history for this message
Jonathan (jjcf89) wrote :

Also noticed the issue on a 20.04 server which got updated to kernel 5.13.0-40. Downgrading to 5.13.0-39 fixed the issue.

Revision history for this message
Jonathan (jjcf89) wrote :

Looks like a nfs patch for https://bugs.launchpad.net/bugs/cve/2022-24448 was introduced in 5.13.0-40 and linux-image-5.4.0-109. So could be cause of regression.

Revision history for this message
Jonathan (jjcf89) wrote :

I checked out the focal kernel and reverted the following commits. And the problem went away, reverting just the 1st one wasn't enough to fix the issue.

- NFSv4: nfs_atomic_open() can race when looking up a non-regular file
- NFSv4: Handle case where the lookup of a directory fails

git clone git://kernel.ubuntu.com/ubuntu/ubuntu-focal.git linux-ubuntu-focal
cd linux-ubuntu-focal
git checkout Ubuntu-hwe-5.13-5.13.0-40.45_20.04.1 -b Ubuntu-hwe-5.13-5.13.0-40.45_20.04.1

vim debian/changelog # modify version in head of file
LANG=C fakeroot debian/rules clean
LANG=C fakeroot debian/rules binary-headers binary-generic binary-perarch skipmodule=true

summary: - nfs cannot access/list wildcard file unless its cached
+ Regression: nfs cannot access/list wildcard file unless its cached
Revision history for this message
Jonathan (jjcf89) wrote : Re: Regression: nfs cannot access/list wildcard file unless its cached

Looks like this patch solves the problem as we have a symlink in the test path. This needs to be applied to Ubuntu's hwe-5.4 and hwe-5.13 kernels. Probably anywhere else the above commits got backported to.

[PATCH] NFS: LOOKUP_DIRECTORY is also ok with symlinks

https://<email address hidden>/T/#m5d587247611e36afcfcd157125e910d4f7075cb7

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Assiging to the linux source package, and marking as confirmed per comment #5.
Thanks for the detailed report and test steps.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e0caaf75d443e02e55e146fd75fe2efc8aed5540

commit e0caaf75d443e02e55e146fd75fe2efc8aed5540
Author: Trond Myklebust <email address hidden>
Date: Tue Feb 8 13:38:23 2022 -0500

    NFS: LOOKUP_DIRECTORY is also ok with symlinks

    Commit ac795161c936 (NFSv4: Handle case where the lookup of a directory
    fails) [1], part of Linux since 5.17-rc2, introduced a regression, where
    a symbolic link on an NFS mount to a directory on another NFS does not
    resolve(?) the first time it is accessed:

    Reported-by: Paul Menzel <email address hidden>
    Fixes: ac795161c936 ("NFSv4: Handle case where the lookup of a directory fails")
    Signed-off-by: Trond Myklebust <email address hidden>
    Tested-by: Donald Buczek <email address hidden>
    Signed-off-by: Anna Schumaker <email address hidden>

~/git/linux$ git describe --contains e0caaf75d443
v5.17-rc5~13^2~1

no longer affects: linux-signed-hwe-5.13 (Ubuntu)
no longer affects: linux-signed-hwe-5.4 (Ubuntu)
no longer affects: linux-signed-hwe-5.13 (Ubuntu Focal)
no longer affects: linux-signed-hwe-5.13 (Ubuntu Impish)
no longer affects: linux-signed-hwe-5.13 (Ubuntu Jammy)
no longer affects: linux-signed-hwe-5.13 (Ubuntu Kinetic)
no longer affects: linux-signed-hwe-5.4 (Ubuntu Focal)
no longer affects: linux-signed-hwe-5.4 (Ubuntu Impish)
no longer affects: linux-signed-hwe-5.4 (Ubuntu Jammy)
no longer affects: linux-signed-hwe-5.4 (Ubuntu Kinetic)
Changed in linux (Ubuntu Kinetic):
status: Confirmed → New
Changed in linux (Ubuntu Focal):
status: New → Confirmed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1971482

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Jonathan (jjcf89) wrote : Re: Regression: nfs cannot access/list wildcard file unless its cached

How do I add 18.04 Bionic to the affects list?

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Jonathan. Done! That needs some Ubuntu groups/membership.

Revision history for this message
Stefan Bader (smb) wrote :

Looking for the mentioned patch in Focal/5.4, this seems to be part of upstream v5.4.181 which was included in 5.4.0-110.124. That should release around today.

Changed in linux (Ubuntu Focal):
assignee: nobody → Stefan Bader (smb)
importance: Undecided → Medium
status: Confirmed → Fix Committed
Revision history for this message
Stefan Bader (smb) wrote :

For Bionic/4.15 the fix is included in upstream stable patchset 2022-03-29 which made it into 4.15.0-177.186, also to be released around today.

Changed in linux (Ubuntu Bionic):
assignee: nobody → Stefan Bader (smb)
importance: Undecided → Medium
status: New → Fix Committed
Revision history for this message
Stefan Bader (smb) wrote :

Similarly for Impish/5.13 included in upstream stable patchset 2022-04-07 which made it into 5.13.0-41.46.

Changed in linux (Ubuntu Impish):
assignee: nobody → Stefan Bader (smb)
importance: Undecided → Medium
status: New → Fix Committed
Revision history for this message
Stefan Bader (smb) wrote :

For Jammy/5.15 this was fixed in v5.15.25 upstream stable release which was already included in 5.15.0-23.23 (Jammy was released with 5.15.0-27.28).

Changed in linux (Ubuntu Jammy):
importance: Undecided → Medium
status: New → Fix Released
Changed in linux (Ubuntu Kinetic):
status: Incomplete → Invalid
Revision history for this message
Jonathan (jjcf89) wrote :

Do these changes also effect the linux-hwe branches?

Jonathan (jjcf89)
summary: - Regression: nfs cannot access/list wildcard file unless its cached
+ Regression: nfs cannot access/list wildcard file unless its cached when
+ there is a symlink in path
Revision history for this message
Stefan Bader (smb) wrote : Re: [Bug 1971482] Re: Regression: nfs cannot access/list wildcard file unless its cached

On 09.05.22 17:39, Jonathan wrote:
> Do these changes also effect the linux-hwe branches?
>
HWE kernels are directly linked to main series kernel of the same version. A fix
applied to one is also included in the related HWE kernel.

Revision history for this message
Jonathan (jjcf89) wrote :

Great thanks

Revision history for this message
Stefan Bader (smb) wrote :

This should be fixed in 5.4.0-110.124 (pending publishing).

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Stefan Bader (smb) wrote :

This should be fixed in 5.13.0-41.46 (pending publishing).

Changed in linux (Ubuntu Impish):
status: Fix Committed → Fix Released
Revision history for this message
Jonathan (jjcf89) wrote :

Updated 20.04 VM to 5.13.0-41-generic #46~20.04.1-Ubuntu and nfs bug is fixed.

Updated 18.04 machine to 5.4.0-110-generic #124~18.04.1-Ubuntu and nfs bug is fixed

Revision history for this message
Stefan Bader (smb) wrote :

Same fix now released in 18.04/Bionic 4.15.0-177.186.

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.