Bug #2015827 “NFS performance issue while clearing the file acce...” : Bugs : linux package : Ubuntu

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2023-04-11: Missing required logs.

#1

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 2015827

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Chengen Du (chengendu) on 2023-04-11

Changed in linux (Ubuntu):
assignee:	nobody → ChengEn, Du (chengendu)
status:	Incomplete → In Progress

Revision history for this message

Jan Ingvoldstad (jan-launchpad-xud) wrote on 2023-04-11:

#2

Please note that this is a bug that for unknown reasons have been backported from 6.2-rc3 to LTS released kernels in Ubuntu Server LTS.

Upstream is not responsible for making the decision of whether this backported change should be part of older kernels in Ubuntu Server LTS.

Please revert the changes to LTS released kernels, so that server hosting environments can use Ubuntu Server as a server platform.

Revision history for this message

Allan G Soeby (soeby) wrote on 2023-04-12:

#3

Du ChengEn, I would second Jan's opinion.

This whole chain of fixes that has gone in to fix LP: #2003053, should be rolled back. There where no heavy arguments to cherry-pick those changes in the first place. (It is not in upstream LTS either).

Once it was discovered what kind of impact it had, it should have been rolled back.

As it is also now clear that LP: #2003053 cannot be solved without impacting other use-cases, and by only introducing an extra mount-option, this is just another argument for reverting this set of patches.

Revision history for this message

Jan Ingvoldstad (jan-launchpad-xud) wrote on 2023-05-02:

#4

Judging from the utter lack of response to the core issue of backporting untested patches from an - at the time - release candidate unstable upstream Linux version, back to what is supposed to be *three* long term support "enterprise grade" Ubuntu editions, it seems that Ubuntu's policy for Linux kernels has gone to "move fast and break things".

Basically, for stability with Ubuntu, there is now only one option:

Roll your own Linux LTS kernels.

This experience has completely undermined my trust in Ubuntu as a stable platform for servers.

Revision history for this message

Kleber Sacilotto de Souza (kleber-souza) wrote on 2023-05-02:

#5

Hello @jan-launchpad-xud and @soeby.

The patches we introduced to fix bug 2003053 unfortunately introduced a regression that was not caught by our tests and the reviews done internally and by upstream. The regression was fixed as soon as we could and we apologize for the inconvenience. We do have extensive quality control processes but unfortunately sometimes issues are discovered after a kernel is released.

The Ubuntu LTS kernels are not necessarily a 1-to-1 match with the upstream LTS releases, we do pick up every patch applied to the upstream stable kernels but we apply other patches to provide extra fixes for our users. The reason we backported a patchset from an upstream -rc release was not unknown or randomly, it was based on a real issue that was affecting our users.

If you could kindly provide more information about the issues that you are currently having with the Ubuntu kernels that were caused by the changes to fix bug 2003053 and bug 2009325 we would be happy to investigate and provide a solution if possible.

Revision history for this message

Jan Ingvoldstad (jan-launchpad-xud) wrote on 2023-05-03:

#6

Reverts upstream commits 21fd9e8700de86d1169f6336e97d7a74916ed04a, 029085b8949f5d269ae2bbd14915407dd0c7f902, and 0eb43812c0270ee3d005ff32f91f7d0a6c4943af Edit (838 bytes, text/plain)

Hello @kleber-souza.

The regression was not fixed. There have only been mitigations.

Please see our comments in the other bug report.

All information required is available in the previous bug report, but I have attached a patchset that actually fixes the regression.

Revision history for this message

Chengen Du (chengendu) wrote on 2023-05-03 (last edit on 2023-05-03):

#7

The NFS patchset did resolve the issue our user encountered, but unfortunately introduced some performance overhead that may have significant impacts in certain scenarios.
We wanted to let you know that we have submitted a patch (https://patchwork.kernel<email address hidden>/) that we propose to address the issue.
We are currently awaiting a response from the upstream.

Ubuntu Foundations Team Bug Bot (crichton) on 2023-05-03

tags:

added: patch

Revision history for this message

Allan G Soeby (soeby) wrote on 2023-05-03:

#8

Hi @kleber-souza, @chengendu

Thanks for your attention.

Allow me give my perception of the impact of fixing "bug" LP: #2003053.

The original patchset introduced *two* regressions. One, (NFS deathlock) that hit everybody - fixed by #2009325, but the remaining one, are now hitting those of use spawning new user processes frequently, causing new "login times" to be created and access cache zapped. As a result we are looking at 300-400% increase in *overall* NFS operations, making the current kernels unusable for production. We do not have that kind of head-room on our NFS servers.

The result is, we are simply stuck with kernels prior to #2003053 fixes. With recent CVE fixes in current kernel, we have now also resorted to the option of building our own kernels. This is very counter-productive.

I understand the use case for the changes that went into "bug" #20003053. The reason why I call this a "bug" (in quotes) is due to the fact, that the behaviour has been around for more than 15 years. While age alone is not a qualifier, I am just saying that this has been an accepted behaviour for that long. Furthermore #2003053 will only apply in environments where the NFS-server has a knowledge of users and their secondary groups and validates them for ACCESS calls. (ours don't)

From the original upstream commit message 0eb43812c0270ee3d005ff32f91f7d0a6c4943af : "While it is reasonable to expect that such group membership changes are rare, and that we do not want to optimise the cache to accommodate them, it is also not unreasonable for the user to expect that if they log out and log back in again, that the staleness would clear up".

It is clear that a trade-off was considered, however the use case being a "user" (a physical interactive person), and not any service of any kind. I am quite certain that with a use case with a regression of 3-4x increase in NFS ops, this would not have gone in the way it was.

I understand why sometimes there is are strong reasons to cherry-pick changes from upstream - or making your own changes. IMHO, I do not think the use case for #20003053 was strong enough to justify that.

The main regression assessment for #20003053 was considered low, as it was upstream changes. We now know, this was not the case.

And with that knowledge, and comparing it to the weak use case the changes was trying to address, it should have been the right decision to revert the changes.

The suggested upstream changes to introduce a mount option to address this, should should be turned around. The option should be added for those wanting to zap/re-validate their access caches on re-login, but leave the default behaviour as is.