Comment 8 for bug 2015827

Revision history for this message
Allan G Soeby (soeby) wrote :

Hi @kleber-souza, @chengendu

Thanks for your attention.

Allow me give my perception of the impact of fixing "bug" LP: #2003053.

The original patchset introduced *two* regressions. One, (NFS deathlock) that hit everybody - fixed by #2009325, but the remaining one, are now hitting those of use spawning new user processes frequently, causing new "login times" to be created and access cache zapped. As a result we are looking at 300-400% increase in *overall* NFS operations, making the current kernels unusable for production. We do not have that kind of head-room on our NFS servers.

The result is, we are simply stuck with kernels prior to #2003053 fixes. With recent CVE fixes in current kernel, we have now also resorted to the option of building our own kernels. This is very counter-productive.

I understand the use case for the changes that went into "bug" #20003053. The reason why I call this a "bug" (in quotes) is due to the fact, that the behaviour has been around for more than 15 years. While age alone is not a qualifier, I am just saying that this has been an accepted behaviour for that long. Furthermore #2003053 will only apply in environments where the NFS-server has a knowledge of users and their secondary groups and validates them for ACCESS calls. (ours don't)

From the original upstream commit message 0eb43812c0270ee3d005ff32f91f7d0a6c4943af : "While it is reasonable to expect that such group membership changes are rare, and that we do not want to optimise the cache to accommodate them, it is also not unreasonable for the user to expect that if they log out and log back in again, that the staleness would clear up".

It is clear that a trade-off was considered, however the use case being a "user" (a physical interactive person), and not any service of any kind. I am quite certain that with a use case with a regression of 3-4x increase in NFS ops, this would not have gone in the way it was.

I understand why sometimes there is are strong reasons to cherry-pick changes from upstream - or making your own changes. IMHO, I do not think the use case for #20003053 was strong enough to justify that.

The main regression assessment for #20003053 was considered low, as it was upstream changes. We now know, this was not the case.

And with that knowledge, and comparing it to the weak use case the changes was trying to address, it should have been the right decision to revert the changes.

The suggested upstream changes to introduce a mount option to address this, should should be turned around. The option should be added for those wanting to zap/re-validate their access caches on re-login, but leave the default behaviour as is.