Comment 58 for bug 624877

Revision history for this message
Theodore Ts'o (tytso) wrote :

There are two issues here, that interact and so they are confusing people. The first is that the kernel has a potential livelock problem in the writeback code, such that if there are constantly new pages dirtied that requires writeback, the sync(2) system call will never return (at least until all of the pages are clean, but on a busy system with lots of processes writing to the disk that could never happen). It doesn't happen all of the time sync(2) is called, but since dpkg was calling sync(2) all the time, it tended to happen there. Still, this problem can happen without dpkg being involved at all, and on many different file systems, since it's a problem with the generic writeback code. Trying to backport this fix to the ancient kernel which is in 10.04 is going to be _hard_. There are people at Red Hat who are paid the big bucks to do this kind of painful backporting (which in this case is multiple patches spread across multiple kernel releases before it was finally fixed, and with all sorts of dependencies). Good luck finding a volunteer willing to figure this out. I wouldn't --- I would much rather run a 3.x kernel. And if I had a business that needed to use a stable enterprise kernel, I'd pay the darned Red Hat or SLES support fees, and get a professionally managed enterprise kernel. Unfortunately, in my experience Canonical doesn't have paid kernel engineers who have either the skill or the bandwidth (not sure which) to do this kind of very tricky backporting to ancient LTS kernels, as compared to what Red Hat has done. I've seen this with ext4 bug fixes which don't get made to 10.04, but which Red Hat has been willing to do for their RHEL6 kernel.

Note that this problem is much less likely to hit on desktop/laptop systems where there generally aren't servers continuously writing to the file system. So for most Ubuntu systems that tend not to be production servers running with highly stressful workloads, this won't be an issue. The people who are complaining on this Launchpad bug are probably outliers, which probably explains the priority paid Canonical engineers have towards doing this kind of backporting.

The second problem/bugfix is the fix to dpkg, which significantly improves both its performance, and the impact on the system as a whole, by using sync_file_range() instead of sync(). Fixing this also tends to remove one of the more common ways of tickling the bug above, but that's not the only reason why backporting this dpkg package would also be a good idea, since it speeds up and decreases the overall system impact of doing package installs.

Or, people could just upgrade their system to Ubuntu LTS 12.04.....