Comment 45 for bug 317781

Revision history for this message
Theodore Ts'o (tytso) wrote :

So, I've been aware of this problem, and have been working on a solution, but since I'm not subscribed to this bug, I wasn't aware of the huge discussion going on here until Nullack prodded me and asked me to "take another look at bug 317781". The short answer is (a) yes, I'm aware of it, (b) there is a (partial) solution, (c) it's not yet in mainline, and as far as I know, not in an Ubuntu Kernel, but it is queued for integration at the next merge window, after 2.6.29 releases, and (d) this is really more of an application design problem more than any thing else. The patches in question are:

http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=3bf3342f394d72ed2ec7e77b5b39e1b50fad8284
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=6645f8c3bc3cdaa7de4aaa3d34d40c2e8e5f09ae
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=dbc85aa9f11d8c13c15527d43a3def8d7beffdc8

So, what is the problem. POSIX fundamentally says that what happens if the system is not shutdown cleanly is undefined. If you want to force things to be stored on disk, you must use fsync() or fdatasync(). There may be performance problems with this, which is what happened with FireFox 3.0[1] --- but that's why POSIX doesn't require that things be synched to disk as soon as the file is closed.

[1] http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/

So, why wasn't this a problem before in the past? Well, ext3 by default has a commit interval of 5 seconds, and has data=ordered. What does this mean? Well, every 5 seconds, the ext3 journal is committed; this means that any changes in since the last commit are now guaranteed to survive an unclean shutdown. The journalling mode data=ordered means that only metadata is written in the journal, but data is ordered; this means that before the commit takes place, any data blocks are associated with inodes that are about to be committed in that transaction will be forced out to disk. This is primarily done for security reasons; if this was not done, then any newly allocated blocks might still contain previous data belonging to some other file or user, and after a crash, accessing that file might result in a user seeing someone else's mail or p0rn, and that's unacceptable from a security perspective.

However, this had the side effect of essentially guaranteeing that anything that had been written was guaranteed to be on disk after 5 seconds. (This is somewhat modified if you are running on batteries and have enabled laptop mode, but we'll ignore that for the purposes of this discussion.) Since ext3 became the dominant filesystem for Linux, application writers and users have started depending on this, and so they become shocked and angry when their system locks up and they lose data --- even though POSIX never really made any such guaranteed. (We could be snide and point out that they should have been shocked and angry about crappy proprietary, binary-only drivers that no one but the manufacturer can debug, or angry at themselves for not installing a UPS, but that's not helpful; expectations are expectations, and it's hard to get people to change those expectations, even when they aren't good for themselves or the environment --- such as Americans living in exburgs driving SUV's getting shocked and angry when gasoline hit $4/gallon, and their 90 minute daily commute started getting expensive. :-)

OK, so enter ext4 and delayed allocation. With delayed allocation, we don't allocate a location on disk for the data block right away. Since there is no location on disk, there is no place to write the data on a commit; but it also means that there is no security problem. It also results in a massive performance improvements; for example, if you create a scratch file, and then delete it 20 seconds later, it will probably never hit the disk. Unfortunately, the default VM tuning parameters, which can be controlled by /proc/sys/vm/dirty_expire_centiseconds and /proc/sys/vm/dirty_writeback_centiseconds, means that in practice, a newly created file won't hit disk until about 45-150 seconds later, depending on how many dirty pages are in the page cache at the time. (This isn't unique to ext4, by the way --- any advanced filesystem which does delayed allocation, which includes xfs and the in the future, btrfs, will have the same issue.)

So the difference between 5 seconds and 60 seconds (the normal time if you're writing huge data sets) isn't *that* big, but for certain crappy applications that apparently write huge numbers of small files in users' home directories. This appears to be the case for both GNOME and KDE. Since these applications are rewriting existing files, and are apparently doing so *frequently*, the chances that files will be lost is high.

So.... what are the solutions? The patches which are queued for the 2.6.30-rc1 merge window basically are a hack which force blocks that had been delay allocated to be allocated when either (a) the file in which was being written had previously been truncated using ftruncate or opened using O_TRUNC, at which point the blocks will be allocated when the file is closed, or (b) if a file containing blocks not yet allocated is renamed using the rename(3) system call causing a previously existing file to be unlinked (i.e., the application has written the file "foo.new" and is now calling rename("foo.new", "foo"), causing the file "foo" to be unlinked), then the file's blocks will also be forcibly allocated. This solves the most common cases where some crappy desktop framework is constantly rewriting large number of files in ~/.gnome or ~/.kde, since in those cases, where the files are constantly being replaced, they will be forced out to disk, giving users the old ext3 behaviour. However, for large files that are being streamed out, or large database files, in most cases they won't meet the criteria (a) and (b) above, so we end up preserving most of the performance advantages of ext4.

Another solution is to make sure your system is reliable. :-) If you have your server in a proper data center, with a UPS, and you're not using any unreliable binary-only video drivers or network drivers, then your system shouldn't be randomly locking up or crashing; in that case, a further patch which will also be merged during the 2.6.30-rc1 merge window will provide a mount option which disables the above-mentioned kludge, since it will impact performance.

The final solution, is we need properly written applications and desktop libraries. The proper way of doing this sort of thing is not to have hundreds of tiny files in private ~/.gnome2* and ~/.kde2* directories. Instead, the answer is to use a proper small database like sqllite for application registries, but fixed up so that it allocates and releases space for its database in chunks, and that it uses fdatawrite() instead of fsync() to guarantee that data is written on disk. If sqllite had been properly written so that it grabbed new space for its database storage in chunks of 16k or 64k, and released space when it was no longer needed in similar large chunks via truncate(), and if it used fdatasync() instead of fsync(), the performance problems with FireFox 3 wouldn't have taken place. Such a solution is also far more efficient in terms of disk space utilization, and minimizes disk writes which is good for SSD's. It is the ultimate correct answer, but it means that you need someone with Systems experience writing the libraries used by application writers.