Comment 142 for bug 317781

Revision history for this message
Theodore Ts'o (tytso) wrote :

@Chris

I hate to keep repeating myself, but the 2.6.30 patches will cause open-write-close-rename (what I call "replace via rename") to have the semantic you want. It will do that by forcing a block allocation on the rename, and then when you do the journal commit, it will block waiting for the data writes to complete. So it will do what you want. Please note that this is an ext4-specific hack; there is no guarantee that btrfs, ZFS, tux3, reiser4 will implement anything like this. And all of these filesystems do implement delayed allocation, and will have exactly the same issue. You and others keep talk about how this is a MUST implement, but the reality is that it is not mandated by POSIX, and implementing these sorts of things will hurt benchmarks, and real-life server workloads. So don't count on other filesystems implementing the same hacks.

@CowbowTim,

Actually ext4's fsync() is smarter; it won't force out other files' data blocks, because of delayed allocation. If you write a new 1G file, thanks to delayed allocation, the blocks aren't allocated, so an fsync() of some other file will not cause that 1G file to be forced out to disk. What will happen instead is that the VM subsystem will gradually dribble out that 1G file over a period of time controlled by /proc/sys/vm/dirty_expire_centisecs and /proc/sys/vm/dirty_writeback_centisecs.

This problem you describe with fsync() and ext3's data=ordered mode is unique to ext3; no other filesystem has it. Fortunately or unfortuately, ext3 is the most common/popularly used filesystem, so people have gotten used to its quirks, and worse yet, seem to assume that they are true for all other filesystems. One of the reasons why we implemented delayed allocation was precisely to solve this problem. Of course, we're now running into the issue that there are people who have been avoiding fsync() at all costs thanks to ext3, so now we're trying to implement some hacks so that ext4 will behave somewhat similar to ext3 in at least some circumstances.

The problem here is really balance; if I implement a data=alloc-on-commit mode, it will have all of the downsides of ext3 with respect to fsync() being slow for "entagled writes" (where you have both a large file which you are copying and a small file which you are fsync()'ing). So it will encourage the same bad behaviour which will mean people will still have the same bad habits when they decide they want to switch to some new more featureful filesystem, like btrfs. The one good thing about the "alloc-on-replace-via-truncate" and "alloc-on-replace-via-rename" is it handles the most annoying set of problems (which is an existing file getting rewritten turning into a zero-length file on a crash), without necessarily causing an implied fsync() on commit for all dirty files (which is what ext3 was doing).

It's interesting that some people keep talking about how the implied fsync() is so terribly, and simultaneously arguing that ext3's behaviour is want they want --- what ext3 was doing was effectively a forced fsync() for all dirty files at each commit (which happens every 5 seconds by default) --- maybe people didn't realize that was what was going on, but that's precisely what ext3's data=ordered means.