Comment 168 for bug 317781

Revision history for this message
Tom B. (tom-bg) wrote :

@CowBoyTim

I agree with you. I work with real-time industrial systems, where the shop floor systems are considered unreliable. We have all the same issues as a regular desktop user, except our users have bigger hammers. The attraction of ext3 was the journalling with the ordered data mode. If power was cut, it was possible to reassemble something to a recent point in time, with only the most recent data lost. This bug in ext4, results in zero-length files, and not only in the most recent files either.

All fsync() does is bypass one layer of write-back caching. This just makes the window of data loss smaller, in the specific case of infrequent fsync() calls. By itself, fsync() does nothing to guarantee data integrity. I think this is why Bogdan was complaining about defective MySQL databases. Given the benchmarks, it is likely that the file system zero-lengthed the entire database file. Specifically, fsync() guarantees the data is on the disk, it doesn't guarantee the file system knows where the file is. As such, one could call fsync(), and still not be able to get at the data after a reboot.

The arguments against telling every application developer to use fsync() are:
1. Under heavy file I/O, fsync() could potentially decrease your average I/O speed by defeating the write-back caching. This could make the window of data loss larger, especially with a real-time system where the incoming data rate is fixed.
2. Repeated calls to fsync() would be very rough on laptop mode and on SSDs (Solid State Disks).
3. Repeated calls to fsync() will limit maximum file system performance for desktop applications. Eventually, the file system developers will replace fsync() with an empty function, just like Apple did.
4. If everyone will want fsync(), why don't we just modify close() function to call fsync()?
5. There is a strong correlation between user activity and system crashes. Not using the fsync() leads to much more understandable system behavior.

Imagine a typical self-inflicted system crash. This can be caused either directly: "Press Save then turn off the Computer," or indirectly: "edit video game config, hit play, and then watch the video driver crash."

If the write-back cache is enabled, and fsync() is not used, the program will write data to the cache, cause a bunch of disk reads, and then during idle time, the data will be written to disk. If the user generated activity results in disk reads, then the write-back cache will "protect" the old version of the file. The user will learn that crashing the machine results in him losing his most recent changes.

On the other hand, if fsync() is used to disable the write back cache, then programmers will start calling fsync() and close() from background threads. This will result in a poor user experience, as the hard disk will be thrashing during program startup (when all the disk reads are happening), and anything could happen when the system crashes during the fsync().

In the case that system crashes correlate to user activity, it is really tempting from a software point of view, to try to get the fsync() to happen before the system crash occurs. Unfortunately, in practice this is really tough to do. The journaled file system with an ordered data mode is a really good compromise for many desktop and real-time type applications. Additionally, limited fsync() use preserves the effectiveness of fsync() for applications that really need it, like databases.