Comment 146 for bug 317781

Revision history for this message
Theodore Ts'o (tytso) wrote :

@pablomme,

Well, until the journal has been committed, none of the modified meta-data blocks are allowed to be written to disk --- so any changes to the inode table, block allocation bitmaps, inode allocation bitmaps, indirect blocks, extent tree blocks, directory blocks, all have to be pinned in memory and not written to disk. The longer you hold off on the journal commit, the more file system meta-data blocks are pinned into memory. And of course, you can't do this forever; eventually the journal will be full, and a new journal commit will be forced to happen, regardless of whether the data blocks have been allocated yet or not.

Part of the challenge here is that normally the VM subsystem decides when it's time to write out dirty pages, and the VM subsystem has no idea about ordering constraints based on the filesystem journal. And in practice, there are multiple files which will have been written out, and the moment one of the is fsync()'ed, we have to do a journal commit for all files, because we can't really reorder filesystem operations. All we can do is force the equivalent of an fsync() when a commit happens.

So the closest approximation to what you want is a data=alloc-on-commit mode, with the commit interval set to some very large number, say 5 or 10 minutes. In practice the commit will happen sooner than that, especially if there are lots of filesystem operations taking place, but hopefully most of the time the VM subsystem will gradually push the pages out before the commit takes place; if the commit takes place first, the alloc-on-commit mode will force any remaining pages to disk on the transaction commit.