I'm writing some software to deal with pretty critical data, and need to know what exactly I need to do to achieve durability.
Everywhere I look is contradictory information, so I'd appreciate any insight.
There are three ways I write to disk.
Using O_DIRECT | O_DSYNC, and pread'ing and then pwrite'ing 512 byte - 16 MB blocks.
Using O_DIRECT, pread'ing and then pwrite'ing 512 byte blocks, and calling fdatasync as regularly as necessary.
Using a memory mapped file, which I call msync(..., MS_SYNC | MS_INVALIDATE) for as regularly as necessary.
And this is all on ext4 with default flags.
For all of these, is it possible for data to be lost (after the write or sync has returned) or corrupted by a power failure, panic, crash, or anything else?
Is it possible that if my server dies mid pwrite, or between the beginning of pwrite and the end of fdatasync, or between the mapped memory being altered and msync, I'll have a mix of old and new data, or will it be one or the other? I want my individual pwrite calls to be atomic and ordered. Is this the case? And is it the case if they're across multiple files? So if I write with O_DIRECT | O_DSYNC to A, then O_DIRECT | O_DSYNC to B, am I guaranteed that, no matter what happens, if the data is in B it's also in A?
Does fsync even guarantee that the data's written? This says not, but I don't know if things have changed since then.
Does the journalling of ext4 completely solve the issue of corrupt blocks that this SO answer says exist?
I'm currently growing files by calling posix_fallocate and then ftruncate. Are both of these necessary, and are they enough? I figured that ftruncate would actually initialise the allocated blocks to avoid these issues.
To add confusion to the mix, I'm running this on EC2, I don't know if that affects anything. Although it makes it very hard to test as I can't control how aggressively it gets shut down.
Absolutely.
No. The answer is device dependent and likely filesystem dependent. Unfortunately, that filesystem could be layers and layers above the "actual" storage device. (e.g.
md
,lvm
,fuse
,loop
,ib_srp
, etc).That's true. But you can probably still use an NMI or
sysrq-trigger
to create a pretty abrupt halt.