Why writes with O_DIRECT and O_SYNC still causing io merge?

1.7k views Asked by At

everyone

Recently, I did some test with fio to test my disk performance. I configured fio to use direct io and O_SYNC, and following is my configuration

[global]
invalidate=0    # mandatory
direct=1
sync=1
thread=1
norandommap=1
runtime=10000
time_based=1

[write4k-rand]
stonewall
group_reporting
bs=4k
size=1g
rw=randwrite
numjobs=1
iodepth=1

However, when I monitor the disk performance through iostat while fio is running, I saw the following output.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.12    0.00    0.08    3.81    0.00   95.98

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00    39.50    0.00  176.00     0.00  1648.00     9.36     1.02    5.81   5.65  99.50

wrqm/s is 39.50. If stop fio, wrqm/s is 0. Why is there still io merges when I'm doing direct io with O_SYNC? Please help me.

Thank you:-)

1

There are 1 answers

0
Anon On BEST ANSWER

On Linux, doing direct I/O doesn't mean "do this exact I/O" - it is a hint to bypass Linux's page cache. At the time of writing the open man page says this about O_DIRECT:

Try to minimize cache effects of the I/O to and from this file.

This means things like the Linux I/O scheduler are still free to do their thing with regard to merges, reorderings (your use of fio's sync=1 is what stops the reordering) etc with O_DIRECT I/O.

Additionally, if you are doing I/O to a file in a filesystem, then it is legitimate for said filesystem to ignore the O_DIRECT hint and fallback to buffered I/O.

See the different parameters of nomerges in https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt for how to teach the scheduler to avoid merging/rearranging but note that you can't control the splitting of a request that is too large.

Having said all the above, it doesn't look like all that much I/O merging (as given by wrqm/s) is happening in your scenario but there's still something a bit strange. The avgrq-sz is 9.36 and since that value is in 512 byte sectors, we get 4792.32 bytes as the average request size being submitted down to the disk. This value is fairly close to the 4096 byte block size fio is using. Since you can't do non-sector sized I/O to a disk and assuming the disk's block size is 512 bytes this suggests a merge of 4KBytes + 512 bytes (I assume the rest is noise) but since it's an average there could be something doing large(r) I/O at the same time fio is doing small I/O and the average is just coming out to something in-between. Because I/O is happening to a file in a filesystem, this might be explained by filesystem metadata being updated...