Why does buffered i/o take longer than direct i/o with bigger write-buffer?

139 views Asked by At

I have tested i/o performance and noticed an interesting behaviour that I can not explain.

There is one program that first sets the stream-buffer to 4096 bytes and then writes one byte for 100.000.000 times. On my target-system this operation takes up 40.806s.

struct timespec start, finish, delta

char write_buffer[4096];

char buffer_to_write[1] = {[0 ... 0] = 0x00};

FILE* fd = fopen("file.txt", "wb");
setbuf(fd, write_buffer);

clock_gettime(CLOCK_REALTIME, &start);
for (int i = 0; i < 100000000; i++) {
   fwrite(buffer_to_write, sizeof(char), sizeof(buffer_to_write), fd);
}
clock_gettime(CLOCK_REALTIME, &finish);

The other program uses direct i/o and writes a buffer of 4096 bytes with direct i/o for 24.414 times. The written data-size is about the same. This operation takes only 0.5s.

struct timespec start, finish, delta

char buffer_to_write[4096] = {[0 ... 4095] = 0x00};

int fd = open("file.txt", O_WRONLY, 0);

clock_gettime(CLOCK_REALTIME, &start);
for (int i = 0; i < 24414; i++) {
   write(fd, buffer_to_write, sizeof(buffer_to_write));
}
clock_gettime(CLOCK_REALTIME, &finish);

From my understanding, the amount of system calls should be the same. I don't see for what reason the program using buffered i/o takes so much more time, even though only every 4096th loop pass, the data should be sent to kernel space...

1

There are 1 answers

2
Petr Skocik On

Nonsyscall function calls still have overhead. Multiply that by a lot and your overhead can exceed that of a single system call that may cost a quite bit to enter but then processes the whole buffer without invoking a function on each byte.

And fwrite isn't just the function call overhead. It has to take a lock, do a range check, and probably call memcpy (expecting more than just 1 byte, otherwise why wouldn't you call fputc or fputc_unlocked instead?).

I'm getting a ratio of about 17:1 with the syscall version taking about 1.2ns per byte (which, BTW, is slightly less than a no-op function call on my machine). Replacing fwrite with fputc/fputc_unlocked improves it to about 4:1. Using an equally sized buffer with the stdio case for each write instead of writing byte at a time makes it about 1:1 (in-cache memcpy is rather fast).