Optimizing an I/O bound Win32 application

Question

Optimizing an I/O bound Win32 application

954 views Asked by Simone Margaritelli At 05 December 2013 at 00:49

I'm trying to optimizie an I/O bound C++ Win32 application. What it actually does is something very similar to recurse a folder and compute a cryptographic hash of every file it finds. It's a single threading application using memory mapped files, so as it's easy to imagine it doesn't seem to use much CPU since most of the times the main thread is put to sleep waiting for I/O to complete. I'm thinking about a couple of solutions, but I'm not sure about it so I'd like to have your opinions.

I could spawn many threads ( having a fixed size pool of workers to keep memory usage under certain threshold ), but honestly I don't know if this can make the situation better, each thread is gonna be put to sleep just like my main thread in the current implementation, plus the scheduler would "waste" lot of computation power to switch contexts.
I was thinking about I/O completion ports ( single thread? multi? ), but that would mean to abandon the memory mapped files ( am I wrong ? ) and use standard file operations. If this is the case, could you please provide me some example code on how to use IOCP to read and elaborate a given list of files without putting the reading thread to sleep ?

Any other idea/suggestion/etc would be really appreciated :)

Thanks.

Original Q&A

There are 5 answers

Jerry Coffin On 05 December 2013 at 06:29

My experience indicates that memory mapping isn't particularly fast, so that would probably be the first thing I'd abandon.

Threading (explicitly or via IOCPs) probably won't do much good either, unless the target system has lots of disk drives, and can split things up so different threads are operating on different physical drives.

Once you've given up on memory mapping and do explicit I/O, you probably want to use FILE_FLAG_NO_BUFFERING and read relatively large blocks (say, a few megabytes at a time). Do check the alignment requirements on your block of memory though--they're a little tricky (or maybe "tedious" would be a better word to describe them). Also note that this only works for reads that are multiples of the disk's sector size, so in a typical case you need to open the file twice, once with FILE_FLAG_NO_BUFFERING to read the bulk of the data, then again without that flag to read the "tail" of the file.

Although it only copies a file (rather than processing the contents), and it's probably pure C, not C++, perhaps some demo code will be of at least a little help:

int do_copy(char const *in, char const *out) {

    HANDLE infile;
    HANDLE outfile;
    char *buffer;
    DWORD read, written;
    DWORD junk=0;
    unsigned long little_tail;
    unsigned long big_tail;
    unsigned __int64 total_copied = 0;
    unsigned __int64 total_size = 0;
    BY_HANDLE_FILE_INFORMATION file_info;

#define size (1024 * 8192)

    buffer = VirtualAlloc(NULL, size, MEM_COMMIT, PAGE_READWRITE);
    if ( NULL == buffer)
        return 0;

    infile = CreateFile(in, 
        GENERIC_READ, 
        FILE_SHARE_READ,
        NULL,
        OPEN_ALWAYS, 
        FILE_FLAG_NO_BUFFERING, 
        NULL);

    GetFileInformationByHandle(infile, &file_info);
    total_size = (unsigned __int64)file_info.nFileSizeHigh << 32 | (unsigned __int64)file_info.nFileSizeLow / 100;

    outfile = CreateFile(out, 
        GENERIC_WRITE, 
        FILE_SHARE_READ,
        NULL,
        CREATE_ALWAYS, 
        FILE_FLAG_NO_BUFFERING, 
        NULL);

    if ((infile == HNULL) || (outfile == HNULL))
        return 0;

    while (ReadFile(infile, buffer, size, &read, NULL) && read == size) {
        WriteFile(outfile, buffer, read, &written, NULL);
        total_copied += written;
        fprintf(stderr, "\rcopied: %lu %%", (unsigned long)(total_copied / total_size));
    }

    little_tail = read % 4096;
    big_tail = read - little_tail;

    WriteFile(outfile, buffer, big_tail, &written, NULL);

    CloseHandle(infile);
    CloseHandle(outfile);

    outfile = CreateFile(out, 
        GENERIC_WRITE, 
        0,
        NULL,
        OPEN_EXISTING,
        FILE_FLAG_SEQUENTIAL_SCAN, 
        NULL);
    fprintf(stderr, "\rcopied: 100 %%\n");

    SetFilePointer(outfile, 0, &junk, FILE_END);
    WriteFile(outfile, buffer+big_tail, little_tail, &written, NULL);
    CloseHandle(outfile);
    VirtualFree(buffer, size, MEM_RELEASE);
    return 1;
}

gast128 On 17 October 2022 at 13:01

Be wary about multi-threading I/O bound scenarios. Here I have a case that it makes things only slightly faster on a SSD but a lot slower on a HDD. The following results are a test of reading a big file per thread on Windows 10:

SSD:

1 thread 200 MB/s
2 threads 100 MB/s

HDD:

1 thread 100 MB/s
2 threads 10 MB/s

Olof Forshell On 18 October 2022 at 19:08

Why does everyone appear so sure that IOCPs are out of the question? With IOCPs you can basically initiate every read you want to do and they will enter the SW as they are completed which more often than not will not be in the order they were issued. Then you do your crypto stuff and store the hash or whatever. In the meantime one or more IOCP reads will have completed and you can do your crypto stuff on them.

Is it not even worth a shot?

Fredrik Wahlgren On 23 March 2023 at 12:48

I can tell how I have successfully solved a very similar problem. Jpeg optimisarion

Let’s say we have four threads. Each thread is given the path and a threadnr from 0 to 3. Each thread needs to know te file number so it needs a global varaible like int gFileNumber[4]

The threads use finfdfirstfile/findnextfile and whenever they find a jpg (in my case), the filenumber is incremented gFileNumber[threadnumber]++

Now comes the the part I think is clever. If (threadnumber == gFileNumber[threadnumber] % 4) Then this thread should deal with this file. Jist calculate the hash. The threads will calvulate the hash for different files at the same time.

**Domi** · Accepted Answer · 2013-12-05T07:49:14+00:00

Before parallelizing anything, always ask yourself first: Does the added complexity justify the performance gained? In order to answer that question with minimal effort, just test how much % of max read throughput you already have. That is, test your current read throughput and then test max throughput. Don't use the theoretical max for this computation. Then, think about how much complexity and how many possible issues are introduced in even the simplest approach to gain the last few %.

As already mentioned in the comments, the greatest performance gain here is probably achieved by pipelining (i.e. overlapping computation and I/O). And the easiest way to implement that is with asynchronous reads. This thread lists multiple ways to implement asnychronous file I/O in C++.

If you don't need portability, just use the Windows OVERLAPPED API. Boost ASIO does not seem to make File I/O very easy (yet). I could not find any good examples.

Note that, depending on your system configuration, you have to launch multiple threads to fully saturate I/O bandwidth (especially if files of that folder actually reside on multiple disks, which is possible). Even if you only read from one device, you might fair (slightly) better with multiple threads to mitigate OS overhead.

TechQA.

Optimizing an I/O bound Win32 application

There are 5 answers

Related Questions in C++

Related Questions in MULTITHREADING

Related Questions in WINAPI

Related Questions in IO

Related Questions in IOCP

Popular Questions

Popular Tags

Trending Questions