WaitForSingleObject vs Interlocked*

3.1k views Asked by At

Under WinAPI there is WaitForSingleObject() and ReleaseMutex() function pair. Also there is Interlocked*() function family. I decided to check out performance between capturing single mutex and exchanging interlocked variable.

HANDLE mutex;
WaitForSingleObject(mutex, INFINITE);
// ..
ReleaseMutex(mutex);

// 0 unlocked, 1 locked
LONG lock = 0;
while(InterlockedCompareExchange(&lock, 1, 0))
  SwitchToThread();
// ..
InterlockedExchange(&lock, 0);
SwitchToThread();

I've measured performance between these two methods and found out that using Interlocked*() is about 38% faster. Why is it so?

Here's my performance test:

#include <windows.h>
#include <iostream>
#include <conio.h>
using namespace std;

LONG interlocked_variable   = 0; // 0 unlocked, 1 locked
int run                     = 1;

DWORD WINAPI thread(LPVOID lpParam)
{
    while(run)
    {
        while(InterlockedCompareExchange(&interlocked_variable, 1, 0))
            SwitchToThread();
        ++(*((unsigned int*)lpParam));
        InterlockedExchange(&interlocked_variable, 0);
        SwitchToThread();
    }

    return 0;
}

int main()
{
    unsigned int num_threads;
    cout << "number of threads: ";
    cin >> num_threads;
    unsigned int* num_cycles = new unsigned int[num_threads];
    DWORD s_time, e_time;

    s_time = GetTickCount();
    for(unsigned int i = 0; i < num_threads; ++i)
    {
        num_cycles[i] = 0;
        HANDLE handle = CreateThread(NULL, NULL, thread, &num_cycles[i], NULL, NULL);
        CloseHandle(handle);
    }
    _getch();
    run = 0;
    e_time = GetTickCount();

    unsigned long long total = 0;
    for(unsigned int i = 0; i < num_threads; ++i)
        total += num_cycles[i];
    for(unsigned int i = 0; i < num_threads; ++i)
        cout << "\nthread " << i << ":\t" << num_cycles[i] << " cyc\t" << ((double)num_cycles[i] / (double)total) * 100 << "%";
    cout << "\n----------------\n"
        << "cycles total:\t" << total
        << "\ntime elapsed:\t" << e_time - s_time << " ms"
        << "\n----------------"
        << '\n' << (double)(e_time - s_time) / (double)(total) << " ms\\op\n";

    delete[] num_cycles;
    _getch();
    return 0;
}
2

There are 2 answers

0
Roman Ryltsov On BEST ANSWER

WaitForSingleObject does not have to be faster. It covers a much wider scope of synchronization scenarios, in particular you can wait on handles which do not "belong" to your process and hence interprocess synchronization. Taking all this into consideration it is only 38% slower according to your test.

If you have everything inside your process and every nanosecond counts, InterlockedXxx might be a better option, but it's definitely not absolutely superior one.

Additionally, you might want to look at Slim Reader/Writer (SRW) Locks API. You will perhaps be able to build a similar class/functions based purely on InterlockedXxx with slightly better performance, however the point is that with SRW you get it ready to use out of the box, with documented behavior, stable and with decent performance anyway.

0
Len Holgate On

You are not comparing equivalent locks so it's not surprising that the performance is so different.

A mutex allows for cross process locking, it's likely one of the most expensive ways to lock due to the flexibility that it provides. It will usually put your thread to sleep when you block on a lock and this uses no cpu until you are woken up having gained the lock. This allows other code to use the cpu.

Your InterlockedCompareExchange() code is a simple spin lock. You will burn CPU waiting for your lock.

You might also want to look into Critical Sections (less overhead than a Mutex) and Slim Reader/Writer Locks (which can be used for mutual exclusion if you always obtain an exclusive lock and which provide fractionally faster performance than critical sections for non-contested use, according to my tests).

You might also want to read Kenny Kerr's "The Evolution of Synchronization in Windows and C++" and Preshing's lock related stuff, here and here.