Under WinAPI there is WaitForSingleObject() and ReleaseMutex() function pair. Also there is Interlocked*() function family. I decided to check out performance between capturing single mutex and exchanging interlocked variable.
HANDLE mutex;
WaitForSingleObject(mutex, INFINITE);
// ..
ReleaseMutex(mutex);
// 0 unlocked, 1 locked
LONG lock = 0;
while(InterlockedCompareExchange(&lock, 1, 0))
SwitchToThread();
// ..
InterlockedExchange(&lock, 0);
SwitchToThread();
I've measured performance between these two methods and found out that using Interlocked*() is about 38% faster. Why is it so?
Here's my performance test:
#include <windows.h>
#include <iostream>
#include <conio.h>
using namespace std;
LONG interlocked_variable = 0; // 0 unlocked, 1 locked
int run = 1;
DWORD WINAPI thread(LPVOID lpParam)
{
while(run)
{
while(InterlockedCompareExchange(&interlocked_variable, 1, 0))
SwitchToThread();
++(*((unsigned int*)lpParam));
InterlockedExchange(&interlocked_variable, 0);
SwitchToThread();
}
return 0;
}
int main()
{
unsigned int num_threads;
cout << "number of threads: ";
cin >> num_threads;
unsigned int* num_cycles = new unsigned int[num_threads];
DWORD s_time, e_time;
s_time = GetTickCount();
for(unsigned int i = 0; i < num_threads; ++i)
{
num_cycles[i] = 0;
HANDLE handle = CreateThread(NULL, NULL, thread, &num_cycles[i], NULL, NULL);
CloseHandle(handle);
}
_getch();
run = 0;
e_time = GetTickCount();
unsigned long long total = 0;
for(unsigned int i = 0; i < num_threads; ++i)
total += num_cycles[i];
for(unsigned int i = 0; i < num_threads; ++i)
cout << "\nthread " << i << ":\t" << num_cycles[i] << " cyc\t" << ((double)num_cycles[i] / (double)total) * 100 << "%";
cout << "\n----------------\n"
<< "cycles total:\t" << total
<< "\ntime elapsed:\t" << e_time - s_time << " ms"
<< "\n----------------"
<< '\n' << (double)(e_time - s_time) / (double)(total) << " ms\\op\n";
delete[] num_cycles;
_getch();
return 0;
}
WaitForSingleObject
does not have to be faster. It covers a much wider scope of synchronization scenarios, in particular you can wait on handles which do not "belong" to your process and hence interprocess synchronization. Taking all this into consideration it is only 38% slower according to your test.If you have everything inside your process and every nanosecond counts,
InterlockedXxx
might be a better option, but it's definitely not absolutely superior one.Additionally, you might want to look at Slim Reader/Writer (SRW) Locks API. You will perhaps be able to build a similar class/functions based purely on
InterlockedXxx
with slightly better performance, however the point is that with SRW you get it ready to use out of the box, with documented behavior, stable and with decent performance anyway.