std::shared_mutex::unlock_shared() blocks even though there are no active exclusive locks on Windows

1.5k views Asked by At

My team has encountered a deadlock that I suspect is a bug in the Windows implementation of SRW locks. The code below is a distilled version of real code. Here's the summary:

  1. Main thread acquires exclusive lock
  2. Main thread creates N children threads
  3. Each child thread
    1. Acquires a shared lock
    2. Spins until all children have acquired a shared lock
    3. Releases the shared lock
  4. Main thread releases exclusive lock

Yes this could be done with std::latch in C++20. That's not the point.

This code works most of the time. However roughly 1 in 5000 loops it deadlocks. When it deadlocks exactly 1 child successfully acquires a shared lock and N-1 children are stuck in lock_shared(). On Windows this function calls into RtlAcquireSRWLockShared and blocks in NtWaitForAlertByThreadId.

The behavior is observed when used std::shared_mutex directly, std::shared_lock/std::unique_lock, or simply calling SRW functions directly.

A 2017 Raymond Chen post asks about this exact behavior, but user error is blamed.

This looks like an SRW bug to me. It's maybe worth noting that if a child doesn't attempt to latch and calls unlock_shared that this will wake its blocked siblings. There is nothing in the documentation for std::shared_lock or *SRW* that suggests is allowed to block even when there is not an active exclusive lock.

This deadlock has not been observed on non-Windows platforms.

Example code:

#include <atomic>
#include <cstdint>
#include <iostream>
#include <memory>
#include <shared_mutex>
#include <thread>
#include <vector>

struct ThreadTestData {
    int32_t numThreads = 0;
    std::shared_mutex sharedMutex = {};
    std::atomic<int32_t> readCounter;
};

int DoStuff(ThreadTestData* data) {
    // Acquire reader lock
    data->sharedMutex.lock_shared();

    // wait until all read threads have acquired their shared lock
    data->readCounter.fetch_add(1);
    while (data->readCounter.load() != data->numThreads) {
        std::this_thread::yield();
    }

    // Release reader lock
    data->sharedMutex.unlock_shared();

    return 0;
}

int main() {
    int count = 0;
    while (true) {
        ThreadTestData data = {};
        data.numThreads = 5;

        // Acquire write lock
        data.sharedMutex.lock();

        // Create N threads
        std::vector<std::unique_ptr<std::thread>> readerThreads;
        readerThreads.reserve(data.numThreads);
        for (int i = 0; i < data.numThreads; ++i) {
            readerThreads.emplace_back(std::make_unique<std::thread>(DoStuff, &data));
        }

        // Release write lock
        data.sharedMutex.unlock();

        // Wait for all readers to succeed
        for (auto& thread : readerThreads) {
            thread->join();
        }

        // Cleanup
        readerThreads.clear();

        // Spew so we can tell when it's deadlocked
        count += 1;
        std::cout << count << std::endl;
    }

    return 0;
}

Here's a picture of the parallel stacks. You can see the main thread is correctly blocking on thread::join. One reader thread acquired the lock and is in a yield loop. Four reader threads are blocked within lock_shared.

enter image description here

5

There are 5 answers

1
LordCecil On BEST ANSWER

This is a confirmed bug in the OS SlimReaderWriter API.

I posted a thread in r/cpp on Reddit because I knew Reddit user u/STL works on Microsoft's STL implementation and is an active user.

u/STL posted a comment declaring it an SRW bug. He filed OS bug report" OS-49268777 "SRWLOCK can deadlock after an exclusive owner has released ownership and several reader threads are attempting to acquire shared ownership together". Unfortunately this a Microsoft internal bug tracker so we can't follow it.

Thanks to commenters in this thread (RbMm in particular) for helping fully explain and understand the observed behavior.

RbMm posted a secondary answer which appears to show that "AcquireSRWLockShared some time can really acquires a slim SRW lock in exclusive mode". Read his response for details. I think almost everyone would be surprised by this behavior!

14
RbMm On

really error was in your code logic. your are wait inside SRW lock. but locks not designed to wait, it designed for fast operations. thread must leave lock fast as possible, after he enter to it.

what and why happens with your concrete code ?

  1. when thread (A) release lock from exclusive access ( ReleaseSRWLockExclusive )
  2. and at "same time" another thread (B) try acquire lock shared ( AcquireSRWLockShared )
  3. and another threads (how minimum yet one) wait already for lock

possible situation when thread B acquire lock with exclusive access by fact, despite he ask for shared only. as result another shared waiters and new shared requests, will be wait, until thread B not leave srw lock. but in your case it not leave, until another threads not enter to section, but they can not enter, until B leave.. deadlock.

however if you leave srw with thread B another threads wake up.

how you can modify self code for test this ?

save time before enter

while (data->readCounter.load() != data->numThreads)

loop. and check time in loop. really, after SRW lock released from exclusive access, all shared waiters must "very fast" enter to lock. if during some time this not happens ( let be 1 second - really this is HUGE time in this case) - let thread which in SRW now - exit from loop and release lock. and deadlock will is gone.

try next code

int DoStuff(ThreadTestData* data) {
    // Acquire reader lock
    data->sharedMutex.lock_shared();

    ULONG64 time = GetTickCount64() + 1000;
    // wait until all read threads have acquired their shared lock
    // but no more 1000 ms !!
    data->readCounter.fetch_add(1);
    while (data->readCounter.load() != data->numThreads && GetTickCount64() < time) {
        std::this_thread::yield();
    }

    // Release reader lock
    data->sharedMutex.unlock_shared();

    return 0;
}

however i prefer pure winapi and for better visual effect, next code:

struct ThreadTestData 
{
    HANDLE hEvent;
    SRWLOCK SRWLock = {};
    LONG numThreads = 1;
    LONG readCounter = 0;
    LONG done = 0;

    void EndThread()
    {
        if (!InterlockedDecrementNoFence(&numThreads))
        {
            if (!SetEvent(hEvent)) __debugbreak();
        }
    }

    void DoStuff()
    {
        AcquireSRWLockShared(&SRWLock);

        InterlockedDecrementNoFence(&readCounter);

        ULONG64 time = GetTickCount64() + 1000;

        while (readCounter)
        {
            if (GetTickCount64() > time)
            {
                if (InterlockedExchangeNoFence(&done, TRUE))
                {
                    MessageBoxW(0, 0, 0, MB_ICONHAND);
                    break;
                }
            }

            SwitchToThread();
        }

        ReleaseSRWLockShared(&SRWLock);

        EndThread();
    }

    static ULONG WINAPI _S_DoStuff(PVOID data)
    {
        reinterpret_cast<ThreadTestData*>(data)->DoStuff();
        return 0;
    }

    BOOL Test(ULONG n)
    {
        if (hEvent = CreateEventW(0, 0, 0, 0))
        {
            AcquireSRWLockExclusive(&SRWLock);

            do 
            {
                numThreads++;
                readCounter++;

                if (HANDLE hThread = CreateThread(0, 0, _S_DoStuff, this, 0, 0))
                {
                    CloseHandle(hThread);
                }
                else
                {
                    readCounter--;
                    numThreads--;
                }

            } while (--n);

            ReleaseSRWLockExclusive(&SRWLock);

            EndThread();

            if (WAIT_OBJECT_0 != WaitForSingleObject(hEvent, INFINITE))
            {
                __debugbreak();
            }

            CloseHandle(hEvent);
        }

        return done;
    }
};

BOOL DoSrwTest(ULONG nThreads)
{
    ThreadTestData data;
    return data.Test(nThreads);
}

ULONG DoSrwTest(ULONG nLoops, ULONG nThreads)
{
    while (!DoSrwTest(nThreads) && --nLoops);

    return nLoops;
}

the full project is here


steps to reproduce - we need only 2 working threads - A and B. and main thread M.

  • thread A call AcquireSRWLockShared and begin wait
  • thread M call ReleaseSRWLockExclusive, when he enter RtlpWakeSRWLock we need suspend M at this point ( in real case sheduler can do it ). at this point section already unlocked !!
  • thread B call AcquireSRWLockShared and enter lock, because SRW is unlocked
  • thread M continue execution RtlpWakeSRWLock but now NOT wake thread A !!
  • when (and if in case your code) thread B call ReleaseSRWLockShared, RtlpWakeSRWLock will be called again and now thread A will be waked

or by another words - ReleaseSRWLockExclusive first remove Lock bit and then, if Waiters present, walk by Wait Blocks for wake waiters ( RtlpWakeSRWLock ) but this 2 operations not atomic. in between, just after Lock bit removed, another thread can acquire the lock. and in this case acquire always will be exclusive by fact, even if thread ask for shared access only

4
RbMm On

i try post yet another answer, for describe details from some another side ( not mean that previous answer is wrong, but already overloaded )

AcquireSRWLockShared some time can really acquires a slim SRW lock in exclusive mode !

steps to reproduce the situation - https://github.com/rbmm/SRW-2/tree/main

  • main thread start N (> 1) worker threads
  • the N-1 threads is started normal and we give them time to call AcquireSRWLockShared by using Sleep(1000) and one thread is started in suspended state
  • then we call ReleaseSRWLockExclusive from the main thread

and with the help of VEX we catch the moment when ReleaseSRWLockExclusive set the K bit in SRW here we suspend the main thread and resume the last worker. and give it time (again Sleep(1000)) to enter to the lock. and it enter in - AcquireSRWLockShared works successfully - since the Lock bit has already been removed then, we continue executing the main thread in ReleaseSRWLockExclusive

and here we have after this: 1 worker thread inside SRW shows a message box and another N-1 worker threads is waiting inside AcquireSRWLockShared

this is repro in crystal clear form - we have a thread that requested shared and received exclusive access to SRW

But ! as soon as we close the messagebox and release SRW, N-1 messageboxes will immediately appear then. nothing stuck. no deadlock.

and if the code was written correctly, such behavior would not cause any problems. we got MORE than we asked for. but so what ? did we ask for shared access? we got it. then we must work with the data that this SRW protect and exit. and everything will be ok. no one will even notice anything. visa versa - if we asked for exclusive and got shared - this is a critical error and can lead to data corruption


here i use N=4. if you run exe (without debugger !!), you first view single message box. and if you look to call stacks of another threads ( you can attach debugger after message box ) - you view that all N-1 threads wait in RtlAcquireSRWLockShared. the thread by fact have exclusive access to lock, despite code ask for shared. close message box. and just 3 new message box popup. this is last N-1 worker enter to the SRW already as shared

void DoStuff()
{
    AcquireSRWLockShared(SRWLock);

    Message(__FUNCTIONW__, MB_ICONINFORMATION);

    ReleaseSRWLockShared(SRWLock);

    EndThread();
}

Any critical sections are not intended to be waited in. they are designed to synchronize data access. and not to wait for something inside them. if the stream is waiting for something inside, it’s already an error. design error. if a worker thread waits for other worker threads, that's another design error. The worker thread should do its own thing and not care about the state of other worker threads. did they also enter some section or did not enter. worker thread can wait for data and that's normal. but it should not wait on the state of other threads.

3
Neill Clift On

At the time I updated SRW to be unfair (now SRW might not have existed at that time and so it might only be the kernel code) the idea of stealing the lock was really important. We had a number of locks that passed ownership. An unlock didn't really unlock but select a thread/threads that would be the new owner. This could lead to convoys as the lock hold time can get expanded by the context swap time if any thread selected needed to be context swapped in. The solution is to make the locks unfair so that any threads that didn't need to context swap could take the lock and hopefully release it even before any other thread could get context swapped in. So unlock becomes a two phase thing where we unconditionally release the lock then find threads to wake. Between these two phases any acquire (shared or exclusive) can take the lock but they can do so only one at a time. There isn't enough state in the single pointer to record anything else. It was really important for us to have a small lock so we could put fine grained locking anywhere. I would like to understand in the c++ standard requires a different behavior. Obviously, the code was likely written before the standard in question. I expect you can easily fix this if you want to support some kind of rendezvous under a shared lock. I had kind of ruled that out given that one shared acquire is not always compatible with another if you have an intervening exclusive acquire. So, my rule of thumb was 'you can never fix a deadlock by changing a lock from exclusive to shared except in a system with no exclusive locks'. Shared aquires can instead of stealing just queue the wait block in the contended case and let the waker sort everything out. Losing the unfairness is worrying though. Could well be so rare as to not matter.

0
arog On

The SRWLOCK is behaving as designed. There is a shared -> exclusive ownership upgrade path which was added as a deliberate feature, designed to help prevent convoys, when the lock was under contention and in the middle of being released / other waiters woken.

Lock convoys are a real and serious performance problem, well known in academic literature. The seminal paper describing them, written by a group including Jim Gray, dates from 1977 and can be read here:

https://dl.acm.org/doi/pdf/10.1145/850657.850659

Unfortunately, the SRWLOCK documentation never covered this anti-convoy behaviour, but as a result of discussions arising from this issue it has been updated in https://github.com/MicrosoftDocs/sdk-api/pull/1785 to indicate the possibility of this shared -> exclusive ownership upgrade path being used.

As per the discussion on https://github.com/microsoft/STL/issues/4448, the MSFT C++ STL team believe that this shared -> exclusive ownership upgrade path in the SRWLOCK then causes the behaviour of their std::shared_mutex implementation to not conform to [thread.sharedmutex.requirements.general]/2, which states:

The maximum number of execution agents which can share a shared lock on a single shared mutex type is unspecified, but is at least 10000. If more than the maximum number of execution agents attempt to obtain a shared lock, the excess execution agents block until the number of shared locks are reduced below the maximum amount by other execution agents releasing their shared lock.

To get the behaviour they desire, the MSFT C++ STL team will have to agree a change SRWLOCK behaviour (either optional or unilateral) with the Windows team, and then a change will need to be made to the MSFT C++ STL after this. The MSFT C++ STL issue (https://github.com/microsoft/STL/issues/4448) should be used to track progress.

However, the above explanation doesn't provide an immediate solution to the OP's problem. To prevent hitting this, I suggest it would be best to simply roll your own latch/barrier using condition_variable_any and shared_mutex. The following code (using condition_variable_any and shared_mutex) demonstrates this approach, is C++17-compatible and is not subject to the deadlock seen with the original code:

#include <cstdint>
#include <iostream>
#include <memory>
#include <shared_mutex>
#include <thread>
#include <vector>

struct ThreadTestData {
    int32_t numThreads = 0;
    std::shared_mutex sharedMutex = {};
    std::condition_variable_any allStarted;
    int32_t readCounter;
};

int DoStuff(ThreadTestData* data) {
    // Register that this thread has started
    data->sharedMutex.lock();
    data->readCounter += 1;
    data->sharedMutex.unlock();

    data->allStarted.notify_all();

    // Acquire reader lock
    std::shared_lock lock(data->sharedMutex);

    // Wait until all reader threads have started, re-acquire reader lock
    data->allStarted.wait(lock,
                          [&]{ return data->readCounter == data->numThreads; });

    // Reader lock released on scope exit (~shared_lock)

    return 0;
}

int main() {
    int count = 0;
    while (true) {
        ThreadTestData data = {};
        data.numThreads = 5;

        // Acquire write lock
        data.sharedMutex.lock();

        // Create N threads
        std::vector<std::unique_ptr<std::thread>> readerThreads;
        readerThreads.reserve(data.numThreads);
        for (int i = 0; i < data.numThreads; ++i) {
            readerThreads.emplace_back(std::make_unique<std::thread>(DoStuff, &data));
        }

        // Release write lock
        data.sharedMutex.unlock();

        // Wait for all readers to succeed
        for (auto& thread : readerThreads) {
            thread->join();
        }

        // Cleanup
        readerThreads.clear();

        // Spew so we can tell when it's deadlocked
        count += 1;
        std::cout << count << std::endl;
    }

    return 0;
}

(I ran it locally until it hit 1,000,000 iterations with no sign of deadlock)