WCF Reliable Sessions Fault when Server under heavy CPU load or Thread Pool Busy

1.2k views Asked by At

WCF Reliable Sessions Fault when Server under heavy CPU load or Thread Pool Busy

There appears to be a design flaw in WCF Reliable Sessions that prevents the issue or acceptance of infrastructure keep-alive messages when the server is under high CPU load (80-100% range) or when there isn't an immediate IO threadpool thread available to handle the message. The symptoms manifest as apparently random channel aborts due to reliable session inactivity timeouts. However it appears the abort logic runs at a higher priority or via a different mechanism because the abort timer seems to fire even though the keep-alive timer can't run.

Digging into the reference source, it appears that the ChannelReliableSession uses an InterruptableTimer class to handle the inactivityTimer. In response, it fires the PollingCallback, set by the ReliableOutputSessionChannel, which creates an ACKRequestedMessage and sends it to the remote endpoint. The InactivityTimer uses the WCF internal IOThreadTimer/IOThreadScheduler to schedule itself. That depends on an available (non-busy) IO ThreadPool thread to service the timer. If CPU load is high, it appears the thread pool won't spawn a new thread. As a result if several threads are executing (appears to be 8 threads on my 4-core machine; with a 15 second inactivityTimeout 7 will abort and fail) then no thread is available to send the keep-alive. However if you modify the reliable session inactivity timeout on the client to be longer than the server, even under these conditions the server will still unilaterally abort the channel because it expected a message in a shorter time. So it appears the abort logic is running at a higher priority or throws an exception into one of the executing threads (not sure which); I expected the abort on the server to be delayed due to high CPU and for the client's longer timeout to eventually hit but this was not the case. If CPU load is lower then this exact same scenario works perfectly fine even with concurrent calls that take 30-90 seconds to return.

It is irrelevant what your InstanceMode is, the max concurrent connections, sessions, or instances are, what any of the other timeout values are (other than recieveTimeout must be greater than the inactivityTimeout). It is entirely a design flaw of the WCF implementation; it should be using an isolated high-priority or realtime thread to service the keep-alive messages so spurious aborts are not generated.

The short version is: I can issue 1000 concurrent requests that take 60 seconds to complete with a 15 second Reliable Session Inactivity Timeout with no problems, so long as the CPU load stays low.As soon as the CPU load gets heavy, calls will randomly begin aborting, including calls that aren't taking up any CPU time or duplex sessions idling waiting to be used. If incoming calls also add to CPU load then the service will enter a death spiral, as execution time is wasted on requests guaranteed to abort, while other requests sit in the inbound queue. The service cannot return to a healthy state until all requests are stopped, all in-flight threads finish, and CPU load drops. This behavior appears to paradoxically make Reliable Sessions one of the least reliable communication mechanisms.

This same behavior applies to clients; in that case the WCF client may be at the mercy of other processes on the box but under high CPU load it will randomly abort the reliable sessions unless all operations take less than the inactivityTimeout to complete, though if you don't issue a new call quickly WCF may still fail to send the keep-alive and the channel may fault.

1

There are 1 answers

1
russbishop On BEST ANSWER

Documenting my answer:

You can slightly mitigate the issue if you use the ThreadPool.SetMinThreads(X, Y) where Y is some number greater than the number of threads executing concurrent WCF requests. Then there may be a thread available++ to service the keep-alive and the reliable sessions may not timeout, even under sustained 100% CPU load, but this has its limits as well. In my tests, I bumped the IO threads from 2 to 20 minimum, then issued a large number of concurrent (but do-nothing requests that simply sleep for 10 seconds). After that I re-ran my client but with the CPU-wasting call and I was able to successfully execute all 8 simultaneously. Restarting the service then immediately executing the same client test failed due to the lazy initialization of the thread pool. Bumping this up I eventually started hitting timeouts again at 14 simultaneous calls (10 calls aborted), which may simply be the scheduler not getting enough CPU slices to execute properly. I suspect if you could grab the IO threads and increase their priority you might be able to solve this problem.

++Because the pool uses lazy initialization, you must issue enough concurrent calls from the client(s) that take time to complete but don't use any CPU (eg: Thread.Sleep(5000)) to force the pool to create the minimum # of threads without triggering the high-CPU-blocks-new-threads logic, otherwise the minimum # of threads won't be created and the problem still happens.

Another potential fix is to make the inactivityTimeout a very large value. This will help alleviate the problem but introduces a new Denial of Service vulnerability, even from unintended failures of clients to close the connection.

Otherwise there does not appear to be a fix for this issue at this time; I would personally advise against using Reliable Sessions due to this flaw as it makes the aborts random in both the connections aborted and the circumstances in which the aborts start to occur.