Kotlin - Suspend functions in one threaded environments

661 views Asked by At

I'm not entirely sure if my mental model of suspend is correct. From what I gathered it seems to mean that a (long running) suspend function can be suspended if another function inside it is marked with suspend (which generates a suspension point for the parent function).

To keep it simple lets assume a one threaded environment without asynchronous programming.

launch { //<--- creates a coroutine in which we can use suspend functions
    fetchUserData("Jon") 
}


// and following functions:

suspend fun fetchUserData(userName: String) {
    makeLongRunningNetworkCall(userName) //<---- suspends fetchUserData()
}

suspend fun makeLongRunningNetworkCall(userName: String) {...}

My understanding is, that makeLongRunningNetworkCall() "takes" fetchUserData() "off" the thread so it doesn't block other computations while waiting for the results of makeLongRunningNetworkCall().

But if nothing suspends makeLongRunningNetworkCall() isn't the thread still blocked by makeLongRunningNetworkCall()? I mean the "waiting" for the network result has to be done somewhere or else the result might be missed?

So for me suspend in that case would just make sense if fetchUserData() and makeLongRunningNetworkCall() would run on different threads, so that makeLongRunningNetworkCall() tells its parent function to go home and free its thread until it received a result?!

Is my understanding correct? Or does suspend rather mean the whole coroutine is taken off the thread? But then again, who assures the response of the network call is captured?

3

There are 3 answers

2
Joffrey On BEST ANSWER

Is my understanding correct? Or does suspend rather mean the whole coroutine is taken off the thread?

Rather the second, but in order to see the difference, you'll need to add some code in fetchUserData. For instance, consider:

suspend fun fetchUserData(userName: String): UserData {
    val userData = makeLongRunningNetworkCall(userName)
    return useData
}

If makeLongRunningNetworkCall suspends, then fetchUserData does need to wait for it to resume before executing the rest of its code (the return). Similarly, the caller of fetchUserData also needs to wait, etc. This is why it's easy to reason about suspend functions - they actually run sequentially in that sense.

So, with that in mind, the whole coroutine is suspended (the whole stack up to the initial coroutine builder launch), because nothing in the execution stack will carry on until makeLongRunningNetworkCall resumes.

But then again, who assures the response of the network call is captured?

This is a good question. All of the above actually started with the assumption that makeLongRunningNetworkCall does suspend. This is not an abstract concept. What it means concretely is that the function will return (as in, actually return) a special token called COROUTINE_SUSPENDED, so the whole suspension mechanism happens, and the thread starts executing something else.

This means that this function must be cooperative and suspend when it's possible. If the function actually blocks on a network call, then it doesn't actually suspend, and it really does block the thread as you guessed. When functions like this actually suspend, it often means that they offloaded their blocking work to another thread, or that they are truly non-blocking (e.g. callback-based) - but sometimes that just means the offloading happens deeper.

To understand and demystify how this works, I suggest reading this nice article about how kotlinx.coroutines is built on just a couple compiler built-ins: https://blog.kotlin-academy.com/kotlin-coroutines-animated-part-1-coroutine-hello-world-51797d8b9cd4

3
Tenfour04 On

A suspend function only actually suspends if the suspend functions that it calls also suspend, and so on down the chain.

This is one reason why, by convention, you must never directly call a blocking function in a coroutine or suspend function. All of the suspend functions in the standard library follow this convention. (The other big reason is simplicity. We don't ever have to worry about whether a suspend function might also be a blocking function and tie up a thread that it shouldn't be.)

A common pattern for handling blocking calls is to wrap it in withContext. withContext is a suspend function that suspends while it runs its code using a CoroutineContext that may be able to handle blocking calls. You can use it with Dispatchers.Default or Dispatchers.IO as appropriate to make it permissible to call blocking functions within the withContext lambda.

If you are using popular libraries such as Retrofit, Jetpack Room, Google Firebase, etc., they expose public suspend functions that you can trust do not block, as required by convention. Since most of these libraries support both Java and Kotlin, they are actually using their own thread pools under the hood rather than Kotlin Coroutines Dispatcher pools. They achieve this by using the low level suspendCancellableCoroutine suspend function, which allows finer control of exactly how the coroutine is suspended and resumed.


The other aspect of this worth mentioning although you didn't ask about it, is whether the suspend function supports cancellation. Usually, we want to support cancellation if possible. All the standard library suspend functions do. If you are wrapping a long blocking calculation in withContext, that's not enough to support cancellation. You must also intersperse some suspending calls or if (isActive) checks within your blocking code to give it opportunities to be interrupted and cancelled if you want it to be able to be stopped before finishing.

0
Jemshit On

Lets assume a one threaded environment without asynchronous programming

Asynchronous programming can be achieved even on single threaded environment, example: asyncio in Python and node.js. How? Coroutines just take turns (in micro/milliseconds) on a single thread and give notion of concurrency, but they are not truly concurrent. This is still very useful for IO-bound tasks, because they trigger some work and wait for the result. For CPU-bound task, single thread is not enough/possible for concurrency.

But if nothing suspends makeLongRunningNetworkCall() isn't the thread still blocked by makeLongRunningNetworkCall()? Or does suspend rather mean the whole coroutine is taken off the thread?

When coroutine is suspended, it is completely taken off the thread, and frees it. Later on when coroutine is resumed, it might resume on different thread.

But then again, who assures the response of the network call is captured? I mean the "waiting" for the network result has to be done somewhere or else the result might be missed?

No thread is required to wait for the result, otherwise we would be just offloading works to dedicated Thread Pools of each library (network io, database, file io, ...) and create another layer of abstraction. Then there wouldn't be real benefit if each library has separate thread (per read/write request) that waits for the result.

So how to await result without sitting on a thread?

Simplified explanation is io-library calls (directly/indirectly) OS's low level functions, which calls device driver at some point. Device driver returns to OS immediately and request is now in flight and performed asynchronously.

Regardless of the type of I/O request, internally I/O operations issued to a driver on behalf of the application are performed asynchronously; that is, once an I/O request has been initiated, the device driver returns to the I/O system. Whether or not the I/O system returns immediately to the caller depends on whether the handle was opened for synchronous or asynchronous I/O.

The OS returns to the library, which returns to the caller in some form (callback/Future/Observable etc) and caller is not blocked. No thread is waiting for the result.

Some time after the request started, the device finishes handling request. It notifies the CPU via an interrupt... The device driver’s Interrupt Service Routine (ISR) responds to the interrupt... then Deferred Procedure Call (DPC) is queued... The DPC takes the IRP representing the initial request and marks it as “complete”. However, that “completion” status only exists at the OS level. OS queues a Asynchronous Procedure Call (APC) to the thread owning the device’s underlying HANDLE... Since Library/BCL has already registered the handle with the I/O Completion Port (IOCP), which is part of the thread pool. So an I/O thread pool thread is borrowed briefly to execute the APC, which notifies the task that it’s complete.

There was no thread while the request was in flight. When the request completed, various threads were “borrowed” or had work briefly queued to them. This work is usually on the order of a millisecond or so (e.g., the APC running on the thread pool) down to a microsecond or so (e.g., the ISR). But there is no thread that was blocked, just waiting for that request to complete.