How to execute parallel compute shaders across multiple compute queues in Vulkan?

2k views Asked by At

Update: This has been solved, you can find further details here: https://stackoverflow.com/a/64405505/1889253

A similar question was asked previously, but that question was initially focused around using multiple command buffers, and triggering the submit across different threads to achieve parallel execution of shaders. Most of the answers suggest that the solution is to use multiple queues instead. The use of multiple queues also seems to be the consensus across various blog posts and Khronos forum answers. I have attempted those suggestions running shader executions across multiple queues but without being able to see parallel execution, so I wanted to ask what I may be doing wrong. As suggested, this question includes the runnable code of multiple compute shaders being submitted to multiple queues, which hopefully can be useful for other people looking to do the same (once this is resolved).

The current implementation is in this pull request / branch, however I will cover the main Vulkan specific points, to ensure only Vulkan knowledge is required to answer this question. It's also worth mentioning that the current use-case is specifically for compute queues and compute shaders, not graphics or transfer queues (although insights/experience achieving parallelism across those would still be very useful, and would most probably also lead to the answer).

More specifically, I have the following:

A couple of points that are not visible in the examples above but are important:

  • All evalAsync run on the same application, instance and device
  • Each evalAsync executes with its own separate commandBuffer and buffers, and in a separate queue
  • If you are wondering whether memory barriers could be having something to do, we have tried by removing all memoryBarriers (this on for example that runs before shader execution) completely but this has not made any difference on performance

The test that is used in the benchmark can be found here, however the only key things to understand are:

  • This is the shader that we use for testing, as you can see, we just add a bunch of atomicAdd steps to increase the amount of processing time
  • Currently the test has small buffer size and high number of shader loop iterations, but we also tested with large buffer size (i.e. 100,000 instead of 10), and smaller iteration (1,000 istead of 100,000,000).

When running the test, we first run a set of "synchronous" shader executions on the same queue (the number is variable but we've tested with 6-16, the latter which is the max number of queues). Then we run these in an asychrnonous manner, where we run all of them and the evalAwait until they are finished. When comparing the resulting times from both approaches, they take the same amount of time eventhough they run across different compute queues.

My questions are:

  • Am I currently missing something when fetching the queues?
  • Are there further parameters in the vulkan setup that need to be configured to ensure asynchronous execution?
  • Are there any restrictions I may not be aware about around potentially operating system processes only being able to submit GPU workloads in a synchronous way to the GPU?
  • Would multithreading be required in order for parallel execution to work properly when dealing with multiple queue submissions?

Furthermore I have found several useful resources online across various reddit posts and Khronos Group forums that provide very in-depth conceptual and theoretical overviews on the topic, but I haven't come across end to end code examples that show parallel execution of shaders. If there are any practical examples out there that you can share, which have funcioning parallel execution of shaders, that would be very helpful.

If there are further details or questions that can help provide further context please let me know, happy to answer them and/or provide more detail.

For completeness, my tests were using:

  • Vulkan SDK 1.2
  • Windows 10
  • NVIDIA 1650

Other relevant links that have been shared in similar posts:

2

There are 2 answers

0
axsauze On BEST ANSWER

I have been able to solve using this suggestion. To provide further context, I was trying to submit commands to multiple queues within the same family, however it was pointed out in the suggestion linked, NVIDIA (and other GPU vendors) have a varying range of capabilities when it comes to parallel processing of command submissions.

In my particular case, the NVIDIA 1650 card I was testing with, only supports concurrent processing when workloads are submitted in different queueFamilies - more specifically, it is only able to support one concurrent command submission across one Graphics queue and one compute family queue.

I re-implemented the code to allow for allocation of family queues for specific commands, and I was able to achieve parallel processing (with a 2x speed improvement by submitting across two queueFamilies).

Here is further detail on the implementation https://kompute.cc/overview/async-parallel.html

1
Nicol Bolas On

You are getting "asynchronous execution". You just don't expect it to behave the way it behaves.

On a CPU, if you have one thread active, then you're using one CPU core (or hyper-thread). All of that core's execution and computation capabilities are given to your thread alone (ignoring pre-emption). But at the same time, if there are other cores, your one thread cannot use any of the computational resources of those cores. Not unless you create another thread.

GPUs don't work that way. A queue is not like a CPU thread. It does not specifically relate to a particular quantity of computational resources. A queue is merely the interface through which commands get executed; the underlying hardware decides how to farm out commands to the various compute resources provided by the GPU as a whole.

What generally happens when you execute a command is that the hardware attempts to fully saturate the available shader execution units using your command. If there are more shader units available than the number of invocations your operation requires, then some resources are available immediately for the next command. But if not, then the entire GPU's compute resources will be dedicated to executing the first operation; the second one must wait for resources to become available before it can start.

It doesn't matter how many compute queues you shove work into; they're all going to try to use as many compute resources as possible. So they will largely execute in some particular order.

Queue priority systems exist, but these mainly help determine the order of execution for commands. That is, if a high-priority queue has some commands that need to be executed, then they will take priority the next time compute resources become available for a new command.

So submitting 3 dispatch batches on 3 separate queues is not going to complete faster than submitting 1 batch on one queue containing 3 dispatch operations.

The main reason multiple queues (of the same family) exist is to be able to submit work from multiple threads without having them do inter-thread synchronization (and to provide some possible prioritization of submissions).