Update: This has been solved, you can find further details here: https://stackoverflow.com/a/64405505/1889253
A similar question was asked previously, but that question was initially focused around using multiple command buffers, and triggering the submit across different threads to achieve parallel execution of shaders. Most of the answers suggest that the solution is to use multiple queues instead. The use of multiple queues also seems to be the consensus across various blog posts and Khronos forum answers. I have attempted those suggestions running shader executions across multiple queues but without being able to see parallel execution, so I wanted to ask what I may be doing wrong. As suggested, this question includes the runnable code of multiple compute shaders being submitted to multiple queues, which hopefully can be useful for other people looking to do the same (once this is resolved).
The current implementation is in this pull request / branch, however I will cover the main Vulkan specific points, to ensure only Vulkan knowledge is required to answer this question. It's also worth mentioning that the current use-case is specifically for compute queues and compute shaders, not graphics or transfer queues (although insights/experience achieving parallelism across those would still be very useful, and would most probably also lead to the answer).
More specifically, I have the following:
- Multiple queues first are "fetched" - my device is a NVIDIA 1650, and supports 16 graphics+compute queues in queue family index 0, and 8 compute queues in queue family index 2
- evalAsync performs the submission (which contains recorded shader commands) - You should notice that a fence is created which we'll be able to use. Also the submit doesn't have any waitStageMasks (PipelineStageFlags).
- evalAwait allows us to wait for the fence - When calling the evalAwait, we are able to wait for the submission to finish through the created fence
A couple of points that are not visible in the examples above but are important:
- All evalAsync run on the same application, instance and device
- Each evalAsync executes with its own separate commandBuffer and buffers, and in a separate queue
- If you are wondering whether memory barriers could be having something to do, we have tried by removing all memoryBarriers (this on for example that runs before shader execution) completely but this has not made any difference on performance
The test that is used in the benchmark can be found here, however the only key things to understand are:
- This is the shader that we use for testing, as you can see, we just add a bunch of atomicAdd steps to increase the amount of processing time
- Currently the test has small buffer size and high number of shader loop iterations, but we also tested with large buffer size (i.e. 100,000 instead of 10), and smaller iteration (1,000 istead of 100,000,000).
When running the test, we first run a set of "synchronous" shader executions on the same queue (the number is variable but we've tested with 6-16, the latter which is the max number of queues). Then we run these in an asychrnonous manner, where we run all of them and the evalAwait until they are finished. When comparing the resulting times from both approaches, they take the same amount of time eventhough they run across different compute queues.
My questions are:
- Am I currently missing something when fetching the queues?
- Are there further parameters in the vulkan setup that need to be configured to ensure asynchronous execution?
- Are there any restrictions I may not be aware about around potentially operating system processes only being able to submit GPU workloads in a synchronous way to the GPU?
- Would multithreading be required in order for parallel execution to work properly when dealing with multiple queue submissions?
Furthermore I have found several useful resources online across various reddit posts and Khronos Group forums that provide very in-depth conceptual and theoretical overviews on the topic, but I haven't come across end to end code examples that show parallel execution of shaders. If there are any practical examples out there that you can share, which have funcioning parallel execution of shaders, that would be very helpful.
If there are further details or questions that can help provide further context please let me know, happy to answer them and/or provide more detail.
For completeness, my tests were using:
- Vulkan SDK 1.2
- Windows 10
- NVIDIA 1650
Other relevant links that have been shared in similar posts:
- Similar discussion with suggested link to example but which seems to have disappeared...
- Post on Leveraging asynchronous queues for concurrent execution (unfortunately no example code)
- (Relatively old - 5 years) Post that suggests nvidia cards can't do parallel execution of shaders, but doesn't seem to have a conculsive answer
- Nvidia presentation on Vulkan Multithreading with multiple queue execution (hence my question above on threads)
I have been able to solve using this suggestion. To provide further context, I was trying to submit commands to multiple queues within the same family, however it was pointed out in the suggestion linked, NVIDIA (and other GPU vendors) have a varying range of capabilities when it comes to parallel processing of command submissions.
In my particular case, the NVIDIA 1650 card I was testing with, only supports concurrent processing when workloads are submitted in different queueFamilies - more specifically, it is only able to support one concurrent command submission across one Graphics queue and one compute family queue.
I re-implemented the code to allow for allocation of family queues for specific commands, and I was able to achieve parallel processing (with a 2x speed improvement by submitting across two queueFamilies).
Here is further detail on the implementation https://kompute.cc/overview/async-parallel.html