I'm just starting to use Julia's CUDArt package to manage GPU computing. I am wondering how to ensure that if I go to pull data from the gpu (e.g. using to_host()
) that I don't do so before all of the necessary computations have been performed on it.
Through some experimentation, it seems that to_host(CudaArray)
will lag while the particular CudaArray is being updated. So, perhaps just using this is enough to ensure safety? But it seems a bit chancy.
Right now, I am using the launch()
function to run my kernels, as depicted in the package documentation.
The CUDArt documentation gives an example using Julia's @sync
macro, which seems like it could be lovely. But for the purposes of @sync
I am done with my "work" and ready to move on as soon as the kernel gets launched with launch()
, not once it finishes. As far as I understand the operation of launch()
- there isn't a way to change this feature (e.g. to make it wait to receive the output of the function it "launches").
How can I accomplish such synchronization?
I think the more canonical way is to make a stream for each device:
streams = [(device(dev); Stream()) for dev in devlist]
and then inside the
@async
block, after you tell it to do the computations, you use thewait(stream)
function to tell it to wait for that stream to finish its computations. See the Streams example in the README.