This question is about notion of a CUDA stream (Stream) and the apparent anomaly with CUDA events (Event) recorded on a stream.
Consider the following code demonstrating this anamoly,
cudaEventRecord(eventStart, stream1)
kernel1<<<..., stream1>>>(...)
cudaEventRecord(eventBetween, stream1)
kernel2<<<..., stream1>>>(...)
cudaEventRecord(eventEnd, stream1)
Since all work on a Stream is sequential, the following sequence of events should've happened,
- "
eventStart
" is recorded to be complete kernel1
runs and completes- "
eventBetween
" is recorded to have completed kernel2
runs and completes- "
eventEnd
" is recorded to have completed
After synchronizing the host thread with the device, the time taken between eventStart
and eventBetween
( measured by cudaEventElapsedTime
) should be run-time of kernel1
and the time taken between eventBetween
and eventEnd
should be run-time of kernel2
.
But according to NVIDIA's documentation on cudaStreamWaitEvent
,
Makes all future work submitted to stream wait until event reports completion before beginning execution.
and blogpost,
You can also synchronize operations within a single stream on a specific event using cudaStreamWaitEvent(event)
Events needn't report completion before all work that's scheduled after recording the Event on the same Stream. This comes as a surprise since all work scheduled on a Stream is supposed to run in a sequential fashion.
Can someone clarify this asynchronous behavior of Events within a Stream ?
It appears the cause of your confusion is conflating host-stream synchronisation and stream-stream synchronisation.
In short:
There is no contradiction between any of the sources you cited.