I have read on many forums that NVIDIA Visual Profiler serializes the program in order to collect timing information.
However in the visual profiler, under context tab, offers advice such as "There is no time overlap between memory copies and kernels on GPU" or if there are overlaps with memory and kernel execution it displays the time of overlap. Also if you look at the following webinar - slide 6 you can see an output trace of overlapping kernels.
I want to know if the profiler can display information regarding concurrent kernel execution (i.e if we run 3 kernels in parallel using 3 different streams, can the profiler show if this is indeed happening in the GPU). If so, where in the visual profiler can I get hold of this information.
Yes.
Both nvprof and Visual Profiler (nvvp) in CUDA Toolkit 5.0 (available as a preview release to registered CUDA developers) support concurrent kernel execution.