I have a question related with the blocks execution on the SMx. I´ve performed a experiment where several kernels are launched from different MPI processes in a GPU K20c. The GPU is shared for the MPI processes by CUDA MPS. According to the MPS documentation (https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf), a stream is associated to each MPI process, so that each MPI process can concurrently execute its kernel with other kernel belonging to other MPI process. For understanding this behaviour, I´ve visualized this experiment with Visual Profiler tool. Visual Profiler show me that some kernels are being executed concurrently, but at the same time. That is to say, both kernels are absolutely overlapped between them (Not only a small part of them). It seems as if the blocks belonging to both kernels are sharing the SMx at the same time. As far as I know, a SMx can only have blocks belonging to the same kernel. Do you have any idea why this is happening?. Thank you so much.
EDIT:
Thanks for your response @RobertCrovella. I´ve taken a look at the slides which you suggested me. According to the slides, each SMx has 4 warps schedulers and that warps can come from either different threadblocks or different concurrent kernels, Ok. I understand that when the expression "different concurrent kernels" is named, it refers to different kernels launched on different streams (concurrently). Therefore, I think that the warps belonging to different blocks from different kernels could be scheduled by the 4 warps schedulers belonging to a SMx. However, I´ve only observed this behaviour when my application is launched with CUDA MPS. When my application is only launched with streams, only a small part of the kernels (the end of a kernel with the start of other kernel) is overlapped. This is normal as far as I know, but the other behaviour is strange for me.