NVIDIA GPUs have schedule complete warps to execute instructions together (well, sort of; see also this question). Thus, if we have a "linear" block of, say, 90 threads (or X x Y x Z = 90 x 1 x 1) - a GPU core will have three warps to schedule instruction execution for:
- threads (0,0,0) through (31,0,0)
- threads (32,0,0) through (63,0,0)
- threads (64,0,0) through (89,0,0) + 6 unused lanes
This is straightforward and obvious. But - what happens if we have a multi-dimensional block, whose X dimension is not a multiple of 32? Say, X x Y x Z = 30 x 3 x 1 ? There are at least two intuitive ways these could be broken up into warps.
Option 1 - pack threads into full warps:
- threads (0,0,0) through (29,0,0) + (0,1,0) through (1,1,0)
- threads (2,1,0) through (30,1,0) + (0,2,0) through (3,2,0)
- threads (4,2,0) through (30,2,0) + 6 unused lanes
Option 2 - keep threads with different z, y coordinates in separate warps:
- threads (0,0,0) through (29,0,0) + 2 unused lanes
- threads (0,1,0) through (29,1,0) + 2 unused lanes
- threads (0,2,0) through (29,2,0) + 2 unused lanes
The first option potentially requires less warps (think of the case of 16 x 2 x 1 blocks); the second option is likely to prevent some divergence within warps - although this depends on the specifics of the kernel code.
My questions:
- If I don't try to specify anything about the aggregation into warps - which option is chosen by default? And does this differ by GPU/driver?
- Can I affect which of the two options is chosen, or otherwise affect the aggregation of threads into warps in a multidimensional block?



tl;dr: CUDA packs full warps.
Deducing this from the programming guide
(Thanks @RobertCrovella)
Section §4.1 of the CUDA Programming API says:
Section §2.2 of the CUDA Programming API says:
So, the multi-dimensional "thread index" is linearized in a straightforward manner into a unidimensional "thread ID", and those are just packed into warps in-order.
Seeing this for yourself
You can check the partition-into-warps using the following program:
If warps are fully packed, you'll have a warp's worth of threads with the full mask (0xFFFFFFFF), and 30 threads with a 30-thread mask (0x3FFFFFFF). Otherwise, the pattern will be different.
... and indeed, we get the first option.
"But I want option 2!"
Well, if you want separate warps for different Y and Z axis coordinates, is to "pad" your block dimensions so that the first (X-axis) dimension is always a multiple of the warp size, 32. This has, of course, the cost of having to perform an extra check:
but that's not very expensive (especially if you use threadIdx.x elsewhere, and if you can calculate unpadded_x_block_size at kernel compile time.)