So I was reading the architecture for GCN 1st Generation GPUs provided by the paper here, and I'm a bit confused on the size of the vector ALUs and some other things.
According to it, each compute unit has 1 scalar unit and 4 SIMDs. Each of these 4 SIMDs have 16 ALUs to perform vector operations. The paper states that the ALUs natively execute single precision floating point and 24-bit integer at full speed and DP and 32 bit integer at reduced speeds.
What I want to know is why do the 32 bit integers execute at reduced speed when 32 bit SP floating point can execute all right?
Secondly we know that for AMD GCN GPUs, each SIMD array executes 1-quarter of wavefront over 4 cycles. When an instruction is assigned to an SIMD unit, does it replicates across all 4? or does it take 4 different cycles in order for each SIMD unit to get an instruction?
If all 4 SIMD units execute the same instruction then, theoretically this gets us 4 wavefronts per 4 cycles. If it's the second case then only 1 wavefront gets completed at the 4th cycle.
Although note that according to the GCN whitepaper, the Local Data Share (LDS) coalesces 16 lanes from 2 different SIMD units each cycle so this gets us 2 complete wavefronts per 4 cycles. This seems to hint that it's the first case, since there is no way to get more than 1 wavefront completed per 4 cycles if the instructions aren't replicated across SIMD units.
Lastly I want to ask about a scenario.
Suppose I have a 2D workgroup assigned to a Compute Unit. The workgroup consists of 8x8 = 64 work items. Will the compute unit form 1 wavefront and execute this over 4 cycles in 1 SIMD unit, while the other 3 SIMD units remain idle? Or will something else happen?
If you look at how 32-bit floats are represented, you'll notice there is a 24-bit mantissa, a sign bit, and 7 bits of exponent. Presumably the GPU can use the floating-point ALU's capabilities directly to operate on 24-bit integers stored in what would normally be the mantissa. For operating on larger integers, explicit long multiplication of some kind will need to be done (like 64-bit arithmetic on a 32-bit CPU), slowing things down.