So I was reading the architecture for GCN 1st Generation GPUs provided by the paper here, and I'm a bit confused on the size of the vector ALUs and some other things.
- According to it, each compute unit has 1 scalar unit and 4 SIMDs. Each of these 4 SIMDs have 16 ALUs to perform vector operations. The paper states that the ALUs natively execute single precision floating point and 24-bit integer at full speed and DP and 32 bit integer at reduced speeds. - What I want to know is why do the 32 bit integers execute at reduced speed when 32 bit SP floating point can execute all right? 
- Secondly we know that for AMD GCN GPUs, each SIMD array executes 1-quarter of wavefront over 4 cycles. When an instruction is assigned to an SIMD unit, does it replicates across all 4? or does it take 4 different cycles in order for each SIMD unit to get an instruction? - If all 4 SIMD units execute the same instruction then, theoretically this gets us 4 wavefronts per 4 cycles. If it's the second case then only 1 wavefront gets completed at the 4th cycle. - Although note that according to the GCN whitepaper, the Local Data Share (LDS) coalesces 16 lanes from 2 different SIMD units each cycle so this gets us 2 complete wavefronts per 4 cycles. This seems to hint that it's the first case, since there is no way to get more than 1 wavefront completed per 4 cycles if the instructions aren't replicated across SIMD units. 
- Lastly I want to ask about a scenario. - Suppose I have a 2D workgroup assigned to a Compute Unit. The workgroup consists of 8x8 = 64 work items. Will the compute unit form 1 wavefront and execute this over 4 cycles in 1 SIMD unit, while the other 3 SIMD units remain idle? Or will something else happen? 
 
                        
If you look at how 32-bit floats are represented, you'll notice there is a 24-bit mantissa, a sign bit, and 7 bits of exponent. Presumably the GPU can use the floating-point ALU's capabilities directly to operate on 24-bit integers stored in what would normally be the mantissa. For operating on larger integers, explicit long multiplication of some kind will need to be done (like 64-bit arithmetic on a 32-bit CPU), slowing things down.