So recently, AMD launched their new GPU architecture called rDNA in their new Navi GPU line up. After reading certain architecture deep-dive article and video, my understanding is this (feel free to correct if I am wrong):
Small workloads that need similar instruction to execute are called "threads".
The scheduler then arranges a bunch of those threads that require the same instruction together. Particularly in AMD GPU case, GCN and rDNA are designed to process 64 and 32 threads respectively.
The SIMD then process those clustered threads. But the difference is AMD GCN uses SIMD16, meaning 16 threads can be processed at once, while AMD rDNA uses SIMD32, meaning 32 threads can be processed at once.
Things should work flawlessly if the GPU has all 64 threads to be executed, but it would be a pain in the ass if it only needs to execute 1 thread. So only 1 SIMD16 Vector Unit is actually doing something productive, while the other three are just basically chilling.
The change in architecture means , with SIMD32, the GPU can eliminate potential bottle neck.
Hoever, every of those source keep saying "The SIMD16 design is better suited for computational workload"... This raised me some question:
1) Isn't SIMD32 design is just overall better in SIMD16 in every single way? If not then what exactly is the advantage of SIMD16 in computational work anyway?
2) For each 64 threads, 4 SIMD16 are doing the processing work simultaneously or serial? The reason I ask it the video from Engadget depicted the process as serialized while the video from Linus Tech Tips seem to hint it's parallel. This confused the hell out of me.
If everything is serial, then why AMD don't just go for SIMD64 or something?
If everything is parallel then I honestly do not see the advantage of the SIMD at all. On GCN, you have 4 SIMD16, and on rDNA, you have 2 SIMD32. If you process 1 thread on GCN with SIMD16, the time you run 1 SIMD16 should be equal to the time you run 4 SIMD16, because, again, they are parallel. Jumping to 2 SIMD32, the time you process 1 SIMD32 should be equal to the time you process 2 of them. In both case, you still have potentially 63 unused threads. So what's the point exatly.
I know my understanding must be flawed at some point, so I would love some deep explanation. Thanks you.
Just a long comment.
In gcn, there is only 1 scalar unit per 4 vectors(16 length). But in rdna, there are 2 scalar units per 1 vector(32 length). This must be a serious advantage on complex algorithms that pressurize that scalar unit. And that scalar unit is newer isnt it? So it is a good single thread problem solving instead of expecting a fully optimized compute workload from developers. Now tree traversal can be better?
In gcn, each 16wide vector is issued in single cycle and whole gcn issues all in 4 cycles. 4 cycles per 64 pipelines. But in rdna it is 1 cycle per 32 pipelines so it is parallel. This means, again, very good advantage for some latency problems.when 2 units work together it is still 1 cycle per 64 pipelines since they are independent two 32 wide vectors.
So far, we have 4 times the gimmicks on "issue" performance and 8 times the gimmicks on the "scalar" workloads.
Getting into thread level parallelism, it finishes same wavefronts quicker than gcn with or without the above advantages. This reduces register pressure. Less register pressure enables headroom for more threads in flight. This is further boosted with 1024 registers per vector which is great compared to gcn's 256 threads. More threads per lane, quicker lanes, better cache system, etc, it becomes faster and efficient.
Scalability of architecture must have stopped them at 32 lanes instead of going for 64 128 etc or smaller like 16 8 4. Perhaps having 64 wide vector can not get enough bandwidth from caches? I dont know. But there is transistor budget. Where would you crop, to have wider simd? Cache means less or slower cache and less or slower cache per pipeline. I wouldnt crop scalar unit either. Perhaps texturing units and rops but gamers will buy it too. Market penetration.
They seem to have played on thread level parallelism well and they may not need to add any more physical threads on same vector. 80 wavefronts on two vectors (when they work together) is already awesome for tlp, and by this, much much more issues on ilp would be less of a problem now. Making 16 or 8 wide vectors on same area would need them have 160 threads in flight per pipeline. Are there 160 unique operations per pipeline? I dont know. Even 80 unique operations looks too much for me. Its like using all math and memory features of rdna simultaneously. Just a guess.
For now, 80 wavefronts limit means that you can try to have up to 80 x 2560 workitems in an algorithm, if there are ilp or other problems. Maybe not so useful in simple algorithms like naive nbody but useful in things like mixed precision int float string everything computed in same instruction window. Perhaps thats why they said 16wide is better.
In gcn there were up to 40 threads per pipeline, in flight. Nvidia is even less like 32 or 16. Now there is 80 in rdna and it is faster. Absolutely better. But may not be when you have only 2560 particles in nbody algorithm. The 64+ simd width you asked for could be better for less particles(maybe) for this reason. But as particles increase, more tlp looks better, hence, less width per compute unit on same transistor count.