I am instancing tens of thousands of meshes on the GPU - each mesh needs to have a unique transform. Is it faster to calculate tens of thousands of matrices on the CPU and pass them to the GPU via a compute buffer, or is it faster to calculate each unique TRS matrix on the gpu itself (e.g, using a compute shader)?
I have tried implementing both, but I have yet been able to correctly calculate TRS matrixes in HLSL. I just want to make sure, before trying further, that calculating on the GPU may be a good option since there are SO MANY instances.
Yes if you have tons of instances it can definitely be worth it to offset the calculation to gpu. Please note that if you also want to perform culling this will also need to be done on gpu as well (in case of instancing it's not such a complex operation).
On the matter of speed you will have a threshold where performing on gpu will become faster than cpu (since you need to upload data and perform a compute pass), this will vary depending on architecture.
This is the compute shader code I was using to convert a SRT pose to a matrix (in that case I also tried to reasonably optimize the code in order to avoid multiple matrix multiplications, even though they are blazing fast on gpu)