In the Kepler architecture whitepaper, NVIDIA states that there are 32
Special Function Units (SFUs) and 32
Load/Store Units (LD/ST) on a SMX.
The SFU are for "fast approximate transcendental operations". Unfortunately, I don't understand what this is supposed to mean. On the other hand, at Special CUDA Double Precision trig functions for SFU it is said, that they only work in single precision. Is this still correct on a K20Xm?
The LD/ST units are obviously for storing and loading. Is any memory load/write required to go through one of theses? And are they also used as a single warp? In other words, can there be only one warp which is currently writing or reading?
Cheers, Andi
SFUs compute functions like
__cosf()
,__expf()
etc.According to recent CUDA C Programming Guide, section G.5.1 they still only work in single precision.
It makes some sense, since if you need double precision it's unlikely you would use inaccurate math functions. You can refer to this answer for suggestions on double-precision arithmetic optimizarions.
The implementation details of double-precision operations could be found in
/usr/local/cuda-5.5/include/math_functions_dbl_ptx3.h
(or wherever your CUDA Toolkit is installed). E.g. forsin
andcos
it uses Payne-Hanek argument reduction followed by Taylor expansion (up to the order 14).For double precision calculations, SFUs seem to be used only in
__internal_fast_rcp
and__internal_fast_rsqrt
, which in turn are used inacos
,log
,cosh
and several other functions (seemath_functions_dbl_ptx3.h
). So most of the time they stall, like LD/ST units stall if there's no ongoing memory transactions.Yes, each access to global memory.
The number of units constrains only the number of instructions issued each cycle. I.e. each clock cycle 32 read instructions could be issued, and 32 results could be returned.
One instruction can read/write up to 128 bytes, so if each thread in warp reads 4 bytes and they are coalesced, then whole warp would require a single load/store instruction. If accesses are uncoalesced, then more instruction should be issued.
Moreover, units are pipelined, meaning multiple read/store request could be executing concurrently by single unit.