Load/Store Units (LD/ST) and Special Function Units (SFUs) for the Kepler architecture

7.6k views Asked by At

In the Kepler architecture whitepaper, NVIDIA states that there are 32 Special Function Units (SFUs) and 32 Load/Store Units (LD/ST) on a SMX.

The SFU are for "fast approximate transcendental operations". Unfortunately, I don't understand what this is supposed to mean. On the other hand, at Special CUDA Double Precision trig functions for SFU it is said, that they only work in single precision. Is this still correct on a K20Xm?

The LD/ST units are obviously for storing and loading. Is any memory load/write required to go through one of theses? And are they also used as a single warp? In other words, can there be only one warp which is currently writing or reading?

Cheers, Andi

2

There are 2 answers

4
aland On BEST ANSWER

The SFU are for "fast approximate transcendental operations"

SFUs compute functions like __cosf(), __expf() etc.

On the other hand here is said, that they only work in single precision, is this still correct on a K20Xm?

According to recent CUDA C Programming Guide, section G.5.1 they still only work in single precision.

It makes some sense, since if you need double precision it's unlikely you would use inaccurate math functions. You can refer to this answer for suggestions on double-precision arithmetic optimizarions.

The implementation details of double-precision operations could be found in /usr/local/cuda-5.5/include/math_functions_dbl_ptx3.h (or wherever your CUDA Toolkit is installed). E.g. for sin and cos it uses Payne-Hanek argument reduction followed by Taylor expansion (up to the order 14).

For double precision calculations, SFUs seem to be used only in __internal_fast_rcp and __internal_fast_rsqrt, which in turn are used in acos, log, cosh and several other functions (see math_functions_dbl_ptx3.h). So most of the time they stall, like LD/ST units stall if there's no ongoing memory transactions.

Is any memoryload/write required to go through one of theses?

Yes, each access to global memory.

And are they also used as a single warp? In other words can there be only one warp which is currently writing or reading?

The number of units constrains only the number of instructions issued each cycle. I.e. each clock cycle 32 read instructions could be issued, and 32 results could be returned.

One instruction can read/write up to 128 bytes, so if each thread in warp reads 4 bytes and they are coalesced, then whole warp would require a single load/store instruction. If accesses are uncoalesced, then more instruction should be issued.

Moreover, units are pipelined, meaning multiple read/store request could be executing concurrently by single unit.

1
Roger Dahl On

Don't accept this as an answer -- we're hoping that someone will come along and answer your question about double precision transcendental operations. I just wanted to address the second part of your question, about the LD/ST units.

The LD/ST units are obviously for storing and loading.

Yes.

Is any memoryload/write required to go through one of theses?

Yes.

And are they also used as a single warp?

Yes, all active threads in a warp always issue the same type of instruction in the same clock cycle. If that instruction is a load or store, it gets issued to the LD/ST units. If a thread is inactive (due to looping or conditional execution), the corresponding LT/ST unit stays idle.

In other words can there be only one warp which is currently writing or reading?

No, the LD/ST units can accept one load or store operation per clock, even though memory latency can be several hundred cycles. So, when one warp issues a load instruction, the LD/ST units will start working on retrieving that data. Instructions in the warp that depend on the data become ineligible to be issued until the data arrives. In the next clock cycle, the warp may still execute other independent instructions (instruction-level parallelism). Even other, independent load or store instructions. Another warp that is eligible to be scheduled may also, in the next clock cycle, issue another load instruction and itself go into a waiting state (thread-level parallelism). At that point, the LD/ST units are keeping track of two pending results. Due to caching and coalescing, it is possible that the data for the second warp arrives first. When data for a warp arrives it gets assigned to the registers designated in the instruction and that particular data dependency is then resolved.