Is coalesced memory access a feature or phenomenon?

324 views Asked by At

I'm current writing a smaller project in OpenCL, and I'm trying to find out what really causes memory coalescing. Every book on GPGPU programming says it's how GPGPUs should be programmed, but not why the hardware would prefer this.

So is it some special hardware component which merges data transfers? Or is it simply to better utilize the cache? Or is it something completely different?

2

There are 2 answers

0
Jan Lucas On

Memory coalescing makes several different things more efficient. It is usually done before the requests hit the cache. Similar to the SIMT execution model it is a architectural trade-off. It enables GPUs to have a more efficient and very high performance memory system but also forces programmers to think carefully about their data layout.

Without coalescing either the cache needs to be able to serve a huge number of requests at the same time or memory access would take a lot longer as the different data transfers would need to be handled one at a time. This is even relevant when just checking if something is a hit or a miss.

Merging requests is rather easy to do, you just pick one transfer and then merge all requests with matching upper address bits. You just generate a single request per cycle and replay the load or store instruction until all threads have been handled.

Caches also stores consecutive bytes, 32/64/128Byte, this fits most applications well, is a good fit to modern DRAM and reduces the overhead for cache bookeeping information: The cache is organized in cachelines and each cacheline has a tag that indicates which addresses are stored in the line.

Modern DRAM uses wide interfaces and also long bursts: The memory of a GPU is typically organized in 32-bit or 64-bit wide channels with GDDR5 memory that has a burst length of 8. This means that every transaction at the DRAM interface has to fetch at least 32-bit*8=32 byte or 64-bit*8=64 byte at a time, even if just a single byte is required from these bytes. Designing data layouts that lead to coalesced requests helps to use the DRAM interface efficiently.

GPUs also have a huge number of parallel threads active at the same time and rather small cache at the same time. CPUs are often able to use their caches to reorder their memory requests to DRAM friendly patterns. The larger number of threads and smaller caches on GPUs make this "cache based coalescing" less efficient on GPUs, as the data will often not stay long enough in the cache to get merged at the cache with other requests to the same cacheline.

2
Dragontamer5788 On

Despite the "random access" name on "RAM" (Random-access Memory), Double-Data-Rate #3 Random-Access Memory (DDR3-RAM) is faster at accessing consecutive positions rather than randomly.

Case in point: "CAS Latency" is the amount of time that DDR3 RAM will stall when you're accessing a new "column", as your RAM chip is literally charging up to serve the new data from another location on the chip.

EDIT: Jan Lucas argues that RAS Latency is more important in practice. See his comment for details.

There's roughly a 10ns delay whenever you switch columns. So, if you have a bunch of memory accesses, if you keep access a bunch of data 'close' to each other, then you don't invoke a CAS delay.

So if you have 20-words to access at a particular location, its more efficient to access those 20-words before moving to a new memory location (invoking a CAS delay). Otherwise, you'll have to invoke ANOTHER CAS delay to "switch back" between memory locations.

Its just around 10 nanoseconds, but that amount of time adds up over time.