Understanding Performance Behavior of Random Writes to Global Memory

96 views Asked by At

I'm running experiments aiming to understand the behavior of random read and write access to global memory.

The following kernel reads from an input vector (groupColumn) with a coalesced access pattern and reads random entries from a hash table in global memory.

struct Entry {
  uint group;
  uint payload;
};
typedef struct Entry Entry;

__kernel void global_random_write_access(__global const uint* restrict groupColumn,
                                         __global Entry* globalHashTable,
                                         __const uint HASH_TABLE_SIZE,
                                         __const uint HASH_TABLE_SIZE_BITS,
                                         __const uint BATCH,
                                         __const uint STRIDE) {
    int global_id = get_global_id(0);
    int local_id = get_local_id(0);

    uint end = BATCH * STRIDE;
    uint sum = 0;
    for (int i = 0; i < end; i += STRIDE) {
        uint idx = global_id + i;
        // hash keys are pre-computed
        uint hash_key = groupColumn[idx];                       // coalesced read access
        __global Entry* entry = &globalHashTable[hash_key];     // pointer arithmetic
        sum += entry->payload;                                  // random read
    }

    if (local_id < HASH_TABLE_SIZE) {
        globalHashTable[local_id].payload = sum;                // rare coalesced write
    }
}

I ran this kernel on a NVIDIA V100 card with multiple iterations. The variance of the results is very low, thus, I only plot one dot per group configuration. The input data size is 1 GiB and each thread processes 128 entries (BATCH = 128). Here are the results: enter image description here

So far so good. The V100 has a max memory bandwidth of roughly 840GiB/sec and the measurements are close enough, given the fact that there are random memory reads involved.

Now I'm testing random writes to global memory with the following kernel:

__kernel void global_random_write_access(__global const uint* restrict groupColumn,
                                         __global Entry* globalHashTable,
                                         __const uint HASH_TABLE_SIZE,
                                         __const uint HASH_TABLE_SIZE_BITS,
                                         __const uint BATCH,
                                         __const uint STRIDE) {
    int global_id = get_global_id(0);
    int local_id = get_local_id(0);

    uint end = BATCH * STRIDE;
    uint sum = 0;
    for (int i = 0; i < end; i += STRIDE) {
        uint idx = global_id + i;
        // hash keys are pre-computed
        uint hash_key = groupColumn[idx];                       // coalesced read access
        __global Entry* entry = &globalHashTable[hash_key];     // pointer arithmetic
        sum += i;
        entry->payload = sum;                                   // random write
    }

    if (local_id < HASH_TABLE_SIZE) {
        globalHashTable[local_id].payload = sum;                // rare coalesced write
    }
}

Godbolt: OpenCL -> PTX

The performance drops significantly to a few GiB/sec for few groups. enter image description here

I can't make any sense of the behavior. As soon as the hash table reaches the size of L1 the performance seems to be limited by L2. For fewer groups the performance is way lower. I don't really understand what the limiting factors are.

The CUDA documentation doesn't say much about how store instructions are handled internally. The only thing I could find is that the st.wb PTX instruction (Cache Operations) might cause a hit on stale L1 cache if another thread would try to read the same addess via ld.ca. However, there are no reads to the hash table involved here.

Any hints or links to understanding the performance behavior are much appreciated.


Edit:

I actually found a bug in my code that didn't pre-compute the hash keys. The access to global memory wasn't random, but actually coalesced due to how I generated the values. I further simplified my experiments by removing the hash table. Now I only have one integer input column and one interger output column. Again, I want to see how the writes to global memory actually behave for different memory ranges. Ultimately, I want to understand which hardware properties influence the performance of writes to global memory and see if I can predict based on the code what performance to expect.

I tested this with two kernels that do the following:

  1. Read from input, write to output
  2. Read from input, read from output and write to output

I also applied two different access patterns, by generating the values in the group column:

  1. SEQUENTIAL: sequentially increasing numbers until current group's size is reached. This pattern leads to a coalesced memory access when reading and writing from the output column.
  2. RANDOM: uni-distributed random numbers within the current group's size. This pattern leads to a misaligned memory access when reading and writing from the output column.

(1) Read & Write

__kernel void global_write_access(__global const uint* restrict groupColumn,
                                    __global uint *restrict output,
                                    __const uint BATCH,
                                    __const uint STRIDE) {
    int global_id = get_global_id(0);
    int local_id = get_local_id(0);

    uint end = BATCH * STRIDE;
    uint sum = 0;
    for (int i = 0; i < end; i += STRIDE) {
        uint idx = global_id + i;
        uint group = groupColumn[idx];                       // coalesced read access
        sum += i;
        output[group] = sum;                                 // write (coalesced | random)
    }
}

PTX Code: https://godbolt.org/z/19nTdK

Read & Write

(2) Read, Read & Write

__kernel void global_read_write_access(__global const uint* restrict groupColumn,
                                  __global uint *restrict output,
                                  __const uint BATCH,
                                  __const uint STRIDE) {
    int global_id = get_global_id(0);
    int local_id = get_local_id(0);

    uint end = BATCH * STRIDE;
    for (int i = 0; i < end; i += STRIDE) {
        uint idx = global_id + i;
        uint group = groupColumn[idx];                       // coalesced read access
        output[group] += 1;                                  // read & write (coalesced | random)
    }
}

PTX Code: https://godbolt.org/z/b647cz

enter image description here

As ProjectPhysX pointed out, the access pattern makes a huge difference. However, for small groups the performance is quite similar for both access patterns. In general, I would like to better understand the shape of the curves and which hardware properties, architectural features etc. influence this shape.

From the cuda programming guide I learned that global memory accesses are conducted via 32-, 64-, or 128-byte transactions. Accesses to L2 are done via 32-byte transactions. So up to 8 integer words can be accessed via a single transaction. This might explain the plateau with a bump at 8 groups at the beginning of the curve. After that more transactions are needed and performance drops.

One cache line is 128 bytes long (both on L1 and L2), hence, 32 intergers fit into a single cache line. For more groups more cache lines are required which can be potentially processed in parallel by more memory controllers. That might be the reason for the performance to increase here. 8 controllers are available on the V100 So I would expect the performance to peak at 256 groups. Though, it doesn't. Instead it will steadily increase performance until reaching 4096 groups and plateau there with roughly 750 GiB/sec.

1

There are 1 answers

1
ProjectPhysX On

The plateauing in your second performane plot is GPU saturation: For only a few work groups, the GPU is partly idle and the latencies involved in launching the kernel significantly reduce performance. Above 8192 groups, the GPU fully saturates its memory bandwidth. The plateau only is at ~520GB/s because of the misaligned writes (have low performance on the V100) and also the "rare coalesced write" in the if-block, which happens at least once per group. For branching within the group, all other threads have to wait for the single write operation to finish. Also this write is not coalesced, because it is not happening for every thread in the group. On the V100, misaligned write performance is very poor at max. ~120GB/s, see the benchmark here.

Note that if you would comment the if-part, the compiler sees that you do not do anything with sum and optimizes everything out, leaving you with a blank kernel in PTX.

The first performance graph to me is a bit more confusing. The only difference in the first kernel to the second is that the random wrtite in the loop is replaced by a random read. Generally, read performance on the V100 is much better (~840GB/s, regardless of coalesced/misaligned) than misaligned write performance, so performance is expected to be much better overall and indeed it is. However I can't make sense of the performance dropping for more groups, where saturation should theoretically be better. But the performance drop isn't really that significant at ~760GB/s vs. 730GB/s.

To summarize, you are observing that the performance penalty for misaligned writes (~120GB/s vs. ~900GB/s for coalesced writes) is much larger than for reads, where performance is about the same for coalesced/misaligned at ~840GB/s. This is common thing for GPUs, with some variance of course between microarchitectures. Typically there is at least some performance penalty for misaligned reads, but not as large as for misaligned writes.