How can I mitigate the performance impact of transposed array access order?

99 views Asked by At

I have an algorithm, written in C, which processes a couple of 2-dimensional arrays (e.g. with size Y x X) to produce another 2-dimensional array of the same size. All three arrays contain 32-bit floats and have the same size, Y x X, where Y might be a few tens but X is a million or so.

Unfortunately:

  • all the arrays must be in row-major order (scanning through X accesses contiguous memory),
  • the algorithm requires the innermost loop to scan over the Y dimension.

Perhaps unsurprisingly, accessing data in this non-contiguous fashion is relatively slow. So...

What can I do to mitigate the performance impact of the non-contiguous memory accesses?

(NB. It was a long shot, but I've tried various patterns of prefetch instructions to bring in upcoming columns, but all to no avail.)

The following (updated) code demonstrates the problem:

#include <stdio.h>
#include <stdlib.h>

#define NX 1000000
#define NY 30

int main() {
    float *a = malloc(sizeof(float) * NY * NX);
    float *b = malloc(sizeof(float) * NY * NX);
    float *c = malloc(sizeof(float) * NY * NX);

    size_t y, x, offset;
    float v;

    for(x=0; x<NX; x++) {
        v = 1;
        for(y=0; y<NY; y++) {
            offset = x + NX * y;
            if(a[offset] < 0) {
                v = 2;
            }
            c[offset] = v * b[offset];
        }
    }

    free(a);
    free(b);
    free(c);
}

On a test machine with an E5520 CPU @ 2.27 GHz this takes ~1 s to execute even though it's only reading ~220 MB and writing ~110 MB.

1

There are 1 answers

0
sh1 On

It looks like your access pattern shouldn't be that harmful. This makes me wonder if branch prediction is your real problem.

Normally transposed data access is done in chunks to keep the cache healthy, but your input is so short on the inner-loop axis that the cached read of the first row should still be valid by the time you revisit it in your outer loop.

You have three arrays 30 elements high by a cache line width of maybe 128 bytes (I expect smaller, but things change). That's only 12kB of cache that you need for the top row to stay resident.

You could try changing v to a small array and proceeding in vertical stripes, though. Even if this didn't actually help your cache utilisation, it would at least give a hint to the compiler that it could be optimised with SIMD.

You might also try this dangerous optimisation to eliminate branches:

for(x=0; x<NX; x++) {
    uint32_t v = 0;
    for(y=0; y<NY; y++) {
        offset = x + NX * y;
        v |= (((uint32_t *)a)[offset] & 0x80000000) >> 8;
        ((uint32_t *)c)[offset] = ((uint32_t *)b)[offset] + v;
    }
}

This does the arithmetic in the log domain, taking the sign bit of the floating-point value and adding it directly to the exponent and assuming that it won't overflow. Also assuming the format in memory is uint32_t-compatible.