How can I mitigate the performance impact of transposed array access order?

Question

How can I mitigate the performance impact of transposed array access order?

86 views Asked by Colonel Mustard At 11 June 2015 at 15:12

I have an algorithm, written in C, which processes a couple of 2-dimensional arrays (e.g. with size Y x X) to produce another 2-dimensional array of the same size. All three arrays contain 32-bit floats and have the same size, Y x X, where Y might be a few tens but X is a million or so.

Unfortunately:

all the arrays must be in row-major order (scanning through X accesses contiguous memory),
the algorithm requires the innermost loop to scan over the Y dimension.

Perhaps unsurprisingly, accessing data in this non-contiguous fashion is relatively slow. So...

What can I do to mitigate the performance impact of the non-contiguous memory accesses?

(NB. It was a long shot, but I've tried various patterns of prefetch instructions to bring in upcoming columns, but all to no avail.)

The following (updated) code demonstrates the problem:

#include <stdio.h>
#include <stdlib.h>

#define NX 1000000
#define NY 30

int main() {
    float *a = malloc(sizeof(float) * NY * NX);
    float *b = malloc(sizeof(float) * NY * NX);
    float *c = malloc(sizeof(float) * NY * NX);

    size_t y, x, offset;
    float v;

    for(x=0; x<NX; x++) {
        v = 1;
        for(y=0; y<NY; y++) {
            offset = x + NX * y;
            if(a[offset] < 0) {
                v = 2;
            }
            c[offset] = v * b[offset];
        }
    }

    free(a);
    free(b);
    free(c);
}

On a test machine with an E5520 CPU @ 2.27 GHz this takes ~1 s to execute even though it's only reading ~220 MB and writing ~110 MB.

Original Q&A

There are 1 answers

**sh1** · Answer 1 · 2015-06-12T15:08:48+00:00

It looks like your access pattern shouldn't be that harmful. This makes me wonder if branch prediction is your real problem.

Normally transposed data access is done in chunks to keep the cache healthy, but your input is so short on the inner-loop axis that the cached read of the first row should still be valid by the time you revisit it in your outer loop.

You have three arrays 30 elements high by a cache line width of maybe 128 bytes (I expect smaller, but things change). That's only 12kB of cache that you need for the top row to stay resident.

You could try changing v to a small array and proceeding in vertical stripes, though. Even if this didn't actually help your cache utilisation, it would at least give a hint to the compiler that it could be optimised with SIMD.

You might also try this dangerous optimisation to eliminate branches:

for(x=0; x<NX; x++) {
    uint32_t v = 0;
    for(y=0; y<NY; y++) {
        offset = x + NX * y;
        v |= (((uint32_t *)a)[offset] & 0x80000000) >> 8;
        ((uint32_t *)c)[offset] = ((uint32_t *)b)[offset] + v;
    }
}

This does the arithmetic in the log domain, taking the sign bit of the floating-point value and adding it directly to the exponent and assuming that it won't overflow. Also assuming the format in memory is uint32_t-compatible.

TechQA.

How can I mitigate the performance impact of transposed array access order?

There are 1 answers

Related Questions in C

Related Questions in PERFORMANCE

Related Questions in MEMORY

Related Questions in MULTIDIMENSIONAL-ARRAY

Related Questions in CPU-CACHE

Popular Questions

Popular Tags

Trending Questions