Justify Memory Access in a cycle

190 views Asked by At

I have the following function:

void ikj(float (*a)[N], float (*b)[N], float (*c)[N], int n) {

    int i, j, k;
    float r;

    papi_start();

    for (i = 0; i < n; i++) {
        for (k = 0; k < n; k++) {

            r = a[i][k];

            for (j = 0; j < n; j++)
                c[i][j] += r * b[k][j];

        }
    }

    papi_stop();

}

And i'm using PAPI to count how many loads and stores i have between papi_start() and papi_stop() and the results i have are the following:

Loads (using PAPI_LD_INS):

32 26781
64 205053
128 1606077
256 12714815
512 101189551
1024 807406950
2048 6450848188

Stores (using PAPI_SR_INS):

32 8290
64 65698
128 524578
256 4194850
512 33555490
1024 268437701
2048 2147487778

where the first value is the size of N and the second value is the number of instructions.

i'm compiling with O3 and the sizes of my cache are L1 = 32KB x 2(Instruction and Data, 8-way) and L2 = 1024KB (4-way) (shared for 2 cores).. my cpu is Intel T3200 and have SSE3..

I know that O3 optimizes the code so it'll use prefetching among other features and since i'm loading contiguous addresses and my cache has a 64 bytes line size i'm loading 16 floats at once, but my calculations don't reach this values, so can anybody explain this to me?

EDIT: This are my assembly files, sorry to just throw them here, but i never worked with assembly and i can't really understand any of it:

http://dl.dropboxusercontent.com/u/878621/mmc.s http://dl.dropboxusercontent.com/u/878621/mmc_asm.s

Thanks!

1

There are 1 answers

5
Alan Stokes On BEST ANSWER

Looking at stores, the number you are getting is pretty close to N**3 / 4. We'd expect it to be O(N**3), obviously.

That suggests that 4 float writes are coalesced into one of whatever PAPI_SR_INS is measuring. Looking at it alternatively you're counting the number of 16 byte writes.

Similarly the number of loads is roughly 3/4 N**3. The dominant term should be the load from b and c inside the innermost loop, which would be 2 reads per iteration. To be honest I can't make much sense of that.

If you don't know exactly what you're measuring, and you don't correlate it with the generated code, it's pretty hard to predict the measurement.

EDIT: the numbers appear to correlate to the load and store instructions executed, but not to the number of L1, L2, etc transactions or misses - so unlikely to correlate to actual performance. Isn't the time taken a better number to worry about? Given the complexity of modern CPU architecture I'd trust measurement over prediction any day.