I have the following function:
void ikj(float (*a)[N], float (*b)[N], float (*c)[N], int n) {
int i, j, k;
float r;
papi_start();
for (i = 0; i < n; i++) {
for (k = 0; k < n; k++) {
r = a[i][k];
for (j = 0; j < n; j++)
c[i][j] += r * b[k][j];
}
}
papi_stop();
}
And i'm using PAPI
to count how many loads and stores i have between papi_start()
and papi_stop()
and the results i have are the following:
Loads (using PAPI_LD_INS
):
32 26781
64 205053
128 1606077
256 12714815
512 101189551
1024 807406950
2048 6450848188
Stores (using PAPI_SR_INS
):
32 8290
64 65698
128 524578
256 4194850
512 33555490
1024 268437701
2048 2147487778
where the first value is the size of N and the second value is the number of instructions.
i'm compiling with O3 and the sizes of my cache are L1 = 32KB x 2(Instruction and Data, 8-way) and L2 = 1024KB (4-way) (shared for 2 cores).. my cpu is Intel T3200 and have SSE3..
I know that O3 optimizes the code so it'll use prefetching among other features and since i'm loading contiguous addresses and my cache has a 64 bytes line size i'm loading 16 floats at once, but my calculations don't reach this values, so can anybody explain this to me?
EDIT: This are my assembly files, sorry to just throw them here, but i never worked with assembly and i can't really understand any of it:
http://dl.dropboxusercontent.com/u/878621/mmc.s http://dl.dropboxusercontent.com/u/878621/mmc_asm.s
Thanks!
Looking at stores, the number you are getting is pretty close to
N**3 / 4
. We'd expect it to be O(N**3), obviously.That suggests that 4 float writes are coalesced into one of whatever PAPI_SR_INS is measuring. Looking at it alternatively you're counting the number of 16 byte writes.
Similarly the number of loads is roughly
3/4 N**3
. The dominant term should be the load from b and c inside the innermost loop, which would be 2 reads per iteration. To be honest I can't make much sense of that.If you don't know exactly what you're measuring, and you don't correlate it with the generated code, it's pretty hard to predict the measurement.
EDIT: the numbers appear to correlate to the load and store instructions executed, but not to the number of L1, L2, etc transactions or misses - so unlikely to correlate to actual performance. Isn't the time taken a better number to worry about? Given the complexity of modern CPU architecture I'd trust measurement over prediction any day.