I have the following code snippet, of which I have to calculate the Arithmetic Intensity.
const int N = 8192;
float a[N], b[N], c[N], d[N];
...
#pragma omp parallel for simd
for(int i = 0; i < N; i++)
{
const float tmp_a = a[i];
const float tmp_b = b[i];
c[i] += tmp_a*tmp_b;
d[i] = tmp_a+tmp_b;
}
Case 1 : What will be the AI if tmp_a and tmp_b are in registers Case 2 : What will be the AI if tmp_a and tmp_b are in RAM or cache
I know AI is given as number of floating point operations divided by the number of bytes transferred. How should the bytes transferred depend on the data being stored in RAM/registers/Cache? What additional information do we need to calculate the maximum floating point throughput achievable by the code?