I was having the following issue, on an ARMv8 Cortex-A53 processor. I have two functions, one that writes to an array with no cache re-use (streaming_work
), which I can feed to another function with re-use (non_streaming_work
).
If I run just non_streaming_work
(i.e. turn off STREAMING flag), I get less cache misses as measured by perf counters than otherwise, when streaming runs before (30k less). This behavior ONLY happens when the streaming function writes to the particular array the non_streaming_work reads from.
I hypothesize this is because the processor learns it is not profitable to cache that address range during the streaming loop, and thus does not cache loads during some part of non_streaming_work, after which it unlearns the behavior.
I was wondering,
- Do my conclusions seem reasonable / is there something I'm missing?
- How can I tell the processor to cache loads for that address range, no matter what? I found the
rprfm
instruction, but this requires stride access, and a lot more information. I was wondering if this is the correct solution / there is not a simpler solution just indicating policy to the processor?
Thank you!
__attribute__((noinline))
void streaming_work(float* in, int16_t* out) {
for (int i = 0; i<N; ++i) {
out[i] = (int16_t) CLAMP(in[i] * (1 << 9));
}
}
__attribute__((noinline))
void non_streaming_work(int16_t* w, int16_t* in, int32_t* out) {
for (int i = 0; i<N-20; ++i) {
for (int k = 0; k<20; ++k) {
out[i+k/2] += w[k] * in[i+k];
}
}
}
int main() {
//...
#if defined(STREAMING)
streaming_work(orig, in);
#endif
long long bef_time = get_time_usecs();
__sync_synchronize();
non_streaming_work(wt, in, out);
__sync_synchronize();
long long aft_time = get_time_usecs();
}