In the textbook Computer Systems: a Programmer's Perspective
there are some impressive benchmarks for optimizing row-major order access.
I created a small program to test for myself if a simple change from row-major access to column-major access would make a huge difference on my own machine.
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#define N 30000
int a[N][N] = { 0 };
int main() {
srand(time(NULL));
int sum = 0;
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
a[i][j] = rand() % 99;
}
}
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
sum += a[i][j];
}
}
}
On average row-major order access took 8.42s
(n=5
trials) on my system whereas column-major order access took 30.12s
(n=5
trials) on my system which is pretty significant.
It seems like on the surface like it should be a pretty simple thing to optimize.
Why don't modern compilers optimize these scenarios?
Most loops do not consist of simple sum statements, but have side effects and dependencies between loop iterations.
Not all operations you may do in a loop are commutative, so the optimizer would have to actually understand all the operations happening as part of a loop to ensure it doesn't change its meaning, including the contents of any system API called, code in dynamically loaded libraries, etc.
Now this is just a guess, but I expect someone tried it out, realized that the optimization didn't have enough information about the code being run to trigger most times, and then went to focus on parallel execution optimizations, which are probably the greater optimization opportunities in most codebases.