I've spent the past few days reading about autovectorization with gcc 4.7. I followed some examples I saw online, and the setup seems to be correct. But when I actually run with the code and compare between vectorization on or off, there isn't a noticeable difference in runtime.
Here's the code I've been working with:
#include <string.h>
#include <stdlib.h>
#include <emmintrin.h>
#include <stdio.h>
#include <math.h>
int main(int argc, char** argv) {
long b = strtol(argv[2], NULL, 0);
unsigned long long int i;
unsigned long long int n = (int)pow(2,29);
float total = 0;
float *__restrict__ x1;
float *__restrict__ y1;
posix_memalign((void *)&x1, 16, sizeof(float)*n);
posix_memalign((void *)&y1, 16, sizeof(float)*n);
float *__restrict__ x = __builtin_assume_aligned(x1,16);
float *__restrict__ y = __builtin_assume_aligned(y1,16);
for (i=0;i<n;i++) {
x[i] = i;
y[i] = i;
}
for (i=0; i<n; i++) {
y[i] += x[i];
}
printf("y[%li]: \t\t\t\t%f\n", b,y[b]);
printf("correct answer: \t\t\t%f\n", (b)*2);
return 0;
}
Some of this stuff seems redundant to me, but was necessary to get the compiler to understand what was going on (especially the fact that the data were aligned). The "b" variable that's read from command line is just there because I was paranoid about the compiler optimizing away the loop entirely.
Here is the compiler command when vectorizeration is enabled:
gcc47 -ftree-vectorizer-verbose=3 -msse2 -lm -O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -fipa-cp-clone test.c -ftree-vectorize -o v
Basically, this is equivalent to just using -O3. I put the flags in myself so that all I needed to do was remove "ftree-vectorize" and be able to test the result sans vectorization.
Here is the output of the ftree-vectorize-verbose flag, to show that the code is in fact being vectorized:
Analyzing loop at test.c:29
29: vect_model_load_cost: aligned.
29: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
29: vect_model_load_cost: aligned.
29: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
29: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
29: vect_model_store_cost: aligned.
29: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
29: cost model: Adding cost of checks for loop versioning aliasing.
29: Cost model analysis:
Vector inside of loop cost: 4
Vector outside of loop cost: 4
Scalar iteration cost: 4
Scalar outside cost: 1
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 2
29: Profitability threshold = 3
Vectorizing loop at test.c:29
29: Profitability threshold is 3 loop iterations.
29: created 1 versioning for alias checks.
29: LOOP VECTORIZED.
Analyzing loop at test.c:24
24: vect_model_induction_cost: inside_cost = 2, outside_cost = 2 .
24: vect_model_simple_cost: inside_cost = 2, outside_cost = 0 .
24: not vectorized: relevant stmt not supported: D.5806_18 = (float) D.5823_58;
test.c:7: note: vectorized 1 loops in function.
Note that the vectorization is profitable after 3 iterations, and I'm running with 2^29~=500,000,000 iterations. So I should expect a vastly different runtime with vectorization turned off, right?
Well, here are the runtimes of the code (I ran it 20 times in a row):
59.082s
79.385s
57.557s
57.264s
53.588s
54.300s
53.645s
69.044s
57.238s
59.366s
56.314s
55.224s
57.308s
57.682s
56.083s
369.590s
59.963s
55.683s
54.979s
62.309s
Throwing away that weird ~370s outlier, that gives a mean runtime of 58.7s, with a standard deviation of 6.0s.
Next, I'll compile with the same command as before, but with no -ftree-vectorize flag:
gcc47 -ftree-vectorizer-verbose=3 -msse2 -lm -O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -fipa-cp-clone test.c -o nov
Again running the program 20 times in a row yields the following times:
69.471s
57.134s
56.240s
57.040s
55.787s
56.530s
60.010s
60.187s
324.227s
56.377s
55.337s
54.110s
56.164s
59.919s
493.468s
63.876s
57.389s
55.553s
54.908s
56.828s
Again throwing away outliers, this gives a mean runtimee of 57.9s with a standard deviation of 3.6s.
So these two versions have statistically indistinguishable runtimes.
Can anyone point me to what I'm doing wrong? Does the "profitability threshold" spit out by the compiler not mean what I think it means? I really appreciate any help people can give me, I've been trying to figure this out for the past week.
EDIT:
I implemented the change that @nilspipenbrinck suggested, and it seems to have worked. I stuck the vectorized loop in a function, and called that function a boatload of times. The relative run-times are now 24.0s (sigma of <0.1s) for no vectorization vs 20.8s (sigma of <0.2s) for vectorization, or a 13% speed improvement. Not as much as I was hoping for, but at least now I know its working! Thanks for taking the time to look at my question and write an answer, I really appreciate it.
You don't do much arithmetic. Therefore the runtime of your test code is memory bound. E.g. you spend most of the time by moving the data between the CPU and memory.
Furthermore your n is very large with 2^29 elements. Therefore you don't benefit from the first and second level cache in any way.
If you want to see improvements with SSE, use a smaller n such that you only touch 8 or 16 kilobyte of data. Also make sure that the data is 'hot' e.g. it has recently been accessed by the CPU. That way the data does not have to be moved from main memory but it gets moved from the caches which is several magnitudes faster.
As an alternative you could also do a lot more arithmetic. This would give the memory prefetch system a chance to fetch the data from main memory in the background while you utilize the CPU doing math.
Summarized: If the arithmetic is faster than your system can move the memory around you will not see any benefits. Memory access times will be the bottleneck and the few cycles you save using the SSE instruction set will get lost in the noise of memory access timings.