Is it possible to achieve similar level of performance in GCC in terms of SSE2/AVX?
It looks like Intel Compiler 15 is superior in auto vectorization efficiency. As benchmark I've used classic flops.c benchmark (https://github.com/AMDmi3/flops/blob/master/flops.c)
And here are results for my Intel Xeon E5-2690 (Sandy Bridge)
Intel Compiler 15 [ /O2 /arch:AVX /fp:fast ]
FLOPS C Program (double Precision), V2.0 18 Dec 1992
Module Error RunTime MFLOPS
(usec)
1 -2.5613e-010 0.0034 4177.1562
2 -1.4166e-013 0.0058 1209.1768
3 3.1904e-010 0.0011 15487.5445
4 9.0594e-014 0.0011 14065.9341
5 -6.2284e-014 0.0034 8652.6807
6 3.3640e-014 0.0021 13994.3450
7 9.4360e-012 0.0101 1193.4732
8 3.7637e-014 0.0022 13677.6492
Iterations = 512000000
NullTime (usec) = 0.0000
MFLOPS(1) = 1730.8542
MFLOPS(2) = 2971.1755
MFLOPS(3) = 6296.4960
MFLOPS(4) = 14153.0984
GCC 6.1.0 [ -m32 -mavx -Ofast ]
FLOPS C Program (double Precision), V2.0 18 Dec 1992
Module Error RunTime MFLOPS
(usec)
1 1.8119e-013 0.0034 4177.1562
2 -1.4166e-013 0.0055 1283.6676
3 8.2157e-015 0.0013 13600.0000
4 1.8874e-015 0.0023 6655.1127
5 -2.7645e-014 0.0048 6060.4082
6 5.1903e-014 0.0041 7159.1128
7 -8.4583e-011 0.0200 598.5387
8 -1.4488e-014 0.0041 7293.4473
Iterations = 512000000
NullTime (usec) = 0.0000
MFLOPS(1) = 1823.5616
MFLOPS(2) = 1585.2039
MFLOPS(3) = 3663.4158
MFLOPS(4) = 7799.1296
Something tells me that I forgot to enable some special switch in GCC.
Ps. Yes I know that Intel Compiler has reduced precision.