The GCC Vector Extensions provide an abstraction of SIMD instructions.
I am wondering how to use them for string processing, e.g. to mask each byte of a buffer:
typedef uint8_t v32ui __attribute__ ((vector_size(32)));
void f(const uint8_t *begin, const uint8_t *end, uint8_t *o)
{
for (; begin < end; begin += 32, o+=32)
*(v32ui*) o = (*(v32ui*) begin) & 0x0fu;
}
Assuming that the input and output buffers are properly aligned (at 32 byte), is such casting supported and well defined with the GCC verctor extensions?
And is this the most efficient way to use the vector extensions on strings?
Or do I have to explicitly store/retrieve parts of the string into the vectors?
For example like this:
void f(const uint8_t *begin, const uint8_t *end, uint8_t *o)
{
for (; begin < end; begin += 32, o+=32) {
v32ui t;
memcpy(&t, begin, 32);
t &= 0f0u;
memcpy(o, &t, 32);
}
}
Or are there better/more efficient ways than to memcpy
?
And when assuming that the input or output buffer (or both) are unaligned, how then can be used the vector extensions safely/efficiently for string processing?
Vectors need to be processed in registers, so
memcpy
can't possibly be useful here.If auto-vectorization doesn't generate good code, the standard technique is to use vector intrinsics. If you can do what you need with ops that could compile to SIMD instructions on multiple architectures, then yeah, gcc vector syntax might be a good approach.
I tried out your first version with gcc 4.9.2. It generates exactly what you'd hope for, with 64bit AVX. (256bit load, vector and, store).
Without a
-march
or anything, just using baseline amd64 (SSE2), it copies the input to a buffer on the stack, and loads from there. I think it's doing this in case of unaligned input/output buffers, instead of just usingmovdqu
. Anyway, it's really horrible slow code, and it would be way faster to do 8 bytes at a time in GP registers than this nonsense.gcc -march=native -O3 -S v32ui_and.c
(on a Sandybridge (AVX without AVX2)):Note the lack of scalar cleanup, or handling of unaligned data.
vmovdqu
is as fast asvmovdqa
when the address is aligned, so it's a bit silly not to use it.gcc -O3 -S v32ui_and.c
is weird.So I guess you can't safely use gcc vector extensions if it's sometimes going to generate code this bad. With intrinsics, the simplest implementation would be:
This generates identical code to the gcc-vector version (compiled with AVX2). Note this uses
VPAND
, notVANDPS
, so it requires AVX2.With large buffers, it would be worth doing a scalar startup until either input or output buffer was aligned to 16 or 32 bytes, then the vector loop, then any scalar cleanup needed. With small buffers, just using unaligned loads/stores and a simple scalar cleanup at the end would be best.
Since you asked about strings specifically, if your strings are nul-terminated (implicit-length), you have to be careful when crossing page boundaries that you don't fault if the string ends before the end of a page, but your read spans the boundary.