Store, modify and retrieve strings with GCC Vector Extensions?

187 views Asked by At

The GCC Vector Extensions provide an abstraction of SIMD instructions.

I am wondering how to use them for string processing, e.g. to mask each byte of a buffer:

typedef uint8_t v32ui __attribute__ ((vector_size(32)));

void f(const uint8_t *begin, const uint8_t *end, uint8_t *o)
{
    for (; begin < end; begin += 32, o+=32)
      *(v32ui*) o = (*(v32ui*) begin) & 0x0fu;
}

Assuming that the input and output buffers are properly aligned (at 32 byte), is such casting supported and well defined with the GCC verctor extensions?

And is this the most efficient way to use the vector extensions on strings?

Or do I have to explicitly store/retrieve parts of the string into the vectors?

For example like this:

void f(const uint8_t *begin, const uint8_t *end, uint8_t *o)
{
    for (; begin < end; begin += 32, o+=32) {
      v32ui t;
      memcpy(&t, begin, 32);
      t &= 0f0u;
      memcpy(o, &t, 32);
    }
}

Or are there better/more efficient ways than to memcpy?

And when assuming that the input or output buffer (or both) are unaligned, how then can be used the vector extensions safely/efficiently for string processing?

1

There are 1 answers

0
Peter Cordes On BEST ANSWER

Vectors need to be processed in registers, so memcpy can't possibly be useful here.

If auto-vectorization doesn't generate good code, the standard technique is to use vector intrinsics. If you can do what you need with ops that could compile to SIMD instructions on multiple architectures, then yeah, gcc vector syntax might be a good approach.

I tried out your first version with gcc 4.9.2. It generates exactly what you'd hope for, with 64bit AVX. (256bit load, vector and, store).

Without a -march or anything, just using baseline amd64 (SSE2), it copies the input to a buffer on the stack, and loads from there. I think it's doing this in case of unaligned input/output buffers, instead of just using movdqu. Anyway, it's really horrible slow code, and it would be way faster to do 8 bytes at a time in GP registers than this nonsense.

gcc -march=native -O3 -S v32ui_and.c (on a Sandybridge (AVX without AVX2)):

        .globl  f
f:
        cmpq    %rsi, %rdi
        jnb     .L6
        vmovdqa .LC0(%rip), %ymm1  # load a vector of 0x0f bytes
        .p2align 4,,10
        .p2align 3
.L3:
        vandps  (%rdi), %ymm1, %ymm0
        addq    $32, %rdi
        vmovdqa %ymm0, (%rdx)
        addq    $32, %rdx
        cmpq    %rdi, %rsi
        ja      .L3
        vzeroupper
.L6:
        ret

Note the lack of scalar cleanup, or handling of unaligned data. vmovdqu is as fast as vmovdqa when the address is aligned, so it's a bit silly not to use it.

gcc -O3 -S v32ui_and.c is weird.

        .globl  f
f:
.LFB0:
        cmpq    %rsi, %rdi
        movdqa  .LC0(%rip), %xmm0  # load a vector of 0x0f bytes
        jnb     .L9
        leaq    8(%rsp), %r10
        andq    $-32, %rsp
        pushq   -8(%r10)
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   %r10
        .p2align 4,,10
        .p2align 3
.L5:
        movq    (%rdi), %rax
        addq    $32, %rdi
        addq    $32, %rdx
        movq    %rax, -80(%rbp)
        movq    -24(%rdi), %rax
        movq    %rax, -72(%rbp)
        movq    -16(%rdi), %rax
        movdqa  -80(%rbp), %xmm1
        movq    %rax, -64(%rbp)
        movq    -8(%rdi), %rax
        pand    %xmm0, %xmm1
        movq    %rax, -56(%rbp)
        movdqa  -64(%rbp), %xmm2
        pand    %xmm0, %xmm2
        movaps  %xmm1, -112(%rbp)
        movq    -112(%rbp), %rcx
        movaps  %xmm2, -96(%rbp)
        movq    -96(%rbp), %rax
        movq    %rcx, -32(%rdx)
        movq    -104(%rbp), %rcx
        movq    %rax, -16(%rdx)
        movq    -88(%rbp), %rax
        movq    %rcx, -24(%rdx)
        movq    %rax, -8(%rdx)
        cmpq    %rdi, %rsi
        ja      .L5
        popq    %r10
        popq    %rbp
        leaq    -8(%r10), %rsp
.L9:
        rep ret

So I guess you can't safely use gcc vector extensions if it's sometimes going to generate code this bad. With intrinsics, the simplest implementation would be:

#include <immintrin.h>
#include <stdint.h>
void f(const uint8_t *begin, const uint8_t *end, uint8_t *o)
{
    __m256i mask = _mm256_set1_epi8(0x0f);
    for (; begin < end; begin += 32, o+=32) {
        __m256i s = _mm256_loadu_si256((__m256i*)begin);
        __m256i d = _mm256_and_si256(s, mask);
        _mm256_storeu_si256( (__m256i*)o, d);
    }
}

This generates identical code to the gcc-vector version (compiled with AVX2). Note this uses VPAND, not VANDPS, so it requires AVX2.

With large buffers, it would be worth doing a scalar startup until either input or output buffer was aligned to 16 or 32 bytes, then the vector loop, then any scalar cleanup needed. With small buffers, just using unaligned loads/stores and a simple scalar cleanup at the end would be best.

Since you asked about strings specifically, if your strings are nul-terminated (implicit-length), you have to be careful when crossing page boundaries that you don't fault if the string ends before the end of a page, but your read spans the boundary.