How to load uint8_t "as" 32 bits integer efficiently into a SIMD register?

153 views Asked by At

I have an array of 8 bit integers that I want to process through SIMD instructions. Since those integers will be used along single precision floating point numbers, I actually want to load them in 32 bit lanes instead of the more "natural" 8 bit lanes.

Assuming AVX512, if I have the following array:

std::array< std::uint8_t, 16 > i{ i0, i1, i2, i3, i4, i5, i6, i7, i8, i9, i10, i11, i12, i13, i14, i15 };

I wish to end up with a __m512i register filled with the following bytes:

[ 0, 0, 0, i0,
    0, 0, 0, i2,
    0, 0, 0, i3,
    0, 0, 0, i4,
    0, 0, 0, i5,
    0, 0, 0, i6,
    0, 0, 0, i7,
    0, 0, 0, i8,
    0, 0, 0, i9,
    0, 0, 0, i10,
    0, 0, 0, i11,
    0, 0, 0, i12,
    0, 0, 0, i13,
    0, 0, 0, i14,
    0, 0, 0, i15 ]

What is the best way to achieve that? I currently handroll it using:

_mm512_set_epi32(
    a[0], a[1], a[2], a[3],
    a[4], a[5], a[6], a[7],
    a[8], a[9], a[10], a[11],
    a[12], a[13], a[14], a[15]);

Note: I used AVX512 as an example, ideally I would like a "generic" strategy that can be abstracted on several instruction sets using e.g. Google Highway.

2

There are 2 answers

0
Rerito On BEST ANSWER

It is possible to do this using Google Highway.

#include <hwy/highway.h>

namespace hn = hwy::HWY_NAMESPACE;

namespace ns::HWY_NAMESPACE {

template< typename FD, std::integral I >
    requires(std::floating_point< hn::TFromD< FD > > &&
        sizeof(I) <= sizeof(hn::TFromD< FD >))
auto loadAsFp(FD fd, const std::span< const I >& data)
{
    using FP = hn::TFromD< FD >; // The target floating point type.
    // A tag to select the proper vector type to load:
    // It will have the same number of lanes as the vector of FP modeled by the tag FD.
    // If it is not a full vector, highway will emulate as best as it can.
    using ID = hn::Rebind< I, FD >;
    auto ld = hn::LoadU(ID{}, data.data());
    if constexpr (sizeof(I) < sizeof(FP))
    {
        // If the integer type is strictly smaller than the target floating type, we must first do a promotion.
        // Note we target a signed type regardless:
        // Highway will be smart enough to figure out if it can ZeroExtend instead of SignExtend.
        using PromotedD = hn::Rebind< hwy::MakeSigned< hn::TFromD< FD > >, FD >;
        return hn::ConvertTo(fd, hn::PromoteTo(PromotedD{}, ld));

    }
    else
       return hn::ConvertTo(fd, ld);
}

} // ns::HWY_NAMESPACE

Then you get the plumbing and compile this for the desired targets. Assuming we compiled for AVX512, this could be used as follows:

const std::vector< std::uint8_t > is{ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 };

auto vec = ns::N_AVX3::loadAsFp(hwy::N_AVX3::ScalableTag< float >{}, std::span{is});

For AVX512, this is equivalent to:

auto ld = _mm512_cvtepi32_ps(
    _mm512_cvtepu8_epi32(
        _mm_loadu_epi8(is.data())));
5
Alex Guteniev On

Load them in a register as a contiguous array _mm_loadu_si128, use reinterpret_cast or c-style cast, notice loadu for unaligned. Then expand using _mm512_cvtepu8_epi32.

The usual obstacle in populating AVX-512 register with smaller vector is the lack of cross-lanes instructions (in particular, lack of cross-lane pshufb equivalent), but for this particular tasks the above mentioned intrinsic is a perfect match.