How can I do efficiently bitwise majority voting on 3, 5, 7, 9 inputs with SSE/SSE2/AVX/...?

374 views Asked by At

I have several (e.g. 3, 5, 7 or 9) equally sized huge blocks of data (e.g. 100KB-100MB), and want to do bitwise majority voting, to get just one block with the most frequently used value for each bit. To speed this up, i would like to use SSE/SSE2/AVX/NEON/... CPU extensions.

I just tried it manually bitwise, and the result was very slow.

4

There are 4 answers

19
Maxim Egorushkin On

As discussed in other answers, specific functions for Maj3, Maj5, Maj7, or Maj9 do auto-vectorize nicely. (Especially with AVX-512 vpternlogd, although compilers don't use it optimally for Maj3 as a single operation.) For clang AVX2 on Skylake, Maj3 that way is about 19x faster than Maj3 this way, if cache bandwidth isn't a bottleneck. Or 11x for Maj7 vs. Aki's version.


If you need to support arbitrary numbers of voters, positional popcount is one strategy, into one byte per bit-position across n inputs, so the output is 1 if count >= n/2+1. With AVX2 for example, we end with a SIMD byte compare and movemask to get 32 output bits.

One option to use vector instructions is through a portable API like std::experimental::simd or gcc vector extensions built-in functions to process, for example, 32 votes at once with 256-bit vectors.
One thing this lacks is a movemask operation; compilers aren't smart enough to optimize this portable extract and OR strategy into x86 movemask or AArch64 shift-right-and-insert chains (which can be more efficient when reducing multiple vectors to a 64-bit mask).

On x86 with AVX2 it does roughly 4 vector instructions to load each row of 32 votes (4 bytes) and sum them up (broadcast load, v & mask == mask, and subtract). And another 10 to compute and store the majority votes from the counts.

#include <iostream>
#include <cstring>
#include <array>
#include <cstdint>

template<size_t R>
__attribute__((noinline))
void positional_popcount_ge(uint8_t* votes, size_t n_votes, std::array<uint8_t const*, R> vote_rows, uint8_t min_votes) {
    static_assert(decltype(min_votes){R} == R);
    using U = uint32_t;
    using U8 = U __attribute__((__vector_size__(32)));
    using U4 = U __attribute__((__vector_size__(16)));
    using U32 = uint8_t __attribute__((__vector_size__(32)));
    union V2 {
        U32 u32;
        U8 u8;
        U4 u4[2];
    };
    V2 const isolate_bits4{
        0x80, 0x80, 0x80, 0x80,
        0x40, 0x40, 0x40, 0x40,
        0x20, 0x20, 0x20, 0x20,
        0x10, 0x10, 0x10, 0x10,
        0x08, 0x08, 0x08, 0x08,
        0x04, 0x04, 0x04, 0x04,
        0x02, 0x02, 0x02, 0x02,
        0x01, 0x01, 0x01, 0x01
    };
    for(size_t c = 0; c < n_votes; c += sizeof(U)) {
        // Bit counts [8 bits][4 bytes].
        U32 count32{};
        for(auto r: vote_rows) {
            U u;
            std::memcpy(&u, r + c, sizeof u);
            // Broadcast u and isolate bits.
            auto u4{u & isolate_bits4.u8};
            // Subtract uint8_t{-1} for set bits.
            count32 -= (U32)u4 == isolate_bits4.u32;
        }
        // Counts to isolated bits.
        V2 vote32{(count32 >= min_votes) & isolate_bits4.u32};

        // Reduce isolated bits back into bytes using bitsize or. Do in 3 stages for gcc.
        auto u4{vote32.u4[0] | vote32.u4[1]};
        u4 |= U4{u4[2], u4[3], 0, 0};
        u4 |= U4{u4[1], u4[0], 0, 0};
        U u{u4[0]};
        std::memcpy(votes + c, &u, sizeof u);
    }
}

void print_bits(uint8_t const* bits, ssize_t n_votes);

int main() {
    constexpr int R = 7, C = 256, C2 = 16;

    uint8_t votes[R][C];
    for(int r = 0; r < R; ++r) {
        for(int c = 0; c < C; ++c) {
            unsigned v = ((2u << r) - 1) << (c & 7);
            votes[r][c] = v;
        }
    }

    std::cout << "votes:\n";
    for(int r = 0; r < R; ++r)
        print_bits(votes[r], C2);

    uint8_t majority_votes[C] = {};
    std::array<uint8_t const*, R> vote_rows;
    for(int r = 0; r < R; ++r)
        vote_rows[r] = votes[r];
    positional_popcount_ge(majority_votes, C, vote_rows, R / 2 + 1);

    std::cout << "majority_votes:\n";
    print_bits(majority_votes, C2);
}

void print_bits(uint8_t const* bits, ssize_t n_votes) {
    for(ssize_t c = 0; c < n_votes; ++c) {
        unsigned b = bits[c];
        for(unsigned mask = 0x80; mask; mask >>= 1) {
            char c = '0' + static_cast<bool>(b & mask);
            std::cout.put(c);
        }
        std::cout.put(' ');
    }
    std::cout.put('\n');
}

Outputs:

votes:
00000001 00000010 00000100 00001000 00010000 00100000 01000000 10000000 00000001 00000010 00000100 00001000 00010000 00100000 01000000 10000000 
00000011 00000110 00001100 00011000 00110000 01100000 11000000 10000000 00000011 00000110 00001100 00011000 00110000 01100000 11000000 10000000 
00000111 00001110 00011100 00111000 01110000 11100000 11000000 10000000 00000111 00001110 00011100 00111000 01110000 11100000 11000000 10000000 
00001111 00011110 00111100 01111000 11110000 11100000 11000000 10000000 00001111 00011110 00111100 01111000 11110000 11100000 11000000 10000000 
00011111 00111110 01111100 11111000 11110000 11100000 11000000 10000000 00011111 00111110 01111100 11111000 11110000 11100000 11000000 10000000 
00111111 01111110 11111100 11111000 11110000 11100000 11000000 10000000 00111111 01111110 11111100 11111000 11110000 11100000 11000000 10000000 
01111111 11111110 11111100 11111000 11110000 11100000 11000000 10000000 01111111 11111110 11111100 11111000 11110000 11100000 11000000 10000000 
majority_votes:
00001111 00011110 00111100 01111000 11110000 11100000 11000000 10000000 00001111 00011110 00111100 01111000 11110000 11100000 11000000 10000000 

Generated assembly is decent: the loops are unrolled (thanks to compile-time-constant R (rows)), vectorized stores and broadcast loads, no extra loads or stores beyond the required minimum, no stack spillage, with gcc-13.2 -O3 and clang-17.0.1 -O2 with -march=x86-64-v3.

But this is still much slower than pure vertical bitwise operations for small fixed vote counts. The worst being Maj3, with this having a bottleneck on 18 vector ALU instructions in the loop per 4 result bytes with AVX2 clang, vs. Maj3 as an auto-vectorized (a&b) | (a&c) | (b&c) using unsigned long (optimizes to 2 AND and 2 OR per result vector) with 10 total instructions per 32 bytes of output (Peter's answer).

With no cache misses, https://uica.uops.info/ predicts this runs Maj3 at best 6 cycles per 4 bytes on Skylake or Ice Lake, vs. 2.5 or 2.0 cycles with some unrolling per 32 bytes the way clang compiles. That's a 19x speedup / slowdown. Maj7 is less extreme, but still about 11x vs. Aki's.


Another option is to use Intel AVX2 intrinsics to trade portability for speed, especially of the cleanup. This version can be more efficient than the above, roughly same 4 AVX instructions per row of votes but only 5 to compute and store the result taking advantage of AVX _mm256_movemask_epi8 instruction unavailable in the gcc vector extensions built-in functions.

template<size_t R>
__attribute__((noinline))
void positional_popcount_ge_avx(uint8_t* votes, size_t n_votes, std::array<uint8_t const*, R> vote_rows, uint8_t min_votes) {
    auto const isolate_bits32 = _mm256_set_epi8(
        0x80, 0x80, 0x80, 0x80,
        0x40, 0x40, 0x40, 0x40,
        0x20, 0x20, 0x20, 0x20,
        0x10, 0x10, 0x10, 0x10,
        0x08, 0x08, 0x08, 0x08,
        0x04, 0x04, 0x04, 0x04,
        0x02, 0x02, 0x02, 0x02,
        0x01, 0x01, 0x01, 0x01
        );
    auto const transpose16 = _mm256_set_epi8(
        15, 11, 7, 3,
        14, 10, 6, 2,
        13,  9, 5, 1,
        12,  8, 4, 0,
        15, 11, 7, 3,
        14, 10, 6, 2,
        13,  9, 5, 1,
        12,  8, 4, 0
        );
    auto const transpose8 = _mm256_setr_epi32(0, 4, 1, 5, 2, 6, 3, 7);
    using U = uint32_t;
    static_assert(sizeof(U) * 8 == sizeof(isolate_bits32));
    static_assert(uint8_t{R} == R);
    auto const min_votes32 = _mm256_set1_epi8(min_votes - 1);
    for(size_t c = 0; c < n_votes; c += sizeof(U)) {
        auto count32 = _mm256_setzero_si256();
        for(auto r: vote_rows) {
            U u;
            std::memcpy(&u, r + c, sizeof u);
            auto bits32 = _mm256_set1_epi32(u);
            bits32 = _mm256_and_si256(bits32, isolate_bits32);
            bits32 = _mm256_cmpeq_epi8(bits32, isolate_bits32); // -1 for 1 bits.
            count32 = _mm256_sub_epi8(count32, bits32);
        }
        count32 = _mm256_permutevar8x32_epi32(_mm256_shuffle_epi8(count32, transpose16), transpose8); // Transpose of 8x4.
        auto vote32 = _mm256_cmpgt_epi8(count32, min_votes32);
        U u = _mm256_movemask_epi8(vote32);
        std::memcpy(votes + c, &u, sizeof u);
    }
}
0
njuffa On

If the number of source data streams is 3, 5, 7, or 9, one can take advantage of proven optimal or near optimal computation of majority-of-n via a majority-of-3 primitive, a.k.a. the ternary median operator ⟨xyz⟩. Knuth, TAOCP Vol. 4a, points out that any monotone Boolean function can be expressed entirely in terms of the ternary median operator and the constants 0 and 1.

Recent literature (see comments in code below) shows how to construct majority-of-7 in a proven optimal way from majority-of-3, requiring seven instances of the latter. The optimal construction of majority-of-9 in this way is still an open research problem, but a fairly efficient construction using thirteen majority-of-3 instances was found recently. The ISO-C99 code below was used to explore this approach.

Compiled with recent x86-64 toolchains and -march=skylake-avx512 the data throughput achieved is decent (tens of GB/sec) when run on recent x86-64 platforms, but not yet approaching system memory throughput, which would be the ultimate goal. The reason for this is that compilers are not yet capable of reliably mapping the majority-of-3 primitive to the vpternlogq instruction available with AVX512, where a majority-of-3 operation is expressible with exactly one such instruction. One would have to work around this by use of intrinsics or use of inline assembly.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define NBR_STREAMS (9)         // 3, 5, 7, or 9
#define NBR_BLOCKS  (10000000)
#define BENCH_ITERS (3)

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

uint64_t maj3 (uint64_t a, uint64_t b, uint64_t c)
{
    return (((b & c) | a) & (b | c));
}

uint64_t maj5 (uint64_t a, uint64_t b, uint64_t c, uint64_t d, uint64_t e) 
{ 
    /* Knuth, TAOCP Vol. 4a, p. 64 */ 
    return maj3 (a, maj3 (c, d, e), maj3 (b, c, maj3 (b, d, e))); 
}

uint64_t maj7 (uint64_t a, uint64_t b, uint64_t c, uint64_t d, 
               uint64_t e, uint64_t f, uint64_t g) 
{ 
    /*
      Eleonora Testa, et al., "Mapping Monotone Boolean Functions into Majority."
      IEEE Transactions on Computers, Vol. 68, No. 5, May 2019, pp. 791-797.
    */
    uint64_t s = maj3 (a, c, d);
    uint64_t t = maj3 (e, f, g);
    return maj3 (b, maj3 (e, s, maj3 (f, g, s)), maj3 (d, t, maj3 (a, c, t)));
}

uint64_t maj9 (uint64_t a, uint64_t b, uint64_t c, uint64_t d, uint64_t e, 
               uint64_t f, uint64_t g, uint64_t h, uint64_t i)
{
    /* 
      Thomas Häner, Damian S. Steiger, Helmut G. Katzgraber, "Parallel Tempering
      for Logic Synthesis." arXiv.2311.12394, Nov. 21, 2023
    */
    uint64_t r = maj3 (g, d, c);
    uint64_t s = maj3 (g, e, b);
    uint64_t t = maj3 (i, f, a);
    uint64_t u = maj3 (r, s, h);
    uint64_t v = maj3 (d, h, t);
    uint64_t w = maj3 (c, d, h);
    uint64_t x = maj3 (i, a, u);
    uint64_t y = maj3 (c, v, t);
    uint64_t z = maj3 (y, e, g);
    return maj3 (maj3 (x, u, f), maj3 (z, b, y), maj3 (s, w, t));
}

/*
  https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
  From: geo <[email protected]>
  Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
  Subject: 64-bit KISS RNGs
  Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)

  This 64-bit KISS RNG has three components, each nearly
  good enough to serve alone.    The components are:
  Multiply-With-Carry (MWC), period (2^121+2^63-1)
  Xorshift (XSH), period 2^64-1
  Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64  (kiss64_t = (kiss64_x << 58) + kiss64_c, \
                kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
                kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64  (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
                kiss64_y ^= (kiss64_y << 43))
#define CNG64  (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)

int main (void)
{
    double start, stop, elapsed, datasize;
    /* generate test data */
    printf ("starting data generation\n"); fflush(stdout);
    uint64_t *stream[NBR_STREAMS+1];
    for (int i = 0; i <= NBR_STREAMS; i++) {
        stream [i] = malloc (sizeof (uint64_t) * NBR_BLOCKS);
        if (stream[i] == 0) {
            printf ("allocation for stream %d failed\n", i);
            return EXIT_FAILURE;
        }
        for (int j = 0; j < NBR_BLOCKS; j++) {
            stream[i][j] = KISS64;
        }
    }
    printf ("data generation complete\n");
    /* compute bits of output stream as majority of bits of input streams */
    printf ("generate output stream; timed portion of code\n"); fflush(stdout);
    for (int n = 0; n < BENCH_ITERS; n++) {
        start = second();
        for (int j = 0; j < NBR_BLOCKS; j++) {
#if (NBR_STREAMS == 3)
            stream[3][j] = maj3 (stream[0][j], stream[1][j], stream[2][j]);
#elif (NBR_STREAMS == 5)
            stream[5][j] = maj5 (stream[0][j], stream[1][j], stream[2][j], 
                                 stream[3][j], stream[4][j]);
#elif (NBR_STREAMS == 7)
            stream[7][j] = maj7 (stream[0][j], stream[1][j], stream[2][j], 
                                 stream[3][j], stream[4][j], stream[5][j],
                                 stream[6][j]);
#elif (NBR_STREAMS == 9)
            stream[9][j] = maj9 (stream[0][j], stream[1][j], stream[2][j], 
                                 stream[3][j], stream[4][j], stream[5][j],
                                 stream[6][j], stream[7][j], stream[8][j]);
#else
#error unsupported N
#endif
        }
        stop = second();
    }
    elapsed = stop - start;
    datasize = sizeof (uint64_t) * NBR_BLOCKS * (NBR_STREAMS + 1);
    printf ("processed at %.3f GB/sec\n", datasize * 1e-9 / elapsed);
    printf ("checking output stream\n"); fflush(stdout);
    /* check result stream, the slow way */
    for (int j = 0; j < NBR_BLOCKS; j++) {
        uint64_t t[NBR_STREAMS+1];
        for (int i = 0; i <= NBR_STREAMS; i++) {
            t[i] = stream[i][j];
        }
        for (int k = 0; k < 64; k++) {
            int majority, bitcount = 0;
            for (int i = 0; i < NBR_STREAMS; i++) {
                bitcount += (t[i] >> k) & 1;
            }
            majority = bitcount > (NBR_STREAMS / 2);
            if (majority != ((t[NBR_STREAMS] >> k) & 1)) {
                printf ("error at block %d bit %d res=%d ref=%d\n",
                        j, k, majority, (int)((t[NBR_STREAMS] >> k) & 1));
                return EXIT_FAILURE;
            }
        }
    }
    printf ("test passed\n");
    return EXIT_SUCCESS;
}
0
Peter Cordes On

Compilers auto-vectorize boolean operations pretty well; you can write a C loop that operates on unsigned char *, or a wider chunk size like unsigned long* if that's convenient for your data without strict-aliasing violations. The resulting asm should be operating on 128 or 256 bits at once. Or depending on compiler tuning options, 512 bits.

If you don't have AVX-512 available (or another ISA with a 3-input logic operation), see Aki Suihkonen's answer for hand-optimized maj5 and maj7 better than what compilers make for the naive versions shown in this answer (especially non-clang), like 1.33x faster for Maj7 AVX2 on Ice Lake vs. this answer's Maj7 with clang, or 3x vs. GCC. The scalar asm shown in his answer corresponds to AVX or NEON/SVE vector asm if you use the functions in loops over arrays.
See njuffa's answer if you do have ternary logic available and want to use it manually (e.g. AVX-512 intrinsics) for implementing maj5 and wider in terms of maj3, with some literature citations for research in this area. (Apparently strictly using maj3 as a building block, rather than arbitrary ternary logic like AVX-512 allows, so it's not always optimal). Aki's functions could also benefit from ternary logic either manually or letting the compiler find groups of 3-input operations, even though compilers are far optimal at using 3-input logic.

AArch64 SVE or FEAT_SHA3 has EOR3, a 3-input XOR which compilers use for Aki's version, with -mcpu=cortex-a710 for example. ASIMD version / SVE version. I don't know of other 3-input boolean ops in AArch64, but 3-input XOR is useful for a full-adder.


Manual vectorization might be useful to take better advantage of AVX-512 vpternlogd, a 3-input bitwise boolean that takes an 8-bit truth table as an immediate, running about as cheaply as a simple 2-input vpand or vpor (https://uops.info/), although it does have to overwrite one of its inputs so can require a vmovdqa register copy if it's needed later. For each vertical group of 3 bits separately, it looks up the result in that table, like an FPGA. So it can compute any 3-input boolean function in one instruction (for a whole vector of e.g. 256 bits), such as Maj3. Compilers will optimize code using a & b | c to use it, but don't do a perfect job. For the straight-forward way to write Maj3, GCC, Clang, and MSVC use two vpternlogd or vpternlogq instructions per vector of results.

// manual vectorized by @harold in 4 ternlog ops, from comments under the question
// GCC's inner loop when auto-vectorizing uses 7x ternlog + 3x two-input ops
// Clang's uses 6x ternlog + 3x two-input
__m256i maj5(__m256i v0, __m256i v1, __m256i v2, __m256i v3, __m256i v4)
{
    __m256i v5 = _mm256_ternarylogic_epi32(v3, v2, v4, 0b01101000);  // These are not Maj3 operations
    __m256i v6 = _mm256_ternarylogic_epi32(v5, v0, v1, 0b11101000);
    __m256i v7 = _mm256_ternarylogic_epi32(v3, v2, v4, 0b10000001);
    __m256i v8 = _mm256_ternarylogic_epi32(v7, v3, v6, 0b11001010);
    return v8;
}

Compilers are also quite good at common-subexpression elimination if you write Maj5 as something like (a&b&c) | (a&b&d) | (a&b&e) | ..., doing the a&b part once. Auto-vectorization of simple portable source might be good enough for your use-cases, and with clang at least, hard to beat with intrinsics unless you have a ternary operation (I think only AVX-512 out of the ISAs you mentioned. Not AVX2 or earlier, and I don't know of one in AArch64 NEON/ASIMD or SVE, and clang doesn't use one .) Compilers differ in how many bitwise boolean ops they use in the inner loop for Maj5 and Maj7. I didn't look at Maj9, that was more typing than I was interested in doing!

typedef unsigned long Chunk;  // a wider type might help compilers make cleanup simpler
          //  if you can't promise them the size is a multiple of 64 bytes or something.

static inline
Chunk bitwise_majority_3(Chunk x, Chunk y, Chunk z) {
    return (x&y) | (y&z) | (z&x);
}

static inline
Chunk bitwise_majority_5(Chunk a, Chunk b, Chunk c, Chunk d, Chunk e)
{
    // abc ∨ abd ∨ abe ∨ acd ∨ ace∨ ade ∨ bcd ∨ bce ∨ bde ∨ cde
    Chunk with_a = (a&b&c) | (a&b&d) | (a&b&e) | 
                             (a&c&d) | (a&c&e) | (a&d&e);
    Chunk without_a = (b&c&d) | (b&c&e) | (b&d&e) | (c&d&e);
    return with_a | without_a;
}

static inline
Chunk bitwise_majority_7(Chunk a, Chunk b, Chunk c, Chunk d, Chunk e, Chunk f, Chunk g)
{
    Chunk ab = a&b;
    Chunk abc = ab&c;  // some temporaries, mostly to save typing
    Chunk abd = ab&d;  // compilers make their own decisions about common subexpressions
    Chunk fg = f&g;
    Chunk with_ab = (abc & d) | (abc & e) | (abc & f) | (abc & g)
                              | (abd & e) | (abd & f) | (abd & g)
                                          | (ab&e& f) | (ab&e& g)
                                                      | (ab& fg);
    Chunk ac = a&c;
    Chunk with_ac = (ac&d& e) | (ac&d& f) | (ac&d& g)
                              | (ac&e& f) | (ac&e& g)
                                          | (ac& fg);
    Chunk ad = a&d;  
    Chunk with_ad = (ad& e&f) | (ad& e&g) | (ad& fg);
    Chunk with_a = with_ab | with_ac | with_ad | (a&e & fg);
    Chunk bc = b&c;
    Chunk with_bc = (bc&d& e) | (bc&d& f) | (bc&d& g)   // same as with_ac but with bc
                              | (bc&e& f) | (bc&e& g)
                                          | (bc& fg);
    Chunk bd = b&d;
    Chunk with_bd = (bd& e&f) | (bd& e&g) | (bd& fg);
    Chunk with_b = with_bc | with_bd | (b&e & fg);

    Chunk cd = c&d;
    Chunk with_cd = (cd& e&f) | (cd& e&g) | (cd& fg);
    Chunk with_c = with_cd | (c&e & fg);

    Chunk with_d = (d&e & fg);

    return with_a | with_b | with_c | with_d;
}

// bitwise_majority_9  left as an exercise for the reader
// or see njuffa's answer which implements maj9 in terms of maj3

Given those scalar building blocks, I made simple loops for compilers to vectorize. __restrict and a constant count that's a multiple of the vector width avoids any cleanup or fallback (for overlap) code, just the asm we want to look at. (Godbolt - MSVC vs. clang 17 for x86-64 AVX2 or for AArch64 SVE -mcpu=cortex-a710. Change the dropdown to look at GCC output.)

void vec_majority_3(Chunk *__restrict a, const Chunk *b, Chunk *c)
{
    for (int i=0 ; i<10240 ; i++){
        a[i] = bitwise_majority_3(a[i], b[i], c[i]);
    }
}

void vec_majority_5(Chunk *__restrict a, const Chunk *b, Chunk *c, const Chunk *d, const Chunk *e)
{
    for (int i=0 ; i<10240 ; i++){
        a[i] = bitwise_majority_5(a[i], b[i], c[i], d[i], e[i]);
    }
}


void vec_majority_7(Chunk *__restrict a, const Chunk *b, Chunk *c, const Chunk *d, const Chunk *e, const Chunk *f, const Chunk *g)
{
    for (int i=0 ; i<10240 ; i++){
        a[i] = bitwise_majority_7(a[i], b[i], c[i], d[i], e[i], f[i], g[i]);
    }
}

Maj3: With AVX2 or SVE: 2x vpand / 2x vpor, plus 3 loads and a store. Clang -O3 -march=x86-64-v3 -fno-unroll-loops (AVX2) makes a loop that will run as 10 uops on Intel CPUs, so we expect Ice Lake to run it at 2 cycles per 32 bytes of output if not bottlenecked on cache bandwidth from L2 or worse. Unrolling will help some with the front-end bottleneck. (See also Micro fusion and addressing modes - fortunately gcc -march=skylake instead of -march=x86-64-v3 does separate loads instead using the same memory operand twice with indexed addressing modes.)

Clang's loops do Maj5 in 14 ops and Maj7 in 26 ops with AVX2 for ARM ASIMD or SVE.

GCC and MSVC are worse, using 17 two-input bitwise ops for Maj5.
And much worse for Maj7: GCC at 60 ops, MSVC at 36.

If everything hits in L1d cache, https://uica.uops.info/ predicts Clang's vec_majority_7 loop will run at 8.69 cycles per output vector (for AVX2 with 32-byte vectors). The bottleneck is vector ALU ports. (To use uiCA, copy/paste just the loop body from compiler asm output, starting with the label, ending with the jnz .L3 or whatever. Not the whole function with the prologue and ret.)

AVX-512

AVX-512's improvement for Maj7 is nowhere near as good as it should be because of inefficient use of vpternlogq. njuffa shows 7 total maj3 (ternary) operations, but clang used 19 total vpternlogq + vpand + vpor instructions. So that's a factor of 2.7 slower than it needs to be. uica predicts 6.33 cycles per 32-byte vector, only a small speedup over AVX2.

AVX-512 also allows 512-bit vectors. 512-bit uops would shut down the vector ALU on port 1 on Intel CPUs, so we'd expect about a 1.5x throughput boost if frequency stayed the same for AVX-512 vpternlogq ymm vs. AVX-512 vpternlogq zmm. (Which is likely on Ice Lake / Rocket Lake desktop chips given adequate cooling, since bitwise ops are "light" in terms of limiting max turbo).

uiCA's analysis for Ice Lake is 9.50 cycles per 64-byte vector result for Maj7, so 4.75 cycles per 32 bytes of output. This is assuming L1d cache hits, but 64-byte loads of whole cache lines might help in non-ideal situations to avoid extra conflict misses.


Cache conflict-miss considerations with many input streams

With Maj7, you have 7 input streams (and hopefully one of those is also an in-place output stream). L1d cache is only 8-way associative on many CPUs, or 12-way on some newer ones like Ice Lake. This could lead to conflict misses if all your inputs are at similar offsets within different page. Or similar offsets within the larger region size for L2 aliasing, especially on Skylake where it's only 4-way associative. (Later Intel have larger L2 caches, e.g. Ice Lake has 512K and bumped the associativity back up to 8.)

See For-loop efficiency: merging loops for some details on loops with many streams of memory access on CPUs like Skylake; they can handle more than 4, but 7 to 9 are pushing it even if you don't have 4K aliasing stalls.

There's not a lot you can do about this, unless there's any scope for reducing 5 inputs down to 2 or something, working in chunks to cache-block it. Or to read all the data from any given cache line in a nearby interval, without reading and writing a lot of other cache lines in between. With only 16 vector regs of 32 bytes each (AVX2), loading two sets of 7 vectors only leaves two temporaries, and won't work for 9 inputs. But maybe you can leave a couple unread until later if you unroll.

With AVX-512, you have 32 vector regs. And you can read a whole cache line at once as a single vector, assuming your data is aligned by 64. (Or by 128 for the benefit of the L2 spatial prefetcher that tries to complete aligned pairs).

If conflict misses turn out to be a bad problem, perhaps cache-block by coping a cache line or two from 5 of your 9 input arrays into one contiguous buffer (5x64x2 = 640 bytes). So all accesses to each of those cache lines happens contiguously, with 4x 128-bit vectors for example. Then you loop over that block, reloading vectors from fixed strides within it while loading the other 4 vectors of data from the source arrays. So access to any input cache line is only separated by accesses to 5 other cache lines that are likely to map to the same set. That should be good enough for pseudo-LRU to do its job. And the 10 cache-line buffer is small enough to stay hot in L1d, not using up L1<->L2 bandwidth.

If you attempt this, check the asm to make sure the compiler didn't defeat the copying. Or if it did, that there are enough vector registers for it to do contiguous loading.

4
Aki Suihkonen On

A partial solution for the problem - maj5 comes from computing a two output maj4 first

  is_equal(a,b,c,d) == true, if popcount(a+b+c+d == 2)
  value(a,b,c,d) = majority_of(a,b,c,d); // or ? for a tie

From that building block, maj5 == is_equal ? e : majority_of_four, as if there's a majority already in 4 values, the fifth value does not contribute; and if there's already equality, then the fifth value makes the decision.

Optimising a Karnaugh-map from the four inputs a,b,c,d gives 10 instructions (on ARM -- and on VEX encoded SIMD)

int maj5(int a, int b, int c, int d, int e)
{
//        ab  /          cd
//                00   01   11    10
//        00      0    0    e     0
//        01      0    e    1     e
//        11      e    1    1     1
//        10      0    e    1     e

    auto row0 = ~(a|b);   // true for the top row
    auto col0 = ~(c|d);   // true for the leftmost column

    auto row2 = a&b;      // true for the row of `e 1 1 1`
    auto col2 = c&d;      // true for the col of `e 1 1 1`

    // zero = true for all the entries in the map with output == 0
    auto zero = (row0 | col0) & ~(row2 | col2);
    // one = true for all the entries in the map with output == 1
    auto one = (row2 | col2) & ~(row0 | col0);

    // E = true for all those entries that are neither 0 or 1
    auto E = ~(zero | one);
 
    // final multiplex between one and `e`
    return one | (e & E);
}

  orr w5, w0, w1
  and w0, w0, w1
  orr w1, w2, w3
  and w2, w2, w3
  and w3, w5, w1

  orr w1, w0, w2
  eor w0, w1, w3
  and w1, w1, w3
  and w0, w0, w4
  orr w0, w0, w1

This building block of 4-input logic function could be possibly convertible to pshufb on SSSE3 or vqtbl1q on ARM64, given that bit interleaving would be efficient. (On ARM64 a tree of vsri, vsli typically does the trick as in bit matrix transpose; also pmull and pclmulqdq can be used to interleave bits -- but from experience I would expect vsri/vsli being more efficient on transpose.)

I would also consider interleaving 4 independent streams originally into nibbles -- as the number of the streams grow, the number of arithmetic functions needed only grows linearly (up to a point) and one will also avoid cache line evictions due to cache implementation (4-way and 8-way associative caches are probably still common). If the data would be interleaved by 8, then Neon vcnt and AVX-512 vpopcntb can be readily used -- making the whole algorithm memory bound and possibly a bit wasteful in terms of storing one boolean per uint8_t -- but then again, the output format is already trivially compatible with the input format.

  Chunk abcd_lo = _mm_shuffle_epi8(abcd & 0x0f0f0f0f0f0f0f, LUT);
  // shift in 16 or 32 domain, then mask
  Chunk abcd_hi = _mm_shuffle_epi8((abcd >> 4) & 0x0f0f0f0f0f0f0f, LUT);
  abcd_lo += efgh_lo
  abcd_hi += efgh_hi

For the maj7 function, my best effort does not (obviously) match the maj3-tree with 7 instructions, but compiles to fewer binary operations than the maj3-based approach.

The idea above is extended to [t, o] = full_adder(a,b,c); [T, O] = full_adder(d,e,f); with t, o mapping to twos, ones, or to carry, sum.

int med7(int t, int o, int T, int O, int g) {
  int G_OR_One = (t&o) | (T&O) | (o & T) | (O & t) | (t & T);
  int One = (t & T) | (t & o & O) | (T & O & o);
  return One | (G_OR_One & ~One & g);
}
        and     w8, w1, w3
        orr     w9, w0, w2
        and     w10, w2, w0
        and     w8, w9, w8
        orr     w11, w3, w1

        orr     w8, w8, w10
        and     w9, w11, w9
        bic     w10, w4, w8
        and     w9, w10, w9
        orr     w0, w9, w8

Theoretically the last function can be evaluated with 6 arbitrary 3-input function, and the full-adders require 4, so this is slightly worse than the optimal sequence of 9 maj3 operations. But when such a function is not available, this version might be useful.

AArch64 SVE has EOR3, a 3-input exclusive or, which a full adder can take advantage of. There's a non-SVE version for CPUs with FEAT_SHA3.


A full implementation of maj7 could look like this. (The original version had a bug of omitting one term (T & t) from the first expression.)

typedef unsigned long long Chunk;

static inline
Chunk med7(Chunk t, Chunk o, Chunk T, Chunk O, Chunk g) {
  Chunk G_OR_One = (t&o) | (T&O) | (o & T) | (O & t) | (t & T);
  Chunk One = (t & T) | (t & o & O) | (T & O & o);
  return One | (G_OR_One & ~One & g);
}
static inline
Chunk vertical_full_adder(Chunk x, Chunk y, Chunk z, Chunk *carry_out)
{
    Chunk sum = x^y ^ z;
    *carry_out = ((x^y) & z) | (x&y);
    return sum;
    //*carry_out = (x&y)|(x&z)|(y&z);  // fun fact, this is a maj3
}

Chunk maj7_aki(Chunk a, Chunk b, Chunk c, Chunk d, Chunk e, Chunk f, Chunk g){
    //[t, o] = full_adder(a,b,c); [T, O] = full_adder(d,e,f);
    Chunk t, T;
    Chunk o = vertical_full_adder(a,b,c, &t);
    Chunk O = vertical_full_adder(d,e,f, &T);
    return med7(t,o, T,O, g);
}

void testmaj7(){
    Chunk a = 0x5555555555555555, b = 0x3333333333333333, c = 0x0f0f0f0f0f0f0f0f
    ,     d = 0x00ff00ff00ff00ff, e = 0x0000ffff0000ffff, f = 0x00000000ffffffff;

//    static_assert(maj7_aki(a,b,c,d,e,f, 0xffffffffffffffff) == bitwise_majority_7(a,b,c,d,e,f, 0xffffffffffffffff));
    volatile Chunk sink;  // look at the asm for constants stored
    Chunk bitwise_majority_7(Chunk a, Chunk b, Chunk c, Chunk d, Chunk e, Chunk f, Chunk g);  // reference version defined later

    sink =           maj7_aki(a,b,c,d,e,f, 0xffffffffffffffff);
    sink = bitwise_majority_7(a,b,c,d,e,f, 0xffffffffffffffff);  // same value: pass

    sink =           maj7_aki(0, a,b,c,d,e,f);  // 0x101110117177F
    sink = bitwise_majority_7(0, a,b,c,d,e,f);  // 0x101170117177F : fail
                          // different value!            ^

    sink =           maj7_aki(0, 0b0101,0b0011,0xf,0xf,0xf,0);   // 1 //  just the failing nibble
    sink = bitwise_majority_7(0, 0b0101,0b0011,0xf,0xf,0xf,0);   // 7
    // 7 is indeed correct: 3 set bits in each position from the 0xf inputs
    // and at least one additional set bit at each of the low 3 positions from the 0b0101 (5) and 0b0011 (3) inputs
    // the full_adder maj7 needs a bit set in *both* those inputs to produce 1
}

Maybe one final edit to the maj7 function:

Chunk maj7_for_avx512(Chunk o, Chunk t, Chunk O, Chunk T, Chunk g) {
    Chunk A = maj(o,t,T), B = maj(t,O,T); // 3 inputs each
    return (A&B) | ((A|B) & g); // 3 inputs
}

As can be seen, there are now only 3 independent 3-operand functions called, which when combined with two calls to full-adder, should convert to just 7 vpternlogd instructions.

Godbolt, including these maj5 and maj7 auto-vectorizing in a loop where clang uses 2x EOR3 for maj7. Hopefully this is repairable without extra instructions. It's correct for most inputs.

As-is, this incorrect version compiles to 19 bitwise operations in a loop for AVX2, and also 19 bitwise ops for -mcpu=cortex-a710 with SVE, even though two of them are EOR3. At best it's shortening critical-path latency, for example doing eor3 dst1, z4, z3, z5 in parallel with eor dst2, z4, z3.