Multiple bitvectors; how to find bits that are set exactly n times?

969 views Asked by At

I have a collection of four bitvectors, for example:

b1 = 00001010
b2 = 10100111
b3 = 10010010
b4 = 10111110

I would like to get the masks of those bits that are set in exactly 0, 1, 2, 3 or 4 of the given bitvectors. So m0 would be the mask of bits that are not set in any of the four bitvectors, m3 is the mask of those bits that are set in exactly three of the bitvectors, etc.:

m0 = 01000000
m1 = 00000001
m2 = 00111100
m3 = 10000000
m4 = 00000010

What is the fastest way to find these masks using bitwise operators?

I assume these have the least operations for 0 and 4 bits:

m0 = ~(b1 | b2 | b3 | b4)  // 4 ops
m4 = b1 & b2 & b3 & b4     // 3 ops

For the other options I'm not so sure my methods have the least operations:

m1 = ((b1 ^ b2) & ~(b3 | b4)) | (~(b1 | b2) & (b3 ^ b4)) // 9 operations
m2 = ((b1 ^ b2) & (b3 ^ b4)) | ((b1 ^ b3) & (b2 ^ b4)) | ((b1 ^ b4) & (b2 ^ b3)) // 11 operations
m3 = ((b1 ^ b2) & (b3 & b4)) | ((b1 & b2) & (b3 ^ b4)) // 7 operations

Is this the fastest way to calculate these masks or can I do it faster (in fewer operations)?

For most of the cases I need one or a few of these masks, not all of them at the same time.

(Note, in reality I will be doing this for 64 or 128 bit vectors. It is probably irrelevant, but I do it in C on a 32-bit x86 platform.)

6

There are 6 answers

2
CodesInChaos On BEST ANSWER

14 operations for all masks.

The idea is to first sort the bits, using min = x & y and max = x | y as conditional swap. This costs 10 operations. Then simply extract the masks which costs 4 operations.

// Split in lower and upper half
var c1 = b1 & b2;
var c3 = b1 | b2;
var c2 = b3 & b4;
var c4 = b3 | b4;

// Sort within each half
var d1 = c1 & c2; // (b1 & b2) & (b3 & b4)
var d2 = c1 | c2; // (b1 & b2) | (b3 & b4)
var d3 = c3 & c4; // (b1 | b2) & (b3 | b4)
var d4 = c3 | c4; // (b1 | b2) | (b3 | b4)

// Sort middle
var e1 = d1;      // (b1 & b2) & (b3 & b4)
var e2 = d2 & d3; // ((b1 & b2) | (b3 & b4)) & ((b1 | b2) & (b3 | b4))
var e3 = d2 | d3; // ((b1 & b2) | (b3 & b4)) | ((b1 | b2) & (b3 | b4))
var e4 = d4;      // (b1 | b2) | (b3 | b4)

// Extract masks
var m4 = e1;      // (b1 & b2) & (b3 & b4)
var m3 = e2 ^ e1; // (((b1 & b2) | (b3 & b4)) & ((b1 | b2) & (b3 | b4))) ^ ((b1 & b2) & (b3 & b4))
var m2 = d3 ^ d2; // The same as e3 ^ e2, saves two operations if only m2 is required
var m1 = e4 ^ e3; // ((b1 | b2) | (b3 | b4)) ^ (((b1 & b2) | (b3 & b4)) | ((b1 | b2) & (b3 | b4)))
var m0 = ~e4;     // ~((b1 | b2) | (b3 | b4))

(The code is in C#, but it's trivial to convert this to C.)


If you use this code calculate only some masks and simply remove the lines that don't affect the result (a decent compiler should do this automatically). The performance isn't bad either:

m4: 3 operations
m3: 9 operations
m2: 7 operations
m1: 9 operations
m0: 4 operations

0
manlio On

Since m2 seems the harder to calculate, you could write:

m2 = ~(m0 | m1 | m3 | m4)  // 4 ops
0
reheated On

Assuming that the (inefficient) C++ search code below is correct, the expressions with the fewest ops, disallowing temporary variables, are as follows:

(4 ops): m0 = ~(((b3 | b2) | b1) | b0)
(7 ops): m1 = ((((b1 & b0) | b3) | b2) ^ (((b3 & b2) | b1) | b0))
(7 ops): m2 = (((b3 | b0) & (b2 | b1)) ^ ((b2 & b1) | (b3 & b0)))
(7 ops): m3 = (((((b3 | b2) & b1) & b0) ^ (b3 & b2)) & (b1 | b0))
(3 ops): m4 = (((b3 & b2) & b1) & b0)

I only checked m1 by hand. This is an interesting problem but are you sure that this is the bottleneck in your software? Even if it is, the implementation with the fewest ops may not be the fastest, e.g. I don't know but NOT could be faster than the other ops.

// A program to search for boolean expressions that determine
// whether n of bools x0,..x3 are true, made up of a minimal
// number of ands, ors, xors and nots.

// There are 2**4=16 possible assignments of x0,..x3
// so there are 2**16 functions (assignments) -> (output)
// thus each map can be identified as an integer
// fun[i] will be the list of ids of all functions that
// can be represented with <= n operations

// options
const int max_ops = 7; // max number of ops to search to


#include <tchar.h>
#include <vector>
#include <set>
#include <iostream>
#include <bitset>
#include <map>
#include <string>

using namespace std;

typedef enum {
    LITERAL,
    NOT,
    AND,
    OR,
    XOR
} OpType;

typedef struct {
    int first;
    int second;
    OpType type;
} op;

int get_count_fn(int k)
{
    // return the id of the function which is true if
    // k of the 4 inputs are true
    int x = 0;
    for (int j = 0; j < 16; j++)
    {
        int m = 0;
        for (int i = 0; i < 4; i++)
        {
            if (j & (1 << i))
            {
                m += 1;
            }
        }
        if (m == k)
        {
            x |= (1 << j);
        }
    }
    return x;
}

void add_triple(map<int, op> & src, int first, int second, OpType type, int result)
{
    // record an operation
    op rhs;
    rhs.first = first;
    rhs.second = second;
    rhs.type = type;
    src[result] = rhs;
}

int get_first(const vector<map<int, op>> & src, int val)
{
    // find the first n such that src[n] contains val
    for (unsigned int i = 0; i < src.size(); i++)
    {
        if (src[i].find(val) != src[i].end())
        {
            return i;
        }
    }
    return -1;
}

string display_retrace(const vector<map<int, op>> & src, int val)
{
    // trace a function backwards to find out how it was constructed
    string result;

    // find the op leading to it
    int n = get_first(src, val);
    auto iter = src[n].find(val);
    op o = iter->second;

    // print it out, recursively
    if (o.type == LITERAL)
    {
        result = string("b") + to_string(o.first);
    }
    else if (o.type == NOT)
    {
        result = string("~") + display_retrace(src, o.first);
    }
    else if (o.type == AND)
    {
        result = string("(") + display_retrace(src, o.first) + string(" & ") +
            display_retrace(src, o.second) + string(")");
    }
    else if (o.type == OR)
    {
        result = string("(") + display_retrace(src, o.first) + string(" | ") +
            display_retrace(src, o.second) + string(")");
    }
    else if (o.type == XOR)
    {
        result = string("(") + display_retrace(src, o.first) + string(" ^ ") +
            display_retrace(src, o.second) + string(")");
    }
    return result;
}

int _tmain(int argc, _TCHAR* argv[])
{
    int all_on = (1 << 16) - 1;
    vector<int> countFuns;
    vector<bool> foundCountFuns;
    vector<map<int, op>> src;

    cout << "The `counting' functions we seek are:\n";
    for (int k = 0; k <= 4; k++)
    {
        int cf = get_count_fn(k);
        cout << std::bitset<16>(cf) << "\n";
        countFuns.push_back(cf);
        foundCountFuns.push_back(false);
    }

    for (int i = 0; i <= max_ops; i++)
    {
        src.push_back(map<int, op>());
    }

    // add all the literals to the list for 0 operations

    for (int i = 0; i < 4; i++)
    {
        int x = 0;
        for (int j = 0; j < 16; j++)
        {
            if (j & (1 << i))
            {
                x |= (1 << j);
            }
        }
        add_triple(src[0], i, -1, LITERAL, x);
    }

    // iterate over the number n of operators
    for (int n = 1; n <= max_ops; n++)
    {
        // iterate over i,j with i+j=n-1
        for (int i = 0; i <= n - 1; i++)
        {
            int j = n - i - 1;

            // add all combinations of all vectors to the list for n
            for (auto pi = src[i].begin(); pi != src[i].end(); pi++)
            {
                for (auto pj = src[j].begin(); pj != src[j].end(); pj++)
                {
                    int xi = pi->first;
                    int xj = pj->first;
                    add_triple(src[n], xi, xj, OR, xi | xj);
                    add_triple(src[n], xi, xj, AND, xi & xj);
                    add_triple(src[n], xi, xj, XOR, xi ^ xj);
                }
            }
        }
        // also add the "nots" from n-1
        for (auto pprev = src[n - 1].begin(); pprev != src[n - 1].end(); pprev++)
        {
            int xprev = pprev->first;
            add_triple(src[n], xprev, -1, NOT, all_on - xprev);
        }

        cout << "Functions with " << n << " operators: size is " << src[n].size() << " ---\n";

        // search for the functions we are interested in
        for (unsigned int k = 0; k < countFuns.size(); k++)
        {
            if (!foundCountFuns[k] && src[n].find(countFuns[k]) != src[n].end())
            {
                cout << "Found count function " << k << ":\n";
                cout << "m" << k << " = " << display_retrace(src, countFuns[k]) << "\n";
                foundCountFuns[k] = true;
            }
        }
    }

    system("pause");
    return 0;
}
3
Jonathan Mee On

Please pardon the C++. It just makes it easier to output.

const int b1 = 0b00001010;
const int b2 = 0b10100111;
const int b3 = 0b10010010;
const int b4 = 0b10111110;

// 4 operations
const int x12 = b1 ^ b2;
const int x34 = b3 ^ b4;
const int a12 = b1 & b2;
const int a34 = b3 & b4;

const int m0 = ~(b1 | b2 | b3 | b4);  // 4 operations
const int m3 = ((x12) & (a34)) | ((a12) & (x34)); // 3 operations
const int m1 = ((x12) ^ (x34)) & ~m3; // 3 operations
const int m4 = a12 & a34; //1 operation
const int m2 = ~(m0 | m1 | m3 | m4); //4 operations

cout << bitset<8>(m0) << endl << bitset<8>(m1) << endl << bitset<8>(m2) << endl << bitset<8>(m3) << endl << bitset<8>(m4) << endl;

A. The best I could do was 19 operations.

4
Evgeny Kluev On

Here are the results sorted by number of masks calculated at the same time.

Each mask needs at most 7 operations if calculated separately:

a01 = b0 & b1
a23 = b2 & b3
r01 = b0 | b1
r23 = b2 | b3

m0 = ~(r01 | r23)              // 4 ops
m1 = (a01 | r23) ^ (r01 | a23) // 7 ops
m2 = (r01 & r23) ^ (a01 | a23) // 7 ops
m3 = (r01 & r23) & (a01 ^ a23) // 7 ops
m4 = a01 & a23                 // 3 ops

There are lots of common sub-expressions here, so if you need to know any pair of masks at once, you need at most 10 operations (less with m0 or m4).

But calculation of some pairs may be optimized even further:

// m2,m3 in 9 ops
t1 = r01 & r23
t2 = a01 ^ a23
m2 = t1 ^ (a23 | t2)
m3 = t1 & t2

// m1,m3 in 9 ops
t1 = r01 ^ r23
t2 = a01 ^ a23
t3 = t1 ^ t2
m1 = t1 & t3
m3 = t2 & t3

Same approach works for mask triples. Only one triple (m1,m2,m3) may be computed faster, in 11 operations, with "sort the bits" approach, which is optimal.

If you need 4 or 5 masks at once, I believe, "sort the bits" approach would give optimal result.

Some more optimizations are possible if we allow one more operation (NAND). Actually, last 3 lines of the last code snippet may be replaced with 2 NANDs:

// m1,m3 in 8 ops (NAND is a single op)
t1 = r01 ^ r23
t2 = a01 ^ a23
m1 = t1 & ~t2
m3 = ~t1 & t2

And (m1,m2,m3) triple also could be optimized with NANDs:

// m1,m2,m3 in 10 ops (NAND is a single op)
x01 = b0 ^ b1
x23 = b2 ^ b3
a01 = b0 & b1
a23 = b2 & b3
t1 = x01 ^ x23
t2 = a01 ^ a23
m1 = t1 & ~t2
m2 = ~t1 & (x23 ^ t2)
m3 = t1 & t2

Add one more operation m4 = a01 & a23 to get all masks except m0 in 11 operations.

These results where obtained with help of exhaustive search code (see below). This code uses some simplifying assumptions to be able to run fast enough. These assumptions are not obvious, which makes this code not a very good tool to prove optimality of the results. At least the results for separate masks are optimal, which is proven by code from other answer. Optimality of results for mask pairs and triples are "proven" by my code, which means that most likely you cannot do it faster.

Here is the code (C++14 or C++11 plus binary literals):

/* Search for minimal logical expression (using &|^ operations) for bits set
 * exactly N times (in a group of 4 bits).
 *
 * Uses brute force approach to get one or two expressions for one or two
 * values of N at once. To make it possible getting results in reasonable time
 * some simplifications were made:
 * - first 4 operations pre-defined: &| or ^& or ^| for 2 pairs of input values
 * - number of ops limited, if no result found within limit, print "impossible"
 * - no attempts to perform operation on 2 expr with the same left and the same
 *   right parts
 * - unused nodes are not allowed (to minimize number of duplicate attempts)
 * - no attempt to use "not" (except for "m0")
 * 
 * Also these optimizations were tried (with no significant effect):
 * - no more than 2 different ops on same pair of sub-expressions
 * - do not apply same op on same pair of sub-expressions more than once
 *
 * operation set may be extended by "nand" (kNAnd option)
*/

#include <algorithm>
#include <array>
#include <iostream>
#include <bitset>
#include <thread>
#include <mutex>
#include <cassert>
using namespace std;

enum {
    kMaxSize = 17,
    kNTargets = 5,
    kNInputs = 4,
    kNil = 255,
    kNAnd = 0,
};

enum Op {
    OpAnd = kNInputs,
    OpOr,
    OpXor,
    OpNAndL,
    OpNAndR,
};

array<const char*, kNInputs + 3> g_op_str {
    "b0", "b1", "b2", "b3",
    " & ", " | ", " ^ ",
};

array<unsigned, kNTargets> g_target_masks {
    0b0111111111111111, // gives correct result only after additional "not"
    0b0111100000000000,
    0b0000011111100000,
    0b0000000000011110,
    0b0000000000000001,
};
//    0111122222233334
array<unsigned, kNInputs> g_literal_vals {
    0b0100011100001111,
    0b0010010011010111,
    0b0001001010111011,
    0b0000100101111101,
};

unsigned g_targets = 0;
unsigned g_score_limit = 0;
mutex g_print_mutex;

template<typename C, typename T>
ptrdiff_t findIndex(C c, T t)
{
    auto it = find(begin(c), end(c), t);
    return it - begin(c);
}

struct DAGNode
{
    unsigned value;
    uint8_t op;
    bool l_not;
    bool r_not;
    uint8_t left;
    uint8_t right;
    uint8_t use_cnt;

    void clear()
    {
        use_cnt = 0;
    }

    void setLit(const uint8_t lit, const unsigned v)
    {
        value = v;
        op = lit;
        l_not = false;
        r_not = false;
        left = kNil;
        right = kNil;
        use_cnt = 0;
    }
};

struct DAG
{
    array<DAGNode, kMaxSize> nodes;
    unsigned size;
    unsigned score;

    void print(ostream& out, size_t ind)
    {
        auto& node = nodes[ind];

        if (node.op < kNInputs)
        {
            out << g_op_str[node.op];
        }
        else
        {
            out << '(';
            if (node.l_not) out << '~';
            print(out, node.left);
            out << g_op_str[node.op];
            if (node.r_not) out << '~';
            print(out, node.right);
            out << ')';
        }
    }

    void printAll(ostream& out)
    {
        for (size_t i = 2 * kNInputs; i < size; ++i)
        {
            auto& node = nodes[i];
            auto ind = findIndex(g_target_masks, node.value);

            if ((1 << ind) & g_targets)
            {
                out << 'm' << static_cast<char>('0' + ind) << " = ";
                print(out, i);
                out << '\n';
            }
        }
    }
};

bool operator < (const DAG& l, const DAG& r)
{
    return l.score < r.score;
}

class Find
{
    using SPA = array<uint8_t, (kMaxSize - kNInputs) * (kMaxSize - kNInputs)>;
    using EDA = bitset<(kMaxSize - kNInputs) * (kMaxSize - kNInputs) * 5>;
    SPA same_pair_;
    EDA dup_op_;
    DAG dag_;
    DAG best_;
    unsigned ops_;
    unsigned targets_;
    unsigned unused_;

    class UseCnt
    {
        unsigned& unused_;
        uint8_t& use_cnt_;

    public:
        UseCnt(unsigned& unused, uint8_t& use_cnt)
        : unused_(unused)
        , use_cnt_(use_cnt)
        {
            if (!use_cnt_)
                --unused_;

            ++use_cnt_;
        }

        ~UseCnt()
        {
            --use_cnt_;

            if (!use_cnt_)
                ++unused_;
        }
    };

    class PairLim
    {
        uint8_t& counter_;

    public:
        PairLim(SPA& spa, size_t l, size_t r)
        : counter_(spa[(kMaxSize - kNInputs) * l + r])
        {
            ++counter_;
        }

        bool exceeded()
        {
            return counter_ > 2;
        }

        ~PairLim()
        {
            --counter_;
        }
    };

    class DupLim
    {
        EDA& eda_;
        size_t ind_;

    public:
        DupLim(EDA& eda, size_t l, size_t r, size_t op)
        : eda_(eda)
        , ind_(5 * ((kMaxSize - kNInputs) * l + r) + op - kNInputs)
        {
            eda_.flip(ind_);
        }

        bool used()
        {
            return !eda_.test(ind_);
        }

        ~DupLim()
        {
            eda_.flip(ind_);
        }
    };

    unsigned getPos(uint8_t l)
    {
        return dag_.nodes[l].value;
    }

    bool tryNode(uint8_t l, uint8_t r, uint8_t op)
    {
        //DupLim dl(dup_op_, l, r, op);
        //if (dl.used())
        //    return false;

        addNode(l, r, op);
        auto node = dag_.nodes[dag_.size - 1];
        const auto ind = findIndex(g_target_masks, node.value);
        const auto m = (1 << ind) & targets_;

        if (m)
        {
            ++node.use_cnt;
            --unused_;

            if (targets_ == m)
            {
                best_ = dag_;
                --dag_.size;
                --dag_.score;
                return true;
            }

            targets_ &= ~m;
        }

        search();

        if (!m)
        {
            --unused_;
        }

        targets_ |= m;
        --dag_.size;
        --dag_.score;
        return false;
    }

public:
    Find()
    : ops_(kNInputs)
    , targets_(g_targets)
    , unused_(0)
    {
        dag_.score = 0;
        dag_.size = kNInputs;
        best_.score = g_score_limit;
        best_.size = 0;

        for (int i = 0; i < kNInputs; ++i)
            dag_.nodes[i].setLit(static_cast<uint8_t>(i), g_literal_vals[i]);

        fill(begin(same_pair_), end(same_pair_), 0);
    }

    void addNode(const uint8_t l, const uint8_t r, uint8_t op)
    {
        auto& node = dag_.nodes[dag_.size];

        switch (op)
        {
            case OpAnd:
                node.value = getPos(l) & getPos(r);
                break;

            case OpOr:
                node.value = getPos(l) | getPos(r);
                break;

            case OpXor:
                node.value = getPos(l) ^ getPos(r);
                break;

            case OpNAndL:
                node.value = ~getPos(l) & getPos(r);
                break;

            case OpNAndR:
                node.value = getPos(l) & ~getPos(r);
                break;

            default:
                assert(false);
        }

        node.op = op;
        node.l_not = false;
        node.r_not = false;
        node.left = l;
        node.right = r;
        node.use_cnt = 0;

        if (op == OpNAndL)
        {
            node.l_not = true;
            node.op = OpAnd;
        }
        else if (op == OpNAndR)
        {
            node.r_not = true;
            node.op = OpAnd;
        }

        ++dag_.size;
        ++dag_.score;
        ++unused_;
    }

    void search()
    {
        if (dag_.score >= best_.score)
            return;

        for (uint8_t i_r = kNTargets; i_r < dag_.size; ++i_r)
        {
            UseCnt uc_r(unused_, dag_.nodes[i_r].use_cnt);

            if (unused_ > 2 * (best_.score - dag_.score) - 1)
                continue;

            for (uint8_t i_l = kNInputs; i_l < i_r; ++i_l)
            {
                UseCnt uc_l(unused_, dag_.nodes[i_l].use_cnt);

                if (unused_ > 2 * (best_.score - dag_.score) - 2)
                    continue;

                if (dag_.nodes[i_l].left == dag_.nodes[i_r].left &&
                    dag_.nodes[i_l].right == dag_.nodes[i_r].right
                )
                    continue;

                PairLim pl(same_pair_, i_l, i_r);
                if (pl.exceeded())
                    continue;

                if (tryNode(i_l, i_r, OpAnd))
                    return;
                if (tryNode(i_l, i_r, OpOr))
                    return;
                if (tryNode(i_l, i_r, OpXor))
                    return;

                if (kNAnd)
                {
                    if (tryNode(i_l, i_r, OpNAndL))
                        return;
                    if (tryNode(i_l, i_r, OpNAndR))
                        return;
                }
            }
        }
    }

    void print(ostream& out, const char* name)
    {
        if (best_.score < g_score_limit)
        {
            out << name << " ops = " << best_.score << '\n';
            best_.printAll(out);
            out << '\n';
        }
        else
        {
            out << name << " impossible\n\n";
        }
    }

    void process(ostream& out, const char* name)
    {
        search();
        lock_guard<mutex> lk(g_print_mutex);
        print(out, name);
    }
};

unsigned readTargets(char* str)
{
    unsigned num = 0;

    for (; *str; ++str)
    {
        if (*str >= '0' && *str <= '4')
        {
            g_targets |= 1 << (*str - '0');
            ++num;
        }
    }

    return num;
}

void usage()
{
    cerr << "Usage: bitcnt [0-4]*\n"
        "example: bitcnt 23 (to find targets m2,m3)\n";
    exit(1);
}

int main(int argc, char **argv) {
    if (argc > 1)
        g_score_limit = 6 + 2 * readTargets(argv[1]);
    else
        usage();

    // set score_limit to 10 for m1,m2,m3 (with nand), time ≈ 1h

    array<Find, 3> finders;
    finders[0].addNode(0, 1, OpAnd);
    finders[0].addNode(0, 1, OpOr);
    finders[0].addNode(2, 3, OpAnd);
    finders[0].addNode(2, 3, OpOr);

    finders[1].addNode(0, 1, OpAnd);
    finders[1].addNode(0, 1, OpXor);
    finders[1].addNode(2, 3, OpAnd);
    finders[1].addNode(2, 3, OpXor);

    finders[2].addNode(0, 1, OpXor);
    finders[2].addNode(0, 1, OpOr);
    finders[2].addNode(2, 3, OpXor);
    finders[2].addNode(2, 3, OpOr);

    auto t0 = thread([&finders]{finders[0].process(cout, "&|");});
    auto t1 = thread([&finders]{finders[1].process(cout, "&^");});
    auto t2 = thread([&finders]{finders[2].process(cout, "^|");});

    t0.join(); t1.join(); t2.join();
}

First version of this answer is still correct but obsoleted by recent results:

m2 in 10 ops:

x1 = b1 ^ b2;
x2 = b3 ^ b4;
m2 = (x1 & x2) | (((b1 & b2) ^ (b3 & b4)) & ~(x1 | x2));

m1 in 8 ops:

m1 = (b1 ^ b2 ^ b3 ^ b4) & ~((b1 & b2) | (b3 & b4));
0
Ilmari Karonen On

Start by considering the trivial case where the bit-vectors have length 1, i.e. they're just single bits. What you're really trying to do is to count the number of these bits that are set; mapping the count to the bitmask you want is then a relatively trivial exercise.

The trick is to find a way to count the bits using only bitwise operations (i.e. AND, OR, XOR and NOT), with each bit of the resulting count ending up in a separate variable. If you can do that, then you can do it simultaneously for as many bits as fit in a variable. This technique is known as bit-slicing or SIMD-within-a-register (SWAR).

So what you're basically trying to do is to implement an n-input single-bit binary adder (where, in your case, n = 4) using bitwise logic operations. Fortunately, this problem has been extensively studied by digital circuit designers.

The general solution involves maintaining an array of k bit vectors t1, t2, ..., tk (where 2k > n) storing the bits of the counts, and adding each input bit vector to the counts one at a time:

// initialize counts to zero
int count[k];
for (int j = 0; j < k; j++) count[j] = 0;

// count input bits
for (int i = 0; i < n; i++) {
    int carry = input[i];
    for (int j = 0; j < k && carry != 0; j++) {
        int temp = count[k];
        count[k] = carry ^ temp;
        carry = carry & temp;
    }
    // XXX: if carry != 0 here, some of the counts have overflowed
}

Then you can extract your bit masks from the counts:

int masks[n+1];
for (int i = 0; i <= n; i++) {
    masks[n] = ~0;  // initialize all mask bits to 1
    for (int j = 0; j < k; j++) {
        masks[n] &= (n & (1 << j) ? count[j] : ~count[j]);
    }
}

Of course, if the number of inputs is small and fixed, we can optimize the code for that specific value. For example, for n = 4, we can use:

// count input bits, store counts in bit-planes c0, c1, c2
int c0 = b0 ^ b1 ^ b2 ^ b3;
int c2 = b0 & b1 & b2 & b3;
int c1 = ((b0 & b1) | ((b0 ^ b1) & b2) | ((b0 ^ b1 ^ b2) & b3)) & ~c2;

// build masks from bit-planes
int m0 = ~c0 & ~c1 & ~c2;
int m1 =  c0 & ~c1 & ~c2;
int m2 = ~c0 &  c1 & ~c2;
int m3 =  c0 &  c1 & ~c2;
int m4 =  c2;  // XXX: count cannot be more than 4

A completely naïve compiler would generate 23 AND / OR / XOR operations and 9 NOTs from this code. However, a decent compiler should cache the values of ~c0, ~c1 and ~c2, saving up to 6 NOTs, and probably also some of the repeated subexpressions like b0 & b1, b0 ^ b1, b0 ^ b1 ^ b2, ~c1 & ~c2 and c1 & ~c2, saving up to 6 AND/XORs, for a total of 17 AND/OR/XORs plus 3 NOTs = 20 ops, which is pretty close to Jonathan Mee's 19-op solution.

In fact, we can do a little better by realizing that we don't actually need to calculate c1, but can instead work with c12 = c1 | c2. We can also optimize the mask building a bit further by noting that c2 cannot be set if c0 (or c1) is:

// count input bits, store counts in bit-planes c0, (c1 = c12 & ~c2), c2
int c0 = b0 ^ b1 ^ b2 ^ b3;
int c2 = b0 & b1 & b2 & b3;
int c12 = ((b0 & b1) | ((b0 ^ b1) & b2) | ((b0 ^ b1 ^ b2) & b3));

// build masks from bit-planes
int m0 = ~c0 & ~c12;
int m1 =  c0 & ~c12;       // c0 implies ~c2
int m2 = ~c0 &  c12 & ~c2;
int m3 =  c0 &  c12;       // c0 implies ~c2
int m4 =  c2;              // c2 implies ~c0 & ~c1

That's 19 AND/ORs and 5 NOTs for a naïve compiler, but trivial common subexpression optimization should reduce that to 15 AND/ORs and 3 NOTs, for a total of 18 ops.

(Of course, the real performance gain on modern processors will come from instruction reordering to reduce pipeline stalls. I suspect this code should do fairly well in that respect: while the masks obviously depend on the counts, and the counts on the inputs, there are no internal dependencies within either the counts or the masks, so there should be plenty of room for reordering.)


Update 23 Nov 2014: The code I had above earlier was untested, and contained a bug in the expression for c1 / c12. I've fixed it now, and even managed to make it slightly more optimizable, saving one op after common subexpression elimination (but costing one extra op for a naïve compiler). It still uses more ops than CodesInChaos' sort-based solution, though.