The problem seems quite suited for a GPU, FPGA, etc. (because it's quite parallel); but I'm looking for a CPU-based and somewhat architecture independent solution right now. I think a good answer could be just some unimplemented pseudo-code, but my program is in pure C++20, so the answer should be relevant in that context (e.g., don't assume something very-high-level like Python, don't use compiler specific intrinsics or assembly). I'm not expecting mind-blowing performance, but I do want the answer to be significantly faster than the three implementations I already have (in this file): a very naive approach without generator matrices, and a naive "multiply input vector with dense generator matrix" approach. The answer should work for arbitrary code word and input lengths, but the important code word lengths are under, say, 2000 bits, and small input lengths are not important.
Some preliminaries: the binary numbers in question have addition and multiplication defined as, respectively, the "exclusive-or" (XOR) and "and" logical/bitwise operations. This extends to binary matrix multiplication.
Hamming codes are old and well-known binary linear block error detecting/correcting codes. Each code word is a string of bits where some bit-positions are designated as parity bits, used for error detection and correction, while the rest of the bits are data bits, which are just copies of the input bits if there was no error. We are considering only Hamming codes where the parity bits are at traditional power-of-two positions (i.e., with 1-based numbering: bit 1, bit 2, bit 4, bit 8, ...). Thus each possible code can be determined using its length n
(number of bits in a code word) or its rank k
(number of data bits in a code word). A Hamming code can be referred to as (n, k)
, e.g., (7, 4)
or (40, 34)
.
Each code has a generator matrix, a binary matrix with which an input vector can be multiplied to obtain a code word. Thus the set of the code words of a certain code is exactly the set of linear combinations of the rows of the generator matrix.
The desired program is basically a coder: it takes as input an (n, k)
pair to give the code (yes, this is redundant - only one of the pair is needed in essence) and an arbitrary binary message, divides the message into k
-bits long sub-messages and outputs a sequence of n
-bits long code words, each encoding one sub-message.
I'm hoping for an answer leveraging properties specific to our generator matrices here (e.g., special representation for the generator matrix and special vector-matrix multiplication algorithm), so here are examples of generator matrices for some codes:
Hamming code (3, 1)
(has only the code words 000
and 111
):
111
Hamming code (5, 2)
(has only the code words 00000
, 11100
, 10011
and 01111
):
11100
10011
Hamming code (6, 3)
:
111000
100110
010101
(Notice how each generator matrix contains the generator matrices of all the smaller codes.)
Hamming code (150, 142)
(all zeros left blank so the ones would stand out more):
111
1 11
1 1 1
11 1 1
1 11
1 1 1
11 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 11
1 1 1
11 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 11
1 1 1
11 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
11 1 1 1 1 1
1 11
1 1 1
11 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
11 1 1 1 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
11 1 1 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
11 1 1 1 1 1
1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
11 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
11 1 1 1 1 1 1
1 11
1 1 1
11 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
11 1 1 1 1
1 1 1
1 1 1 1
1 1 1 1
11 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
Notice how there are relatively few ones among all the zeros in most generator matrices, and there's definitely a pattern, shape even, to the matrices.
I'm weak in all relevant areas here, so please try to correct any possible mistakes I made.