Ways to measure bit sequence complexity

197 views Asked by At

I'm looking for a simple way to estimate the complexity of a sequence of bits of a fixed size (probably a maximum of length 10). For example, I imagine 0000000 and 111111 aren't very complex at all, but 101010 and 101101 sit elsewhere on the spectrum.

I know that Kolmogorov complexity is uncomputable, but is it possible that it can be programmed simply for sequences of fixed (and small) length with a binary alphabet? Or is there another measure that might only approximate the measure but is much more easily computable?

It is important that the measure be rather simple so that I can explain it to other (though rather well-educated) people.

Thanks.

1

There are 1 answers

1
Mike Dunlavey On BEST ANSWER

You need to have a procedure for counting the complexity, and there is no best procedure.

For example, you could run-code the string, and count the number of runs.

You could run the string through a LZW compressor (like ZIP), and report the size it compressed to.

You don't have to choose just one method. Your method could be to try five different methods, and report the one that gave you the smallest measure.

For example, you could try initially inverting every other bit, and then try the run-coding. Or try inverting bits 2 and 3, then bits 6 and 7, and so on.

These are possible ways to get a measure, but that's all they are.

The Kolmogorov complexity is the size of the smallest program, in bits, that could reproduce the string, and that depends on the language (whether high-level, assembly, machine, or turing machine, or code to drive a special program you've created for the purpose).

You know it exists because you know there are upper and lower bounds. Any program that can reproduce the string gives you an upper bound. You know the empty program cannot, so that gives you a lower bound of zero. So it is somewhere in between, but that doesn't mean you can find it.

Keep in mind, it doesn't really make sense to talk about the complexity of just one string, because the measurement tool could be optimized for that string. You really need to be talking about a population of strings, just to keep the tool honest.