Choosing a magic byte least likely to appear in real data

381 views Asked by At

I hope this isn't too opinionated for SO; it may not have a good answer.

In a portion of a library I'm writing, I have a byte array that gets populated with values supplied by the user. These values might be of type Float, Double, Int (of different sizes), etc. with binary representations you might expect from C, say. This is all we can say about the values.

I have an opportunity for an optimization: I can initialize my byte array with the byte MAGIC, and then whenever no byte of the user-supplied value is equal to MAGIC I can take a fast path, otherwise I need to take the slow path.

So my question is: what is a principled way to go about choosing my magic byte, such that it will be reasonably likely not to appear in the (variously-encoded and distributed) data I receive?

Part of my question, I suppose, is whether there's something like a Benford's law that can tell me something about the distribution of bytes in many sorts of data.

2

There are 2 answers

3
Potatoswatter On BEST ANSWER

Capture real-world data from a diverse set of inputs that would be used by applications of your library.

Write a quick and dirty program to analyze dataset. It sounds like what you want to know is which bytes are most frequently totally excluded. So the output of the program would say, for each byte value, how many inputs do not contain it.

This is not the same as least frequent byte. In data analysis you need to be careful to mind exactly what you're measuring!

Use the analysis to define your architecture. If no byte never appears, you can abandon the optimization entirely.

1
grtamayo On

I was inclined to use byte 255 but I discovered that is also prevalent in MSWord files. So I use byte 254 now, for EOF code to terminate a file.