I have the following string that I would like to Huffman-encode and store efficiently into a bit array:
>>> print sequence
GTCAGGACAAGAAAGACAANTCCAATTNACATTATG|
The frequencies of the symbols in sequence
are:
>>> print freqTuples
[(0.40540540540540543, 'A'), (0.1891891891891892, 'T'), (0.16216216216216217, 'C'), (0.16216216216216217, 'G'), (0.05405405405405406, 'N'), (0.02702702702702703, '|')]`
I translate this into a Huffman code dictionary:
>>> print codeDict
{'A': '1', 'C': '010', 'G': '001', 'N': '0110', 'T': '000', '|': '0111'}
I then used the Python bitstring
package to translate the string, character by character, into an instance of the BitArray
class, which I call bitArray
, which contains bits for each character encoded with its respective Huffman code:
>>> print bitArray.bin
0b001000010100100110101100111100110101101100000100101100000001101010100000010000010111
Here is the bit array in bytes:
>>> print bitArray.tobytes()
!I\254\363[^D\260^Z\240Ap
I must use tobytes()
instead of bytes
, as the bit array I generate does not divide evenly into 8-bit segments.
When I calculate the storage efficiency of the BitArray
representation (the ratio of the sizes of the bit array and the input string), I get worse performance than if I had left the input string unencoded:
>>> sys.getsizeof(bitArray.tobytes()) / float(len(sequence))
1.2972972973
Am I measuring storage efficiency correctly? (If I encode longer input strings, this ratio improves, but it seems to approach an asymptotic limit of around 0.28. I'd like to confirm if this is the right way to measure things.)
Edit
The following two approaches yield different answers:
>>> print len(bitArray.tobytes()) / float(len(mergedSequence))
0.297297297297
>>> print bitArray.len / (8.*len(mergedSequence))
0.283783783784
I'm not sure which to believe. But in the process of writing data to storage, I think I would need the byte representation, which makes me inclined towards choosing the first result.
Implies that the encoded version is 30% longer than the original sequence.
I don't think you want to use
getsizeof
here -- if you want to minimize the size of the Python object, you should be usinggetsizeof(sequence)
as well, rather thanlen
.If instead, you want to do what Huffman coding is meant to do, and minimize the binary representation, then you want to use
len
on both (assuming the sequence is represented as one-byte-per-character).So, your real ratio is 11 / 37.
I assume you're using Huffman coding as an exercise, as this doesn't seem like a logical way to efficiently store what is just a four-bit code with a termination character. At least it would be better to use arithmetic coding, which will allow you to use base-5 encoding instead of base-2, which is optimal for 5 possible characters.
Really, I would assume in a sequence long enough to be worth compressing, there is a known ratio of G:A:C:T and / or fixed length 2-bit encoding will be just as efficient (the ratios approach 1:1:1:1) since you don't really need to encode the termination character.