Entropy Rate of a source of information with memory

815 views Asked by At

I have some English written text and calculated the entropy of it. However I realized that compression algorithms based on LZ methods compress much under the limit given by entropy.

That's due to the fact that a source of information that models English text has memory. So the boundary to compression is given by the Entropy Rate and not by the entropy of that source.

I saw the definition of entropy rate of a source with memory but was wondering how was it possible to calculate the entropy rate with an algorithm or pseudo code for a text written in English.

Any ideas?

Thanks for help.

1

There are 1 answers

0
rmalouf On

In general, estimating the entropy rate of a source with memory is a hard problem, and when there are lots of long-distance dependencies (as there are in natural language) it's especially difficult. Essentially what you need to do is to construct a grammar of the language based on the sample, and then calculate the entropy rate of the grammar. The particular assumptions that you make when extracting that grammar are going to make a big difference in the entropy rate you end up with, so the best you'll be able to do is to estimate some fairly weak bounds on the entropy rate of the actual source.

A common practice is to estimate the entropy rate simply by compressing a large sample using a standard compression program like gzip, though as Cosma Shalizi points out that's a terrible idea. If you're going to use a general data compression algorithm, LZ76 is a better choice; Fermín Moscoso del Prado has a paper coming out looking at some of the alternatives.

While compression algorithms do sort of give you a kind of grammar, it would even better to use a grammar that accurately captures the long-distance dependencies in the language. Unfortunately, learning realistic natural language grammars from samples of raw text is very much an open research problem. But, there is some promising work on learning finite state approximations to natural language which could be used to produce good entropy rate estimates. Check out CSSR for one way of approaching the problem along those lines.