Good compression scheme for continued fraction terms?

735 views Asked by At

So I'm implementing a continued fraction library for handling a subset of quadratic integers and rational numbers. Continued fraction terms are represented by unsigned integers. I've noticed the following general patterns when working with continued fractions:

  1. Most terms are small single digit numbers, with 1 being the most common.
  2. Some terms can be enormous, the largest possible for my application is 366 bits, but these are extremely rare.
  3. A large term represents an especially good number approximation, which means there are usually fewer terms overall for a large fraction.
  4. The worst possible continued fraction is the golden ratio, and a rational approximation of that with 366 bits of precision corresponds to roughly 525 1's in a row.
  5. Random rational numbers do not generally have large runs of the same number, but may have two to four in a row, with 1 again the most common.

So I have a limit on both the number of terms and the size of a term, with the number of terms roughly inversely proportional to their size. So storing these terms in machine words or even bytes is usually very wasteful of space, and I still need to handle multi-word arithmetic in the worst case. Given the roughly inverse relationship between term size and number of terms (which both also depend on the size of the fraction's numerator and denominator), I've been trying to find or come up with a good compression scheme so that I don't waste so much space storing the integer terms.

I've considered Huffman encoding, as the speed of encoding and decoding is nice, but I'm not sure how to come up with the probabilities for the code values. I have a vague intuition that Stern-Brocot Trees might offer a hint, as they are binary trees with a direct relationship to continued fractions.

Does anyone know of a good compression approach for handling lots of small numbers with occasional huge ones where runs of the same number are typically short (but might be long in rare cases)? In particular I need to be able to encode and decode fairly fast (say O(n*lg(n)) is the absolute worst speed I can tolerate, with O(n) or better preferable), and be able to get at the position of individual terms so that I know which term number I'm operating on (fourth term, fifth term, etc).

Also, I'm not interested in using any third party real number or continued fraction libraries. I've looked at several and they are either insufficient or overkill for my needs, and I'd like the experience of coding this up myself for my own edification.

Update

I've also learned of Gauss–Kuzmin distribution which gives the probability distribution of a particular continued fraction term of k for a random number uniformly distributed between 0 and 1. It is

P(k) = -lg(1.0 - 1.0/((k + 1.0)**2) in Python pseudo-code, where lg is the base 2 logarithm. This is the limit as k approaches infinity, so my case is somewhat more restricted (though 2**366 is still huge). "The Entropy of Continued Fractions" by Linas Vepstas gives the (information) entropy of Gauss-Kuzmin distribution as approximately 3.43 bits. My maximum denominator is large enough that the information entropy of my continued fractions is probably close to that limit, though the graph in the linked article shows the limit is approached quite slowly and so is different for relatively small denominators.

5

There are 5 answers

7
David Eisenstat On BEST ANSWER

One possibility is a simple prefix code where the binary number 1x is has the bits x encoded as 0 -> 10 and 1 -> 11, followed by the terminator 0.

Table:

1 -> 0
2 -> 100
3 -> 110
4 -> 10100
5 -> 10110
6 -> 11100
7 -> 11110
8 -> 1010100

The corresponding Huffman code probabilities here for n are something like Theta(1/n^2) (waving my hands a bit because the alphabet is infinite).

0
ttw On

Another possibility is to use some sort of "Universal Encoding" scheme. As randomly chosen Continued Fractions rarely have really big partial quotients, Universal Encoding should be useful.

3
Mark Adler On

Your distribution seems amenable to Rice Coding. You can tune the coding parameter, let's call it k, to your data. The coding takes the bits of the number above the low k bits, and transmits that many 1 bits, followed by a 0. Then you transmit the low k bits directly.

0
dimitri On

You can use arithmetic encoding considering each positive integer as a symbol in your source alphabet. It doesn't matter that there are infinitely many as the relative probability of larger and larger integers drops to zero.

In fact, considering a uniform distribution on [0,1), you can set the conditional probability of each new integer a_n in the continued fraction expansion (i.e. each new symbol from the entropy source) to be P(a_n = k ) = 1/k - 1/(k+1). You may want to consider the very first integer to understand why this conditional probability makes sense (half the numbers in the interval [0,1) will have a_1 = 1, for one sixth of them a_1 = 2, etc).

Also, you may need to flip the direction of the arithmetic encoding with each new symbol for decoding to be unambiguous. Unfortunately, arithmetic en/decoding is not very fast.

0
AudioBubble On

You might consider dropping continued fractions and instead implement their mutated cousin: continued logarithms. See the very bottom of that paper for a discussion and implementation of basic arithmetic.

There's even a hardware implementation for hugely parallel arithmetic on continued logarithms, with the same asymptotic complexity as FFT, but far simpler and thus much lower constants.

If you're looking for anything more exotic than can be built from just the reversible {+,-,*,/} operators, please see my new question on the floor operator.

As you've seen, continued fractions tend to blow up in term bit size for very large integers or very small fractions. This oscillation of gigantic terms with small terms is what you would need to exploit in any compression scheme.

Continued logarithms, on the other hand, split up large terms in a sort of 'recursive scientific notation,' as Bill Gosper puts in in that paper. Each term co-routine emits or consumes only very small messages which can be formed in a natural sort of run-length encoded serialization describing the 'log base 2 of the number remaining to be described.'

The unfortunate side effect of using this continued logarithms is that Hurwitz numbers are patternless, while in continued fractions are very regular. However most, but not all, quadratic surds remain periodic, which can be thought of as a compression scheme as well.

Once in the compact run-length encoded format you can use traditional compression techniques for runs of small numbers, such as LZW described in the question's comments by Mark Ransom.