Hamming numbers by intervals

623 views Asked by At

Here's a somewhat different approach to generating the sequence of Hamming numbers (aka regular numbers, 5-smooth numbers) based on the interval from one number in the sequence to the next. Here's an example plot of said intervals:

enter image description here

So there is a relatively limited number of discrete intervals separating one number from the next, and the intervals get smaller as H increases. It's often noted that Hamming numbers get sparser as they increase in size, which they do in absolute terms, but in another sense (proportionally) they get closer together.

Basically, as H goes up there is greater opportunity for 2^i*3^j*5^k where i,j,k are positive or negative integers to result in a fraction near 1.0.

Turns out that a table of just 119 intervals (i,j,k triples) covers Hamming numbers up to about 10^10000. That's about the first 1.59 trillion Hamming numbers. Such a table (C header file), sorted by the interval size from small to large, is here. Given a Hamming number, to find the next one all that's required is to find the first entry in the table where multiplication (addition of respective exponents) would yield a result with positive powers for i,j and k.

E.g., the millionth Hamming number is 2^55*3^47*5^64 which is about 5.1931278e83. The next Hamming number after that is 2^38*3^109*5^29 or about 5.1938179e83. The first appropriate table entry is:

{-17,62,-35}, // 1.000132901540844

So while those numbers are separated by about 7e79, their ratio is 1.000132901540844. To find the next number required just trying up to 119 entries in the worst case, involving just additions and comparisons (no multiplications). Also, the table of just 3 short ints per entry requires under 1kb memory. The algorithm is basically O(1) in memory and O(n) in time, where n is the length of the sequence.

One way to speed it up would be to rather than searching the table from the 0th index every time, constrain the list of table entries to search to just those entries that empirically are known to succeed the given entry in the given range (n < 1.59e12). Those lists are given in the header file above in the succtab[] struct, e.g.:

{11,{47,55,58,65,66,68,70,72,73,75,76}},

So that particular index is empirically found to only be followed by 11 different indices as listed, so those are the only ones searched.

Doing that speeds up the algorithm by a factor of 4 or so, implemented here (C code) along with the header file above. Here's a plot of the execution time on an i7-2600 3.4GHz machine:

enter image description here

I believe that compares favorably with the state of the art--is that so?

The Hamming problem is sometimes reduced to just finding the nth Hamming number without generating all the intermediate values. Adapting the above technique to a well-known scheme of just enumerating the Hamming numbers in a band around the desired range gives this plot of execution time: enter image description here

So that takes less than 2 seconds to find the 1.59 trillionth Hamming number. The C code for that is here. Does this also compare favorably with the state of the art, at least in the given bounds?

EDIT: the bounds for n (1.59e12, Hamming numbers up to about 10^10000) were chosen based on a specific machine, where it was desired that i,j,k be short ints and also reasonable expectation on execution speed. Larger tables could be generated, e.g. a table of 200 entries would allow n to be as high as about 1e18 (Hamming numbers up to about 10^85000).

Another question would be how to speed it up further. One potential area: it turns out that some table entries are hit much more often than others, and they have a correspondingly larger list of successors to check. For example, when generating the first 1.59e12 numbers, this entry is hit by fully 46% of the iterates:

{-7470,2791,1312}

It has 23 possible different successors. Perhaps some way of narrowing that down based on other parameters (e.g., history of the previous entries traversed) would help, although there wouldn't be much room for an expensive operation.

EDIT #2:

For some info about generating the table, there are basically six classes of fractions 2^i*3^j*5^k where i,j,k are positive or negative integers: fractions with only 2,3 or 5 in the numerator, and fractions with only 2,3, or 5 in the denominator. E.g., for the class with only 2 in the numerator:

f = 2^i/(3^j*5^k), i > 0 and j,k >= 0

A C program to compute the intervals for this class of fraction is here. For Hamming numbers up to about 10^10000 it runs in a few seconds. It could probably be made more efficient.

Repeating a similar process for the other 5 classes of fractions yields six lists. Sorting them all together by the interval size and removing duplicates yields the complete table.

1

There are 1 answers

0
Will Ness On

The triples enumeration is ~ n2/3 but the sorting of the band is ~ n2/3 log (n2/3) i.e. ~ n2/3 log n. This obviously doesn't change even with ~ n1/3 band space scheme.

Indeed the empirical complexities are seen in practice as ~ n0.7.

I am yet to understand your algorithm fully, but the evidence you presented strongly suggests the pure ~ n2/3 operation, which would constitute a clear and significant improvement over the previous state of the art, absolutely.

enter image description here

This would be not so, in my opinion, if it was needed to generate the whole sequence in order to find the "intervals" (ratios) your algorithm is based on. But since you generate them independently, as your later edit seems to suggest, it's no impediment at all.

Correction: if we're only interested in the nth member of the sequence, then full sort of the band is not needed; O(n) select-kth-largest algorithms do exist.