I wrote a Java compressor for a super obscure compression format. (It was mainly used on Amiga Computers in the 1990s).
There is a fair amount of documentation on how to decompress the file format, but none on actually how to compress it.
So, I tried to make it myself. It works, but there's one issue. It takes me 42 seconds to compress all the files I want to compress, on "low-intensity settings". It takes roughly 10x that time on higher intensity settings.
I believe it can be much faster than that.
It's basically a Lz77 sliding-window variant.
The real bottleneck is searching for an existing occurance to compress with.
Right now, I'm using a Map<Byte, List<Integer>>
(The List<Integer>
is all the indices the byte is present at.)
To find a potential match, what it does is:
It takes the current index of the file being compressed.
It gets the List<Integer>
from the Map with the byte at the current index.
It finds the longest sub-list of bytes that already occurs in the file by using that List, and just checking how long they match for.
I think a better data structure could speed this up significantly, but I'm stuck at this point.
One of the restrictions I have to work with is that I need to strictly adhere to this ancient compression format due to the nature of what this program is for.
How could I optimize the compression without making it less effective at packing data?
Main Bottleneck code:
private static int search(byte[] data, int bufferEnd, List<Byte> target, Map<Byte, List<Integer>> dictionary) {
int minIndex = Math.max(0, bufferEnd - getMaximumOffset(target.size())); // There's a certain point at which data will not be compressed. By calculating it here, it saves a lot of overheard, and prevents this from becoming O(n^2)
byte test = target.get(0);
if (!dictionary.containsKey(test))
return -1; // No results found.
List<Integer> possibleResults = dictionary.get(test);
for (int i = possibleResults.size() - 1; i >= 0; i--) {
int testIndex = possibleResults.get(i);
if (minIndex > testIndex)
break; // We've gone too far.
// Test this
boolean pass = true;
for (int j = 1; j < target.size(); j++) {
if (target.get(j) != data[j + testIndex]) {
pass = false;
break; // Break from the j for loop.
}
}
if (pass) // A match has been found. Return it.
return testIndex;
}
return -1;
}
Which is called by:
while ((tempIndex = search(data, i, searchList, dictionary)) >= 0) { // Find the longest compressable bunch of characters.
if (data.length - 1 == readIndex) // If we've reached the end of the data, exit.
break;
searchList.add(data[++readIndex]);
}
Full Code here for anyone who needs it.
You're missing a bunch of optimizations, especially low level ones.
That's very inefficient.
Actually, a
Map
is pretty fast, but much slower than an array. Instead ofmap.get(someByte)
, which does autoboxing and a map lookup (some index computation and a few array accesses), you can do a single array access usingarray[someByte & 0xFF]
, getting about one order of magnitude speedup.Similarly,
List<Integer>
implies autoboxing as you start withint
s. Autoboxing is usually acceptable, but not when it's in the core of a demanding algorithm. You can write an own class behaving likeList<int>
or google for it (there are a few good libraries for that).This is a needless double lookup. Unless you're using
null
values, it can be written asThis is twice as fast, but as I wrote, you should use an array here.
Concerning the high-level optimizations, I don't really know how to compress efficiently, but I'm sure, there are quite some tricks. If there were no resources on compression, I'd start with rolling hash. But read about compression in general first.