Java Inflate inconsistent with large Strings

2.5k views Asked by At

I want to recover a compressed medium length string (665 chars) using java.util.zip package, The compression is made by this code:

public String compress(String s){
    Deflater def = new Deflater(9);
    byte[] buffer = new byte[s.length()];
    String rta = "";
    def.setInput(s.getBytes());
    def.finish();
    def.deflate(buffer);
    rta = new String(buffer);
    rta = rta.trim().concat("*" + Integer.toString(s.length()));
  //this addition at the end is used to recover the original length of the string to dynamically create the buffer later on.
    return rta;
}

And the code to decompress is this:

public String decompress(String s){
    String rta = "";
    Inflater inf = new Inflater();
    byte[] buffer = separoArray(s, true).getBytes(); // This function returns the compressed string or the original length wheter true/false parameter
    int len = Integer.valueOf(separoArray(s, false));
    byte[] decomp = new byte[len];
    inf.setInput(buffer);
    try {
        inf.inflate(decomp, 0, len);
        inf.end();
    } catch (DataFormatException e) {e.printStackTrace();}
    rta = new String(decomp);
    return rta;
}

And this are the original String and the decompressed one:

Original:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed rutrum imperdiet consequat. Nulla eu sapien tincidunt, pellentesque ipsum in, luctus eros. Nullam tristique arcu lorem, at fringilla lectus tincidunt sit amet. Ut tortor dui, cursus at erat non, interdum imperdiet odio. In hac habitasse platea dictumst. Nulla facilisi. Duis eget auctor nibh. Cras ante odio, dignissim et sem id, ultrices imperdiet erat. Aenean ut purus hendrerit, bibendum massa non, accumsan orci. Morbi quis leo sed mauris scelerisque vulputate. Fusce gravida facilisis ipsum pellentesque euismod. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae"

Decompressed:

"Lorem ipsuAdolor sit amet, consectetur adipiscing elit. Sed rutrsuAimperdiet consequat. Nulla eu sapien tincidunt, pellentesquem ipsuAin, luctus eros. Nullam tristiquemarcu lLore, at fringilla lectus tincidunt sit amet. Ut tortor dui, cursus at erat non, interdsuAimperdiet odsAimpeIn hac habitasse platea dius ms Nulla eufacilisi. Duierog odatus dunibh. Craat erddsAim, dignissim odsm ipd, ulistcesmperdiet odat n. Aenean ut pur athendreri pebibendAimmassaon, inacc msan orci. Morbi quierleodsmdmmausti sceleriuem ivulputate. Fusce gravideufacilisisipsuAinllentesquem ieuiemod. VeiqubulAin erddpsuAinlrimisipnufaucubus orciuctus erot ulistcesmposuereursbilia Cura"

The differences are visible, why is this happening?, what could I do to avoid it?

Thank You.

2

There are 2 answers

0
Joop Eggen On BEST ANSWER

I agree with the commenters, that a compressed string should better be byte[]. However with a single-byte encoding like ISO-8859-1 one might abusively convert between byte[] and String.

The following differs from your version, in that it explicitly indicates the encoding. For text UTF-8 is adequate to have no limits and cover the full Unicode range.

Note the usage of the deflate return value.

public static String compress(String s) {
    Deflater def = new Deflater(9);
    byte[] sbytes = s.getBytes(StandardCharsets.UTF_8);
    def.setInput(sbytes);
    def.finish();
    byte[] buffer = new byte[sbytes.length];
    int n = def.deflate(buffer);
    return new String(buffer, 0, n, StandardCharsets.ISO_8859_1)
            + "*" + sbytes.length;
}

public static String decompress(String s) {
    int pos = s.lastIndexOf('*');
    int len = Integer.parseInt(s.substring(pos + 1));
    s = s.substring(0, pos);
    
    Inflater inf = new Inflater();
    byte[] buffer = s.getBytes(StandardCharsets.ISO_8859_1);
    byte[] decomp = new byte[len];
    inf.setInput(buffer);
    try {
        inf.inflate(decomp, 0, len);
        inf.end();
    } catch (DataFormatException e) {
        throw new IllegalArgumentException(e);
    }
    return new String(decomp, StandardCharsets.UTF_8);
}
2
Stephen C On

The problem is not with Deflater.

The primary problem is this line:

    rta = new String(buffer);

What you are doing is taking an array of bytes (representing the compressed input string) and decoding it into a String using your platform's default character encoding. This is wrong. For the majority of character encodings, there are byte values of sequences of byte values that cannot be mapped to characters. When you attempt to "decode" bytes that don't represent properly encoded text, you are liable to get a scattering of question marks or some other character throughout the string. This results in loss of information ... and there's no way to recover it.

(There are one or two character sets where the decoding / encoding is fully reversible ... and you could use one of them as the encoding scheme when converting the compressed bytes to "text". But that's not the end of it!)

The second problem with how you are dealing with the compressed bytes. The deflate(byte[] buffer) method compresses the input data and writes the compressed output into buffer. However, there is no guarantee that N bytes of input is going to result in N bytes of output. Instead the deflate method returns an int giving the number of bytes written into buffer.

But your code is then taking the entire buffer ... including the bytes that weren't written ... and turning that into a String (by the unsound procedure described above). You then trim the String to (I presume) get rid of the trailing NUL characters. But that will trim all white-space from the start and end, and some of those characters could be a significant part of the compressed string.


Basically, what you are doing is unsound. You should not be trying to convert arbitrary bytes into a String. Compressed data is NOT text.

My recommendation is to do one of the following:

  • Don't convert the (compressed) byte[] to a String. Keep it as a byte[] ... and deal with the length issue properly.

  • Alternatively, use a non-lossy bytes-as-characters encoding scheme; e.g. hex encoding or base64 encoding.