How do I convert a Windows-1251 text to something readable?

9.3k views Asked by At

I have a string, which is returned by the Jericho HTML parser and contains some Russian text. According to source.getEncoding() and the header of the respective HTML file, the encoding is Windows-1251.

How can I convert this string to something readable?

I tried this:

import java.io.UnsupportedEncodingException;

public class Program {
    public void run() throws UnsupportedEncodingException {
        final String windows1251String = getWindows1251String();
        System.out.println("String (Windows-1251): " + windows1251String);
        final String readableString = convertString(windows1251String);
        System.out.println("String (converted): " + readableString);
    }
    private String convertString(String windows1251String) throws UnsupportedEncodingException {
        return new String(windows1251String.getBytes(), "UTF-8");
    }
    private String getWindows1251String() {
        final byte[] bytes = new byte[] {32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
        return new String(bytes);
    }
    public static void main(final String[] args) throws UnsupportedEncodingException {
        final Program program = new Program();
        program.run();
    }
}

The variable bytes contains the data shown in my debugger, it's the result of net.htmlparser.jericho.Element.getContent().toString().getBytes(). I just copy and pasted that array here.

This doesn't work - readableString contains garbage.

How can I fix it, i. e. make sure that the Windows-1251 string is decoded properly?

Update 1 (30.07.2015 12:45 MSK): When change the encoding in the call in convertString to Windows-1251, nothing changes. See the screenshot below.

Screenshot

Update 2: Another attempt:

Second screenshot

Update 3 (30.07.2015 14:38): The texts that I need to decode correspond to the texts in the drop-down list shown below.

Expected result

Update 4 (30.07.2015 14:41): The encoding detector (code see below) says that the encoding is not Windows-1251, but UTF-8.

public static String guessEncoding(byte[] bytes) {
    String DEFAULT_ENCODING = "UTF-8";
    org.mozilla.universalchardet.UniversalDetector detector =
        new org.mozilla.universalchardet.UniversalDetector(null);
    detector.handleData(bytes, 0, bytes.length);
    detector.dataEnd();
    String encoding = detector.getDetectedCharset();
    System.out.println("Detected encoding: " + encoding);
    detector.reset();
    if (encoding == null) {
        encoding = DEFAULT_ENCODING;
    }
    return encoding;
}
3

There are 3 answers

1
Glory to Russia On BEST ANSWER

I fixed this problem by modifying the piece of code, which read the text from the web site.

private String readContent(final String urlAsString) {
    final StringBuilder content = new StringBuilder();
    BufferedReader reader = null;
    InputStream inputStream = null;
    try {
        final URL url = new URL(urlAsString);
        inputStream = url.openStream();
        reader =
            new BufferedReader(new InputStreamReader(inputStream);

        String inputLine;
        while ((inputLine = reader.readLine()) != null) {
            content.append(inputLine);
        }
    } catch (final IOException exception) {
        exception.printStackTrace();
    } finally {
        IOUtils.closeQuietly(reader);
        IOUtils.closeQuietly(inputStream);
    }
    return content.toString();
}

I changed the line

new BufferedReader(new InputStreamReader(inputStream);

to

new BufferedReader(new InputStreamReader(inputStream, "Windows-1251"));

and then it worked.

1
Rodney On

(In the light of updates I deleted my original answer and started again)

The text which appears

пїЅпїЅпїЅпїЅпїЅпїЅ

is an accurate decoding of these byte values

-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67

(Padded at either end with 32, which is space.)

So either

1) The text is garbage or

2) The text is supposed to look like that or

3) The encoding is not Windows-1215

This line is notably wrong

return new String(windows1251String.getBytes(), "UTF-8");

Extracting the bytes out of a string and constructing a new string from that is not a way of "converting" between encodings. Both the input String and the output String use UTF-16 encoding internally (and you don't normally even need to know or care about that). The only times other encodings come into play are when text data is stored OUTSIDE of a string object - ie in your initial byte array. Conversion occurs when the String is constructed and then it is done. There is no conversion from one String type to another - they are all the same.

The fact that this

return new String(bytes);

does the same as this

return new String(bytes, "Windows-1251");

suggests that Windows-1251 is the platforms default encoding. (Which is further supported by your timezone being MSK)

0
bvdb On

Just to make sure you understand 100% how java deals with char and byte.

byte[] input = new byte[1];

// values > 127 become negative when you put them in an array.
input[0] = (byte)239; // the array contains value -17 now.

// but all 255 values are preserved. 
// But if you cast them to integers, you should use their unsigned value.
// (casting alone isn't enough).
int output = input[0] & 0xFF; // output is 239 again

// you shouldn't cast directly from a single-byte to a char.
// because: char is 16-bit ; but you only want to use 1 byte ; unfortunately your negative values will be applied in the 2nd byte, and break it.
char corrupted = (char) input[0]; // char-code: 65519 (2 bytes are used)
char corrupted = (char) ((int)input[0]); // char-code: 66519 (2 bytes are used)

// just casting to an integer/character is ok for values < 0x7F though
// values < 0x7F are always positive, even when casted to byte
// AND the first 7-bits in any ascii-encodings (e.g. windows-1251) are identical.
byte simple = (byte) 'a';
char chr = (char) ascii_LT_7F; // will result in 'a' again

// But it's still more reliable to use the & 0xFF conversion.
// Because it ensures that your character can never be greater than char code 255 (a single byte), even when the byte is unexpectedly negative (> 0x7F).
char chr = (char) ((byte)simple & 0xFF); // also results in 'a'

// for value 239 (which is 0xEF) it's impossible though.
// a java char is 16-bit encoded internally, following the unicode character set.
// characters 0x00 to 0x7F are identical in most encodings.
// but e.g. 0xEF in windows-1251 does not match 0xEF in UTF-16.
// so, this is a bad idea.
char corrupted = (char) (input[0] & 0xFF);

// And that's something you can only fix by using encodings.
// It's good practice to use encodings really just ALWAYS.
// the encoding indicates what your bytes[] are encoded in NOW.
// your bytes will be converted to 16-bit characters.
String text = new String(bytes, "from-encoding");

// if you want to change that text back to bytes, use an encoding !!
// this time the encoding specifies is the TARGET-ENCODING.
byte[] bytes = text.getBytes("to-encoding");

I hope this helps.

As for the displayed values: I can confirm that the byte[] is displayed correctly. I checked them in the Windows-1251 code page. (byte -17 = int 239 = 0xEF = char 'п')

In other words, your byte values are incorrect, or it's a different source-encoding.