OkHttp - ISO-8859-1 encoded webpage - � included in retrieved page source string

631 views Asked by At

After hours of trial and error and many more spent crawling the web for solutions I am currently at a total loss.

I am successfully using OkHttp to retrieve the source of a webpage in the following way:

Request request = new Request.Builder()
        .url(APIURL + Integer.toString(StopIndex) + "/")
        .addHeader("Content-Type", "text/html; charset=ISO-8859-1")
        .build();
client.newCall(request).enqueue(new Callback() {
    @Override
    public void onFailure(Call call, IOException e) {
        Log.e("OkHttp request issue", e.toString());
    }

    @Override
    public void onResponse(Call call, Response response) throws IOException {
        PageSource = response.body().string();
        StopActivity.this.runOnUiThread(new Runnable() {
            @Override
            public void run() {
                tv1.setText(PageSource);
            }
        });
    }
});

For testing purposes I am displaying the downloaded String in a TextView and I noticed "�" signs in places where german special letters ("ä", "ö", etc. ) were used. I figured this was an issue with UTF-8 <-> ISO-8859-1 encoding, since the source didn't use "& auml;" or similar but simply "ä" and indeed the target webpage specifies the following:

<meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type" />

I then tried to include the "addHeader" property within the Request.Builder(), but it doesn't change anything with the output. I continued trying weird things with OkHttp interceptors and ByteBuffers, but nothing worked for me, as I was never able to get a hold of the response before it was re-encoded and introduced �s.

How can I tell OkHttp to respect the ISO-8859-1 encoding and prevent it from replacing all special characters ("ä", "ö", "ü", etc. ) with �?

Many thanks in advance and merry Christmas to all of you.

EDIT/ ANSWER:

Using the Guava library from Google I was able to retrieve the correctly encoded page source as follows:

String pageSource = CharStreams.toString(new InputStreamReader(response.body().byteStream(), "ISO-8859-1"));
1

There are 1 answers

2
Jesse Wilson On BEST ANSWER

OkHttp doesn't parse your HTML to read the content-type within it. Instead you need to specify the charset yourself as an argument to string(). Even better, get your server to include the proper charset in the response’s content type header.