Java UTF-8 encoding not set to URLConnection

47.4k views Asked by At

I'm trying to retrieve data from http://api.freebase.com/api/trans/raw/m/0h47

As you can see in text there are sings like this: /ælˈdʒɪəriə/.

When I try to get source from the page I get text with sings like ú etc.

So far I've tried with the following code:

urlConnection.setRequestProperty("Accept-Charset", "UTF-8");
urlConnection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded;charset=utf-8");

What am I doing wrong?

My entire code:

URL url = null;
URLConnection urlConn = null;
DataInputStream input = null;
try {
url = new URL("http://api.freebase.com/api/trans/raw/m/0h47");
} catch (MalformedURLException e) {e.printStackTrace();}

try {
    urlConn = url.openConnection(); 
} catch (IOException e) { e.printStackTrace(); }
urlConn.setRequestProperty("Accept-Charset", "UTF-8");
urlConn.setRequestProperty("Content-Type", "text/plain; charset=utf-8");

urlConn.setDoInput(true);
urlConn.setUseCaches(false);

StringBuffer strBseznam = new StringBuffer();
if (strBseznam.length() > 0)
    strBseznam.deleteCharAt(strBseznam.length() - 1);

try {
    input = new DataInputStream(urlConn.getInputStream()); 
} catch (IOException e) { e.printStackTrace(); }
String str = "";
StringBuffer strB = new StringBuffer();
strB.setLength(0);
try {
    while (null != ((str = input.readLine()))) 
    {
        strB.append(str); 
    }
    input.close();
} catch (IOException e) { e.printStackTrace(); }
3

There are 3 answers

4
Joop Eggen On BEST ANSWER

The HTML page is in UTF-8, and could use arabic characters and such. But those characters above Unicode 127 are still encoded as numeric entities like ú. An Accept-Encoding will not, help, and loading as UTF-8 is entirely right.

You have to decode the entities yourself. Something like:

String decodeNumericEntities(String s) {
    StringBuffer sb = new StringBuffer();
    Matcher m = Pattern.compile("\\&#(\\d+);").matcher(s);
    while (m.find()) {
        int uc = Integer.parseInt(m.group(1));
        m.appendReplacement(sb, "");
        sb.appendCodepoint(uc);
    }
    m.appendTail(sb);
    return sb.toString();
}

By the way those entities could stem from processed HTML forms, so on the editing side of the web app.


After code in question:

I have replaced DataInputStream with a (Buffered)Reader for text. InputStreams read binary data, bytes; Readers text, Strings. An InputStreamReader has as parameter an InputStream and an encoding, and returns a Reader.

try {
    BufferedReader input = new BufferedReader(
            new InputStreamReader(urlConn.getInputStream(), "UTF-8")); 
    StringBuilder strB = new StringBuilder();
    String str;
    while (null != (str = input.readLine())) {
        strB.append(str).append("\r\n"); 
    }
    input.close();
} catch (IOException e) {
    e.printStackTrace();
}
0
Rob Marrowstone On

Well I'm thinking the problem is when you are reading from the stream. You should either call the readUTF method on the DataInputStream instead of calling readLine or, what I would do, would be to create an InputStreamReader and set the encoding, then you can read from the BufferedReader line by line (this would be inside your existing try/catch):

Charset charset = Charset.forName("UTF8");
InputStreamReader stream = new InputStreamReader(urlConn.getInputStream(), charset);
BufferedReader reader = new BufferedReader(stream);
StringBuffer responseBuffer = new StringBuffer();

String read = "";
while ((read = reader.readLine()) != null) {
    responseBuffer.append(read);
}
1
limlim On

Try adding also the user agent to your URLConnection:

urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36");

This solved my decoding problem like a charm.