I'm making an android app, that needs to fetch and parse XML. The class for that was made following the instructions from here http://www.tutorialspoint.com/android/android_rss_reader.htm and the fetcher method looks like this:
public void fetchXML() {
Thread thread = new Thread(new Runnable() {
@Override
public void run() {
try {
URL url = new URL(urlString);
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setReadTimeout(10000 /* milliseconds */);
conn.setConnectTimeout(15000 /* milliseconds */);
conn.setRequestMethod("GET");
conn.setDoInput(true);
// Starts the query
conn.connect();
InputStream stream = conn.getInputStream();
xmlFactoryObject = XmlPullParserFactory.newInstance();
xmlFactoryObject.setValidating(false);
xmlFactoryObject.setFeature(Xml.FEATURE_RELAXED, true);
xmlFactoryObject.setNamespaceAware(true);
XmlPullParser myparser = xmlFactoryObject.newPullParser();
//myparser.setFeature(XmlPullParser.FEATURE_PROCESS_NAMESPACES, false);
myparser.setInput(new InputStreamReader(stream, "UTF-8"));
parseXMLAndStoreIt(myparser);
stream.close();
} catch (Exception e) {
e.printStackTrace();
}
}
});
thread.start();
}
Parser looks like the one in tutorial, with my parsing logic in it.
As you can see from
myparser.setInput(new InputStreamReader(stream, "UTF-8"));
I'm using UTF-8 charset. Now when I use getText() method in my parser for example on the word 'Jõhvi', the logcat output is 'J�hvi'. It's the same for other characters of my native language, Estonian, that aren't in English alphabet. I need to use this string as a key and in the user interface, so this isn't acceptable. I'm thinking it's a charset problem, but there is no info at the XML site I'm pulling this from and using
conn.getContentEncoding()
returns null so I'm in the dark here.
Content encoding and character encoding are not the same thing.
Content encoding refers to compression such as gzip. Since
getContentEncoding()
is null, that tells you there's no compression.You should be looking at
conn.getContentType()
, because the character encoding can usually be found in thecontent-type
response header.conn.getContentType()
might return something like:text/xml; charset=ISO-8859-1
so you will have to do some parsing. Look for the character set name after "charset=" but be prepared for the case where the mime type is specified but the charset is not.