XMLPullParser black diamond question marks with certain characters

889 views Asked by At

I'm making an android app, that needs to fetch and parse XML. The class for that was made following the instructions from here http://www.tutorialspoint.com/android/android_rss_reader.htm and the fetcher method looks like this:

public void fetchXML() {
    Thread thread = new Thread(new Runnable() {
        @Override
        public void run() {

            try {
                URL url = new URL(urlString);
                HttpURLConnection conn = (HttpURLConnection) url.openConnection();


                conn.setReadTimeout(10000 /* milliseconds */);
                conn.setConnectTimeout(15000 /* milliseconds */);
                conn.setRequestMethod("GET");
                conn.setDoInput(true);


                // Starts the query
                conn.connect();
                InputStream stream = conn.getInputStream();

                xmlFactoryObject = XmlPullParserFactory.newInstance();
                xmlFactoryObject.setValidating(false);
                xmlFactoryObject.setFeature(Xml.FEATURE_RELAXED, true);
                xmlFactoryObject.setNamespaceAware(true);

                XmlPullParser myparser = xmlFactoryObject.newPullParser();
                //myparser.setFeature(XmlPullParser.FEATURE_PROCESS_NAMESPACES, false);
                myparser.setInput(new InputStreamReader(stream, "UTF-8"));

                parseXMLAndStoreIt(myparser);
                stream.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    });
    thread.start();
}

Parser looks like the one in tutorial, with my parsing logic in it.

As you can see from

 myparser.setInput(new InputStreamReader(stream, "UTF-8"));

I'm using UTF-8 charset. Now when I use getText() method in my parser for example on the word 'Jõhvi', the logcat output is 'J�hvi'. It's the same for other characters of my native language, Estonian, that aren't in English alphabet. I need to use this string as a key and in the user interface, so this isn't acceptable. I'm thinking it's a charset problem, but there is no info at the XML site I'm pulling this from and using

conn.getContentEncoding()

returns null so I'm in the dark here.

1

There are 1 answers

1
kris larson On BEST ANSWER

Content encoding and character encoding are not the same thing.

Content encoding refers to compression such as gzip. Since getContentEncoding() is null, that tells you there's no compression.

You should be looking at conn.getContentType(), because the character encoding can usually be found in the content-type response header.

conn.getContentType() might return something like:

text/xml; charset=ISO-8859-1

so you will have to do some parsing. Look for the character set name after "charset=" but be prepared for the case where the mime type is specified but the charset is not.