Reading XML file encoded in UTF16 in Java

6.2k views Asked by At

I am trying to read a UTF-16 xml file with Java. The file was written with C#.

Here's the java code:

import java.io.File;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

public class XMLReadTest
{
    public static void main(String[] s)
    {
        try
        {
            File fXmlFile = new File("C:\\my_file.xml");

            DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
            DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
            Document doc = dBuilder.parse(fXmlFile);

            doc.getDocumentElement().normalize();

            NodeList nList = doc.getElementsByTagName("row");

            for (int temp = 0; temp < nList.getLength(); temp++)
            {
                Node nNode = nList.item(temp);

                if (nNode.getNodeType() == Node.ELEMENT_NODE)
                {
                    Element eElement = (Element) nNode;

                    System.out.println("FILE_NAME: " + eElement.getElementsByTagName("FILE_NAME").item(0).getTextContent());
                }
            }
        }
        catch(Exception ex)
        {
            ex.printStackTrace();
        }
    }
}

And here's the xml file:

<?xml version="1.0" encoding="utf-16" standalone="yes"?>
<docMetadata>
  <row>
    <FILE_NAME>Выписка_Винтовые насосы.pdf</FILE_NAME>
    <FILE_CAT>GENERAL</FILE_CAT>
  </row>
</docMetadata>

When running this code in eclipse and in the Run/Debug settings window, in the last tab named 'Common' the selected encoding is the Default - Inherited (Cp1253), the output I get is wrong:

FILE_NAME: ???????_???????? ??????.pdf

When the selecdted encoding in the same tab is UTF-8 then the output is OK:

FILE_NAME: Выписка_Винтовые насосы.pdf

What am I doing wrong?

How can I get the correct output with the default encoding (cp 1253) in eclipse project settings?

This code runs in a server where I don't want to change the default encoding of the virtual machine.

I have tested this code with both Java 7 and Java 8

4

There are 4 answers

0
crapatzi On BEST ANSWER

I was using an old dom4j library to parse the xml and that was causing the problem. Using the JVM 1.7 embeded library solved the problem:

import java.io.File;
import java.io.StringReader;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;

public XMLDoc()
    {
        try
        {
            File xmlFile = new File("C:\\my_file.xml");
            DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
            DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
            Document doc = dBuilder.parse(xmlFile);
            doc.getDocumentElement().normalize();

            NodeList nList = _doc.getElementsByTagName("row");
            for (int i = 0; i < nList.getLength(); i++)
            {
                Node nNode = nList.item(i);

                if (nNode.getNodeType() == Node.ELEMENT_NODE)
                {
                    Element eElement = (Element) nNode;
                    Node itemNode = eElement.getElementsByTagName("FILE_NAME").item(0);
                    String text = itemNode != null ? itemNode.getTextContent() : "";

                    // russian text is fine here
                }
            }
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
    }
2
Remy Lebeau On

The problem has nothing to do with the XML itself. Java strings are UTF-16 encoded, and the Document is correctly decoding the XML data to UTF-16 strings. The real problem is that you have Eclipse set to use cp1253 (Windows-1253 Greek, which is slightly different than ISO-8859-7 Greek) for its console charset, but most of the Unicode characters you are trying to output (Russian) simply do not exist in that charset, so they get replaced with ? instead. That also explains why the output is correct when the console charset is set to UTF-8 instead, as UTF8<->UTF16 conversions are loss-less.

4
user3141592 On

How can I get the correct output with the default encoding (cp 1253) in eclipse project settings?

You can't. To see the correct output, the console must know the characters to display.

This code runs in a server where I don't want to change the default encoding of the virtual machine.

You could write a UTF-8/16 log file where you can see the output with cat from another console or a text editor.

            if (nNode.getNodeType() == Node.ELEMENT_NODE)
            {
                Element eElement = (Element) nNode;
                String message = "FILE_NAME: " + eElement.getElementsByTagName("FILE_NAME").item(0).getTextContent();
                System.out.println(message);
                // output FILE_NAME to logfile.txt (quick and dirty)
                OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(new File("logfile.txt")), "UTF-8");
                writer.write(message);
                writer.close();
            }

I ran this code in eclipse with ISO-8859-1 encoding in the run configuration.

Eclipse output: FILE_NAME: ???????_???????? ??????.pdf

logfile output: FILE_NAME: Выписка_Винтовые насосы.pdf

0
polypiel On

Try to set the encoding explicitly in the input stream:

Document doc = dBuilder.parse(new InputStreamReader(new FileInputStream(fXmlFile), "UTF-16"));