Encoding problems in the XML output

984 views Asked by At

I wrote an XML parser, everything is working fine except the text encoding. I made some researches to fix this, but i'm still stuck.

I've got a list of string which contains movies titles and I add it to the XML with a CDATA encapsulation, for example :

CDATA movieTitle= new CDATA(aMovie.getTitle());
movie.addContent(new Element("title").addContent(movieTitle));

And I save it using this :

XMLOutputter xmlOutput = new XMLOutputter();
Format format = Format.getPrettyFormat();
format.setEncoding("UTF-8");
xmlOutput.setFormat(format);
xmlOutput.output(doc, new FileWriter(fileName+ ".xml"));

But the result is :

<title><![CDATA[LA LOI DU MARCHxC9]></title>

And should be "LA LOI DU MARCHÉ".

What should I do to avoid this happening ?

2

There are 2 answers

0
Joop Eggen On BEST ANSWER

As the XML already knows about the encoding, and places it in the <?xml encoding ?>, I prefer the solution of @rolfl, a binary OutputStream.

The error here is, that FileWriter is a very old utility class that uses the default encoding. Which is absolutely non-portable.

xmlOutput.output(doc, Files.newBufferedWriter(Paths.get(fileName+ ".xml"),
        StandardCharsets.UTF_8));
1
rolfl On

This is a common problem with JDOM, and it's an issue that derives from how Java handles OutputStreams and Writers. In essence, Java does not make the file encoding visible in a Writer.... In your case, you're probably running an ASCII-based writer.... and it can't encode the unicode É correctly.

See the notes on the XMLOutputter's documentation

The solution is to use a FileoutputStream instead of a FileWriter. Since UTF-8 is the default encoding, you don't need to set it. Try it:

XMLOutputter xmlOutput = new XMLOutputter();
xmlOutput.setFormat(Format.getPrettyFormat());
try (OutputStream out = new FileOutputStream(fileName+ ".xml")) {
    xmlOutput.output(doc, out);
}