I am converting doc file to html using the following code:
public static byte[] generateHTMLFromDoc(byte[] docBytes) {
try(ByteArrayInputStream inputStream = new ByteArrayInputStream(docBytes);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
XWPFDocument document = new XWPFDocument(inputStream);
XHTMLOptions options = XHTMLOptions.create();
Base64ImageExtractor imageExtractor = new Base64ImageExtractor();
options.setExtractor(imageExtractor);
options.URIResolver(imageExtractor);
XHTMLConverter.getInstance().convert(document, outputStream, options);
return outputStream.toByteArray();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
But my file include utf-8 so it will dislay like ??? such as: m?c l?c Please help me how to add utf-8 into options
I tried to add
options.setEncoding("UTF-8");
but there is no setEncoding for XHTMLOptions
From your code I guess you are using very old versions of XDocReport and Apache POI. So I suggest update to more current versions.
Current XDocReport version 2.0.4 provides the
ImageManager
Base64EmbedImgManager
already. So no specialBase64ImageExtractor
needed.Following works for me.
After upgrading your versions, you should test again. And and if you have problems with Unicode in
WordDocument.docx
, the you should provide an exampleWordDocument.docx
which produces those problems. That will make your problem reproducible for others.Found the Unicode problem. fr.opensagres.poi.xwpf.converter.xhtml.SimpleContentHandler uses String.getBytes without explicit
Charset
. That "Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.".So it depends on platform's default charset whether Unicode will be handled correctly. If the platform is Windows and no other charset is set via environment variables, the platform's default charset is windows-1252. That, of course cannot handle Unicode.
You can check via:
For how to set UTF-8 the platform's default charset for Windows, please see Setting the default Java character encoding.
For me, starting the class
DOCXToXHTMLXDocReport
viaproduces the correct Unicode results in HTML in Windows platform too.