Convert .doc with utf-8 to html using xdocreport

190 views Asked by At

I am converting doc file to html using the following code:

   public static byte[] generateHTMLFromDoc(byte[] docBytes) {
        try(ByteArrayInputStream inputStream = new ByteArrayInputStream(docBytes);
            ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
            XWPFDocument document = new XWPFDocument(inputStream);
            XHTMLOptions options = XHTMLOptions.create();
            Base64ImageExtractor imageExtractor = new Base64ImageExtractor();
            options.setExtractor(imageExtractor);
            options.URIResolver(imageExtractor);
            XHTMLConverter.getInstance().convert(document, outputStream, options);
            return outputStream.toByteArray();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }

But my file include utf-8 so it will dislay like ??? such as: m?c l?c Please help me how to add utf-8 into options

I tried to add

options.setEncoding("UTF-8");

but there is no setEncoding for XHTMLOptions

1

There are 1 answers

0
Axel Richter On BEST ANSWER

From your code I guess you are using very old versions of XDocReport and Apache POI. So I suggest update to more current versions.

Current XDocReport version 2.0.4 provides the ImageManager Base64EmbedImgManager already. So no special Base64ImageExtractor needed.

Following works for me.

import java.io.*;

//needed jars: xdocreport-2.0.4.jar, 
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;
import fr.opensagres.poi.xwpf.converter.xhtml.Base64EmbedImgManager ;

//needed jars: all apache poi dependencies of poi-ooxml version 5.2.3
import org.apache.poi.xwpf.usermodel.*;

public class DOCXToXHTMLXDocReport {

 public static void main(String[] args) throws Exception {

  String docPath = "./WordDocument.docx";
  String htmlPath = "./WordDocument.html";

  XWPFDocument document = new XWPFDocument(new FileInputStream(docPath));

  XHTMLOptions options = XHTMLOptions.create().setImageManager(new Base64EmbedImgManager());
  
  FileOutputStream out = new FileOutputStream(htmlPath);
  XHTMLConverter.getInstance().convert(document, out, options);

  out.close();      
  document.close();    

  java.awt.Desktop.getDesktop().browse(new File(htmlPath).toPath().toRealPath(java.nio.file.LinkOption.NOFOLLOW_LINKS).toUri());  
 
 }
}

After upgrading your versions, you should test again. And and if you have problems with Unicode in WordDocument.docx, the you should provide an example WordDocument.docx which produces those problems. That will make your problem reproducible for others.


Found the Unicode problem. fr.opensagres.poi.xwpf.converter.xhtml.SimpleContentHandler uses String.getBytes without explicit Charset. That "Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.".

So it depends on platform's default charset whether Unicode will be handled correctly. If the platform is Windows and no other charset is set via environment variables, the platform's default charset is windows-1252. That, of course cannot handle Unicode.

You can check via:

System.out.println("Default Charset=" + java.nio.charset.Charset.defaultCharset());  

For how to set UTF-8 the platform's default charset for Windows, please see Setting the default Java character encoding.

For me, starting the class DOCXToXHTMLXDocReport via

java -Dfile.encoding=UTF-8 ... DOCXToXHTMLXDocReport

produces the correct Unicode results in HTML in Windows platform too.