Convert .doc with utf-8 to html using xdocreport

Question

Convert .doc with utf-8 to html using xdocreport

186 views Asked by Pla At 17 October 2023 at 19:38

I am converting doc file to html using the following code:

   public static byte[] generateHTMLFromDoc(byte[] docBytes) {
        try(ByteArrayInputStream inputStream = new ByteArrayInputStream(docBytes);
            ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
            XWPFDocument document = new XWPFDocument(inputStream);
            XHTMLOptions options = XHTMLOptions.create();
            Base64ImageExtractor imageExtractor = new Base64ImageExtractor();
            options.setExtractor(imageExtractor);
            options.URIResolver(imageExtractor);
            XHTMLConverter.getInstance().convert(document, outputStream, options);
            return outputStream.toByteArray();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }

But my file include utf-8 so it will dislay like ??? such as: m?c l?c Please help me how to add utf-8 into options

I tried to add

options.setEncoding("UTF-8");

but there is no setEncoding for XHTMLOptions

Original Q&A

There are 1 answers

**Axel Richter** · Accepted Answer · 2023-10-19T04:35:47+00:00

From your code I guess you are using very old versions of XDocReport and Apache POI. So I suggest update to more current versions.

Current XDocReport version 2.0.4 provides the ImageManager Base64EmbedImgManager already. So no special Base64ImageExtractor needed.

Following works for me.

import java.io.*;

//needed jars: xdocreport-2.0.4.jar, 
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;
import fr.opensagres.poi.xwpf.converter.xhtml.Base64EmbedImgManager ;

//needed jars: all apache poi dependencies of poi-ooxml version 5.2.3
import org.apache.poi.xwpf.usermodel.*;

public class DOCXToXHTMLXDocReport {

 public static void main(String[] args) throws Exception {

  String docPath = "./WordDocument.docx";
  String htmlPath = "./WordDocument.html";

  XWPFDocument document = new XWPFDocument(new FileInputStream(docPath));

  XHTMLOptions options = XHTMLOptions.create().setImageManager(new Base64EmbedImgManager());
  
  FileOutputStream out = new FileOutputStream(htmlPath);
  XHTMLConverter.getInstance().convert(document, out, options);

  out.close();      
  document.close();    

  java.awt.Desktop.getDesktop().browse(new File(htmlPath).toPath().toRealPath(java.nio.file.LinkOption.NOFOLLOW_LINKS).toUri());  
 
 }
}

After upgrading your versions, you should test again. And and if you have problems with Unicode in WordDocument.docx, the you should provide an example WordDocument.docx which produces those problems. That will make your problem reproducible for others.

Found the Unicode problem. fr.opensagres.poi.xwpf.converter.xhtml.SimpleContentHandler uses String.getBytes without explicit Charset. That "Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.".

So it depends on platform's default charset whether Unicode will be handled correctly. If the platform is Windows and no other charset is set via environment variables, the platform's default charset is windows-1252. That, of course cannot handle Unicode.

You can check via:

System.out.println("Default Charset=" + java.nio.charset.Charset.defaultCharset());

For how to set UTF-8 the platform's default charset for Windows, please see Setting the default Java character encoding.

For me, starting the class DOCXToXHTMLXDocReport via

java -Dfile.encoding=UTF-8 ... DOCXToXHTMLXDocReport

produces the correct Unicode results in HTML in Windows platform too.

TechQA.

Convert .doc with utf-8 to html using xdocreport

There are 1 answers

Related Questions in JAVA

Related Questions in APACHE-POI

Related Questions in XDOCREPORT

Popular Questions

Popular Tags

Trending Questions