Issue with table formatting when converting DOCX to PDF using fr.opensagres.poi.xwpf.converter.pdf (Apache POI)

1.2k views Asked by At

I am trying to create a table in a DOCX file and then convert it to a PDF using Apache POI (version 5.2.3) and the XWPF Converter (version 2.0.4) library. I have successfully created the table and merged cells in the DOCX file. However, when I convert the DOCX file to PDF using the XWPF Converter, the resulting PDF does not have the proper formatting.

ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
PdfOptions options = PdfOptions.create();
PdfConverter.getInstance().convert(document, byteArrayOutputStream, options);
byte[] pdfBytes = byteArrayOutputStream.toByteArray();

Expected result: I expect the converted PDF to maintain the table formatting and cell merging as it appears in the original DOCX file.

Actual result: The converted PDF does not accurately reflect the formatting of the table and merged cells.

1

There are 1 answers

1
Axel Richter On

The programmers of XDocReport have done a great job to handle the really complex file structure of a Microsoft Word *.docx document in Office Open XML format. But, of course, there always are not solved problems.

When it comes to tables in Word, then following problems are known to me:

A Word table might have row heights not set explicitly and so only determined by content. Then XDocReport not calculates the height considering the font descenders.

A Word table might have table cells hidden using gridBefore and wBefore for cells before the first cell in row and/or gridAfter and wAfter for cells after the last cell in row. Such cells are not part of the rows then and also are not set via cell merging. This is something what XDocReport not considers. And because of the missed cells, the whole table structure gets damaged.

A Word table might have set alternating row background through table style. This is something what XDocReport not considers.

There might be more. But I doubt there is any free software out which really considers all of the complex possibilities of a Microsoft Word document. Even commercial software, except Microsoft Word itself, will have issues there.

Following short complete Java program can be used to test:

import java.io.*;
import java.math.BigInteger;

//needed jars: fr.opensagres.poi.xwpf.converter.core-2.0.4.jar, 
//             fr.opensagres.poi.xwpf.converter.pdf-2.0.4.jar,
//             fr.opensagres.xdocreport.itext.extension-2.0.4.jar,
//             itext-4.2.1.jar                                   
import fr.opensagres.poi.xwpf.converter.pdf.PdfOptions;
import fr.opensagres.poi.xwpf.converter.pdf.PdfConverter;

//needed jars: apache poi 5.2.3 and it's dependencies
//             and additionally: poi-ooxml-full-5.2.3.jar 
import org.apache.poi.xwpf.usermodel.*;

public class XWPFToPDFConverterSampleMin {

 public static void main(String[] args) throws Exception {

  String docPath = "./XWPFDocument.docx";
  String outputFile = "./XWPFDocument.pdf";

  InputStream in = new FileInputStream(new File(docPath));
  XWPFDocument document = new XWPFDocument(in);

  PdfOptions options = PdfOptions.create();
  OutputStream out = new FileOutputStream(outputFile);
  PdfConverter.getInstance().convert(document, out, options);

  document.close();
  out.close(); 

 }
}

The XWPFDocument.docx looks like so:

enter image description here

The resulting XWPFDocument.pdf looks like so:

enter image description here