How to Convert drawn shapes of HWPFDocument to XSL FO?

142 views Asked by At

I am trying to convert .doc file to PDF, For this I am initially trying to convert .doc > XSL-FO > PDF.

On Converting the .doc to XSL-FO I am unable to convert the drawn objects such as checkbox,rectangle,square to XSL-FO.

It gets converted as below , which should actually be a box

PDF output

The conversion code I am using is

    HWPFDocumentCore wordDocument = WordToFoUtils.loadDoc(is);                              
    WordToFoConverter wordToFoConverter = new WordToFoConverter(
                                             
    XMLHelper.getDocumentBuilderFactory().newDocumentBuilder().newDocument());
                                
    wordToFoConverter.processDocument(wordDocument);
    File foFile = new File("D:\\Testing\\testing\\" + "test.fo");
    ByteArrayOutputStream out = new ByteArrayOutputStream();
    StreamResult streamResult = new StreamResult(out);
    
    Transformer transformer = TransformerFactory.newInstance().newTransformer();
                                    transformer.transform(new 
                                     DOMSource(wordToFoConverter.getDocument()), streamResult);
    String result = 
     org.apache.commons.lang3.StringUtils.normalizeSpace(java.text.Normalizer.normalize(new 
     String(out.toByteArray(), "UTF-8"), java.text.Normalizer.Form.NFD));
                                
    result = URLEncoder.encode(result, "UTF-8");

Further Apache FOP is used to convert the .fo to pdf

The .doc file is as below

Word input

and the WordToFoConverter converted the boxes as below

FO resulting from the conversion

1

There are 1 answers

1
K J On

In Plain Text like XML, check boxes usually come from basic symbol fonts.

They are seen / shown as ☐ when unchecked, or ☑ or ☒ when checked.

In any basic text stream it should be relatively easy to use or find and replace them. However beware the encoding especially UTF , thus best copied from a clean set of Zapf Dingbats or Adobe TTF Symbol font.

many have a Unicode description but do test visually that they work after copy and paste from the PDF since the font mapping may not always tally.

8999 ⌧ ⌧ \002327 0x2327 X in a rectangle box

By far the simplest way to use UniCode text is as Rich Text which you can on Windows Command Line (you don't need the lower left dialogue, its just to illustrate export settings) outPort as Port-AbleDocFile using Write.exe which can read TXT and /PrintTo PDF. enter image description here

Its much simpler than XML where just one character requires:-

<w:rPr>
<w:rFonts w:ascii="Segoe UI Symbol" w:hAnsi="Segoe UI Symbol" w:cs="Segoe UI Symbol" w:eastAsia="Segoe UI Symbol"/>
<w:color w:val="auto"/>
<w:spacing w:val="0"/>
<w:position w:val="0"/>
<w:sz w:val="48"/>
<w:shd w:fill="auto" w:val="clear"/>
<w:vertAlign w:val="subscript"/>
</w:rPr>
<w:t xml:space="preserve">☑</w:t>