I am searching for a solution for a long time but couldn't be able to find it. There are more similar qestion-answers but that didn't help me.

Basically

  1. I have some word documents (xxx.docx) having some images.
  2. That image is in WMF format (when I am manually checking it) and it basically contains tabular information.
  3. I need to collect that table.
    So the task is reduced to collect the image and get table from text using computer vision.

1 when I am trying to collect the image-- python-docx can't detect that as image , then, I found "aspose.words" library can detect the image (as it is not in an usual image format)as an image object and can write it in EMF format (xxx.emf). [ if anyother way is there please mention ]

[2] Now I have the image (xxx.emf) in a folder. so the next task is to get the content the image contains, which is totally tabular information. Now I can't use this format to read in python.

So, getting emf image and reading is not my target, the target is to get the table from the image in excel. Please help me out in these steps, or please suggest other ways according to the requirement. If anyone needs to get the docx can go to this here in a repo. Thank you.

2

There are 2 answers

0
Mark Setchell On

Word and Excel files are actually just zipped archives. You can unzip them with 7zip:

7z x 36C77022Q0250.docx

That gives you the following content:

ls -lR word

drwx------  3 mark  staff     96 10 May 00:38 _rels
-rw-r--r--  1 mark  staff  48763 28 Apr 21:35 document.xml
-rw-r--r--  1 mark  staff   1290 28 Apr 21:35 fontTable.xml
-rw-r--r--  1 mark  staff   2838 28 Apr 21:35 footer1.xml
-rw-r--r--  1 mark  staff   2865 28 Apr 21:35 footer2.xml
-rw-r--r--  1 mark  staff   1246 28 Apr 21:35 header1.xml
-rw-r--r--  1 mark  staff   1246 28 Apr 21:35 header2.xml
drwx------  3 mark  staff     96 10 May 00:38 media
-rw-r--r--  1 mark  staff    755 28 Apr 21:35 settings.xml
-rw-r--r--  1 mark  staff  49239 28 Apr 21:35 styles.xml
drwx------  3 mark  staff     96 10 May 00:38 theme

word/_rels:
total 8
-rw-r--r--  1 mark  staff  1307 28 Apr 21:35 document.xml.rels

word/media:
total 320
-rw-r--r--  1 mark  staff  162672 28 Apr 21:35 image1.wmf      <--- HERE IT IS

You can see your WMF file there and copy it to the current directory and rename it for simpler access:

cp word/media/image1.wmf image.emf

You can then convert that to a PNG with either Inkscape or LibreOffice

inkscape -d 288 -e file.png image.emf

libreoffice --headless --convert-to png image.emf

enter image description here

I think it has messed up a little on my system because I lack your fonts.

1
kiwiwings On

I don't know much about Python, but I've implemented the WMF/EMF/EMF+ classes in Apache POI. I would use the location of the text records to give them some meaning. The rest is for you to figure out, e.g. by only using lines with the same amount of columns.

import java.awt.geom.Point2D;
import java.awt.geom.Rectangle2D;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.StandardCharsets;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;
import java.util.stream.Collectors;
import java.util.stream.Stream;

import org.apache.poi.hemf.usermodel.HemfPicture;
import org.apache.poi.hwmf.record.HwmfText.WmfExtTextOut;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.junit.jupiter.api.Test;

public class TestWmfExtract {
    @Test
    void blub() throws IOException {
        Map<Double, Map<Double,String>> tab = new TreeMap<>();

        try (InputStream is = new FileInputStream("36C77022Q0250.docx");
             XWPFDocument doc = new XWPFDocument(is);
             InputStream is2 = doc.getAllPictures().get(0).getPackagePart().getInputStream()
        ) {
            HemfPicture emf = new HemfPicture(is2);

            Stream<WmfExtTextOut> st = emf.getRecords().stream()
                .filter(r -> r instanceof WmfExtTextOut)
                .map(WmfExtTextOut.class::cast);
            for (WmfExtTextOut hr : (Iterable<WmfExtTextOut>) (st::iterator)) {
                Point2D p2d = hr.getReference();
                String txt = hr.getText(StandardCharsets.UTF_16LE);
                Rectangle2D bi = (Rectangle2D)hr.getGenericProperties().get("boundsIgnored").get();
                double x = bi != null ? bi.getCenterX() : p2d.getX();
                x = 20. * Math.round(x / 20.);
                tab.computeIfAbsent(p2d.getY(), (d) -> new TreeMap<>()).put(x, txt);
            }

            List<Double> colX = tab.values().stream().flatMap((m) -> m.keySet().stream())
                .distinct().sorted().collect(Collectors.toList());

            try (Workbook wb = new XSSFWorkbook();
                 FileOutputStream fos = new FileOutputStream("tab-out.xlsx")) {
                Sheet sh = wb.createSheet();

                int rowIdx = 0;
                for (Map<Double, String> cols : tab.values()) {
                    Row row = sh.createRow(rowIdx);
                    cols.forEach((x, txt) -> row.createCell(colX.indexOf(x)).setCellValue(txt));
                    rowIdx++;
                }

                wb.write(fos);
            }
        }
    }
}