Scan datamatrix codes from pdf file and save them to csv

578 views Asked by At

A task: Scan datamatrix codes from pdf file and save them to csv.

File

Final result: 010466010514027621)ZPTsFWoUgqe,91009492ZCUruNv8/rQRlZyH/mZhkRY11D5aW4aLjpVn3DVxFIi7l9gV/pvguWxiVnpTRI0SFkNx1dPavcQYjiQ6DCSnNw==

I cannot form the structure of this code in my head.

I started to study libraries for working with pdf files, specifically PyPDF2, but ran into a problem. PyPDF2 finds absolutely nothing in the file. I tried to find the sequence in the code of the pdf file but did not understand anything.

Please help me with any piece of this code (except for writing to csv). It may be possible to extract information from the PDF without rendering into an image, since large amounts of codes and code speed play a role.

If there are people who know the structure of pdf, tell me if it will be possible to draw out the location of each pixel (black square) of the datamatrix code and will it be possible to translate all this into the final form.

I would be grateful for any information. Thank you.

1

There are 1 answers

1
Тарасов Андрей On

You can use my solution:

import fitz, cv2, argparse
from pylibdmtx import pylibdmtx

def reader(pdf, csv):
    pdf_file = fitz.open(pdf)
    csv_file = open(csv, 'ab')
    for current_page_index in range(len(pdf_file)):
      for img_index,img in enumerate(pdf_file.get_page_images(current_page_index)):
        image = fitz.Pixmap(pdf_file, img[0])
        if image.height>50:
          image.save("1.png")
          img = cv2.imread('1.png')
          border = cv2.copyMakeBorder(img, 10, 10, 10, 10, cv2.BORDER_CONSTANT, None, value = [255, 255, 255]) 
          csv_file.write(pylibdmtx.decode(border)[0].data)
          csv_file.write(b'\n')
    csv_file.close()