Error on getting the xref of an image with PyMuPDF using page.get_text("dict")["blocks"]

32 views Asked by At

With the following Python function I'm trying to extract text and images from a pdf document. Also, I want to put a label like f"<<<image_{image_counter}>>>" in the extracted text at the exact location of the corresponding image. This is the Python function I have:

def extract_text_and_save_images_not_working(pdf_path):

    doc = fitz.open(pdf_path)
    full_text = ""
    image_counter = 1  # Initialize the image counter before iterating through pages
    
    for page_num in range(len(doc)): # Iterate through each page of the pdf document

        page = doc.load_page(page_num) # Load the pdf page
        blocks = page.get_text("dict")["blocks"]  # The list of block dictionaries 
        
        for block in blocks:  # Iterate through each block

            if block['type'] == 0:  # If the block is a text block
                for line in block["lines"]:  # Iterate through lines in the block
                    for span in line["spans"]:  # Iterate through spans in the line
                        full_text += span["text"] + " "  # Append text to full_text
                full_text += "\n"  # Add newline after each block

            elif block['type'] == 1:  # If the block is an image block
                image_label = f"<<<image_{image_counter}>>>"  # Label to insert in the extracted text in place of the corresponding image 
                full_text += f"{image_label}\n"  # Insert image label at the image location
                img = block['image']
                xref = img[0]
                print()
                print(xref)
                print()
                base_image = doc.extract_image(xref)  # Attempt to extract image
                image_bytes = base_image["image"]  # Get the image bytes
                image_filename = f"image_{image_counter}.png"

                with open(image_filename, "wb") as img_file:  # Save the image
                    img_file.write(image_bytes)
                
                image_counter += 1  # Increment counter for next image regardless of extraction success

    doc.close() # Close the pdf document
    return full_text

Basically the function extract the block dictionaries of each page using this function blocks = page.get_text("dict")["blocks"] and for each block checks if it is a text block (block['type'] == 0) or an image block (block['type'] == 1). If the block is an image, then the function saves the image in the same directory of the running script with this name f"image_{image_counter}.png" and adds a label (f"<<<image_{image_counter}>>>") in the extracted text at the line that identifies the position of the image in the pdf. Now, when I run this function, I get the following error:

Traceback (most recent call last):
  File "c:\Users\xxxx\Desktop\X_Project\extract_images_from_pdf\extract_text_and_images_from_pdf.py", line 93, in <module>
    extracted_text = extract_text_and_save_images_not_working(pdf_path)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\xxxx\Desktop\X_Project\extract_images_from_pdf\extract_text_and_images_from_pdf.py", line 76, in extract_text_and_save_images_not_working
    base_image = doc.extract_image(xref)  # Attempt to extract image
                 ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\xxxx\Desktop\X_Project\extract_images_from_pdf\venv\Lib\site-packages\fitz\__init__.py", line 3894, in extract_image
    raise ValueError( MSG_BAD_XREF)
ValueError: bad xref

Which makes sense this error because in the variable xref I should get an integer number representing the cross reference number of the image, but instead I get another integer number that doesn't represents the correct cross reference number. In other words, in my exercise for the specific document pdf I'm using, I expect xref = 52 but instead I get xref = 137.

0

There are 0 answers