EasyOCR isn't adding proper linebreaks to the extracted text

76 views Asked by At

Here's the gist of the code:

import fitz
import easyocr
from PIL import Image

def extract_text_from_pdf(pdf_path):
    reader = easyocr.Reader(['en'], download_enabled=False)
    extracted_text = ""
    for page_number in range(pdf_document.page_count):
    
        page = pdf_document[page_number]
        resolution = 300
        zoomfactor = resolution/72.0
        pixmap = page.get_pixmap(matrix-fitz.Matrix(zoomfactor, zoomfactor))
        image = pixmap.tobytes()
        result = reader.readtext(image, paragraph=True)

        print("Page {page_number + 1} - OCR Result:") 
        for detection in result:
            extracted_text += detection[1]

    pdf_document.close()
    
return extracted_text

The image passed looks something like this:

enter image description here

But the extracted text looks like this: "account: 1234url: xyz"

The expectation is:

"account: 1234

url: xyz"

It seems like easyOCR is extracting each word separately and not reading the image line by line. Probably because there's a huge space between the words on a single line.

Can you please suggest something?

1

There are 1 answers

4
Markus Safar On

According to the documentation you can specify Bounding Box Merging.

x_ths (float, default = 1.0) - Maximum horizontal distance to merge text boxes when paragraph=True. y_ths (float, default = 0.5) - Maximum verticall distance to merge text boxes when paragraph=True.

Modifying one of these should do the trick

Update:
According to the op, setting x_ths to 1000.0 did solve the issue.