Reading images from pdf and extract Text from it

Question

Reading images from pdf and extract Text from it

263 views Asked by Piyush Gupta At 02 May 2022 at 12:28

Problem Statement: I have a pdf which contains n number of pages and each page has 1 image whose text I need to read and perform some operation.

What I tried: I have to do this in python, and the only library I found with the best result is pytesserac. I am pasting the sample code which I tried

    fn = kw['fn'] = self.env.context.get('wfg_pg', kw['fn'])
    zoom, zoom_config = self.get_zoom_for_doc(index), ' -c tessedit_do_invert=0'
    if 3.3 < zoom < 3.5:
        zoom_config += ' --oem 3 --psm 4'
    elif 0 != page_number_list[0]:
        zoom_config += ' --psm 6'
    full_text, page_length = '', kw['doc'].pageCount
    if recursion and index >= 10:
        return fn.get('most_correct') or fn.get(page_number_list[0])
    mat = fitz.Matrix(zoom, zoom)  # increase resolution
    for page_no in page_number_list:
        page = kw['doc'].loadPage(page_no)  # number of page
        pix = page.getPixmap(matrix=mat)
        with Image.open(io.BytesIO(pix.getImageData())) as img:
            text_of_each_page = str(pytesseract.image_to_string(img, config='%s' % zoom_config)).strip()
        fn[page_no] = text_of_each_page
        full_text = '\n'.join((full_text, text_of_each_page, '\n'))
    _logger.critical(f"full text in load immage {full_text}")
    args = (full_text, page_number_list)
    load = recursion and self.run_recursion_to_load_new_image_to_text(*args, **kw)
    if recursion and load:
        return self.load_image
    return full_text

The issue: My pdf is having dates like 1/13, 1/7 the library is reading them as 143, 1n and in some places, it is reading 17 as 1). Also after the text, it is also giving some symbols like { & . , = randomly whereas in pdf these things are not even there.

For accuracy

1. I tried converting the image to .tiff format but it didn't work for me.
2. Tried adjusting the resolution of the image.

Original Q&A

There are 1 answers

**learning_bunny** · Accepted Answer · 2022-05-05T04:45:13+00:00

learning_bunny On 05 May 2022 at 04:45 BEST ANSWER

You can use pdftoppm tool for converting you images really fast as it provides you to use multi-threading feature by just passing thread_count=(no of threads). You can refer to this link for more info on this tool. Also better images can increase the accuracy of tesseract.

TechQA.

Reading images from pdf and extract Text from it

There are 1 answers

Related Questions in PYTHON-3.X

Related Questions in PYTHON-TESSERACT

Related Questions in TEXT-EXTRACTION

Related Questions in PYTHON-PDFREADER

Related Questions in IMAGE-TEXT

Popular Questions

Popular Tags

Trending Questions