Remove the garbage words from the pdf

Question

Remove the garbage words from the pdf

200 views Asked by Muhammad Samadzade At 30 August 2023 at 10:22

I am extracting the pdf to text using python and libraries like, fitz, pdfreader and so on. But in my pdf, there are some schematics and words I do not need on it.

Here is an example.

When extracting the text, the words of the schematics are also included, but I do not want those words to appeare. Because if the image can be extrated the text in the images is not meaninful.

I could not come up with a strategy to delete these useless words from the pdf.

import fitz
from io import BytesIO

class DeleteGarbage(object):
    def __init__(self, max_table_area=1.5):
        self.max_table_area = max_table_area

    def process(self, context):
        '''extract page content and does basic filtering using fitz'''
        for page_number, page in enumerate(context["fitz"]):
            if page_number != 2:
                continue
            area_of_page = page.rect.width * page.rect.height
            paths = page.get_drawings()  # extract existing drawings
            
            for path in paths:
                for item in path["items"]:
                    if item[0] == "l":  # line
                        rect = [item[1][0], item[1][1], item[2][0], item[2][1]]
                        if self.check_if_not_table(rect, page_number, context['content']['pages'][page_number - 1]['tables']):
                            rect = [item[1][0] - 10, item[1][1] - 10, item[2][0] + 10, item[2][1] + 10]
                            white = (1, 1, 1)
                            black = (0, 0, 0)
                            page.add_redact_annot(rect, f"", align=fitz.TEXT_ALIGN_CENTER, fill=white, text_color=white)
                    elif item[0] == "re":  # rectangle
                        rect = item[1]
                        if rect.get_area() < area_of_page / self.max_table_area and self.check_if_not_table(rect, page_number, context['content']['pages'][page_number - 1]['tables']):
                            white = (1, 1, 1)
                            black = (0, 0, 0)
                            page.add_redact_annot(
                                [rect[0] - 10, rect[1] - 10, rect[2] + 10, rect[3] + 10],
                                f"",
                                align=fitz.TEXT_ALIGN_CENTER,
                                fill=white,
                                text_color=white
                            )

            page.apply_redactions()
        return context
    def check_if_not_table(self, rect, page_number, tables):
        for table_coordination in tables['coordination']:
            if table_coordination[0] - 10 < rect[0] and table_coordination[1] - 10 < rect[1] and table_coordination[2] + 10 > rect[2] and table_coordination[3] + 10 > rect[3]:
                return False
        return True

Original Q&A

There are 1 answers

**K J** · Answer 1 · 2023-08-30T23:59:18+00:00

Your strategy is reasonable but the problem with many similar documents like that is that contents are often all over the place so we can see the extracted heading area is actually the last contents written in the body text.

One way would be to draw redaction areas to remove the unwanted upper searchable graphics section. but that is often more work than select the desired section so let's concentrate on the tabular layout. It could just as easily be two columns etc.

What we need is a profile for the page extraction thus in this case we want for page 3 the area as defined here.

So we can build a list of desires per page and then run all as one script to output all in good order.

For an example of 2 columns per page see https://stackoverflow.com/a/77008749/10802527 where with a few adjustments that page profile could be used on page 1 (shown below) using

for left -x 0 -y 110 -W 300 -H 700
& right -x 300 -y 110 -W 300 -H 400

Since it's smaller only the right half is seen here on the console, but you will be redirecting outputs to an output file.txt or similar.

If you take a batch of desires and write a command modular you could simply write (consider adding ranges of similar pages rather than singles):

pdfEXfunc file.pdf 2col 1 110 700 400 // for split page 1
pdfEXfunc file.pdf 2col 2 100 200 200 // for page 2 TOC
pdfEXfunc file.pdf 1col 2 300 200     // for page 2 REVisions
pdfEXfunc file.pdf 1col 3 270 250     // for full width page 3
pdfEXfunc file.pdf etc etc.

TechQA.

Remove the garbage words from the pdf

There are 1 answers

Related Questions in PYTHON

Related Questions in PDF

Related Questions in PDF-READER

Related Questions in PYMUPDF

Related Questions in PDFPLUMBER

Popular Questions

Trending Questions