pymupdf detect two paragraph which text blocks coordinates is closed as one

172 views Asked by At

I face a problem that When I use fitz to detect pdf layout. The two paragraph will be detect as one textblock if the two block as a close line margin. enter image description here

for example. I want detect the text and the isolated formula as to text blocks. but for now fitz detect them as one text block.How could i handdle this. Shoud I detect words coordinates and sort it with normal reading order or some methods like this.

1

There are 1 answers

1
Jorj McKie On

PyMuPDF also has ways to adjust the granularity of text extraction: there are more levels between and beyond block extraction and word extraction.

You can extract by line, by text span (both are a higher level than word) and by character (level below word). And all of them deliver wrapping rectangles of the respective text, plus a plethora of text font proprerties (font size, font weight, font style, font color), writing direction.

Here is an example that extracts lines of text:

details = page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT)  # skips images!
for block in details["blocks"]:  # delivers the block level
    for line in block["lines"]:  # the lines in this block
        bbox = fitz.Rect(line["bbox"])  # wraps this line
        line_text = "".join([span["text"] for span in line["spans"]])

Please do have a look at this picture in the documentation - it shows an overview of the dictionary layout: https://pymupdf.readthedocs.io/en/latest/_images/img-textpage.png.