pymupdf detect two paragraph which text blocks coordinates is closed as one

Question

pymupdf detect two paragraph which text blocks coordinates is closed as one

172 views Asked by CAO RUI At 19 January 2023 at 07:27

I face a problem that When I use fitz to detect pdf layout. The two paragraph will be detect as one textblock if the two block as a close line margin.

for example. I want detect the text and the isolated formula as to text blocks. but for now fitz detect them as one text block.How could i handdle this. Shoud I detect words coordinates and sort it with normal reading order or some methods like this.

Original Q&A

There are 1 answers

**Jorj McKie** · Answer 1 · 2023-01-19T13:00:27+00:00

PyMuPDF also has ways to adjust the granularity of text extraction: there are more levels between and beyond block extraction and word extraction.

You can extract by line, by text span (both are a higher level than word) and by character (level below word). And all of them deliver wrapping rectangles of the respective text, plus a plethora of text font proprerties (font size, font weight, font style, font color), writing direction.

Here is an example that extracts lines of text:

details = page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT)  # skips images!
for block in details["blocks"]:  # delivers the block level
    for line in block["lines"]:  # the lines in this block
        bbox = fitz.Rect(line["bbox"])  # wraps this line
        line_text = "".join([span["text"] for span in line["spans"]])

Please do have a look at this picture in the documentation - it shows an overview of the dictionary layout: https://pymupdf.readthedocs.io/en/latest/_images/img-textpage.png.

TechQA.

pymupdf detect two paragraph which text blocks coordinates is closed as one

There are 1 answers

Related Questions in TEXTBLOCK

Related Questions in PYMUPDF

Popular Questions

Trending Questions