How to use Python Fitz detect Hyphen when using search_for?

387 views Asked by At

I'm new to the Fitz library and am working on a project where I need to find a string in a PDF page. I'm running into a case where the text on the page that I'm searching on is hyphenated. I am aware of the TEXT_DEHYPHENATE flag that I can use in the search for function, but that doesn't work for me (as shown in the image here https://postimg.cc/zHZPdd6v ). I'm getting no cases when I search for the hyphenated string.

Python Script

LOC = "./test.pdf"

doc = fitz.open(LOC) 
page = doc[1]
print(page.get_text())
found = page.search_for("lowcost", flags=TEXT_DEHYPHENATE)
print("DONE")
print(len(found))
found = page.search_for("low-cost", flags=TEXT_DEHYPHENATE)
print("DONE")
print(len(found))
found = page.search_for("low cost", flags=TEXT_DEHYPHENATE)
print("DONE")
print(len(found))

for rect in found: 
    print(rect)

Output

Abstract 
The objective of “XXXXXXXXXXXXXXXXXX” was design and assemble a low-
cost and efficient tool.  
 
DONE
0
DONE
0
DONE
0

Can someone please point me to how I might be able to detect the hyphen in my file? Thank you!

1

There are 1 answers

0
Jorj McKie On

Your first approach should work, look here:

# insert some hyphenated text
page.insert_textbox((100,100,300,300),"The objective of 'xxx' was design and assemble a low-\ncost and efficient tool.")
157.94699853658676

# now search for it again
page.search_for("lowcost")  # 2 rectangles!
[Rect(159.3009796142578, 116.24800109863281, 175.8009796142578, 131.36199951171875),
 Rect(100.0, 132.49501037597656, 120.17399597167969, 147.6090087890625)]

# each containing a text portion with hyphen removed
for rect in page.search_for("lowcost"):
    print(page.get_textbox(rect))

    
low
cost

Without the original file there is no way to tell the reason for your failure. Are you sure there really is text - and not e.g. an image or other hickups?

Edited: As per the comment of user @KJ below: PyMuPDF's C base library MuPDF regards all of the unicodes '-', 0xAD, 0x2010, 0x2011 as hyphens in this context. They all should work the same. Just reconfirmed it in an example.