I have written some code to extract links from PDF files and convert them into HTML using PyMuPDF, but the problem is that any text on the page that matches the text of a link will also be outputted as a link.
So for example, if the word PDF shows up once as a link on a page, then any other mention of the word PDF on the same page will also be outputted as a link.
How can I fix this?
Here is my code:
Get links
for page in doc:
links.clear()
link = page.first_link
while link:
h = link.rect.height * 0.2
smaller = link.rect + (0, h, 0, -h)
linkText = page.get_textbox(smaller).strip()
if linkText:
links.append([link.uri, linkText])
link = link.next
Match links up with text
for block in page.get_text("dict", clip=area)["blocks"]:
newBlock = True
if block['type'] == 0:
for line in block["lines"]:
for span in line["spans"]:
linkWritten = False
for link in links:
if link[1] in span['text'] or span['text'] in link[1]:
if span['text'] in link[1]:
htmlContent += '<a href="' + link[0] + '">' + span ['text'] + "</a>"
else:
textBefore = span['text'].split(link[1])[0]
textAfter = span['text'].split(link[1])[1]
htmlContent += textBefore + '<a href="' + link[0] + '">' + link[1] + "</a>" + textAfter
linkWritten = True
BONJOUR, According to the documentation it would be possible to extract html directly from your pdf with the
htmlparameter. There's also more information onextractHTML()hereps: I haven't tried it