Prevent text that matches link from being outputted as a link?

23 views Asked by At

I have written some code to extract links from PDF files and convert them into HTML using PyMuPDF, but the problem is that any text on the page that matches the text of a link will also be outputted as a link.

So for example, if the word PDF shows up once as a link on a page, then any other mention of the word PDF on the same page will also be outputted as a link.

How can I fix this?

Here is my code:

Get links

for page in doc:
    links.clear()
    link = page.first_link
    while link:
        h = link.rect.height * 0.2
        smaller = link.rect + (0, h, 0, -h)
        linkText = page.get_textbox(smaller).strip()
        if linkText:
            links.append([link.uri, linkText])
        link = link.next

Match links up with text

for block in page.get_text("dict", clip=area)["blocks"]:
        newBlock = True
        if block['type'] == 0:
            for line in block["lines"]:
                for span in line["spans"]:
                    linkWritten = False
                    for link in links:
                        if link[1] in span['text'] or span['text'] in link[1]:
                            if span['text'] in link[1]:
                                htmlContent += '<a href="' + link[0] + '">' + span ['text'] + "</a>"
                            else:
                                textBefore = span['text'].split(link[1])[0]
                                textAfter = span['text'].split(link[1])[1]
                                htmlContent += textBefore + '<a href="' + link[0] + '">' + link[1] + "</a>" + textAfter
                            linkWritten = True
1

There are 1 answers

0
GuiEpi On

BONJOUR, According to the documentation it would be possible to extract html directly from your pdf with the html parameter. There's also more information on extractHTML() here

extractHTML()
Textpage content as a string in HTML format. This version contains complete formatting and positioning information. Images are included (encoded as base64 strings). You need an HTML package to interpret the output in Python.

ps: I haven't tried it