Prevent text that matches link from being outputted as a link?

Question

Prevent text that matches link from being outputted as a link?

23 views Asked by Cai Samuels At 01 March 2024 at 17:54

I have written some code to extract links from PDF files and convert them into HTML using PyMuPDF, but the problem is that any text on the page that matches the text of a link will also be outputted as a link.

So for example, if the word PDF shows up once as a link on a page, then any other mention of the word PDF on the same page will also be outputted as a link.

How can I fix this?

Here is my code:

Get links

for page in doc:
    links.clear()
    link = page.first_link
    while link:
        h = link.rect.height * 0.2
        smaller = link.rect + (0, h, 0, -h)
        linkText = page.get_textbox(smaller).strip()
        if linkText:
            links.append([link.uri, linkText])
        link = link.next

Match links up with text

for block in page.get_text("dict", clip=area)["blocks"]:
        newBlock = True
        if block['type'] == 0:
            for line in block["lines"]:
                for span in line["spans"]:
                    linkWritten = False
                    for link in links:
                        if link[1] in span['text'] or span['text'] in link[1]:
                            if span['text'] in link[1]:
                                htmlContent += '<a href="' + link[0] + '">' + span ['text'] + "</a>"
                            else:
                                textBefore = span['text'].split(link[1])[0]
                                textAfter = span['text'].split(link[1])[1]
                                htmlContent += textBefore + '<a href="' + link[0] + '">' + link[1] + "</a>" + textAfter
                            linkWritten = True

Original Q&A

There are 1 answers

**GuiEpi** · Answer 1 · 2024-03-01T21:30:54+00:00

BONJOUR, According to the documentation it would be possible to extract html directly from your pdf with the html parameter. There's also more information on extractHTML() here

extractHTML()
Textpage content as a string in HTML format. This version contains complete formatting and positioning information. Images are included (encoded as base64 strings). You need an HTML package to interpret the output in Python.

ps: I haven't tried it

TechQA.

Prevent text that matches link from being outputted as a link?

Get links

Match links up with text

There are 1 answers

Related Questions in PYTHON

Related Questions in PDF

Related Questions in PYMUPDF

Popular Questions

Trending Questions