Retain links when extracting text using PyMuPDF

97 views Asked by At

I'm using fitz module of PyMuPDF to extract text from pdf, and I've noticed that the extracted documents do not retain the hyperlinks that were present in the files.

I am able to extract all hyperlinks present in pdf using PyMuPDF but not able to retain or replace that link at the place where that link present.

For example, a text containing a hyperlink could be extracted like this: "My favorite search engine is [Google] (https://google.com)."

Here is the code for extracting hyperlinks from every page:

import fitz # PyMuPDF

#filename
filename = r"clinical_performance_study_plan.pdf"

with fitz.open(filename) as my_pdf_file:

    #loop through every page
    for page_number in range(1, len(my_pdf_file)+1):

        # acess individual page
        page = my_pdf_file[page_number-1]

        for link in page.links():
            #if the link is an extrenal link with http or https (URI)
            if "uri" in link:
                url = link["uri"]
                print(f'Link: "{url}" found on page number --> {page_number}')
            #if the link is internal or file with no URI
            else:
                pass

Is there a way to retain them? I couldn't find it in the docs or the PyMuPDF class.

0

There are 0 answers