Retain links when extracting text using PyMuPDF

89 views Asked by Tejas Borse At 20 November 2023 at 05:02

I'm using fitz module of PyMuPDF to extract text from pdf, and I've noticed that the extracted documents do not retain the hyperlinks that were present in the files.

I am able to extract all hyperlinks present in pdf using PyMuPDF but not able to retain or replace that link at the place where that link present.

For example, a text containing a hyperlink could be extracted like this: "My favorite search engine is [Google] (https://google.com)."

Here is the code for extracting hyperlinks from every page:

import fitz # PyMuPDF

#filename
filename = r"clinical_performance_study_plan.pdf"

with fitz.open(filename) as my_pdf_file:

    #loop through every page
    for page_number in range(1, len(my_pdf_file)+1):

        # acess individual page
        page = my_pdf_file[page_number-1]

        for link in page.links():
            #if the link is an extrenal link with http or https (URI)
            if "uri" in link:
                url = link["uri"]
                print(f'Link: "{url}" found on page number --> {page_number}')
            #if the link is internal or file with no URI
            else:
                pass

Is there a way to retain them? I couldn't find it in the docs or the PyMuPDF class.

Original Q&A

TechQA.

Retain links when extracting text using PyMuPDF

There are 0 answers

Related Questions in PYTHON

Related Questions in DATA-CLEANING

Related Questions in EMBEDDING

Related Questions in GOOGLE-GENERATIVEAI

Popular Questions

Popular Tags

Trending Questions