I'm using fitz module of PyMuPDF to extract text from pdf, and I've noticed that the extracted documents do not retain the hyperlinks that were present in the files.
I am able to extract all hyperlinks present in pdf using PyMuPDF but not able to retain or replace that link at the place where that link present.
For example, a text containing a hyperlink could be extracted like this: "My favorite search engine is [Google] (https://google.com)."
Here is the code for extracting hyperlinks from every page:
import fitz # PyMuPDF
#filename
filename = r"clinical_performance_study_plan.pdf"
with fitz.open(filename) as my_pdf_file:
#loop through every page
for page_number in range(1, len(my_pdf_file)+1):
# acess individual page
page = my_pdf_file[page_number-1]
for link in page.links():
#if the link is an extrenal link with http or https (URI)
if "uri" in link:
url = link["uri"]
print(f'Link: "{url}" found on page number --> {page_number}')
#if the link is internal or file with no URI
else:
pass
Is there a way to retain them? I couldn't find it in the docs or the PyMuPDF class.