I have a script that lists the annotations of a PDF file Parse annotations from a pdf:
import popplerqt5
import argparse
def extract(fn):
doc = popplerqt5.Poppler.Document.load(fn)
annotations = []
for i in range(doc.numPages()):
page = doc.page(i)
for annot in page.annotations():
contents = annot.contents()
if contents:
annotations.append(contents)
print(f'page={i + 1} {contents}')
print(f'{len(annotations)} annotation(s) found')
return annotations
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('fn')
args = parser.parse_args()
extract(args.fn)
But it only works for text annotations, there are a lot of Python libraries like Poppler, PyPDF2, PyMuPDF, and I've been searching their documentations and source codes a lot and as far as I'm concerned, they are not able to extract the binary of sound annotations. Do you know any library that can do this? I need to extract the binaries of these sound annotations and convert them to MP3's.
The next version of PyMuPDF will support extracting audio annotations. Use this script to extract audio annotations from a PDF using PyMuPDF, it's easy to use, just call the script and pass a PDF file as the first argument:
python script.py myfile.pdf
Note: only works on Windows.