A Django 4.2 on Python 3.10 application misbehaves sometimes, in a place that should be stateless. If I access the database (postgresql 14 via pcygopg 3.1.18) while working with pypdf 3.17.4, the output document is broken.. in roughly 1 out of 8 attempts. How can I determine the cause of the erratic behavior?
The resulting PDF has missing content, so comparing output size is sufficient to determine the bug was triggered:
# call as: python3 manage.py minrepro input.pdf
import argparse, io
from django.core.management.base import BaseCommand
from pypdf import PdfReader, PdfWriter
from djangoapp.models import DjangoModel
def main(fin):
pdfout = PdfWriter()
pageout = pdfout.add_blank_page(width=200, height=200)
for i in range(8):
# Note: accessing the database *during* PDF merging is relevant!
# without the next line, the problem cannot be reproduced
for c in range(31): a = DjangoModel.objects.first()
fin.seek(0)
for pagein in PdfReader(fin, strict=True).pages:
pageout.merge_page(pagein)
with io.BytesIO() as fout:
pdfout.write(fout)
return fout.tell()
class Command(BaseCommand):
def add_arguments(self, parser):
parser.add_argument(dest="pdf", type=argparse.FileType("rb") )
def handle(self, *args, **options):
for i in range(30):
if i == 0: first_size = main(options["pdf"])
current_size = main(options["pdf"])
if not first_size == current_size:
print(f"presumed stateless call was not {i=}, {first_size=} != {current_size=}")
- My fist theory was pypdf failing to generate truly unique names for (post-merge) duplicate identifiers, and the DB access merely giving the opportunity (delay) for that to become noticeable. I looked for any
timeorrandomimports in the relevant libraries, no obvious bugs there. Also, nothing liketime.sleep(0.02)could not make my reproducer work, only database access. - This is CPython with insert-ordered dicts, most code should behave the same on identical input. Also called python with
-W module::ResourceWarning -W module::DeprecationWarning -W module::PendingDeprecationWarningto be more likely notified of any mistakes on my part, nothing. - Have reproduced the behavior in a single-threaded management command, it is therefore not related to the any http/web server issue.
Root cause identified: Dropping and recreating PdfReader in the loop objects has a low probability for
pypdf.PdfWriter.mergeto recycle into already used identifier mappings.The trigger appears to be garbage collection in my loop, with psycopg3 indeed just providing the opportunity for that to happen, not specifically causing it.
The relevant documentation for resetting the association between source and destination PDF object has lead me to the previously missed detail that pypdf assumes, but never checks that PdfReader objects used during merging will never recycle python
id().That has happened in my code, as easily visible by printing id(page.pdf).
Keeping a reference to the PdfReader object until the last input is merged (or simply until the output is written) also stops my reproducer from working, like this: