I need to convert a PDF text document to Markdown and maintaining its structure (ie. indexed numbered headers and subheaders should have their correspective number of hashtags # in markdown to keep the same structure tree).
I have explored alone PDFMinersix but I am basically extracting text and I don't see a functionality capable of mapping the structure tree to markdown format, or am I wrong?
For me it's important to convert the document to text and being able to retain structure tree hierarchy. Either in 1 or 2 steps is the same for me.
Any recommendations for Python libraries or best practices that have proven effective in similar scenarios? I am looking for a solution that could scale hundreds of documents and so possibly nothing hardcoded, even though the documents will actually share most of the structure and indexing.
Maybe Try llama_parse with result_type="markdown" - this worked for me
code: