Programmatically tag an untagged pdf with iTextSharp

1.4k views Asked by At

I want to parse a PDF file with iTextSharp. With tagged PDFs I used TaggedPdfReaderTool and the .ConvertToXml() method which returned me an acceptable XML for further parsing. Now I have to parse untagged PDFs which I need to parse! In Adobe Acrobat you can add tags to a PDF using their accessibility tool. After that I'm able to parse it with iTextSharp. Now I'm looking for a free solution to add tags to my PDF programmatically (with iTextSharp). In the book "iText in Action" I've read that you can create a tagged PDF (from an XML file) but i need to convert an existing one!

My code for parsing tagged PDFs:

        var path = @"C:\Users\xxx\Desktop\xxx.pdf";
        var fs = new FileStream(@"C:\Users\xxx\Desktop\xxx_tagged.xml", FileMode.Create);

        PdfReader reader = new PdfReader(path);
        TaggedPdfReaderTool tool = new TaggedPdfReaderTool();
        tool.ConvertToXml(reader, fs);
        fs.Close();
1

There are 1 answers

2
Joris Schellekens On

This is not an easy problem. Since you are essentially asking for a solution to do structure recognition.

Think about it. You want to know where paragraphs begin and end, you'd need a solution for figuring out tables and lists. Not to mention nested tables and lists and combinations thereof.

This is the topic of research. One popular approach is to use neural networks (treating the pdf as an image and tackling it as an image recognition task), or alternatively tackle it in a rule-based fashion.