How can I extract dynamically loaded items from a PDF file?

732 views Asked by At

I need to get a list of all the items in several controls on a PDF file. There is a dropdown/combo box that is dynamically populated based on which affiliated radio button is selected. Then, when you select one of the items from the dropdown/combo box, three controls below it are populated. I want to extract all this data (otherwise I have to copy and paste it all - blech!)

Every attempt to extract the data has failed. Some of the things I have tried:

Opening the file in Notepad ++. It gives me some interesting "stuff", such as:

%PDF-1.7
%âãÏÓ
34 0 obj
<</Linearized 1/L 234042/O 39/E 3596/N 1/T 233689/H [ 461 175]>>
endobj

42 0 obj
<</DecodeParms<</Columns 4/Predictor 12>>/Filter/FlateDecode/ID[<FB56CF3E25DF09408A0A82199D930FFC><0C6A1B8FEE941E4A8BB87F1D46F07BDE>]/Index[34 17]/Info 33 0 R/Length 58/Prev 233690/Root 35 0 R/Size 51/Type/XRef/W[1 2 1]>>stream
hÞbbd``b`ÊŒóAÄ=7
H0 ‚i?øz…‰‘aHŒ7ñŸqé/€  9 ò
endstream
endobj
startxref
0
%%EOF

50 0 obj
<</C 94/Filter/FlateDecode/I 116/Length 85/S 38/V 71>>stream
hÞb```c``Êa ®¨€ˆY8ÅØ ˜á8—ëI;© ¨bi' ÍÃÀ\Øæ3ƒ4ò20÷\€H3êÀM``žëµá@€ 8Å

endstream
endobj
35 0 obj

...but not what I need.

I tried running a couple of utilities available on the Internet, both a couple of online tools and a couple of downloads, that supposedly can extract text from a PDF, but in every case, all I get - if anything - is this:

Please wait... 

If this message is not eventually replaced by the proper contents of the document, your PDF 
viewer may not be able to display this type of document. 

You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by 
visiting  http://www.adobe.com/products/acrobat/readstep2.html. 

For more assistance with Adobe Reader visit  http://www.adobe.com/support/products/
acrreader.html. 

Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark 
of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other 
countries.

So, when all else fails, read the err msg. It says to "upgrade to the latest version of Adobe Reader" and gives a link. I already have the latest version from there, though - downloaded/installed a few weeks ago. When I select Help > Check for Updates... from the PDF file in question, I get:

No updates available

Installed: Adobe Acrobat XI Pro (11.0.11)

I found some code on the Internet using iTextSharp; I copied that and created a quick-and-dirty util which has this code:

private void buttonExtractTextFromPDF_Click(object sender, EventArgs e)
{
    String filename = @"C:\Misc\Direct_Payment_Orig.pdf";
    if (File.Exists(filename))
    {
        try
        {
            StringBuilder text = new StringBuilder();
            PdfReader pdfReader = new PdfReader(filename);
            for (int page = 1; page <= pdfReader.NumberOfPages; page++)
            {
                ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
                text.Append(System.Environment.NewLine);
                text.Append("\n Page Number:" + page);
                text.Append(System.Environment.NewLine);
                currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                text.Append(currentText);
                pdfReader.Close();


            }
            pdfTextBox.Text += text.ToString();

        }
        catch (Exception ex)
        {
            MessageBox.Show("Error: " + ex.Message, "Error");
        }
    }
}

...but it simply gives me that same lame msg, "Please wait... If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document...." msg - no err msg (the catch block is not reached), just a seemingly bogus message. Seemingly, I say, because I can indeed see the file just fine with my bare peepers.

What is preventing it from being "seen" programmatically? Is it a licensing issue? Could this be the root of my pain?

0

There are 0 answers