I'm parsing PDFs with pdfMiner, using it as a library in my python script.
In most of these PDFs there is a table, where one of the columns is named "company".
Is there a way to:
- detect the existence of that table in the PDF.
- get all the company names (i.e. all the entries in the 2nd column of the table).
The best method I found so far is to use the HTMLconverter class in the pdfminer lib. This allows you to convert the pdf in HTML format, and it is easier to figure out tables, rows and columns. In my case at least: it may work with all kinds of tables in a PDF file.