I have pdf files which have electronical origin, but can be in various layouts. These pdfs include tables, which are sometimes rotated in +90 or -90 degrees. It can sometimes happen, that the first line of the file has the normal orientation (0 degrees), but all other content is rotated.
The metadata of these files do not include the rotation information, rotation is always 0. I need to extract tables from these files with help of https://pdftables.com.
Pdftables was first opensource based on pdfminer python library, now it is a commercial product. What happens when I send these strange pages to pdftables is, that the content can not be read properly. So I need to figure out, the orientation of the page before I send it there.
- I tried to extract text with pdfminer by myself and compare it with extraction from unix tool pdftotext (which can extract text properly) and whenever there was a "difference" I would take the file to rotation.
Unfortunately, this does not work always, because pdfminer doesnt give me always the same results as pdftables.
- I have tried pythons OpenCV library for images of these pdfs, but this could only recognise the skew of text, not 90 degrees angle.
- I have also tried the Hough transform method to find lines from text and estimate their direction, but since there are tables on pages, it is hard to estimate if line is just the text or a real line.
Please, do you have some suggestions, how to solve this problem? Thanks