Camelot Cannot extract entire table

510 views Asked by At

Im using Camelot to extract table information from a PDF that i have converted from scanned to searchable using ocrmypdf(500dpi).

Camelot seems to be able to identify the table and extract most of the data within the table but it seems to be unable to extract the bottom half. In essence, it sees the top half of the table but seems to be unable to separate the text from the lower half.

This is the table from the PDF in question:

Table from PDF

But when i use the visual debugging method of Camelot where i ask it to show me the words it will extract it seems to recognize the bottom section of the table as one giant block

Visual Debug of Table

Any guidance you can provide on improving Camelots "vision" here would be helpful.

1

There are 1 answers

0
Tomper On

Apart from the block, the horizontal lines are also marked as text, which is odd.

Camelot uses pdfminer.six for text extraction and you can pass LAParams (page 16) to camelot.read_pdf() to tweak that.
You should also check out camelot.plot(table, type="grid") to see if the lines are recognized correctly. If not, that might be where the problem lies.