Headers are not getting extracted from PDF while extracting the table data from PDF using camelot

Question

Headers are not getting extracted from PDF while extracting the table data from PDF using camelot

2.8k views Asked by Abhishek Bisht At 08 November 2018 at 08:20

I am using camelot for table data extraction, however header are not getting extracted as part of the PDF.

Attaching the target PDF link below and target table are at page number 3 and 4, which need to extracted.

https://drive.google.com/file/d/1xniTIwpnNIdA_k4xvEARlVH97Lk-K2Yr/view?usp=sharing

One of the tables looks like below

I have seen the the camelot documentation and I think the problem is related to the "Detect short lines"

https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines

However not able to resolve the problem by tweaking the line_size_scaling parameter.

Please assist.

Original Q&A

There are 1 answers

**Vinayak Mehta** · Accepted Answer · 2018-11-09T16:53:19+00:00

I plotted the detected table boundary on page 3 using $ camelot -p 3 lattice -plot contour 007.pdf. Looks like Camelot isn't including the header row in the detected table boundary [bug 1] (see image below). Then I tried using the table_areas keyword argument with flavor='lattice' but then it didn't include the lines in the the specified table boundary [bug 2]. I've added these on the issue tracker as #200 and #201.

You can still use the table_areas keyword argument with flavor='stream' to get the table out.

Using CLI: $ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf

Using API: tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])

You can find the table boundary coordinates using the steps described here: https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging

Hope that helps!

TechQA.

Headers are not getting extracted from PDF while extracting the table data from PDF using camelot

There are 1 answers

Related Questions in PDF-SCRAPING

Related Questions in PYTHON-CAMELOT

Popular Questions

Trending Questions