I'm trying to parse tables in a PDF using Camelot. The cells have multiple lines of texts in them, and some have an empty line separating portions of the text:
First line
Second line
Third line
I would expect this to be parsed as First line\nSecond line\n\nThird line
(notice the double line breaks), but I get this instead: T\nFirst line\nSecond line\nhird line
. The first character after a double-line-break moves to the beginning of the text, and I only get a single line-break instead.
I also tried using tabula, but that one messes up de entire table (data-frame actually) when there is an empty row in the table, and also in case of some words it puts a space between the characters.
EDIT:
My main issue is the removal of multiple line-breaks. The other one I could fix from code if I knew where the empty lines were.
my friend, can you check the example here
https://camelot-py.readthedocs.io/en/master/user/advanced.html#improve-guessed-table-rows
tables = camelot.read_pdf('group_rows.pdf', flavor='stream', row_tol=10)
tables[0].df
I solved the same problem with the code below
tables = camelot.read_pdf(file, flavor = 'stream', table_areas=['24,618,579,93'], columns=['67,315,369,483,571'], row_tol=10,strip_text='t\r\n\v')