I am using the Tabula module in Python. I am trying to output text from a PDF.
I am using this code:
pdf_read = tabula.read_pdf(
input_path = "Test File.pdf",
pages = start_page_number,
guess=False,
area=(81.735,18.55,391.285,273.61),
relative_area = False,
format="TSV",
output_path="testing_area.tsv"
)
When I go to run my code, it says "The output file is empty."
Any idea why this could be?
Edit: If I remove everything except the input_path and pages, my data is getting read into pdf_read correctly, it just does not output into an external file.
Something is wrong with this option...hmm...
Edit #2: I figured out why the area part was not working and now it is, but I still can't get this to output a file for some reason.
Edit #3: I tried looking at this: How to convert PDF to CSV with tabula-py?
But I keep getting an error message: "build_options() got an unexpected keyword argument 'spreadsheet'
Edit #4: I'm using the latest version of tabula.py, which doesn't have the spreadsheet option.
Still can't output a file with data though.
I don't know why that wasn't working above, so the output of pdf_read is a list.
I converted the list into a dataframe and then output the dataframe using to_csv.
Code is below: