PDF Scraper: Error from tabula-java, End of-File, expected line

86 views Asked by At

I'm running a PDF Scraper:

!pip install -q tabula-py==2.7.0
#[PDF Scraper]
try:
        df = tabula.io.read_pdf(BytesIO(pdf_content), pandas_options={'header': None}, pages=3, stream=True)[0]
    except Exception as e:
        # If an IndexOutOfBoundsException occurs, indicating that page 3 is not found, try reading page 2 instead
        df = tabula.io.read_pdf(BytesIO(pdf_content), pandas_options={'header': None}, pages=2, stream=True)[0]
#[PDF Scraper]

It's been running successfully for months, nothing has changed whatsoever and it suddenly failed with this error:

Error from tabula-java: Picked up JAVA_TOOL_OPTIONS: -Djdk.jar.maxSignatureFileSize=2147483639 Error: Error: End-of-File, expected line

Error from tabula-java: Picked up JAVA_TOOL_OPTIONS: -Djdk.jar.maxSignatureFileSize=2147483639 Error: Error: End-of-File, expected line

1

There are 1 answers

0
RithwikBojja On

PDF Scraper: Error from tabula-java, End of-File, expected line

Code:

import fsspec
import requests
import tabula
from io import BytesIO

rithtest_url = "https://test/table.pdf"
rith_data = requests.get(rithtest_url)

try:
    df = tabula.io.read_pdf(BytesIO(rith_data.content), pandas_options={'header': None}, pages=1, stream=True)[0]
except Exception as e:
    df = tabula.io.read_pdf(BytesIO(rith_data.content), pandas_options={'header': None}, pages=1, stream=True)[0]

df.head()

This error has occurred for me when i have taken a corrupted pdf as below:

enter image description here

When provided non-corrupted pdf it worked as expected:

enter image description here