Using AWS Textract for processing PDF

1.8k views Asked by At

I want to use Textract OCR service for reading text from pdf file. I have a problem with that because I want to do it locally, without S3 bucket. I tested it for image files and it works good, but it does not work for PDF files.

This is the code where I get an error:

response = textract.start_document_text_detection(DocumentLocation="sample2.pdf")

Error:

Invalid type for parameter DocumentLocation, value: sample2.pdf, type: <class 'str'>, valid types: <class 'dict'>

Code2:

response = textract.start_document_text_detection(DocumentLocation={"name":"sample2.pdf"})

Error:

Unknown parameter in DocumentLocation: "name", must be one of: S3Object

Code3:

response = textract.start_document_text_detection(Document={'Bytes': "sample2.pdf"})

Error:

Unknown parameter in input: "Document", must be one of: DocumentLocation, ClientRequestToken, JobTag, NotificationChannel, OutputConfig

What should I do, Is there a way to make Textract work for PDF documents without s3?

1

There are 1 answers

0
ehrene On

The short answer to your question is "No."

Textract works with S3 only for input. y\You will need to follow the format for the expected input which is described for the service in the boto3 documentation here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.start_document_text_detection

Essentially, the service wants a structured input and you need to fill that in correctly according to their specification. Here's the DocumentLocation dictionary input expected by boto3.

DocumentLocation={
    'S3Object': {
        'Bucket': 'string',
        'Name': 'string',
        'Version': 'string'
    }
}

I'm having some similar issues getting this to work in boto3 currently as well, but i will keep working thru the docs to see what i can figure out.