Textract in R (paws) without S3Object

209 views Asked by At

When using textract from the paws package in R the start_document_analysis call requires the path to a S3Object in DocumentLocation.

textract$start_document_analysis(
    DocumentLocation = list(
      S3Object = list(Bucket = bucket, Name = file)
    )
  )

Is it possible to use DocumentLocation without a S3Object? I would prefer to just provide the path to a local PDF.

2

There are 2 answers

0
LRutten On

The start_document_analysis api only supports providing an s3 object as input, and not a base64 encoded string like the analyze_document api (see also CLI docs on https://docs.aws.amazon.com/cli/latest/reference/textract/start-document-analysis.html)

So unfortunately you have to use S3 as a place to (temporarily) store your data. Of course you can write your own logic to do that :). Great tutorial on that can be found at https://www.gormanalysis.com/blog/connecting-to-aws-s3-with-r/ Since you have already set up credentials etc. you can skip a lot of the steps and start at step 3 for example.

0
qix On

paws also has support for the sync functions: https://www.paws-r-sdk.com/docs/textract/

e.g. https://www.paws-r-sdk.com/docs/textract_detect_document_text/, but be careful, as the example isn't always perfect. In this case, they say it accepts an argument: Document [required] The input document as base64-encoded bytes or an Amazon S3 object. along with the example:

svc$detect_document_text(
  Document = list(
    Bytes = raw,
    S3Object = list(
      Bucket = "string",
      Name = "string",
      Version = "string"
    )
  )
)

However, feeding a base64 string to Bytes will throw an error:

# Error: `In file(what, "rb") : cannot open file ...`
result = textract$detect_document_text(
  Document = list(Bytes = knitr::image_uri('image.png')))

Instead, you can provide the file path:

result = textract$detect_document_text(Document = list(Bytes='image.png'))

In your case, try using analyze_document.