Node.js - Problem to extract text from PDF file using Google Cloud Vision API

1.6k views Asked by At

I'm new to cloud environments and programming in general, and I'm struggling to use the Google Vision API to extract text from a PDF file located in a remote bucket.

I've found it really difficult to get meaningful content related to this subject in the docs and even in Stack Overflow. The closest I got to solving this problem was with this question:

Cloud Vision API - PDF OCR

But it did not work for me for the reasons described below, which is why I'm asking a question of my own.

Here is the problem:

I am making the following post request to the specified url

    https://vision.googleapis.com/v1/files:asyncBatchAnnotate?key=MY_API_KEY

    "requests": [
    {
        "inputConfig": {
            "gcsSource": {
                "uri": "gs://BUCKET_NAME/FILE_NAME.pdf"
            },
            "mimeType": "application/pdf"
        },
        "features": [
            {
                "type": "DOCUMENT_TEXT_DETECTION"
            }
        ],
        "outputConfig": {
            "gcsDestination": {
                "uri": "gs://BUCKET_NAME/output/"
            },
            "batchSize": 1
        }
    }
]

The POST request is successful, and after that, according to what I found, I have to make a get request to check if the document text detection is done, using the response I received from my previous post request. If it is done, it's supposed to write a response in a file inside my Bucket (Which is why I configured an 'output' in the json above)

However, when I make a get request on the url

    https://vision.googleapis.com/v1/operations/RESPONSE?key=API_KEY

I get the following error:

    "error": {
        "code": 7,
        "message": "Error writing final output to: gs://BUCKET_NAME/output/filename.json"
    }

Even if there is a way to solve this problem to write the final output, I wonder if that's the best way to extract data from a pdf, it looks very weird to make a post and a get, specially considering that when you're extracting data from an image using the same API, you only have to make one request

Thanks for the help.

1

There are 1 answers

0
Oleksandr Bushkovskyi On BEST ANSWER

I believe the problem is that you are using API key authentication for "asyncBatchAnnotate" request. The documentation states that you must use service account authentication: https://cloud.google.com/vision/docs/pdf#authentication

Another important point is that the service account used must have write access to the destination bucket.

Making POST request to submit the task and then GET request to check status is a common REST pattern for long running operations. http://restalk-patterns.org/long-running-operation-polling.html

I think APIs for images and pdf are different because when recognizing images it's assumed that operation is relatively short while recognizing multipage pdf can take significant time.