How to locally process a batch of files using Document AI with the Python client?

379 views Asked by At

I'm trying to use the Python console to use the Document OCR processor to locally process a large amount of pdf documents (native and scanned) to extract the text and some metadata. The documents are stored in multiple folders and subfolders. I currently have a working code to locally synchronously process the documents one-by-one, but since there are documents with more than 15 pages I had to switch to batch processing. I was hoping to use the batch processing to locally upload the batches of files and then locally process and store the results. Here are my imports:

from google.api_core.client_options import ClientOptions
from google.cloud import documentai

My documentai module's version is 2.20.1.

I am using Spyder IDE 5.4.3 with Python 3.11.4.

The general idea is to go through some list (file_paths), which contains the paths to individual documents and open them one-by-one and create the object, which contains the pdf and the mime_type. And then, according to some logic, using these objects to construct the batches. But I'm not sure which object it should be. ChatGPT suggested following:

batch_input_config = documentai.types.BatchProcessRequest.Types.BatchInputConfig(

The relevant part of the code:

# Create a list to hold BatchInputConfig objects for each document
batch_input_configs = []

for file_path in file_paths:
    with open(file_path, 'rb') as image:
        content = image.read()

    # Create a BatchInputConfig object for each document
    batch_input_config = documentai.types.BatchProcessRequest.Types.BatchInputConfig(
        content=content,
        mime_type='application/pdf'
    )

    # Some logic for the batch construction is left out here
    # Add the ProcessRequest object to the batch_request list
        batch_input_configs.append(batch_input_config)

But that's throwing an error: "AttributeError: module 'google.cloud.documentai' has no attribute 'types'" on the batch_input_config = documentai.types.BatchProcessRequest.Types.BatchInputConfig( line. I checked online examples and the documentation and everything seems to use the Google Cloud Storage to store the documents and even the results.

I'm sure there are advantages in using the GCS but it would be greatly preferred if the files and results could be stored locally, without involving the GCS. Because everything is based around doing everything locally. Specifically, I need to attach the file path and the name of the document to its corresponding results. And it was possible to use the synchronous code locally, so I would expect the same from batch processing...

So my question is: Is it possible to use documentai to locally process batches of documents? (and if so, how? Which object or a workaround to use?)

Excuse any glaring errors, I'm using the documentai for the very first time. Any help would be greatly appreciated, thanks!

I have tried several objects, such as: documentai.types.Document documentai.types.BatchProcessRequest.Types.BatchInputConfig documentai.BatchDocumentsInputConfig

I have tried updating my documentai module and specify the import as: from google.cloud import documentai_v1beta3 as documentai

I have also tried bothering ChatGPT a lot, but so far to no avail.

Even if the workaround would be something like having the code upload the files to the GCS, process them there and then download the results, that would be okay I guess.

1

There are 1 answers

1
Holt Skinner On

The code from ChatGPT is inaccurate. To batch process multiple files at once, you will need to upload the documents to Google Cloud Storage then you can use the code samples here to perform batch processing.

https://cloud.google.com/document-ai/docs/send-request#batch-process