Batch Import and Label Assignment in Google Document AI

131 views Asked by At

We are integrating Google's Document AI into our document management system and require an automated solution to import and label PDF documents for a custom classifier processor's dataset.

Is there an existing API feature, CLI tool, or a workaround that allows for the automatic assignment of labels during or after the batch import process into Document AI's dataset?

The goal is to facilitate the training of the classifier without manual intervention for a large volume of documents and labels.

Despite using the v1beta3 API for batch importing (https://cloud.google.com/document-ai/docs/reference/rest/v1beta3/projects.locations.processors.dataset/importDocuments), there's no functionality to assign labels automatically to the documents; they default to "Unlabeled."

1

There are 1 answers

1
Holt Skinner On

Similar answer as https://stackoverflow.com/a/76535922/6216983

There's not an API to explicitly label documents, but if you import JSON files from Google Cloud Storage which are in the Document format, they will be read as labeled.

You can use the importDocuments() method to perform the import.

This guide shows how to do this with pre-labeled data in the console.

https://cloud.google.com/document-ai/docs/workbench/build-custom-classification-processor#import_pre-labeled_data_to_the_training_and_test_sets

There's also the option of batch labeling in the console if the input documents are separated into folders by classification.

https://cloud.google.com/document-ai/docs/workbench/build-custom-classification-processor#optional_batch_label_documents_at_import