We are integrating Google's Document AI into our document management system and require an automated solution to import and label PDF documents for a custom classifier processor's dataset.
Is there an existing API feature, CLI tool, or a workaround that allows for the automatic assignment of labels during or after the batch import process into Document AI's dataset?
The goal is to facilitate the training of the classifier without manual intervention for a large volume of documents and labels.
Despite using the v1beta3 API for batch importing (https://cloud.google.com/document-ai/docs/reference/rest/v1beta3/projects.locations.processors.dataset/importDocuments), there's no functionality to assign labels automatically to the documents; they default to "Unlabeled."
Similar answer as https://stackoverflow.com/a/76535922/6216983
There's not an API to explicitly label documents, but if you import JSON files from Google Cloud Storage which are in the
Document
format, they will be read as labeled.You can use the
importDocuments()
method to perform the import.This guide shows how to do this with pre-labeled data in the console.
https://cloud.google.com/document-ai/docs/workbench/build-custom-classification-processor#import_pre-labeled_data_to_the_training_and_test_sets
There's also the option of batch labeling in the console if the input documents are separated into folders by classification.
https://cloud.google.com/document-ai/docs/workbench/build-custom-classification-processor#optional_batch_label_documents_at_import