I am trying to batch process a set of documents using Document AI and its Java SDK. My code is derived from the batch processing example for Java (seen here), but I have modified it to add more than one document (40 documents of up to 5 pages each).
I wait for the result of the batch processing using the same code as in the example:
// Batch process document using a long-running operation.
// You can wait for now, or get results later.
// Note: first request to the service takes longer than subsequent
// requests.
System.out.println("Waiting for operation to complete...");
future.get();
System.out.println("Document processing complete.");
After a bit less than 5 minutes, I always get the following exception:
feb. 06, 2024 6:34:08 EM com.google.api.gax.longrunning.OperationTimedPollAlgorithm shouldRetry
VARNING: The task has been cancelled. Please refer to https://github.com/googleapis/google-cloud-java#lro-timeouts for more information
java.util.concurrent.CancellationException: Task was cancelled.
at com.google.common.util.concurrent.AbstractFuture.cancellationExceptionWithCause(AbstractFuture.java:1560)
at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:590)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:571)
at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:91)
at com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:67)
at com.google.api.gax.longrunning.OperationFutureImpl.get(OperationFutureImpl.java:125)
at ...
What can I do to avoid this timeout? I have tried with a smaller amount of documents (25), but that times out as well.
From the link listed in the error message:
LRO Timeouts
The polling operations have a default timeout that varies from service to service. The library will throw a
java.util.concurrent.CancellationExceptionwith the message:Task was cancelled.if the timeout exceeds the operation. ACancellationExceptiondoes not mean that the backend GCP Operation was cancelled. This exception is thrown from the client library when it has exceeded the total timeout without receiving a successful status from the operation. Our client libraries respect the configured values set in the OperationTimedPollAlgorithm for each RPC.Note: The client library handles the Operation's polling mechanism for you. By default, there is no need to manually poll the status yourself.
You don't need to continuously poll long-running operations and it's not advised to do so, especially when processing a large number of documents, as it could take a long time. In this case, you can check the output Google Cloud Storage bucket at a later time once the operation is completed, rather than polling/waiting for it to complete.
If you want your application to block/wait for the operation to complete, then you can extend the timeout time as shown in the link.