Currently I am using textract queries to extract specific information from uploaded pdf documents. I have a lambda function called textract_async_job_creation which is triggered every time a document is uploaded to an S3 bucket. This function then runs the textract start_document_analysis method and stores the response in another S3 bucket. This triggers SNS to send a notification to another lambda function called textract-response-process which stores the results of the queries as json in another bucket. My issue is that the first lambda function, textract_async_job_creation, will not allow me to use the adapter that I trained, and throws an error saying AdapterConfig is not recognized (I do not remember the exact error). All the documentation that I have read allows for AdapterConfig to be used with start_document_analysis. Can someone tell me what I've done wrong here?
import os
import json
import boto3
from botocore.config import Config
from urllib.parse import unquote_plus
my_config = Config(
region_name='us-east-2',
retries={
'max_attempts': 10,
'mode': 'adaptive'
}
)
textract = boto3.client('textract', config=my_config)
OUTPUT_BUCKET_NAME = os.environ["OUTPUT_BUCKET_NAME"]
OUTPUT_S3_PREFIX = os.environ["OUTPUT_S3_PREFIX"]
SNS_TOPIC_ARN = os.environ["SNS_TOPIC_ARN"]
SNS_ROLE_ARN = os.environ["SNS_ROLE_ARN"]
def lambda_handler(event, context):
responses = []
for record in event["Records"]:
file_obj = record["s3"]
bucketname = str(file_obj["bucket"]["name"])
filename = unquote_plus(str(file_obj["object"]["key"]))
print(f"Bucket: {bucketname} ::: Key: {filename}")
response = textract.start_document_analysis(
DocumentLocation={'S3Object': {'Bucket': bucketname, 'Name': filename}},
FeatureTypes=['QUERIES'],
OutputConfig={'S3Bucket': OUTPUT_BUCKET_NAME, 'S3Prefix': OUTPUT_S3_PREFIX},
NotificationChannel={'SNSTopicArn': SNS_TOPIC_ARN, 'RoleArn': SNS_ROLE_ARN},
QueriesConfig={
'Queries': [
{'Text': 'What is the name of the claimant?', 'Pages': ['1']},
{'Text': 'What is the date on the document?', 'Pages': ['1']},
{'Text': 'What is the phone number?', 'Pages': ['1']},
{'Text': 'What is the address of the office?', 'Pages': ['1']}
]
}
# AdaptersConfig={
# 'Adapters': [
# {'AdapterId': 'xxxxxxxxxxx', 'Version': '1'}
# ]
# }
)
responses.append(response)
successful_responses = [resp for resp in responses if resp["ResponseMetadata"]["HTTPStatusCode"] == 200]
failed_responses = [resp for resp in responses if resp["ResponseMetadata"]["HTTPStatusCode"] != 200]
if successful_responses:
return {"statusCode": 200, "body": json.dumps(f"Job(s) created successfully for {len(successful_responses)} file(s)!")}
else:
return {"statusCode": 500, "body": json.dumps(f"Job creation failed for {len(failed_responses)} file(s)!")}
I tried to use start_document_analysis to use the adapter I trained with textract in order to extract the correct query responses from the documents I am uploading to an S3 bucket. However, I get an error whenever I try to include AdaptersConfig with the start_document_analysis method and I do not understand why given the documentation gives examples using AdaptersConfig.
I was using python 3.9 as the runtime which was using an older version of boto3 which did not recognize AdapterConfig as a parameter. I switched to python version 3.12 and that fixed the issue.