What is the cost of listing all files in AWS S3 bucket?

9k views Asked by At

I am writing a script in python where I need to get the latest modified file in a bucket (using a prefix), but as far as I have read, I cannot do that query directly from python (using boto3 at least), So I have to retrieve the information of every object in my bucket.

I would have to do some query of several thousands of files, and I do not want to get any surprise in my billing.

If I do a query where I retrieve the metadata of all the objects in my bucket to sort them later locally, will I be charged as a single request or will it count as a request per object?

Thank you all in advance

1

There are 1 answers

5
maronavenue On

Popular

A common method people use is via s3api to consolidate multiple calls into a single LIST request for every 1000 objects and then use --query to define your filtering operation, such as:

aws s3api list-objects-v2 --bucket your-bucket-name --query 'Contents[?contains(LastModified, `$DATE`)]'

Although please keep in mind that this isn't a good solution for two reasons:

  1. This does not scale really well especially with large buckets and it does not help much in minimizing the data outbound.
  2. It does not reduce the number of S3 API calls because the --query parameter isn't performed in the server-side. It just so happened to be a feature of this aws-cli command. To illustrate, this is how it would look like in boto3 and as you can see we'd still need to query it on client-side:
import boto3

client = boto3.client('s3',region_name='us-east-1')

response = client.list_objects_v2(Bucket='your-bucket-name')

results = sorted(response['Contents'], key=lambda item: item['LastModified'])[-1])

Probably

One thing you could *probably* do depending on your specific use case is to utilize S3 Event Notifications to automatically publish an event to SQS which gives you the opportunity to poll for all the S3 object events along with their metadata information which is more lightweight. This is still going to cost some money and it's not going to work if you already have an existing big bucket to begin with. Plus the fact that you'll have to actively poll for the messages since they won't persist too long.

Perfect (sorta)

This sounds to me like a good use case for S3 Inventory. It will deliver a daily file for you which is comprised of the list of objects and their metadata information based on your specifications. See https://docs.aws.amazon.com/AmazonS3/latest/user-guide/configure-inventory.html