getting ids from nested bucket structure in s3

21 views Asked by At

s3://trd-data-lake-landing-zone/fetched_projects/project_65e34c4352faff00017fc8a2/locations/location_65e34c4352faff00017fc835/design_65e34c4352faff00017fc832/analysis_65e34c4352faff00017fc8a3/

look at this file structure. under fetched_projects I see a project_ folder which has some files and a folder called locations.under locations folder I have another folder with location which has few folders inside as design which has few more folder as analysis which has some json files in it. If you notice none of these s are same.

s3://trd-data-lake-landing-zone/
└── fetched_projects
    └── project_<id>
        ├── files...
        └── locations
            └── location_<id>
                └── design_<id>
                    └── analysis_<id>
                        └── json files...

at the end of the analysis bucket I get some json files called result_.json. I want to use only those to run another transformation pipeline which flattens the json structure. how do I tackle this problem dynamically? I am trying this on my local machine.

I tried to get the list of all the ids and then creating one format key like f'{projects_path}project_{project_id}/result_{project_id}.json' but that did not work

1

There are 1 answers

0
KrzysiekDD On

Could you try to be a bit more specific? If you want to dynamically obtain a file given some key, or match your keys with keys present in the bucket you can list the items in the bucket then just filter a list of strings with regex. This is the most naive approach, and better solutions can be implemented specific to your task.

You can split the key of each file at some character (for example /). In pseudocode (assuming you have boto3 setup in the environment you are executing your python code):

import boto3

###
# Necessary boto3 setup and auth would normally be here
###

s3_client = boto3.client("s3")
objects = s3_client.list_objects_v2(Bucket=bucket_name)["Contents"]
for file in objects:
    # file key is the URI string
    # if the names do not match check with a debugger if the URI
    # looks like you would expect it to look
    file_key: str = file["Key"]
    key_split = file_key.split('/') # ensure the file_key is unquoted

If you have already defined the exact keys inside the curly brackets, then to download a json file from s3 you would

file_content = s3_client.get_object(Bucket=bucket_name, Key=file_key)
## process the data accordingly

but again, the intent of your question is missing here. Could you try to provide a minimal (at least theoretical) working example of what you would like to accomplish?