How to dynamically specify s3 path using glue?

1.8k views Asked by At

I am writing some files from relational database source to s3 using glue. I would like the s3 path to be in this format bucket_name/database/schema/table/year/month/day format. I am reading the bucket_name, database, schema, table name from a configuration file. I would like use those parameters read from configuration file to dynamically specify the s3 path where I am saving these source files. I am writing the source files to s3 using glue dynamic frame.

In the glue script I mention the path dynamically as : s3_target_path = 's3://' + target_bucket_name + '/' + database + '/' + schema + '/' + table + '/' + year '/' + month '/' + day

1

There are 1 answers

0
Parsifal On

Glue's DynamicFrame supports writing data with Hive-style partition names (key-value). See https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html#aws-glue-programming-etl-partitions-writing:

connection_options = {"path": "$outpath", "partitionKeys": ["type"]},

This document says that you have to convert to a Spark DataFrame if you want to apply an alternate partitioning scheme. I've never done this, but I have used an RDD like so:

  1. Use map() to add the output key (eg: xxx/yyy/yyyy/mm/dd)
  2. Use groupBy() with that key field
  3. Use forEach() with a function to write the output files.