programmatically deleting parquet partitions from S3 bucket using pyspark

848 views Asked by thentangler At 30 November 2020 at 17:27

I have a parquet file partitioned in the S3 file system (s3fs) like so:

STATE='DORMANT'
-----> DATE=2020-01-01
-----> DATE=2020-01-02
             ....
-----> DATE=2020-11-01

STATE='ACTIVE'
-----> DATE=2020-01-01
-----> DATE=2020-01-02
             ....
-----> DATE=2020-11-01

Every day new data is appended to the parquet file and partitioned accordingly.

I would like to keep only the last 90 days of data and delete the rest. So when the 91'st data of data comes in, it appends and then deletes day 1 in the DATE partition. When day 92 comes in, it deletes day 2 and so on.

Is this possible via pyspark?

Original Q&A

TechQA.

programmatically deleting parquet partitions from S3 bucket using pyspark

There are 0 answers

Related Questions in AMAZON-S3

Related Questions in PYSPARK

Related Questions in PARQUET

Related Questions in PARTITION

Related Questions in PYTHON-S3FS

Popular Questions

Popular Tags

Trending Questions