I have a parquet file partitioned in the S3 file system (s3fs) like so:
STATE='DORMANT'
-----> DATE=2020-01-01
-----> DATE=2020-01-02
....
-----> DATE=2020-11-01
STATE='ACTIVE'
-----> DATE=2020-01-01
-----> DATE=2020-01-02
....
-----> DATE=2020-11-01
Every day new data is appended to the parquet file and partitioned accordingly.
I would like to keep only the last 90 days of data and delete the rest. So when the 91'st data of data comes in, it appends and then deletes day 1 in the DATE
partition. When day 92 comes in, it deletes day 2 and so on.
Is this possible via pyspark?