Partitioning Parquet AWS Wrangler with LakeFs

48 views Asked by At

I was trying to partition the parquet on S3 and it worked with AWS Wrangler.

basename_template = 'part.'
partitioning = ['cust_id', 'file_name', 'added_year', 'added_month', 'added_date']
loop = asyncio.get_event_loop()
s3_path = "s3://customer-data-lake/main/parquet_data"
await loop.run_in_executor(None, lambda: wr.s3.to_parquet(
   df=batch.to_pandas() ,
   path=s3_path,
   dataset=True,
   max_rows_by_file=MAX_ROWS_PER_FILE,
   use_threads=True,
   partition_cols = partitioning,
   mode='append',
   boto3_session=s3_session,
   filename_prefix=basename_template
 ))

Then I tried to convert it to lakeFs, I changed the endpoint to LakeFS

wr.config.s3_endpoint_url = lakefsEndPoint

Then suddenly partitioning was not working anymore. It just appends to the same partition.

This image is the original S3 one enter image description here

Then this is after I changed to lakeFs enter image description here

It just appends to the csv_1. What am I doing wrong here?

0

There are 0 answers