How can I apply a unique filter to partition column of a parquet file using wr.s3.read_parquet?

385 views Asked by At

I have a parquet dataset stored in s3 and I want to read it to apply a filter to the partition field, specifically the unique. I was trying as follows, however the unique function cannot be applied

Here's my attempt:

query_fecha_dato = "{0}fecha_dato={1}/".format(param.delivery["output_path"], fecha_dato_formato)
print(query_fecha_dato)
df_fecha_datos = wr.s3.read_parquet(path=query_fecha_dato,dataset=True,filters=[('fecha_dato','unique',fecha_dato)])
print(df_fecha_datos.head(5))

It should show only the partition column "fecha_dato", however it shows the following:

nro_de_pedido nro_de_negocio  ... nrootchex ingest_date
0    2006968078      635922336  ...        -1  2022-08-06
1    2006968079      635912195  ...        -1  2022-08-06
2    2006968080      635921361  ...        -1  2022-08-06
3    2006968081      635922792  ...        -1  2022-08-06
4    2006968082      635922368  ...        -1  2022-08-06

I want to obtain only the partition column "fecha_dato" without duplicates

1

There are 1 answers

2
maow On

According to the docu I cannot find filters as an option.

It looks like to select only fecha_dato, you need to specify columns=['fecha_dato']. Furthermore I don't see a unique option in awswrangler, but you can use pandas drop_duplicates afterwards

df_fecha_datos = wr.s3.read_parquet(path=query_fecha_dato,dataset=True,colums=['fecha_dato']).drop_duplicates()

should work - at least as long as you do not get multiple dataframes back from s3.

This downloads all values in fecha_dato and only drop the duplicates locally, but I have no good idea how to save this bandwidth without deploying some compute resources in AWS.