I have a parquet dataset stored in s3 and I want to read it to apply a filter to the partition field, specifically the unique. I was trying as follows, however the unique function cannot be applied
Here's my attempt:
query_fecha_dato = "{0}fecha_dato={1}/".format(param.delivery["output_path"], fecha_dato_formato)
print(query_fecha_dato)
df_fecha_datos = wr.s3.read_parquet(path=query_fecha_dato,dataset=True,filters=[('fecha_dato','unique',fecha_dato)])
print(df_fecha_datos.head(5))
It should show only the partition column "fecha_dato", however it shows the following:
nro_de_pedido nro_de_negocio ... nrootchex ingest_date
0 2006968078 635922336 ... -1 2022-08-06
1 2006968079 635912195 ... -1 2022-08-06
2 2006968080 635921361 ... -1 2022-08-06
3 2006968081 635922792 ... -1 2022-08-06
4 2006968082 635922368 ... -1 2022-08-06
I want to obtain only the partition column "fecha_dato" without duplicates
According to the docu I cannot find
filters
as an option.It looks like to select only
fecha_dato
, you need to specifycolumns=['fecha_dato']
. Furthermore I don't see a unique option in awswrangler, but you can use pandasdrop_duplicates
afterwardsshould work - at least as long as you do not get multiple dataframes back from s3.
This downloads all values in
fecha_dato
and only drop the duplicates locally, but I have no good idea how to save this bandwidth without deploying some compute resources in AWS.