parquet Incremental updates cause disordered reading in python

20 views Asked by At

I do a test, I stored data date by date and in month partition. But when I read this data to python

for the_date in sorted(all_dates):
temp = data[data.tradedate == the_date].copy()
temp.to_parquet(
    ParquetFile,
    engine="pyarrow",  # 推荐 pyarrow
    compression="gzip",
    partition_cols='month',

)

it comes:

                     tradedate
tradedate_index           
2024-01-23      2024-01-23
2024-01-23      2024-01-23
2024-01-23      2024-01-23
2024-01-23      2024-01-23
2024-01-23      2024-01-23
                    ...
2024-03-06      2024-03-06
2024-03-06      2024-03-06

when I sorted it in python:

data_new.sort_values('tradedate')

it comes:

                 tradedate
tradedate_index           
2024-01-01      2024-01-01
2024-01-01      2024-01-01
2024-01-01      2024-01-01
2024-01-01      2024-01-01
2024-01-01      2024-01-01
                    ...
2024-03-15      2024-03-15
2024-03-15      2024-03-15

which means the data is not the order when I stored them. I want to know why and whether it would harm the performance.

0

There are 0 answers