I have a big feather file, which I want to change to parquet, so that I can work with Pyspark. Is there a more efficient way of change the file type than doing the following:
df = pd.read_feather('file.feather').set_index('date')
df_parquet = df.astype(str)
df_parquet.to_parquet("path/file.gzip",
compression='gzip')
As dataframe df kills my memory, I'm looking for alternatives. As of this post I understand that I can't read in feather from Pyspark directly
With the code you posted, you are doing the following conversions:
Steps 2-4 are each expensive steps to do. You will not be able to avoid 4 but by keeping the data in Arrow without going the loop into pandas, you can avoid 2+3 with the following code snippet:
A minor issue but you should avoid using the
.gzipending with Parquet files. A.gzip/.gzending indicates that the whole file is compressed withgzipand that you can unzip it withgunzip. This is not the case with gzip-compressed Parquet files. The Parquet format compresses individual segments and leaves the metadata uncompressed. This leads to nearly the same compression at a much higher compression speed. The compression algorithm is thus an implementation detail and not transparent to other tools.