groupby on very large dataset +10GB with python librairies, pandas, vaex and dask

739 views Asked by At

I have more than 10 GB transaction data, i used DASK to read the data, select the columns am intrested in and also groupby the columns i wanted. All this was incredibly fast but computing wasn't working well and debugging was hard.

I then decided to open my data by using PANDAS chunksize, chunking my data per million. And then used VAEX to combine the files in one big HDF5 file. Until here everything went well, but when i try to groupby my columns and exceed 50k data, my code crashes. I was wondering how to manage this...should i groupby every pandas chunk before combining them in the a vaex dataframe or is it possible to convert my vaex dataframe to a dask dataframe, groupby and then convert the grouped by dataframe into a vaex which is more user friendly for me as it's similar to pandas.

path=....
cols=['client_1_id','amount', 'client_2_id',  'transaction_direction'] 

chunksize = 10**6
df = pd.read_csv(path,
                 iterator=True,
                 sep='\t',
                 usecols=cols,
                 chunksize=chunksize,
                error_bad_lines=False)


import vaex
# Step 1: export to hdf5 chunks
for i, chunk in enumerate(df):
    print(i)
    df_chunk = vaex.from_pandas(chunk, copy_index=False)
    df_chunk.export_hdf5(f'dfv_{i}.hdf5')
    
dfv = vaex.open('dfv_*.hdf5')

# Step 2: Combine back into one big hdf5 file
dfv.export_hdf5('dfv.hdf5')


dfv=vaex.open('dfv.hdf5')

this is my first post, sorry if there's not enough details, or if i am unclear, please feel free to ask me any question.

0

There are 0 answers