I have a dataset with ~7M rows and 3 columns, 2 numeric and 1 consisting of ~20M distinct string uuids. The data takes around 3G as a csv file and castra can store it in about 2G. I would like to test out bcolz with this data.
I tried
odo(dask.dataframe.from_castra('data.castra'), 'data.bcolz')
which generated ~70G of data before exhausting inodes on the disk and crashing.
What is the recommended way to get such a dataset into bcolz?
From Killian Mie on the bcolz mailing list:
Read csv in chunks via
pandas.read_csv()
, convert your string column from Python object dtype to a fix length numpy dtype, say, 'S20', then append as numpy array to ctable.Also, set
chunklen=1000000
(or similar) at ctable creation which will avoid creating hundreds of files under the /data folder (probably not optimal for compression though)The 2 steps above worked well for me (20 million rows, 40-60 columns).
Try this: