Converting sparsematrix to hdf5 is taking too much time even in Vaex and memory crashes

308 views Asked by At

I have a dataframe that contains text data and numerical features. I have vectorized text data and I plan to concatenate it with the remaining numerical data for running on Machine Learning algorithms.

I have vectorized text data using TIDF as shown below:

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(max_features=10000)
text_vect = vect.fit_transform(myDataframe['text_column'])
text_vect_df = pd.DataFrame.sparse.from_spmatrix(text_vect)

text_vect_df.shape : 250000 x 9300

I have converted text_vect_df to a csv file and used vaex to convert it to hdf5 as shown below. Vaex must work well with hdf5 format.

text_vaex_hdf5 = vaex.from_csv('text_vectorized.csv', convert=True, chunk_size=5_000_000)

The text_vectorized.csv is 4GB.vaex.from_csv() is taking too much time and memory is crashing(8GB RAM).

I tried in my Jupyterhub(with external GPU) for the shape of text_vect_df.shape 200000 x 9300. It downloads in chunks with 7GB each and reading this is taking too much time.

text_vectorized.csv_chunk0.hdf5
7.51 GB
text_vectorized.csv_chunk1.hdf5
7.51 GB
text_vectorized.csv_chunk2.hdf5
2.5 GB

Question 1: How can hdf5 files be greater than the original csv5 files? Shouldn't it be smaller? Question 2: How do I store 950000 x 10000 sized dataframe if the lesser size is failing/crashing?

I read about vaex and it looks really cool because computations happen in seconds. I would love to continue working with this but I am stuck. I have tried dask. Not as cool as Vaex.

Already tried solutions:

  1. Pandas's to_hdf should not be used for storing sparse matrix because https://vaex.readthedocs.io/en/latest/faq.html

When one uses the pandas .to_hdf method, the output HDF5 file has a row based format. Vaex on the other hand expects column based HDF5 files

  1. Without dask or vaex, memory gets crashed while running KNN, SVM or any ML algorithms.
  2. Tried with dask, no luck, worker gets killed in client local cluster.
  3. With Vaex, not able to move forward
0

There are 0 answers