I have a dataframe that contains text data and numerical features. I have vectorized text data and I plan to concatenate it with the remaining numerical data for running on Machine Learning algorithms.
I have vectorized text data using TIDF as shown below:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(max_features=10000)
text_vect = vect.fit_transform(myDataframe['text_column'])
text_vect_df = pd.DataFrame.sparse.from_spmatrix(text_vect)
text_vect_df.shape : 250000 x 9300
I have converted text_vect_df to a csv file and used vaex to convert it to hdf5 as shown below. Vaex must work well with hdf5 format.
text_vaex_hdf5 = vaex.from_csv('text_vectorized.csv', convert=True, chunk_size=5_000_000)
The text_vectorized.csv
is 4GB.vaex.from_csv()
is taking too much time and memory is crashing(8GB RAM).
I tried in my Jupyterhub(with external GPU) for the shape of text_vect_df.shape 200000 x 9300.
It downloads in chunks with 7GB each and reading this is taking too much time.
text_vectorized.csv_chunk0.hdf5
7.51 GB
text_vectorized.csv_chunk1.hdf5
7.51 GB
text_vectorized.csv_chunk2.hdf5
2.5 GB
Question 1: How can hdf5 files be greater than the original csv5 files? Shouldn't it be smaller? Question 2: How do I store 950000 x 10000 sized dataframe if the lesser size is failing/crashing?
I read about vaex and it looks really cool because computations happen in seconds. I would love to continue working with this but I am stuck. I have tried dask. Not as cool as Vaex.
Already tried solutions:
- Pandas's to_hdf should not be used for storing sparse matrix because https://vaex.readthedocs.io/en/latest/faq.html
When one uses the pandas .to_hdf method, the output HDF5 file has a row based format. Vaex on the other hand expects column based HDF5 files
- Without dask or vaex, memory gets crashed while running KNN, SVM or any ML algorithms.
- Tried with dask, no luck, worker gets killed in client local cluster.
- With Vaex, not able to move forward