Motivation:
I have about 30 million rows of data, one column being an index value, the other being a list of 512 int32 numbers. I wish to only retrieve maybe a thousand or so at a time, so I want to create some sort of datastore that can look up the data by index, while leaving the rest on the disk.
Right now the data is split up into 184 files, which can be opened by pandas.
This is what my dataframe looks like
df.head()
IndexID NumpyIds
1899317 [0, 47715, 1757, 9, 38994, 230, 12, 241, 12228...
22861131 [0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...
2163410 [0, 26039, 41156, 227, 860, 3320, 6673, 260, 1...
15760716 [0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
12244098 [0, 45651, 4128, 227, 5, 10397, 995, 731, 9, 3...
There is the index, and then the column 'NumpyIds' which are numpy arrays of size 512, containing int32 ints.
I then tried this:
store = pd.HDFStore('/data2.h5')
store.put('index', df, format='table', append=True)
And got this
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-12-05b956667991> in <module>()
----> 1 store.put('index', df, format='table', append=True, data_columns=True)
2 store.close
4 frames
/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in put(self, key, value, format, index, append, complib, complevel, min_itemsize, nan_rep, data_columns, encoding, errors)
1040 data_columns=data_columns,
1041 encoding=encoding,
-> 1042 errors=errors,
1043 )
1044
/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in _write_to_group(self, key, value, format, axes, index, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns, encoding, errors)
1707 dropna=dropna,
1708 nan_rep=nan_rep,
-> 1709 data_columns=data_columns,
1710 )
1711
/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns)
4141 min_itemsize=min_itemsize,
4142 nan_rep=nan_rep,
-> 4143 data_columns=data_columns,
4144 )
4145
/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in _create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize)
3811 nan_rep=nan_rep,
3812 encoding=self.encoding,
-> 3813 errors=self.errors,
3814 )
3815 adj_name = _maybe_adjust_name(new_name, self.version)
/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in _maybe_convert_for_string_atom(name, block, existing_col, min_itemsize, nan_rep, encoding, errors)
4798 # we cannot serialize this data, so report an exception on a column
4799 # by column basis
-> 4800 for i in range(len(block.shape[0])):
4801
4802 col = block.iget(i)
TypeError: object of type 'int' has no len()
What am I trying to do?
I have 184 pandas files which I am trying to concatenate into 1 hdf file for fast look up using the index.
For example
store['index'][21]
Would give me that 512 dimension vector for the index of 21.
Edit:
I tried creating a column for every number, so
df[[str(i) for i in range(512)]] = pd.DataFrame(df.NumpyIds.to_numpy(), index=df.index)
df.drop(columns='NumpyIds', inplace=True)
store.put('index', df, format='table', append=True)
store.close
Which works, although I feel this may be a hack rather than an ideal workaround. But now the issue is I can't seem to get those values from the index
store.select(key='index', start=2163410)
returns
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 ... 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511
IndexID
0 rows × 512 columns
Which are the column names, but not the data in that column. Also this method takes a lot of RAM. I am wondering if it loads all the data at once, rather than just the index specified.
Another workaround I'm trying is opening the data directly in h5py
df = pd.read_hdf(hdf_files[0])
df.set_index('IndexID', inplace=True)
df.to_hdf('testhdf.h5', key='df')
h = h5py.File('testhdf.h5')
But I can't seem to figure out how to retrieve data by index from this store
h['df'][2163410]
/usr/local/lib/python3.6/dist-packages/h5py/_hl/base.py in _e(self, name, lcpl)
135 else:
136 try:
--> 137 name = name.encode('ascii')
138 coding = h5t.CSET_ASCII
139 except UnicodeEncodeError:
AttributeError: 'int' object has no attribute 'encode'
as far as I know, this is a BUG.
See #34274.
I've fixed it in #38919. Now it shows appropriate error message.