Appending pandas data to hdf store, getting 'TypeError: object of type 'int' has no len()' error

1.3k views Asked by At

Motivation:

I have about 30 million rows of data, one column being an index value, the other being a list of 512 int32 numbers. I wish to only retrieve maybe a thousand or so at a time, so I want to create some sort of datastore that can look up the data by index, while leaving the rest on the disk.

Right now the data is split up into 184 files, which can be opened by pandas.

This is what my dataframe looks like

df.head()

IndexID NumpyIds
1899317 [0, 47715, 1757, 9, 38994, 230, 12, 241, 12228...
22861131    [0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...
2163410 [0, 26039, 41156, 227, 860, 3320, 6673, 260, 1...
15760716    [0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
12244098    [0, 45651, 4128, 227, 5, 10397, 995, 731, 9, 3...

There is the index, and then the column 'NumpyIds' which are numpy arrays of size 512, containing int32 ints.

I then tried this:

store = pd.HDFStore('/data2.h5')
store.put('index', df, format='table', append=True)

And got this

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-05b956667991> in <module>()
----> 1 store.put('index', df, format='table', append=True, data_columns=True)
      2 store.close

4 frames
/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in put(self, key, value, format, index, append, complib, complevel, min_itemsize, nan_rep, data_columns, encoding, errors)
   1040             data_columns=data_columns,
   1041             encoding=encoding,
-> 1042             errors=errors,
   1043         )
   1044 

/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in _write_to_group(self, key, value, format, axes, index, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns, encoding, errors)
   1707             dropna=dropna,
   1708             nan_rep=nan_rep,
-> 1709             data_columns=data_columns,
   1710         )
   1711 

/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns)
   4141             min_itemsize=min_itemsize,
   4142             nan_rep=nan_rep,
-> 4143             data_columns=data_columns,
   4144         )
   4145 

/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in _create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize)
   3811                 nan_rep=nan_rep,
   3812                 encoding=self.encoding,
-> 3813                 errors=self.errors,
   3814             )
   3815             adj_name = _maybe_adjust_name(new_name, self.version)

/usr/local/lib/python3.6/dist-packages/pandas/io/pytables.py in _maybe_convert_for_string_atom(name, block, existing_col, min_itemsize, nan_rep, encoding, errors)
   4798         # we cannot serialize this data, so report an exception on a column
   4799         # by column basis
-> 4800         for i in range(len(block.shape[0])):
   4801 
   4802             col = block.iget(i)

TypeError: object of type 'int' has no len()

What am I trying to do?

I have 184 pandas files which I am trying to concatenate into 1 hdf file for fast look up using the index.

For example

store['index'][21]

Would give me that 512 dimension vector for the index of 21.

Edit:

I tried creating a column for every number, so

  df[[str(i) for i in range(512)]] = pd.DataFrame(df.NumpyIds.to_numpy(), index=df.index) 
  df.drop(columns='NumpyIds', inplace=True)
  store.put('index', df, format='table', append=True)
  store.close

Which works, although I feel this may be a hack rather than an ideal workaround. But now the issue is I can't seem to get those values from the index

store.select(key='index', start=2163410)

returns

    0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  ... 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511
IndexID                                                                                                                                                                                                                                                                                                                                 
0 rows × 512 columns

Which are the column names, but not the data in that column. Also this method takes a lot of RAM. I am wondering if it loads all the data at once, rather than just the index specified.

Another workaround I'm trying is opening the data directly in h5py

df = pd.read_hdf(hdf_files[0])
df.set_index('IndexID', inplace=True)
df.to_hdf('testhdf.h5', key='df')
h = h5py.File('testhdf.h5')

But I can't seem to figure out how to retrieve data by index from this store

h['df'][2163410]

/usr/local/lib/python3.6/dist-packages/h5py/_hl/base.py in _e(self, name, lcpl)
    135         else:
    136             try:
--> 137                 name = name.encode('ascii')
    138                 coding = h5t.CSET_ASCII
    139             except UnicodeEncodeError:

AttributeError: 'int' object has no attribute 'encode'
1

There are 1 answers

0
Lightyears On

as far as I know, this is a BUG.
See #34274.


I've fixed it in #38919. Now it shows appropriate error message.