How to update an Earray in pytables?

815 views Asked by At

I have a np.array that is too large to store in memory (34000, 34000) hence I need PyTables to store this as an Earray. As I am constrained by memory, I broke up the matrix multiplication into piecewise multiplications which is then appended to the Earray.

Here I have a simpler example where the Earray is made up of (300, 30000) where each element is 9. I am trying to update it by inserting an entire array.

 [[9. 9. 9. ... 9. 9. 9.]
 [9. 9. 9. ... 9. 9. 9.]
 [9. 9. 9. ... 9. 9. 9.]
 ...
 [9. 9. 9. ... 9. 9. 9.]
 [9. 9. 9. ... 9. 9. 9.]
 [9. 9. 9. ... 9. 9. 9.]]

However, I need to constantly update the array elements. I realize that the Earray should work in reassignment as it has the inherited .setitems method from tables.array. Below is a simple code to illustrate how I am updating the rows.

I encountered the problem where the reassignment is not persistent at closure.

hdf5_epath = 'extendable.hdf5'
hdf5_update = tables.open_file(hdf5_epath, mode='r+')
extended_data = hdf5_update.root.data[:]

sess = tf.Session()
for each in range(len(extended_data)):
    print(extended_data[each])
    abc = tf.ones(34716, tf.float32)
    ones = sess.run(abc)
    extended_data[each] = ones

hdf5_update.close()

Am I doing something wrong, or is PyTables not meant for such a use case?

1

There are 1 answers

0
kcw78 On

I'm not familiar with TensorFlow, so can only help with the Pytables calls in your code. Yes, you can add or update data in an EArray. I have not used the EArray.setitems() method to modify data. There is an easier way; simply index the EArray values like you would with Numpy indexing. If you want to add data (rows) to the EArray, use the EArray.append() method. There are examples of both on the Pytables doc site. Review these references for a brief tutorial:
pytables.org: Modifying data in tables
pytables.org: Appending data to an existing table

In your code, extended_data is a Numpy array, and hdf5_update.root.data[:] points to the ondisk HDF5 EArray data. It is a copy and not a view. Modifying extended_data does NOT modify hdf5_update.root.data[:]. That is why the data isn't persistent.

I created a simple example to show how this works. The code below will modify the ondisk data. Output from above will show values of extended_data and hdf5_update.root.data[:] are different after the EArray is modified. Ondisk data is modified. In memory data is not. Scroll down for code to create the example HDF5 file.

CODE TO MODIFY HDF5 EARRAY IN PLACE:

import tables as tb, numpy as np
hdf5_epath = 'extendable.hdf5'
h5f = tb.open_file(hdf5_epath, mode='r+')

extended_data = h5f.root.MyData.X[:]

print (extended_data.dtype, extended_data.shape)

myarray = 9.*np.ones(3*300).reshape((3,300))

h5f.root.MyData.X[0:3, : ] = myarray 
print (extended_data[0,0], extended_data[2,299])
print (h5f.root.MyData.X[0,0], h5f.root.MyData.X[2,299])

h5f.root.MyData.X[-3:, : ] = myarray 
print (extended_data[-1,0], extended_data[-1,299])
print (h5f.root.MyData.X[-1,0], h5f.root.MyData.X[-1,299])

h5f.close()

CODE TO CREATE HDF5 USED ABOVE:
Run this to create extendable.hdf5 used above. I suggest you inspect the data with HDFView before and after running each code segment.

import tables as tb, numpy as np
hdf5_epath = 'extendable.hdf5'
h5f = tb.open_file(hdf5_epath, mode='a')
dataGroup = h5f.create_group(h5f.root, 'MyData')

myarray = np.arange(30.*300.).reshape((30,300))

X = h5f.create_earray(dataGroup,"X", obj=myarray)                  
print ('flavor =', X.flavor )
print ('dim=', X.ndim, ', rows = ', X.nrows)

myarray = np.arange(30*300+30*300,30*300,-1).reshape((30,300))

X.append( myarray )
print ('dim=', X.ndim, ', rows = ', X.nrows)

Y_1 = X.read( 0 )
print (Y_1.dtype, Y_1.shape)

print (Y_1[0,0])
print (Y_1[-1,-1])

Y_2 = X.read( 1 )
print (Y_2[0,0])
print (Y_2[-1,-1])

h5f.close()