How to load individual .npz files as samples into a Tensorflow Dataset

25 views Asked by At

I am having a problem with loading the data that make up the dataset. My previous (working) approach was to use a pandas DataFrame, but for larger datasets, the training process gets killed, as the data takes up too much memory. So I decided to use TensorFlow's Dataset class to overcome this problem, but I cannot load the individual files.

Specifically, I tried loading the individual file paths of the individual .npz files as samples, and then using the map method of the Dataset class to individually load the contents of the .npz files. Each .npz file is a numpy array of shapes (1, x, x, z) and it's contained within the folder that specifies its label name.

This is the method I use for loading the dataset:

IMAGE_SUPPORTED_EXTENSIONS = ('.jpg', '.jpeg', '.png')

def load_dataset(self):
data = []

        for label in self.labels:
            folder = self.main_folder / label
            file_paths = [str(file_path) for file_path in folder.glob('*') if file_path.suffix in TENSOR_SUPPORTED_EXTENSIONS]
            latenst_spaces = [DatasetLoader.load_tensor(file_path) for file_path in folder.glob('*') if file_path.suffix in TENSOR_SUPPORTED_EXTENSIONS]
            dataset = tf.data.Dataset.from_tensor_slices(file_paths)
            
            # Zip dataset with labels
            dataset = dataset.map(lambda x: (x, label))
            
            dataset = dataset.map(map_function)
            data.append(dataset)
        
        # Concatenate datasets from different labels
        dataset = data[0]
        for i in range(1, len(data)):
            dataset = dataset.concatenate(data[i])
        
        return dataset

And this is the function passed to the map method:

def map_function(element):
    file_path, label = element
    npz_data = DatasetLoader.load_tensor(file_path)
    return (npz_data, label)
    @staticmethod

def load_tensor(file_path):
        file_path = tf.get_static_value(tf_tensor)
        file_path = Path(file_path)
        if file_path.suffix not in ('.npy', '.npz'):
            raise ValueError(f"Extension {file_path.suffix} not suppported.")
        try:
            with np.load(file_path) as tensor:
                if file_path.suffix == ".npz":
                    for _, item in tensor.items():
                        tensor = item
            return np.array(tensor).squeeze()
        except Exception as e:
            print(f"Error loading {file_path.stem} file: {str(e)}.", "\nFile path: ", file_path)
            raise RuntimeError(f"Error loading {file_path.stem} file: {str(e)}.") from e
1

There are 1 answers

0
hpaulj On

Here's an example of creating a npz with 2 arrays, and then loading them:

In [9]: x,y = np.ones((2,3)), np.arange(5)
In [10]: np.savez('test.npz', **{'x':x, 'y':y})

And the load:

In [12]: alist = []
    ...: with np.load('test.npz') as data:
    ...:     for i,v in data.items():
    ...:         print(i,v)
    ...:         alist.append(v)    
    ...:         
x [[1. 1. 1.]
 [1. 1. 1.]]
y [0 1 2 3 4]

In [13]: alist
Out[13]: 
[array([[1., 1., 1.],
        [1., 1., 1.]]),
 array([0, 1, 2, 3, 4])]

your error?

If I try your tensor=item I get an error

In [15]: with np.load('test.npz') as tensor:
    ...:     for _, item in data.items():
    ...:         tensor = item
    ...: arr = np.array(tensor)    
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[15], line 2
      1 with np.load('test.npz') as tensor:
----> 2     for _, item in data.items():
      3         tensor = item
      4 arr = np.array(tensor)

File ~\miniconda3\lib\_collections_abc.py:911, in ItemsView.__iter__(self)
    909 def __iter__(self):
    910     for key in self._mapping:
--> 911         yield (key, self._mapping[key])

File ~\miniconda3\lib\site-packages\numpy\lib\npyio.py:248, in NpzFile.__getitem__(self, key)
    246     key += '.npy'
    247 if member:
--> 248     bytes = self.zip.open(key)
    249     magic = bytes.read(len(format.MAGIC_PREFIX))
    250     bytes.close()

AttributeError: 'NoneType' object has no attribute 'open'

I get this error even I saved only one array to the npz.

Changing a variable inside a context or for loop is dangerous, either not working or producing an error.