Efficient way to retrieve data from multiple numpy memmap files and creating a new array

168 views Asked by At

for machine learning I need to get data from multiple large memmap files, combine them and return it. The amount of variables (files) used are defined by the user. At the moment I store the files in a list:
memmap_path=["Folder1/file2.dat","Folder24/file28.dat","Folder65/file1.dat"]

The data retrieval in the dataset class itself looks like this:

def __getitem__(self, index):
    list_of_arrays = [np.memmap(memmap_file, dtype='float32', mode='r', shape=(24000,300,300))[index] for memmap_file in memmap_path]
    x = np.stack(list_of_arrays)
    y = self.targets[index]

    return torch.from_numpy(x), y

Does anyone have a better approach for the situation? I am aware that using loops should be avoided if possible, but I am not sure how to do it in this case. I thought about creating an array of zeros and filling it a loop, instead of using np.stack and list comprehension, but I am not sure about the performance improvements. Any suggestions are welcome.

0

There are 0 answers