I am trying to read a very large parquet file in batches using the petastorm library. i need to perform some of the pre processing on the batches that will be loaded, and then train a Neural Network
The code that I am performing is:
import petastorm
from petastorm import make_batch_reader
from petastorm.pytorch import DataLoader
data_path = 'output.parquet'
with make_batch_reader('file:///'+data_path) as reader:
dataloader = DataLoader(reader,batch_size=20,shuffling_queue_capacity=100)
for batches in dataloader:
print(batches)
I am getting an error: ValueError: Type names and field names must be valud identifiers: 'M ID'
My dataset in parquet files looks something like this
M ID | features | labels
M4 | [[43.0, 9.0, 414.0, 6.0, 0.0], [33.0, 5.0, 808... | [808, 921, 1797, 872, 399, 1897]
M1 | [[25.0, 8.0, 600.0, 6.0, 0.0], [25.0, 2.0, 700... | [700, 800, 900, 1000, 1200, 1100]
M5 | [[78.0, 2.0, 726.0, 7.0, 0.0], [35.0, 7.0, 153... | [1535, 1116, 677, 274, 1408, 876]
M2 | [[35.0, 5.0, 600.0, 7.0, 1.0], [35.0, 2.0, 700... | [700, 800, 900, 1000, 1100, 1200]
M3 | [[68.0, 7.0, 667.0, 7.0, 0.0], [29.0, 10.0, 58... | [583, 1875, 1934, 336, 826, 1461]