I am trying to test batches in my training set. My training set is in a .tsv file with 3 columns: Quality( 1 indicates the two sentences are similar and 0 is the opposite), #1 String (1st String), #2 String (2nd String).
I have tried convert X and y to other types like lists but the error remains.
Do you have any suggestion? Thank you!
def get_dataloaders(ds, lengths=[0.6, 0.2, 0.2], batch_size=32, seed=42, num_workers=2):
train_set, val_set, test_set = random_split(ds, lengths=lengths, generator=torch.Generator().manual_seed(seed))
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=num_workers)
val_loader = DataLoader(val_set, batch_size=batch_size, shuffle=False, num_workers=num_workers)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False, num_workers=num_workers)
return train_loader, val_loader, test_loader
data_dir = "/data.tsv"
data = pd.read_csv(data_dir, sep='\t')
y = data['Quality'].values #dtype: int64
X = data[['#1 String', '#2 String']].values #dtype: O
data_input = np.column_stack((X, y)) #dtype: O
train_loader, val_loader, test_loader = get_dataloaders(data_input)
for batch in train_loader: #TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object
----------------------------------------------------------------------------------
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 265, in default_collate
return collate(batch, collate_fn_map=default_collate_fn_map)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 119, in collate
return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 169, in collate_numpy_array_fn
raise TypeError(default_collate_err_msg_format.format(elem.dtype))
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object
Edit: I know where I misundertood, it is the dtype of X and y, not their types. The dtype of y is int64 and X is Obj. But I guess when I combine 2 columns into one single value, it has to be dtype: obj. How should i fix this?
Pytorch won't ingest object type inputs. You need to featurize your strings into a numeric form first.
for example: