TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object

142 views Asked by At

I am trying to test batches in my training set. My training set is in a .tsv file with 3 columns: Quality( 1 indicates the two sentences are similar and 0 is the opposite), #1 String (1st String), #2 String (2nd String).
I have tried convert X and y to other types like lists but the error remains.
Do you have any suggestion? Thank you!

def get_dataloaders(ds, lengths=[0.6, 0.2, 0.2], batch_size=32, seed=42, num_workers=2):
    train_set, val_set, test_set = random_split(ds, lengths=lengths, generator=torch.Generator().manual_seed(seed))

    train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=num_workers)
    val_loader = DataLoader(val_set, batch_size=batch_size, shuffle=False, num_workers=num_workers)
    test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False, num_workers=num_workers)
    return train_loader, val_loader, test_loader

data_dir = "/data.tsv"
data = pd.read_csv(data_dir, sep='\t')
y = data['Quality'].values                             #dtype: int64
X = data[['#1 String', '#2 String']].values            #dtype: O

data_input = np.column_stack((X, y))                   #dtype: O

train_loader, val_loader, test_loader = get_dataloaders(data_input)

for batch in train_loader:     #TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object

----------------------------------------------------------------------------------

TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 265, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 119, in collate
    return collate_fn_map[elem_type](batch, collate_fn_map=collate_fn_map)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/collate.py", line 169, in collate_numpy_array_fn
    raise TypeError(default_collate_err_msg_format.format(elem.dtype))
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found object

Edit: I know where I misundertood, it is the dtype of X and y, not their types. The dtype of y is int64 and X is Obj. But I guess when I combine 2 columns into one single value, it has to be dtype: obj. How should i fix this?

1

There are 1 answers

0
Karl On

Pytorch won't ingest object type inputs. You need to featurize your strings into a numeric form first.

for example: