Empty error when using collate fn and num workers at same time in dataloader

128 views Asked by At

I am trying to load input and label sequences and their corresponding masks using a collate fn which adds padding to the batch of sequences with different lengths. When using a higher num worker, I get an 'Empty' error:

Empty                                     Traceback (most recent call last)
File c:\Users\Oumar Kane\AppData\Local\pypoetry\Cache\virtualenvs\pytorch1-HleOW5am-py3.10\lib\site-packages\torch\utils\data\dataloader.py:1132, in _MultiProcessingDataLoaderIter._try_get_data(self, timeout)
   1131 try:
-> 1132     data = self._data_queue.get(timeout=timeout)
   1133     return (True, data)

File C:\Python\Python310\lib\multiprocessing\queues.py:114, in Queue.get(self, block, timeout)
    113     if not self._poll(timeout):
--> 114         raise Empty
    115 elif not self._poll():

Empty: 

I place below the code of the collate fn:

def collate_fn(batch):
    from torch.nn.utils.rnn import pad_sequence
    # Separate the input sequences, target sequences, and attention masks
    input_seqs, input_masks, target_seqs, target_masks = zip(*batch)

    # Pad the input sequences to have the same length
    padded_input_seqs = pad_sequence(input_seqs, batch_first=True)

    # Pad the target sequences to have the same length
    padded_target_seqs = pad_sequence(target_seqs, batch_first=True)

    # Pad the input masks to have the same length
    padded_input_masks = pad_sequence(input_masks, batch_first=True)

    # Pad the labels masks to have the same length
    padded_target_masks = pad_sequence(target_masks, batch_first=True)

    return padded_input_seqs, padded_input_masks, padded_target_seqs, padded_target_masks

And the data loader initialization and test:

from wolof_translate.utils.split_with_valid import split_data
from wolof_translate.data.dataset_v4 import SentenceDataset
from transformers import T5TokenizerFast

# split the data
split_data(random_state=0, csv_file='corpora_v6.csv')

# tokenizer
tokenizer = T5TokenizerFast('wolof-translate/wolof_translate/tokenizers/t5_tokenizers/tokenizer_v5.model')

# load the train sentences and load some sequences
train_dataset = SentenceDataset('data/extractions/new_data/train_set.csv', tokenizer)

dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=2, collate_fn=collate_fn, num_workers=4) # We use 4 workers (We have 12 cores)

i = 0
for input_, mask_, labels, _ in dataloader:
    i+=1
    print(input_.shape)

I recently updated Pytorch to version 2.0.1+cu117. I don't know if it caused the multi-processing not to work correctly.

0

There are 0 answers