Impossible to fine tunning SBERT even with a 48Gb GPU

176 views Asked by At

I am trying to fine tune a SBERT from Hugging Face using TSDAE.

I rent a server with a RTX A6000 GPU with with 48Gb.

I am reading chunks of the text file (train set) and, for each chunk, extracting sentences.

I have a loop where I am submitting for training 1000 sentences each time. Than, I save the model, iterate the loop, read the next 1000 sentences, reload the model from the disk and train again.

For the first 5/6 times the training occurs without problem. Than it fails issuing "Out of memory", even I am cleaning the cache (torch.cuda.empty_cache()) before each new phase.

The code is the following:

def train_denoising(train_sentences,modelName):

  torch.cuda.empty_cache()
  # os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:22"
  os.environ["CUDA_LAUNCH_BLOCKING"] = "1"


  word_embedding_model = models.Transformer(modelName)
  # Apply **cls** pooling to get one fixed sized sentence vector
  pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), 'cls')
  model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

  #model = SentenceTransformer(modelName)

  
  train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)
  train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)
  train_loss = losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path=modelName, tie_encoder_decoder=True)

  model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=num_epochs,
    weight_decay=0,
    scheduler='constantlr',
    optimizer_params={'lr': 3e-5},
    show_progress_bar=True,
    # checkpoint_path=model_output_path,
    use_amp=False,  # Set to True, if your GPU supports FP16 cores
    output_path='./yobb_model')
  return model

num_sentences = 1000

for path in paths:
 with gzip.open(path, 'rt', encoding='utf8') if path.endswith('.gz') else open(path, encoding='utf8') as f:
 
    for piece in read_in_chunks(f, chunk_size=500*1024): 
        aux = [line.lower() for line in splitter.split(piece) if len(line) > 10]

        count = len(aux)//num_sentences
        index = 0
        # iterate over the sentences getting <num_sentences> each time
        for i in range(count):                            
            train_sentences.extend(aux[index:index+num_sentences])  
            index += num_sentences
            if(len(train_sentences) <= 0):
                continue
            print("Numero de sentencas {}".format(len(aux)))
            logging.info("{} train sentences".format(len(train_sentences)))
            train_denoising(train_sentences,"./yobb_model")
            train_sentences.clear()

        count = len(aux) % num_sentences
        if(count > 0):                
            print("Numero de sentencas {}".format(len(aux[-count:])))
            logging.info("{} train sentences".format(len(aux[-count:])))
            train_denoising(aux[-count:],"./yobb_model")

The error is the classic one:

OutOfMemoryError: CUDA out of memory. Tried to allocate 920.00 MiB (GPU 0; 47.54 GiB total capacity; 43.35 GiB already allocated; 517.88 MiB free; 46.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The batch size is 16.

I have tried this setting after getting Out Of Memory in several different attempts.

Why, even have executed for 1000 sentences in a previous iteration, PyTorch fails?

Why can't it release the memory of the GPU since I am initiating a new training process?

0

There are 0 answers