CUDA out of memory error during PEFT LoRA fine tuning

1.5k views Asked by At

I'm trying to fine-tune the model weights from a FLAN-T5 model downloaded from hugging face. I'm trying to do this with PEFT and specifically LoRA. I'm using the Python 3 code below. I'm running this on ubuntu server 18.04LTS with an invidia gpu that has 8GB of ram. I'm getting an error "CUDA out of memory", the full error message is below. I've tried adding:

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"

but I'm still getting the same error message. The code and error message are below. Can anyone see what the issue might be and suggest how to solve it?

Code:

from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

# added to deal with memory allocation error
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"

#
# ### Load Dataset and LLM   

huggingface_dataset_name = "knkarthick/dialogsum"

dataset = load_dataset(huggingface_dataset_name)

dataset


# Load the pre-trained [FLAN-T5 model](https://huggingface.co/docs/transformers/model_doc/flan-t5) and its tokenizer directly from HuggingFace. Using the [small version](https://huggingface.co/google/flan-t5-base) of FLAN-T5. Setting `torch_dtype=torch.bfloat16` specifies the memory type to be used by this model.

# In[17]:


model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)



index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))


# updated 11/1/23 to ensure using gpu
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids\
    .cuda()
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids\
    .cuda()

    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])


# To save some time subsample the dataset:

tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)


# Check the shapes of all three parts of the dataset:

# In[7]:


# print(f"Shapes of the datasets:")
# print(f"Training: {tokenized_datasets['train'].shape}")
# print(f"Validation: {tokenized_datasets['validation'].shape}")
# print(f"Test: {tokenized_datasets['test'].shape}")
#
# print(tokenized_datasets)


# The output dataset is ready for fine-tuning.

#
# ### Perform Parameter Efficient Fine-Tuning (PEFT)
# - use LoRA

#
# ### Setup the PEFT/LoRA model for Fine-Tuning
#
# - set up the PEFT/LoRA model for fine-tuning with a new layer/parameter adapter
# - freezing the underlying LLM and only training the adapter
# - LoRA configuration below
# - Note the rank (`r`) hyper-parameter, which defines the rank/dimension of the adapter to be trained

# In[8]:


from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
#     r=4, # Rank
#     lora_alpha=4,
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5

)


# Add LoRA adapter layers/parameters to the original LLM to be trained.

# In[9]:


peft_model = get_peft_model(original_model,
                            lora_config)
# print(print_number_of_trainable_model_parameters(peft_model))

# Enable gradient checkpointing in the model's configuration.
# peft_model.config.gradient_checkpointing = True


#
# ### Train PEFT Adapter
#
# Define training arguments and create `Trainer` instance.

# In[10]:


output_dir = f'/home/username/stuff/username_storage/LLM/PEFT/train_args/no_log_max_depth_{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
#     auto_find_batch_size=True,
    per_device_train_batch_size=4, 
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
#     max_steps=1
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)


# In[11]:


peft_trainer.train()

peft_model_path="/home/username/stuff/username_storage/LLM/PEFT/peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

error:

return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.79 GiB total capacity; 1.10 GiB already allocated; 17.31 MiB free; 1.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|          | 0/32 [00:00<?, ?it/s]

update:

I tried going down to batch size 1 and I got the error message below

attn_weights = nn.functional.softmax(scores.float(), dim=-1).type_as(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 7.79 GiB total capacity; 1.10 GiB already allocated; 11.31 MiB free; 1.12 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
1

There are 1 answers

5
Andreas Giersch On

Have you tried with re-enabling gradient_checkpointing or enabling it at all? According to https://huggingface.co/docs/transformers/v4.18.0/en/performance even if everything else is set up fine, it might happen that you run into out of memory issues due to gradient calculation overhead. During gradient calculation, all activations from the forward pass are saved and this can cause a huge memory consumption spike. In order to fix this issue, the Huggingface docs recommend to enable gradient_checkpointing in the TrainingArguments. I see that you have the line

peft_model.config.gradient_checkpointing = True

commented out. I haven't tried this, but maybe you should re-enable this. Otherwise try to follow the docs and just enable gradient_checkpointing in the TrainingArguments.

I had the same problem when fine-tuning llama 2 with peft and LoRA with quantization. The models itself would fit easily onto one gpu (my setup is 2 gpus with 24GB each). However, during fine-tuning the memory consumption spiked extremely. A 4-bit quantized llama-2-13B would consume around 7-8 GB regularly and suddenly spike to above 24GB during fine-tuning. The approach mentioned in the Huggingface docs fixed the problem for me.