Huggingface Data Collator

56 views Asked by At

I'm trying to fine-tune mistral-7b on a task where it is important for the model to only output a label and nothing else. Hence I am formatting my train_dataset as follows:

f"some system prompt\n{user_input}\nLable:{label}"

my eval_dataset looks like:

f"some system prompt\n{user_input}\nLable:"

now I am using the huggingface Trainer to fine-tune:

run_name = BASE_MODEL_ID.split("/")[-1]  + PROJECT_NAME
output_dir = "./" + run_name
trainer_args = TrainingArguments(
               output_dir=output_dir,
               warmup_steps=2,
               per_device_train_batch_size=2,
               gradient_accumulation_steps=16,
               gradient_checkpointing=True,
               max_steps=200,
               learning_rate=2e-5, # Want a small lr for finetuning
               bf16=True,
               optim="paged_adamw_8bit",
               load_best_model_at_end=True,
               metric_for_best_model="eval_loss",
               logging_steps=32,              
               logging_dir="./logs",        
               save_strategy="steps",     
               save_steps=32,                
               evaluation_strategy="steps", 
               eval_steps=32,               
              )

trainer = Trainer(
    model=model,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    args=trainer_args,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    callbacks=[EarlyStoppingCallback(early_stopping_patience=10)],
)

However, when I use the data collator every padding token is set to -100 as I defined pad_token = eos_token. Is there a way to keep this behavior but add a eos token to the end of the sequence that doesn't get converted to -100 by the data collator?

That would look something like this (assuming 2 to be the eos_token_id):

[-100, -100 -100, .... 55,32,4,2]
0

There are 0 answers