I'm trying to fine-tune mistral-7b on a task where it is important for the model to only output a label and nothing else. Hence I am formatting my train_dataset as follows:
f"some system prompt\n{user_input}\nLable:{label}"
my eval_dataset looks like:
f"some system prompt\n{user_input}\nLable:"
now I am using the huggingface Trainer to fine-tune:
run_name = BASE_MODEL_ID.split("/")[-1] + PROJECT_NAME
output_dir = "./" + run_name
trainer_args = TrainingArguments(
output_dir=output_dir,
warmup_steps=2,
per_device_train_batch_size=2,
gradient_accumulation_steps=16,
gradient_checkpointing=True,
max_steps=200,
learning_rate=2e-5, # Want a small lr for finetuning
bf16=True,
optim="paged_adamw_8bit",
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
logging_steps=32,
logging_dir="./logs",
save_strategy="steps",
save_steps=32,
evaluation_strategy="steps",
eval_steps=32,
)
trainer = Trainer(
model=model,
train_dataset=train_ds,
eval_dataset=test_ds,
args=trainer_args,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
callbacks=[EarlyStoppingCallback(early_stopping_patience=10)],
)
However, when I use the data collator every padding token is set to -100 as I defined pad_token = eos_token. Is there a way to keep this behavior but add a eos token to the end of the sequence that doesn't get converted to -100 by the data collator?
That would look something like this (assuming 2 to be the eos_token_id):
[-100, -100 -100, .... 55,32,4,2]