I am using the following training arguments and trainer for fine-tuning a huggingface model:
trainer_args = TrainingArguments(
output_dir=model_ckpt.split('/')[0],
num_train_epochs=5,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
weight_decay=0.01,
logging_steps=5,
evaluation_strategy='steps',
eval_steps=100,
eval_accumulation_steps=1,
save_steps=800,
report_to="wandb", # enable logging to W&B
run_name=f"{your_name}_{model_ckpt.split('/')[0]}_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}",
overwrite_output_dir=True,
load_best_model_at_end=True,
metric_for_best_model='eval_loss',
)
trainer = Trainer(model=model, args=trainer_args,
tokenizer=tokenizer,
data_collator=seq2seq_data_collator,
train_dataset=dataset_pt["train"],
eval_dataset=dataset_pt["validation"])
I have two questions wrt the logs in WandB:
- What does this plot mean? What is train/epoch?
- Why can't I see any logs like epoch/batch/train-val loss?
Basically I want to check which epoch my trainer is currently running.
I tried checking log parameters in Training Arguments, but couldn't understand what to change.
Edit 1: The y-axis is 'steps' for the given graph